gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   TOON â Token Oriented Object Notation
       
       
        yohbho wrote 4 hours 7 min ago:
        LLMs read
        
        > users[2]{id,name,role}:
          1,Alice,admin
          2,Bob,user
        
        differently than me, i guess. I would read that as "at index value of
        two, i.e. the third element of an array, the values 1aliceadmin and
        2bobuser are stored, or not, since we want to destructure these values
        and a pair value of a tuple of three is given. and would be confused
        and think wtf is that, dear user, did you omit or misformat values?
       
        viggity wrote 5 hours 21 min ago:
        Since the whole point of this is to limit LLM token consumption, it'd
        be interesting to see the results of prompts that use it.
        
        I've seen a ton of people who just paste a CSV into a prompt and expect
        it to work well because they don't know any better, but the results are
        typically hot garbage. It's too repetitive, it can't memorize and/or
        process such a big chunk of data. Asking an LLM to use pandas to
        iteratively analyze some CSV works great, though.
       
        pshirshov wrote 6 hours 41 min ago:
        I have this: [1] , a binary deduplicating storage for JSON-like data
        structures with efficient direct access (no parsing required). It's
        even more efficient in terms of space and access speed (but not
        manually editable).
        
   URI  [1]: https://github.com/7mind/sick
       
        AvAn12 wrote 9 hours 10 min ago:
        Similar to YAML?  Also do consider ancient formats like fixed width -
        in which case you donât even need delimiter characters. Are LLMs
        clever enough to parse these if given a code book or old-school INPUT
        statement?  Cheers
       
        neilv wrote 13 hours 4 min ago:
        If you instead put parentheses around the lexical sequences, then you
        wouldn't need syntax like `[3]` to denote length.
        
        You also wouldn't need indentation levels to be syntactically
        meaningful.
        
        You could also get rid of LLM tokens like square brackets, curly
        braces, colons, and commas.
        
        And you could have objects nested to arbitrary depth.
        
        In near the same character count as TOON (sometimes more, sometimes
        less).
        
        (I was telling someone over the weekend that there are only a few small
        wins for Lisps in most AI work right now.  I hadn't considered that the
        printed syntax itself might have a use with these LLM huge black
        boxes.)
       
          hdjfjkremmr wrote 12 hours 27 min ago:
          have you tried it? models struggle keeping track of opening/closing
          braces, which is exactly why xml/csv (or toon) tends to work better
          than json
       
            neilv wrote 4 hours 23 min ago:
            What is the reason that the current LLMs work better with XML than
            with JSON?  Is it the names in the element tags?
       
        rs186 wrote 16 hours 27 min ago:
        I wonder how many tokens will be saved compared to real JSON if we use
        a special version where property names don't require quotes, like in
        JavaScript.
       
        awaseem wrote 17 hours 58 min ago:
        This is awesome, I saw it on twitter and gave it a star
       
        metalliqaz wrote 18 hours 29 min ago:
        What is the font used on that README image?
       
          drewlesueur wrote 18 hours 20 min ago:
          Looks like one of the variations of Iosevka.
          
   URI    [1]: https://github.com/be5invis/Iosevka
       
            metalliqaz wrote 18 hours 14 min ago:
            Well done, sir!
       
        chuckadams wrote 18 hours 44 min ago:
        indentation-based sounds pretty brittle for a serialization format.  I
        imagine a tabular format that factors out repeating keys could be
        expressed fairly compactly in json itself.
       
        s1mon wrote 19 hours 34 min ago:
        Obligatory XKCD:
        
   URI  [1]: https://xkcd.com/927/
       
        andreygrehov wrote 19 hours 42 min ago:
        I donât know what Iâm talking about (pure fantasy), but what if you
        train a model on compressed data and then perform inference on
        compressed data as well? Could this work? With the output also being
        compressed and then decompressed by the client?
       
          WorldMaker wrote 2 hours 51 min ago:
          The tokenizer is already a form of (somewhat lossy) compression of a
          string of plaintext to a stream of token identifiers. You can reason
          about Tokenizers/"embedding spaces" as a sort of massive "Dictionary
          Table/Dictionary Function" like you might use in a zip/gzip stream.
          
          Starting with already compressed data doesn't necessarily mean fewer
          tokens, you can probably assume similar entropy (or probably worse
          entropy) in expanding "Dictionary words" in a compressed stream
          versus "tokens" from a plaintext stream.
       
          Loranubi wrote 11 hours 9 min ago:
          Since all input is run through a tokenizer, I would expect the
          tokenizer space doesn't change a lot between one trained on
          uncompressed vs one trained on compressed data.
       
        mentalgear wrote 20 hours 26 min ago:
        Neat. I did a similar thing with CSV (instead of JSON) a year back.
        Great that there are measurements, but I think the really interesting
        measure would have it run against the actual "Structured Output Format"
        endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.
       
        3cats-in-a-coat wrote 20 hours 53 min ago:
        I'll say the obvious. A lot of this you can just do in JSON.
        
        Let's take the example:
        
            {
              "users": [
            { "id": 1, "name": "Alice", "role": "admin" },
            { "id": 2, "name": "Bob", "role": "user" }
              ]
            }
        
            users[2]{id,name,role}:
              1,Alice,admin
              2,Bob,user
        
        We can keep it JSON, but use more compact list expressions, as tuples
        when pragmatic:
        
            ["users",
               [1, "Alice", "admin"],
               [2, "Bob", "user"]
            ]
        
        The thing is the game with LLMs is not what's shortest, but what's:
        
        1. Mainstream, so they understand it.
        
        2. What they're tuned for, and their tuned for what's mainstream
        (JSON).
        
        If you want to go extreme compression you can shove it all in JSON
        strings too and keep the larger structure JSON:
        
            ["users",
               "1:admin:Alice",
               "2:user:Bob",
            ]
        
        You may say "how is this better". Well it's better because it's still
        JSON, there's less to explain to the LLM, and to your other devs. Even
        if we use a weird compact format like "id:role:name" this is still
        shorter to explain than a completely different syntax with its whole
        world of rules.
       
          rc1 wrote 16 hours 43 min ago:
          If fairness to toon, the alternative json your giving doesnât
          include hints on structure.
          
          Not sure LLM are more âtunedâ to JSON.
          
          That said, your general point holds that toon maybe unnecessary.
          Especially in the examples given. But perhaps plan text would
          suffice. Toon could be useful when automating inputs with many
          different shapes.
       
            copypaper wrote 12 hours 49 min ago:
            Yea exactly. The LLMs are tuned to natural language. I don't think
            anything will beat good ol' templating (a.k.a. plain text). In Go I
            do something like this:
            
              // mytemplate.tmpl
              Description="The following data is for the users in our
            application."
              Format="id,name,role"
              length=2
              Data:
              {{range .}}
              {{.ID}}, {{.Name}}, {{.Role}}
              {{end}}
            
            This way you're able to change the formatting to something the LLM
            understands for each struct. The LLM might understand some structs
            better as JSON, others as YAML, and others in an arbitrary format.
            Templating gives you the most flexibility to choose which one will
            work best.
       
        hedgehog wrote 21 hours 38 min ago:
        It would be interesting to compare this to BAML and TOML.
       
          toobulkeh wrote 20 hours 57 min ago:
          Definitely is a core feature of BAML. My main complaint with BAML is
          that itâs all or nothing. Itâs very opinionated and we canât
          get the benefits without the DX and vice versa. Separating this
          feature without requiring a DSL of model definition is a great add.
       
            hedgehog wrote 20 hours 6 min ago:
            TOML has some readability and compactness benefits over JSON while
            still being both common enough for models to easily be able to
            process it relatively reliably and widely supported in most
            languages. I suspect BAML still performs better but likewise due to
            the tooling work I haven't integrated it.
       
        Pxtl wrote 22 hours 40 min ago:
        I'm sorry I don't see this adding value over various other formats.  I
        don't really want a new object serialization format, I just want the
        existing ones to have the features I need.  YAML but with static typing
        and schema.  XML but without crazy internet features.  TOML but with an
        object format that doesn't hurt my brain.  JSON but with decent
        multiline strings and comments.  NestedText but with a sub-standard
        that provides static-typing and schema and whatnot.
       
          rs186 wrote 16 hours 26 min ago:
          I don't think you even need to care about this as a format. It could
          exist only during communication and encoded/decoded by middleware,
          and everything still works.
       
          tptacek wrote 18 hours 22 min ago:
          This isn't really an interchange formula so much as something you'd
          JIT compile down to when handing things off to an LLM, right?
       
            furyofantares wrote 18 hours 12 min ago:
            And on the way out of the LLM. Token savings nice on the way out
            too, and then also I have to imagine it's better for the LLM to see
            one format in all of it's context instead of two.
            
            It seems like a nice idea to me if restricted to that. Although I
            guess I am not sure if it's really intended that way - the array
            count for example is probably pretty bad for LLM output.
       
              tptacek wrote 18 hours 3 min ago:
              I feel like on the output side you might be working against LLM
              training? But I don't know.
       
          verdverm wrote 20 hours 25 min ago:
           [1] | [2] CUE can emit the other formats (minus XML because it's a
          beast of ambiguity, but there are other tools for json->xml i.e.)
          
          It also has modules and imports, a very underrated feature for config
          languages if you haven't experienced it before
          
   URI    [1]: https://cuelang.org
   URI    [2]: https://cuetorials.com
       
          foxglacier wrote 20 hours 58 min ago:
          The benchmarks show it performs better than them, so that's the value
          - cost savings and improved accuracy. I suppose you could convert
          JSON to TOON just for the LLM and not actually read it with your own
          brain.
       
        inopinatus wrote 23 hours 7 min ago:
        JSON unmarshalling often has to consider separately whether an
        attribute is absent, false, zero, null, or the empty string, but this
        was never quite semantically ambiguous enough for my tastes, so adding
        that void-ish values may also now be serialised as a tuple of length
        [0] seems to me an excellent additional obfuscation.
       
          anonymoushn wrote 12 hours 11 min ago:
          Arrays of length 0 also exist in json?
       
            andrus wrote 11 hours 0 min ago:
            Yes, this is valid JSON: []
       
          joshribakoff wrote 22 hours 46 min ago:
          The use case here is to reduce the token usage with LLMs, such as an
          agent that outputs a list of commands eg. Tuples with files to write
          and their new contents.
          
          Supporting this use case doesnât require perfectly marshaling every
          data structure ever.
          
          But to your point the tool could have wider use cases without the
          limitations.
       
            inopinatus wrote 22 hours 16 min ago:
            If one trains a model to understand it then that model will
            inevitably emit it, which means in turn one shall have to parse it,
            and now the application supports TOON for anything, and good luck
            telling the users/customers any different.
       
              ziofill wrote 19 hours 17 min ago:
              What if thereâs a simple converter back to json after the model
              output? Is that possible?
       
        moralestapia wrote 1 day ago:
        [flagged]
       
          jayd16 wrote 21 hours 58 min ago:
          I'm not sure which one would win but its a bit telling that
          compression isn't mentioned at all.
          
          I guess its about LLMs so the idea is has to be plaintext?  But if
          you can train it on TOON can't you train it on BSON?
       
        vessenes wrote 1 day ago:
        Iâll be interested to see benchmarks. My expectation is that accuracy
        will take a hit on mid or longer context prompts: Iâd bet that the
        heavy use of JSON in fine tuning will end up impacting quality of a
        more terse (less reasoning space) novel encoding.
        
        That said: I like the idea!
       
          mattcollins wrote 10 hours 28 min ago:
          FWIW, I ran a test comparing LLM accuracy with TOON versus JSON, CSV
          and a variety of other formats when using them to represent tabular
          data: [1] I've only looked at one model (gpt-4.1-nano) so far. I'm
          hoping to run similar tests on some other models but it gets
          challenging to discern statistically significant differences with
          better models as their accuracy tends to be a lot better across the
          board.
          
   URI    [1]: https://www.improvingagents.com/blog/is-toon-good-for-table-...
       
          saretup wrote 14 hours 56 min ago:
          I would assume the next iterations/fine-tuned variants of current
          models would reach similar accuracy for TOON as they do for JSON.
          
          The current models unfortunately do not have TOON in their training
          set, so they would probably require additional input tokens to grok
          the notation, and even then probably wonât have the same accuracy
          as they do for JSON.
       
          brian-bk wrote 1 day ago:
          There are a very light benchmarks in the Readme, or are you looking
          for more?
       
            Mumps wrote 1 day ago:
            Do you mean the [0] Token Benchmarks section? I only see token
            count numbers.
            
            Which doesn't address the question: do LLMs understand TOON the
            same as they would JSON? It's quite likely that this notation is
            not interpreted the same by most LLM, as they would JSON. So
            benchmarks on, say, data processing tasks, would be warranted.
            
            [0]
            
   URI      [1]: https://github.com/johannschopplich/toon?tab=readme-ov-fil...
       
              tujux wrote 1 day ago:
              I think they're talking about these sections:
              
              1. Retrieval Accuracy - [1] 2. Performance by dataset -
              
   URI        [1]: https://github.com/johannschopplich/toon?tab=readme-ov-f...
   URI        [2]: https://github.com/johannschopplich/toon?tab=readme-ov-f...
       
        meander_water wrote 1 day ago:
        I don't get it, can't you just use yaml instead of inventing another
        DSL.
       
          inopinatus wrote 22 hours 3 min ago:
          Norway.
       
            Too wrote 14 hours 3 min ago:
            This TOON is bound to have the same problem, because strings are
            not quoted. You canât differentiate between the number 123 and
            the string â123â.
            
            For LLM consumption, this might not matter, donât use this for
            anything else.
       
            dragonwriter wrote 21 hours 54 min ago:
            YAML 1.2 has been out for 16 years now, so I would simply not
            assume that the suggestion to use YAML for a new purpose means
            âuse YAML 1.1â.
       
              flyer23 wrote 21 hours 16 min ago:
              It is, also noone uses it:)
       
              inopinatus wrote 21 hours 50 min ago:
              I could agree that you would not make poor assumptions.
              
              Your LLM, however, may experience cross-format feature
              superposition and consequential spurious activation.
       
          jscheel wrote 1 day ago:
          For repeating objects of the same structure, yaml will still require
          each key on each object, whereas this is a hybrid with csv, so it
          defines the keys once.
       
            3cats-in-a-coat wrote 20 hours 49 min ago:
            No one forces us to use objects in JSON with repeated keys you
            know.
       
              jscheel wrote 5 hours 34 min ago:
              For sure, but most people aren't thinking intentionally about
              what they are dumping into their context either ;)
       
              makapuf wrote 20 hours 19 min ago:
              Indeed a
              
                  {"header": ["some","column","names"], "values":
              [[1,2,3],[4,5,6],...]}
              
              could fit.
       
          mhosayny wrote 1 day ago:
          It's more compact than YAML. More like a combination of YAML and CSV.
       
        anonymoushn wrote 1 day ago:
        Hello, it's probably better to add leading spaces before all of the
        words rather than none of them
       
       
   DIR <- back to front page