_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Thoughts on the Word Spec in Rust
       
       
        cbondurant wrote 2 hours 12 min ago:
        It could easily be the case that its just outside of the goals or scope
        of docx-rs, but I wonder. It would probably be pretty reasonable to add
        some kind of a catch-all "unknown" variant, that backs itself up with
        storing the names of tags as interned strings?
        
        Justified under the idea that unexpected tags should be uncommon by the
        fact they are unexpected (if its common you should have expected it),
        and can be relegated to a less-performant cold-path as a result.
        
        It would probably mean not having the most fun time ever for the
        developer depending on docx-rs if an explicit requirement is
        interacting with and modifying a tag that ends up in the "whatever"
        bucket, but at least you could make sure that you (de)serialize
        losslessly.
       
        cbm-vic-20 wrote 2 hours 40 min ago:
        The article links to a classic Joel on Software article "In Defense of
        Not-Invented-Here Syndrome" written in 2001.
        
        It's interesting to see how this has played out 24 years later with
        "vibe coding" and how Amazon does business.
        
        > Indeed during the recent dotcom mania a bunch of quack business
        writers suggested that the company of the future would be totally
        virtual — just a trendy couple sipping Chardonnay in their living
        room outsourcing everything. What these hyperventilating
        “visionaries” overlooked is that the market pays for value added.
        Two yuppies in a living room buying an e-commerce engine from company A
        and selling merchandise made by company B and warehoused and shipped by
        company C, with customer service from company D, isn’t honestly
        adding much value.
        
   URI  [1]: https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-in...
       
        amelius wrote 4 hours 44 min ago:
        The original Word format was a literal dump of a part of the data
        segment of the Word process. Basically like an mmapped file. Super
        fast. It is a pity that modern languages and their runtimes do not
        allow data structures to be saved like that.
       
          OskarS wrote 3 hours 35 min ago:
          You can absolutely save data like that, it's just that it's a
          terrible idea. There are obvious portability concerns issues:
          little-endian vs. big endian, 32-bit vs. 64-bit, struct padding, etc.
          
          Essentially, this system works great if you know the exact hardware
          and compiler toolchain, and you never expect to upgrade it with
          things that might break memory layout. Obviously this does not hold
          for Word: it was written originally in a 32-bit world and now we live
          in a 64-bit one, MSVC has been upgraded many times, etc. There's also
          address space concern: if you embed your pointers, are you SURE that
          you're always going to be able to load them in the same place in the
          address space?
          
          The overhead of deserialization is very small with a properly written
          file format, it's nowhere near worth the sacrifice in portability.
          This is not why Word is slow.
       
            amelius wrote 48 min ago:
            It's only a terrible idea because our tools are terrible.
            
            That's exactly the point!
            
            (For example, if Rust would detect a version change, it could
            rewrite the data into a compatible format, etc.)
       
            skywal_l wrote 2 hours 16 min ago:
            Andrew Kelley (author of zig) has a nice talk about programming
            without pointers allowing ultra fast serialize/deserialization. [0]
            
            And then you have things like cap'n'proto if you want to control
            your memory layout. [1] But for "productivity" files, you are
            essentially right. Portability and simplicity of the format is
            probably what matters.
            
            [0]: [1]:
            
   URI      [1]: https://www.hytradboi.com/2025/05c72e39-c07e-41bc-ac40-85e...
   URI      [2]: https://capnproto.org/
       
              OskarS wrote 26 min ago:
              That is true, cap’n proto and flatbuffers are excellent
              realizations of this basic concept. But that’s very different
              thing from what the commenter is talking about Word doing in the
              90s, of just memory-mapping the internal data structures and be
              done with it.
       
          maxerickson wrote 4 hours 3 min ago:
          There's all kinds of discussions of recovering text from corrupted
          files that just kind of went away when they moved over to the
          explicit serialization in docx.
       
          johngossman wrote 4 hours 13 min ago:
          Your mileage may be different. I didn't work on Word (though I talked
          to those guys about their format) but I worked on two other apps that
          used the same strategy in the same era. One, on load you had to fix
          up runtime data that landed in that part of the data segment. Two,
          the in memory representation was actually somewhat sparse. This meant
          that a serializer actually read and wrote less to disk than mapping
          the file. So documents were smaller and there was actually less i/o
          and faster loads.
          
          The reason I hated it though was because it was very hard to version.
          I know the Word team had that problem, especially when the mandate
          came down for older versions to be able to read newer versions. Hard
          enough to organize the disk format so old versions can ignore stuff,
          but now you're putting the same requirements on the in-memory
          representation. Maybe Word did it better.
       
        iyn wrote 6 hours 24 min ago:
        I see the author is here — I wonder if you also handle PDFs? Quick
        look at the site indicates that yes, but could you tell more about it?
        Do you also have a custom parser/serializer? Do you allow for PDF
        editing?
        
        The reason for asking, is that I've had a shower thought of building
        custom PDF/doc reader for myself, that would allow me to easily take
        notes and integrate with Anki. Been doing that in Obsidian with the pdf
        plugin, but it's too slow. At the same time, I've heard that PDF spec
        is not easy to work with, so I'm curious about your experience on that
        front.
       
          piker wrote 6 hours 19 min ago:
          Yes, it does render PDFs.
          
          There's actually an example PDF in the bundle if you click "Fetch
          Example" from the web preview at: [1] .
          
          Under the hood, Tritium is using PDFium[1]. That's the same library
          used by Chrome, for example. The PDF spec is another animal that will
          be tackled in due course, but most legal users only need to view and
          comment on PDFs.
          
          Try and find a binding to PDFium from your language of choice and
          start at that layer. PDFs are complex beasts, most of which
          complexity it may not be necessary to try to tackle in the first
          instance. [1]
          
   URI    [1]: https://tritium.legal/preview
   URI    [2]: https://pdfium.googlesource.com/pdfium/
       
        joachimma wrote 7 hours 39 min ago:
        I wonder why round-trip is such a small concern for people implementing
        serializers/deserializers of various kinds.
        I usually throw in an "Unknown" node type, which stores things
        unaltered until I can understand things again. The parsers I usually
        write are very small, so I haven't seen what issues comes up at scale,
        maybe there are dragons lurking .
       
          piker wrote 7 hours 34 min ago:
          This is the solution for that particular issue that Tritium uses.
          
          [NOTE: one dragon would be the memory consumption alluded to in the
          article.]
       
            robmccoll wrote 7 hours 25 min ago:
            Could you intern strings? Seems like you're likely to see the same
            tags and attributes over and over.
       
              piker wrote 7 hours 24 min ago:
              Yes, and there are probably a lot of other clever ideas. But the
              better solution is probably just to implement more of the spec.
              Once you get through maybe 80% of the tags, you've eliminated
              99.9% of the memory issue given their frequency distribution.
       
        olivermuty wrote 7 hours 43 min ago:
        would be cool if they published this as oss
       
          piker wrote 7 hours 37 min ago:
          Thanks for the kind words, and I have given it serious thought.
          
          It's definitely not impossible in the future.
          
          I just don't think there is enough interest right now in contributing
          to the underlying tech without generalizing it so much that it
          basically becomes an inferior LibreOffice.
          
          Instead, the business model for Tritium is to give away the niche
          legal product for free to the community, but charge commercial users
          who need more granular control over its network activity, etc. This
          gives smaller start-ups, law offices and in-house shops a chance to
          benefit from the niche features while reserving for more demanding
          organizations to express an interest in and benefit from advanced
          features.
       
            IshKebab wrote 6 hours 21 min ago:
            I think you should just charge everyone. I can't imagine there are
            many people in the community who would have a use for it but aren't
            professionals who could pay money for it.
            
            You could make a special exemption for non-profits and public
            defenders.
            
            Giving it away for free just creates potential for freeloaders.
            
            Great product idea by the way! Hard to believe lawyers have gone
            without this for so long.
       
          maverwa wrote 7 hours 37 min ago:
          or - if feasible - extend the existing crate instead of creating a
          new one.
       
            piker wrote 7 hours 35 min ago:
            I thought about that as well, but it's really just so core to every
            aspect of the product that Tritium needs to own it 100%. We just
            don't have the capacity to take a tradeoff that is favorable to the
            broader use case. I highlighted this issue in particular but there
            were other places where Tritium's needs diverged from the docx_rs
            approach. (e.g., dealing with references)
       
       
   DIR <- back to front page