_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Are You Sure You Want to Use MMAP in Your Database Management System? [pdf]
       
       
        kazinator wrote 3 days ago:
        I'm sure I've read a very similar paper years ago cautioning against
        using mmap. Darned if I can find it though!
       
        namibj wrote 3 days ago:
        Modern AMD Zen2/3 has more PCIe bandwidth than DRAM bandwidth.
        In other words, you can't saturate your flash if you buffer reads
        through DRAM, because the buffering costs you one write followed a bit
        later by one read.
        
        mmap hits limits far earlier due to the kernel evicting pages with only
        a single thread and much of the process needing global-ish locks.
        
        Use zoned storage and raw NVMe, with a fallback to io_uring. You need a
        userspace page cache of some sort. Maybe randomly sample the stack of
        pages you traversed to get to the page you're currently looking at, and
        bump them in the LRU.
        Feel free to default to stream latency-insensitive table scan
        operations without even caching them to not pollute cache.
       
          dataangel wrote 2 days ago:
          How do you avoid buffering reads through DRAM and also avoid using
          mmap? Like even io_uring has to get data off the device (which is
          probably memory mapped) and into DRAM, so maybe I'm just not
          understanding what you mean.
       
            namibj wrote 2 days ago:
            You can for example use a controller memory buffer with NVMe and
            just access the data over mapped PCIe, or you get it to DMA into L3
            cache and use/overwrite it before it gets it's turn at the
            write-back part of the DRAM memory controller.
       
            gpderetta wrote 2 days ago:
            Some PCI devices can DMA directly into/from the CPU cache bypassing
            memory. See intel direct IO for example.
       
        apavlo wrote 3 days ago:
        Please see the accompanying webpage with a 10min video:
        
   URI  [1]: https://db.cs.cmu.edu/mmap-cidr2022/
       
        jcranberry wrote 3 days ago:
        Seems like brk is the worse choice when implementing your buffer pool
        though. Especially if you want your DBMS to work on BSD.
       
        gavinray wrote 3 days ago:
        Database engineer folks:
        
        I recently got to talk to someone who had written a query engine on the
        JVM
        
        One thing I thing they said is that:
        
          > "The buffer manager should probably be rewritten to use the native
        OS page cache. When we wrote it originally this functionality wasn't
        easily available and so we used Java DMA (Direct Memory Access)
        instead."
        
        I'm not familiar with what this means, would anyone be willing to
        explain more about OS page cache and how you'd implement something like
        that?
        
        Would really appreciate it
       
          cpleppert wrote 3 days ago:
          I don't understand what that means unfortunately. I would think the
          second part of the comment refers to direct buffers in NIO that were
          added in java 1.4. One possibility is that he means directly
          interfacing with the operating system I/O system calls and using
          sun.misc.Unsafe to read or write to buffers.
       
            gavinray wrote 3 days ago:
            If it helps, here's the docs and source for what I'm referring to:
            [1]
            
   URI      [1]: https://docs.jboss.org/author/display/TEIID/Memory%20Manag...
   URI      [2]: https://github.com/teiid/teiid/blob/master/engine/src/main...
   URI      [3]: https://github.com/teiid/teiid/blob/master/engine/src/main...
       
          sakras wrote 3 days ago:
          I’m not familiar with JVM things and what DMA is there, but usually
          when people talk about the OS page cache, they mean the in-memory
          file cache stored by the kernel. This means that if you read or write
          to a page in the cache, you’d be accessing memory instead of disk.
          
          The alternative is to open your file with O_DIRECT, which makes your
          reads/writes always interact with the storage system and bypasses the
          page cache.
       
          dharmab wrote 3 days ago:
          This article is a decent introduction to the page cache: [1] You
          typically have to go out of your way to _not_ use the page cache.
          
   URI    [1]: https://lwn.net/Articles/457667/
       
            gavinray wrote 3 days ago:
            This looks super useful -- thank you!
       
        znpy wrote 3 days ago:
        It's curious that among the databases OpenLDAP is not mentioned.
        
        The people at Symas (the company behind OpenLDAP) implemented their own
        storage layer, and wrote a document about that:
        
   URI  [1]: https://www.openldap.org/pub/hyc/mdb-paper.pdf
       
          tmulcahy wrote 3 days ago:
          They do mention LMDB, which OpenLDAP uses as a backend:
          
   URI    [1]: https://www.openldap.org/doc/admin24/backends.html
       
        ajross wrote 3 days ago:
        IMHO more iconoclasm is needed here:  "Are you sure you want to use
        storage in your DBMS?"
        
        Obviously some kind of persistent store is needed.  But in the modern
        world RAM is so cheap and so performant, and cross-site
        redundancy/failover techniques are so robust, and sharding paradigms so
        scalable, that... let's be honest, a deployed database is simply never
        going to need to restore from a powered-off/pickled representation. 
        Ever.
        
        The hard parts of data management are all in RAM now.  So... sure,
        don't mmap() files.  But... maybe consider not using a complicated file
        format at all.    Files should be a snapshotted dump, or a log, or some
        kind of simple combination thereof.  If you're worrying about B+ trees
        and flash latency, you're probably doing it wrong.
       
          ayende wrote 3 days ago:
          Generally,  really bad idea.
          Assume a 512gb RAM and most of that being in use.
          With a log file for durability.
          
          It sounds fine, but...
          
          What is the cost of restarts?
          You need to read through the log, apply it, etc.
          
          That can take a LOT of time, especially on cloud disks
       
            AtlasBarfed wrote 2 days ago:
            Your comment is so funny in contrast to the article.
            
            Article: "Don't use mmap because modern storage is soooo fast. It
            will BLOW YOUR MIND!!!"
            
            You: "I'm in the cloud!"
            
            .........oh. Nevemind.
       
          mamcx wrote 3 days ago:
          This is only true for the niche subset of "large-ish" datasets and
          VERY sophisticated customer, where operate a distributed-ram store is
          "fine".
          
          A lot, bit a GIGANTIC margin of the needs for a DB are not close to
          this, at all.
          
          Starting with sqlite, that is likely the most deployed, is impossible
          that this scenario could be used for it (and that accounting how
          small most dbs are).
       
          marginalia_nu wrote 3 days ago:
          RAM is really fucking expensive though. Even a modest data set of a
          couple of dozen terabytes will cost an arm and a leg if you're
          keeping it mostly in RAM (with redundancy).
       
            ajross wrote 3 days ago:
            Only for hardware.  Once you're at scale and deploying a redundant
            database, your hardware costs start to vanish anyway.
            
            I mean, yes, putting a bunch of drives on a local machine is cheap.
             But that too is a circumstance where a mmap()-using DBMS is
            probably inappropriate.
       
              loeg wrote 3 days ago:
              > Once you're at scale and deploying a redundant database, your
              hardware costs start to vanish anyway.
              
              This is really, really false.  It would make my life a lot easier
              if it were true.
       
                samhw wrote 3 days ago:
                I assume they mean "relative to [my idea of how much a company
                will make when serving that amount of traffic]" - which does
                depend on the industry they are in, not just in terms of how
                profitable it is but also its 'user load coefficient', so to
                speak.
                
                (The first job I had was at a 10-person startup in a coworking
                space, barely making enough revenue to break even, and still
                consuming a vast, vast amount of data, because the product
                involved constant streams of sensor data from a vast number of
                tiny cheap devices. People tend to forget that not every single
                business is a CRUD web/mobile app whose average user accounts
                for $20/month revenue against at most a few hundred HTTP
                requests and a couple megabytes of disk.)
       
                  Dylan16807 wrote 3 days ago:
                  > a vast, vast amount of data, because the product involved
                  constant streams of sensor data from a vast number of tiny
                  cheap devices
                  
                  Even that depends on a lot of factors.    Just as an example,
                  if it takes $10 to build and deploy a sensor, and it returns
                  one number per second, then $1 of RAM can hold 2+ years of
                  data before archiving it.
       
                    samhw wrote 1 day ago:
                    Hmm, how many bits are you allowing for the 'number'? Also,
                    you need to identify which user it belongs to, as well as -
                    in our case - which device and which sensor on that device.
                    Also, is that ordinary NVRAM? How do you protect against
                    bit flips? Google's well-known paper[0] found a 0.22%
                    average incidence of uncorrectable errors per DIMM - that's
                    corruption of more bits than ECC can fix (and I'm not sure
                    that your pricing is even for ECC RAM). Your tolerance of
                    errors may differ, but you'd probably want to replicate the
                    data at least once. Disk seems a good choice, for the added
                    benefit that you can actually survive a power cut without
                    going bust (i.e. redundancy against total loss, as well as
                    handling the 'freak errors' in data correctness that become
                    the norm when you're dealing with vast amounts of
                    hardware). I'm a fan of the kind of minimalism that you're
                    advocating, don't get me wrong, but it's from pushing that
                    limit that I've learned what the hard limits are.
                    
                    [0]
                    
   URI              [1]: https://static.googleusercontent.com/media/researc...
       
                      Dylan16807 wrote 1 day ago:
                      > Hmm, how many bits are you allowing for the 'number'?
                      
                      32.  Which I would expect to be overkill compared to the
                      precision of your average sensor.
                      
                      > Also, you need to identify which user it belongs to, as
                      well as - in our case - which device and which sensor on
                      that device.
                      
                      Which is the same for big chunks of readings, so I'm
                      assuming a system that's able to store that metadata once
                      per several readings.
                      
                      > How do you protect against bit flips?
                      
                      I dunno, I was just saying the cost of memory.    You can
                      have bit flips on any kind of server no matter how you're
                      doing it, and they might get persisted, so I'd say that's
                      out of scope.
                      
                      > average incidence of uncorrectable errors per DIMM
                      
                      That average is highly skewed by broken DIMMs though. 
                      They found a mean of a few thousand correctable errors
                      per DIMM, but they also found that 80% of DIMMs on one
                      platform and >96% of DIMMs on the other platform had zero
                      correctable errors.
                      
                      > Your tolerance of errors may differ, but you'd probably
                      want to replicate the data at least once. Disk seems a
                      good choice, for the added benefit that you can actually
                      survive a power cut without going bust
                      
                      Sure, disk for backup sounds lovely and would be
                      extremely cheap compared to the RAM.  I wasn't advocating
                      having only the RAM copy, just saying that depending on
                      other factors it might be reasonable for a RAM copy to be
                      the main analysis database even for sensor data.
                      
                      The post that started this thread directly says you
                      should be using persistent files as write-only
                      dumps/logs.
       
                    marginalia_nu wrote 3 days ago:
                    That ignores any type of structure to the data (such as
                    timestamps and which sensor it is). It would also probably
                    need to be indexed somehow.
       
                      Dylan16807 wrote 3 days ago:
                      I left spare space for that.  Especially if you store a
                      minute of samples at a time.
       
        eatonphil wrote 3 days ago:
        > Despite these apparent success stories, many other DBMSs have
        tried—and failed—to replace a traditional buffer pool with
        mmapbased file I/O. In the following, we recount some cautionary tales
        to illustrate how using mmap in your DBMS can go horribly wrong
        
        Saying "many other DBMSs have tried — and failed" is a little weirdly
        put because above that they show a list of databases that use or used
        mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB,
        SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number
        they list as having once used mmap and moved off it (Mongo,
        SingleStore, and InfluxDB). Maybe they just omitted some others that
        moved off it or ?
        
        True they list a few more databases that considered mmap and decided
        not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true
        they list RocksDB as a fork of LevelDB to avoid mmap.
        
        My point being this paper seems to downplay the number of systems they
        introduce as still using mmap. And it didn't go too much into the
        potential benefits that, say, SQLite or LMDB sees keeping mmap an
        option other than the introduction when they mentioned perceived
        benefits. Or maybe I missed it.
       
          jakewins wrote 3 days ago:
          I was heavily involved in the push to switch Neo4j off of memory
          mapping and on to an in house page cache.
          
          In retrospect, I’d say it was something like 50% driven by Java
          memory mapping having insane issues on Windows, 20% Java memory
          mapping having insane issues on every platform, 25% me being a noob
          and 5% actual issues with memory mapping on Linux.
          
          I think if could say “this DB is posix only” I would try memory
          mapping for the next DB I build
       
            icedchai wrote 3 days ago:
            I worked at a company that built a custom database on top of mmap,
            back in the early 2000's.  It was single threaded, intended for
            very limited OLTP use cases, not a generic DBMS, but worked
            wonderfully for those use cases. A couple gigabytes was more than
            enough address space. During this time, a machine with 4 gigabytes
            RAM was considered "high end"! It only ran on a couple of posix
            platforms.
            
            At startup, it would read everything into memory so everything was
            hot, ready to go. The "database" was intended to run on a dedicated
            node with no other applications running. It also kept a write ahead
            log for transaction recovery.
       
            apavlo wrote 3 days ago:
            > I think if could say “this DB is posix only” I would try
            memory mapping for the next DB I build
            
            Great! Please reach out to us after you learn (again) that this is
            a bad idea!
       
            kjeetgill wrote 3 days ago:
            > Java memory mapping having insane issues on every platform
            
            I'm very interested, what issues? I've occasionally considered
            using mmap for a few things, but never had justification to. I'm
            curious what issues I might have run into.
            
            Did you use JNI/JNA to mmap or via RandomAccessFile?
       
              RyanHamilton wrote 3 days ago:
              2GB limit. Cost of indirect addressing. Inability to address RAM
              and disk using same method calls without overhead.
              
              I've implemented a modest database in both C and Java with memory
              mapping. It's hard to explain the feeling of programming both but
              in general: with java I found myself trying to add intelligence
              at the cost of complexity to make things run faster. In C I could
              get speed by removing overheads and doing less work.
       
                kjeetgill wrote 2 days ago:
                Thanks! Insightful. I wonder if you could get what you needed
                via unsafe but I agree: that's fighting the system in Java.
       
          mytherin wrote 3 days ago:
          SQLite does not use mmap as a main I/O mechanism by default, it
          provides it as an optional feature [1].
          
          mmap has one large advantage, which is that it allows you to share
          buffers with the operating system. This allows you to share buffers
          between different processes, and allows you to re-use caches between
          restarts of your process. This can be powerful in some situations,
          especially as an in-process database, and cannot be done without
          using mmap.
          
          There are many problems with using mmap as your only I/O and buffer
          management system (as listed in the paper and the SQLite docs). One
          of the main problems from a system design perspective is that mmap
          does not enforce a boundary on when I/O occurs and what memory to
          manage. This makes it very hard to switch away from mmap towards a
          dedicated buffer pool, as this will require significant
          re-engineering of the entire system.
          
          By contrast, adding optional support for mmap in a system with a
          dedicated buffer pool is straightforward.
          
   URI    [1]: https://www.sqlite.org/mmap.html
       
          benbjohnson wrote 3 days ago:
          There’s was a previous discussion on HN a couple months ago [1].
          The tl;dr seemed to be that mmap() is a good first step and
          eventually you can swap it out for a custom-made buffer pool if you
          need to. A lot of new databases are trying to figure out
          product/market fit and spending time on a buffer pool initially is
          usually not worth it.
          
   URI    [1]: https://news.ycombinator.com/item?id=29936104
       
            derefr wrote 3 days ago:
            How custom are these “custom buffer pools”, anyway? Could DBMS
            use-cases be similar enough that a single buffer-pool library —
            if such a thing were to be created — could suit all their needs?
            A Jemalloc equivalent specifically tuned for managing large
            disk-backed allocations?
       
              jasonwatkinspdx wrote 3 days ago:
              The buffer pool, concurrency control, and recovery algorithm all
              need to dovetail into each other. In theory you could have a
              generic pool library that offers a Pin and Unpin based API, but
              as you start talking about stuff like garbage collection that has
              to play nice with incremental checkpointing and the WAL rollover
              and... it gets hard to make a reusable library vs just each
              database writing something narrowly tailored to their needs.
              
              Also, a lot of the interesting research stuff going on right now
              is on storage engines that look fairly different from a
              traditional RDBMS buffer pool centric storage engine.
       
              jandrewrogers wrote 3 days ago:
              That is essentially what the kernel cache is.
              
              Caches are pretty heavily customized for the database design
              because they control so much of the runtime behavior of the
              entire system. The implementations are different based on the
              software architecture (e.g. thread-per-core versus
              multi-threaded), storage model, target workload, scale-up versus
              scale-out, etc. The "custom buffer pool" isn't just a cache, it
              is also a high-performance concurrent I/O scheduler since it is
              responsible for cache replacement.
              
              Even if you were only targeting a single software architecture
              and storage model, it would require a very elaborate C++
              metaprogramming library to come close to generating an
              optimized-for-purpose cache implementation. Not worth the effort.
              The internals are pretty modular and easy to hack on even in
              sophisticated implementations. In practice, it is often simpler
              to take components from existing implementations and do some
              custom assembly and tweaking to match the design objective.
              
              In either case, you still have to understand how and why the
              internals do what they do to know how to effect the desired code
              behavior. For people that have a lot of experience doing it, the
              process of writing yet another one from scratch is pretty
              mechanical.
       
              benbjohnson wrote 3 days ago:
              There’s enough commonality that you could make some kind of
              generic buffer pool library. However, writing databases is still
              quite niche so I’m not sure it’d be worth it. There’s also
              a lot of details with regards to transactional semantics that
              differs per database.
       
            eatonphil wrote 3 days ago:
            Thanks! I missed that thread.
       
          thecleaner wrote 3 days ago:
          The point is mmap is good for read heavy workload and they do go into
          a test setup where they compare performance between mmap and a file
          based io system. Bandwidth seems to fall after the page cache fills
          up.
          
          Anyways it's not like OS will do some magic one a page needs to be
          flushed to disc. It all depends on how quickly and how nicely
          scheduling is done. Paper also goes into software aspects of using an
          mmap system. 3.2 and 3.4 seem quite relevant software problems.
       
       
   DIR <- back to front page