_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Erasure Coding versus Tail Latency
       
       
        benlivengood wrote 17 hours 22 min ago:
        The next level of efficiency is using nested erasure codes.  The outer
        code can be across regions/zones/machines/disks while the inner code is
        across chunks of a stripe.  Chunk unavailability is fast to correct
        with an extra outer chunk and bit rot or corruption can be fixed by the
        inner code without an extra fetch.  In the fast path only data chunks
        need to be fetched.
       
          dragontamer wrote 16 hours 45 min ago:
          That's just LDPC but not as extreme.
          
          If you're going to "nest" erasure codes, might as well make them
          XOR-based (fastest operation on modern CPUs and/or hardware),
          calculate a randomization scheme that has very-very high probability
          (99.9%+) of fixing errors, and other such benefits.
          
          To provide assurances that you have enough LDPC, you then run LDPC on
          your LDPC-check bits. Then you run LDPC on your LDPC-LDPC-check bits.
          Then you run LDPC on your LPDC-LDPC-LDPC-check bits, until you've got
          the target probability / chance of fixing errors that you desire.
          
          --------
          
          LDPC's brilliance is that XOR-as-erasure-code is exceptionally fast,
          and this "repeated hierarchy" of error-correction codes leads to
          high-probabilities (99.9999%+) of successful correction of erasures.
       
        jeffbee wrote 18 hours 43 min ago:
        I do not follow. How is it possible that the latency is lower in a
        4-of-5 read of a coded stripe compared to a 1-of-4 replicated stripe?
       
          benlivengood wrote 17 hours 28 min ago:
          Full replication is probably always lower latency but costs N times
          as many copies that you want to store whereas erasure coding costs
          N/M.  The cost of replicating write traffic is similar.
       
            jeffbee wrote 16 hours 43 min ago:
            Certainly, but there are other tradeoffs. You can write a byte to a
            replicated stripe, but encoded stripes either demand full-stripe
            writes or writing a byte is hilariously expensive. It's not clear
            in the blog if we are talking about sealed read-only data or what.
       
              benlivengood wrote 14 hours 2 min ago:
              Practically all modern storage has a minimum sector write size;
              now commonly 4096 bytes for HDD and megabytes for NAND.  Most
              systems with fancy erasure codes address this by gathering
              multiple writes into single stripes and are effectively
              append-only filesystems with garbage collection to support
              snapshots.
              
              A large number of RS erasure codes are over GF(2^8) which could
              allow individual byte updates to a stripe for e.g. Octane as
              storage.  Generally, of course, most filesystems checksum at a
              block level and so byte-writes are rarely going to be implemented
              as such.
       
                vlovich123 wrote 13 hours 28 min ago:
                Small correction I think. The 4KiB for HDDs is fairly new and
                afaik most NAND is going to also be 4KiB with the exception of
                QLC which last I saw had a 64KiB minimum sector write size. If
                you have a source that NAND flash is using MiB minimum sector
                size, can you point me to that link?
       
                  toast0 wrote 11 hours 4 min ago:
                  4k sectors for hard drives has been around since roughly
                  2010, depending on your specific drive. 512e drives are very
                  common; if you do a write that's not 4k aligned, the drive
                  will turn it into a read / modify / write cycle.
       
          ec109685 wrote 18 hours 23 min ago:
          Aren’t they comparing 4 of 5 with 4 of 4?
       
            jeffbee wrote 17 hours 46 min ago:
            If so that doesn't make a great deal of sense because the
            alternative to a coded stripe isn't a unique unreplicated stripe,
            it's a replicated one. So the alternative would be to race multiple
            reads of all known replicas, or to hedge them.
       
              ec109685 wrote 16 hours 7 min ago:
              Good point. I guess it’s really 2 out of 2 versus 4 out of 6.
       
        sujayakar wrote 19 hours 5 min ago:
        this is a really cool idea.
        
        one followup I was thinking of is whether this can generalize to
        queries other than key value point lookups. if I'm understanding
        correctly, the article is suggesting to take a key value store, and for
        every `(key, value)` in the system, split `value` into fragments that
        are stored on different shards with some `k` of `M` code. then at query
        time, we can split a query for `key` into `k` subqueries that we send
        to the relevant shards and reassemble the query results into `value`.
        
        so, if we were to do the same business for an ordered map with range
        queries, we'd need to find a way to turn a query for `interval: [start,
        end]` into some number of subqueries that we could send to the
        different shards and reassemble into the final result. any ideas?
       
          ddorian43 wrote 18 hours 43 min ago:
          There [1] open source db that uses erasure coding for replication in
          single zone/region.
          
   URI    [1]: https://ydb.tech/
       
            eivanov89 wrote 18 hours 24 min ago:
            In YDB with block 4+2 erasure coding, you need half the disk space
            compared to mirror-3-dc schema. Meanwhile CPU usage is just a
            little bit higher, thus in high throughput tests mirror-3-dc wins.
            Indeed as mentioned in the post there might be a tail latency win
            in latency runs, but if your task is throughput with a reasonable
            latencies, replication might be a better choice.
       
              ddorian43 wrote 13 hours 38 min ago:
              I expect it to save a lot of CPU by only needing 1/3x of
              compactions. You might want to do a benchmark on that ;). An
              example is quickwit (building inverted indexes is very
              expensive).
       
              pjdesno wrote 18 hours 9 min ago:
              If you only care about throughput, just fetch the data and read
              should be the same speed as triple replicated.
              
              For writing, triple-rep has to write 2x as much data or more, so
              it's going to be slower unless your CPUs are horribly slow
              compared to your drives.
       
          ddorian43 wrote 18 hours 43 min ago:
          > so, if we were to do the same business for an ordered map with
          range queries, we'd need to find a way to turn a query for `interval:
          [start, end]` into some number of subqueries that we could send to
          the different shards and reassemble into the final result. any ideas?
          
          Dbs that are backed by s3-like-storage, the storage does this for
          you, but for blocks of, say, 1MB, and not per-kv (high overhead).
          
          Think you use rocksdb in your db, and erasure-code the sstables.
       
        loeg wrote 19 hours 16 min ago:
        Yeah.  And you get the storage for free if your distributed design also
        uses the erasure-encoded chunks for durability.  Facebook's Warm
        Storage infrastructure does something very similar to what this article
        describes.
       
        ot wrote 19 hours 28 min ago:
        It is worth noting that this does not come for free, and it would have
        been nice for the article to mention the trade-off: reconstruction is
        not cheap on CPU, if you use something like Reed-Solomon.
        
        Usually the codes used for erasure coding are in systematic form: there
        are k "preferential" parts out of M that are just literal fragments of
        the original blob, so if you get those you can just concatenate them to
        get the original data. If you get any other k-subset, you need to
        perform expensive reconstruction.
       
          gizmo686 wrote 11 hours 27 min ago:
          That is true if you want a "perfect" algorithm, that can provide
          arbitrary M-of-N guarantees. But, if you are a bit more flexible in
          your requirements you can get some very cheep reconstruction.
          
          I worked on a system that uses a variant of parity packet encoding.
          Basic parity packet encoding is very simple. You divide you data into
          N blocks, then send the XOR of all the blocks as an extra packet.
          Both sender and receiver maintain a running XOR of packets. As soon
          as the Nth packet has been received, they immidietly reconstruct the
          N+1th packet without any additional work. This ammounts to 1 extra
          XOR operation per unit of data, which is a trivial amount of overhead
          in almost any workload.
          
          Of course, the above scheme is limited to N/N+1 recovery (and is
          probably as good as you can do for that particular use case).
          
          However, it has a fairly simple extension to N/N+M recovery. Arrange
          the data in  an NxM grid, and construct M sets of "extra" packets".
          The first set is constructed row wise, (effectivly devolving into the
          above case). For the second set, rotate each of the columns by their
          column index. So if R(x,y) is a redundant packet, and D(x,y) is a
          data packet at location (x,y) in the grid, you would have
          
            * R(0,0) = D(0,0) ^ D(1,0) ^ D(2,0) ^ ... D(N,0)
            * R(0,1) = D(0,1) ^ D(1,1) ^ D(2,1) ^ ... D(N,1)
            ...
            * R(0,M-1) = D(0,M-1) ^ D(1,M-1) ^ D(2,M-1) ^ ... D(N,M-1)
            * R(1,0) = D(0,0) ^ D(1,1) ^ D(2,2) ^ ... D(N,N%M)
            * R(1,1) = D(0,1) ^ D(1,2) ^ D(2,3) ^ ... D(N,(N+1)%M)
            * R(1,M-1) = D(0,M-1) ^ D(1,0) ^ D(2,1) ^ ... D(N,(N+1)%M)
            ...
            * R(2,0) = D(0,0) ^ D(1,2) ^ D(2,4) ^ ... D(N,2N%M)
            * R(3,0) = D(0,0) ^ D(1,3) ^ D(2,6) ^ ... D(N,3N%M)
          
          Your overhead is now M XOR operations per unit of real data, which is
          still trivial for reasonable values of M. The downside of this scheme
          is that if the first redundancy packet is not enough to reconstruct
          the dropped packet, you need to wait for the entire NxM table to be
          sent, which could cause a significant long-tail spike in latency if
          you are not careful. (The upside of this downside, is it provides
          even stronger burst protection that a traditional K-of-M erasure
          coding. If you get even more creative with how you group packets for
          the extra redundancy packets, you can get even stronger burst
          protection). The other downside is you end up being less space
          efficient than Reed-Solomon error correction.
          
          Interestingly, the recovery algorithm I described is not optimal in
          the sense that there are times where it fails to recover data that is
          theoretically recoverable. Recovering data in all theoretically
          possible cases probably would be quite intensive.
       
          mjb wrote 19 hours 23 min ago:
          It's true that Reed-Solomon isn't free.
          
          The first two codes (N of N+1 and N of N+2) are nearly trivial and
          can be done very fast indeed. On my hardware, the N of N+1 code
          (which is an XOR) can be arranged to be nearly as fast a memcpy
          (which obviously isn't free either). They can also be done in a
          streaming way which can save the memcpy if you're feeding a stream
          into a parser (e.g. JSON or something) or decryption.
          
          > Usually the codes used for erasure coding are in systematic form:
          there are k "preferential" parts out of M that are just literal
          fragments of the original blob, so if you get those you can just
          concatenate them to get the original data.
          
          Yeah, that's true. If you're CPU bound, it may be worth waiting a
          little longer for these 'diagonal' components to come back.
       
            dragontamer wrote 16 hours 48 min ago:
            Reed Solomon is closer to "perfect" but is unnecessary.
            
            IIRC, Turbo Codes and LDPCs are less-perfect (they cannot offer
            strict guarantees like Reed-Solomon can), but as XOR-based simple
            operations, they are extremely extremely fast to implement.
            
            LDPC has high-probabilities of fixing errors (near Reed-Solomon
            level), which is good enough in practice. Especially since LDPC's
            simple XOR-based operation is far faster and like O(n) instead of
            Reed-Solomon's matrix-multiplication (O(n^2)) algorithm.
            
            The state of the art has moved forward. Reed Solomon is great for
            proving the practice and providing strict assurances (likely better
            for storage where you have strict size limits and need strong
            guarantees for MTBF or other such statistics). But for a "faster"
            algorithm (ie: trying to prevent repeated packets in a
            communication stream like TCP or similar protocol), LDPC and/or
            Turbo codes are likely a better solution.
            
            -----
            
            Reed Solomon is probably best for "smaller" codes where the matrix
            is smaller and O(n^2) hasn't gotten out of hand yet. But as codes
            increase in size, the O(n) "less than perfect" codes (such as Turbo
            codes or LDPC codes) become better-and-better ideas.
            
            That being said: I can imagine some crazy GPU / SIMD algorithm
            where we have such cheap compute and low bandwidth where the O(n^2)
            operation might serve as a better basis than the cheap XOR
            operation. The future of computers is going to be more compute and
            less relative memory bandwidth after all, so the pendulum may swing
            the other way depending on how future machines end up.
       
              aero_code wrote 13 hours 59 min ago:
              I'm interested in using LDPC (or Turbo Codes) in software for
              error correction, but most of the resources I've found only cover
              soft-decision LDPC. When I've found LDPC papers, it's hard for me
              to know how efficient the algorithms are and whether it's worth
              spending time on them. Reed-Solomon has more learning resources
              that are often more approachable (plus open source libraries). Do
              you have more information on how to implement LDPC decoding using
              XOR-based operations?
       
                ajb wrote 11 hours 57 min ago:
                The search term you want for that is "binary erasure channel" 
                + LDPC decoding .
       
                dragontamer wrote 13 hours 27 min ago:
                Alas, no. I'm mostly spitballing here since I know that
                XOR-based codes are obviously much faster than Reed-solomon
                erasure code / matrix solving methodology.
                
                There's a lot of names thrown out for practical LDPC erasure
                codes. Raptor Codes, Tornado Codes, and the like. Hopefully
                those names can give you a good starting point?
                
                EDIT: I also remember reading a paper on a LDPC Fountain Code
                (ex: keep sending data + LDPC checkbits until the other side
                got enough to reconstruct the data), as a kind of "might as
                well keep sending data while waiting for the ACK", kind of
                thing, which should cut down on latency.
                
                --------
                
                I'm personally on the "Finished reading my book on Reed-Solomon
                codes. Figuring out what to study next" phase. There's a lot of
                codes out there, and LDPC is a huge class...
                
                Then again, the project at work that I had that benefited from
                these error (erm... erasure) correcting codes was complete and
                my Reed-solomon implementation was good enough and doesn't
                really need to be touched anymore. So its not like I have a
                real reason to study this stuff anymore. Just a little bit of
                extra data that allowed the protocol to cut off some latency
                and reduce the number of resends in a very noisy channel we
                had. The good ol' "MVP into shelved code" situation, lol.
                Enough code to prove it works, made a nice demo that impressed
                the higher-ups, and then no one ended up caring for the idea.
                
                If I were to productize the concept, I'd research these more
                modern, faster codes (like LDPC, Raptor, Tornado, etc. etc.)
                and implement a state-of-the-art erasure correction solution,
                ya know? But at this point, the projects just dead.
                
                But honestly, the blog-post's situation (cut down on latency
                with forward error correction) is seemingly a common problem
                that's solved again and again in our industry. But at the same
                time, there's so much to learn in the world of Comp. Sci that
                sometimes its important to "be lazy" and "learn it when its
                clearly going to be useful" (and not learning it to
                hypothetically improve a dead project, lol).
       
              nsguy wrote 14 hours 15 min ago:
              I haven't heard about LDPCs before. Thanks!
              
              Do they serve the same use case though? With Reed-Solomon the
              idea is to recover from complete loss of a fragment of data
              (erasure coding), isn't LPDC strictly for error
              correction/"noise" (e.g. certain bits flipping but the data
              overall exists)?
       
                dragontamer wrote 14 hours 11 min ago:
                I admit that I haven't thought it all the way through, but in
                general, all error-correction codes I'm aware of have a simpler
                erasure-code version available too.
                
                Reed Solomon traditionally is an error-correction code, for
                example. But has common implementations in its simplified
                erasure-only code. (Ex: fixing "lost data" is far easier than
                fixing "contradictory data").
                
                I'm fairly certain that LDPC erasure codes is as simple as "Is
                there only one missing erasure in this particular code??" and
                "answer is LDPC XOR (other data) == missing-data".
                
                EDIT: The "hard part" is the exact composition of (other data),
                of which there's many styles and different methodologies with
                tons of different tradeoffs.
       
              mjb wrote 16 hours 39 min ago:
              That's true too, this approach isn't limited to Reed-Solomon (or
              MDS codes). For non-MDS codes the fetch logic becomes a little
              more complicated (you need to wait for a subset you can
              reconstruct from rather than just the first k), but that's not a
              huge increase in complexity.
       
            pjdesno wrote 18 hours 13 min ago:
            There are new Intel instructions (GFNI) which accelerate things a
            lot, as well as various hacks to make it go fast. 
            See [1] for some quick and dirty benchmarks on jerasure, one of the
            EC plugins for Ceph, IIRC not using GFNI. (TLDR: 25GB/s on a Ryzen
            7700X)
            
   URI      [1]: https://www.reddit.com/r/ceph/comments/17z1w08/but_is_my_c...
       
        dmw_ng wrote 19 hours 37 min ago:
        Nice to see this idea written about in detail. I had thought about it
        in the context of terrible availability bargain bucket storage (iDrive
        E2), where the cost of (3,2) erasure coding an object and distributing
        each segment to one of 3 regions would still be dramatically lower than
        paying for more expensive and more reliable storage.
        
        Say 1 chunk lives in Germany, Ireland and the US each. Client races
        GETs to all 3 regions and cancels the request to the slowest to respond
        (which may also be down). Final client latency is equivalent to that of
        the 2nd slowest region, with substantially better availability due to
        the ability to tolerate any single region being down
        
        Still wouldn't recommend using E2 for anything important, but ^ was one
        potential approach to dealing with its terribleness. It still doesn't
        address the reality of when E2 regions go down, it is often for days
        and reportedly sometimes weeks at a time. So reliable writing in this
        scenario would necessitate some kind of queue with capacity for weeks
        of storage
        
        There are variants of this scheme where you could potentially balance
        the horrible reliability storage with some expensive reliable storage
        as part of the same system, but I never got that far in thinking about
        how it would work
       
          derefr wrote 18 hours 55 min ago:
          Having not heard of E2 before (it never seems to come up in the
          comparisons the other object-storage providers do to make themselves
          look good) — do you know why it has such bad availability? "Weeks
          of downtime" sounds crazy for a business focused on storage.
          
          If they aren't also failing at durability, then it wouldn't be any of
          the classical problems associated with running a storage cluster. Do
          they just... not bother with online maintenance / upgrades / hardware
          transitions?
       
            dmw_ng wrote 18 hours 41 min ago:
            My best guess is their primary product is a Windows (I think)
            backup client that has a place for smarts allowing them to paper
            over problems, or something along those lines. Feels like an "oh
            Germany is down again, when is Frank back from holiday so he can
            catch a plane to replace the switch?" type affair. If you google
            around the Reddit data hoarder communities you'll find plenty of
            war stories about E2
       
          ddorian43 wrote 19 hours 12 min ago:
          > Client races GETs to all 3 regions and cancels the request to the
          slowest to respond
          
          Dropbox does this internally in magic pocket (see their eng blog)
       
        siscia wrote 19 hours 42 min ago:
        Nice to see this talked about here and Marc being public about it.
        
        AWS is such a big place that even after a bit of tenure you still got
        place to look to find interesting technical approaches and when I was
        introduced to this schema for Lambda storage I was surprised.
        
        As Marc mentions it is such a simple and powerful idea that is
        definitely not mentioned enough.
       
          mjb wrote 19 hours 28 min ago:
          If folks are interested in more details about Lambda's container
          storage scheme, they can check out our ATC'23 paper here:
          
   URI    [1]: https://www.usenix.org/conference/atc23/presentation/brooker
       
            pjdesno wrote 17 hours 56 min ago:
            Note that the fast_read option on erasure-coded Ceph pools uses the
            same technique. I'm not sure which release it was in, but I see
            commits referencing it from 2015.
       
       
   DIR <- back to front page