_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Dividing unsigned 8-bit numbers
       
       
        RaisingSpear wrote 28 min ago:
        For CPUs that support AVX-512 VBMI, there's a faster reciprocal-based
        approach: [1] A VBMI2 example implementation can be found in the
        function named divide_with_lookup_16b:
        
   URI  [1]: https://avereniect.github.io/2023/04/29/uint8_division_using_a...
   URI  [2]: https://godbolt.org/z/xdE1dx5Pj
       
        xpe wrote 8 hours 24 min ago:
        The bit-division burns my eyes and makes a mockery of my puny
        intellect, my liege. Mine fellow vassals possess skills most frightful
        in their potency.
       
        afstr wrote 9 hours 4 min ago:
        M  I’m looking I’m I’m
        O
        ; r
       
        jonhohle wrote 9 hours 5 min ago:
        I do a fair amount of decompiling and the list of constants is awesome!
        I usually can make a reasonable guess or have intuition, but frequently
        end up stepping through several integers to find the right value.
        Having a handy lookup table will be great.
       
        purplesyringa wrote 11 hours 35 min ago:
        It's odd that the only reason AVX-512 long division wins in the end is
        that it's compared to AVX2 reciprocal. Would it be possible to compare
        it to AVX-512 reciprocal computation?
       
        dzaima wrote 13 hours 5 min ago:
        Here's a version of the vrcpps idea, doing whole vectors of 32 u8's,
        bypassing lane crossing and somewhat simplifying the packing/unpacking,
        that's ~1.5x faster on Haswell:
        
   URI  [1]: https://godbolt.org/z/TExGahv3z
       
          Cold_Miserable wrote 11 hours 10 min ago:
          This is ~9.6x faster than "scalar".
          
          ASM_TestDiv proc    ;rcx out, rdx A, r8 B
          mov rax,05555555555555555H
          kmovq k1,rax
          vmovdqu8 zmm0,zmmword ptr [rdx]
          vmovdqu8 zmm4,zmmword ptr [r8]
          vpbroadcastw zmm3,word ptr [FLOAT16_F8]
          vmovdqu8 zmm2{k1},zmm0    ;lower 8-bit
          vmovdqu8 zmm16{k1},zmm4  ;lower 8-bit
          vpsrlw zmm1,zmm0,8  ;higher 8-bit
          vpsrlw zmm5,zmm4,8  ;higher 8-bit
          vpord zmm1,zmm1,zmm3
          vpord zmm2,zmm2,zmm3
          vpord zmm5,zmm5,zmm3
          vpord zmm16,zmm16,zmm3
          vsubph zmm1,zmm1,zmm3{rd-sae}  ;fast conv 16FP
          vsubph zmm2,zmm2,zmm3{rd-sae}
          vsubph zmm5,zmm5,zmm3{ru-sae}
          vsubph zmm16,zmm16,zmm3{ru-sae}
          vrcpph zmm5,zmm5
          vrcpph zmm16,zmm16
          vfmadd213ph zmm1,zmm5,zmm3{rd-sae}
          vfmadd213ph zmm2,zmm16,zmm3{rd-sae}
          vxorpd zmm1,zmm1,zmm3
          vxorpd zmm2,zmm2,zmm3
          vpsllw zmm1,zmm1,8
          vpord zmm1,zmm1,zmm2
          vmovdqu8 zmmword ptr [rcx],zmm1  ;16 8-bit unsigned int
          ret
       
        synthos wrote 14 hours 3 min ago:
        Newton Raphson could be used to calculate a reciprocal. (in a bit width
        larger than 8). If starting from a good reciprocal approximation,
        convergence to bit accurate reciprocal should not take many iterations.
        Then multiply the reciprocal with the numerator to perform the divide
       
        Cold_Miserable wrote 14 hours 11 min ago:
        Heh?
        Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer?
        edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor
        (same constant)
       
          bremac wrote 9 hours 50 min ago:
          Unfortunately none of the hardware used for testing supports FP16
          arithmetic. Between Intel and AMD, the only platform that supports
          AVX512-FP16 is currently Sapphire Rapids.
       
          purplesyringa wrote 9 hours 54 min ago:
          I tried a similar approach with 32-bit FP before, and the problem
          here is that fast conversion is only fast in the sense of latency.
          Throughput-wise, it takes 2 uops instead of one, so in the end, a
          plain float<->int conversion wins.
       
        ack_complete wrote 14 hours 58 min ago:
        Think the SSE2 implementation could be tightened up by using the same
        register for the dividend and the quotient, shifting the quotient bits
        into the dividend as the dividend bits are shifted out. This was common
        practice in software CPU division routines.
       
        mlochbaum wrote 15 hours 17 min ago:
        We implemented something like this in CBQN last year (mainly for
        modulus, as floor division isn't a primitive). Commit is [1] , some
        proofs of when and why it works at [2] .
        
   URI  [1]: https://github.com/dzaima/CBQN/commit/d333902
   URI  [2]: https://mlochbaum.github.io/BQN/implementation/primitive/arith...
       
        ryao wrote 15 hours 17 min ago:
        Why not use libdivide?
        
   URI  [1]: https://libdivide.com/
       
          LegionMammal978 wrote 15 hours 12 min ago:
          libdivide is optimized for the case of a common divisor used many
          times, not for a long array of distinct divisors.
       
        dataflow wrote 15 hours 28 min ago:
        > SIMD ISAs usually do not provide the integer division; the only known
        exception is RISC-V Vector Extension
        
        It's kind of funny to read "the only known exception is..." in this
        context. What would an unknown exception be - an ISA that accidentally
        implements this but that the author believes nobody is aware of yet?
        
        More seriously, I actually don't understand the intended meaning here.
        I assume the author means "out of all the ISAs I know"? What is that
        set of ISAs?
       
          dzaima wrote 14 hours 50 min ago:
          Some SIMD ISAs with integer division:
          
          - Arm SVE, though only for 32-bit and 64-bit element types: [1] -
          loongarch64 LSX/LASX: [2] - MRISC32 (though that's somewhat obvious
          as basically everything in it is shared between scalar and vector).
          
   URI    [1]: https://developer.arm.com/architectures/instruction-sets/int...
   URI    [2]: https://jia.je/unofficial-loongarch-intrinsics-guide/viewer/...
       
          camel-cdr wrote 14 hours 54 min ago:
          The Convex C1 [0] and for a newer example NEC SX Aurora [1] both also
          support vector integer division.
          
          [0] [1]
          
   URI    [1]: https://bitsavers.org/pdf/convex/080-000120-000_CONVEX_Archi...
   URI    [2]: https://ftp.libre-soc.org/NEC_SX_Aurora_TSUBASA_VectorEngine...
       
          bee_rider wrote 15 hours 9 min ago:
          Practically, could the expression “only know exception” mean
          anything other than “known by me?” I mean, it is clearly possible
          for an exception to exist, on account of the existing known
          exception, so they can’t know that more exceptions don’t exist
          out there somewhere.
          
          I dunno. I think it is a more meaningful statement if we know more
          about the author; if we assume that they are very well informed, I
          guess we would assume that the fact that they don’t know about
          something is really meaningful. In the case of a blog post where most
          of us don’t know the author, it is hard to infer much. But at least
          it tells us why they decided to do the thing.
       
            zahlman wrote 11 hours 45 min ago:
            If a SIMD ISA exists, someone must know that it exists, because
            definitionally we only apply the term "SIMD ISA" to things that
            were consciously created to be such. So we could simply check every
            such example. Saying "only known example" is indeed silly.
            
            But e.g. in mathematics, if we say that "almost every member of set
            X has property Y; the only known exception is Z" then there
            absolutely could be more exceptions, even if we pool the knowledge
            of every mathematician. It isn't necessary that X is finite, or
            even enumerable. It could be possible for exceptions other than Z
            to exist even though every other member of the set that we know
            about has the property. It could be possible to prove that there
            are at most finitely many exceptions in an infinite set, and only
            know of Z but not be able to rule out the possibility of more
            exceptions than that.
            
            We don't even need to appeal to infinities. For example, there are
            problems in discrete math where nobody has found the exact answer
            (which necessarily is integer, by the construction of the problem)
            but we can prove upper and lower bounds. Suppose we find a problem
            where the known bounds are very tight (but not exact) and the
            bounded value is positive. Now, construct a set of integers ranging
            from 0 up to (proven upper bound + 1) inclusive... you can probably
            see where this is going.
            
            The latter doesn't apply to SIMD ISAs, because we know all the
            interesting (big hand-wave!) properties of all of them rather
            precisely - since they're designed to have those properties.
       
          perching_aix wrote 15 hours 16 min ago:
          You could come up with an ISA that implements it and it wouldn't be
          "known". Maybe that helps?
       
          michaelhoffman wrote 15 hours 22 min ago:
          > I assume the author means "out of all the ISAs I know"?
          
          Out of all the ISAs where they know whether it provides integer
          division or not.
       
            dataflow wrote 15 hours 11 min ago:
            Yeah but my point is that as a reader I'm trying to figure out
            which ISAs actually don't provide this (vs. which ISAs the author
            lacks knowledge of), and I still don't know what those are. The
            sentence looks like it's supposed to tell me, but it doesn't.
       
        foundry27 wrote 15 hours 38 min ago:
        I think the approximate reciprocal approach is interesting here. The
        doc mentions multiplying the dividend by ~1.00025 in the math to avoid
        FP error so you don’t end up off-by-one after truncation, but I think
        this hack is still incomplete! On some inputs (like 255, or other
        unlucky divisors near powers of two), you might get borderline rounding
        behaviour that flips a bit of the final integer. It’s easy to forget
        that single-precision floats don’t line up neatly with every 8bit
        integer ratio in real code, and a single off-by-one can break pixel ops
        or feed subtle bugs into a bigger pipeline.
        
        I suspect a hybrid scheme like using approximate reciprocals for most
        values but punting to scalar for unlucky ones could handle these corner
        cases without killing performance. That’d be interesting to benchmark
       
          LegionMammal978 wrote 15 hours 15 min ago:
          There are only 65280 possible inputs, that's easily small enough to
          test every value for correctness.
       
        orlp wrote 15 hours 45 min ago:
        What's not mentioned is that in most cases you have a constant divisor
        which lets you replace division by multiplication with the reciprocal.
        The reciprocal can be rounded to a nearby dyadic rational, letting you
        do the division with a right-shift.
        
        For example, 8-bit division by 3 is equivalent to widening
        multiplication by 171 followed by a right-shift of 9, as 171/2^9 =
        0.333984375 which is close enough to 1/3 that the results match
        exactly.
       
          15155 wrote 12 hours 23 min ago:
          These methods are especially useful in hardware/FPGA implementations
          where it's infeasible to have a ton of fully pipelined dividers.
       
            andrepd wrote 11 hours 27 min ago:
            They are actually useful for optimising compilers too! Mul or
            mul+shifts is often faster than div
       
          kevinventullo wrote 14 hours 15 min ago:
          Also, if you know ahead of time that it’s exact division, there is
          a similar approach that doesn’t even need widening multiplication!
       
            Footkerchief wrote 12 hours 55 min ago:
            Can you provide an example with details?  Thanks!
       
              kevinventullo wrote 12 hours 1 min ago:
              In 8-bit arithmetic (i.e. mod 256), the multiplicative inverse of
              11 is 163. So, if you take some multiple of 11, say 154, then you
              can compute 154/11 instead as 154*163.
              
              Indeed,
              
              154*163 = 25102, and
              
              25102 = 14 (mod 256).
       
            orlp wrote 13 hours 37 min ago:
            Yes, if you know something is an exact multiple of n = r*2^k where
            r is odd, you can divide out the multiple by right-shifting k
            followed by modular multiplication by the modular multiplicative
            inverse of r.
       
              thaumasiotes wrote 8 hours 47 min ago:
              Theoretically, you could also take advantage of two's complement
              arithmetic:
              
                     00011011 (27)
                   x 01010101 (-0.333...)
                  ------------------------
                     11110111 (-9)
                  
                  invert and add one:
                  
                     00001001 (9, like we wanted)
              
              I'd be interested in extending this to non-exact multiples, but
              if that's possible I don't know how.
       
          thaumasiotes wrote 14 hours 17 min ago:
          > For example, 8-bit division by 3 is equivalent to widening
          multiplication by 171 followed by a right-shift of 9, as 171/2^9 =
          0.333984375 which is close enough to 1/3 that the results match
          exactly.
          
          Is this related to the fact that 171 is the multiplicative inverse of
          3 (mod 256), or is that a coincidence?
       
            zahlman wrote 12 hours 19 min ago:
            Sort of. (After all, a "reciprocal" for our purposes is just a
            multiplicative inverse in the reals, so it makes sense that it
            would be related to the multiplicative inverse in other domains.)
            
            3 times 171 is 513. So to divide by 3, we could multiply by 171 and
            then divide by 513. Dividing by 513... isn't any easier, but 513 is
            close to 512, so we hope that dividing by 512 (which is trivial -
            just a right-shift) gets us close enough.
            
            Suppose for a moment we try dividing 3 by 3 using this trick. First
            we'll multiply by 171 to get 513. Consider that value in binary:
            
                1000000001
                 ^^^^^^^^^
                  ~~~~~~~~
            
            Dividing by 512 will shift away the ^ bits. For floor division, we
            therefore want the ^ underlined value to be close to zero. (That
            way, when we divide 255 (say) by 3, the error won't be big enough
            to overflow into the result bits.)
            
            The multiplicative inverse fact is equivalent to telling us that
            the ~ underlined bits are exactly 1. Conveniently, that's close to
            0 - but we didn't account for all the ^ underlined bits. For
            example, the multiplicative inverse of 7 (mod 256) is 183, but 7
            times 183 is 1281. That's close to 1280, but that doesn't really
            help us - we could right-shift by 8 but then we still have to
            divide by 5. If we just ignore the problem and divide by 1024
            (right-shift by 10), of course we get a lot of wrong results. (Even
            6 / 7 would give us 1 instead of 0.)
            
            It also turns out that we'll need more bits for accurate results in
            the general case. I think it's possible without overflowing 16-bit
            numbers, but it definitely requires a bit more trickery for
            problematic divisors like (from my testing) 195. I thought I
            remembered the details here but proved myself wrong at the Python
            REPL :(
       
              orlp wrote 11 hours 55 min ago:
              Division by 195 is trivial. The answer is simply
              
                  uint8_t div195(uint8_t x) { return x >= 195; }
       
                zahlman wrote 11 hours 42 min ago:
                Yes, but you can't incorporate that into an algorithm that
                allows the divisor to vary and avoids branching. 195 is
                problematic for the multiply-shift strategy.
       
            orlp wrote 13 hours 25 min ago:
            It's not entirely a coincidence but also not a general result that
            one should use the modular inverse as multiplier.
            
            171 * 3 = 2^9 + 1, which is not surprising as we know that 171 * 3
            = 1 (mod 2^8). So rearranged we have 171 / 2^9 = 1/3 + 1/(2^9*3)
            which shows it's a close approximation of 1/3.
       
          edflsafoiewq wrote 15 hours 13 min ago:
          A shift of 16 is enough for every 8-bit numerator, ie. x/a is
          (u32(x)*b)>>16 for some b depending only on a. You could precompute b
          for each a and store it a lookup table. The largest b is b=65536 for
          a=1 and the smallest is b=258 for a=255, so b fits in a u16 if stored
          with a 1-offset. Not sure it's worth it unless you reuse the
          denominator many times though.
       
        AlotOfReading wrote 15 hours 59 min ago:
        I'm not certain it'll actually be faster, but you should be able to
        merge the reciprocal, multiplication and rounding adjustment into a
        single fma in log space via evil floating point bit hacks. Then you'd
        just be paying the cost of converting to and from the float
        representation of the integer.
       
          ashvardanian wrote 15 hours 54 min ago:
          I'd also love to see such comparisons. In SimSIMD, if AVX-512FP16 is
          available, I use FP16 for most I8/U8 operations, but can't speak
          about division.
          
          Overall, using F64 directly for I32 division is a well known trick,
          and it should hold for smaller representations as well.
       
            AlotOfReading wrote 15 hours 45 min ago:
            Don't have time to figure out the actual numbers myself, but here's
            an example doing the same for various transcendental functions: [1]
            a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be able
            to do a*(fma(b, magic, constant)) with appropriate conversions on
            either side. Should work in 32 bits for u8s.
            
   URI      [1]: https://github.com/J-Montgomery/hacky_float_math/blob/623e...
       
        abcd_f wrote 16 hours 8 min ago:
        Since it's 8bit by 8bit, a precomputed lookup table (64K in size) will
        be another option.
       
          ack_complete wrote 15 hours 16 min ago:
          A lookup table in memory can only be accessed an element at a time,
          so it gets bottlenecked in the execution units on the load and
          address generation traffic. This algorithm uses 8-bit integers, so
          the vectorized version is processing between 16-64 elements per
          operation depending on the vector width. It's even worse if the 8-bit
          divide is integrated with other vector operations as then the lookup
          table method also has insert/extract overhead to exchange the values
          between the vector and integer units.
          
          A hybrid approach using small in-register lookup tables in the vector
          unit (pshufb/tbl) can be lucrative, but are very limited in table
          size.
       
            mithametacs wrote 14 hours 44 min ago:
            I've never thought of a micro-lookup-table!
            
            That's cool.  What have you used those for?
       
              anonymoushn wrote 4 hours 0 min ago:
              You can test whether a register full of bytes belong to a class
              of bytes with the high bit unset and distinct lower nibbles in 2
              instructions (each with 1 cycle latency). For example the
              characters that commonly occur in text and must be escaped in
              json are "\"\\\r\n\t" (5 different bytes). Their lower nibbles
              are:
              
              \": 0010
              
              \t: 1001
              
              \n: 1010
              
              \\: 1100
              
              \r: 1101
              
              Since there are no duplicate lower nibbles, we can test for
              membership in this set in 2 instructions. If we want to get the
              value into a general-purpose register and branch on whether it's
              0 or not, that's 2 more instructions.
              
              For larger sets or sets determined at runtime, see here:
              
   URI        [1]: http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
       
              glangdale wrote 4 hours 42 min ago:
              Not OP, but we use these tables all the time in Hyperscan (for
              string and character class matching) and it's a long-standing
              technique to use it for things like SIMD population count
              (obsoleted now if you have the right AVX-512 ISA, ofc).
       
              ack_complete wrote 14 hours 28 min ago:
              Typically just like conventional lookup tables, where you can get
              the table size down small enough. Indexed palette / quantization
              coding is a case where this can often work. It's pretty niche
              given the limitations, but if you can make it work it's often a
              major speedup since you're able to do 16/32/64 lookups in
              parallel.
       
          ashvardanian wrote 15 hours 59 min ago:
          64 KB is a pretty significant budget for such a small operation. I've
          had a variant that uses 768 bytes with some extra logic, but will
          deprecate that kernel soon.
          
   URI    [1]: https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...
       
            sroussey wrote 15 hours 55 min ago:
            This seems like something that could be in hardware to allow native
            execution of the instruction. Is that not the case anywhere?
       
          mmastrac wrote 16 hours 5 min ago:
          Lookup tables aren't necessarily faster these days for a lot of
          things when using low-level languages, but it would have been
          interesting to see the comparison here.
       
            adhoc32 wrote 16 hours 1 min ago:
            pretty sure a memory access is faster than the methods presented in
            the article.
       
              ryao wrote 15 hours 6 min ago:
              An uncached random memory access is around 100 cycles.
       
                Sesse__ wrote 13 hours 17 min ago:
                100 cycles would be very low. Many systems have more than 100
                ns!
       
              PhilipRoman wrote 15 hours 27 min ago:
              Depends also heavily on the context. You pay for each cache miss
              twice - once for the miss itself, and next time when you access
              whatever was evicted during the first miss. This is why LUTs
              often shine in microbenchmarks, but drag down performance in real
              world scenarios when mixed with other cache bound code.
       
              Retr0id wrote 15 hours 27 min ago:
              64K is enough to fill L1 on many systems
       
              retrac wrote 15 hours 46 min ago:
              Access to main memory can be many many cycles; a short routine
              already in cache may be able to recompute a value more quickly
              than pulling it from main memory.
       
              dist-epoch wrote 15 hours 57 min ago:
              Hitting L2 is more than 3-4 cycles
       
       
   DIR <- back to front page