_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Lossless LLM compression for efficient GPU inference via dynamic-length float
       
       
        gitroom wrote 1 day ago:
        Pretty    cool seeing how fast  all this moves - feels like every week
        theres a  new trick or hardware upgrade. I def get nerd sniped by these
        efficiency improvements lol.
       
        firefoxd wrote 1 day ago:
        Someone has figured out how to compress images even further with LLMs.
        They promised to published a white paper since last year: [1] /s I'll
        show myself out
        
   URI  [1]: https://getproxyai.com/blog/this-image-is-4KB
       
        jsemrau wrote 1 day ago:
        I still hold the opinion that ternary instead of binary would lead to
        an even higher degree of compression.
       
          xmasotto wrote 1 day ago:
          The underlying memory is still binary, or were you proposing an
          entirely new computer architecture with ternary gates?
       
            buildbot wrote 1 day ago:
            Not necessarily new - first ternary computer was around in 1959!
            
   URI      [1]: https://en.wikipedia.org/wiki/Setun
       
        thund wrote 1 day ago:
        Is this different than ZipNN? [1] I see it mentioned but can’t
        understand if it’s based on it or different/better…
        
   URI  [1]: https://arxiv.org/pdf/2411.05239
       
          jhj wrote 1 day ago:
          Not really, it's just adding some data transposition (coalescing
          individual bytes from the data words together) and an option to use a
          LZ/dictionary-type compressor to compress redundant things. But an
          LZ-type compressor doesn't make much sense on NN weights I think
          since it is not as redundant as most text data with many repeats, and
          also the space of possible dictionary matches is pretty small since
          unless the data is highly sparse, there may not be many repetitions
          that you can leverage to avoid the dictionary overhead.
          
          If you add an LZ-type compressor and have this be in the critical
          path for inference, then decompression will be a lot slower. It would
          be best to fuse decompression with the compute kernels (e.g., a GEMM
          that performs decompression on each tile before the arithmetic), and
          the simpler the decompression routine, the easier this will be.
       
          thund wrote 1 day ago:
          Found it, the news reminded me of this paper
          
   URI    [1]: https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7...
       
        aseligman wrote 1 day ago:
        Some additional context: many real world agent use cases struggle to
        balance quality, cost, and performance. This technique can help avoid
        the tradeoffs that quantization techniques introduce, including
        unpredictable results while you try cost optimize an agent. In some
        cases the cost savings can be significant using dfloat11 as you squeeze
        into more affordable GPUs.
        
        * I work with xmad.ai
       
        aazo11 wrote 1 day ago:
        This is a huge unlock for on-device inference. The download time of
        larger models makes local inference unusable for non-technical users.
       
        Animats wrote 1 day ago:
        Once this weight format war settles down, hardware can be built to
        support it. Presumably you want matrix multiply hardware optimized for
        whatever weight format turns out to be reasonably optimal.
       
          eoerl wrote 1 day ago:
          Optimization is post hoc here : you have to train first to be able to
          huffman en ode, so it's not a pure format question
       
        jhj wrote 1 day ago:
        This is just a consequence of the fact that bfloat16 has a very high
        dynamic range which is not all used. People like hyperparameters that
        look like 0.01 not 10^10, even though there is the same fractional
        precision available at each exponent and if you multiplied everything -
        hyperparameters, initialized weights, training data, etc in a network
        by 10^6 things will still work more or less the same since the upper
        range is hardly used (with the possible exception of some small number
        of special functions).
        
        Typical entropy of bfloat16 values seen in weights (and activations)
        are about 10-12 bits (only 65-75% or so of the value range is used in
        practice). Sign and mantissa bits tend to be incompressible noise.
        
        This has been exploited several times before in the context of both
        classical HPC and AI, with lossless compression work from Martin
        Burtscher's lab ( [1] ), fpzip from LLNL ( [2] ) and my library dietgpu
        from 2021 ( [3] ) which we used to speed training on a large GPU
        cluster by about 10% wall clock time overall by losslessly compressing
        all data prior to send and decompressing upon receive (e.g., gradients,
        weights from backup, etc), which is still computing the same thing as
        it did before as it is lossless.
        
        Also, rANS is more efficient and easier to implement in SIMD-like
        instruction sets than Huffman coding. It would reduce the performance
        latency/throughput penalties as well with DFloat11 (since we have to
        decompress before we do the arithmetic).
        
   URI  [1]: https://userweb.cs.txstate.edu/~burtscher/
   URI  [2]: https://computing.llnl.gov/projects/fpzip
   URI  [3]: https://github.com/facebookresearch/dietgpu
       
          Dylan16807 wrote 15 hours 5 min ago:
          Was bfloat a mistake then?  Wasn't the point of it to increase
          dynamic range?
          
          At least the cost to truncate and zero fill is small.
       
          liuliu wrote 1 day ago:
          That let you think if we can rewind the time, maybe we should just
          allocate one more bit for half precision (6 exp, 9 mantissa) and not
          doing this bfloat16 thing.
       
          bjornsing wrote 1 day ago:
          > if you multiplied everything - hyperparameters, initialized
          weights, training data, etc in a network by 10^6 things will still
          work more or less the same since the upper range is hardly used (with
          the possible exception of some small number of special functions)
          
          I doubt that very much. Thing is that inputs are multiplied with
          weights and added together in a neural network layer, and then the
          output becomes the input of the next layer in a cycle that can repeat
          up to a hundred times or more. When you get to the final output layer
          that 10^6 factor has been applied so many times that it has
          snowballed to a 10^600 factor.
       
            ironbound wrote 1 day ago:
            The Deepseek v3 paper details a quantisation method of scaling
            after matmul but before accumulation to improve precision, this is
            different than normal GEMM as operations are left till the end, can
            read more in chapter 3.3 of the paper below.
            
   URI      [1]: https://arxiv.org/html/2412.19437v2#S3
       
          brookst wrote 1 day ago:
          Thanks for the fantastic explanation!
          
          Would it be more efficient to calculate some kind of per-model or
          per-layer mean, and then only specify standard deviations, maybe by
          fp8 or smaller?
       
          refibrillator wrote 1 day ago:
          Note to others reading along: in the last appendix page the OP paper
          reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and
          Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not
          reported for others).
          
          Using DFloat11, tokens/sec was higher only when compared relative to
          running inference with some layers offloaded to CPU.
          
          Classic comp sci tradeoff between space and speed, no free lunch,
          etc.
       
          hinkley wrote 1 day ago:
          Do you think there’s a call for introducing an even smaller float
          that can pack more values into a SIMD register? Like a 12 bit?
       
            boulos wrote 1 day ago:
            The latest GPUs and TPUs support fp8. It's a big part of the
            efficiency gain in the latest systems. Blackwell also supports fp4.
       
          vessenes wrote 1 day ago:
          Thanks Jeff -- can you point me to something written up about rANS?
          All I find on line is turbulence modeling solutions; I presume this
          is not what you're referring to.
          
          As we know, quantizations are a critical tool for local LLM runners;
          RAM is typically the gating factor. Are you aware of other better
          lossless compression of BF16 weights out there?
          
          The reason I ask is this Dfloat11 seems relatively easy to plug in to
          existing quantization workflows, but you seem dismissive of the paper
          -- I presume it's my gap in understanding, and I'd like to
          understand.
       
            zorgmonkey wrote 1 day ago:
            I don't know of any great write-ups unfortunately, but the rANS
            you're looking for is range asymmetric numeral systems.
       
              eln1 wrote 12 hours 30 min ago:
              There are lots of materials about ANS, e.g. gathered here:
              
   URI        [1]: https://encode.su/threads/2078-List-of-Asymmetric-Numera...
       
          iandanforth wrote 1 day ago:
          For those who don't bother to click through profiles, Jeff really
          knows what he's talking about. Much of Meta/FAIR + community benefits
          from his code.
       
            VladVladikoff wrote 1 day ago:
            I really love HN for this reason. Full of some of the brightest
            minds on the internet. Often the comments have very interesting
            information, instead of stupid knee jerk reactions to post titles.
       
        anticensor wrote 1 day ago:
        This is just a VBR mode for neural networks. Not quite useful when
        inference is already quite slow.
       
          vessenes wrote 1 day ago:
          Even presuming this is an accurate summary, the conclusion is not
          accurate - most local LLM inference users are constantly trading off
          quality for speed, in that speed drops dramatically once RAM is full.
          So, if you think of speed at desired quality, this could be very
          useful.
       
        yjftsjthsd-h wrote 1 day ago:
        > Compared to a potential alternative of offloading parts of an
        uncompressed model to the CPU to meet memory constraints, DFloat11
        achieves 1.9-38.8x higher throughput in token generation. With a fixed
        GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths
        than uncompressed models.
        
        The context length alone probably makes it worthwhile even if your
        models fit in memory, but I'm curious if it improves tokens/sec even
        all on GPU, since in my very amateur understanding LLMs tend to be
        constrained by memory bandwidth?
       
          brigade wrote 1 day ago:
          It does not; the decompression is memory to memory, one tensor at a
          time, so it’s worse. They claim less than 200 GB/s on an A100, and
          their benchmarks suggest it’s somewhere between 1.5-4x slower at
          batch size 1 depending on GPU and model. This overhead of course
          mostly disappears with a large enough batch size.
          
          Other lossless codecs can hit 600 GB/s on the same hardware, so there
          should be some room for improvement. But A100’s raw memory
          bandwidth is 1.6 TB/s
       
          hnuser123456 wrote 1 day ago:
          If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.
       
          philjohn wrote 1 day ago:
          My mental model is saying it might do, much like on slow hard drives
          DoubleSpace in DOS slightly sped up loading data from disk.
       
        hchja wrote 1 day ago:
        This is pretty useless in any case that doesn’t involve BFloat16
        models
       
          spindump8930 wrote 1 day ago:
          bf16 is the defacto default datatype and distribution type for LLMs,
          which are then often eagerly quantized by users with more limited
          hardware. See the recent Llama releases and e.g. the H100 spec sheet
          (advertised flops and metrics target bf16).
       
          throwaway314155 wrote 1 day ago:
          So an increasingly smaller number of cases?
       
        luotuoshangdui wrote 1 day ago:
        Does it affect speed?
       
        marksimi wrote 1 day ago:
        Time to (dynamically) float
       
        mountainriver wrote 1 day ago:
        Is it possible to run this on new models? It seem like the code is only
        for inference, unless I’m misunderstanding
       
        ein0p wrote 1 day ago:
        Note that this is _way_ slower at small batch sizes you'd need for
        interactive use. At batch size 1 this seems to run at 1/3rd the speed
        of bf16 (so about 1/6th the speed of fp8 you'd realistically be using)
        if figure 5 is to be believed. This is actually a pretty impressive
        feat in itself if you know anything about GPU kernel programming, but
        it is much slower nevertheless. For this to work at "wire speed" it'd
        need hardware support, which takes years. Their "baseline" elsewhere in
        the paper is CPU offloading, which is dog slow and can't be made fast
        due to PCIe bottleneck.
       
          ow5 wrote 1 day ago:
          Hi! one of the contributors to the paper — we have kernels not
          released yet that can shave down decoding latency by >20%.
          
          Also when we ran experiments for streaming with the current kernels,
          we were median ~1.3x slower at inference
       
            ein0p wrote 1 day ago:
            Thanks for chiming in! How do you explain the top-most graph in
            Figure 5? Am I misreading it?
       
          timschmidt wrote 1 day ago:
          It's perfectly possible to run LLMs quickly on CPUs.  An Epyc or Xeon
          with 12 memory channels achieves similar memory bandwidth to a 4090,
          which is the limiting factor.  Engineering sample Epycs in kits with
          motherboard and RAM are available on Aliexpress for reasonable prices
          even.
       
            ein0p wrote 1 day ago:
            Did I say it wasn't? If your context is short and your model is
            small, it is possible to run LLMs on high-end CPUs able to support
            12 channels of high-spec DDR5 RDIMMs. It's not possible to run them
            as fast as they'd run on a GPU equipped with HBM though. Nor would
            it be even remotely as energy efficient. Also, it's not possible to
            run LLMs quickly on CPU if your context is long, because CPUs do
            not have the requisite FLOPS to process long context quickly. And
            before you bring MoE into the conversation, MoE only affects the
            feedforward part of each transformer block, and full memory
            bandwidth and compute savings are only realized at batch size 1,
            sequence length 1, AKA the most inefficient mode that nobody other
            than Ollama users use in practice. Sequence length 8 (common for
            speculative decoding) could be using up to 8x37B parameters
            (assuming you want to run DeepSeek - the strongest available open
            weights model). Batch size of even 2 with sequence length 8 could
            use almost all parameters if you're particularly unlucky. Prompt
            will almost certainly use all parameters, and will slam into the
            FLOPS wall of your EPYC's ALUs. So can LLMs (with an emphasis on
            "Large") be run on CPUs? Yes. Are you going to have a good time
            running them this way? No.
       
              timschmidt wrote 1 day ago:
              llamafile contains specific optimizations for prompt processing
              using AVX512 for dealing with just this issue: [1] (about a 10x
              speedup over llama.cpp)
              
              Somewhere between 8 and 192 cores I'm sure there's enough AVX512
              to get the job done.  And we've managed to reinvent Intel's
              Larrabee / Knights concept.
              
              Sadly, the highly optimized AVX512 kernels of llamafile don't
              support these exotic floats yet as far as I know.
              
              Yes, energy efficiency per query will be terrible compared to a
              hyperscaler.  However privacy will be perfect.    Flexibility will
              be higher than other options - as running on the CPU is almost
              always possible.  Even with new algorithms and experimental
              models.
              
   URI        [1]: https://justine.lol/matmul/
       
                ein0p wrote 1 day ago:
                At 192 cores you're way better off buying a Mac Studio, though.
       
        badmonster wrote 1 day ago:
        What stands out most is the practical implication: enabling lossless
        inference of a 405B-parameter model on a single node with 8×80GB GPUs
        is wild. That’s a huge unlock for research labs and startups alike
        that want to run frontier models without massive infrastructure costs.
       
          Der_Einzige wrote 1 day ago:
          4 but quants of DeepSeek or llama3 405n already fit on those GPUs and
          purported to have almost 0 loss compared to the full model. Doesn’t
          seem like that big of a deal given this
       
          latchkey wrote 1 day ago:
          > That’s a huge unlock for research labs and startups alike that
          want to run frontier models without massive infrastructure costs.
          
          Or let one of the neoclouds take care of the infrastructure costs and
          rent it out from them. Disclosure: I run one of them.
       
            saagarjha wrote 1 day ago:
            That just moves the infrastructure costs to your cloud bill.
       
              latchkey wrote 1 day ago:
              True, but there is so much value that we provide above and beyond
              just a cloud bill, that I think it is worth it. This is way more
              than racking and stacking commodity servers and providing a ssh
              login.
              
              It is novel equipment that few have ever used before outside of a
              relatively small HPC community. It regularly breaks and has
              issues (bugs) that need industry relationships to manage
              properly. We've had one server down for over a month now cause
              SMCI can't get their sh/t together to fix it. That's a $250k+
              350lbs paperweight. Good luck to any other small company that
              wants to negotiate that relationship.
              
              We are offering a very valuable service by enabling easy access
              to some of the most powerful compute available today. How many
              people do you think have a good grasp of what it takes to
              configure rocev2 & 8x400G across a cluster of servers? Good luck
              trying to hire talent that can set that up, they already have
              jobs.
              
              The capex / opex / complexity involved with deploying this level
              of gear is huge and only getting larger as the industry shifts to
              bigger/better/faster (ie: air cooling is dead). Things are moving
              so quickly, that equipment you purchased a year ago is now
              already out of date (H100 -> H200 is a great example). You're
              going to have to have a pretty impressive depreciation model to
              deploy this yourself.
              
              I wouldn't just dismiss this as moving costs around.
       
                zarathustreal wrote 1 day ago:
                wait your competitive advantage is “human friction exists”?
                
                …how do you justify marketing yourself in a system like that?
                
                “In general, people in this vertical have difficulty doing
                their jobs. Luckily we’ve had drinks with most of them”
                ……
       
                  latchkey wrote 1 day ago:
                  It is obviously more than that, you've just chosen to pick a
                  single item off the list to focus on.
       
            Ringz wrote 1 day ago:
            I need your services in Cape Town South Africa. It’s hard to find
            good data centers here.
       
              latchkey wrote 1 day ago:
              Rent from us! hello@hotaisle.ai
       
            sundarurfriend wrote 1 day ago:
            > neoclouds
            
            For anyone else who hadn't heard of this term:
            
            > Neoclouds are startups specializing in AI-specific cloud
            computing. Unlike their larger competitors, they don’t develop
            proprietary chips. Instead, they rely heavily on Nvidia’s
            cutting-edge GPUs to power their operations. By focusing solely on
            AI workloads, these companies offer specialized solutions tailored
            to AI developers’ needs.
            
            from
            
   URI      [1]: https://www.tlciscreative.com/the-rise-of-neoclouds-shapin...
       
              latchkey wrote 1 day ago:
              I believe that the term was first coined by SemiAnalysis in this
              article:
              
   URI        [1]: https://semianalysis.com/2024/10/03/ai-neocloud-playbook...
       
            airstrike wrote 1 day ago:
            Keep up the great work! We need more of you and other players.
            
            Some unsolicited feedback: I would suggest reworking your landing
            page so that the language is always from your customers'
            perspective. Your customers want to solve a real internal problem
            that they have. Talking about how great your company is will always
            have less impact than talking about how you know what that problem
            is and how you intend to solve it.
            
            Your mission is relevant to you and your investors, not to your
            customers. They care about themselves.
            
            Your "quick start" should be an interactive form. I shouldn't have
            to remember what to put in an email to reach out to you. Make it
            easy for me. Also move that to the front page, provide a few
            "standard" packages and a custom one. Reduce the friction to
            clicking the CTA.
            
            Since your pricing is transparent, you should be able to tell me
            what that price will be before I even submit a request. I assume
            you're cheaper than the competition (otherwise why would I not go
            with them?) so make that obvious. Check out Backblaze's website for
            an example page: [1] Shell out a few grand and hire a designer to
            make your page look more professional. Something like [2] but with
            the points above, as they also make the same mistake of making
            their home page read like a pitch deck.
            
   URI      [1]: https://www.backblaze.com/cloud-storage/pricing
   URI      [2]: https://oxide.computer/
       
              latchkey wrote 1 day ago:
              Fantastic unsolicited feedback, I'm definitely taking this to
              heart!
              
              Website is intended to be more like documentation instead of a
              pitch deck or useless splash with a contact us form. I dislike
              sites like Oxide, I scroll past and don't read or ingest any of
              the fancy parts. Of course, you're right, this probably needs to
              be less about me. =)
              
              Friction definitely needs to be improved. That part is being
              worked on right now. Our intention is to be fully self-service,
              so that you don't have to talk to us at all, unless you want to.
              Credit card and go.
              
              We recently lowered our prices to be competitive with the rest of
              the market vs. focusing on people who care more about what we
              offer. We weren't trying to be cheaper than everyone else, we
              were trying to offer a better service. Lesson learned and pricing
              adjusted. Streisand effect, I don't like to mention the other
              players much.
              
              Again, thanks!
       
          miohtama wrote 1 day ago:
          I am not expert here, so want to ask what's magical about 405B
          number?
       
            daveguy wrote 1 day ago:
            That's the size of the largest, most capable, open source models.
            Specifically Llama 3.1 has 405B parameters. Deepseek's largest
            model is 671B parameters.
       
              mhitza wrote 1 day ago:
              Small corrections. Llama 3.1 is not an Open Source model, but a
              Llama 3.1 Licensed model. Neither is DeepSeek apparently [1]
              which I was of the false opinion that it is. Though I never
              considered using it, so haven't checked the license before.
              
   URI        [1]: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main...
       
                Der_Einzige wrote 1 day ago:
                You can just ignore the license since the existence of these
                models is based on piracy at a scale never before seen. Aaron
                Swartz couldn’t have even imagined violating copyright that
                hard.
                
                If you live in a glass house, you won’t throw stones. No one
                in the LLM space wants to be litigious
                
                It’s an open secret that DeepSeek used a ton of OpenAI
                continuations both in pre training and in the distillation.
                That totally violates openAI TOS. No one cares.
       
                  LoganDark wrote 1 day ago:
                  > No one in the LLM space wants to be litigious
                  
                  Except for OpenAI.
       
                gunalx wrote 1 day ago:
                Both deepseek R1 and V3-0324 is mit licensed.
       
          danielmarkbruce wrote 1 day ago:
          It's... useful right now...it's not a huge unlock in a world where
          model size, GPU memory size, different precision support are changing
          quickly.
       
            jhj wrote 1 day ago:
            Unlike quantization, dimensionality reduction/low rank
            approximation, distillation etc, lossless compression is an
            always-correct addition to any ML system as you are computing the
            same thing you did before, the only question is if it is fast
            enough to not cause substantial bottlenecks and if the achievable
            compression ratio is high enough to be useful.
            
            Floating point is just an inefficient use of bits (due to excessive
            dynamic range), especially during training, so it will always be
            welcome there. Extreme quantization techniques (some of the <=
            4-bit methods, say) also tend to increase entropy in the weights
            limiting the applicability of lossless compression, so lossless and
            lossy compression (e.g., quantization) sometimes go against each
            other.
            
            If you have billions in dollars in inference devices, even reducing
            the number of devices you need for a given workload by 5% is very
            useful.
       
              danielmarkbruce wrote 17 hours 55 min ago:
              "always correct"...
       
            striking wrote 1 day ago:
            Is GPU memory size really changing that quickly? For that matter,
            is model size?
       
              latchkey wrote 1 day ago:
              Both AMD and Nvidia are dumping more and more memory into their
              GPUs.
              
              MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288
              HBM3e (and support FP4/6).
       
                NBJack wrote 1 day ago:
                The professional side of things, yes. For consumer grade GPUs,
                despite the trends in gaming markets otherwise needing such,
                the values have stagnated a bit.
       
                  latchkey wrote 1 day ago:
                  I'm NDA with AMD and sadly can't mention details, but I can
                  say the future is promising.
       
                    NBJack wrote 5 hours 26 min ago:
                    Music to my ears. The entire market needs more competitors.
                    As a happy Ryzen owner, I look forward to it.
                    
                    As long as AMD fixes the damn driver issues I've seen for
                    over a decade.
       
                    DrillShopper wrote 1 day ago:
                    I hope AMD cracks the CUDA Problem soon
       
                      latchkey wrote 1 day ago:
                      I'm personally really excited about this solution:
                      
   URI                [1]: https://docs.scale-lang.com/
       
              danielmarkbruce wrote 1 day ago:
              Yes, yes.
              
              Nvidia about to release blackwell ultra with 288GB. Go back to
              maybe 2018 and max was 16gb if memory serves.
              
              DeepSeek recently release a 670 gb model. A couple years ago
              Falcon's 180gb seemed huge.
       
                spoaceman7777 wrote 1 day ago:
                I'd assume that, in the context of LLM inference, "recent"
                generally refers to the Ampere generation and later of GPUs,
                when the demand for on board memory went through the roof (as,
                the first truly usable LLMs were trained on A100s).
                
                We've been stuck with the same general caps on standard GPU
                memory since then though. Perhaps limited in part because of
                the generational upgrades happening in the bandwidth of the
                memory, rather than the capacity.
       
                  danielmarkbruce wrote 1 day ago:
                  Bandwidth is going up too. "It's not doubling every 18 months
                  and hence it's not moving" isn't a sensible way to view
                  change.
                  
                  A one time effective 30% reduction in model size simply isn't
                  going to be some massive unlocker, in theory or in practice.
       
              kadushka wrote 1 day ago:
              What's rapidly changing are quantization algorithms, and hardware
              features to support those algorithms. For example, Blackwell GPUs
              support dynamic FP4 quantization with group size 16. At that
              group size it's close to lossless (in terms of accuracy metrics).
       
        wills_forward wrote 1 day ago:
        So this could universally decrease the memory requirements by
        un-quantitized LLMs by 30%? Seems big if true.
       
          moffkalast wrote 1 day ago:
          Not as big when Q8 quantization is already considered overkill and
          cuts it down to 50% (and a flat 2x speed boost without any additional
          compute overhead mind you) and the more common Q4KM is more like 30%.
          Definitely interesting if it can be added to existing quantization,
          but K quants do already use different precision levels for different
          layers depending on general perplexity impact which is similar to
          this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8
          bits. And that's not even considering calibrated imatrix which does
          something conceptually similar to FFT to compress even higher.
       
            janalsncm wrote 1 day ago:
            Quantization is not lossless.
       
              danielmarkbruce wrote 1 day ago:
              Nobody really cares if it meets a strict definition of lossless.
       
                BoorishBears wrote 1 day ago:
                I do? I spend a ton of time post-training models for creative
                tasks.
                
                The effects of model quantization are usually qualified in
                terms of performance on benchmaxxed tasks with strong logit
                probabilities, temp 0, and a "right" answer the model has to
                pick. Or even worse they'll be measured on metrics that don't
                map to anything except themselves like perplexity ( [1] )
                
                I agree Q8 is strong but I also think the effects of
                quantization are constantly being underappreciated. People are
                often talking about how these models perform while
                fundamentally using 10+ variants of a single model with
                distinct performance profiles.
                
                Even knowing the bits per weight used isn't enough to know how
                exactly a given quant method is affecting the model:
                
   URI          [1]: https://arxiv.org/pdf/2407.09141
   URI          [2]: https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gg...
       
                  imtringued wrote 1 day ago:
                  If you've trained your own models you would be aware of
                  quantization aware training.
       
                  danielmarkbruce wrote 1 day ago:
                  "Nobody really cares if it meets a strict definition of
                  lossless" != "quantization can be done haphazardly."
       
                    BoorishBears wrote 1 day ago:
                    If you're trying to really snarkily refer to the article on
                    Dynamic Quants 2.0 and how carefully developed they were,
                    they're comparing their quants to the methodology 99.99%
                    quants out there use.
                    
                    The problem is not that people are making quants
                    "haphazardly", it's that people keep parroting that various
                    quants are "practically lossless" when they actually have
                    absolutely no clue how lossy they are given how application
                    specific the concept is for something as multidimensional
                    as an LLM.
                    
                    The moment anyone tries a little harder to quantify how
                    lossy they are, we repeatedly find that the answer is "not
                    any reasonably definition of lossless". Even in their
                    example where Q4 is <1% away in MMLU 5-shot is probably
                    massively helped by a calibration dataset that maps to
                    MMLU-style tasks really well, just like constantly using
                    WikiText massively helps models that were trained on...
                    tons of text from Wikipedia.
                    
                    So unless you're doing your own calibrated quantization
                    with your own dataset (which is not impossible, but also
                    not near common), even their "non-haphazard" method could
                    have a noticeable impact on performance.
       
                      danielmarkbruce wrote 1 day ago:
                      Wasn't referring to that.
                      
                      You are saying that people are using quantized models
                      haphazardly and talking about them haphazardly. I'll
                      grant it's not the exact same thing as making them
                      haphazardly, but I think you took the point.
                      
                      The terms shouldn't be used here. They aren't helpful.
                      You are either getting good results or you are not. It
                      shouldn't be treated differently from further training on
                      dataset d. The weights changed - how much better or worse
                      at task Y did it just get?
       
                        BoorishBears wrote 1 day ago:
                        The term is perfectly fine to use here because choosing
                        a quantization strategy to deploy already has enough
                        variables:
                        
                        - quality for your specific application
                        
                        - time to first token
                        
                        - inter-token latency
                        
                        - memory usage (varies even for a given bits per
                        weight)
                        
                        - generation of hardware required to run
                        
                        Of those the hardest to measure is consistently
                        "quality for your specific application".
                        
                        It's so hard to measure robustly that many will take
                        significantly worse performance on the other fronts
                        just to not have to try to measure it... which is how
                        you end up with full precision deployments of a 405b
                        parameter model: [1] When people are paying multiples
                        more for compute to side-step a problem, language and
                        technology that allows you to erase it from the
                        equation is valid.
                        
   URI                  [1]: https://openrouter.ai/meta-llama/llama-3.1-405...
       
                          danielmarkbruce wrote 1 day ago:
                          You say that as though people know these things for
                          the full precision deployment and their use case.
                          
                          Some have the capability to figure it and can do it
                          for both full precision and quantized. Most don't and
                          cannot.
       
                throwaway314155 wrote 1 day ago:
                Seems reductive.
       
                kridsdale3 wrote 1 day ago:
                That's not true. If there are measurable performance
                differences.
       
                  danielmarkbruce wrote 1 day ago:
                  "strict" means something. People, including yourself, only
                  care if there is a practical difference in performance. "this
                  is lossless and that isn't lossless" is a completely useless
                  statement in this realm. In many domains lossy compression is
                  either not tolerated, not legal or not practical.
       
                  kadushka wrote 1 day ago:
                  If you get any accuracy degradation with full 8 bits of
                  precision you're doing it wrong.
       
                    omneity wrote 1 day ago:
                    Or your model wasn't trained so well (weights are too
                    spiky)
       
                moffkalast wrote 1 day ago:
                And when you consider that the usual final step in the pipeline
                is that a sampler goes ham on the probabilities and just picks
                some random nonsense, the tolerance for lossy compression is
                fairly high.
                
                In fact, there's this funny occurrence where Q4 models on
                occasion perform better than their fp16 counterparts on
                benchmarks ran with top_k=1 since the outputs are slightly more
                random and they can less deterministically blunder past the
                local maximum into a more correct solution.
       
                  Der_Einzige wrote 1 day ago:
                  We got an oral at ICLR for calling out how shit samplers like
                  top_p and top_k are. Use min_p!
       
                    moffkalast wrote 1 day ago:
                    True yep, I wish more people benchmarked models with more
                    representative sampler settings and then took the average
                    of 5 or 10 responses.
       
        Havoc wrote 2 days ago:
        I'm guessing by lossless they mean something other than what the word
        usually means in compression context?
        
        >achieving near information-optimal compression without any loss of
        precision
        
        So perhaps more lossless as in didn't lose perplexity/benchmarks?
        
        In my mind lossless is precisely zero bits lost along the way.
       
          vintermann wrote 1 day ago:
          A good example that information, i.e. bits, are only meaningful with
          respect to an end. If you don't know what the bits in a float will be
          used to, you can't throw them away, but if the floats are in a
          function, and you know that what some bits are can't affect the
          output of the function regardless of input, then you can throw those
          bits away and still have a lossless compression of the function.
       
          ziddoap wrote 1 day ago:
          The part you quote is a few sentences past the sentence that says
          "preserving outputs that are bit-for-bit identical to the original
          model".
       
          artemisart wrote 1 day ago:
          The first sentence of the introduction ends with "we introduce
          Dynamic-Length Float (DFloat11), a lossless compression framework
          that reduces LLM size by 30% while preserving outputs that are
          bit-for-bit identical to the original model" so yes it's lossless.
       
          8ytecoder wrote 1 day ago:
          Think Morse code, where frequently used letters have shorter codes
          than less frequent ones. This ensures zero loss of information.
       
          Vendan wrote 1 day ago:
          information-optimal compression is "the theoretical minimum number of
          bits needed to represent data without losing any information, based
          on the data's entropy", so I think they mean the same thing you do
       
            brokencode wrote 1 day ago:
            Yeah, they’re saying that this compression is almost as good as
            is theoretically possible without losing any information.
       
        iamnotagenius wrote 2 days ago:
        Interesting, but not exactly practical for a local LLM user, as 4-bit
        is how LLM's are run locally.
       
          gojomo wrote 1 day ago:
          Some might prefer the fidelity of this method's 70% savings over the
          lossyness of 4-bit quantization's 75%.
          
          And, maybe the methods stack for those willing to trade both costs
          for the smallest representation.
       
            svachalek wrote 1 day ago:
            This is only a 30% savings, which is a cool technical feat but hard
            to see a use case for.
       
          sroussey wrote 1 day ago:
          True, but their research did include running on 5080 local.
          
          The big take away, in my opinion, is that their technique for LUTs
          etc could also be applied to lossy quants as well. Say maybe you get
          5bit accuracy in size of 4bit?
          
          I don’t know, but maybe? Also their two stage design might make
          current quantized you kernal designs better.
       
            spindump8930 wrote 1 day ago:
            Yes, it could be stacked on quants. It might be that quantized
            activations already are more "dense" and so they can't be
            compressed as much (from 16 -> ~11 bits), but certainly possible.
       
              jasonjmcghee wrote 1 day ago:
              I read it similarly - that this is a specific attribute of
              bfloat16, so the quants folks tend to run on local hardware don't
              have the same inefficiency to exploit
       
        loufe wrote 2 days ago:
        I'm so grateful to live through such exciting times. I can open HN
        every two to some exciting new news about ML/transformer models. I
        really should read more into it, but does llama.cpp use a "custom
        kernel" per se, with cublas, or is it just making good use of the
        cublas kernal?
       
          jonplackett wrote 1 day ago:
          It’s funny that you’re missing the time frame from your sentence.
          
          2 weeks? Two months? Two days? Two minutes?
          
          All of the above are true sometimes! Exciting times indeed.
       
            loufe wrote 1 day ago:
            Good catch, I meant every two days! :)
       
       
   DIR <- back to front page