_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   LLaMA 3 70B Llamafiles
       
       
        aappleby wrote 11 hours 28 min ago:
        What's the cheapest hardware setup that can run a 70B model at
        tolerably interactive rates? (say 10 characters a second)
       
          jart wrote 10 hours 12 min ago:
          Any Macbook with 32GB should be able to run
          Meta-Llama-3-70B-Instruct.Q2_K.llamafile which I uploaded a few
          minutes ago. It's smart enough to solve math riddles, but at this
          level of quantization you should expect hallucinations.
          
          If you want to run Q4_0 you'll probably be able to squeeze it on a
          $3,999.00 Macbook Pro M3 Max w/ 48GB of RAM.
          
          If you want to run Q5_K_M or or Q8_0 the best choice is probably Mac
          Studio. I have an Apple M2 Ultra w/ 24‑core CPU, 60‑core GPU,
          128GB RAM. It cost me $8000 with the monitor. If I run
          Meta-Llama-3-70B-Instruct.Q4_0.llamafile then I get 14 tok/sec
          (prompt eval is 82 tok/sec) thanks to the Metal GPU.
          
          You could alternatively go on vast.ai and rent a system with 4x RTX
          4090's for a few bucks an hour. That'll run 70b. Or you could build
          your own, but the graphics cards alone will cost $10k+.
          
          AMD Threadripper Pro 7995WX ($10k) does a good job too. I get 5.9
          tok/sec eval with Q4_0 and 49 tok/sec prompt. If I use F16 weights
          then prompt eval goes 65 tok/sec.
       
          frozenport wrote 10 hours 23 min ago:
          M2 macs can do it: [1] in practice 10 tokens per second is kinda
          annoyingly slow
          
          most local people would opt for a smaller 7b model
          
   URI    [1]: https://twitter.com/junrushao/status/1681828325923389440
       
            zarzavat wrote 7 hours 53 min ago:
            Have been playing around with Llama3 7b today, it’s not very
            good. I’m sure that Facebook put everything they could into
            making it good, but 7B is apparently just not enough parameters.
       
              mistrial9 wrote 6 min ago:
              llava-v1.5-7b-q4.llamafile  yes agree that the impression is poor
              overall
       
              pennomi wrote 2 hours 37 min ago:
              I assume you mean 8B? There is no Llama 3 7B.
       
                sieszpak wrote 1 hour 12 min ago:
                Llama 3 8B seems sad to answer... this is the first model in a
                long time that has had trouble telling me how much is 3! -
                (factorial)
       
       
   DIR <- back to front page