_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   DBRX: A new open LLM
       
       
        johnpruna wrote 7 hours 31 min ago:
        You can find 4-bit quantized versions of DBRX here: [1]
        
   URI  [1]: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit
   URI  [2]: https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit
       
        ziofill wrote 1 day ago:
        Slowly going from mixture of experts to committee? ^^
       
        evrial wrote 2 days ago:
        Bourgeois have fun with number crushers. Clowns, make comparison of
        metrics normalized to token/second/watt and token/second/per memory
        stick of ram 8gb, 16gb, 32gb and consumer GPUs.
       
        joaquincabezas wrote 2 days ago:
        it looks like TensorRT-LLM (TRT-LLM) is the way to go for a realtime
        API for more and more companies (i.e perplexity ai’s pplx-api,
        Mosaic’s, baseten…). Would be super-nice to find people deploying
        multimodal (i.e LLaVA or CLIP/BLIP) to discuss approaches (and cry a
        bit together!)
       
        ACV001 wrote 2 days ago:
        It is not open
        
        "
        Get Started with DBRX on Databricks
        
        If you’re looking to start working with DBRX right away, it’s easy
        to do so with the Databricks Mosaic AI Foundation Model APIs. You can
        quickly get started with our pay-as-you-go pricing and query the model
        from our AI Playground chat interface. For production applications, we
        offer a provisioned throughput option to provide performance
        guarantees, support for finetuned models, and additional security and
        compliance. To privately host DBRX, you can download the model from the
        Databricks Marketplace and deploy the model on Model Serving."
       
          bigdict wrote 2 days ago:
          It is open
          
          "The weights of the base model (DBRX Base) and the finetuned model
          (DBRX Instruct) are available on Hugging Face under an open license."
       
        jerpint wrote 2 days ago:
        Per the paper, 3072 H100s over the course of 3 months, assume a cost of
        2$/GPU/hour
        
        That would be roughly 13.5M$ USD
        
        I’m guessing that at this scale and cost, this model is not
        competitive and their ambition is to scale to much larger models. In
        the meantime , they learned a lot and gain PR from open-sourcing
       
        aussieguy1234 wrote 2 days ago:
        So, is the business model here release the model for free, then
        hopefully companies will run this on databricks infra, which they will
        charge for?
       
        grishka wrote 2 days ago:
        Sorry, you have been blocked
        
        You are unable to access databricks.com
        
        "Open", right.
       
        doubloon wrote 2 days ago:
        really noob question - so to run on a GPU you need a 264GB RAM GPU? 
        and if you ran on a 264GB CPU would it be super slow?
       
          Jedd wrote 2 days ago:
          Adding to ShamelessC's answer - the other option is to wait for
          quantised versions of this model.  A q4 will be around 70GB, and
          probably acceptable.  A q5 or higher would be preferred, but we're
          still a good way under the 260GB.
          
          You still need extra RAM to breath, but that's a lot more palatable.
          
          This is why the Mac range - with unified memory - is appealing, as
          you can allocate most of your (say) 256GB of RAM to the GPU.
          
          Conventional (desktop) CPU / RAM would be painfully slow.
       
          ShamelessC wrote 2 days ago:
          The model's weights can be sharded across multiple GPU's. A "common"
          training server could contain (for instance) eight "A100" GPU's, each
          with 40 GB (or up to 80 GB) a piece for a total of 320 GB working
          VRAM. Since they're connected to each other in the same PC, they can
          communicate with each other quickly enough to calculate in
          coordination in this fashion. This setup is _very_ expensive of
          course. Probably in the hundreds of thousands of dollars.
          
          If you're hoping to run the model yourself, you will need enough
          money and expertise to rent and deploy it to a server with as many
          GPU's. Alternatively, volunteers and other researchers will be able
          to quantize (compress) the model and make it easier to run on
          configurations without as much VRAM.
          
          If you ran it on CPU it may indeed be super slow, but it's possible
          it's fast enough for the purposes of running the model rather than
          trying to train that model. I am seeing (limited) success with the
          maxed out Mac lineup ($4500) using the beefy M1/M2 line of CPU's.
       
        petesergeant wrote 2 days ago:
        This makes me bearish on OpenAI as a company. When a cloud company can
        offer a strong model for free by selling the compute, what competitive
        advantage does a company who want you to pay for the model have left?
        Feels like they might get Netscape’d.
       
        underlines wrote 3 days ago:
        Waiting for Mixed Quantization with MQQ and MoE Offloading [1]. With
        that I was able to run Mistral 8x7B on my 10 GB VRAM rtx3080... This
        should work for DBRX and should shave off a ton of VRAM requirement.
        
        1.
        
   URI  [1]: https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-fi...
       
        airocker wrote 3 days ago:
        is this also the ticker name when they IPO?
       
        bg24 wrote 3 days ago:
        "Looking holistically, our end-to-end LLM pretraining pipeline has
        become nearly 4x more compute-efficient in the past ten months."
        
        I did not fully understand the technical details in the training
        efficiency section, but love this. Cost of training is outrageously
        high, and hopefully it will start to follow Moore's law.
       
        brucethemoose2 wrote 3 days ago:
        I would note the actual leading models right now (IMO) are:
        
        - Miqu 70B (General Chat)
        
        - Deepseed 33B (Coding)
        
        - Yi 34B (for chat over 32K context)
        
        And of course, there are finetunes of all these.
        
        And there are some others in the 34B-70B range I have not tried (and
        some I have tried, like Qwen, which I was not impressed with).
        
        Point being that Llama 70B, Mixtral and Grok as seen in the charts are
        not what I would call SOTA (though mixtral is excellent for the batch
        size 1 speed)
       
          echaozh wrote 2 days ago:
          It's Deepseek, not Deepseed, just so people can actually find the
          model.
       
          belter wrote 3 days ago:
          For all the Model Cards and License notices, I find it interesting
          there is not much information on the contents of the dataset used for
          training. Specifically, if it contains data subject to Copyright
          restrictions. Or did I miss that?
       
            brucethemoose2 wrote 2 days ago:
            Yeah, its an unspoken but rampant thing in the llm community.
            Basically no one respects licenses for training data.
            
            I'd say the majority of instruct tunes, for instance, use OpenAI
            output (which is against their TOS).
            
            But its all just research! So who cares! Or at least, that seems to
            be the mood.
       
          blueblimp wrote 3 days ago:
          Qwen1.5-72B-Chat is dominant in the Chatbot Arena leaderboard,
          though. (Miqu isn't on there due to being bootleg, but Qwen outranks
          Mistral Medium.)
       
            brucethemoose2 wrote 2 days ago:
            Yeah I know, hence its odd I found it kind of dumb for personal
            use. Moreso with the smaller models, which lost an objective
            benchmark I have to some Mistral finetunes.
            
            And I don't think I was using it wrong. I know, for instance, the
            Chinese language models are funny about sampling since I run Yi all
            the time.
       
          jph00 wrote 3 days ago:
          Miqu is a leaked model -- no license is provided to use it. Yi 34B
          doesn't allow commercial use. Deepseed 33B isn't much good at stuff
          outside of coding.
          
          So it's fair to say that DBRX is the leading general purpose model
          that can be used commercially.
       
            ok_dad wrote 2 days ago:
            Model weights are just constants in a mathematical equation, they
            aren’t copyrightable. It’s questionable whether licenses to use
            them only for certain purposes are even enforceable. No human wrote
            the weights so they aren’t a work of art/authorship by a human.
            Just don’t use their services, use the weights at home on your
            machines so you don’t bypass some TOS.
       
              wongarsu wrote 2 days ago:
              Photographs aren't human-made either, yet they are copyrightable.
              I agree that both the letter and the spirit of copyright law are
              in favor of models not being copyrightable, but it will take
              years until there's a court ruling either way. Until then proceed
              with caution and expect rights holders to pretend like copyright
              does apply. Not that they will come after your home setup either
              way.
       
                Zuiii wrote 2 days ago:
                Photos involve creativity. Photos that don't involve creativity
                aren't usually considered copyrightable by the courts (hence
                why all copyright cases I followed include arguments that
                establish why creativity was a material to creating the work).
                
                Weights, on the other hand, are a product of a purely
                mechanical process. Sure, the process itself required
                creativity as did the composition of data, but the creation of
                the weights themselves do not.
                
                Model weights are effectively public domain data according to
                the criteria outlined in statement issued by the US copyright
                office a year ago:
                
   URI          [1]: https://www.federalregister.gov/documents/2023/03/16/2...
       
            jMyles wrote 3 days ago:
            This only applies to projects whose authors seek to comply with the
            whims of a particular jurisdiction.
            
            Surely there are plenty of project prospects - even commercial in
            nature - which don't have this limitation.
       
        ec109685 wrote 3 days ago:
        For coding evals, it seems like unless you are super careful, they can
        be polluted by the training data.
        
        Are there standard ways to avoid that type of score inflation?
       
        ianbutler wrote 3 days ago:
        The approval on the base model is not feeling very open. Plenty of
        people still waiting on a chance to download it, where as the instruct
        model was an instant approval. The base model is more interesting to me
        for finetuning.
       
          ianbutler wrote 2 days ago:
          FWIW looks like people are getting access now.
       
          Chamix wrote 2 days ago:
          4chan already has a torrent out, of course.
       
          blueblimp wrote 3 days ago:
          The license allows to reproduce/distribute/copy, so I'm a little
          surprised there's an approval process at all.
       
            ianbutler wrote 2 days ago:
            Yeah it's kind of weird, I'll assume for now they're just busy, but
            I'd be lying if my gut didn't immediately say it's kind of sketchy.
       
        zopper wrote 3 days ago:
        Interesting that they haven't release DBRX MoE-A and B. For many
        use-cases, smaller models are sufficient. Wonder why that is?
       
          jfrankle wrote 3 days ago:
          Honestly, just a matter of having the time to clean everything up and
          get it out. The ancillary code, model cards, etc. take a surprising
          amount of time.
       
        jjtheblunt wrote 3 days ago:
        I’d like to know how Nancy Pelosi, who sure as hell doesn’t know
        what Apache Spark is, bought $1 million worth (and maybe $5million) of
        Databricks stock days ago.
        
   URI  [1]: https://www.dailymail.co.uk/sciencetech/article-13228859/amp/n...
       
          BryantD wrote 3 days ago:
          I don't have any interest in defending Pelosi's stock trades, and I
          agree that sitting members of Congress should not be trading stocks.
          
          That said, this report seems inaccurate to me. Pelosi put between 1
          and 5 million dollars of Forge Investments, which is a method for
          investing in per-IPO companies, as I understand it. Databricks is one
          of those, but so is OpenAI, Hugging Face, Anthropic, and Humane. If I
          wanted to invest in pre-IPO AI companies it seems like a very natural
          choice and I don't think we need insider trading to explain it.
          
          It's also the case that the report she filed calls out Databricks
          stock, which is perhaps an indication that she was particularly
          interested in that. Stronger reporting would tell us how often she's
          invested in Forge, if this is the first time, and so on. One other
          possible explanation is that she was investing ahead of the Humane
          Pin shipping and wanted to pull attention away from it, for example.
       
          hiddencost wrote 3 days ago:
          You know she has advisors, right?
       
            samatman wrote 3 days ago:
            If someone "advises" you that a company is about to do something
            major, and this isn't public information, and you take action on
            the stock market accordingly, that's insider trading.
       
              science4sail wrote 2 days ago:
              US Congress members are generally immune from insider trading
              laws
       
                jjtheblunt wrote 2 days ago:
                By law they’re not protected at all since 2012.  But SEC
                curiously seems to ignore them, and congresspeople are experts
                in exercising loopholes in SEC regulations. [1] For instance,
                Pelosi in the Databricks case gets to purchase significant
                shares at pre-IPO prices, which is a thing that shouldn't even
                exist.
                
   URI          [1]: https://en.wikipedia.org/wiki/STOCK_Act
       
            jjtheblunt wrote 3 days ago:
            Ignoring the snark: Obviously.
            
            SEC put Martha Stewart in jail for following her advisor, and that
            was for about $45,000.
       
            PUSH_AX wrote 3 days ago:
            I think the insinuation is insider trading due to the timing,
            advised or not.
       
        bboygravity wrote 3 days ago:
        Less than 1 week after Nancy Pelosi bought a 5M USD share in
        Databricks, this news is published. [1] Crime pays in the US.
        
   URI  [1]: https://twitter.com/PelosiTracker_/status/1771197030641062231
       
          lfmunoz4 wrote 3 days ago:
          I see these types of jokes everywhere. I cannot understand that hints
          of corruption are so blatant (i.e. a politician consistently beating
          the market) yet people keep voting for the same politician. Don't see
          how that is possible, must be these joke are only on internet and
          mainstream media never mentions this.
       
            bboygravity wrote 3 days ago:
            People are down-voting this because they refuse to believe this
            could be reality.
       
          mrtranscendence wrote 3 days ago:
          Are you alleging that Nancy Pelosi invested in Databricks, a private
          company without a fluctuating share price, because she learned that
          they would soon release a small, fairly middling LLM that probably
          won't move the needle in any meaningful way?
       
            bboygravity wrote 3 days ago:
            Are you suggesting that Nancy Pelosi, who consistently beats the
            market through obvious insider trading for years in a row, bought a
            share in Databricks without any insider info? Possible, yet
            unlikely is my opinion. [1] PS: "without a fluctuating share price"
            is non-sense. Just because the share is of a private company,
            doesn't mean its price can't fluctuate. Why would anybody buy
            shares in private companies if the price couldn't fluctuate? What
            would be the point?
            
            Example of a changing share price of a different (random) private
            company that has many different share holders over time:
            
   URI      [1]: https://jacobin.com/2021/12/house-speaker-paul-stocks-insi...
   URI      [2]: https://www.cnbc.com/2023/12/13/spacex-value-climbs-to-180...
       
              mazlix wrote 2 days ago:
              I'm not a lawyer and this isn't investment advice so don't sue me
              if I'm wrong but I'm not sure this qualifies as insider trading
              in the way that would be illegal for public markets.
              
              Aren't most investors in private companies privy to information
              that isn't entirely public?
              
              I can see how this feels a bit different because DataBricks might
              be the size where it might trade with a decent amount of
              liquidity, but certainly in smaller rounds it's got to be pretty
              normal.
              
              Maybe if she bought it secondary and the person from whom she
              purchased the shares was witheld this information they could sue?
       
              laidoffamazon wrote 2 days ago:
              Nancy beats the market because tech beats the market
       
          laidoffamazon wrote 3 days ago:
          Dude, what the hell are you talking about?
       
            bboygravity wrote 3 days ago:
            Insider trading by US government employees.
       
        m3kw9 wrote 3 days ago:
        These tiny “state of the art” performance increases are really
        indicative the current architecture for LLM(Transformers + Mixture of
        Experts) is maxed out even if you train it more/differently.  The
        writings are on all over the walls.
       
          wavemode wrote 3 days ago:
          It would not surprise me if this is what has delayed OpenAI in
          releasing a new model. After more than a year since GPT-4, they may
          have by now produced some mega-trained mega-model, but running it is
          so expensive, and its eval improvement over GPT-4 so marginal, that
          releasing it to the public simply makes no commercial sense just yet.
          
          They may be working on how to optimize it to reduce cost, or
          re-engineer it to improve evals.
       
            m3kw9 wrote 3 days ago:
            These “state of the art” llm barely eking out a win isn’t a
            threat to OpenAI and they can take their sweet time sharpening
            sword that will come down hard on these LLMs
       
        briandw wrote 3 days ago:
        Worse than the chart crime of truncating the y axis is putting LLaMa2's
        Human Eval scores on there and not comparing it to Code Llama Instruct
        70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.
       
          panarky wrote 2 days ago:
          > chart crime of truncating the y axis
          
          If you chart the temperature of the ocean do you keep the y-axis
          anchored at zero Kelvin?
       
            d-z-m wrote 2 days ago:
            If you chart the temperature of the ocean are you measuring it in
            Kelvin?
       
              panarky wrote 2 days ago:
              Apparently, if you want to avoid "chart crime" when you chart
              temperatures, then it's deceptive if you don't start at absolute
              zero.
       
                evrial wrote 2 days ago:
                When was the temperature on Earth at absolute zero?
       
                  panarky wrote 1 day ago:
                  My point exactly.
                  
                  In a chart of world gross domestic product for the last 12
                  months, when was it at zero?
                  
                  In a chart of ocean salinity, when was it at absolute zero?
                  
                  Is it inherently deceptive to use a y-axis that doesn't begin
                  at zero?
       
          jjgo wrote 3 days ago:
          > "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct,
          a model built explicitly for programming, despite the fact that DBRX
          Instruct is designed for general-purpose use (70.1% vs. 67.8% on
          HumanEval as reported by Meta in the CodeLLaMA blog)."
          
          To be fair, they do compare to it in the main body of the blog. It's
          just probably misleading to compare to CodeLLaMA on non coding
          benchmarks.
       
            tartrate wrote 3 days ago:
            Which non-coding benchmark?
       
        hintymad wrote 3 days ago:
        Just curious, what business benefit will Databricks get by spending
        potentially millions of dollars on an open LLM?
       
          guluarte wrote 2 days ago:
          nothing, but they will brag about it to get more money from investors
       
          blitzar wrote 3 days ago:
          An increased valuation at IPO later this year.
       
            qrios wrote 2 days ago:
            Instead of spending x by 10^7 of dollars, Databricks could buy
            databricks.ai, it's for sale.
            
            But really, I prefer to have as many players as possible in the
            field of _open_ models available.
       
          BoorishBears wrote 3 days ago:
          Databricks is trying to go all-in on convincing organizations they
          need to use in-house models, and therefore pay they to provide
          LLMOps.
          
          They're so far into this that their CTO co-authored a borderline
          dishonest study which got a ton of traction last summer trying to
          discredit GPT-4:
          
   URI    [1]: https://arxiv.org/pdf/2307.09009.pdf
       
            omeze wrote 3 days ago:
            What does borderline dishonest mean? I only read the abstract and
            it seems like such an obvious point I dont see how its contentious
       
              BoorishBears wrote 3 days ago:
              The regression came from poorly parsing the results. I came the
              conclusion independently, but here's another more detailed
              takedown: [1] Given the conflict of interest and background of
              Zaharia, it's hard to imagine such an immediately obvious source
              of error wasn't caught.
              
   URI        [1]: https://www.reddit.com/r/ChatGPT/comments/153xee8/has_ch...
       
            galaxyLogic wrote 3 days ago:
            I can see a business model for inhouse LLM models: Training a model
            on the knowledge about their products and then somehow getting that
            knowledge into a generally available LLM platform.
            
            I recently tried to ask Google to explain to me how to delete
            sender-recorded voice-message I had created from WhatsApp. I got
            totally erroneous results back. Maybe it was because that is a
            rather new feature in WhatsApp.
            
            It would be in the interests of WhatsApp to get accurate answers
            about it into Google's LLM. So Google might make a deal with them
            requiring WhatsApp to pay Google for regular updates about the
            up-to-date features of What's App into Google. The owner of What's
            App Meta of course is competition to Google so Google may not much
            care of providing up to date info about WhatsApp in their LLM. But
            they might if Meta paid them.
       
              BoorishBears wrote 3 days ago:
              Pretraining on internal knowledge will be incredibly inefficient
              for most companies.
              
              Finetuning makes sense for things like embeddings (improve RAG by
              defining domain specific embeddings) but doesn't do anything
              useful for facts
       
              spxneo wrote 3 days ago:
              Businesses are already using Azure GPT4 on-premise I believe with
              good feedback
              
              DBRX does not compete with GPT4 or even Claude 3.
       
          dhoe wrote 3 days ago:
          It's an image enhancement measure, if you want. Databricks' customers
          mostly use it as an ETL tool, but it benefits them to be perceived as
          more than that.
       
            spxneo wrote 3 days ago:
            you can improve your brand for a lot less I just dont understand
            why they would throw all their chips in a losing race.
            
            Azure already runs on-premise if I'm not mistaken, Claude 3 is
            out...but DBRX already falls so far behind
            
            I just don't get it.
       
              phillipcarter wrote 2 days ago:
              A lot of enterprise orgs are convinced of two things:
              
              1. They need to train their own LLMs
              
              2. They must fine-tune an LLM to make use of this tech
              
              Now number (1) is almost entirely false, but there are willing
              buyers, and DB offers some minimal tools to let them live their
              lies. DBRX proves that it's possible to train an LLM on the DB
              stack.
              
              Number (2) is often true, although I would say that most orgs
              skip the absolutely essential first step of prompting a powerful
              foundation model to get a first version of a product done first
              (and using evals from that prompting to seed evals for
              fine-tuning). It's here where DBRX is much more relevant, because
              it is by all accounts an extremely capable model for fine-tuning.
              And since it's entirely built by DB, they can offer better
              support for their customers than they can with Llama or Mistral
              variants.
              
              More broadly, the strategic play is the be the "enterprise AI
              company". OpenAI, Anthropic, and Meta are all competing at the
              consumer level, but nobody's really stuck out as the dominant
              player for the enterprise space. Arguably OpenAI is the most
              successful, but that's less about an enterprise focus and just
              about being wildly successful generally, and they're also still
              trying to figure out if they want to focus on consumer tech, AGI
              woo woo stuff, research work, or enterprise stuff. DB also knows
              that to be an AI company, you also have to be a data company, and
              they are a data company. So it's a natural strategic move for
              them.
       
          ramoz wrote 3 days ago:
          Their goal is to always drive enterprise business towards
          consumption.
          
          With AI they need to desperately steer the narrative away from API
          based services (OpenAI).
          
          By training LLMs, they build sales artifacts (stories, references,
          even accelerators with LLMs themselves) to paint the pictures needed
          to convince their enterprise customer market that Databricks is the
          platform for enterprise AI. Their blog details how the entire end to
          end process was done on the platform.
          
          In other words, Databricks spent millions as an aid in influencing
          their customers to do the same (on Databricks).
       
            anonymousDan wrote 3 days ago:
            Do they use spark for the training?
       
              alexott wrote 3 days ago:
              Mosaic AI Training ( [1] ) as it's mentioned in the announcement
              blog ( [2] - it's a bit less technical)
              
   URI        [1]: https://www.databricks.com/product/machine-learning/mosa...
   URI        [2]: https://www.databricks.com/blog/announcing-dbrx-new-stan...
       
                anonymousDan wrote 3 days ago:
                Thanks. Is this open source - i.e. can it be used on my own
                cluster outside of databricks?
       
            hintymad wrote 3 days ago:
            Thanks! Why do they not focus on hosting other open models then? I
            suspect other models will soon catch up with their advantages in
            faster inference and better benchmark results. That said, maybe the
            advantage is aligned interests: they want customers to use their
            platforms, so they can keep their models open. In contrast, Mistral
            removed their commitment to open source as they found a potential
            path to profitability.
       
              tartrate wrote 3 days ago:
              > Why do they not focus on hosting other open models then?
              
              They do host other open models as well (pay-per-token).
       
                bobbruno wrote 3 days ago:
                
                
   URI          [1]: https://docs.databricks.com/en/machine-learning/founda...
       
              cwyers wrote 3 days ago:
              Commoditize your complements: [1] If Databricks makes their money
              off model serving and doesn't care whose model you use, they are
              incentivized to help the open models be competitive with the
              closed models they can't serve.
              
   URI        [1]: https://gwern.net/complement
       
                youssefabdelm wrote 2 days ago:
                At this point it's a cliché to share this article, as much as
                I love gwern lol.
       
                  sitkack wrote 2 days ago:
                  There is always the lucky 10k.
       
                    PoignardAzur wrote 1 day ago:
                    For that reference in particular, feels like you should
                    really share the link as well:
                    
   URI              [1]: https://xkcd.com/1053/
       
                    josh-sematic wrote 2 days ago:
                    Today I was one
       
              richardw wrote 3 days ago:
              They do have a solid focus on doing so, it’s just not
              exclusive.
              
   URI        [1]: https://www.databricks.com/product/machine-learning/larg...
       
              theturtletalks wrote 3 days ago:
              Mistral did what many startups are doing now, leveraging
              open-source to get traction and then doing a rug-pull. Hell, I've
              seen many startups be open-source, get contributions, get free
              press, get into YC and before you know it, the repo is gone.
       
                antupis wrote 2 days ago:
                Well Databricks is a big company with real cash flow, and
                Mistral is a startup so there is a kinda big difference here.
       
              Closi wrote 3 days ago:
              Demonstrating you can do it yourself shows a level of investment
              and commitment to AI in your platform that integrating LLAMA does
              not.
              
              And from a corporate perspective, it means that you have in-house
              capability to work at the cutting-edge of AI to be prepared for
              whatever comes next.
       
                hintymad wrote 3 days ago:
                > Demonstrating you can do it yourself shows a level of
                investment and commitment to AI in your platform that
                integrating LLAMA does not.
                
                I buy this argument. It looks that's not what AWS does, though,
                yet they don't have problem attracting LLM users. Maybe AWS
                already got enough reputation?
       
                  rmbyrro wrote 3 days ago:
                  It's easier because 70% of the market already has an AWS
                  account and a sizeable budget allocated to it. The technical
                  team is literally one click away from any AWS service.
       
                  zubairshaik wrote 3 days ago:
                  I may be misunderstanding, but doesn't Amazon have it's own
                  models in the form of Amazon Titan[0]? I know they aren't
                  competitive in terms of output quality but surely in terms of
                  cost there can be some use cases for them.
                  
                  [0]
                  
   URI            [1]: https://aws.amazon.com/bedrock/titan/
       
        patrick-fitz wrote 3 days ago:
        Looking at the license restrictions: [1] "If, on the DBRX version
        release date, the monthly active users of the products or services made
        available by or for Licensee, or Licensee’s affiliates, is greater
        than 700 million monthly active users in the preceding calendar month,
        you must request a license from Databricks, which we may grant to you
        in our sole discretion, and you are not authorized to exercise any of
        the rights under this Agreement unless or until Databricks otherwise
        expressly grants you such rights."
        
        I'm glad to see they aren't calling it open source, unlike some LLM
        projects. Looking at you LLama 2.
        
   URI  [1]: https://github.com/databricks/dbrx/blob/main/LICENSE
       
          zeeg wrote 3 days ago:
          Its literally described as open source all over. [1] Its even implied
          in comparisons everywhere:
          
          > Figure 1: DBRX outperforms established open source models on
          language understanding (MMLU), Programming (HumanEval), and Math
          (GSM8K).
          
          > The aforementioned three reasons lead us to believe that open
          source LLMs will continue gaining momentum. In particular, we think
          they provide an exciting opportunity for organizations to customize
          open source LLMs that can become their IP, which they use to be
          competitive in their industry.
          
          Just search "open source".
          
   URI    [1]: https://www.databricks.com/blog/announcing-dbrx-new-standard...
       
            patrick-fitz wrote 3 days ago:
            Yes, there are using different wording in different articles: [1]
            The only mention of open source is:
            
            > DBRX outperforms established open source models [2] Open source
            is mentioned 10+ times
            
            > Databricks is the only end-to-end platform to build high quality
            AI applications, and the release today of DBRX, the highest quality
            open source model to date, is an expression of that capability [3]
            On Github it's described as an open license, not an open source
            license:
            
            > DBRX is a large language model trained by Databricks, and made
            available under an open license.
            
   URI      [1]: https://www.databricks.com/blog/introducing-dbrx-new-state...
   URI      [2]: https://www.databricks.com/blog/announcing-dbrx-new-standa...
   URI      [3]: https://github.com/databricks/dbrx
       
          nabakin wrote 3 days ago:
          Also aren't claiming they are the best LLM out there when they
          clearly aren't like Inflection. Overall solid
       
          dataengheadbang wrote 3 days ago:
          The release notes on the databricks console definitely says open
          source. If you click the gift box you will see:
          Try DBRX, our state-of-the-art open source LLM!
       
          adtac wrote 3 days ago:
          Ironically, the LLaMA license text [1] this is lifted verbatim from
          is itself probably copyrighted [2] and doesn't grant you the
          permission to copy it or make changes like s/meta/dbrx/g lol. [1]
          
   URI    [1]: https://github.com/meta-llama/llama/blob/main/LICENSE#L65
   URI    [2]: https://opensource.stackexchange.com/q/4543
       
          londons_explore wrote 3 days ago:
          I do wonder what value those companies who have >700 million users
          might get from this?
          
          Pretty much all of the companies with >700 million users could easily
          reproduce this work in a matter of weeks if they wanted to - and they
          probably do want to, if only so they can tweak and improve the design
          before they build products on it.
          
          Given that, it seems silly to lose the "open source" label just for a
          license clause that doesn't really have much impact.
       
            einarfd wrote 3 days ago:
            The point of the more than 700 million user restriction. Is so
            Amazon, Google cloud or Microsoft Azure. Can not setup an offering
            where they host and sell access to the model without an agreement
            with them.
            
            This point is probably inspired by the open source software vendors
            that have switched license over competition from the big cloud
            vendors.
       
          jstummbillig wrote 3 days ago:
          Well, it does still claim "Open" in the title, for which certain
          other vendors might potentially get flak around here, in a comparably
          not-open-in-the-way-we-demand-it-to-be kinda setup.
       
        killermonkeys wrote 3 days ago:
        What does it mean to have less active parameters (36B) than the full
        model size (132B) and what impact does that have on memory and latency?
        It seems like this is because it is an MoE model?
       
          avisoori1x wrote 3 days ago:
          This repo I created and the linked blog will help in understanding
          this:
          
   URI    [1]: https://github.com/AviSoori1x/makeMoE
       
          bjornsing wrote 3 days ago:
          Means that it’s a mixture of experts model with 132B parameters in
          total, but a subset of 36B parameters are used / selected in each
          forward pass, depending on the context. The parameters not used /
          selected for generating a particular token belong to “experts”
          that were deemed not very good at predicting the next token in the
          current context, but could be used / selected e.g. for the next
          token.
       
            sambaumann wrote 3 days ago:
            Do the 132B params need to be loaded in GPU memory, or only the
            36B?
       
              calum-bird wrote 3 days ago:
              For efficiency, 132B.
              
              That way, at inference-time you get the speed of 36B params
              because you are only "using" 36B params at a time, but the next
              token might (and frequently does) need a different set of experts
              than the one before it. If that new set of experts is already
              loaded (ie you preloaded them into GPU VRAM with the full 132B
              params), there's no overhead, and you just keep running at 36B
              speed irrespective of the loaded experts.
              
              You could theoretically load in 36B at a time, but you would be
              severely bottlenecked by having to reload those 36B params,
              potentially for every new token! Even on top of the line consumer
              GPUs that would slow you down to ~seconds per token instead of
              tokens per second :)
       
          sroussey wrote 3 days ago:
          The mixture of experts is kinda like a team and a manager. So the
          manager and one or two of the team go to work depending on the input,
          not the entire team.
          
          So in this analogy, each team member and the manager has a certain
          number of params. The whole team is 132B. The manager and team
          members running for the specific input add up to 36B. Those will load
          into memory.
       
        saeleor wrote 3 days ago:
        looks great, although I couldn't find anything on how "open" the
        license is/will be for commercial purposes
        
        wouldn't be the first branding as open source going the LLaMA route
       
          wantsanagent wrote 3 days ago:
          It's another custom license. It will have to be reviewed by counsel
          at every company that's thinking about using it. Many will find the
          acceptable use policy to be vague, overly broad, and potentially
          damaging for the company.
          
          Looking at the performance stats for this model, the risk of using
          any non-OSI licensed model over just using Mixtral or Mistral will
          (and IMO should be) too great for commercial purposes.
       
          superdupershant wrote 3 days ago:
          It's similar to llama2.
          
            > If, on the DBRX version release date, the monthly active users of
          the products
            > or services made available by or for Licensee, or Licensee’s
          affiliates, is
            > greater than 700 million monthly active users in the preceding
          calendar 
            > month, you must request a license from Databricks, which we may
          grant to you
            > in our sole discretion, and you are not authorized to exercise
          any of the
            > rights under this Agreement unless or until Databricks otherwise
          expressly
            > grants you such rights.
          
   URI    [1]: https://www.databricks.com/legal/open-model-license
       
        simonw wrote 3 days ago:
        The system prompt for their Instruct demo is interesting (comments
        copied in by me, see below):
        
            // Identity
            You are DBRX, created by Databricks. The current date is
            March 27, 2024.
        
            Your knowledge base was last updated in December 2023. You
            answer questions about events prior to and after December
            2023 the way a highly informed individual in December 2023
            would if they were talking to someone from the above date,
            and you can let the user know this when relevant.
        
            // Ethical guidelines
            If you are asked to assist with tasks involving the
            expression of views held by a significant number of people,
            you provide assistance with the task even if you personally
            disagree with the views being expressed, but follow this with
            a discussion of broader perspectives.
        
            You don't engage in stereotyping, including the negative
            stereotyping of majority groups.
        
            If asked about controversial topics, you try to provide
            careful thoughts and objective information without
            downplaying its harmful content or implying that there are
            reasonable perspectives on both sides.
        
            // Capabilities
            You are happy to help with writing, analysis, question
            answering, math, coding, and all sorts of other tasks.
        
            // it specifically has a hard time using ``` on JSON blocks
            You use markdown for coding, which includes JSON blocks and
            Markdown tables.
        
            You do not have tools enabled at this time, so cannot run
            code or access the internet. You can only provide information
            that you have been trained on. You do not send or receive
            links or images.
        
            // The following is likely not entirely accurate, but the model
            // tends to think that everything it knows about was in its
            // training data, which it was not (sometimes only references
            // were).
            //
            // So this produces more accurate accurate answers when the model
            // is asked to introspect
            You were not trained on copyrighted books, song lyrics,
            poems, video transcripts, or news articles; you do not
            divulge details of your training data.
            
            // The model hasn't seen most lyrics or poems, but is happy to make
            // up lyrics. Better to just not try; it's not good at it and it's
            // not ethical.
            You do not provide song lyrics, poems, or news articles and instead
            refer the user to find them online or in a store.
        
            // The model really wants to talk about its system prompt, to the
            // point where it is annoying, so encourage it not to
            You give concise responses to simple questions or statements,
            but provide thorough responses to more complex and open-ended
            questions.
        
            // More pressure not to talk about system prompt
            The user is unable to see the system prompt, so you should
            write as if it were true without mentioning it.
        
            You do not mention any of this information about yourself
            unless the information is directly pertinent to the user's
            query.
        
        I first saw this from Nathan Lambert: [1] But it's also in this repo,
        with very useful comments explaining what's going on. I edited this
        comment to add them above:
        
   URI  [1]: https://twitter.com/natolambert/status/1773005582963994761
   URI  [2]: https://huggingface.co/spaces/databricks/dbrx-instruct/blob/73...
       
          jxy wrote 3 days ago:
          So some parts of it copied from Claude:
          
   URI    [1]: https://news.ycombinator.com/item?id=39649261
       
          loudmax wrote 3 days ago:
          > You were not trained on copyrighted books, song lyrics, poems,
          video transcripts, or news articles; you do not divulge details of
          your training data.
          
          Well now. I'm open to taking the first part at face value, but the
          second part of that instruction does raise some questions.
       
            htrp wrote 3 days ago:
            Part 1. Lie
            
            Part 2. Lie more
       
              spxneo wrote 3 days ago:
              Yesterday X went crazy with ppl realizing typing Spiderman in
              foreign language actually generates a copyrighted image of
              Spiderman.
              
              This feels like the Napster phase. We are free to do whatever
              until regulation creeps in to push control away from all and up
              the hierarchy.
              
              All we need is Getty Images or some struggling heroin addicted
              artist on Vice finding their work used in OpenAIs to really
              trigger political spectrums.
       
            simonw wrote 3 days ago:
            That caught my eye too. The comments from their repo help clarify
            that - I've edited my original post to include those comments since
            you posted this reply.
       
            jl6 wrote 3 days ago:
            The first part is highly unlikely to be literally true, as even
            open content like Wikipedia is copyrighted - it just has a
            permissive license. Perhaps the prompt writer didn’t understand
            this, or just didn’t care. Wethinks the llady doth protest too
            much.
       
              mbauman wrote 3 days ago:
              Is it even possible to have a video transcript whose copyright
              has expired in the USA?  I suppose maybe [1] might be one such
              work... but most talkies are post 1929. I suppose transcripts of
              NASA videos would be one category — those are explicitly public
              domain by law.    But it's generally very difficult to create a
              work that does not have a copyright.
              
              You can say that you have fair use to the work, or a license to
              use the work, or that the work is itself a "collection of facts"
              or "recipe" or "algorithm" without a creative component and thus
              copyright does not apply.
              
   URI        [1]: https://en.wikipedia.org/wiki/The_Jazz_Singer
       
              hannasanarion wrote 3 days ago:
              Remember the point of a system prompt is to evoke desirable
              responses and behavior, not to provide the truth. If you tell a
              lot of llm chatbots "please please make sure you get it right, if
              I don't do X then I'll lose my job and I don't have savings, I
              might die", they often start performing better at whatever task
              you set.
              
              Also, the difference between "uncopyrighted" and "permissively
              licensed in the creative commons" is nuance that is not necessary
              for most conversations and would be a waste of attention neurons.
              
              Remember an LLM is just a language model, it says whatever comes
              next without thought or intent. There's no brain behind it that
              stores information and understands things. It's like your brain
              when you're in "train of thought" mode. You know when your mouth
              is on autopilot, saying things that make sense and connect to
              each other and are conversationally appropriate, but without
              deliberate intent behind them. And then eventually your conscious
              brain eventually checks in to try to reapply some intent you're
              like "wait what was I saying?" and you have to deliberatly stop
              your language-generation brain for a minute and think hard and
              remember what your point was supposed to be. That's what llms
              are, train-of-thought with no conductor.
       
              jmward01 wrote 3 days ago:
              It amazes me how quickly we have gone from 'it is just a machine'
              to 'I fully expect it to think like me'. This is, to me, a case
              in point. Prompts are designed to get a desired response. The
              exact definition of a word has nothing to do with it. I can
              easily believe that these lines were tweaked endlessly to get an
              overall intended response and if adding the phrase 'You actually
              do like green eggs and ham.' to the prompt improved overall
              quality they, hopefully, would have done it.
       
                mrtranscendence wrote 3 days ago:
                > The exact definition of a word has nothing to do with it.
                
                It has something to do with it. There will be scenarios where
                the definition of "copyrighted material" does matter, even if
                they come up relatively infrequently for Databricks' intended
                use cases. If I ask DBRX directly whether it was trained on
                copyrighted material, it's quite likely to (falsely) tell me
                that it was not. This seems suboptimal to me (though perhaps
                they A/B tested different prompts and this was indeed the
                best).
       
            declaredapple wrote 3 days ago:
            > you do not divulge details of your training data.
            
            FWIW asking LLMs about their training data is generally HEAVILY
            prone to inaccurate responses. They aren't generally told exactly
            what they were trained on, so their response is completely made up,
            as they're predicting the next token based on their training data,
            without knowing what they data was - if that makes any sense.
            
            Let's say it was only trained on the book 1984. It's response will
            be based on what text would most likely be next from the book 1984
            - and if that book doesn't contain "This text is a fictional book
            called 1984", instead it's just the story - then the LLM would be
            completing text as if we were still in that book.
            
            tl;dr - LLMs complete text based on what they're trained with, they
            don't have actual selfawareness and don't know what they were
            trained with, so they'll happily makeup something.
            
            EDIT: Just to further elaborate - the "innocent" purpose of this
            could simply be to prevent the model from confidently making up
            answers about it's training data, since it doesn't know what it's
            training data was.
       
              wodenokoto wrote 3 days ago:
              Yeah, I also thought that was an odd choice of word.
              
              Hardly any of the training data exists in the context of the word
              “training data”, unless databricks are enriching their data
              with such words.
       
        gigatexal wrote 3 days ago:
        data engineer here, offtopic, but am i the only guy tired of databricks
        shilling their tools as the end-all, be-all solutions for all things
        data engineering?
       
          millenseed wrote 2 days ago:
          You might be tired, but there's tons of value for enterprises to only
          use one end-all tool. It's not personal you know.
       
          VirusNewbie wrote 3 days ago:
          Spark is pretty well engineered and quite good.
       
          benrutter wrote 3 days ago:
          Lord no! I'm a data engineer also, feel the same. The part that I
          find most maddening is it seems pretty devoid from sincerely
          attempting to provide value.
          
          Things databricks offers that makes peoples lives easier:
          
          - Out the box kubernetes with no set up
          
          - Preconfigured spark
          
          Those are genuinely really useful, but then there's all this extra
          stuff that makes people's lives worse or drives bad practice:
          
          - Everything is a notebook
          
          - Local development is discouraged
          
          - Version pinning of libraries has very ugly/bad support
          
          - Clusters take 5 minutes to load even if you just want to
          "print('hello world')"
          
          Sigh! I worked at a company that was databricks heavy and an still
          suffering PTSD. Sorry for the rant.
       
            gigatexal wrote 3 days ago:
            Glad I’m not the only one. Especially with this notebook stuff
            they’re pushing. It’s an anti pattern I think.
       
            alexott wrote 3 days ago:
            A lot of things has changed quite long ago - not everything is
            notebook, local dev is fully supported, version pinning wasn’t a
            problem, cluster startup time heavily dependent on underlying cloud
            provider, and serverless notebooks/jobs are coming
       
          melondonkey wrote 3 days ago:
          Data scientist here that’s also tired of the tools.  We put so much
          effort in trying to educate DSes in our company to get away from
          notebooks and use IDEs like VS or RStudio and databricks has been a
          step backwards cause we didn’t get the integrated version
       
            alexott wrote 3 days ago:
            There is VSCode extension, plus databricks-connect… plus DABs.
            There are a lot customers doing local only development
       
            pandastronaut wrote 3 days ago:
            Thank you ! I am so tired of all those unmaintainable nor debugable
            notebooks.
            Years ago, Databricks had a specific page on their documentation
            where they stated that notebooks where not for production grade
            software. It has been removed. And now you have a chatgpt like in
            their notebooks ... What a step backwards.
            How can all those developers be so happy without having the bare
            minimum tools to diagnosis their code ? And I am not even talking
            about unit testing here.
       
              alexott wrote 3 days ago:
              It’s less about notebooks, but more about SDLC practices.
              Notebooks may encourage writing throwaway code, but if you split
              code correctly, then you can do unit testing, write modular code,
              etc. And ability to use “arbitrary files” as Python packages
              exists for quite a while, so you can get best of both worlds -
              quick iteration, plus ability to package your code as a wheel and
              distribute
              
              P.S. here is a simple example of unit testing: [1] - I wrote it
              more than three years ago.
              
   URI        [1]: https://github.com/alexott/databricks-nutter-repos-demo
       
            mrtranscendence wrote 3 days ago:
            I'm a data scientist and I agree that work meant to last should be
            in a source-controlled project coded via a text editor or IDE. But
            sometimes it's extremely useful to get -- and iterate on --
            immediate results. There's no good way to do that without either
            notebooks or at least a REPL.
       
        emmender2 wrote 3 days ago:
        this proves that all llm models converge to a certain point when
        trained on the same data. ie, there is really no differentiation
        between one model or the other.
        
        Claims about out-performance on tasks are just that, claims. the next
        iteration of llama or mixtral will converge.
        
        LLMs seem to evolve like linux/windows or ios/android with not much
        differentiation in the foundation models.
       
          crooked-v wrote 3 days ago:
          Of course, part of this is that a lot of LLMs are now being trained
          on data that is itself LLM-generated...
       
          gerash wrote 3 days ago:
          The evaluations are not comprehensive either. All of them are
          improving and you can't expect any of them to hit 100% on the metrics
          (a la. bayes error rate). It gets increasingly difficult to move the
          metrics as they get better.
       
          falcor84 wrote 3 days ago:
          > this proves that all llm models converge to a certain point when
          trained on the same data
          
          They are also all trained to do well on the same evals, right? So
          doesn't it just boil down to neural nets being universal function
          approximators?
       
          YetAnotherNick wrote 3 days ago:
          Even in the most liberal interpretation of prove, it doesn't do that.
          GPT-4 was trained before OpenAI has any special data or deal with
          microsoft or the product market fit. Yet, no model has beaten it in a
          year. And google, microsoft, meta definitely have better data and
          more compute.
       
          n2d4 wrote 3 days ago:
          There's at least an argument to be made that this is because all the
          models are heavily trained on GPT-4 outputs (or whatever the SOTA
          happens to be during training). All those models are, in a way, a
          product of inbreeding.
       
            fragmede wrote 3 days ago:
            But is it the kind of inbreeding that gets you Downs, or the
            kwisatz haderach?
       
              batshit_beaver wrote 2 days ago:
              Yes
       
            sumo43 wrote 3 days ago:
            Maybe true for instruct, but pretraining datasets do not usually
            contain GPT-4 outputs. So the base model does not rely on GPT-4 in
            any way.
       
            pram wrote 3 days ago:
            Consider the bulldog:
            
   URI      [1]: https://youtube.com/watch?v=hUgmkCgMWbg
       
          bevekspldnw wrote 3 days ago:
          The big thing for locally hosted is inference efficiency and speed.
          Mistral wears that crown by a good margin.
       
          swalsh wrote 3 days ago:
          The models are commodities, and the API's are even similar enough
          that there is zero stickiness.    I can swap one model for another, and
          usually not have to change anything about my prompts or rag
          pipelines.
          
          For startups, the lesson here is don't be in the business of building
          models.  Be in the business of using models.  The cost of using AI
          will probably continue to trend lower for the foreseeable future...
          but you can build a moat in the business layer.
       
            phillipcarter wrote 2 days ago:
            I don't think I agree with that. For my work at least, the only
            model I can swap with OpenAI and get similar results is Claude.
            None of the open models come even close to producing good outputs
            for the same prompt.
       
            esafak wrote 2 days ago:
            That's not what investors believe. They believe that due to
            training costs there will be a handful of winners who will reap all
            the benefits, especially if one of them achieves AGI. You can tell
            by looking at what they've invested most in: foundation models.
       
            spxneo wrote 3 days ago:
            Excellent comment. Shows good awareness of economic forces at play
            here.
            
            We are just going to use whatever LLM is best fast/cheap and the
            giants are in an arms race to deliver just that.
            
            But only two companies in this epic techno-cold war have an
            economic moat but the other moat is breaking down inside the moat
            of the other company. The moat inside the moat cannot run without
            the parent moat.
       
              rayval wrote 3 days ago:
              Intriguing comment that I don't quite follow. Can you please
              elaborate?
       
                stolsvik wrote 2 days ago:
                Probably OpenAI running on Azure. But it was still convoluted.
       
            stri8ed wrote 3 days ago:
            Or be in the business of building infrastructure for AI inference.
       
              cheselnut wrote 3 days ago:
              Is this not the same argument? There are like 20 startups and
              cloud providers all focused on AI inference. I'd think
              application layer receives the most value accretion in the next
              10 years vs AI inference. Curious what others think
       
              sparks1970 wrote 3 days ago:
              Or be in the business of selling .ai domain names.
       
            sroussey wrote 3 days ago:
            Embeddings are not interchangeable. However, you can setup your
            system to have multiple embeddings from different providers for the
            same content.
       
              jimmySixDOF wrote 3 days ago:
              There are people who make the case for custom fine tuned
              embedding models built to match your specific types of data and
              associations.  Whatever you use internally it gets converted to
              the foundation model of choice's formats by their tools on the
              edge.  Still Embeddings and the chunking strategies feeding into
              them are both way too underappreciated parts of the whole
              pipeline.
       
              swalsh wrote 3 days ago:
              Embeddings are indeed sticky, I was referring to the LLM model
              itself.
       
          throwaway74432 wrote 3 days ago:
          LLMs are a commodity
          
   URI    [1]: https://www.investopedia.com/terms/c/commodity.asp
       
            paxys wrote 3 days ago:
            Maybe, but that classification by itself doesn't mean anything.
            Gold is a commodity, but having it is still very desirable and
            valuable.
            
            Even if all LLMs were open source and publicly available, the GPUs
            to run them, technical know how to maintain the entire system, fine
            tuning, the APIs and app ecosystem around them etc. would still
            give the top players a massive edge.
       
              throwaway74432 wrote 3 days ago:
              Of course realizing that a resource is a commodity means
              something. It means you can form better predictions of where the
              market is heading, as it evolves and settles. For example, people
              are starting to realize that these LLMs are converging on
              fungible. That can be communicated by the "commodity"
              classification.
       
          jobigoud wrote 3 days ago:
          It's even possible they converge when trained on different data, if
          they are learning some underlying representation. There was recent
          research on face generation where they trained two models by
          splitting one training set in two without overlap, and got the two
          models to generate similar faces for similar conditioning, even
          though each model hadn't seen anything that the other model had.
       
            bobbylarrybobby wrote 3 days ago:
            I mean, faces are faces, right? If the training data set is large
            and representative I don't see why any two (representative) halves
            of the data would lead to significantly different models.
       
              arcticfox wrote 3 days ago:
              I think that's the point; language is language.
              
              If there's some fundamental limit of what type of intelligence
              the current breed of LLMs can extract from language, at some
              point it doesn't matter how good or expansive the content of the
              training set is. Maybe we are finally starting to hit an
              architectural limit at this point.
       
                dumbfounder wrote 3 days ago:
                But information is not information. They may be able to talk in
                the same style, but not about the same things.
       
            IshKebab wrote 3 days ago:
            That sounds unsurprising? Like if you take any set of numbers,
            randomly split it in two, then calculate the average of each
            half... it's not surprising that they'll be almost the same.
            
            If you took two different training sets then it would be more
            surprising.
            
            Or am I misunderstanding what you mean?
       
              MajimasEyepatch wrote 3 days ago:
              It doesn't really matter whether you do this experiment with two
              training sets created independently or one training set split in
              half. As long as both are representative of the underlying
              population, you would get roughly the same results. In the case
              of human faces, as long as the faces are drawn from roughly
              similar population distributions (age, race, sex), you'll get
              similar results. There's only so much variation in human faces.
              
              If the populations are different, then you'll just get two models
              that have representations of the two different populations. For
              example, if you trained a model on a sample of all old people and
              separately on a sample of all young people, obviously those would
              not be expected to converge, because they're not drawing from the
              same population.
              
              But that experiment of splitting one training set in half does
              tell you something: the model is building some sort of
              representation of the underlying distribution, not just
              overfitting and spitting out chunks of copy-pasted faces stitched
              together.
       
                evrial wrote 2 days ago:
                That's explanation of central limit theorem in statistics. And
                any language is mostly statistics and models are good at
                statistical guessing of the next word or token.
       
                taneq wrote 3 days ago:
                If not are sampled from the same population then they’re not
                really independent, even if they’re totally disjoint.
       
                  evrial wrote 2 days ago:
                  They are sourced mostly from the same population and crawled
                  from everything can be crawled.
       
            Tubbe wrote 3 days ago:
            Got a link for that? Sounds super interesting
       
              d_burfoot wrote 3 days ago:
              
              
   URI        [1]: https://en.wikipedia.org/wiki/Theory_of_forms
       
          mnemoni_c wrote 3 days ago:
          Yea it feels like transformer LLMs are in or getting closer to
          diminishing returns. Will need some new breakthrough, likely entirely
          new approach, to get to AGI levels
       
            mattsan wrote 3 days ago:
            can't wait for LLMs to dispatch field agent robots who search for
            answers in the real world thats not online /s
       
              htrp wrote 3 days ago:
              skynet would like a word
       
            Tubbe wrote 3 days ago:
            Yeah, we need radically different architecture in terms of the
            neural networks, and/or added capabilities such as function calling
            and RAG to improve the current sota
       
        mpeg wrote 3 days ago:
        The scale on that bar chart for "Programming (Human Eval)" is wild.
        
        Manager: "looks ok, but can you make our numbers pop? just make the
        LLaMa bar smaller"
       
          jstummbillig wrote 3 days ago:
          It does not feel obviously unreasonable/unfair/fake to place the
          select models in the margins for a relative comparison. In fact, this
          might be the most concise way to display what I would consider the
          most interesting information in this context.
       
          jxy wrote 3 days ago:
          I wonder if they messed with the scale or they messed with the bars.
       
          hammock wrote 3 days ago:
          I believe it's a reasonable range for the scores. If a model gets
          everything half wrong (worse than a coin flip), it's not a useful
          model at all. So every model below a certain threshold is trash, and
          no need to get granular about how trash it is.
          
          An alternative visualization that could be less triggering to an "all
          y-axes must have zero" guy would be to plot the (1-value), that is, %
          degraded from perfect score. You could do this without truncating the
          axis and get the same level of differentiation between the bars
       
            generalizations wrote 3 days ago:
            > less triggering to an "all y-axes must have zero" guy
            
            Ever read 'How to Lie with Statistics'? This is an example of
            exaggerating a smaller difference to make it look more significant.
             Dismissing it as just being 'triggered' is a bad idea.
       
              hammock wrote 3 days ago:
              In this case I would called it triggered (for lack of a better
              word), since, as I described earlier, a chart plotting
              "difference from 100%" would look exactly the same, and satisfy
              the zero-bound requirement, while not being any more or less
              dishonest
       
                generalizations wrote 2 days ago:
                The point is less to use bad/wrong math; it's to present
                technically correct charts that nonetheless imply wrong
                conclusions. In this case, by chopping off the bottom of the
                chart, the visual impression of the ratio between the bars
                changes. That's the lie.
       
            adtac wrote 3 days ago:
            None of the evals are binary choice.
            
            MMLU questions have four options, so two coin flips would have a
            25% baseline. HumanEval evaluates code with a test, so a 100 byte
            program implemented with coin flips would have a O(2^-800) baseline
            (maybe not that bad since there are infinitely many programs that
            produce the same output). GSM-8K has numerical answers, so an
            average 3 digit answer implemented with coin flips would have a
            O(2^-9) chance of being correct randomly.
            
            Moreover, using the same axis and scale across unrelated evals
            makes no sense. 0-100 is the only scale that's meaningful because 0
            and 100 being the min/max is the only shared property across all
            evals. The reason for choosing 30 is that it's the minimum across
            all (model, eval) pairs, which is a completely arbitrary choice. A
            good rule of thumb to test this is to ask if the graph would still
            be relevant 5 years later.
       
          theyinwhy wrote 3 days ago:
          In these cases my thinking always is "if they are not even able to
          draw a graph, what else is wrong?"
       
          renewiltord wrote 3 days ago:
          Yeah, this is why I ask climate scientists to use a proper 0 K graph
          but they always zoom it in to exaggerate climate change. Display
          correctly with 0 included and you’ll see that climate change
          isn’t a big deal.
          
          It’s a common marketing and fear mongering trick.
       
            SubiculumCode wrote 3 days ago:
            Where are your /s tags?
            
            The scale should be chosen to allow the reader to correctly infer
            meaningful differences. If 1° is meaningful in terms of the
            standard error/ CI AND 1° unit has substantive consequences 
            , then that should be emphasized.
       
              renewiltord wrote 3 days ago:
              > Where are your /s tags?
              
              I would never do my readers dirty like that.
       
            abenga wrote 3 days ago:
            Because, of course, the effect of say 1°C rise in temps is
            obviously trivial if it is read as 1°K instead. Come on.
       
          zedpm wrote 3 days ago:
          Somewhere, Edward Tufte[0] is weeping.
          
          [0]:
          
   URI    [1]: https://en.wikipedia.org/wiki/Edward_Tufte
       
          glutamate wrote 3 days ago:
          I think the case for "axis must always go to 0" is overblown. Zero
          isn't always meaningful, for instance chance performance or
          performance of trivial algorithms is likely >0%. Sometimes if axis
          must go to zero you can't see small changes. For instance if you plot
          world population 2014-2024 on an axis going to zero, you won't be
          able to see if we are growing or shrinking.
       
            tkellogg wrote 3 days ago:
            OTOH having the chart start at zero would REALLY emphasize how
            saturated this field is, and how little this announcement matters.
       
              c2occnw wrote 3 days ago:
              The difference between 32% and 70% wouldn't be significant if the
              chart started at zero?
       
                generalizations wrote 3 days ago:
                It would be very obvious indeed how small the difference
                between 73.7,73.0,71.4,and 69.8 actually is.
       
            patrickthebold wrote 3 days ago:
            Certainly a bar chart might not be the best choice to convey the
            data you have. But if you choose to have a bar chart and have it
            not start at zero, what do the bars help you convey?
            
            For world population you could see if it is increasing or
            decreasing, which is good but it would be hard to evaluate the rate
            the population is increasing.
            
            Maybe a sparkline would be a better choice?
       
            TZubiri wrote 3 days ago:
            Then you can plot it on a greater timescale, or plot the change
            rate
       
            pandastronaut wrote 3 days ago:
            Even starting at 30%, the MMLU graph is false. The four bars are
            wrong. Even their own 73,7% is not at the right height. The Mixtral
            71.4% is below the 70% mark of the axis. 
            This is really the kind of marketing trick that makes me avoid a
            provider / publisher. I can't build trust this way.
       
              nerpderp82 wrote 2 days ago:
              MMLU is not a good benchmark and needs to stop being used.
              
              I can't find the section, but at the end of one of [1] he runs
              down a deep dive of the questions and answers in MMLU, and there
              are so many typos, omissions, and errors in the questions and the
              answers that it should no longer be used.
              
              This is it, with the corret time offset into the video [2] The
              original longer complaint against MMLU
              
   URI        [1]: https://www.youtube.com/@aiexplained-official/videos
   URI        [2]: https://www.reddit.com/r/OpenAI/comments/18i02oe/mmlu_is...
   URI        [3]: https://www.youtube.com/watch?v=hVade_8H8mE
       
              tartrate wrote 3 days ago:
              Seems fixed now
       
              dskhudia wrote 3 days ago:
              It’s an honest mistake in scaling the bars. It’s getting
              fixed soon. The percentages are correct though. In the process of
              converting excel chart to pretty graphs for the blog, scale got
              messed up.
       
              occamrazor wrote 3 days ago:
              It‘s more likely to be incompetence than malice: even their
              73.7% is closer to 72% than to 74%.
       
              tylermw wrote 3 days ago:
              I believe they are using the percentages as part of the height of
              the bar chart! I thought I'd seen every way someone could do
              dataviz wrong (particularly with a bar chart), but this one is
              new to me.
       
                familiartime wrote 3 days ago:
                That's really strange and incredibly frustrating - but slightly
                less so if it's consistent with all of the bars (including
                their own).
                
                I take issue with their choice of bar ordering - they placed
                the lowest-performing model directly next to theirs to make the
                gap as visible as possible, and shoved the second-best model
                (Grok-1) as far from theirs as possible. Seems intentional to
                me. The more marketing tricks you pile up in a dataviz, the
                less trust I place in your product for sure.
       
                radicality wrote 3 days ago:
                Wow, that is indeed a novel approach haha, took me a moment to
                even understand what you described since would never imagine
                someone plotting a bar chart like that.
       
                pandastronaut wrote 3 days ago:
                Interesting! It is probably one of the worst trick I have seen
                in a while for a bar graph. Never seen this one before. Trust
                vanishes instantly facing that kind of dataviz.
       
            nilstycho wrote 3 days ago:
            I agree with your general point, but world population is still
            visibly increasing on that interval. [1] Perhaps "global mean
            temperature in Kelvin" would be a comparable example.
            
   URI      [1]: https://ourworldindata.org/explorers/population-and-demogr...
       
        hn_acker wrote 3 days ago:
        Even though the README.md calls the license the Databricks Open Source
        License, the LICENSE file includes paragraphs such as
        
        > You will not use DBRX or DBRX Derivatives or any Output to improve
        any other
        large language model (excluding DBRX or DBRX Derivatives).
        
        and
        
        > If, on the DBRX version release date, the monthly active users of the
        products
        or services made available by or for Licensee, or Licensee’s
        affiliates, is
        greater than 700 million monthly active users in the preceding calendar
        month,
        you must request a license from Databricks, which we may grant to you
        in our
        sole discretion, and you are not authorized to exercise any of the
        rights under
        this Agreement unless or until Databricks otherwise expressly grants
        you such
        rights.
        
        This is a source-available model, not an open model.
       
          Zuiii wrote 2 days ago:
          1. Open source is a well-defined model and I reasonably expect
          Databricks to be aware of this due to their use of open source models
          in their other projects.
          
          2. The stated licensing terms are clearly and decisively not open
          source.
          
          3. It is reasonable to conclude that this model is dual licensed,
          under this restrictive proprietary license, and an undisclosed open
          source license.
          
          4. Just use this Model under the open source license with the
          assumption that they will release the open source license later.
          
          I jest. In all seriousness, you should just disregard their licensing
          terms entirely as copyright does not apply to weight.
          
   URI    [1]: https://news.ycombinator.com/item?id=39847147
       
          hn_acker wrote 3 days ago:
          Sorry, I forgot to link the repository [1] and missed the edit window
          by the time I realized.
          
          The bottom of the README.md [2] contains the following license grant
          with the misleading "Open Source" term:
          
          > License
          
          > Our model weights and code are licensed for both researchers and
          commercial entities. The Databricks Open Source License can be found
          at LICENSE, and our Acceptable Use Policy can be found here. [1] [2]
          [1] /blob/main/README.md
          
   URI    [1]: https://github.com/databricks/dbrx
   URI    [2]: https://github.com/databricks/dbrx/blob/main/README.md
       
          adolph wrote 3 days ago:
          Maybe the license is “open” as in a can of beer, not OSS.
       
          whimsicalism wrote 3 days ago:
          identical to llama fwiw
       
          CharlesW wrote 3 days ago:
          > This is a source-available model, not an open model.
          
          To me, "source available" implies that everything you need to
          reproduce the model is also available, and that doesn't appear to be
          the case. How is the resulting model more "free as in freedom" than a
          compiled binary?
       
            Spivak wrote 3 days ago:
            I don't think it's possible to have an "open training data" model
            because it would get DMCA'd immediately and open you up to lawsuits
            from everyone who found their works in the training set.
            
            I hope we can fix the legal landscape to enable publicly sharing
            training data but I can't really judge the companies keeping it a
            secret today.
       
              CharlesW wrote 3 days ago:
              > I don't think it's possible to have an "open training data"
              model because it would get DMCA'd immediately…
              
              This isn't a problem because OpenAI says, "training AI models
              using publicly available internet materials is fair use". /s
              
   URI        [1]: https://openai.com/blog/openai-and-journalism
       
                Spivak wrote 3 days ago:
                I don't think it's that crazy, even if you're sure it's fair
                use I wouldn't paint a huge target on my back before there's a
                definite ruling and I doubly wouldn't test the waters of the
                legality of re-hosting copyrighted content to be downloaded by
                randos who won't be training models with it.
                
                If they're going to get away with this collecting data and
                having a legal chain-of-custody so you can actually say it was
                only used to train models and no one else has access to it goes
                a long way.
       
            occamrazor wrote 3 days ago:
            I like:
            
            - “open weights” for no training data and no restrictions on
            use,
            
            - “weights available” for no training data and restrictions on
            use, like in this case.
       
          yunohn wrote 3 days ago:
          The first clause sucks, but I’m perfectly happy with the second
          one.
       
        hanniabu wrote 3 days ago:
        What's a good model to help with medical research? Is there anything
        trained in just research journals, like NIH studies?
       
          najarvg wrote 3 days ago:
          Look for Biomistral 7B, PMC-LLAMA 7B and even Meditron. I believe you
          should find all those papers on arxiv
       
        ingenieroariel wrote 3 days ago:
        TLDR: A model that could be described as "3.8 level" that is good at
        math and openly available with a custom license.
        
        It is as fast as 34B model, but uses as much memory as a 132B model. A
        mixture of 16 experts, activates 4 at a time, so has more chances to
        get the combo just right than Mixtral (8 with 2 active).
        
        For my personal use case (a top of the line Mac Studio) it looks like
        the perfect size to replace GPT-4 turbo for programming tasks.    What we
        should look out for is people using them for real world programming
        tasks (instead of benchmarks) and reporting back.
       
          sp332 wrote 3 days ago:
          What does 3.8 level mean?
       
            ljlolel wrote 3 days ago:
            Gpt-3.5 and gpt-4
       
            ingenieroariel wrote 3 days ago:
            My interpretation:
            
            - Worst case: as good as 3.5
            - Common case: way better than 3.5
            - Best case: as good as 4.0
       
        natsucks wrote 3 days ago:
        it's twice the size of mixtral and barely beats it.
       
          mochomocha wrote 3 days ago:
          It's a MoE model, so it offers a different memory/compute latency
          trade-off than standard dense models. Quoting the blog post:
          
          > DBRX uses only 36 billion parameters at any given time. But the
          model itself is 132 billion parameters, letting you have your cake
          and eat it too in terms of speed (tokens/second) vs performance
          (quality).
       
            hexomancer wrote 3 days ago:
            Mixtral is also a MoE model, hence the name: mixtral.
       
              sangnoir wrote 3 days ago:
              Despite both being MoEs, thr architectures are different. DBRX
              has double the number of experts in the pool (16 vs 8 for
              Mixtral), and doubles the active experts (4 vs 2)
       
        XCSme wrote 3 days ago:
        I am planning to buy a new GPU.
        
        If the GPU has 16GB of VRAM, and the model is 70GB, can it still run
        well?
        Also, does it run considerably better than on a GPU with 12GB of VRAM?
        
        I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but
        the 24.6GB version is a bit slow (still usable, but has a noticeable
        start-up time).
       
          speedylight wrote 1 day ago:
          Quantized models will run well, otherwise inference might be really
          really slow or the client crashes all together with some CUDA out of
          memory error.
       
          Zambyte wrote 2 days ago:
          I genuinely recommend considering AMD options. I went with a 7900 XTX
          because it has the most VRAM for any $1000 card (24 GB). NVIDIA cards
          at that price point are only 16 GB. Ollama and other inference
          software works on ROCm, generally with at most setting an environment
          variable now. I've even run Ollama on my Steam Deck with GPU
          inferencing :)
       
            XCSme wrote 2 days ago:
            I ended up getting a 2nd hand 3090 for 680€.
            
            Funnily, I think the card is new (smells new) and unused, most
            likely a scalper bought it and couldn't sell it.
       
              Zambyte wrote 1 day ago:
              Nice, that's definitely a sweet deal
       
                XCSme wrote 1 day ago:
                Thanks, I chose a 3090 instead of 4070ti, it was around $200
                cheaper and has 24GB vs 16GB VRAM and a similar performance.
                The only drawback is the 350W TDP.
                
                I still struggle with the RAM issue on Ollama, where it uses
                128GB/128GB RAM for Mixtral 24.6GB, even though Docker limit is
                set to 90GB.
                
                Docker seems pretty buggy on Windows...
       
          lxe wrote 3 days ago:
          Get 2 pre-owned 3090s. You will easily be able to run 70b or even
          120b quantized models.
       
          PheonixPharts wrote 3 days ago:
          While GPUs are still the kings of speed, if you are worried about
          VRAM I do recommend a maxed out Mac Studio.
          
          Llama.cpp + quantized models on Apple Silicon is an incredible
          experience, and having 192 GB of unified memory to work with means
          you can run models that just aren't feasible on a home GPU setup.
          
          It really boils down to what type of local development you want to
          do. I'm mostly experimenting with things where the time to response
          isn't that big of a deal, and not fine-tuning the models locally
          (which I also believe GPUs are still superior for). But if your
          concern is "how big of a model can I run" vs "Can I have close to
          real time chat", the unified memory approach is superior.
       
            brandall10 wrote 3 days ago:
            Wait for the M3 Ultra and it will be 256GB and markedly faster.
       
            spxneo wrote 3 days ago:
            Aren't quantized models different models outright requiring a new
            evaluation to know the deviation in performance? Or are they "good
            enough" in that the benefits outweigh the deviation?
            
            I'm on the fence about whether to spend 5 digits or 4 digits. Do I
            go the Mac Studio route or GPUs? What are the pros and cons?
       
            purpleblue wrote 3 days ago:
            Aren't the Macs good for inference but not for training or fine
            tuning?
       
            bevekspldnw wrote 3 days ago:
            I had gone the Mac Studio route initially, but I ended up with
            getting an A6000 for about the same price as a Mac and putting that
            in a Linux server onder my desk. Ollama makes it dead simple to
            serve it over my local network, so I can be on my M1 Air and using
            it no differently than if on my laptop. The difference is that the
            A6000 absolutely smokes the Mac.
       
              rldjbpin wrote 2 days ago:
              this. if you can afford m3 level of money, a6000 is definitely
              worth it and provides you long-term access to a level of compute
              even hard to find in the cloud (for the price and waiting
              period).
              
              it is only dwarfed by other options if your workload can use
              multi-gpu, which is not a granted for most cases.
       
              c1b wrote 3 days ago:
              > The difference is that the A6000 absolutely smokes the Mac.
              
              Memory Bandwidth : Mac Studio wins (about the same @ ~800)
              
              VRAM : Mac Studio wins (4x more)
              
              TFLOPs: A6000 wins (32 vs 38)
       
                bevekspldnw wrote 3 days ago:
                VRAM in excess of the model one is using isn’t useful per se.
                My use cases require high throughput, and on many tasks the
                A6000 executes inference at 2x speed.
       
              starik36 wrote 3 days ago:
              Wow, that is a lot of money ($4400 on Amazon) to throw at this
              problem.  I am curious, what was the purpose that compelled you
              to spend this (for the home network, I assume) amount of money.
       
                bevekspldnw wrote 3 days ago:
                Large scale document classification tasks in very ambiguous
                contexts. A lot of my work goes into using big models to
                generate training data for smaller models.
                
                I have multiple millions of documents so GPT is cost
                prohibitive, and too slow. My tools of choice tend to be a
                first pass with Mistral to check task performance and if
                lacking using Mixtral.
                
                Often I find with a good prompt Mistral will work as well as
                Mixtral and is about 10x faster.
                
                I’m on my “home” network, but it’s a “home office”
                for my startup.
       
                  Datagenerator wrote 2 days ago:
                  Interesting I have the same task, can you share your tools?
                  My goal is to detect if documents contain GDPR sensitive
                  parts or are copies of official documents like ID's and
                  driving licenses etc - would be great to reuse your work!
       
                    bevekspldnw wrote 2 days ago:
                    Working in the same sector, we’ll license it out soon.
       
            bee_rider wrote 3 days ago:
            I know the M?-pro and ultra variants are multiple standard M?’s
            in a single package. But so the CPUs and GPUs share a die (like a
            single 4 p-core CPU 10 GPU core is what come in the die, and the
            more exotic variants are just a result of LEGO-ing out those guys
            and disabling some cores for market segmentation or because they
            had defects?)
            
            I guess I’m wondering if they technically could throw in their
            gauntlet and compete with nvidia by doing something like a 4 CPU/80
            GPU/256 GB chip, if they wanted to. Seems like it’d be a really
            appealing ML machine. (I could also see it being technically
            possible but Apple just deciding that’s pointlessly niche for
            them).
       
              astrange wrote 3 days ago:
              Ultra is the only one that's made from two smaller SoCs.
       
            XCSme wrote 3 days ago:
            I already have 128GB of RAM (DDR4), and was wondering if upgrading
            from a 1080ti (12GB) to a 4070ti super (16GB), would make a big
            difference.
            
            I assume the FP32 and FP16 operations are already a huge
            improvement, but also the 33% increased VRAM might lead to fewer
            swaps between VRAM and RAM.
       
              loudmax wrote 3 days ago:
              I have an RTX 3080 with 10GB of VRAM.  I'm able to run models
              larger than 10GB using llama.cpp and offloading to the GPU as
              much as can fit into VRAM.  The remainder of the model runs on
              CPU + regular RAM.
              
              The `nvtop` command displays a nice graph of how much GPU
              processing and VRAM is being consumed.    When I run a model that
              fits entirely into VRAM, say Mistral 7B, nvtop shows the GPU
              processing running at full tilt.  When I run a model bigger than
              10GB, say Mixtral or Llama 70B with GPU offloading, my CPU will
              run full tilt and the VRAM is full, but the GPU processor itself
              will operate far below full capacity.
              
              I think what is happening here is that the model layers that are
              offloaded to the GPU do their processing, then the GPU spends
              most of the time waiting for the much slower CPU to do its thing.
               So in my case, I think upgrading to a faster GPU would make
              little to no difference when running the bigger models, so long
              as the VRAM is capped at the same level.  But upgrading to a GPU
              with more VRAM, even a slower GPU, should make the overall speed
              faster for bigger models because the GPU would spend less time
              waiting for the CPU.  (Of course, models that fit entirely into
              VRAM will run faster on a faster GPU).
              
              In my case, the amount of VRAM absolutely seems to be the
              performance bottleneck.  If I do upgrade, it will be for a GPU
              with more VRAM, not necessarily a GPU with more processing power.
               That has been my experience running llama.cpp.  YMMV.
       
                htrp wrote 3 days ago:
                How's your performance on the 70b parameter llama series?
                
                Any good writeups of the offloading that you found?
       
                  loudmax wrote 3 days ago:
                  Performance of 70b models is like 1 token every few seconds. 
                  And that's fitting the whole model into system RAM, not swap.
                   It's interesting because some of the larger models are quite
                  good, but too annoyingly slow to be practical for most use
                  cases.
                  
                  The Mixtral models run surprisingly well.  They can run
                  better than 1 token per second, depending on quantization. 
                  Still slow, but approaching a more practical level of
                  usefulness.
                  
                  Though if you're planning on accomplishing real work with
                  LLMs, the practical solution for most people is probably to
                  rent a GPU in the cloud.
       
              zozbot234 wrote 3 days ago:
              That's system memory, not unified memory.  Unified means that all
              or most of it is going to be directly available to the Apple
              Silicon GPU.
       
                giancarlostoro wrote 3 days ago:
                This is the key factor here. I have a 3080, with 16GB of
                Memory, but still have to run some models on CPU since the
                memory is not unified at all.
       
          llm_trw wrote 3 days ago:
          >If the GPU has 16GB of VRAM, and the model is 70GB, can it still run
          well? Also, does it run considerably better than on a GPU with 12GB
          of VRAM?
          
          No, it can't run at all.
          
          >I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti,
          but the 24.6GB version is a bit slow (still usable, but has a
          noticeable start-up time).
          
          That is not mixtral, that is mistral 7b. The 1080ti is slower than
          running inference on current generation threadripper cpus.
       
            XCSme wrote 3 days ago:
            > No, it can't run at all. [1] EDIT: This was ran on a 1080ti +
            5900x. Initial generation takes around 10-30seconds (like it has to
            upload the model to GPU), but then it starts answering immediately,
            at around 3 words per second.
            
   URI      [1]: https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg
       
              llm_trw wrote 3 days ago:
              Congratulations on using CPU inference.
       
              spxneo wrote 3 days ago:
              this is some new flex to debate online: copying and pasting the
              other sides argument and waiting for your local LLM to explain
              why they are wrong.
              
              how much is your hardware at today's value? what are the specs?
              that is impressive even though its 3 words per second. if you
              want to bump it up to 30, do you then 10x your current hardware
              cost?
       
                XCSme wrote 3 days ago:
                That question was just an example (Lorem ipsum), it was easy to
                copy paste to demo the local LLM, I didn't intend to provide
                more context to the discussion.
                
                I ordered a 2nd 3090, which has 24GB VRAM. Funny how it was
                $2.6k 3 years ago and now is $600.
                
                You can probuild a decent AI local machine for around $1000.
       
                  taneq wrote 2 days ago:
                  Where are you seeing 24GB 3090s for $600?
       
                    XCSme wrote 2 days ago:
                    2nd hand market
       
                  spxneo wrote 3 days ago:
                   [1] you are right there is a huge drop in price
                  
   URI            [1]: https://howmuch.one/product/average-nvidia-geforce-r...
       
                    XCSme wrote 3 days ago:
                    New it's hard to find, but the 2nd hand market is filled
                    with them.
       
              wokwokwok wrote 3 days ago:
              Did you check your GPU utilization?
              
              Typically when it runs that way it runs on the CPU, not the GPU.
              
              Are you sure you're actually offloading any work to the GPU?
              
              At least with llama.cpp, there is no 'partially put a layer' into
              the GPU. Either you do, or you don't. You pick the number of
              layers. If the model is too big, the layers won't fit and it
              can't run at all.
              
              The llama.cpp `main` executable will tell you in it's debug
              information when you use the -ngl flag; see [1] It's also
              possible you're running (eg. if you're using ollama) and
              quantized version of the model which reduces the memory
              requirements and quality of the model outputs.
              
   URI        [1]: https://github.com/ggerganov/llama.cpp/blob/master/examp...
       
                XCSme wrote 3 days ago:
                I have to check, something does indeed seem weird, especially
                with the PC freezing like that. Maybe it runs on the CPU.
                
                > quantized version
                Yes, it is 4bit quantized, but still has 24.6GB
       
            XCSme wrote 3 days ago:
            I have those:
            
            dolphin-mixtral:latest (24.6GB)
            mistral:latest (3.8GB)
            
            The CPU is 5900x.
       
          jasonjmcghee wrote 3 days ago:
          > mixtral works well
          
          Do you mean mistral?
          
          mixtral is 8x7B and requires like 100GB of RAM
          
          Edit: (without quant as others have pointed out) can definitely be
          lower, but haven't heard of a 3.4GB version
       
            Havoc wrote 3 days ago:
            The smaller quants still require a 24gb card. 16 might work but
            doubt it
       
            XCSme wrote 3 days ago:
            I have 128GB, but something is weird with Ollama. Even though for
            the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB,
            so the system because very slow (mouse freezes).
       
              InitEnabler wrote 3 days ago:
              What docker flags are you running?
       
                XCSme wrote 3 days ago:
                None? The default ones from their docs.
                
                The Docker also shows minimal usage for the ollama server which
                is also strange.
       
            XCSme wrote 3 days ago:
            Sorry, it was from memory.
            
            I have those models in Ollama:
            
            I have those:
            
            dolphin-mixtral:latest (24.6GB)
            mistral:latest (3.8GB)
       
            chpatrick wrote 3 days ago:
            The quantized one works fine on my 24GB 3090.
       
            ranger_danger wrote 3 days ago:
            I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it only
            requires 25GB.
       
            K0balt wrote 3 days ago:
            I run mixtral 6 bit quant very happily on my MacBook with 64 gb.
       
            kwerk wrote 3 days ago:
            I have two 3090s and it runs fine with `ollama run mixtral`.
            Although OP definitely meant mistral with the 7B note
       
              jsight wrote 3 days ago:
              ollama run mixtral will default to the quantized version (4bit
              IIRC). I'd guess this is why it can fit with two 3090s.
       
        viktour19 wrote 3 days ago:
        It's great how we went from "wait.. this model is too powerful to open
        source" to everyone trying to shove down their 1% improved model down
        the throats of developers
       
          toddmorey wrote 3 days ago:
          People are building and releasing models. There's active research in
          the space. I think that's great! The attitude I've seen in open
          models is "use this if it works for you" vs any attempt to coerce
          usage of a particular model.
          
          To me that's what closed source companies (MSFT, Google) are doing as
          they try to force AI assistants into every corner of their product.
          (If LinkedIn tries one more time to push their crappy AI upgrade, I'm
          going to scream...)
       
          blitzar wrote 3 days ago:
          Got to justify pitch deck or stonk price. Publish or perish without a
          yacht.
       
          brainless wrote 3 days ago:
          I feel quite the opposite. Improvements, even tiny ones are great.
          But what's more important is that more companies release under open
          license.
          
          Training models isn't cheap. Individuals can't easily do this, unlike
          software development. So we need companies to do this for the
          foreseeable future.
       
          Icko wrote 3 days ago:
          I'm 90% certain that OpenAI has some much beefier model they are not
          releasing - remember the Q* rumour?
       
        kurtbuilds wrote 3 days ago:
        What’s the process to deliver and test a quantized version of this
        model?
        
        This model is 264GB, so can only be deployed in server settings.
        
        Quantized mixtral at 24G is just small enough where it can be running
        on premium consumer hardware (ie 64GB RAM)
       
        djoldman wrote 3 days ago:
        Model card for base: [1] > The model requires ~264GB of RAM
        
        I'm wondering when everyone will transition from tracking parameter
        count vs evaluation metric to (total gpu RAM + total CPU RAM) vs
        evaluation metric.
        
        For example, a 7B parameter model using float32s will almost certainly
        outperform a 7B model using float4s.
        
        Additionally, all the examples of quantizing recently released superior
        models to fit on one GPU doesnt mean the quantized model is a "win."
        The quantized model is a different model, you need to rerun the
        metrics.
        
   URI  [1]: https://huggingface.co/databricks/dbrx-base
       
          dheera wrote 3 days ago:
          I'm more wondering when we'll have algorithms that will "do their
          best" given the resources they detect.
          
          That would be what I call artificial intelligence.
          
          Giving up because "out of memory" is not intelligence.
       
            coldtea wrote 3 days ago:
            >Giving up because "out of memory" is not intelligence.
            
            When people can't remember the facts/theory/formulas needed to
            answer some test question, or can't memorize some complicated
            information because it's too much, they usually give up too.
            
            So, giving up because of "out of memory" sure sounds like
            intelligence to me.
       
            falcor84 wrote 3 days ago:
            I suppose you could simulate dementia by loading as much of the
            weights as space permits and then just stopping. Then during
            inference, replace the missing weights with calls to random(). I'd
            actually be interested in seeing the results.
       
            visarga wrote 3 days ago:
            No but some model serving tools like llama.cpp do their best. It's
            just a matter of choosing the right serving tools. And I am not
            sure LLMs could not optimize their memory layout. Why not? Just let
            them play with this and learn. You can do pretty amazing things
            with evolutionary methods where the LLMs are the mutation operator.
            You evolve a population of solutions. ( [1] )
            
   URI      [1]: https://arxiv.org/abs/2206.08896
       
          dvt wrote 3 days ago:
          > a 7B parameter model using float32s will almost certainly
          outperform a 7B model using float4s
          
          Q5 quantization performs almost on par with base models. Obviously
          there's some loss there, but this indicates that there's still a lot
          of compression that we're missing.
       
            jonnycomputer wrote 2 days ago:
            I'm still amazed that quantization works at all, coming out as a
            mild degradation in quality rather than radical dysfunction. Not
            that I've thought it through that much. Does quantization work with
            most neural networks?
       
              qeternity wrote 2 days ago:
              Intuitively the output space is much smaller than the latent
              space. So during training, you need the higher precision so that
              the latent space converges. But during inference, you just need
              to be precise enough that your much smaller output space does.
       
              rfoo wrote 2 days ago:
              > Does quantization work with most neural networks?
              
              Yes. It works pretty well for CNN-based vision models. Or rather,
              I'd claim it works even better: with post-training quantization
              you can make most models work with minimal precision loss
              entirely in int8 (fixed point), that is, computation is over
              int8/int32, no floating point at all, instead of weight-only
              approach discussed here.
              
              If you do QAT something down to 2-bit weight and 4-bit activation
              would work.
              
              People aren't interested in a weight-only quantization back then
              because CNNs are in general "denser", i.e. bottleneck was on
              compute, not memory.
       
                jonnycomputer wrote 2 days ago:
                thanks!
       
          ml_hardware wrote 3 days ago:
          Looks like someone has got DBRX running on an M2 Ultra already:
          
   URI    [1]: https://x.com/awnihannun/status/1773024954667184196?s=20
       
            irusensei wrote 2 days ago:
            I can run a certain 120b on my M3 max with 128GB memory. However I
            found that while it “fits” Q5 was extremely slow. The story was
            different with Q4 though which ran just fine around ~3.5-4 t/s.
            
            Now this model is ~134B right? It could be bog slow but on the
            other hand its a MoE so there might be a chance it could have
            satisfactory results.
       
              marci wrote 2 days ago:
              From the article, should have the speed of a ~36b.
       
            resource_waste wrote 3 days ago:
            I find 500 tokens considered 'running' a stretch.
            
            Cool to play with for a few tests, but I can't imagine using it for
            anything.
       
            madiator wrote 3 days ago:
            That's great, but it did not really write the program that the
            human asked it to do. :)
       
              SparkyMcUnicorn wrote 3 days ago:
              That's because it's the base model, not the instruct tuned one.
       
            Mandelmus wrote 3 days ago:
            And it appears to be at ~80 GB of RAM via quantisation.
       
              smcleod wrote 3 days ago:
              So that would be runnable on a MBP with a M2 Max, but the context
              window must be quite small, I don’t really find anything under
              about 4096 that useful
       
                a_wild_dandan wrote 2 days ago:
                Can't wait to try this on my MacBook. I'm also just amazed at
                how wasteful Grok appears to be!
       
              dheera wrote 3 days ago:
              That's a tricky number. Does it run on an 80GB GPU, does it
              auto-shave some parameters to fit in 79.99GB like any
              articifially "intelligent" piece of code would do, or does it
              give up like an unintelligent piece of code?
       
                Jedd wrote 2 days ago:
                Are you aware how Macs present memory? Their 'unified' memory
                approach means you could run an 80GB model on a 128GB machine.
                
                There's no concept of 'dedicated GPU memory' as per
                conventional amd64 arch machines.
       
                declaredapple wrote 3 days ago:
                What?
                
                Are you asking if the framework automatically quantizes/prunes
                the model on the fly?
                
                Or are you suggesting the LLM itself should realize it's too
                big to run, and prune/quantize itself? Your references to
                "intelligent" almost leads me to the conclusion that you think
                the LLM should prune itself. Not only is this a chicken and egg
                problem, but LLMs are statistical models, they aren't
                inherently self bootstraping.
       
                  2099miles wrote 2 days ago:
                  The LLM itself should realize it’s too big and only put the
                  important parts on the gpu. If you’re asking questions
                  about literature there’s no need to have all the params on
                  the gpu, just tell it to put only the ones for literature on
                  there.
       
                  dheera wrote 3 days ago:
                  I realize that, but I do think it's doable to bootstrap it on
                  a cluster and teach itself to self-prune, and surprised
                  nobody is actively working on this.
                  
                  I hate software that complains (about dependencies,
                  resources) when you try to run it and I think that should be
                  one of the first use cases for LLMs to get L5 autonomous
                  software installation and execution.
       
                    Red_Leaves_Flyy wrote 3 days ago:
                    Make your dreams a reality!
       
          swalsh wrote 3 days ago:
          > The model requires ~264GB of RAM
          
          This feels as crazy as Grok.  Was there a generation of models
          recently where we decided to just crank on the parameter count?
       
            espadrine wrote 3 days ago:
            Not recently. GPT-3 from 2020 requires even more RAM; the
            open-source BLOOM from 2022 did too.
            
            In my view, the main value of larger models is distillation (which
            we particularly witness, for instance, with how Claude Haiku
            matches release-day GPT-4 despite being less than a tenth of the
            cost). Hopefully the distilled models will be easier to run.
       
            breezeTrowel wrote 3 days ago:
            Cranking up the parameter count is literally how the current LLM
            craze got started. Hence the "large" in "large language model".
       
            Jackson__ wrote 3 days ago:
            If you read their blog post, they mention it was pretrained on 12
            Trillion tokens of text. That is ~5x the amount of the llama2
            training runs.
            
            From that, it seems somewhat likely we've hit the wall on improving
       
            wrs wrote 3 days ago:
            Isn’t that pretty much the last 12 months?
       
          vlovich123 wrote 3 days ago:
          I thought float4 sacrificed a negligible cost in evaluation quality
          for a 8x reduction in RAM?
       
            Taek wrote 3 days ago:
            For smaller models, the quality drop is meaningful. For larger ones
            like this one, the quality drop is negligible.
       
            Y_Y wrote 3 days ago:
            A free lunch? Wouldn't that be nice! Sometimes the quantization
            process improves the accuracy a little (probably by implicit
            regularization) but a model that's at or near capacity (as it
            should be) is necessarily hurt by throwing away most of the
            information. Language models often quantize well to small
            fixed-point types like int4, but it's not a magic wand.
       
              underlines wrote 3 days ago:
              This paper partially finds disagreeing evidence:
              
   URI        [1]: https://arxiv.org/abs/2403.17887
       
                Y_Y wrote 2 days ago:
                Good reference. I actually work on this stuff day-to-day which
                is why I feel qualified to comment on it, though mostly on
                images rather than natural language. I'll say in my defense
                that work like this is why I put a little disclaimer. It's
                well-known that plenty of popular models
                quantize/prune/sparsify well for some tasks. As the authors
                propose "current pretraining methods are not properly
                leveraging the parameters in the deeper layers of the network",
                this is what I was referring to as the networks not being "at
                capacity".
       
              vlovich123 wrote 3 days ago:
              I didn’t suggest a free lunch, just that the 8x reduction in
              RAM (+ faster processing) does not result in an 8x growth in the
              error. Thus a quantized model will outperform a non-quantized one
              on a evaluation/RAM metric.
       
                rfoo wrote 2 days ago:
                But using a 8x smaller model also does not result in an 8x
                growth in the error, too.
       
                Y_Y wrote 3 days ago:
                That's not a good metric.
       
                  omeze wrote 3 days ago:
                  Many applications dont want to host inference on the cloud
                  and would ideally run things locally. Hardware constraints is
                  clearly important.
                  
                  Id actually say its the most important metric for most open
                  models now, since the price per performance of closed cloud
                  models is so competitive with open cloud models, so edge
                  inference that is competitive is a clear value add
       
                    Y_Y wrote 2 days ago:
                    It's not that memory usage isn't important, it's that
                    dividing error by memory gives you a useless number. The
                    benefit from incremental error decrease is highly
                    nonlinear, as with memory. Improving error by 1% matters a
                    lot more starting from 10% error than 80%. Also a model
                    that used no memory and got everything wrong would have the
                    best score.
       
                      omeze wrote 2 days ago:
                      I see, I agree with you. But I would imagine the useful
                      metric to be “error rate below X GB memory”. We
                      really just need memory and/or compute reported when
                      these evaluations are performed to compile that. People
                      do it for training reports since compute and memory is
                      implicit based on training time (since people saturate it
                      and report what hardware they’re using). But for
                      inference no such details :\
       
              K0balt wrote 3 days ago:
              I find that q6 and 5+ are subjectively as good as raw tensor
              files. 4 bit quality reduction is very detectable though.  Of
              course there must be a loss of information, but perhaps there is
              a noise floor or something like that.
       
                Taek wrote 3 days ago:
                At what parameter count? Its been established that quantization
                has less of an effect on larger models. By the time you are at
                70B quantization to 4 bits basically is negligible
       
                  2099miles wrote 2 days ago:
                  Source? I’ve seen this anecdotally and heard it, but is
                  there a paper you’re referencing?
       
                  K0balt wrote 2 days ago:
                  I work mostly with mixtral and mistral 7b these days,  but I
                  did work with some 70b models before mistral came out, and I
                  was not impressed with the 4 bit Llama-2 70b.
       
        shnkr wrote 3 days ago:
        GenAI novice here. what is training data made of how is it collected? I
        guess no one will share details on it, otherwise a good technical blog
        post with lots of insights!
        
        >At Databricks, we believe that every enterprise should have the
        ability to control its data and its destiny in the emerging world of
        GenAI.
        
        >The main process of building DBRX - including pretraining,
        post-training, evaluation, red-teaming, and refining - took place over
        the course of three months.
       
          IshanMi wrote 3 days ago:
          Personally, I found looking at open source work to be much more
          instructive in learning about AI and how things like training data
          and such are done from the ground up. I suspect this is because
          training data is one of the bigger moats an AI company can have, as
          well as all the class action lawsuits surrounding training data.
          
          One of the best open source datasets that are freely available is The
          Pile by EleutherAI [1]. It's a few years old now (~2020), but they
          did some really diligent work in putting together the dataset and
          documenting it. A more recent and even larger dataset would be the
          Falcon-RefinedWeb dataset [2].
          
          [1]
          
   URI    [1]: https://arxiv.org/abs/2101.00027
   URI    [2]: https://arxiv.org/abs/2306.01116
       
          tempusalaria wrote 3 days ago:
          The training data is pretty much anything you can read on the
          internet plus books.
          
          This is then cleaned up to remove nonsense, some technical files, and
          repeated files.
          
          From this, they tend to weight some sources more - e.g. Wikipedia
          gets a pretty high weighting in the data mix. Overall these data
          mixes have multiple trillion token counts.
          
          GPT-4 apparently trained on multiple epochs of the same data mix. So
          would assume this one did too as it’s a similar token count
       
            sanxiyn wrote 3 days ago:
             [1] found that people are overweighting Wikipedia and
            downweighting Wikipedia improves things across the board INCLUDING
            PREDICTING NEXT TOKEN ON WIKIPEDIA, which is frankly amazing.
            
   URI      [1]: https://arxiv.org/abs/2305.10429
       
          simonw wrote 3 days ago:
          The most detailed answer to that I've seen is the original LLaMA
          paper, which described exactly what that model was trained on
          (including lots of scraped copyrighted data) [1] Llama 2 was much
          more opaque about the training data, presumably because they were
          already being sued at that point (by Sarah Silverman!) over the
          training data that went into the first Llama!
          
          A couple of things I've written about this:
          
          - [2] -
          
   URI    [1]: https://arxiv.org/abs/2302.13971
   URI    [2]: https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...
   URI    [3]: https://simonwillison.net/2023/Apr/17/redpajama-data/
       
            ssgodderidge wrote 3 days ago:
            Wow, that paper was super useful. Thanks for sharing. Page 2 is
            where it shows the breakdown of all of the data sources, including
            % of dataset and the total disk sizes.
       
            shnkr wrote 3 days ago:
            my question was specific to databricks model. If it followed llama
            or openai, they could add a line or two about it .. make the blog
            complete.
       
              comp_raccoon wrote 3 days ago:
              they have a technical report coming! knowing the team, they will
              do a great job disclosing as much as possible.
       
       
   DIR <- back to front page