_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   A History of Large Language Models
       
       
        WolfOliver wrote 7 hours 25 min ago:
        with what tool was this article written?
       
          altilunium wrote 2 hours 40 min ago:
          
          
   URI    [1]: https://gregorygundersen.com/blog/2020/06/21/blog-theme/
       
        sreekanth850 wrote 7 hours 44 min ago:
        I was wondering on what basis @Sama keeps saying they are near AGI,
        when in reality LLMs just calculate sequences and probabilities. I
        really doubt this bubble is going to burst soon.
       
        empiko wrote 8 hours 4 min ago:
        What a great write-up, kudos to the author! I’ve been in the field
        since 2014, so this really feels like reliving my career. I think one
        paradigm shift that isn’t fully represented in the article is what we
        now call “genAI.” Sure, we had all kinds of language models (BERTs,
        word embeddings, etc.), but in the end, most people used them to build
        customized classifiers or regression models. Nobody was thinking about
        “solving” tasks by asking oracle-like models questions in natural
        language. That was considered completely impossible with our technology
        even in 2018/19. Some people studied language models, but that
        definitely wasn’t their primary use case; they were mainly used to
        support tasks like speech-to-text, grammar correction, or similar
        applications.
        
        With GPT-3 and later ChatGPT, there was a very fundamental shift in how
        people think about approaching NLP problems. Many of the techniques and
        methods became outdated and you could suddenly do things that were not
        feasible before.
       
          mike_hearn wrote 6 hours 40 min ago:
          Are you sure? I wrote an essay at the end of 2016 about the state of
          AI research and at the time researchers were demolishing benchmarks
          like FAIR's bAbI which involved generating answers to questions. I
          wrote back then about story comprehension and programming robots by
          giving them stories (we'd now call these prompts). [1] bAbI paper:
          [2] Abstract: One long-term goal of machine learning research is to
          produce methods that are applicable to reasoning and natural
          language, in particular building an intelligent dialogue agent. To
          measure progress towards that goal, we argue for the usefulness of a
          set of proxy tasks that evaluate reading comprehension via question
          answering. Our tasks measure understanding in several ways: whether a
          system is able to answer questions via chaining facts, simple
          induction, deduction and many more. The tasks are designed to be
          prerequisites for any system that aims to be capable of conversing
          with a human.
          
          So at least FAIR was thinking about making AI that you could ask
          questions of in natural language. Then they went and beat their own
          benchmark with the Memory Networks paper: [3] Fred went to the
          kitchen. Fred picked up the milk. Fred travelled to the office.
          
          Where is the milk ? A: office
          
          Where does milk come from ? A: milk come from cow
          
          What is a cow a type of ? A: cow be female of cattle
          
          Where are cattle found ? A: cattle farm become widespread in brazil
          
          What does milk taste like ? A: milk taste like milk
          
          What does milk go well with ? A: milk go with coffee
          
          Where was Fred before the office ? A: kitchen
          
          That was published in 2015. So we can see quite early ChatGPT like
          capabilities, even though they're quite primitive still.
          
   URI    [1]: https://blog.plan99.net/the-science-of-westworld-ec624585e47
   URI    [2]: https://arxiv.org/abs/1502.05698
   URI    [3]: https://arxiv.org/pdf/1410.3916
       
          yobbo wrote 7 hours 48 min ago:
          > Nobody was thinking about “solving” tasks by asking oracle-like
          models
          
          I remember this being talked about maybe even earlier than 2018/2019,
          but the scale of models then was still off by at least one order of
          magnitude before it had a chance of working. It was the ridiculous
          scale of GPT that allowed the insight that scaling would make it
          useful.
          
          (Tangentially related; I remember a research project/system from
          maybe 2010 or earlier that could respond to natural language queries.
          One of the demos was to ask for distance between cities. It was based
          on some sort of language parsing and knowledge graph/database, not
          deep-learning. Would be interesting to read about this again, if
          anyone remembers.)
       
        Al-Khwarizmi wrote 8 hours 35 min ago:
        A great writeup, just let me make two nitpicks (not to diminish the
        awesome effort of the author, but just in case they wish to take
        suggestions).
        
        1. I think the paper underemphasizes the relevance of BERT. While from
        today's LLM-centric perspective it may seem minor because it's in a
        different branch of the tech tree, it smashed multiple benchmarks at
        the time and made previous approaches to many NLP analysis tasks
        immediately obsolete. While I don't much like citation counts as a
        metric, a testament of its impact is that it has more than 145K
        citations - in the same order of magnitude as the Transformers paper
        (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a
        landmark paper due to what came afterwards, but at the time it wasn't
        that useful due to being more oriented to generation (but not that good
        at it) and, IIRC, not really publicly available (it was technically
        open source but not posted at a repository or with a framework that
        allowed you to actually run it). It's also worth remarking that for
        many NLP tasks that are not generative (things like NER, parsing,
        sentence/document classification, etc.) often the best alternative is
        still a BERT-like model even in 2025.
        
        2. The writing kind of implies that modern LLMs were something that was
        consciously sought after ("the transformer architecture was not enough.
        Researchers also needed advancements in how these models were trained
        in order to make the commodity LLMs most people interact with today").
        The truth is that no one in the field expected modern LLMs. The story
        was more like the OpenAI researchers noticing that GPT-2 was good at
        generating random text that looked fluent, and thought "if we make it
        bigger it will do that even better". But it turned out that not only it
        generated better random text, but it started being able to actually
        state real facts (in spite of the occasional hallucinations), answer
        questions, translate, be creative, etc. All those emergent abilities
        that are the basis of "commodity LLMs most people interact with today"
        were a totally unexpected development. In fact, it is still poorly
        understood why they work.
       
          jph00 wrote 6 hours 36 min ago:
          (2) is not quite right. I created ULMFiT specifically because I
          thought a language model pretrained on a large general corpus then
          fine-tuned was the right way to go for creating generally capable NLP
          models. It wasn't an accident.
          
          The fact that, sometime later, GPT-2 could do zero-shot generation
          was indeed something a lot of folks got excited about, but that was
          actually not the correct path. The 3-step ULMFiT approach (causal LM
          training on general corpus then specialised corpus, then
          classification task fine tuning) was what ChatGPT 3.5 Instruct used,
          which formed the basis of the first ChatGPT product.
          
          So although it took quite a while to take off, the idea of the LLM
          was quite intentional and has largely developed as I planned (even
          although at the time almost no-one else felt the same way; luckily
          Alec Radford did, however! He told me in 2018 that reading the ULMFiT
          paper was a big "omg" moment for him and he set to work on GPT right
          away.)
          
          PS: On (1), if I may take a moment to highlight my team's recent
          work, we updated BERT last year to create ModernBERT, which showed
          that yes, this approach still has legs. Our models have had >1.5m
          downloads and there's >2k fine-tunes and variants of it now on
          Huggingface:
          
   URI    [1]: https://huggingface.co/models?search=modernbert
       
            Al-Khwarizmi wrote 1 hour 35 min ago:
            Point taken (both from you and the sibling comment mentioning Phil
            Blunsom), I should know better than carelessly dropping such broad
            generalizations as "no one in the field expected..." :)
            
            Still, I think only a tiny minority of the field expected it, and I
            think it was also clear from the messaging at the time that the
            OpenAI researchers who saw how GPT-3 (pre-instruct) started solving
            arbitrary tasks and displaying emergent abilities were surprised by
            that. Maybe they did have an ultimate goal in mind of creating a
            general-purpose system via next word prediction, but I don't think
            they expected it so soon and just by scaling GPT-2.
       
            HarHarVeryFunny wrote 2 hours 0 min ago:
            When you say "classification task fine tuning", are you referring
            to RLHF?
            
            RLHF seems to have been the critical piece that "aligned" the
            otherwise rather wild output of a purely "causally" (next-token
            prediction) trained LLM with what a human expects in terms of
            conversational turn taking (e.g. Q & A) and instruction following,
            as well as more general preferences/expectations.
       
          williamtrask wrote 7 hours 11 min ago:
          Nit: regarding (2), Phil Blunsom did (same Blunsom from the article,
          and who was leading language modeling at DeepMind for about 7-8
          years). He would often opine at Oxford (where he taught) that solving
          next word prediction is a viable meta path to AGI. Almost nobody
          agreed at the time. He also called out early that scaling and better
          data were the key, and they did end up being, although Google
          wasn’t as “risk on” as OpenAI on gathering the data for
          GPT-1/2. Had they been history could easily have been different.
          People forget the position OAI was in at the time. Elon/funding had
          left, key talent had left. Risk appetite was high for that kind of
          thing… and it paid off.
       
        brcmthrowaway wrote 9 hours 49 min ago:
        Dumb question, what is the difference between embedding and bag of
        words?
       
          Al-Khwarizmi wrote 9 hours 9 min ago:
          With bag of words, the representation of a word is a vector whose
          dimension is the dictionary size, all components are zeros except for
          the component corresponding to that word, which is one.
          
          This is not good to train neural networks (because they like to be
          fed dense, continuous data, not sparse and discrete) and it treats
          each word as an atomic entity without dealing with relationships
          between them (you don't have a way to know that the wprds "plane" and
          "airplane" are more related than "plane" and "dog").
          
          With word embeddings, you get a space of continuous vectors with a
          predefined (lower) number of dimensions. This is more useful to serve
          as input or training data to neural networks, and it is a
          representation of the meaning space ("plane" and "airplane" will have
          very similar vectors, while the one for "dog" will be different)
          which opens up a lot of possibilities to make models and systems more
          robust.
       
            HarHarVeryFunny wrote 1 hour 51 min ago:
            Also important to note that in a Transformer-based LLM, embeddings
            are more than just a way of representing the input words.
            Embeddings are what pass through the transformer, layer by layer,
            and get transformed by it.
            
            The size of the embedding space (number of vector dimensions) is
            therefore larger than needed to just represent word meanings - it
            needs to be large enough to also be able to represent the
            information added by these layer-wise transformations.
            
            The way I think of these transformations, but happy to be
            corrected, is more a matter of adding information rather than
            modifying what is already there, so conceptually the embeddings
            will start as word embeddings, then maybe get augmented with
            part-of-speech information, then additional syntactic/parsing
            information, and semantic information, as the embedding gets
            incrementally enriched as it is "transformed" by successive layers.
       
              empiko wrote 38 min ago:
              > The way I think of these transformations, but happy to be
              corrected, is more a matter of adding information rather than
              modifying
              
              This is very much the case considering the residual connections
              within the model. The final representation can be expressed as a
              sum of representations from N layers, where the N-th
              representation is a function of N-1-th.
       
        jph00 wrote 10 hours 25 min ago:
        This is quite a good overview, and parts reflect well how things played
        out in language model research. It's certainly true that language
        models and deep learning were not considered particularly promising in
        NLP, which frustrated me greatly at the time since I knew otherwise!
        
        However the article misses the first two LLMs entirely.
        
        Radford cited CoVE, ELMo, and ULMFiT as the inspirations for GPT.
        ULMFiT (my paper with Sebastian Ruder) was the only one which actually
        fine-tuned the full language model for downstream tasks. [1] ULMFiT
        also pioneered the 3-stage approach of fine-tuning the language model
        using a causal LM objective and then fine-tuning that with a
        classification objective, which much later was used in GPT 3.5
        instruct, and today is used pretty much everywhere.
        
        The other major oversight in the article is that Dai and Le (2015) is
        missing -- that pre-dated even ULMFiT in fine-tuning a language model
        for downstream tasks, but they missed the key insight that a general
        purpose pretrained model using a large corpus was the critical first
        step.
        
        It's also missing a key piece of the puzzle regarding attention and
        transformers: the memory networks paper recently had its 10th birthday
        and there's a nice writeup of its history here: [2] It came out about
        the same time as the Neural Turing Machines paper ( [3] ), covering
        similar territory -- both pioneered the idea of combining attention and
        memory in ways later incorporated into transformers.
        
   URI  [1]: https://thundergolfer.com/blog/the-first-llm
   URI  [2]: https://x.com/tesatory/status/1911150652556026328?s=46
   URI  [3]: https://arxiv.org/abs/1410.5401
       
       
   DIR <- back to front page