_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Chinchilla Scaling: A replication attempt
       
       
        gwern wrote 1 day ago:
        The original Chinchilla authors have now identified the original bug,
        apparently:
        
   URI  [1]: https://twitter.com/borgeaud_s/status/1780988694163321250
       
          mirekrusin wrote 16 hours 32 min ago:
          Lovely, they are also open sourcing data.
       
            anonymousDan wrote 15 hours 59 min ago:
            The scientific process at work!
       
        warbaker wrote 1 day ago:
        Calling this a "replication attempt" implied to me that they tried to
        replicate the Chinchilla Scaling paper and found that it did not
        replicate, which would be a very big deal!
        
        Instead, they just redid the analysis based on a figure in the paper
        and found that the old model with slightly different parameters gave a
        better fit to the data. This is a valuable contribution, but a bit
        over-stated by the paper title, and the confrontational, "gotcha" tone
        of the paper is unwarranted.
        
        A better framing would have been something like "Chinchilla Scaling:
        Reanalyzed".
       
          ege_erdil wrote 1 day ago:
          one of their three approaches does not replicate and it's because of
          a software bug in the optimizer they used, i don't know what else we
          were supposed to say
       
        cgearhart wrote 1 day ago:
        TL;DR—couldn’t exactly replicate their results, but broadly
        confirmed their findings. They agree that the optimal range is 5–40
        tokens per parameter, and close to 20 for the “chinchilla” model
        from the original paper.
        
        Very unusual choice to reconstruct the dataset by eyeballing the graph
        in the source paper (why not just ask for it…?) and it’s not really
        clear why the result is dressed up behind the salacious-seeming
        abstract.
       
          ege_erdil wrote 1 day ago:
          we didn't eyeball the graph, there are more accurate ways of
          extracting the data from a pdf file than that
          
          we did ask for the data but got no response until we published on
          arxiv
          
          what is supposed to be "salacious" about the abstract?
       
        magnio wrote 1 day ago:
        > To extract the data from the figure, we first downloaded the PDF from
        Hoffmann et al.’s arXiv submission and saved it in SVG format. We
        then parsed the SVG content to navigate and search the SVG structure.
        Within the SVG, we identified the group of points representing the
        scatter plot data and iterated over each point to extract its fill
        color and position (x and y coordinates) using the attributes of the
        corresponding SVG elements.
        
        > To map the SVG coordinates to the model size and training FLOP
        values, we used the location of the labels or ticks on the respective
        axes. This allowed us to establish a correspondence between the SVG
        coordinates and the actual data values represented in the plot.
        
        They ... reconstructed the data ... from a plot ... using ruler and
        eyes? Why not just emailed the original authors for the raw data? I
        can't help but feel like @yuvaltheterrible debunking papers.
       
          ege_erdil wrote 1 day ago:
          we did and gave them a two week grace period to respond, but they
          only responded to us after we published on arxiv
          
          also, we didn't reconstruct the data using a ruler, you can automate
          that entire process so that it's much more reliable than that
       
            saurabh20n wrote 20 hours 32 min ago:
            Looks like you’re one of the authors.
            
            It would be nice if you could post if the actual data matches your
            reconstruction—now that you have it in hand. Would help us not
            worry about the data provenance and focus on the result you found.
       
              ege_erdil wrote 15 hours 38 min ago:
              we're not sure if the actual data exactly matches our
              reconstruction, but one of the authors pointed out to us that we
              can exactly reproduce their scaling law if we make the mistake
              they made when fitting it to the data
              
              what they did was to take the mean of the loss values across
              datapoints instead of summing them and used L-BFGS-B with the
              default tolerance settings, so the optimizer terminated early,
              and we can reproduce their results with this same mistake
              
              so our reconstruction appears to be good enough
       
          williamdclt wrote 1 day ago:
          I particularly like this second quote, I appreciate them taking the
          time to explain "what is a graph" in a scientific paper!
       
          levocardia wrote 1 day ago:
          I do that all the time using WebPlotDigitizer [1]. Works great.
          
   URI    [1]: https://apps.automeris.io/wpd/
       
            dynm wrote 1 day ago:
            Seconded. When I first saw this, I thought it looked unintuitive
            and difficult to use, but when I tried it, it was very easy and I
            had the extracted data in a few minutes.
       
          acc_297 wrote 1 day ago:
          In fairness they did not use a ruler or eyes based on the excerpts
          you quote they extracted exact coordinates of data from an svg format
          which if the svg was created correctly should at least give a
          non-biased dataset maybe with less precision than the source
       
          polygamous_bat wrote 1 day ago:
          > Why not just emailed the original authors for the raw data?
          
          Industry research labs, especially Google deepmind, are notoriously
          closed up about their “proprietary” data. I’ve hit this wall
          multiple times in my own work in AI.
       
            sp332 wrote 1 day ago:
             [1] says they're going to open the data from the paper. Not sure
            why they didn't do it before, but good news.
            
   URI      [1]: https://twitter.com/borgeaud_s/status/1780988694163321250
       
          Ajoo wrote 1 day ago:
          They claimed that they did ask several times in one of the replies.
       
          mxwsn wrote 1 day ago:
          Funnily enough, I've done this for a paper I wrote as well. Emailing
          authors is kind of a crapshoot. It's normal to get no response if
          it's been several years since the paper came out. In this case, a pdf
          plot is essentially lossless, and it's much faster than waiting for
          authors to maybe respond.
       
            V1ndaar wrote 1 day ago:
            And not only that, in many cases they will tell you (if they reply)
            "oh, we can't find the source of that plot anymore". Happened to me
            quite a few times (although in physics).
            
            I'm pretty sure I'm not the only one who's written themselves a
            mini tool to even extract data from a bitmap plot based on the
            axes. Involves some manual steps (cropping mainly), but is very
            convenient for the cases where people not even use vector graphics,
            but sometimes even just screenshots of plots... Do I like it? Hell
            no! It's why I've put quite some effort in doing it better for my
            PhD thesis.
       
              mirekrusin wrote 16 hours 38 min ago:
              Somebody tell them that huggingface, github, gitlab, codeberg etc
              exist.
       
              WanderPanda wrote 18 hours 58 min ago:
              So be fair, sometimes (e.g. in the case of scatter plots with
              many dots) pdf renderers become very slow and/or mess up the
              rendering. In this case the easiest option is rasterizing it (for
              performance and consistency of the appearance)
       
                V1ndaar wrote 14 hours 3 min ago:
                That is certainly true (and why added a general "embed plot
                data as bitmap into SVG/PDF" option to [1] that works not only
                for raster heatmaps). But realistically such plots are often
                not ideal anyway (too many data points in a plot is often a
                sign that a different type of plot would be better; typically
                one that aggregates in some way) and it's just another argument
                to make the data for plots available as well.
                
   URI          [1]: https://github.com/Vindaar/ggplotnim
       
                jszymborski wrote 17 hours 19 min ago:
                If you have the misfortune of having to use Word for writing
                manuscripts and/or have scatter plots with a good number of
                points, SVGs will ruin your day in my experience.
                
                (Yes, I'd much rather use LaTeX)
       
              godelski wrote 1 day ago:
              Yeah it's very annoying especially these days when there's no
              real excuse to not have a copy. You can easily store all code and
              data for free and in an accessible manner. Even just GitHub for
              90+% is good enough. Hugging face helps, and there's many other
              ways too.
              
              I remember my first year in grad school I was trying to replicate
              a work by a very prestigious university. It definitely wasn't
              reproducible from text but I did my best. Couldn't get close to
              their claims so I email the lead author (another grad student).
              No response. Luckily my advisor knew their advisor. Got a meeting
              and then I got sent code. It was nothing like what they claimed
              in the paper so I have no idea what they gave me. Anyways, my
              paper never got published because I couldn't beat them. It is
              what it is.
       
        newfocogi wrote 1 day ago:
        Key claims:
        
        "We have found three potential issues with Hoffmann et al.’s
        estimates of the Chinchilla scaling law that rely on Approach
        3:
        1. Their estimated model fits the reconstructed data very
        poorly. These conclusions hold even when accounting
        for potential noise in data reconstruction and excluding
        outlier models.
        2. The confidence are implausibly tight given the number of data
        points. Obtaining confidence intervals that
        tight would require many hundreds of thousands of observations, while
        they likely had only ∼400.
        3. Their estimated model implies a scaling policy that is
        inconsistent with their other approach"
        
        Data point most people are probably looking for:
        "We find a range consistent with the 20
        tokens per parameter rule of thumb. Indeed, our point estimates imply
        that 25.6 tokens per parameters is optimal."
       
          moffkalast wrote 1 day ago:
          Their rule of thumb would imply that a 70B model is saturated with
          1.7T tokens, that's inconsistent with reality.
       
            eldenring wrote 1 day ago:
            No their claim is that there are dimishing returns for a fixed
            compute budget (in training) to scaling up data past that threshold
            vs. scaling up params.
            
            This doesn't take inference into account either, obviously.
       
            og_kalu wrote 1 day ago:
            The Chinchilla laws were compute optimal scaling laws. They're not
            supposed to tell you what parameter-token combination will saturate
            a model.
       
              moffkalast wrote 1 day ago:
              Compute optimal for what, training? There's nothing optimal in
              blowing up model size beyond the absolute minimal needed or
              you'll spent the equivalent of a country in electricity trying to
              scale inference later.
       
                og_kalu wrote 1 day ago:
                Training yes.
                
                Doubling your parameter count past that ratio will yield a
                better model than doubling your data and is much easier and
                cheaper to do.
       
                  naasking wrote 1 day ago:
                  That suggests that it's likely memorizing more special cases
                  rather than distilling general principles. They generalize to
                  some degree but clearly there's room for improvement.
       
                    og_kalu wrote 1 day ago:
                    It doesn't really suggest anything. Neither model will even
                    close to saturation and all else equal, bigger models
                    perform better in every way, including generalization.
       
                      naasking wrote 7 hours 39 min ago:
                      But why do bigger models perform better? Arguably because
                      there's a larger state space that can be used to remember
                      more contexts, which helps with both generalization and
                      case-specific processing.
       
                        og_kalu wrote 5 min ago:
                        >But why do bigger models perform better?
                        
                        No one really knows the answer to this question.
                        
                        >Arguably because there's a larger state space that can
                        be used to remember more contexts, which helps with
                        both generalization and case-specific processing.
                        
                        What I'm trying to say is that both models in either
                        scenario are very over-parameterized and under-trained.
                        
                        You say the answer is extra space ? 
                        The smaller model has not used anywhere near the space
                        it has. They both have extra space.
                        
                        It's like arguing a bigger drum is better because of
                        extra space when all the water you plan to store will
                        not take even half of your smaller drum
       
                FeepingCreature wrote 1 day ago:
                Blow up model size, get lots of space and parameters to do the
                double-descent grok thing in, then distill it way way down?
       
                rfw300 wrote 1 day ago:
                Yes, compute-optimal for training only. The purpose of the
                paper wasn’t to determine the most economically practical
                model one could build,    the most “intelligent” model one
                could build given some amount of compute.
       
                  ijk wrote 1 day ago:
                  Quite. The big question at the time was "how much data do we
                  need to train GPT-3 equivalent models". Open models had
                  failed to live up to GPT performance, even ones with a
                  massive number of parameters. So getting results that
                  suggested a reason why other models were massively
                  undertrained was important.
                  
                  Meanwhile, people noticed that for deployed models, inference
                  cost often outweighs the initial training costs. It's
                  sometimes better to train a smaller, faster model longer on
                  more data, because it has lower overall cost (including
                  environmental impact) if you're expecting to run the model a
                  few million or billion times (e.g., [1]). So training past
                  the Chinchilla optimum point became a lot more common,
                  particularly after Llama.
                  
   URI            [1]: https://arxiv.org/abs/2401.00448
       
        cs702 wrote 1 day ago:
        Interesting! If the authors are right, it seems that the number of
        training tokens required per parameter (slowly) declines as models
        become larger (Figure 5).
        
        That's good news. I think it deserves wider dissemination, so I'm
        upvoting your post.
        
        Thank you for sharing this on HN!
       
          Kronopath wrote 1 day ago:
          This is not good news, this means that we could end up with a
          dangerously superintelligent AI just by scaling up the number of
          parameters, without increasing the amount of training data.
       
            pfdietz wrote 1 hour 50 min ago:
            It's only bad news if you don't want a dangerously superintelligent
            AI.
       
              Kronopath wrote 1 hour 45 min ago:
              No one should want this.
       
            kelseyfrog wrote 1 day ago:
            No, but LLMs require orders of magnitude more language input than
            humans[1]. It's very reasonable to assume that architectural
            differences (size among them) is more likely a constraint for
            performance.
            
            1. Specifically larger than the upper bound on lifetime language
            input for humans, even assuming 24/7 at max reading speed.
       
              mirekrusin wrote 16 hours 34 min ago:
              LLMs are super-intelligent at mimicking already, it won't take
              much time to find some kind of RL loop there.
       
              TeMPOraL wrote 1 day ago:
              Yes, but LLMs come out of training as experts in approximately
              any single thing you can think of, and then some, and all that in
              dozen of languages. Humans don't achieve even a fraction of this
              kind of breadth.
       
                sdenton4 wrote 21 hours 18 min ago:
                LLMs are experts at everything except what the user is an
                expert in.
       
                  andai wrote 18 hours 20 min ago:
                  Gell-Mann Amnesia effect
                  
                  > You open the newspaper to an article on some subject you
                  know well. In Murray's case, physics. In mine, show business.
                  You read the article and see the journalist has absolutely no
                  understanding of either the facts or the issues. Often, the
                  article is so wrong it actually presents the story
                  backward—reversing cause and effect. I call these the "wet
                  streets cause rain" stories. Paper's full of them.
                  
                  > In any case, you read with exasperation or amusement the
                  multiple errors in a story, and then turn the page to
                  national or international affairs, and read as if the rest of
                  the newspaper was somehow more accurate about Palestine than
                  the baloney you just read. You turn the page, and forget what
                  you know.
                  
                  -Michael Crichton
                  
                  Edit: Found the speech this is from.
                  
   URI            [1]: https://web.archive.org/web/20190808123852/http://la...
       
                godelski wrote 1 day ago:
                This is not quite accurate, but complex because measurement is
                hard. The things they are being tested on are almost surely
                within the dataset. Let's take the bar exam for instance. Sure,
                we don't know what's in GPT data, but we know it has reddit,
                and we know reddit has many similar if not exact questions on
                it. We know that the first GPT4 did not have good semantic
                similarity matching because they just used a 3 substring
                matching on 50 chararcters (Appendix C) and they only consider
                the false positive nature. Then there's this line...
                
                  The RLHF post-training dataset is vastly smaller than the
                pretraining set and unlikely to have any particular question
                contaminated. However we did not check explicitly.
                
                But my favorite is the HumanEval. I'll just remind everyone
                that this was written by 60 authors, mostly from OpenAI
                
                  We evaluate functional correctness on a set of 164
                handwritten programming problems, which we call the HumanEval
                dataset. ... __It is important for these tasks to be
                hand-written, since our models are trained on a large fraction
                of GitHub, which already contains solutions to problems from a
                variety of sources.__
                
                The problems? Well they're leetcode style... Can you tell me
                you can write leetcode style questions that
                
                  Human Eval 2
                
                  Prompt:
                  def truncate_number(number: float) -> float: """ Given a
                positive floating point number, it can be decomposed into and
                integer part (largest integer smaller than given number) and
                decimals (leftover part always smaller than 1). Return the
                decimal part of the number. >>> truncate_number(3.5) 0.5 """ 
                
                  Solution:
                  return number % 1.0 
                
                  Human Eval 4
                
                  Prompt:
                  from typing import List def mean_absolute_deviation(numbers:
                List[float]) -> float: """ For a given list of input numbers,
                calculate Mean Absolute Deviation around the mean of this
                dataset. Mean Absolute Deviation is the average absolute
                difference between each element and a centerpoint (mean in this
                case): MAD = average | x - x_mean | >>>
                mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 
                
                  Solution
                  mean = sum(numbers) / len(numbers) 
                  return sum(abs(x - mean) for x in numbers) / len(numbers) 
                
                You really want to bet that that isn't on github? Because I'll
                bet you any dollar amount you want that there are solutions in
                near exact form that are on github prior to their cutoff date
                (Don't trust me, you can find them too. They're searchable
                even). Hell, I've poisoned the dataset here!
                
                LLMs are (lossy) compression systems. So they're great for
                information retrieval. And a lot of what we consider
                intelligence (and possibly even creativity) is based on
                information retrieval. Doesn't mean these things are any less
                impressive but just a note on how we should be interpreting
                results and understanding the limitations of our tools.
                Measuring intelligence is a really difficult thing and we need
                to be aware that the term isn't universally agreed upon and so
                people are often talking past one another and also some people
                are conflating the differences as if they are the same.
       
              p1esk wrote 1 day ago:
              How much language input does a human need to become intelligent
              if he doesn’t receive any other input?
       
              HeatrayEnjoyer wrote 1 day ago:
              Do they? What is the total size of all visual, audio, touch,
              locomotive, scent, and taste data collected between birth and
              when a human reaches IQ 100? There are multiple high-bandwidth
              feeds running into the brain 24/7.
       
                zarzavat wrote 18 hours 48 min ago:
                Vision is not necessary for language acquisition.
                
                Proof: blind and partially sighted people exist.
       
                cubefox wrote 1 day ago:
                > language input
       
            exe34 wrote 1 day ago:
            Like a corporation then. We should ban them until we can figure out
            how to align them!
       
              tehsauce wrote 1 day ago:
              ASI is nothing like a corporation
       
                TeMPOraL wrote 1 day ago:
                Is very much like a corporation; a corp is effectively an AGI,
                just running very slowly - at the speed of bureaucracy.
       
                wizzwizz4 wrote 1 day ago:
                No, they're not. Corporations have known, concrete impacts on
                the world, whereas the dangers of AI are, so far, corporations.
                ASIs are (as yet) fictional.
                
                Another difference: most corporations will avoid doing illegal
                stuff if the penalties are large enough: the corporation
                alignment problem is political. Pretty much no extant AI
                systems can be instructed in this way: we don't know how to
                align AIs even in theory.
       
                  kelseyfrog wrote 4 hours 8 min ago:
                  > Corporations have known, concrete impacts on the world
                  
                  I hate to do this, but can you enumerate them?
       
                  andai wrote 18 hours 17 min ago:
                  For organisms the ultimate punishment is death. How do you
                  delete an AI from the internet?
       
                    exe34 wrote 13 hours 18 min ago:
                    sudo rm * -rf
       
                      wizzwizz4 wrote 9 hours 13 min ago:
                      That won't provide any motivation: no AI system yet
                      created fears death (except perhaps some of the really
                      simple, evolved ones – but I'd question whether they're
                      sophisticated enough to fear).
       
          dzdt wrote 1 day ago:
          Could be that the independence of training points available decline
          as the dataset size grows? At some point it becomes hard to add data
          that isn't essentially similar to something youve already added.
       
            cs702 wrote 1 day ago:
            Yes, could be. Not sure how or even if anyone could prove it,
            though.
       
              godelski wrote 1 day ago:
              This should be fairly de facto true. Remember your dataset is
              some proxy for some real (but almost surely intractable)
              distribution.
              
              Now let's think about filling the space with p-balls that are
              bound by nearest points. So there should be no data point inside
              the ball. Then we've turned this problem into a sphere packing
              problem and we can talk about the size and volumes of those
              spheres.
              
              So if we uniformally fill our real distribution with data then
              the average volume of those spheres decrease. If we fill but not
              uniformly the average ball will decrease but the largest ball
              will shrink slower (this case being we aren't properly covering
              data in that region). In either case that more you add data, the
              more the balls shrink. Essentially meaning the difference between
              data decreases. The harder question is about those under
              represented regions. Finding them and determining how to properly
              sample.
              
              Another quick trick you can use to convince yourself if thinking
              about basis vectors (this won't be robust btw but a good starting
              point). In high dimensions the likelihood that two randomly
              sampled vectors are orthogonal is almost certainly true. So then
              we think of drawing basis vectors (independent vectors that span
              our space). So as we fill in data, we initially are very likely
              to have vectors (or data) that are independent in some way. But
              as we add more, the likelihood that they are orthogonal
              decreases. Of course your basis vectors don't need to be
              orthogonal but that's more semantics because we can always work
              in a space where that's true.
       
                cs702 wrote 8 hours 11 min ago:
                I agree, but my question was not whether distance between data
                points tends to decrease as dataset size grows, but whether
                that is the reason why the number of training tokens required
                per parameter declines. It could be, but proving it would
                require a better understanding of how and why these giant AI
                models work.
       
                  godelski wrote 4 hours 28 min ago:
                  Wasn't your question about how *independent* the data is?
                  
                  We could talk about this in different ways, like variance.
                  But I'm failing to see how I didn't answer your question. Did
                  I misscommunicate? Did I misunderstand?
                  
                  The model is learning off of statistics so most of your
                  information gain would be through more independent data.
                  Think of this space we are talking about as "knowledge." And
                  our "intelligence" as how easy it is to get to any point in
                  this space. The vector view above might help with
                  understanding this one, because you can step in the direction
                  of any vectors you have and then how you combine them to get
                  to your final point. The question is how many you have to use
                  (how many "steps" away)? And of course, how close you can get
                  to your final destination. As you can imagine, from my
                  previous comment, that doing this for any given point you'll
                  reduce "steps" to your final destination if you have more
                  vectors, but you can also understand that the utility of each
                  vector decreases as you add more. (Generally. Of course if
                  you have a gap in knowledge you can get a big help from a
                  single vector that goes into that area but let's leave that
                  aside).
                  
                  Does this help clarify? If not I might need you to clarify
                  your question a bit more. (I am a ML researcher fwiw)
       
              sebzim4500 wrote 1 day ago:
              I guess you could artifically limit the training data (e.g. by
              removing languages, categories) and see if the utility of extra
              tokens drops off as a result.
       
       
   DIR <- back to front page