_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Benchmarking GPT-5 on 400 real-world code reviews
       
       
        rs186 wrote 1 hour 7 min ago:
        Serious question: how are these tools different from glorified system
        prompt generator?
       
        rs186 wrote 1 hour 8 min ago:
        > GPT-5 stood out for its analytical strength and review clarity.
        
        The sentence is too obviously LLM generated, but whatever.
        
        > Weaknesses:
        
        >
        
        >    False positives: A few reviews include incorrect or harmful fixes.
        
        >    Inconsistent labeling: Occasionally misclassifies the severity of
        findings or touches forbidden lines.
        
        >    Redundancy: Some repetition or trivial suggestions that dilute
        review utility.
        
        wtf are "forbidden lines"?
       
        thawab wrote 3 hours 50 min ago:
        How can o4-mini be at 57, and sonnet-4 is at 39? This is way off,
        o4-mini is not even in the top 5 of coding agents.
       
        highfrequency wrote 3 hours 50 min ago:
        Great to see more private benchmarks. I would suggest swapping out the
        evaluator model from o3 to one of the other companies, eg Gemini 2.5
        Pro, to make sure the ranking holds up. For example, if OpenAI models
        all share some    sense of what constitutes good design, it would not be
        that surprising that o3 prefers GPT5 code to Gemini code! (I would not
        even be surprised if GPT5 were trained partially on output from o3).
       
        mkotlikov wrote 4 hours 59 min ago:
        Models tend to prefer output that sounds like their own. If I were to
        run these benchmarks I would have:
        
        1) Gemini 2.5 Pro rank only non-google models
        2) Claude 4.1 Opus rank only non-Anthropic models
        3) GPT5-thinking rank only non-OpenAI
        4) Then sum up the rankings and sort by the sum.
       
        thegeomaster wrote 5 hours 16 min ago:
        Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096
        thinking tokens is way too low; I bet o3 is generating significantly
        more.
       
          energy123 wrote 5 hours 2 min ago:
          For o3, I set reasoning_effort "high" and it's usually 1000-2000
          reasoning tokens for routine coding questions.
          
          I've only seen it go above 5000 for very difficult style transfer
          problems where it has to wrangle with the micro-placement of lots of
          text. Or difficult math problems.
       
        jondwillis wrote 5 hours 22 min ago:
        Idea: randomized next token prediction passed to a bunch of different
        models on a rotating basis.
        
        It’d be harder to juice benchmarks if a random sample of ~100 top
        models were randomly sampled in this manner for output tokens while
        evaluating the target model’s output.
        
        On second thought, I’m slapping AGPL on this idea. Please hire me and
        give me one single family house in a California metro as a bonus.
        Thanks.
       
        dovin wrote 5 hours 52 min ago:
        I don't consider myself a font snob but that web page was actually hard
        for me to read. Anyway, it's definitely capable according to my
        long-horizon text-based escape room benchmark. I don't know if it's
        significantly better than o3 yet though.
       
        grigio wrote 5 hours 57 min ago:
        I don't trust benchmarks that do not include chinese models,..
       
        XCSme wrote 6 hours 3 min ago:
        The ranking seems wrong, Gemini-2.5flash as good as Clause Opus 4?
       
          ascorbic wrote 5 hours 1 min ago:
          And Sonnet above Opus?
       
        tw1984 wrote 6 hours 12 min ago:
        the conclusion of this post seems to be that GPT-5 is significantly
        better than o3, yet such conclusion is made by the exact far less
        reliable model o3 as proven by the tests in this post.
        
        thanks, but no thanks, I don't buy such marketing propaganda.
       
        Lionga wrote 6 hours 16 min ago:
        Company selling AI Reviews says AI Reviews great! In other news water
        is wet.
       
          carlob wrote 6 hours 9 min ago:
          Company selling AI Reviews says its AI Review of AI Reviews concluded
          AI reviews are great! In other news water is wet (as assessed by more
          water).
          
          FTFY
       
            Lionga wrote 5 hours 58 min ago:
            My AI Review says your comment is 100% perfect (this comment was
            written by ChatGPT 5)
       
        8-prime wrote 6 hours 18 min ago:
        Asking GPT 4o seems like an odd choice.
        I know this is not quite comparable to what they were doing, but asking
        different LLMs the following question
        > answer only with the name nothing more norting less.what currently
        available LLM do you think is the best?
        
        Resulted in the following answers:
        
        - Gemini 2.5 flash: Gemini 2.5 Flash
        
        - Claude Sonnet 4: Claude Sonnet 4
        
        - Chat GPT: GPT-5
        
        To me its conceivable that GPT 4o would be biased toward output
        generated by other OpenAI models.*
       
          qingcharles wrote 4 hours 10 min ago:
          Someone else commented the same:
          
   URI    [1]: https://news.ycombinator.com/item?id=44834643
       
          monkeydust wrote 5 hours 33 min ago:
          I know from our research models do exhibit bias when used this way as
          llm as a judge...best to use a totally different foundation company
          for the judge.
       
          rullelito wrote 5 hours 50 min ago:
          Without knowing too much about ML training, generated output from the
          own model must be much easier to understand since it generates data
          that is more likely to be similar to the training set? Is this
          correct?
       
            jondwillis wrote 5 hours 26 min ago:
            I don’t think so. The training data, or some other filter applied
            to the output tokens, is resulting in each model indicating that it
            is the best.
            
            The self-preference is almost certainly coming from
            post-processing, or more likely because the model name is inserted
            into the system prompt.
       
        spongebobstoes wrote 6 hours 23 min ago:
        > the “minimal” GPT-5 variant ... achieved a score of 58.5
        
        the image shows it with a score of 62.7, not 58.5
        
        which is right? mistakes like this undermine the legitimacy of a closed
        benchmark, especially one judged by an LLM
       
          jama211 wrote 27 min ago:
          Probably written by an llm too…
       
          rs186 wrote 1 hour 11 min ago:
          A large chunk of this article reads like LLM generated, so I guess it
          was never proofread, and details like this are not validated, or they
          could be entirely made up i.e. hallucinated.
       
        shinycode wrote 6 hours 27 min ago:
        I’m curious to know how people use PR review platforms with LLMs.
        Because what I feel is that I need to do the review and then review the
        review of the LLM which is more work in the end. If I don’t review
        anymore (or if no one does it) knowledge is kind of lost. It surely
        depends on team size but do people use those to only to have better
        hints or to accelerate reviews with no/low overlook ?
       
          Leherenn wrote 5 hours 43 min ago:
          Only has a sanity check/better hints. But I use it for my own PRs,
          not others'. Usually it's not much to review and easy to
          agree/disagree with.
          
          I haven't found it to be really useful so far, but it's also very
          little added work, so for now I keep on using it. If it saves my ass
          even just once, it will probably be worth it overall.
       
            fcantournet wrote 4 hours 9 min ago:
            > If it saves my ass even just once, it will probably be worth it
            overall.
            
            That's a common fallacy of safety by the way :)
            
            It could very well "save your ass" just once (whatever that means)
            while costing you more in time, opportunity, effort, or even false
            sense of safety, to generate more harm than it will ultimately save
            you.
       
              Leherenn wrote 3 hours 34 min ago:
              Sure, but so far the cost is very minimal. Like 1 minute per PR
              on average. A crash in production and the subsequent falloffs is
              probably a good week of work and quite a bit of stress. That
              gives me quite a few PRs.
              
              And it's not even safety critical code.
       
          stpedgwdgfhgdd wrote 6 hours 4 min ago:
          I give the MR id to CC and let it review. I have glab cli installed
          so it knows how to pull and even add a comment. Unfortunately not at
          all specific line number afaict. I also have Atlassian MCP, so CC can
          also add a comment in the Jira work item (fka issue).
       
        comex wrote 6 hours 33 min ago:
        > Each model’s responses are ranked by a high-performing judge model
        — typically OpenAI’s o3 — which compares outputs for quality,
        relevance, and clarity. These rankings are then aggregated to produce a
        performance score.
        
        So there's no ground truth; they're just benchmarking how impressive an
        LLM's code review sounds to a different LLM.  Hard to tell what to make
        of that.
       
          andrepd wrote 2 hours 10 min ago:
          It's almost too on the nose to be satire, yet here we are.
       
          dvfjsdhgfv wrote 2 hours 26 min ago:
          >  Hard to tell what to make of that.
          
          It's not hard. You are visiting a website with an .ai domain. You
          already know what the conclusions will be.
       
          croes wrote 2 hours 46 min ago:
          It undermines the private benchmark approach if the evaluation is
          done that way.
       
          shikon7 wrote 5 hours 46 min ago:
          Also, using an OpenAI model to judge the performance of an OpenAI
          model seems prone to all kinds of biases.
       
            LauraMedia wrote 5 hours 5 min ago:
            Am I missing something? If LLM-1 is supposed to judge LLM-2,
            doesn't LLM-1 have to be better than LLM-2? If LLM-1 is only 40% as
            good at coding as LLM-2, why would you trust the LLM with the
            lesser knowledge?
       
              stingraycharles wrote 2 hours 20 min ago:
              At least use something like Zen MCP’s Consensus tool to gain a
              consensus around a large variety of models.
       
              BlindEyeHalo wrote 4 hours 57 min ago:
              At the heart of the P vs NP problem lies the observation that
              solution verification seems to be much easier than solution
              generation. If that applies in this context is another question
              but I think it is not unreasonable to assume that the judge needs
              to be less powerful than the performer.
              
              Or in other words, I don't need to be a chef myself to decide if
              a meal is good or not.
       
                torginus wrote 1 min ago:
                It's a bit different for reasoning LLMs - they operate in a
                feedback loop, measuring the quality of the solution and
                iterating on it until either the quality meets a desired
                threshold, or all reasoning effort is expended.
                
                This can correct for generation errors, but cannot correct for
                quality measurement errors, so the question is valid.
       
                cubefox wrote 3 hours 3 min ago:
                It's usually easier to create a false statement than to check
                whether it's false.
       
                rowanG077 wrote 4 hours 28 min ago:
                That really doesn't hold for all problems. You can imagine any
                number of problems where a valid solution is easier, complexity
                wise, to generate than it is to validate. A trivial example is
                semiprime factorization. Easy to generate any semiprime, hard
                to factor.
       
                  mcphage wrote 13 min ago:
                  > That really doesn't hold for all problems.
                  
                  But it does hold for this problem.
       
                  BlindEyeHalo wrote 2 hours 49 min ago:
                  Sure, it was never my intention to make it seem like a
                  general statement, just highlighting that there is a large
                  class of problems for which it is true.
                  
                  As you point out there are many problems that higher
                  complexity classes than NP.
       
                  jama211 wrote 4 hours 17 min ago:
                  Pretty sure they know that, their point still stands
       
            mirekrusin wrote 5 hours 6 min ago:
            Exactly, they should at least compare with judges as best models
            from others, ideally verified by human/ground truth/tests.
       
          with wrote 6 hours 18 min ago:
          It’s a widely accepted eval technique and it’s called “llm as a
          judge”
       
            kingstnap wrote 4 hours 58 min ago:
            It's widely accepted because it's cheap, but LLMs aren't really
            good judges.
            
            It's supposed to leverage a "generate vs. critique" gap in skill
            level as a form of self-improvement. It's easier to judge how good
            food is vs. make it.
            
            But here's the thing. When it comes to code review, you need to be
            effectively as skilled as the person who wrote it. There isn't
            really a gap.
            
            And then the real clincher is this. LLMs naturally have a skill gap
            between their judgement and generation skills as is. The reason is
            that they have superhuman pattern matching and memorization
            ability. They can use their memorized patterns as a massive crutch
            for their actual reasoning skills, but they can't do the same for
            judgement calls in code review.
       
            sensanaty wrote 5 hours 31 min ago:
            Accepted by whom, the people shoving AI down our throats?
       
            jacquesm wrote 5 hours 57 min ago:
            Accepted does not mean correct. It's like using a rubber yardstick
            as the means to figure out who won the pumpkin growing competition.
       
              ben_w wrote 5 hours 24 min ago:
              I'd say it's worse than that, a rubber ruler still has a definite
              length when not under tension etc.
              
              This might be more like asking amateur painters to each paint a
              picture of a different one of the pumpkins, then judging each
              other's paintings without seeing the actual pumpkin that painting
              was based on.
       
                jacquesm wrote 5 hours 4 min ago:
                Ok, that is indeed better. For a further improvement we should
                let the previous generation of paintings judge the new one.
       
            magicalhippo wrote 5 hours 58 min ago:
            Shouldn't one review the ratings of say a random 1% to ensure it's
            performing as expected?
       
          raincole wrote 6 hours 21 min ago:
          That's how 99% of 'LLM benchmark numbers' circulating on the internet
          work.
       
            qsort wrote 5 hours 46 min ago:
            No, they aren't. Most benchmarks use ground truth, not evaluation
            by another LLM. Using another LLM as verifier, aside from the
            obvious "quis custodiet custodes ipsos", opens an entire can of
            worms, such as the fact that there could be systematic biases in
            the evaluation. This is not in and of itself disqualifying but it
            should be addressed, and the article doesn't even say anything.
       
              charlieyu1 wrote 1 hour 50 min ago:
              Even the benchmarks for maths only checked numerical answers for
              ground truth, which means the LLM can output a lot of nonsense
              and guess the correct answer to pass it
       
              sigmoid10 wrote 2 hours 35 min ago:
              Ground truth evaluation is not that simple unless you are doing
              multiple-choice-style tests or something similar where the
              correctness of an answer can be determined by a simple process.
              Open ended natural language tasks like this one are incredibly
              difficult to evaluate and using LLMs as judge is not just the
              current standard, it is basically the only way to do it at scale
              economically.
       
                qsort wrote 1 hour 50 min ago:
                The original comment was this:
                
                > So there's no ground truth; they're just benchmarking how
                impressive an LLM's code review sounds to a different LLM. Hard
                to tell what to make of that.
                
                The comment I replied to was:
                
                > That's how 99% of 'LLM benchmark numbers' circulating on the
                internet work.
                
                And that's just false. SWE-Bench verified isn't like this.
                Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like
                this. The new internal benchmarks used by OpenAI in GPT-5's
                model card aren't like this.
                
                Maybe this benchmark is a special snowflake and needs
                LLM-as-a-judge, but this doesn't invalidate the original
                concern: setting up a benchmark this way runs into a series of
                problems and is prone to show performance differences that
                might not be there with a different setups. Benchmarks are
                already hard to trust, I'm not sure how this is any more
                indicative than the rest.
       
          eviks wrote 6 hours 22 min ago:
          Why is it hard to ignore an attempt to assess reality that is not
          grounded in reality?
       
            jtrn wrote 1 hour 29 min ago:
            That's an extremely dense question :) (Not pejorative, but
            conceptual dense).
            
            I had some fun trying to answer it, ignoring fixating on whether or
            not the premise is true, for argument's sake.
            
            My answer is:
            
            I would think "attempting to assess reality that is not grounded in
            reality" is hard to ignore due to a combination of "it's what is
            available," being easy to understand, and seeming useful (decoupled
            from whether it's really so). As a result, it's hard to ignore
            because it's what is mostly available to us for consumption and is
            easy to make "consumable."
            
            I think there is a LARGE overlap in this topic with my pet peeve
            and hatred of mock tests in development. They are not completely
            useless, but their obvious flaws and vulnerabilities seem to me to
            be in the same area: "Not grounded in reality."
            
            Said another way: Because it's what's easy to make, and thus there
            is a lot of it, creating a positive feedback loop of mere-exposure
            effect. Then it becomes hard to ignore because it's what's shoved
            in our face.
       
          ImageXav wrote 6 hours 27 min ago:
          Yes, especially as models are known to have a preference towards
          outputs of models in the same family. I suspect this leaderboard
          would change dramatically with different models as the judge.
       
            jacquesm wrote 5 hours 58 min ago:
            I don't care about either method. The ground truth should be what a
            human would do, not what a model does.
       
              mirekrusin wrote 5 hours 3 min ago:
              There may be different/better solutions for almost all those kind
              of tasks. I wouldn’t be surprised if optimal answer to some of
              them would be refusal/defer ask, refactor first, then solve it
              properly.
       
                jacquesm wrote 4 hours 21 min ago:
                That response is quite in line with the typical human based PR
                response on a first draft.
                
                There is a possibility that machine based PR reviews are
                better: for instance because they are not prejudiced based on
                who is the initiator of the PR and because they don't take
                other environmental factors into account. You'd expect a
                machine to be more neutral, so on that front the machine should
                and possibly could score better. But until the models
                consistently outperform the humans in impartially scored
                quality vs a baseline of human results it is the humans that
                should call this, not the machines.
       
                  jeltz wrote 3 hours 54 min ago:
                  I wouldn't necessarily expect a machine to be more neutral.
                  Machines can easily be biased too.
       
                    jacquesm wrote 3 hours 51 min ago:
                    On something like a PR review I would. But on anything that
                    would involve private information such as the background,
                    gender, photographs and/or video as well as other writings
                    by the subject I think you'd be right.
                    
                    It's just that it is fairly trivial to present a PR to a
                    machine in such a way that it can only comment on the
                    differences in the code. I would find it surprising if that
                    somehow led to a bias about the author. Can you give an
                    example of how you think that would creep into such an
                    interaction?
       
            spiderfarmer wrote 6 hours 24 min ago:
            They are different models already but yes, I already let ChatGPT
            judge Claude's work for the same reason.
       
        timbilt wrote 6 hours 36 min ago:
        > Unlike many public benchmarks, the PR Benchmark is private, and its
        data is not publicly released. This ensures models haven’t seen it
        during training, making results fairer and more indicative of
        real-world generalization.
        
        This is key.
        
        Public benchmarks are essentially trust-based and the trust just isn't
        there.
       
          jacquesm wrote 5 hours 52 min ago:
          Then you just need to use different data the next time you evaluate.
          That is much more indicative of real-world generalization: after all,
          you don't normally do multiple PRs on the same pieces of code. The
          current approach risks leaking the dataset selectively and/or fudging
          the results because they can't be verified. Transparency is key when
          doing this kind of benchmark, so now we have to trust the entity
          doing the benchmarking rather than independent verification of the
          results and with the amount of money that is at stake here I don't
          think that's the way to go.
       
          nojs wrote 6 hours 28 min ago:
          How does this ensure models haven’t seen it during training - is it
          a different benchmark per model release?
       
          laggyluke wrote 6 hours 31 min ago:
          Unless you're running the LLM yourself (locally), private benchmarks
          are also trust-based, aren't they?
       
            timbilt wrote 6 hours 24 min ago:
            Yes, but in a case like this it's a neutral third-party running the
            benchmark. So there isn't a direct incentive for them to favor one
            lab over another.
            
            With public benchmarks we're trusting the labs not to cheat. And
            it's easy to 
            "cheat" accidentally - they actually need to make a serious effort
            to not contaminate the training data.
            
            And there's massive incentives for the labs to cheat in order to
            get the hype going around their launch and justify their massive
            investments in training. It doesn't have to be the CEO who's
            directing it. Can even be one/a few researchers who are responsible
            for a specific area of model performance and are under tremendous
            pressure to deliver.
       
              vohk wrote 6 hours 0 min ago:
              The problem is when using a model hosted by those labs (ex:
              OpenAI only allowed access to o3 through their own direct API,
              not even Azure), there still exists a significant risk of
              cheating.
              
              There's a long history of that sort of behaviour. ISPs gaming
              bandwidth tests when they detect one is being run. Software
              recognizing being run in a VM or on a particular configuration. I
              don't think it's a stretch to assume some of the money at OpenAI
              and others has gone into spotting likely benchmark queries and
              throwing on a little more compute or tagging them for future
              training.
              
              I would be outright shocked if most of these benchmarks are even
              attempting serious countermeasures.
       
       
   DIR <- back to front page