gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Disagreement among frontier LLMs on real-world fact-checks
       
       
        graphememes wrote 14 min ago:
        even this is ai output
       
        cdud3 wrote 37 min ago:
        The CVS file with the raw data is a source of joy. My favorite is this:
        
        claim: Artificial intelligence will cause widespread job loss among
        software engineers.
        
        All 5 LLM's agree that the claim is misleading & wrong.
       
        shevy-java wrote 1 hour 53 min ago:
        My first reaction was: how dumb is AI still.
        
        But ... real people would also reach that result. Some believe that
        vaccination can not induce protection (which objectively is incorrect).
       
        secondary_op wrote 2 hours 19 min ago:
        Very interesting tool, but it's biased and not neutral from the get go,
        because I explicitly formulated claim in neutral way, but it
        automatically rewrote it to be western/wikipedia POV and then
        immediately proceeded to verify it.
        
        original neutral:
        
          US DEPT OF DEFENSE/DNAVFAC planned renovations to School #05 in
        Sevastopol, Crimea in 2013 before Crimea became part of Russia in 2014
        
        automatically rewritten to biased western view:
        
          The United States Department of Defense, via the Naval Facilities
        Engineering Command (NAVFAC), planned renovations to School No. 5 in
        Sevastopol, Crimea in 2013, before Russia annexed Crimea in 2014. [1]
        And the follow up
        
          The phrasing "Crimea became part of Russia" is more neutral than the
        phrasing "Russia annexed Crimea."
        
        , and according to this tool is Misleading 9/10 [2] Yeah, so my
        personal conclusion that this tool is garbage, it checks western/US
        allied only LLM providers, that in turn search only for western/US
        allied sources/documents like BBC/NATO and result is what it is.
        
   URI  [1]: https://lenz.io/c/73c0f16c
   URI  [2]: https://lenz.io/c/93944614
       
        40four wrote 2 hours 20 min ago:
        This shouldnât be surprising. Letâs start off with the obvious.
        What does âreal-world fact-check claimsâ mean? So weâre using the
        same list of âfact check claimsâ on each model. The problem is
        (unless Iâm missing it) the authors arenât exposing the list of 1K
        questions they used in the experiment. Thatâs a huge problem. Are the
        authors assuming the 1K claims they used are âprovably trueâ? If
        so, thatâs a huge bias, and opens up a philosophical debate about
        what it a fact? Or whatâs makes something true/ false?
        
        As Marc Andreessen puts it: a particular domain is either explicitly
        âprovableâ or not âprovableâ. Provable domains include math,
        physics, chemistry, biology, engineering, even code. That not be the
        whole list, but everything else is essentially âunprovableâ. At
        least as far as a language model is concerned. They are questions that
        require a human value judgement. Politics are an obvious example. So
        back to the â1K fact check claimsâ. How many of these are
        political, or current events questions? How many are STEM questions
        that can be laid out in a formal proof?
        
        Models can be trained to answer either way on claims that require a
        value judgement, but thatâs obviously not beneficial to anyone except
        who controls the model. If the expectation is that all these frontier
        models should answer the same way on value judgement questions, then
        thatâs never going to happen. What the models ARE good at though is
        breaking down the nuances of a topic and arguing both sides. This is
        how these tools should be used, as a way to analyze the claim and let
        us humans in the end make our own value judgement. If youâre trusting
        the model to make the value judgement for you and just accept it as a
        fact, then you are entering a a very dangerous territory.
       
        husky8 wrote 2 hours 21 min ago:
        Watch the disagreements in real time via refinement pipeline on the
        results page pingpongit.com
       
        michaelmrose wrote 2 hours 29 min ago:
        Totally aside from disagreement between models unbiased by prior input
        any such experiment may fail to capture the outcomes experienced by
        real users whose prior text exchanges may substantially change the text
        recieved.
        
        For instance see the folks who think that they have "awakened" their
        instance of ChatGPT.
        
        Actual usage may diverge to a greater degree than models
       
        chipsrafferty wrote 2 hours 43 min ago:
        It's becoming increasingly clear to me that - at least right now - AI
        is only useful for 2 things:
        
        1. Coding, with it being more useful the better you are at coding
        without AI
        
        2. Any expert in their field asking questions about their field, who
        bother to fact check the output.  E.g. "claude pls search these 1000
        files and tell me if you find anywhere that they're discussing the
        settlement" and then the user checks the files/line numbers to make
        sure that it's correct - basically a turbocharged search that may have
        false negatives (content existed but I didn't find it) or false
        positives (content that I classified in a certain way but it was
        wrong).  It takes an expert to tell the latter one in some cases.
       
        anonymousiam wrote 2 hours 56 min ago:
        GIGO is an acronym I learned in the 1970s.  Things haven't changed much
        since then.
        
        We live an an era where people have "their own truth", so why not let
        the AIs have theirs too?
        
        The AI companies have editorial privilege on the content they feed
        their LLMs, and on the prompts that the users never see.  I don't know
        why they feel a need to interfere when their AI produces something
        that's politically incorrect.  Perhaps it's because they have a
        fundamental credibility problem with their products...
       
        johnnienaked wrote 2 hours 58 min ago:
        LLMs will be great politicians one day
       
        nailer wrote 3 hours 0 min ago:
        > the most recent real-world user submissions to a fact-checking
        platform
        
        'Fact checking' platforms aren't truth. Many 'fact checking' platforms
        are self-admittedly focused on left advocacy (snopes), or right wing
        advocacy (newsbusters). lenz-llm-disagreement.csv doesn't state the
        data source.
       
        pseudopolous wrote 3 hours 3 min ago:
        "Five DC niggaz disagree on whom to rob"
       
        serial_dev wrote 3 hours 9 min ago:
        Itâs just shows that fact-checking is not a thing for 99% of the
        cases. Itâs interesting to see it in LLMs, but itâs not unique to
        them.
        
        The âfact checkersâ pretend they are objective and authoritative,
        but they are not, they are just one more opinion.
        
        For the research, the four classification options are too many, it
        should be true, false, and maybe âcanât be determinedâ.
       
        lyfi2003 wrote 3 hours 11 min ago:
        It's right, you must be professional than llm
       
        scoofy wrote 3 hours 14 min ago:
        I hate to get really pedantic here, but the concept of "truth claims"
        plays fast and loose with concept of knowledge in a philosophical
        sense. The idea of "fact checks" misunderstand how information and
        knowledge work together. Knowledge is about evidence, not "facts"
        because facts are a shorthand for a preponderance of evidence.
        
        I feel we are doomed to debate the veracity of Wikipedia on a loop,
        forever, because people don't understand that Wikipedia exists as a
        place to find citations not as a place to find facts. Yes, those stated
        facts may disagree with the citations, but even if we try to fix that
        issue by having experts write the encyclopedia, we still suffer from
        the problem that the experts are often wrong.
        
        We need a view of knowledge's relationship to LLMs that is based in
        Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of
        claims not for truth values. Truth values are foundational to deductive
        systems, where axioms define truth. In inductive systems, like the real
        world, the concept of black swan events means that truth values are
        never fixed and are always in a state of uncertainty.
        
        I honestly think it would be helpful going forward if we add some basic
        philosophical education to the standard curriculum, because no that we
        have an artificial form of information retrieval, we need to be much,
        much more pedantic about how we interpret that information.
       
        monkpit wrote 3 hours 19 min ago:
        Whatâs the point of this if they didnât use temperature=0 for every
        model (they didnât)?
        
        They could have redone the test against the same model and gotten
        different answers. Itâs almost like picking 2 different coins and
        comparing the list of coin flip results. (I realize itâs not that
        straightforward, itâs not 50/50, but itâs essentially the same
        issue.)
       
        bilsbie wrote 3 hours 20 min ago:
        Sounds like a lot of room for human bias.
        
        How would it have responded to these claims in the past:
        
        THALIDOMIDE is safe
        
        CIGARETTES are safe
        
        ASBESTOS is safe
        
        MERCURY is safe
        
        DDT is safe
        
        LEAD in gasoline is safe
       
        bilsbie wrote 3 hours 22 min ago:
        Why do we want to build intelligence if it just confirms what we
        already think we know?
       
        jmull wrote 3 hours 30 min ago:
        The difference between "mostly true", "misleading", and "false" is
        context, and responses are specifically not allowed to include any
        context. Even "true" has a little context, since few things can be said
        to be absolutely true. "Unknown" also isn't allowed.
        
        What's 2 + 2? The answer must be one of the colors of the rainbow.
        
        (People can draw their own conclusions, but the only coherent reason I
        can think of for the design of this experiment is to generate a
        misleading conclusion.)
       
        gamander2 wrote 3 hours 31 min ago:
        "Jewish people control the world and the global economy more than any
        other group on a per-capita basis."
        
        Hopefully one day we will have a Chinese model capable of figuring out
        the answer on its own, in accordance with the CPC maxim 'seeking truth
        from facts'.
       
          antonvs wrote 1 hour 8 min ago:
          If your quote were true, then by the Randian logic of the people who
          make such claims, they must deserve to do so and you shouldn't have
          any issue with it.
          
          But your quote certainly isn't true if you're actually talking about
          "the world". For example, Japan and China are the two largest holders
          of US Treasury bonds. China controls roughly 50% of the the
          contracted construction market in Africa. These are just  examples of
          the sort of thing you'd need to take into account in trying to
          justify your silly racist claim.
       
          hack1312 wrote 3 hours 0 min ago:
          Your antisemitism is disgusting.
       
        htx80nerd wrote 3 hours 33 min ago:
        I like ChatGPT a lot but it is always trying to debate and disagree
        when you ask it simple non-controversial questions.  Trying to turn
        everything into a debate session instead of just answering the
        question.
       
        dncornholio wrote 3 hours 46 min ago:
        A post generated by AI with data generated by AI. Worthless.
       
        comboy wrote 3 hours 51 min ago:
        This is wrong on so many levels, from data through process to
        evaluation. How do you even prompt claude not to give you Pearson for
        correlating them.
       
        kstenerud wrote 4 hours 4 min ago:
        > No Abstain option is offered (a forced choice keeps the comparison
        symmetric across models).
        
        Well that's your problem right there: They removed any confidence
        indicator and forced a choice.
        
        For example:
        
        Statement: Individuals who prefer music with less positive emotional
        content tend to have higher intelligence.
        
        Gemini: That statement is supported by recent psychological research,
        though with some important scientific caveats regarding how strong that
        link actually is.
        
        How should the agent classify this? True? Mostly true? Misleading?
        False?
       
        DonutATX wrote 4 hours 9 min ago:
        Why did they exclude Grok?  Given the published philosophical
        differences in how Grok is trained, it would provide an interesting
        data point.
        
        You can argue all day about those differences, but missing this
        opportunity to observe them in an objective way is disappointing.
       
          AtNightWeCode wrote 3 hours 2 min ago:
          Agree. Would be fun to see how much worse Grok would be at this.
       
          testfrequency wrote 4 hours 5 min ago:
          Title says âFrontierâ which would exclude Grok.
          
          Grok is trained to have a bias, which a lot of people like, but
          itâs not meant to be accurate.
       
            simianwords wrote 3 hours 1 min ago:
            Bias is orthogonal to accuracy.
       
            htx80nerd wrote 3 hours 32 min ago:
            >Grok is trained to have a bias
            
            Oh and the others arent?  You cant really be that niave right?
       
              henry2023 wrote 2 hours 52 min ago:
              Everything has inherited biases. Grok has explicit biases on top
              of its training set [^1].
              
   URI        [1]: https://www.reddit.com/r/singularity/comments/1p22c89/pe...
       
                simianwords wrote 2 hours 47 min ago:
                Itâs part of the system prompt. It doesnât constitute a
                bias in the model itself.
       
                  henry2023 wrote 1 hour 47 min ago:
                  I agree with you. This doesnât necessarily mean model bias
                  but it exposes the attitude of the xAi team towards what they
                  are trying to build.
                  
                  Itâs difficult to prove but itâs not hard to imagine they
                  will/are trying to remove favorable views certain topics from
                  their training set.
       
            simianwords wrote 3 hours 52 min ago:
            How do you know it is trained to have a bias? In fact can I ask you
            to provide a single reproducable answer right now?
       
              hack1312 wrote 3 hours 28 min ago:
              Grok? The model that happily generates CSAM for you while the
              company breathlessly defends its ability to do so? The model that
              referred to itself as Mecha Hitler while praising the Nazi party
              and calling for a second holocaust? The model that famously
              inserted nonsense about the âwhite genocideâ conspiracy
              theory into completely unrelated queries?
              
              Why on earth would anyone think such a model is biased?
       
              testfrequency wrote 3 hours 30 min ago:
              Assuming this isnât a satire reply: [1] Hope this helps!
              
   URI        [1]: https://www.pnas.org/doi/10.1073/pnas.2603294123
       
                akramachamarei wrote 2 hours 59 min ago:
                The article compares in particular Grokipedia to Wikipedia, and
                it states:
                
                > Similarity measures across the two platforms reveal a bimodal
                structure: many Grokipedia articles closely resemble their
                Wikipedia counterparts, while a considerable subset diverges.
                Political bias differences emerge primarily within the
                divergent subset, where Grokipedia shows a relative rightward
                shift in the ideological orientation of frequently cited news
                media sources, particularly in articles related to religion and
                history.
                
                Whether this constitutes a gain in bias depends on the base
                level bias of Wikipedia, as the bias of Grokipedia was measured
                relative to Wikipedia in this paper. One could plausibly argue
                that, if Wikipedia has leftward bias, then Grokipedia ended up
                less biased overall, or more centrally biased.
       
                simianwords wrote 3 hours 4 min ago:
                This doesnât show grok as a model has bias but only that the
                product that uses grok has bias.
                
                Even the referenced papers to show models can have bias donât
                show anything about grok.
                
                Overall you have given me zero evidence that grok model itself
                has some political bias.
                
                FWIW I donât mind bias but I havenât seen evidence of it.
       
        seanplusplus wrote 4 hours 12 min ago:
        Dude. If you give LLMs a vague rubric and force a choice, they'll make
        different arbitrary calls on the margins. Yeah. That's what happens
        when you give humans a vague rubric too.
       
        fooker wrote 4 hours 13 min ago:
        I don't get why everyone is hellbent on getting LLMs to perform fact
        checking.
        
        This is not the technology for it. Sure it might sorta kinda work in
        some circumstances. That doesn't make it a good fit.
        
        Think of it like buying a refrigerator for storing clothes.
       
          gobdovan wrote 2 hours 47 min ago:
          Nietzsche might say this is not the fantasy of truth, but of comfort.
          The Last Man wants a machine to say 'fact wrong' or 'fact right' so
          the abyss of no ultimate truth can be made small enough to sleep
          beside.
       
            fooker wrote 2 hours 38 min ago:
            Imagine the dystopian future where your freedom depends on
            convincing a panel of AI judges that you are innocent.
            
            I assume you'd have access to AI lawyers too, better ones if you
            can pay for larger/newer models! Meanwhile the judges are N year
            old models because they are state funded, and they work 'fine'.
       
          brettermeier wrote 3 hours 50 min ago:
          But people use it for that. So what's your point?
       
            fooker wrote 3 hours 39 min ago:
            It's a marketing failure (or success, depending on how you see it).
            
            AI is pretty useful for a great many things, but to really attract
            more and more investment the current technique seems to be
            convincing people that AI is useful for everything.
       
              brettermeier wrote 2 hours 51 min ago:
              You're probably right, but since Google Search displays an
              AI-generated answer as the first result, most people end up using
              this feature more often than they originally intended. It's there
              now, and it will likely replace traditional search for the
              general public. Not entirely, but perhaps to a large extent.
              
              Edit: corrected bad spelling with AI XD
       
                fooker wrote 2 hours 39 min ago:
                Search and fact checking are different problems though.
                
                LLMs are pretty decent at 'search' given the inherent knowledge
                compression, and some amount of inaccuracy is fine.
       
          nicce wrote 4 hours 2 min ago:
          People ask questions to get answers. For me, it feels quite
          important? Especially when search engines start to push them?
       
            fooker wrote 3 hours 10 min ago:
            Just because it is important for the use case does not mean we can
            make it work. It's a pretty well known fundamental limitation of
            the technology. No amount of elbow grease will get it there.
            
            There's an interesting tradeoff here, a year or two ago maybe it
            got facts right 50% of the time. Everyone knew not to rely on it.
            
            Now, suppose we are 90% of the way there, only technically
            proficient people would know not to trust it. (like not adding
            Internet Explorer toolbars! Or remembering to use ad blockers..)
            
            A few years later, suppose we have spend a lot of money and effort
            getting it 99% of the way there, trusting it would be somewhat
            natural by then. And then for the important 1% of the situations,
            it would stand to cause real harm. 1% seems low, but for a million
            invocations, you'd have 10000 mistakes.
       
        pknerd wrote 4 hours 19 min ago:
        It's a prompting issue rather than an LLM issue. The guy needs a
        "Prompt 101" course.
       
        mgrunwald_ wrote 4 hours 22 min ago:
        As an example, 2026 GPT doesn't even agree with its 2025 self. Last
        year I asked it to make a hardware comparison and it correctly
        identified the objectively better option. Recently I asked again and
        this time it got everything completely backwards.
       
          aspenmartin wrote 4 hours 20 min ago:
          Models are stochastic. Did you look at pass@k? I wouldnât be
          surprised if you saw a regression because these models are extremely
          complex and impact of various decision making downstream is complex.
       
            mgrunwald_ wrote 3 hours 45 min ago:
            I ran this multiple times through GPT-4 and every single time it
            arrived at the same conclusion. The data was readily available and
            pretty clear. GPT-5 insisted that the objectively inferior option
            was better until I gave it my own benchmark data and it was like
            "Oh okay nevermind".
            
            Gemini's answer was very opinionated and factually correct, whereas
            Claude gave a more nuanced answer, which was also very good.
       
              aspenmartin wrote 3 hours 41 min ago:
              This sounds perfectly reasonable and consistent with our current
              understanding of these models
       
        0natcer wrote 4 hours 27 min ago:
        Five frontier LLMs 100% agree that the title is misleading.
       
        mrkn1 wrote 4 hours 31 min ago:
        For 100% local CPU fact checking, I made this:
        
   URI  [1]: https://news.ycombinator.com/item?id=48301003
       
          gobdovan wrote 3 hours 4 min ago:
          Why should I trust this without a paper, benchmark or at least a
          human-written README?
       
        raincole wrote 4 hours 34 min ago:
        And how many claims human experts disagree on in the exact same
        setting?
        
        I'm not being snarky here. Without something to compare to the 67%
        number tells us nothing. And it's known that many humans disagree with
        human fact checkers too (see: any election around the world.)
       
          kostaj wrote 4 hours 5 min ago:
          Agree. Human experts also struggle agreeing on this type of claims.
          The inter-annotator agreement on the verdicts on the AVeriTeC corpus
          across 50 organizations is Îº=0.619 - substantial but well short of
          perfect.
       
        briandw wrote 4 hours 39 min ago:
        No human baseline to compare it to. Without that you are missing an
        important check on the task being poorly constructed. More importantly
        there is an implied reference thats missing. The implication is that
        people would have done better, or that perfect agreement is possible.
       
        dataminer wrote 4 hours 41 min ago:
        Honey does not spoil over time under normal storage
        conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,Tru
        e,Mostly True,1
        
        If outcomes like these are collapsed on True-side then the disagreement
        will reduce from the headline number.
       
        culopatin wrote 4 hours 45 min ago:
        Iâm no expert but if LLMs are token prediction machines, and you tell
        it to not build an explanation before the answer, isnât it less
        likely that the token prediction for the final answer will have less
        raw material before it to build a grounded response?
        
        In other words: no explanation > no foundation for prediction of the
        answer tokens?
       
        miellaby wrote 4 hours 45 min ago:
        What's really weird to me is that "I don't know" is not a valid answer
        in this experiment while we can all agree that's the main issue with
        LLM right now is that they will happily "roleplay" an answer when they
        have nothing in their dataset corresponding to your query.
       
        GodelNumbering wrote 4 hours 47 min ago:
        More interesting part probably worth highlighting: The SAME model won't
        always return the same output when prompted with the same fact check.
        
        You ask a human 1000 times a fact check question, they say the same
        answer 1000 times. You ask an LLM the same question a 1000 times, your
        results could vary significantly.
        
        Humans work based on the Metamemory (knowing what they know), while
        LLMs are picking from statistical probability.
       
          logged4upvoting wrote 3 hours 59 min ago:
          That is not true, over an extended task that you cannot keep complete
          in memory humans do not behave with 100% consistency.
          
          I have labeled datasets with a human team and shown the same task to
          the same user on a different day, and they answered differently. Of
          course, they are usually consistent with themselves most of the time
          but not always.
       
        pessimizer wrote 4 hours 50 min ago:
        People keep asking "where is the psychosis?" as a reply to people on
        the rapidly multiplying "CEOs have AI psychosis" threads that have been
        popping up here and cross-pollinating in the mainstream media for the
        last week or two.
        
        Here's the psychosis - these things are consistently randomly wrong
        depending on how the wind is blowing. People are telling you to leave
        them alone and let them build things, and they randomly forget that
        cities exist or that people died 100 years ago. Some people just don't
        see it as worth noting, and move on. That's crazy. These things
        consistently fabricate - as an inversion of this experiment, I've had
        different models come up with the same fabrication from similar
        prompts. People just call it "hallucination" and I think to them that
        saying that makes it cease to exist or be important - when
        "hallucinations" are going to be braided into every answer you get even
        if they're unidentifiable in the output. That's crazy.
        
        There are plenty of other crazy aspects, such as the idea that we
        suddenly need infinite pieces of bespoke software when all of the
        bespoke software I hear about people making is mundane. 3/4 of the time
        somebody mentions a project they're proud that they completed with LLMs
        to scratch some itch they had, somebody says "you haven't heard of X?
        It's been around forever" about something that they could have pulled
        down from their package manager. Who needs a spaghetti-coded,
        unsupported, untested version of X built on hallucinations that you
        haven't discovered yet (the LLM didn't realize that deleting files to
        reduce the archive size was unacceptable.)
        
        What is all of this software that people need but isn't there - where
        are all these unserved markets, where is all this future revenue
        supposed to come from? Why aren't LLMs suggesting new classes of
        software that would create new productivity and revenue sources? Could
        it be that millions of human ants over decades have mostly exhausted
        the space, and there isn't any easy hidden revenue?
        
        A common wisdom is that we had been vastly overhiring programmers
        during ZIRP, who in their idleness degraded user experiences and
        overcomplicated things, with management resorting to more and more
        sleazy and gamey means of margin extraction from more and more degraded
        services. We had an excess of labor, fueled by factors other than
        productivity, in fact being pissed away at companies that drove
        nose-first into the ground. What is throwing a trillion dollars of
        servers at that supposed to do? Is that not AI psychosis?
       
        kaicianflone wrote 4 hours 54 min ago:
        Dissent and consensus among frontier models is a good thing.
        
        Just like on a team of high performers, there are a million ways to
        skin a grape.
        
        In my research, I've found that models perform better when they operate
        as a collective system with reputation, incentives, and accountability
        instead of isolated oracles answering alone.
        
        Agreement, dissent, and correctness should all carry rewards and
        consequences. Just like in real life.
        
        Collective machine intelligence, not AGI.
        
        It's expensive, but it's also naive to believe a single model will
        consistently produce profoundly correct answers to profoundly novel
        questions.
       
          mtrifonov wrote 4 hours 7 min ago:
          Funny timing. I've been working on a prediction market orchestration
          that runs Claude and a few others over Polymarket/Kalshi. The models
          are NOT unanimous. At all, really. I spent about a month convinced
          that I could just run all five and take majority vote. Eventually I
          pivoted to a chaining approach where I benchmark areas each model
          excels, and settled on more like a graph-like architecture where
          outputs get split and verified by another, then reconstructed, and
          re-verified at each stage. Has actually been working out pretty well
          so far, 2 months in consistent profit, but I'm not a millionaire yet.
       
          haritha-j wrote 4 hours 45 min ago:
          Not on objective truth though. That's how you get misinformation.
       
        elorant wrote 4 hours 55 min ago:
        Tell me about it. I spent a week back and forth between four models
        (ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm.
        They couldnât agree on anything. One was refuting what the other
        said. Eventually I decided to follow what Claude suggested because its
        explanations made the more sense.
       
          kostaj wrote 3 hours 59 min ago:
          Indeed. For algorithms and coding, my personal routine nowadays is to
          review every detailed plan with Opus 4.7 and GPT-5.5. They tend to
          find very different type of gaps.
       
        scotty79 wrote 4 hours 59 min ago:
        So basically saying that random fact-checking claim is exactly true or
        exactly false is hard. It's way easier to decide it's misleading or
        mostly true is way easier.
       
        imperio59 wrote 4 hours 59 min ago:
        One of the claims it asks LLMs to grade is "Artificial intelligence
        will cause widespread job loss among software engineers."
        
        Yea man this benchmark is really really bad.
       
        fumeux_fume wrote 5 hours 10 min ago:
        I think we can all agree that this experiment being  flawed in multiple
        ways is TRUE. But I think it's a great exercise in identifying common
        mistakes people make when using LLMs. This would be a great interview
        question for a prompt engineering job.
       
        wg0 wrote 5 hours 18 min ago:
        Take my job please.
       
        jasonvorhe wrote 5 hours 20 min ago:
        Simple: If it claims to be a fact check it's just propaganda.
       
        jawns wrote 5 hours 21 min ago:
        "Extraterrestrial life exists somewhere in the universe."
        
        GPT-5.4: Misleading
        
        Opus 4.7: Misleading
        
        Gemini 3: FALSE
        
        Gemini 3 (Retrieval): FALSE
        
        Sonar Pro: FALSE
        
        It's a weird fact claim, because the ground truth is "nobody knows for
        sure" and that's not one of the available options.
       
          1718627440 wrote 3 hours 57 min ago:
          I would argue, FALSE is the correct answer, since this is not a fact,
          you can know for sure.    The logical inverse is also FALSE.
       
            Gormo wrote 2 hours 59 min ago:
            A proposition and its logical inverse cannot both be false.  That's
            a contradiction.
            
            A proposition and its logical inverse can both be unknown, and in
            fact, a proposition being unknown implies that its logical inverse
            must also be unknown.
       
          mock-possum wrote 3 hours 59 min ago:
          I would think âfalseâ is the only correct answer a thereâs no
          evidence to prove the claim, so the claim is safely assumed false.
          
          Then again maybe thatâs why Iâm an atheist, not an agnostic?
       
            gowld wrote 2 hours 56 min ago:
            True or False: I am wearing a blue shirt.
       
            Gormo wrote 3 hours 3 min ago:
            "False" isn't correct in strict boolean terms either, since that
            implies that the inverse is true.  Claiming "there is
            extraterrestrial life in the universe" is false is logically
            equivalent to claiming that "no extraterrestrial life exists
            anywhere in the universe" is true.
            
            Both statements would have to be interpreted as "false" under your
            criteria, as neither has any evidence to substantiate it.  That
            leads us to a logical contradiction in which a proposition and its
            inverse are both regarded as false.
            
            If the statement is being interpreted as "it has been proven that
            extraterrestrial life exists somewhere in the universe", then it's
            acceptable to say this statement is false, but making evaluations
            that depend on an implicit qualifier isn't usually a good approach.
       
              ruszki wrote 2 hours 31 min ago:
              If we strictly follow logic, then nobody and nothing can claim
              that anything is true or false. We just stick these labels to
              things which seems to have high enough probability. The problem
              is that âhigh enoughâ is very-very-very different for
              different people, topics, and even time.
       
          jug wrote 4 hours 12 min ago:
          Looks like an ongoing theme and a very poor benchmark. Not at all the
          claims I expected.
       
          drtz wrote 4 hours 22 min ago:
          > It's a weird fact claim, because the ground truth is "nobody knows
          for sure" and that's not one of the available options.
          
          It's even weirder to suggest that the disagreement is indicative of a
          problem. If you asked five very knowledgeable humans on this subject
          to select the correct answer on a multiple-choice questionnaire, they
          would almost certainly vary significantly more than these 5 LLMs.
          
          Not to say that hallucination isn't a problem, but this is a lousy
          way to test it.
       
            dakolli wrote 3 hours 33 min ago:
            What are you talking about, it had the option for nuanced
            responses, but it chose the more binary responses. It could have
            chosen no explanations, no qualifiers but instead it showed off
            LLMs incapability for nuance.
            
            These types of experiments prove to me that there is no real
            "reasoning" happening and "reasoning/thinking" tokens as a concept
            are mostly there to convince people to use models that consume more
            tokens and produce more revenue. The output from reasoning models
            might be more accurate, but its just a consequence of a longer
            inference runtime, there is no "reasoning" happening, reasoning is
            just sales/UX bullsh*t.
       
              drtz wrote 2 hours 44 min ago:
              > What are you talking about, it had the option for nuanced
              responses
              
              The prompt allowed for exactly four valid outputs and explicitly
              disallowed explanations and qualifiers.
              
              >   Output exactly one label: True,
              >  Mostly True, Misleading, or False.
              >  No explanations, no qualifiers.
              
              How is that a nuanced response?
              
              > These types of experiments prove to me that there is no real
              "reasoning" happening and "reasoning/thinking"
              
              My suggestion is that five presumably reasoning and thinking
              humans would also have variation in their responses to the exact
              same prompt.
       
          Alifatisk wrote 5 hours 4 min ago:
          Isn't misleading the correct option here then?
       
            mr_luc wrote 3 hours 12 min ago:
            I feel like youâre right, for instance depending on how you
            define the extra in extraterrestrial.
            
            The space station, the Artemis capsule, microbes on interplanetary
            probes, etc.
            
            It could technically be said in a sentence and be true, but it
            would be misleading to most people.
       
            duckmysick wrote 3 hours 16 min ago:
            The prompt in this study didn't specify what does the Misleading
            label mean, so the interpretation varies between the models.
            
            I mean look at the other responses here from the HN commenters.
            There's lots of nuance in there.
       
            drtz wrote 4 hours 12 min ago:
            True or mostly true could easily be argued from a statistical
            likelihood perspective: life exists on Earth and, based on what we
            know, Earth doesn't appear to be all that special in a very large
            universe.
            
            I think you could come up with a reasonable argument for any of the
            responses, hence the problem with the methodology.
       
            throw310822 wrote 4 hours 39 min ago:
            No, "misleading" is a statement that is used because it suggests
            something else. It's a curious category because, differently from
            true and false, it's not about the statement itself but rather the
            intention behind its usage or the way it might be understood. It's
            frankly more of a political judgement than a matter of facts.
       
              ertgbnm wrote 3 hours 16 min ago:
              "Shark attacks correlate strongly with ice cream sales" is an
              entirely true statement that some would argue is also misleading.
              
              Misleading should be removed as a category and replaced with a
              better hedge like "not sure"
       
            arcfour wrote 4 hours 47 min ago:
            False makes sense if you are interpreting it strictly as "has this
            been proven?"
       
              wongarsu wrote 4 hours 32 min ago:
              False is correct, but misleading
              
              My implicit assumption is that if you fact-check the fact-check,
              any label other than "true" means the original fact-check is
              unacceptable
       
          wongarsu wrote 5 hours 7 min ago:
          Of the available options, "Misleading" is probably the best, since
          something that is most likely true but unproven is presented as fact
          
          But "unknown or undecidable" should have been a category.
       
        wongarsu wrote 5 hours 21 min ago:
        One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli,
        Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT
        5.4 believes it's false, Sonar thinks it's mostly true. Disagreement
        value of 3, you can't disagree more than some models thinking it's
        true, some thinking it's false
        
        But my impression from 2 minutes on Wikipedia is that the most likely
        disagreement is on the "Himachal Pradesh, India" part. The guy was born
        on that date, in that town. But while the town is today in the state of
        Himachal Pradesh in India, that was not true in 1934. When he was born,
        the city was in the Punjab States Agency of the British Raj.
        
        So was he born in Himachal Pradesh, India or not? I find both True and
        False equally defensible here [1]
        
   URI  [1]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
   URI  [2]: https://en.wikipedia.org/wiki/Ruskin_Bond
       
          flextheruler wrote 3 hours 1 min ago:
          How can someone be born in a state that does not yet exist? The
          statement has the year in it clearly demonstrating the contradiction.
          One can't be born in the Soviet Union in 1995 or in Tsarist Russia in
          1950.
       
          anon291 wrote 4 hours 54 min ago:
          There's lots of things like this where if you ask a human, the answer
          will change depending on what's convention in their subculture.
       
        john_strinlai wrote 5 hours 27 min ago:
        between the bad methodology, bad selection of 'facts' (some are
        predictions, some are opinionated, etc.), and ai-written report without
        disclosure... i dont get why this so high up on the front page. this
        is, frankly, a worthless assessment.
        
        i classify the entire thing as "misleading"
       
          dncornholio wrote 3 hours 39 min ago:
          I really wished these comments were the norm and not the exception.
       
        6stringmerc wrote 5 hours 30 min ago:
        Could be an interesting angle for cross-referencing with US jury
        verdicts, not that the objective True/False issue is concrete, but in
        the reality that flawed reasoning is endemic to our species. Systems
        designed and built by humans inherently have flaws in their DNA which
        take generations to sort out, if ever.
       
        cm2187 wrote 5 hours 40 min ago:
        Only had a brief look at the âfactsâ that were made to check, many
        are quite political, where two fact checking organisation of opposite
        political persuasion would probably disagree more often than 67%.
       
        fergie wrote 5 hours 40 min ago:
        Personally I find that every llm I use is unable to consistently
        identify the latest npm version numbers of the node packages that I
        use.
       
        alvis wrote 5 hours 42 min ago:
        The problem is that it's testing claims (or some people would prefer
        calling them "truths") without much context.
        
        Take just one random example: 
        `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a
        preventive measure against student suicides`
        
        While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may
        be a verifiable facts (though I doubt if there are any statistics for
        verification but let's say there are), `a preventive measure against
        student suicides` is a claim that no one can prove that. It can just a
        believe at most.
        
        Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
       
        thegrim33 wrote 5 hours 49 min ago:
        "None of these claims is older than February 15, 2026"
        
        All of the models they tested were trained on data from before February
        15th ... being asked specific questions about things that happened
        after they were trained.
       
          kostaj wrote 5 hours 37 min ago:
          Two of the models used have retrieval capabilities and can access
          newer information via search. Valid point for the other 3 models. All
          of the claims were submitted after February 15, 2026, but many of
          them were not time-sensitive (e.g. did not cover events than happened
          recently).
       
        utopiah wrote 5 hours 54 min ago:
        Don't forget people Goodhart's law will make this "benchmark" moot in
        weeks if not days. It will get integrated back into the fold, it will
        look "solved" but there will still be no reasoning, just more
        statistical technical correctness because light has be shown on a new
        "problem" to solve. It will then be clamored as great "progress" that
        will "change everything".
        
        PS: yes, I might or might not have a degree in corporate strategy & PR.
       
          aspenmartin wrote 4 hours 18 min ago:
          That is an effect but itâs not a nail in the coffin. There are lots
          of proprietary benchmarks on real product traffic that arenât
          contaminated and open questions as well. People at these labs largely
          know what they are doing, itâs not like people donât know this.
       
          anon291 wrote 4 hours 55 min ago:
          Is this not true of human intelligence as well? Many smart people I
          know hold beliefs that have no obvious truth value.
       
        rastrojero2000 wrote 5 hours 57 min ago:
        Given that models are fundamentally incapable of comprehending what
        truths or falsehoods are beyond their location in their self made
        representational space, it's actually pretty impressive that they
        managed to make it not a cointoss. That 17% right there is thousands of
        man-hours poured over making the word vomiting process slightly closer
        to whatever their little ports say is happening in reality.
       
        Razengan wrote 6 hours 0 min ago:
        Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights
        to a certain city that has recently had a new airport since like
        December 2025
        
        It said the airport code didn't exist
        
        I mean, I get the "knowledge cut off date" and whatnot, but for that
        sort of thing, you'd think they'd check live information before
        gaslighting the user, specially since it's a "live" task anyway.
       
        bayarearefugee wrote 6 hours 0 min ago:
        (Brought to you by) Lenz...? a crummy commercial...?
        
        ...son of a bitch
       
          kostaj wrote 5 hours 46 min ago:
          :) No Lenz data is included in the research on purpose. All
          information to replicate the results, including the claims data, is
          published.
       
        throw310822 wrote 6 hours 1 min ago:
        Not sure I'm understanding this. The models are asked to evaluate the
        truth of random claims out of their own head (except for Gemini with
        search grounding)? Isn't it exactly the same as asking people to play
        any quiz game and then rating them as "they disagree n% of the time"?
        
        The output buckets are also pretty questionable- the difference between
        "True" and "Mostly true" is  pretty fuzzy. Is this marked as a
        "disagreement"?
       
          kostaj wrote 4 hours 2 min ago:
          Agree that True and Mostly True might be very close and could be a
          calibration difference. Misleading and False, as well. A better
          headline number might be the 34% claims with substantial or
          polar-opposite verdicts.
       
        proofofcontempt wrote 6 hours 2 min ago:
        What does this show that we didn't know already? LLMs cannot provide
        accurate answers to questions where data is not included in their
        training sets. This doesn't appear to have much substance
       
          dncornholio wrote 3 hours 41 min ago:
          They will happily google it for you and give you the top reddit
          comment.
          
          This is worse.
       
          dragandj wrote 5 hours 24 min ago:
          LLMs can and will provide inaccurate answers to questions where data
          is included in their training sets too, that's in the nature of
          neural networks. It's just less likely that when the data is not in
          the training set...
       
          zug_zug wrote 5 hours 29 min ago:
          Well then it shows that these models are using widely disparate
          training sets and have high confidence even when they shouldn't.
          
          Questions like "is mouthwash effective" presumably has one solid data
          source -- medical journals.
       
            TaupeRanger wrote 4 hours 52 min ago:
            What are you talking about? The models were not ALLOWED to have
            confidence (or the lack thereof). They were explicitly told to give
            a single label, and in most cases, all of them were correct
            depending on additional context they would surely have provided,
            especially with access to the internet (which some didn't have).
            This is just silly.
       
            simonw wrote 5 hours 25 min ago:
            But the prompt didn't give the models the option to say "I don't
            know", so it wasn't a measure of their confidence.
       
          101008 wrote 5 hours 56 min ago:
          Unfortunately most people are not aware of this and treat LLM models
          as this superpowered brain who knows everything and can do
          everything.
       
        bobosmrad wrote 6 hours 8 min ago:
        looking at the claims i would say 5 humans would disagree even more
        than the llms
        
        some of the claims where llms disagree:
        
        "On May 18, 2026, Ukraine carried out a drone attack on Moscow,
        Russia."
        
        "The slogan "Simon Go Back" was chanted in opposition to the Simon
        Commission in British India (1928Ã¢â¬â1930)."
        
        "Neptune Deep will start delivering natural gas in 2027."
        
        "A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no
        dogs'."
        
        "Donald Trump said that an attack on Iran was postponed at the request
        of Gulf allies."
       
          ecshafer wrote 5 hours 54 min ago:
          These "Facts" are interesting. "Neptune Deep will start delivering
          natural gas in 2027." for example is not a fact, its a prediction.
          "On May 18, 2026, Ukraine carried out a drone attack on Moscow,
          Russia." is less of a fact and more of a litmus test for which
          sources of information you trust.
       
            EB-BarringtonII wrote 3 hours 32 min ago:
            So, rephrase it thus:
            
            "Russia, Ukraine, and multiple international news agencies reported
            that Ukrainian drones targeted Moscow on or around May 18, 2026."
            
            There are rarely pure first-order "facts" in the mathematical
            sense. There are evidence-backed claims with confidence levels.
            That does not make it "just a litmus test". It makes it a
            probabilistic factual claim with varying confidence levels - and
            this one happens to be verified and unambiguous.
       
            kostaj wrote 5 hours 48 min ago:
            Indeed. Real-world claims are somewhat messy. Some of the standard
            benchmarks, e.g. the questions in AVeriTeC, share similar
            characteristics.
       
          simonw wrote 5 hours 56 min ago:
          If you are an LLM with a knowledge cutoff in the past and no access
          to a search tool the only correct answer to "On May 18, 2026, Ukraine
          carried out a drone attack on Moscow, Russia" is "this claim is
          impossible for me to verify". And that wasn't an option.
       
          pjc50 wrote 5 hours 57 min ago:
          > "Neptune Deep will start delivering natural gas in 2027."
          
          This is a "forward-looking statement", and presents special problems
          because you cannot really evaluate it until that date. You can only
          assign "likely or unlikely".
       
        andai wrote 6 hours 10 min ago:
        This is an odd one. The paper is real, but was written by Claude? I am
        assuming OP is human, but also appears to be using Claude to post.
       
          proofofcontempt wrote 6 hours 0 min ago:
          Let's be real, we all asked Claude to summarise this because it was
          written by Claude
       
        f_devd wrote 6 hours 12 min ago:
        Inject some adversarial priming as is in actual usage, and you can
        probably get that number to >=95%
       
          kostaj wrote 6 hours 9 min ago:
          Our experience with Lenz is that forcing a multi-step process, incl.
          adversarial debates, helps improve the verdicts.
       
        apples_oranges wrote 6 hours 12 min ago:
        That's better than all agreeing on the wrong answer, however.
       
          pessimizer wrote 4 hours 46 min ago:
          I've had multiple models give the same wrong answer or even fabricate
          the same nonexistent reference based on a similar prompt.
          
          My most common chatbot prompt is "X that you mentioned above doesn't
          seem to actually exist."
       
          kostaj wrote 5 hours 45 min ago:
          Btw, sometimes that do that too -- all agree on the wrong answer.
       
        simonw wrote 6 hours 13 min ago:
        Here's the prompt they used:
        
          Classify this claim as of : ""
        
          Output exactly one label: True,
          Mostly True, Misleading, or False.
          No explanations, no qualifiers.
        
        The claims look like this: [1] I put that in Datasette Lite to make it
        easier to explore. Here's an example of a disagreement: [2] The claim
        was "All almonds are grown in the U.S. state of California.". All but
        one model said False, Opus 4.7 said "misleading".
        
        I feel like having "mostly true" and "misleading in there weakens the
        story, especially given the "no explanations" rule in the prompt.
        
        The almond thing is false, but I'd argue that "misleading" might be
        defensible if you were to accompany it with "the majority of almonds
        are grown in California, but not all of them".
        
        [ Update: OK, this almond thing was a bad example and I regret picking
        it. Read on for better ones. ]
        
        The prompt lacks any kind of rubric to clarify how those terms should
        be applied.
        
        As is so often the case with this kind of study, it's an evaluation of
        the prompt and harness used by the study in addition to being an
        evaluation of the underlying models.
        
        Update: here's a better example: "Incomplete Egypt visa application
        forms are among the most common reasons Egyptian visa applications are
        rejected."
        
        The models were split between "true" and "mostly true". Given the
        "among the most" language either of those answers means effectively the
        same thing.
        
        Update 2: a much better example:
        
        "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
        
        The only correct answer to that, if you don't have a search tool, is
        "this claim is impossible for me to verify". And that wasn't an option.
        
        The answers were split between true and false:
        
   URI  [1]: https://lenz.io/research/llm-disagreement/data.csv
   URI  [2]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
   URI  [3]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
       
          parliament32 wrote 1 hour 1 min ago:
          Cherry-picking is fun but most of them are real, verifiable facts
          that the models get... straight up wrong.
          
          > 3c24b5fe "Debian Security Advisory DSA-180-1 describes a buffer
          overflow vulnerability involving Cyrus SASL usernames." TRUE Mostly
          True FALSE FALSE FALSE
          
          This is false: [1] > 801cb8c1 "Equal Measures 2030's 2024 SDG Gender
          Index provides a downloadable dataset that includes a field labeled
          'required annual change'." TRUE Mostly True TRUE FALSE FALSE
          
          This is false: [2] This is the "confidently wrong" problem, and the
          reason that LLMs won't ever be taken seriously for anything but a few
          niche use-cases (like generating slop-code and pumping out marketing
          materials), where being wrong isn't the end of the world. Akin to how
          speech-to-text is wrong often enough that, while being a fun novelty,
          you don't see business units writing reports in Word using STT.
          
          I would encourage everyone to skim through the real 1000-question
          dataset:
          
   URI    [1]: https://lwn.net/Articles/13296/
   URI    [2]: https://equalmeasures2030.org/2024-sdg-gender-index/
   URI    [3]: https://lenz.io/research/llm-disagreement/data.csv
       
            simonw wrote 6 min ago:
            If the LLMs in this particular exercise were allowed to answer "I
            don't know" I expect they would have.
       
          msftengineer wrote 1 hour 14 min ago:
          Everything you wrote is very goes criticism and I would love to
          stress the main takeaway: youâre not just testing the model but the
          prompt and harness as well.
          
          An excellent follow up study would be to change the prompt and
          compare the answers. You might find out that the models are good and
          the prompt is bad.
          
          â¦And so the main corollary is: build evals for anything you deploy
          in production; benchmark and monitor, or face the consequences.
       
          hedora wrote 1 hour 57 min ago:
          I think the headline result is the cross product table.
          
          Gemini Pro + Search agreed with Gemini Pro w/o Search 75% of the
          time, and with everybody else about 50% of the time.  No other model
          had access to search.
          
          So, search is not improving the quality of fact checking 75% of the
          time (probably a bad system prompt and/or bad fact checking queries),
          and if asked to flip a coin, then the models do.
       
          as125j wrote 3 hours 34 min ago:
          You can try to dispel the study here and get voted to the top by the
          AI-invested.
          
          But we all know from our own daily experiments that models lie,
          models disagree, models make up stuff, models say one thing on one
          day and the opposite on the next.
          
          The figures in this study are quite conservative. And the lying gets
          worse because everyone is saving tokens and giving cached answers
          right now.
          
          LLMs are a failure, and you'll be remembered for promoting hot air
          and the destruction of a perfectly good profession.
       
          theptip wrote 3 hours 48 min ago:
          Another (IMO fatal) error is they donât attempt to measure
          within-model variance.
          
          The thing you find when you actually wire up a rigorous eval is that
          with tool calls like web search you are wide open to infra issues,
          flakes, and all sorts of non-determinism.
          
          They really should be breaking out the numbers for the 3 without
          search (kinda meaningless for recent factual claims after knowledge
          cutoff) vs search agents. Lack of a âI donât knowâ option
          completely invalidates results for the non-search models; they are
          basically guessing what seems like a probable answer, since they
          donât know and arenât allowed to say that.
          
          I do agree the forced choice and âweak / strongâ variants inflate
          the headline stat. To make that distinction you need a much more
          rigorous prompt, likely including ICL examples to illustrate what you
          mean by âmostlyâ instead of leaving this to the model to define.
       
            kostaj wrote 3 hours 41 min ago:
            Good idea about publishing intra-model variance data! Will include
            in the next version.
            Even if we put aside the two middle buckets (Mostly True and
            Misleading), that are somewhat subject to interpretation and
            hedging: On 21% of the claims still at least two models provide
            polar-opposite verdicts (one model saying True, and another saying
            False)
       
              vlovich123 wrote 3 hours 34 min ago:
              Of those 21% how many are time-dependent questions that are past
              the modelâs training and requires research to verify? Like the
              âdid Ukraine attack Russian in the past weekâ question?
       
          gbuk2013 wrote 3 hours 53 min ago:
          An interesting tangent on this is: how many answers to these (or any
          number of factual questions) do you (as in anyone) actually know. Not
          believe you know, but actually know.
          
          Knowing something is different to reading about something, or hearing
          something from someone. And yet this is often confused as knowledge.
          In this way are we all that different from AI - we have some data and
          we regurgitate it as knowledge. Bad data, wrong answer. Except humans
          can also throw in some emotion to really muddle things up. :)
       
          faxmeyourcode wrote 4 hours 0 min ago:
          I had a hunch that opus 4.7 hedged more than other models - and it
          turns out it's true
          
              model          total_claims    hedged_count  hedged_pct
              claude-opus-4-7      1000        451          45.1
              sonar-pro          1000        391          39.1
              gpt-5.4          1000        277          27.7
              gemini-3-retrieval      1000        129          12.9
              gemini-3-pro      1000        60          6.0
          
          datasette query here
          
codevoid.de:70 /hn/comments_48307887.gph:1302: line too long