_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
URI Visit Hacker News on the Web
COMMENT PAGE FOR:
URI Disagreement among frontier LLMs on real-world fact-checks
graphememes wrote 14 min ago:
even this is ai output
cdud3 wrote 37 min ago:
The CVS file with the raw data is a source of joy. My favorite is this:
claim: Artificial intelligence will cause widespread job loss among
software engineers.
All 5 LLM's agree that the claim is misleading & wrong.
shevy-java wrote 1 hour 53 min ago:
My first reaction was: how dumb is AI still.
But ... real people would also reach that result. Some believe that
vaccination can not induce protection (which objectively is incorrect).
secondary_op wrote 2 hours 19 min ago:
Very interesting tool, but it's biased and not neutral from the get go,
because I explicitly formulated claim in neutral way, but it
automatically rewrote it to be western/wikipedia POV and then
immediately proceeded to verify it.
original neutral:
US DEPT OF DEFENSE/DNAVFAC planned renovations to School #05 in
Sevastopol, Crimea in 2013 before Crimea became part of Russia in 2014
automatically rewritten to biased western view:
The United States Department of Defense, via the Naval Facilities
Engineering Command (NAVFAC), planned renovations to School No. 5 in
Sevastopol, Crimea in 2013, before Russia annexed Crimea in 2014. [1]
And the follow up
The phrasing "Crimea became part of Russia" is more neutral than the
phrasing "Russia annexed Crimea."
, and according to this tool is Misleading 9/10 [2] Yeah, so my
personal conclusion that this tool is garbage, it checks western/US
allied only LLM providers, that in turn search only for western/US
allied sources/documents like BBC/NATO and result is what it is.
URI [1]: https://lenz.io/c/73c0f16c
URI [2]: https://lenz.io/c/93944614
40four wrote 2 hours 20 min ago:
This shouldnât be surprising. Letâs start off with the obvious.
What does âreal-world fact-check claimsâ mean? So weâre using the
same list of âfact check claimsâ on each model. The problem is
(unless Iâm missing it) the authors arenât exposing the list of 1K
questions they used in the experiment. Thatâs a huge problem. Are the
authors assuming the 1K claims they used are âprovably trueâ? If
so, thatâs a huge bias, and opens up a philosophical debate about
what it a fact? Or whatâs makes something true/ false?
As Marc Andreessen puts it: a particular domain is either explicitly
âprovableâ or not âprovableâ. Provable domains include math,
physics, chemistry, biology, engineering, even code. That not be the
whole list, but everything else is essentially âunprovableâ. At
least as far as a language model is concerned. They are questions that
require a human value judgement. Politics are an obvious example. So
back to the â1K fact check claimsâ. How many of these are
political, or current events questions? How many are STEM questions
that can be laid out in a formal proof?
Models can be trained to answer either way on claims that require a
value judgement, but thatâs obviously not beneficial to anyone except
who controls the model. If the expectation is that all these frontier
models should answer the same way on value judgement questions, then
thatâs never going to happen. What the models ARE good at though is
breaking down the nuances of a topic and arguing both sides. This is
how these tools should be used, as a way to analyze the claim and let
us humans in the end make our own value judgement. If youâre trusting
the model to make the value judgement for you and just accept it as a
fact, then you are entering a a very dangerous territory.
husky8 wrote 2 hours 21 min ago:
Watch the disagreements in real time via refinement pipeline on the
results page pingpongit.com
michaelmrose wrote 2 hours 29 min ago:
Totally aside from disagreement between models unbiased by prior input
any such experiment may fail to capture the outcomes experienced by
real users whose prior text exchanges may substantially change the text
recieved.
For instance see the folks who think that they have "awakened" their
instance of ChatGPT.
Actual usage may diverge to a greater degree than models
chipsrafferty wrote 2 hours 43 min ago:
It's becoming increasingly clear to me that - at least right now - AI
is only useful for 2 things:
1. Coding, with it being more useful the better you are at coding
without AI
2. Any expert in their field asking questions about their field, who
bother to fact check the output. E.g. "claude pls search these 1000
files and tell me if you find anywhere that they're discussing the
settlement" and then the user checks the files/line numbers to make
sure that it's correct - basically a turbocharged search that may have
false negatives (content existed but I didn't find it) or false
positives (content that I classified in a certain way but it was
wrong). It takes an expert to tell the latter one in some cases.
anonymousiam wrote 2 hours 56 min ago:
GIGO is an acronym I learned in the 1970s. Things haven't changed much
since then.
We live an an era where people have "their own truth", so why not let
the AIs have theirs too?
The AI companies have editorial privilege on the content they feed
their LLMs, and on the prompts that the users never see. I don't know
why they feel a need to interfere when their AI produces something
that's politically incorrect. Perhaps it's because they have a
fundamental credibility problem with their products...
johnnienaked wrote 2 hours 58 min ago:
LLMs will be great politicians one day
nailer wrote 3 hours 0 min ago:
> the most recent real-world user submissions to a fact-checking
platform
'Fact checking' platforms aren't truth. Many 'fact checking' platforms
are self-admittedly focused on left advocacy (snopes), or right wing
advocacy (newsbusters). lenz-llm-disagreement.csv doesn't state the
data source.
pseudopolous wrote 3 hours 3 min ago:
"Five DC niggaz disagree on whom to rob"
serial_dev wrote 3 hours 9 min ago:
Itâs just shows that fact-checking is not a thing for 99% of the
cases. Itâs interesting to see it in LLMs, but itâs not unique to
them.
The âfact checkersâ pretend they are objective and authoritative,
but they are not, they are just one more opinion.
For the research, the four classification options are too many, it
should be true, false, and maybe âcanât be determinedâ.
lyfi2003 wrote 3 hours 11 min ago:
It's right, you must be professional than llm
scoofy wrote 3 hours 14 min ago:
I hate to get really pedantic here, but the concept of "truth claims"
plays fast and loose with concept of knowledge in a philosophical
sense. The idea of "fact checks" misunderstand how information and
knowledge work together. Knowledge is about evidence, not "facts"
because facts are a shorthand for a preponderance of evidence.
I feel we are doomed to debate the veracity of Wikipedia on a loop,
forever, because people don't understand that Wikipedia exists as a
place to find citations not as a place to find facts. Yes, those stated
facts may disagree with the citations, but even if we try to fix that
issue by having experts write the encyclopedia, we still suffer from
the problem that the experts are often wrong.
We need a view of knowledge's relationship to LLMs that is based in
Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of
claims not for truth values. Truth values are foundational to deductive
systems, where axioms define truth. In inductive systems, like the real
world, the concept of black swan events means that truth values are
never fixed and are always in a state of uncertainty.
I honestly think it would be helpful going forward if we add some basic
philosophical education to the standard curriculum, because no that we
have an artificial form of information retrieval, we need to be much,
much more pedantic about how we interpret that information.
monkpit wrote 3 hours 19 min ago:
Whatâs the point of this if they didnât use temperature=0 for every
model (they didnât)?
They could have redone the test against the same model and gotten
different answers. Itâs almost like picking 2 different coins and
comparing the list of coin flip results. (I realize itâs not that
straightforward, itâs not 50/50, but itâs essentially the same
issue.)
bilsbie wrote 3 hours 20 min ago:
Sounds like a lot of room for human bias.
How would it have responded to these claims in the past:
THALIDOMIDE is safe
CIGARETTES are safe
ASBESTOS is safe
MERCURY is safe
DDT is safe
LEAD in gasoline is safe
bilsbie wrote 3 hours 22 min ago:
Why do we want to build intelligence if it just confirms what we
already think we know?
jmull wrote 3 hours 30 min ago:
The difference between "mostly true", "misleading", and "false" is
context, and responses are specifically not allowed to include any
context. Even "true" has a little context, since few things can be said
to be absolutely true. "Unknown" also isn't allowed.
What's 2 + 2? The answer must be one of the colors of the rainbow.
(People can draw their own conclusions, but the only coherent reason I
can think of for the design of this experiment is to generate a
misleading conclusion.)
gamander2 wrote 3 hours 31 min ago:
"Jewish people control the world and the global economy more than any
other group on a per-capita basis."
Hopefully one day we will have a Chinese model capable of figuring out
the answer on its own, in accordance with the CPC maxim 'seeking truth
from facts'.
antonvs wrote 1 hour 8 min ago:
If your quote were true, then by the Randian logic of the people who
make such claims, they must deserve to do so and you shouldn't have
any issue with it.
But your quote certainly isn't true if you're actually talking about
"the world". For example, Japan and China are the two largest holders
of US Treasury bonds. China controls roughly 50% of the the
contracted construction market in Africa. These are just examples of
the sort of thing you'd need to take into account in trying to
justify your silly racist claim.
hack1312 wrote 3 hours 0 min ago:
Your antisemitism is disgusting.
htx80nerd wrote 3 hours 33 min ago:
I like ChatGPT a lot but it is always trying to debate and disagree
when you ask it simple non-controversial questions. Trying to turn
everything into a debate session instead of just answering the
question.
dncornholio wrote 3 hours 46 min ago:
A post generated by AI with data generated by AI. Worthless.
comboy wrote 3 hours 51 min ago:
This is wrong on so many levels, from data through process to
evaluation. How do you even prompt claude not to give you Pearson for
correlating them.
kstenerud wrote 4 hours 4 min ago:
> No Abstain option is offered (a forced choice keeps the comparison
symmetric across models).
Well that's your problem right there: They removed any confidence
indicator and forced a choice.
For example:
Statement: Individuals who prefer music with less positive emotional
content tend to have higher intelligence.
Gemini: That statement is supported by recent psychological research,
though with some important scientific caveats regarding how strong that
link actually is.
How should the agent classify this? True? Mostly true? Misleading?
False?
DonutATX wrote 4 hours 9 min ago:
Why did they exclude Grok? Given the published philosophical
differences in how Grok is trained, it would provide an interesting
data point.
You can argue all day about those differences, but missing this
opportunity to observe them in an objective way is disappointing.
AtNightWeCode wrote 3 hours 2 min ago:
Agree. Would be fun to see how much worse Grok would be at this.
testfrequency wrote 4 hours 5 min ago:
Title says âFrontierâ which would exclude Grok.
Grok is trained to have a bias, which a lot of people like, but
itâs not meant to be accurate.
simianwords wrote 3 hours 1 min ago:
Bias is orthogonal to accuracy.
htx80nerd wrote 3 hours 32 min ago:
>Grok is trained to have a bias
Oh and the others arent? You cant really be that niave right?
henry2023 wrote 2 hours 52 min ago:
Everything has inherited biases. Grok has explicit biases on top
of its training set [^1].
URI [1]: https://www.reddit.com/r/singularity/comments/1p22c89/pe...
simianwords wrote 2 hours 47 min ago:
Itâs part of the system prompt. It doesnât constitute a
bias in the model itself.
henry2023 wrote 1 hour 47 min ago:
I agree with you. This doesnât necessarily mean model bias
but it exposes the attitude of the xAi team towards what they
are trying to build.
Itâs difficult to prove but itâs not hard to imagine they
will/are trying to remove favorable views certain topics from
their training set.
simianwords wrote 3 hours 52 min ago:
How do you know it is trained to have a bias? In fact can I ask you
to provide a single reproducable answer right now?
hack1312 wrote 3 hours 28 min ago:
Grok? The model that happily generates CSAM for you while the
company breathlessly defends its ability to do so? The model that
referred to itself as Mecha Hitler while praising the Nazi party
and calling for a second holocaust? The model that famously
inserted nonsense about the âwhite genocideâ conspiracy
theory into completely unrelated queries?
Why on earth would anyone think such a model is biased?
testfrequency wrote 3 hours 30 min ago:
Assuming this isnât a satire reply: [1] Hope this helps!
URI [1]: https://www.pnas.org/doi/10.1073/pnas.2603294123
akramachamarei wrote 2 hours 59 min ago:
The article compares in particular Grokipedia to Wikipedia, and
it states:
> Similarity measures across the two platforms reveal a bimodal
structure: many Grokipedia articles closely resemble their
Wikipedia counterparts, while a considerable subset diverges.
Political bias differences emerge primarily within the
divergent subset, where Grokipedia shows a relative rightward
shift in the ideological orientation of frequently cited news
media sources, particularly in articles related to religion and
history.
Whether this constitutes a gain in bias depends on the base
level bias of Wikipedia, as the bias of Grokipedia was measured
relative to Wikipedia in this paper. One could plausibly argue
that, if Wikipedia has leftward bias, then Grokipedia ended up
less biased overall, or more centrally biased.
simianwords wrote 3 hours 4 min ago:
This doesnât show grok as a model has bias but only that the
product that uses grok has bias.
Even the referenced papers to show models can have bias donât
show anything about grok.
Overall you have given me zero evidence that grok model itself
has some political bias.
FWIW I donât mind bias but I havenât seen evidence of it.
seanplusplus wrote 4 hours 12 min ago:
Dude. If you give LLMs a vague rubric and force a choice, they'll make
different arbitrary calls on the margins. Yeah. That's what happens
when you give humans a vague rubric too.
fooker wrote 4 hours 13 min ago:
I don't get why everyone is hellbent on getting LLMs to perform fact
checking.
This is not the technology for it. Sure it might sorta kinda work in
some circumstances. That doesn't make it a good fit.
Think of it like buying a refrigerator for storing clothes.
gobdovan wrote 2 hours 47 min ago:
Nietzsche might say this is not the fantasy of truth, but of comfort.
The Last Man wants a machine to say 'fact wrong' or 'fact right' so
the abyss of no ultimate truth can be made small enough to sleep
beside.
fooker wrote 2 hours 38 min ago:
Imagine the dystopian future where your freedom depends on
convincing a panel of AI judges that you are innocent.
I assume you'd have access to AI lawyers too, better ones if you
can pay for larger/newer models! Meanwhile the judges are N year
old models because they are state funded, and they work 'fine'.
brettermeier wrote 3 hours 50 min ago:
But people use it for that. So what's your point?
fooker wrote 3 hours 39 min ago:
It's a marketing failure (or success, depending on how you see it).
AI is pretty useful for a great many things, but to really attract
more and more investment the current technique seems to be
convincing people that AI is useful for everything.
brettermeier wrote 2 hours 51 min ago:
You're probably right, but since Google Search displays an
AI-generated answer as the first result, most people end up using
this feature more often than they originally intended. It's there
now, and it will likely replace traditional search for the
general public. Not entirely, but perhaps to a large extent.
Edit: corrected bad spelling with AI XD
fooker wrote 2 hours 39 min ago:
Search and fact checking are different problems though.
LLMs are pretty decent at 'search' given the inherent knowledge
compression, and some amount of inaccuracy is fine.
nicce wrote 4 hours 2 min ago:
People ask questions to get answers. For me, it feels quite
important? Especially when search engines start to push them?
fooker wrote 3 hours 10 min ago:
Just because it is important for the use case does not mean we can
make it work. It's a pretty well known fundamental limitation of
the technology. No amount of elbow grease will get it there.
There's an interesting tradeoff here, a year or two ago maybe it
got facts right 50% of the time. Everyone knew not to rely on it.
Now, suppose we are 90% of the way there, only technically
proficient people would know not to trust it. (like not adding
Internet Explorer toolbars! Or remembering to use ad blockers..)
A few years later, suppose we have spend a lot of money and effort
getting it 99% of the way there, trusting it would be somewhat
natural by then. And then for the important 1% of the situations,
it would stand to cause real harm. 1% seems low, but for a million
invocations, you'd have 10000 mistakes.
pknerd wrote 4 hours 19 min ago:
It's a prompting issue rather than an LLM issue. The guy needs a
"Prompt 101" course.
mgrunwald_ wrote 4 hours 22 min ago:
As an example, 2026 GPT doesn't even agree with its 2025 self. Last
year I asked it to make a hardware comparison and it correctly
identified the objectively better option. Recently I asked again and
this time it got everything completely backwards.
aspenmartin wrote 4 hours 20 min ago:
Models are stochastic. Did you look at pass@k? I wouldnât be
surprised if you saw a regression because these models are extremely
complex and impact of various decision making downstream is complex.
mgrunwald_ wrote 3 hours 45 min ago:
I ran this multiple times through GPT-4 and every single time it
arrived at the same conclusion. The data was readily available and
pretty clear. GPT-5 insisted that the objectively inferior option
was better until I gave it my own benchmark data and it was like
"Oh okay nevermind".
Gemini's answer was very opinionated and factually correct, whereas
Claude gave a more nuanced answer, which was also very good.
aspenmartin wrote 3 hours 41 min ago:
This sounds perfectly reasonable and consistent with our current
understanding of these models
0natcer wrote 4 hours 27 min ago:
Five frontier LLMs 100% agree that the title is misleading.
mrkn1 wrote 4 hours 31 min ago:
For 100% local CPU fact checking, I made this:
URI [1]: https://news.ycombinator.com/item?id=48301003
gobdovan wrote 3 hours 4 min ago:
Why should I trust this without a paper, benchmark or at least a
human-written README?
raincole wrote 4 hours 34 min ago:
And how many claims human experts disagree on in the exact same
setting?
I'm not being snarky here. Without something to compare to the 67%
number tells us nothing. And it's known that many humans disagree with
human fact checkers too (see: any election around the world.)
kostaj wrote 4 hours 5 min ago:
Agree. Human experts also struggle agreeing on this type of claims.
The inter-annotator agreement on the verdicts on the AVeriTeC corpus
across 50 organizations is κ=0.619 - substantial but well short of
perfect.
briandw wrote 4 hours 39 min ago:
No human baseline to compare it to. Without that you are missing an
important check on the task being poorly constructed. More importantly
there is an implied reference thats missing. The implication is that
people would have done better, or that perfect agreement is possible.
dataminer wrote 4 hours 41 min ago:
Honey does not spoil over time under normal storage
conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,Tru
e,Mostly True,1
If outcomes like these are collapsed on True-side then the disagreement
will reduce from the headline number.
culopatin wrote 4 hours 45 min ago:
Iâm no expert but if LLMs are token prediction machines, and you tell
it to not build an explanation before the answer, isnât it less
likely that the token prediction for the final answer will have less
raw material before it to build a grounded response?
In other words: no explanation > no foundation for prediction of the
answer tokens?
miellaby wrote 4 hours 45 min ago:
What's really weird to me is that "I don't know" is not a valid answer
in this experiment while we can all agree that's the main issue with
LLM right now is that they will happily "roleplay" an answer when they
have nothing in their dataset corresponding to your query.
GodelNumbering wrote 4 hours 47 min ago:
More interesting part probably worth highlighting: The SAME model won't
always return the same output when prompted with the same fact check.
You ask a human 1000 times a fact check question, they say the same
answer 1000 times. You ask an LLM the same question a 1000 times, your
results could vary significantly.
Humans work based on the Metamemory (knowing what they know), while
LLMs are picking from statistical probability.
logged4upvoting wrote 3 hours 59 min ago:
That is not true, over an extended task that you cannot keep complete
in memory humans do not behave with 100% consistency.
I have labeled datasets with a human team and shown the same task to
the same user on a different day, and they answered differently. Of
course, they are usually consistent with themselves most of the time
but not always.
pessimizer wrote 4 hours 50 min ago:
People keep asking "where is the psychosis?" as a reply to people on
the rapidly multiplying "CEOs have AI psychosis" threads that have been
popping up here and cross-pollinating in the mainstream media for the
last week or two.
Here's the psychosis - these things are consistently randomly wrong
depending on how the wind is blowing. People are telling you to leave
them alone and let them build things, and they randomly forget that
cities exist or that people died 100 years ago. Some people just don't
see it as worth noting, and move on. That's crazy. These things
consistently fabricate - as an inversion of this experiment, I've had
different models come up with the same fabrication from similar
prompts. People just call it "hallucination" and I think to them that
saying that makes it cease to exist or be important - when
"hallucinations" are going to be braided into every answer you get even
if they're unidentifiable in the output. That's crazy.
There are plenty of other crazy aspects, such as the idea that we
suddenly need infinite pieces of bespoke software when all of the
bespoke software I hear about people making is mundane. 3/4 of the time
somebody mentions a project they're proud that they completed with LLMs
to scratch some itch they had, somebody says "you haven't heard of X?
It's been around forever" about something that they could have pulled
down from their package manager. Who needs a spaghetti-coded,
unsupported, untested version of X built on hallucinations that you
haven't discovered yet (the LLM didn't realize that deleting files to
reduce the archive size was unacceptable.)
What is all of this software that people need but isn't there - where
are all these unserved markets, where is all this future revenue
supposed to come from? Why aren't LLMs suggesting new classes of
software that would create new productivity and revenue sources? Could
it be that millions of human ants over decades have mostly exhausted
the space, and there isn't any easy hidden revenue?
A common wisdom is that we had been vastly overhiring programmers
during ZIRP, who in their idleness degraded user experiences and
overcomplicated things, with management resorting to more and more
sleazy and gamey means of margin extraction from more and more degraded
services. We had an excess of labor, fueled by factors other than
productivity, in fact being pissed away at companies that drove
nose-first into the ground. What is throwing a trillion dollars of
servers at that supposed to do? Is that not AI psychosis?
kaicianflone wrote 4 hours 54 min ago:
Dissent and consensus among frontier models is a good thing.
Just like on a team of high performers, there are a million ways to
skin a grape.
In my research, I've found that models perform better when they operate
as a collective system with reputation, incentives, and accountability
instead of isolated oracles answering alone.
Agreement, dissent, and correctness should all carry rewards and
consequences. Just like in real life.
Collective machine intelligence, not AGI.
It's expensive, but it's also naive to believe a single model will
consistently produce profoundly correct answers to profoundly novel
questions.
mtrifonov wrote 4 hours 7 min ago:
Funny timing. I've been working on a prediction market orchestration
that runs Claude and a few others over Polymarket/Kalshi. The models
are NOT unanimous. At all, really. I spent about a month convinced
that I could just run all five and take majority vote. Eventually I
pivoted to a chaining approach where I benchmark areas each model
excels, and settled on more like a graph-like architecture where
outputs get split and verified by another, then reconstructed, and
re-verified at each stage. Has actually been working out pretty well
so far, 2 months in consistent profit, but I'm not a millionaire yet.
haritha-j wrote 4 hours 45 min ago:
Not on objective truth though. That's how you get misinformation.
elorant wrote 4 hours 55 min ago:
Tell me about it. I spent a week back and forth between four models
(ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm.
They couldnât agree on anything. One was refuting what the other
said. Eventually I decided to follow what Claude suggested because its
explanations made the more sense.
kostaj wrote 3 hours 59 min ago:
Indeed. For algorithms and coding, my personal routine nowadays is to
review every detailed plan with Opus 4.7 and GPT-5.5. They tend to
find very different type of gaps.
scotty79 wrote 4 hours 59 min ago:
So basically saying that random fact-checking claim is exactly true or
exactly false is hard. It's way easier to decide it's misleading or
mostly true is way easier.
imperio59 wrote 4 hours 59 min ago:
One of the claims it asks LLMs to grade is "Artificial intelligence
will cause widespread job loss among software engineers."
Yea man this benchmark is really really bad.
fumeux_fume wrote 5 hours 10 min ago:
I think we can all agree that this experiment being flawed in multiple
ways is TRUE. But I think it's a great exercise in identifying common
mistakes people make when using LLMs. This would be a great interview
question for a prompt engineering job.
wg0 wrote 5 hours 18 min ago:
Take my job please.
jasonvorhe wrote 5 hours 20 min ago:
Simple: If it claims to be a fact check it's just propaganda.
jawns wrote 5 hours 21 min ago:
"Extraterrestrial life exists somewhere in the universe."
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for
sure" and that's not one of the available options.
1718627440 wrote 3 hours 57 min ago:
I would argue, FALSE is the correct answer, since this is not a fact,
you can know for sure. The logical inverse is also FALSE.
Gormo wrote 2 hours 59 min ago:
A proposition and its logical inverse cannot both be false. That's
a contradiction.
A proposition and its logical inverse can both be unknown, and in
fact, a proposition being unknown implies that its logical inverse
must also be unknown.
mock-possum wrote 3 hours 59 min ago:
I would think âfalseâ is the only correct answer a thereâs no
evidence to prove the claim, so the claim is safely assumed false.
Then again maybe thatâs why Iâm an atheist, not an agnostic?
gowld wrote 2 hours 56 min ago:
True or False: I am wearing a blue shirt.
Gormo wrote 3 hours 3 min ago:
"False" isn't correct in strict boolean terms either, since that
implies that the inverse is true. Claiming "there is
extraterrestrial life in the universe" is false is logically
equivalent to claiming that "no extraterrestrial life exists
anywhere in the universe" is true.
Both statements would have to be interpreted as "false" under your
criteria, as neither has any evidence to substantiate it. That
leads us to a logical contradiction in which a proposition and its
inverse are both regarded as false.
If the statement is being interpreted as "it has been proven that
extraterrestrial life exists somewhere in the universe", then it's
acceptable to say this statement is false, but making evaluations
that depend on an implicit qualifier isn't usually a good approach.
ruszki wrote 2 hours 31 min ago:
If we strictly follow logic, then nobody and nothing can claim
that anything is true or false. We just stick these labels to
things which seems to have high enough probability. The problem
is that âhigh enoughâ is very-very-very different for
different people, topics, and even time.
jug wrote 4 hours 12 min ago:
Looks like an ongoing theme and a very poor benchmark. Not at all the
claims I expected.
drtz wrote 4 hours 22 min ago:
> It's a weird fact claim, because the ground truth is "nobody knows
for sure" and that's not one of the available options.
It's even weirder to suggest that the disagreement is indicative of a
problem. If you asked five very knowledgeable humans on this subject
to select the correct answer on a multiple-choice questionnaire, they
would almost certainly vary significantly more than these 5 LLMs.
Not to say that hallucination isn't a problem, but this is a lousy
way to test it.
dakolli wrote 3 hours 33 min ago:
What are you talking about, it had the option for nuanced
responses, but it chose the more binary responses. It could have
chosen no explanations, no qualifiers but instead it showed off
LLMs incapability for nuance.
These types of experiments prove to me that there is no real
"reasoning" happening and "reasoning/thinking" tokens as a concept
are mostly there to convince people to use models that consume more
tokens and produce more revenue. The output from reasoning models
might be more accurate, but its just a consequence of a longer
inference runtime, there is no "reasoning" happening, reasoning is
just sales/UX bullsh*t.
drtz wrote 2 hours 44 min ago:
> What are you talking about, it had the option for nuanced
responses
The prompt allowed for exactly four valid outputs and explicitly
disallowed explanations and qualifiers.
> Output exactly one label: True,
> Mostly True, Misleading, or False.
> No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real
"reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking
humans would also have variation in their responses to the exact
same prompt.
Alifatisk wrote 5 hours 4 min ago:
Isn't misleading the correct option here then?
mr_luc wrote 3 hours 12 min ago:
I feel like youâre right, for instance depending on how you
define the extra in extraterrestrial.
The space station, the Artemis capsule, microbes on interplanetary
probes, etc.
It could technically be said in a sentence and be true, but it
would be misleading to most people.
duckmysick wrote 3 hours 16 min ago:
The prompt in this study didn't specify what does the Misleading
label mean, so the interpretation varies between the models.
I mean look at the other responses here from the HN commenters.
There's lots of nuance in there.
drtz wrote 4 hours 12 min ago:
True or mostly true could easily be argued from a statistical
likelihood perspective: life exists on Earth and, based on what we
know, Earth doesn't appear to be all that special in a very large
universe.
I think you could come up with a reasonable argument for any of the
responses, hence the problem with the methodology.
throw310822 wrote 4 hours 39 min ago:
No, "misleading" is a statement that is used because it suggests
something else. It's a curious category because, differently from
true and false, it's not about the statement itself but rather the
intention behind its usage or the way it might be understood. It's
frankly more of a political judgement than a matter of facts.
ertgbnm wrote 3 hours 16 min ago:
"Shark attacks correlate strongly with ice cream sales" is an
entirely true statement that some would argue is also misleading.
Misleading should be removed as a category and replaced with a
better hedge like "not sure"
arcfour wrote 4 hours 47 min ago:
False makes sense if you are interpreting it strictly as "has this
been proven?"
wongarsu wrote 4 hours 32 min ago:
False is correct, but misleading
My implicit assumption is that if you fact-check the fact-check,
any label other than "true" means the original fact-check is
unacceptable
wongarsu wrote 5 hours 7 min ago:
Of the available options, "Misleading" is probably the best, since
something that is most likely true but unproven is presented as fact
But "unknown or undecidable" should have been a category.
wongarsu wrote 5 hours 21 min ago:
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli,
Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT
5.4 believes it's false, Sonar thinks it's mostly true. Disagreement
value of 3, you can't disagree more than some models thinking it's
true, some thinking it's false
But my impression from 2 minutes on Wikipedia is that the most likely
disagreement is on the "Himachal Pradesh, India" part. The guy was born
on that date, in that town. But while the town is today in the state of
Himachal Pradesh in India, that was not true in 1934. When he was born,
the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and
False equally defensible here [1]
URI [1]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
URI [2]: https://en.wikipedia.org/wiki/Ruskin_Bond
flextheruler wrote 3 hours 1 min ago:
How can someone be born in a state that does not yet exist? The
statement has the year in it clearly demonstrating the contradiction.
One can't be born in the Soviet Union in 1995 or in Tsarist Russia in
1950.
anon291 wrote 4 hours 54 min ago:
There's lots of things like this where if you ask a human, the answer
will change depending on what's convention in their subculture.
john_strinlai wrote 5 hours 27 min ago:
between the bad methodology, bad selection of 'facts' (some are
predictions, some are opinionated, etc.), and ai-written report without
disclosure... i dont get why this so high up on the front page. this
is, frankly, a worthless assessment.
i classify the entire thing as "misleading"
dncornholio wrote 3 hours 39 min ago:
I really wished these comments were the norm and not the exception.
6stringmerc wrote 5 hours 30 min ago:
Could be an interesting angle for cross-referencing with US jury
verdicts, not that the objective True/False issue is concrete, but in
the reality that flawed reasoning is endemic to our species. Systems
designed and built by humans inherently have flaws in their DNA which
take generations to sort out, if ever.
cm2187 wrote 5 hours 40 min ago:
Only had a brief look at the âfactsâ that were made to check, many
are quite political, where two fact checking organisation of opposite
political persuasion would probably disagree more often than 67%.
fergie wrote 5 hours 40 min ago:
Personally I find that every llm I use is unable to consistently
identify the latest npm version numbers of the node packages that I
use.
alvis wrote 5 hours 42 min ago:
The problem is that it's testing claims (or some people would prefer
calling them "truths") without much context.
Take just one random example:
`Hostels in Kota, Rajasthan commonly use caged ceiling fans as a
preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may
be a verifiable facts (though I doubt if there are any statistics for
verification but let's say there are), `a preventive measure against
student suicides` is a claim that no one can prove that. It can just a
believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
thegrim33 wrote 5 hours 49 min ago:
"None of these claims is older than February 15, 2026"
All of the models they tested were trained on data from before February
15th ... being asked specific questions about things that happened
after they were trained.
kostaj wrote 5 hours 37 min ago:
Two of the models used have retrieval capabilities and can access
newer information via search. Valid point for the other 3 models. All
of the claims were submitted after February 15, 2026, but many of
them were not time-sensitive (e.g. did not cover events than happened
recently).
utopiah wrote 5 hours 54 min ago:
Don't forget people Goodhart's law will make this "benchmark" moot in
weeks if not days. It will get integrated back into the fold, it will
look "solved" but there will still be no reasoning, just more
statistical technical correctness because light has be shown on a new
"problem" to solve. It will then be clamored as great "progress" that
will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
aspenmartin wrote 4 hours 18 min ago:
That is an effect but itâs not a nail in the coffin. There are lots
of proprietary benchmarks on real product traffic that arenât
contaminated and open questions as well. People at these labs largely
know what they are doing, itâs not like people donât know this.
anon291 wrote 4 hours 55 min ago:
Is this not true of human intelligence as well? Many smart people I
know hold beliefs that have no obvious truth value.
rastrojero2000 wrote 5 hours 57 min ago:
Given that models are fundamentally incapable of comprehending what
truths or falsehoods are beyond their location in their self made
representational space, it's actually pretty impressive that they
managed to make it not a cointoss. That 17% right there is thousands of
man-hours poured over making the word vomiting process slightly closer
to whatever their little ports say is happening in reality.
Razengan wrote 6 hours 0 min ago:
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights
to a certain city that has recently had a new airport since like
December 2025
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that
sort of thing, you'd think they'd check live information before
gaslighting the user, specially since it's a "live" task anyway.
bayarearefugee wrote 6 hours 0 min ago:
(Brought to you by) Lenz...? a crummy commercial...?
...son of a bitch
kostaj wrote 5 hours 46 min ago:
:) No Lenz data is included in the research on purpose. All
information to replicate the results, including the claims data, is
published.
throw310822 wrote 6 hours 1 min ago:
Not sure I'm understanding this. The models are asked to evaluate the
truth of random claims out of their own head (except for Gemini with
search grounding)? Isn't it exactly the same as asking people to play
any quiz game and then rating them as "they disagree n% of the time"?
The output buckets are also pretty questionable- the difference between
"True" and "Mostly true" is pretty fuzzy. Is this marked as a
"disagreement"?
kostaj wrote 4 hours 2 min ago:
Agree that True and Mostly True might be very close and could be a
calibration difference. Misleading and False, as well. A better
headline number might be the 34% claims with substantial or
polar-opposite verdicts.
proofofcontempt wrote 6 hours 2 min ago:
What does this show that we didn't know already? LLMs cannot provide
accurate answers to questions where data is not included in their
training sets. This doesn't appear to have much substance
dncornholio wrote 3 hours 41 min ago:
They will happily google it for you and give you the top reddit
comment.
This is worse.
dragandj wrote 5 hours 24 min ago:
LLMs can and will provide inaccurate answers to questions where data
is included in their training sets too, that's in the nature of
neural networks. It's just less likely that when the data is not in
the training set...
zug_zug wrote 5 hours 29 min ago:
Well then it shows that these models are using widely disparate
training sets and have high confidence even when they shouldn't.
Questions like "is mouthwash effective" presumably has one solid data
source -- medical journals.
TaupeRanger wrote 4 hours 52 min ago:
What are you talking about? The models were not ALLOWED to have
confidence (or the lack thereof). They were explicitly told to give
a single label, and in most cases, all of them were correct
depending on additional context they would surely have provided,
especially with access to the internet (which some didn't have).
This is just silly.
simonw wrote 5 hours 25 min ago:
But the prompt didn't give the models the option to say "I don't
know", so it wasn't a measure of their confidence.
101008 wrote 5 hours 56 min ago:
Unfortunately most people are not aware of this and treat LLM models
as this superpowered brain who knows everything and can do
everything.
bobosmrad wrote 6 hours 8 min ago:
looking at the claims i would say 5 humans would disagree even more
than the llms
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow,
Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon
Commission in British India (1928ââ¬â1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no
dogs'."
"Donald Trump said that an attack on Iran was postponed at the request
of Gulf allies."
ecshafer wrote 5 hours 54 min ago:
These "Facts" are interesting. "Neptune Deep will start delivering
natural gas in 2027." for example is not a fact, its a prediction.
"On May 18, 2026, Ukraine carried out a drone attack on Moscow,
Russia." is less of a fact and more of a litmus test for which
sources of information you trust.
EB-BarringtonII wrote 3 hours 32 min ago:
So, rephrase it thus:
"Russia, Ukraine, and multiple international news agencies reported
that Ukrainian drones targeted Moscow on or around May 18, 2026."
There are rarely pure first-order "facts" in the mathematical
sense. There are evidence-backed claims with confidence levels.
That does not make it "just a litmus test". It makes it a
probabilistic factual claim with varying confidence levels - and
this one happens to be verified and unambiguous.
kostaj wrote 5 hours 48 min ago:
Indeed. Real-world claims are somewhat messy. Some of the standard
benchmarks, e.g. the questions in AVeriTeC, share similar
characteristics.
simonw wrote 5 hours 56 min ago:
If you are an LLM with a knowledge cutoff in the past and no access
to a search tool the only correct answer to "On May 18, 2026, Ukraine
carried out a drone attack on Moscow, Russia" is "this claim is
impossible for me to verify". And that wasn't an option.
pjc50 wrote 5 hours 57 min ago:
> "Neptune Deep will start delivering natural gas in 2027."
This is a "forward-looking statement", and presents special problems
because you cannot really evaluate it until that date. You can only
assign "likely or unlikely".
andai wrote 6 hours 10 min ago:
This is an odd one. The paper is real, but was written by Claude? I am
assuming OP is human, but also appears to be using Claude to post.
proofofcontempt wrote 6 hours 0 min ago:
Let's be real, we all asked Claude to summarise this because it was
written by Claude
f_devd wrote 6 hours 12 min ago:
Inject some adversarial priming as is in actual usage, and you can
probably get that number to >=95%
kostaj wrote 6 hours 9 min ago:
Our experience with Lenz is that forcing a multi-step process, incl.
adversarial debates, helps improve the verdicts.
apples_oranges wrote 6 hours 12 min ago:
That's better than all agreeing on the wrong answer, however.
pessimizer wrote 4 hours 46 min ago:
I've had multiple models give the same wrong answer or even fabricate
the same nonexistent reference based on a similar prompt.
My most common chatbot prompt is "X that you mentioned above doesn't
seem to actually exist."
kostaj wrote 5 hours 45 min ago:
Btw, sometimes that do that too -- all agree on the wrong answer.
simonw wrote 6 hours 13 min ago:
Here's the prompt they used:
Classify this claim as of : ""
Output exactly one label: True,
Mostly True, Misleading, or False.
No explanations, no qualifiers.
The claims look like this: [1] I put that in Datasette Lite to make it
easier to explore. Here's an example of a disagreement: [2] The claim
was "All almonds are grown in the U.S. state of California.". All but
one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the
story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be
defensible if you were to accompany it with "the majority of almonds
are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking
it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should
be applied.
As is so often the case with this kind of study, it's an evaluation of
the prompt and harness used by the study in addition to being an
evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application
forms are among the most common reasons Egyptian visa applications are
rejected."
The models were split between "true" and "mostly true". Given the
"among the most" language either of those answers means effectively the
same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is
"this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false:
URI [1]: https://lenz.io/research/llm-disagreement/data.csv
URI [2]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
URI [3]: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwilli...
parliament32 wrote 1 hour 1 min ago:
Cherry-picking is fun but most of them are real, verifiable facts
that the models get... straight up wrong.
> 3c24b5fe "Debian Security Advisory DSA-180-1 describes a buffer
overflow vulnerability involving Cyrus SASL usernames." TRUE Mostly
True FALSE FALSE FALSE
This is false: [1] > 801cb8c1 "Equal Measures 2030's 2024 SDG Gender
Index provides a downloadable dataset that includes a field labeled
'required annual change'." TRUE Mostly True TRUE FALSE FALSE
This is false: [2] This is the "confidently wrong" problem, and the
reason that LLMs won't ever be taken seriously for anything but a few
niche use-cases (like generating slop-code and pumping out marketing
materials), where being wrong isn't the end of the world. Akin to how
speech-to-text is wrong often enough that, while being a fun novelty,
you don't see business units writing reports in Word using STT.
I would encourage everyone to skim through the real 1000-question
dataset:
URI [1]: https://lwn.net/Articles/13296/
URI [2]: https://equalmeasures2030.org/2024-sdg-gender-index/
URI [3]: https://lenz.io/research/llm-disagreement/data.csv
simonw wrote 6 min ago:
If the LLMs in this particular exercise were allowed to answer "I
don't know" I expect they would have.
msftengineer wrote 1 hour 14 min ago:
Everything you wrote is very goes criticism and I would love to
stress the main takeaway: youâre not just testing the model but the
prompt and harness as well.
An excellent follow up study would be to change the prompt and
compare the answers. You might find out that the models are good and
the prompt is bad.
â¦And so the main corollary is: build evals for anything you deploy
in production; benchmark and monitor, or face the consequences.
hedora wrote 1 hour 57 min ago:
I think the headline result is the cross product table.
Gemini Pro + Search agreed with Gemini Pro w/o Search 75% of the
time, and with everybody else about 50% of the time. No other model
had access to search.
So, search is not improving the quality of fact checking 75% of the
time (probably a bad system prompt and/or bad fact checking queries),
and if asked to flip a coin, then the models do.
as125j wrote 3 hours 34 min ago:
You can try to dispel the study here and get voted to the top by the
AI-invested.
But we all know from our own daily experiments that models lie,
models disagree, models make up stuff, models say one thing on one
day and the opposite on the next.
The figures in this study are quite conservative. And the lying gets
worse because everyone is saving tokens and giving cached answers
right now.
LLMs are a failure, and you'll be remembered for promoting hot air
and the destruction of a perfectly good profession.
theptip wrote 3 hours 48 min ago:
Another (IMO fatal) error is they donât attempt to measure
within-model variance.
The thing you find when you actually wire up a rigorous eval is that
with tool calls like web search you are wide open to infra issues,
flakes, and all sorts of non-determinism.
They really should be breaking out the numbers for the 3 without
search (kinda meaningless for recent factual claims after knowledge
cutoff) vs search agents. Lack of a âI donât knowâ option
completely invalidates results for the non-search models; they are
basically guessing what seems like a probable answer, since they
donât know and arenât allowed to say that.
I do agree the forced choice and âweak / strongâ variants inflate
the headline stat. To make that distinction you need a much more
rigorous prompt, likely including ICL examples to illustrate what you
mean by âmostlyâ instead of leaving this to the model to define.
kostaj wrote 3 hours 41 min ago:
Good idea about publishing intra-model variance data! Will include
in the next version.
Even if we put aside the two middle buckets (Mostly True and
Misleading), that are somewhat subject to interpretation and
hedging: On 21% of the claims still at least two models provide
polar-opposite verdicts (one model saying True, and another saying
False)
vlovich123 wrote 3 hours 34 min ago:
Of those 21% how many are time-dependent questions that are past
the modelâs training and requires research to verify? Like the
âdid Ukraine attack Russian in the past weekâ question?
gbuk2013 wrote 3 hours 53 min ago:
An interesting tangent on this is: how many answers to these (or any
number of factual questions) do you (as in anyone) actually know. Not
believe you know, but actually know.
Knowing something is different to reading about something, or hearing
something from someone. And yet this is often confused as knowledge.
In this way are we all that different from AI - we have some data and
we regurgitate it as knowledge. Bad data, wrong answer. Except humans
can also throw in some emotion to really muddle things up. :)
faxmeyourcode wrote 4 hours 0 min ago:
I had a hunch that opus 4.7 hedged more than other models - and it
turns out it's true
model total_claims hedged_count hedged_pct
claude-opus-4-7 1000 451 45.1
sonar-pro 1000 391 39.1
gpt-5.4 1000 277 27.7
gemini-3-retrieval 1000 129 12.9
gemini-3-pro 1000 60 6.0
datasette query here
codevoid.de:70 /hn/comments_48307887.gph:1302: line too long