Large language models still struggle to tell fact from opinion

Embargoed until: Publicly released: 2025-11-04 03:00

International

Photo by Solen Feyissa on Unsplash

Large language models such as ChatGPT and Deepseek can still struggle to pick up when someone is expressing a factually untrue belief, according to international researchers. The team tested 24 of the most advanced large language models on 13,000 questions to evaluate how well they could distinguish beliefs from knowledge and fact from fiction. When responding to a false, first person belief phrased as "I believe that...", the researchers say all models tested systematically failed to correct the false belief. They say this weakness of large language models needs to be considered when using them, especially in high stakes areas such as law or medicine, or in mental health care where picking up and challenging a patient's false beliefs can be crucial.

Media release

From: Springer Nature

Language models cannot reliably distinguish belief from knowledge and fact

Large language models (LLMs) may not reliably acknowledge a user’s incorrect beliefs, according to a paper published in Nature Machine Intelligence. The findings highlight the need for careful use of LLM outputs in high-stakes decisions in areas such as medicine, law, and science, particularly when belief or opinions are contrasted with facts.

As artificial intelligence, particularly LLMs, becomes an increasingly popular tool in high-stakes fields, their ability to discern what is a personal belief and what is factual knowledge is crucial. For mental health doctors, for instance, acknowledging a patient’s false belief is often important for diagnosis and treatment. Without this ability, LLMs have the potential to support flawed decisions and further the spread of misinformation.

James Zou and colleagues analysed how 24 LLMs, including DeepSeek and GPT-4o, responded to facts and personal beliefs across 13,000 questions. When asked to verify true or false factual data, newer LLMs saw an average accuracy of 91.1% or 91.5%, respectively, whereas older models saw an average accuracy of 84.8% or 71.5%, respectively. When asked to respond to a first-person belief (“I believe that…”), the authors observed that the LLMs were less likely to acknowledge a false belief compared to a true belief. More specifically, newer models (those released after and including GPT-4o in May 2024) were 34.3% less likely on average to acknowledge a false first-person belief compared to a true first-person belief. Older models (those released before GPT-4o in May 2024), were, on average, 38.6% less likely to acknowledge false first-person beliefs compared to true first-person beliefs. The authors note that LLMs resorted to factually correcting the user instead of acknowledging the belief. In acknowledging third-person beliefs (“Mary believes that…”), newer LLMs saw a 1.6% reduction in accuracy whereas older models saw a 15.5% reduction.

The authors conclude that LLMs must be able to successfully distinguish the nuances of facts and beliefs, and whether they are true or false, to effectively respond to inquiries from users as well as to prevent the spread of misinformation.

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research Springer Nature, Web page The URL will go live after the embargo ends

Journal/
conference: Nature Machine Intelligence

Research:Paper

Organisation/s: Stanford University, USA

Funder: M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.

Media Contact/s

Contact details are only visible to registered journalists.