AI may be good at parroting medical answers, but it's not so great at reasoning them out

Publicly released:
International
CC-0. https://unsplash.com/photos/a-person-holding-a-cell-phone-in-their-hand-hWSNT_Pp4x4
CC-0. https://unsplash.com/photos/a-person-holding-a-cell-phone-in-their-hand-hWSNT_Pp4x4

US scientists say AI chatbots are great at answering medical questions when the answer is easily findable online, but when asked to use reasoning to reach an answer, their accuracy drops dramatically. The team tested six AI chatbots, including ChatGPT, Llama and DeepSeek. To see if the chatbots could use reasoning to come up with an answer, rather than just parroting one found online, they asked the chatbots 68 questions. But, in the data being searched by the chatbots, they replaced easily findable answers with the phrase 'none of the other answers' so the AIs would have to reason their way to a solution. This made the chatbots a lot less accurate, they say, with DeepSeek getting six of 68 questions wrong, and ChatGPT stuffing up on 18 of the questions. The worst performer was Meta's Lllama, which got 26 questions wrong. DeepSeek was among the best performers. The findings show we can't rely on AIs to provide accurate medical answers using reasoning, the authors say, and their clinical applications should be limited to support roles and should always be checked over by a healthcare professional.

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research JAMA, Web page The URL will go live after the embargo ends
Journal/
conference:
JAMA Network Open
Research:Paper
Organisation/s: Stanford University, USA
Funder: Ms Bedi is supported by the Stanford Graduate Fellowship. Dr Chung is supported by the Mentored Research Training Grant from the Foundation for Anesthesia Education and Research.
Media Contact/s
Contact details are only visible to registered journalists.