Medical decision-making not improved by AI chatbots

Publicly released:
International
PHOTO: Engin Akyurt on Unsplash
PHOTO: Engin Akyurt on Unsplash

Large language models (LLMs) like ChatGPT might be able to get high scores on medical licensing exams, but new research finds they’re lacking when it comes to helping the general public make correct medical decisions. Researchers asked around 1300 people in the UK to identify underlying medical conditions and then choose a course of action, like calling their GP or going to the urgent clinic. The groups that used one of three AI chatbots identified relevant conditions in roughly a third of the cases and they figured out the right correct action in less than 44% - results no better than the control group that could use any other resource like a regular internet search. The researchers argue that current LLMs are not ready to be used for public medical advice.

News release

From: Springer Nature

Medicine: LLMs may not improve public medical decision-making

Large language models (LLMs) may not help members of the public make better decisions about their health in everyday medical situations, suggests a study published in Nature Medicine. The authors argue that future tools will need to be designed to better support real users before they can be safely used for public medical advice.

LLMs have been proposed by global healthcare providers as potential tools to improve public access to medical knowledge, enabling individuals to perform preliminary health assessments and manage conditions before seeking help from a clinician. However, previous research indicates that LLMs that achieve very high scores on medical licensing exams in controlled settings are not necessarily guaranteed to succeed in real-world interactions.

Adam Mahdi, Adam Bean and colleagues tested whether LLMs could assist members of the public in accurately identifying medical conditions — such as a common cold, anaemia or gallstones — and choosing a course of action, such as calling an ambulance or their general practitioner. A total of 1,298 participants in the UK were each given ten different medical scenarios and were randomly assigned to use one of three LLMs (GPT-4o, Llama 3 or Command R+) or their usual resources (in the control group), such as internet search engines.

When tested without human participants, the LLMs completed the scenarios accurately and correctly identified conditions in 94.9% of cases and chose a correct course of action in 56.3% of cases on average. However, when the participants used the same LLMs, relevant conditions were identified in less than 34.5% of cases and a correct course of action was chosen in less than 44.2%, results that were no better than the control group. In a subset of 30 cases, the authors manually inspected human–LLM interactions and observed that participants often provided incomplete or incorrect information to the model, but also that LLMs would sometimes generate misleading or incorrect information.

The authors conclude that current LLMs are not ready for deployment in direct patient care, as pairing LLMs with human users introduces challenges that existing benchmarks and simulated interactions fail to predict.

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research Springer Nature, Web page URL will go live after embargo lifts
Journal/
conference:
Nature Medicine
Research:Paper
Organisation/s: Oxford Internet Institute, University of Oxford
Funder: A.M. acknowledges support from Prolific and support for the Dynabench platform from the Data-centric Machine Learning Working Group at MLCommons. A.M. and A.M.B. were partially supported by the Oxford Internet Institute’s Research Programme funded by the Dieter Schwarz Stiftung gGmbH. L.R. acknowledges support from the Royal Society Research grant no. RG\R2\232035 and the UKRI Future Leaders Fellowship (grant no. MR/Y015711/1). L.T. was supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC).
Media Contact/s
Contact details are only visible to registered journalists.