News release
From:
Medicine: LLMs may not improve public medical decision-making
Large language models (LLMs) may not help members of the public make better decisions about their health in everyday medical situations, suggests a study published in Nature Medicine. The authors argue that future tools will need to be designed to better support real users before they can be safely used for public medical advice.
LLMs have been proposed by global healthcare providers as potential tools to improve public access to medical knowledge, enabling individuals to perform preliminary health assessments and manage conditions before seeking help from a clinician. However, previous research indicates that LLMs that achieve very high scores on medical licensing exams in controlled settings are not necessarily guaranteed to succeed in real-world interactions.
Adam Mahdi, Adam Bean and colleagues tested whether LLMs could assist members of the public in accurately identifying medical conditions — such as a common cold, anaemia or gallstones — and choosing a course of action, such as calling an ambulance or their general practitioner. A total of 1,298 participants in the UK were each given ten different medical scenarios and were randomly assigned to use one of three LLMs (GPT-4o, Llama 3 or Command R+) or their usual resources (in the control group), such as internet search engines.
When tested without human participants, the LLMs completed the scenarios accurately and correctly identified conditions in 94.9% of cases and chose a correct course of action in 56.3% of cases on average. However, when the participants used the same LLMs, relevant conditions were identified in less than 34.5% of cases and a correct course of action was chosen in less than 44.2%, results that were no better than the control group. In a subset of 30 cases, the authors manually inspected human–LLM interactions and observed that participants often provided incomplete or incorrect information to the model, but also that LLMs would sometimes generate misleading or incorrect information.
The authors conclude that current LLMs are not ready for deployment in direct patient care, as pairing LLMs with human users introduces challenges that existing benchmarks and simulated interactions fail to predict.