EXPERT REACTION: ChatGPT can (almost) pass the US Medical Licensing Exam

Embargoed until: Publicly released: 2023-02-10 06:00

International

CC:0

The AI software ChatGPT was able to score at or close to the 60% passing grade needed for the United States Medical Licensing Exam (USMLE), say US researchers. The USMLE is a notoriously difficult series of three exams that are required to get a medical license in the US, which cover most medical disciplines. The AI scored between 52.4% and 75% across the three exams, according to the team, who add ChatGPT provided responses that made coherent, internal sense and frequently contained insights.

Media release

From: PLOS

ChatGPT can (almost) pass the US Medical Licensing Exam

The AI software was able to achieve passing scores for the exam, which usually requires years of medical training

ChatGPT can score at or around the approximately 60 percent passing threshold for the United States Medical Licensing Exam (USMLE), with responses that make coherent, internal sense and contain frequent insights, according to a study published February 9, 2023 in the open-access journal PLOS Digital Health by Tiffany Kung, Victor Tseng, and colleagues at AnsibleHealth.

ChatGPT is a new artificial intelligence (AI) system, known as a large language model (LLM), designed to generate human-like writing by predicting upcoming word sequences. Unlike most chatbots, ChatGPT cannot search the internet. Instead, it generates text using word relationships predicted by its internal processes.

Kung and colleagues tested ChatGPT’s performance on the USMLE, a highly standardized and regulated series of three exams (Steps 1, 2CK, and 3) required for medical licensure in the United States. Taken by medical students and physicians-in-training, the USMLE assesses knowledge spanning most medical disciplines, ranging from biochemistry, to diagnostic reasoning, to bioethics.

After screening to remove image-based questions, the authors tested the software on 350 of the 376 public questions available from the June 2022 USMLE release.

After indeterminate responses were removed, ChatGPT scored between 52.4% and 75.0% across the three USMLE exams. The passing threshold each year is approximately 60%. ChatGPT also demonstrated 94.6% concordance across all its responses and produced at least one significant insight (something that was new, non-obvious, and clinically valid) for 88.9% of its responses. Notably, ChatGPT exceeded the performance of PubMedGPT, a counterpart model trained exclusively on biomedical domain literature, which scored 50.8% on an older dataset of USMLE-style questions.

While the relatively small input size restricted the depth and range of analyses, the authors note their findings provide a glimpse of ChatGPT’s potential to enhance medical education, and eventually, clinical practice. For example, they add, clinicians at AnsibleHealth already use ChatGPT to rewrite jargon-heavy reports for easier patient comprehension.

“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” say the authors.

Author Dr Tiffany Kung added that ChatGPT's role in this research went beyond being the study subject: "ChatGPT contributed substantially to the writing of [our] manuscript... We interacted with ChatGPT much like a colleague, asking it to synthesize, simplify, and offer counterpoints to drafts in progress...All of the co-authors valued ChatGPT's input."

Expert Reaction

These comments have been collated by the Science Media Centre to provide a variety of expert perspectives on this issue. Feel free to use these quotes in your stories. Views expressed are the personal opinions of the experts named. They do not represent the views of the SMC or any other organisation unless specifically stated.

Associate Professor Alex Polyakov is a Clinical Associate Professor in the Faculty of Medicine, Dentistry and Health Sciences at the University of Melbourne and is a Medical Director at Genea Fertility Melbourne

As a medical educator, the findings of this study on ChatGPT's performance in the United States Medical Licensing Exam (USMLE) are intriguing and suggest potential applications in medical education. The results demonstrate that ChatGPT can reach the passing score for the USMLE, a highly standardised and regulated series of exams for medical licensure in the United States, with a score between 52.4% and 75.0% and with at least one significant insight for 88.9% of its responses.

The results highlight the potential of AI-based systems like ChatGPT to enhance medical education, specifically in areas such as student assessment, knowledge dissemination, and curriculum development. AI-based systems like ChatGPT can provide quick and efficient feedback to medical students and physicians-in-training, allowing them to identify areas for improvement and further study. Additionally, ChatGPT's ability to simplify and synthesise information could make it a valuable tool in disseminating knowledge to medical students, especially in the early stages of their education.

Moreover, ChatGPT's ability to provide insights and counterpoints could also be useful in developing new medical curricula. AI-based systems like ChatGPT can provide a wealth of information and ideas, which can help medical educators create innovative and engaging learning experiences for medical students.

However, it is important to note that while the results are promising, further research is needed to evaluate the long-term impact and reliability of AI-based systems like ChatGPT in medical education. Additionally, AI-based systems like ChatGPT should not replace human interaction and assessment in medical education. Medical students and physicians-in-training need to develop critical thinking skills, clinical reasoning, and ethical awareness, all of which require human interaction and feedback.

AI-based systems like ChatGPT have the potential to enhance medical education greatly, but it is important to approach their integration into medical education with caution and to consider their role alongside human interaction and assessment carefully.

(Written with the help of ChatGPT:)

Last updated: 18 Aug 2023 4:35pm

Contact information

Contact details are only visible to registered journalists.

Declared conflicts of interest None declared.

Dr Simon McCallum, Senior Lecturer in Software Engineering, Te Heranga Waka, Victoria University of Wellington

This particular study was conducted in the first few weeks of ChatGPT becoming available. There have been three updates since November with the latest on January 30th. These updates have improved the ability of the AI to answer the sorts of questions in the medical exam.

Google has developed a Large Language Model (the broad category of tools like ChatGPT) called Med-PaLM, which 'performs encouragingly on the axes of our pilot human evaluation framework.' Med-PaLM is a specialisation of Flan-PaLM, a system released by Google that is similar to ChatGPT, trained on general instructions. Med-PaLM focused its learning on medical text and conversations. 'For example, a panel of clinicians judged only 61.9% of Flan-PaLM long-form answers to be aligned with scientific consensus, compared to 92.6% for Med-PaLM answers, on par with clinician-generated answers (92.9%). Similarly, 29.7% of Flan-PaLM answers were rated as potentially leading to harmful outcomes, in contrast with 5.8% for Med-PaLM, comparable with clinician-generated answers (6.5%).'

Thus, ChatGPT may pass the exam, but Med-PaLM is able to give advice to patients that is as good as a professional GP. And both of these systems are improving.

ChatGPT is also good at simplifying content so that individuals can understand medical jargon or complex instructions. Asking the AI to simplify until the language used fits the needs of the patient will change people's ability to understand medical advice and removes the potential embarrassment associated with saying you do not understand.

Within university education we are having to pivot almost as fast as at the start of the pandemic to account for the ability of AI to perform tasks which were traditionally a sign of understanding. There is also a massive cultural shift when everybody has access to a tool that can assist in written communication. Careers and jobs which were seen as difficult, may be automated by these AI tools. Microsoft has announced that ChatGPT is now integrated into MS Team Professional and will act as a meeting secretary, summarising meetings and creating action items. Bing will also include a ChatGPT advancement linking the version 4 of ChatGPT with up-to-date search information."

Society is about to change, and instead of warning about the hypochondria of randomly searching the internet for symptoms, we may soon get our medical advice from Doctor Google or Nurse Bing.

Last updated: 09 Feb 2023 1:02pm

Contact information

Contact details are only visible to registered journalists.

Declared conflicts of interest Conflict of interest statement: "I am an active member of the Labour Party (Taieri LEC Chair). I am leading Te Heranga Waka Victoria University of Wellington’s response to AI tools." Expertise and background: “I have a PhD in Computer Science (in Neural Networks like those used in ChatGPT ) from the University of Otago. I have been teaching using Github Copilot last year. Copilot uses the same GTP model as ChatGPT but was focused on programming languages rather than human languages. My research has been in Games for Health and Games for Education, where AIs in games have been part of the tools integrated into research. I have also applied ChatGPT to many of our courses and it passes first year courses and some of our second year courses as of December, and may do even better now."

Dr Collin Bjork, Senior Lecturer in Science Communication and Podcasting, Massey University

The claim that ChatGPT can pass US medical exams is overblown and should come with a lengthy series of asterisks. Like ChatGPT itself, this research article is a dog and pony show designed to generate more hype than substance.

OpenAI had much to gain by releasing a free open-access version of ChatGPT in late 2022 and fomenting a media fervor around the world. Now, OpenAI is predicting 1 billion in revenue in 2024, even as a 'capped-profit' company.

Similarly, the authors of this article have much to gain by releasing a free open-access version of their article claiming that ChatGPT can pass the US Medical Licensing Exams. All of the authors but one work for Ansible Health, 'an early stage venture-backed healthcare startup' based in the Silicon Valley. At two years old, this tiny company will likely need to go back to their venture capitalist investors soon to ask for more money. And the media splash from this well-timed journal article will certainly help fund their next round of growth. After all, a pre-print of this article already went viral on social media because the researchers listed ChatGPT as an author. But the removal of ChatGPT from the list of authors in the final article indicates that this too was just a publicity stunt.

As for the article itself, the findings are not as straightforward as the press release indicates. Here’s one example:

The authors claim that 'ChatGPT produced at least one significant insight in 88.9% of all responses' (8). But their definition of 'insight' as 'novelty, nonobviousness, and validity' (7) is too vague to be useful. Furthermore, the authors insist that these 'insights' indicate that ChatGPT 'possesses the partial ability to teach medicine by surfacing novel and nonobvious concepts that may not be in the learner’s sphere of awareness' (10). But how can an unaware learner distinguish between true and false insights, especially when ChatGPT only offers 'accurate' answers on the USMLE a little more than half the time?

The authors’ claims about ChatGPT’s insights and teaching potential are misleading and naive.

Last updated: 09 Feb 2023 12:59pm

Contact information

Contact details are only visible to registered journalists.

Declared conflicts of interest No conflicts of interest

Journal/
conference: PLOS Digital Health

Research:Paper

Organisation/s: AnsibleHealth, Inc Mountain View, USA

Funder: The authors received no specific funding for this work.

Media Contact/s

Contact details are only visible to registered journalists.