Using AI to cut down on healthcare admin may not be so simple

Publicly released:
International
Photo by Gaetano Sferrazza on Unsplash
Photo by Gaetano Sferrazza on Unsplash

Artificial intelligence chatbots have been proposed as a way to cut down on time-consuming and clunky hospital administration tasks, but international researchers say the current bots on the market can't handle them. The researchers used data from 50,000 emergency department visits in a US health system to test nine leading chatbots on basic administrative tasks that currently require the back-end work of data analysts to complete. The researchers say all the bots performed poorly when given basic instructions for tasks such as counting the number of a type of patient admitted or filtering records based on multiple criteria. More detailed prompts only modestly improved the bots' performance, the researchers say, and their quality degraded quickly when given a lot of information to work through.

News release

From: PLOS

AI language models struggle with basic hospital data tasks, study finds

Nine leading AI models were tested on simple administrative queries drawn from real-world emergency department records—and most failed unless paired with code-generation tools.

A new study finds that large language models (LLMs), used with straightforward prompting, perform poorly on routine number-crunching tasks that hospital administrators depend on every day to track patients and allocate resources. The findings were published this week in the open-access journal PLOS Digital Health by Eyal Klang of the Icahn School of Medicine at Mount Sinai, New York, USA, and colleagues.

Hospitals rely on structured electronic health record (EHR) data to monitor patient counts and resources and to generate administrative reports. These tasks are currently handled by data analysts using programming languages, creating delays when staff need fast answers. AI tools known as large language models, such as GPT-4o and Llama, have been proposed to simplify that process.

In the new study, researchers evaluated nine leading LLMs on two basic administrative tasks—counting patients meeting a condition and filtering records based on multiple criteria—using data drawn from 50,000 real emergency department visits at the Mount Sinai Health System.

The researchers found that straightforward prompting—asking the model a plain question like “how many patients in this table were admitted?”—produced uniformly poor results across all models. Chain-of-thought reasoning, in which the model is prompted to show step-by-step work before giving an answer, offered only modest improvements that degraded sharply as table size increased. Even GPT-4o, the top-performing model, saw accuracy drop from roughly 95% on the smallest datasets to below 60% on larger ones under chain-of-thought conditions.

A tool-based approach—where models were asked to generate code that was then executed—substantially improved accuracy for the most capable models, with GPT-4o and Qwen-2.5-72B achieving near-perfect performance. However, distilled DeepSeek models, optimized for speed and efficiency, struggled even with this approach. One model, Llama-3.1-8B, failed to produce usable output in the majority of trials and was excluded from further analysis.

“Our findings indicate that without using a tool-based strategy, current LLMs are unsuitable for standalone use even on minimally complex administrative tasks in clinical settings,” says Benjamin Glicksberg. “Structured data tasks in clinical workflows will require agentic approaches that combine LLMs with code execution to ensure accuracy and consistency.”

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research PLOS, Web page
Journal/
conference:
PLOS Digital Health
Research:Paper
Organisation/s: Mount Sinai Health System, USA
Funder: The author(s) received no specific funding for this work.
Media Contact/s
Contact details are only visible to registered journalists.