AI chatbots are irrational, inconsistent and prone to mistakes

Publicly released:
International
CC-0. https://pixabay.com/photos/cyber-brain-computer-brain-7633488/
CC-0. https://pixabay.com/photos/cyber-brain-computer-brain-7633488/

UK and Italian scientists asked seven different artificial intelligence (AI) chatbots to complete a series of tasks devised by psychologists to show that we humans often reason in irrational ways, and found the chatbots are even more irrational than we are, and often make mistakes that people don't, especially when maths is involved. The chatbots were also inconsistent when asked to repeat the same task, providing human-like and non-human-like responses and getting the task right sometimes and wrong on other attempts. The researchers say Open AI's Chat GPT-4.0 gave the most logical, human-like responses, while Meta's Llama 2 performed worst, only giving human-like responses in 8.3% of cases and refusing to answer at all in 41.7% of cases.

Media release

From: The Royal Society

(Ir)rationality and cognitive biases in large language models

Do large language models (LLMs) display rational reasoning? To answer this question, we take tasks from cognitive psychology that were designed to show that humans often reason in irrational ways and apply these tasks to seven LLMs. We find that these models also often answer the tasks incorrectly. However, they frequently make mistakes that humans do not, especially when there are calculations involved. We find an additional layer of irrationality in the way the LLMs respond to the tasks: when asked the same question several times, there is significant inconsistency in the answers given.

  • Just give me a reason – Large language models (LLMs) can show irrational reasoning and frequently make mistakes that humans do not, especially when calculations are involved. Seven LLMs were prompted with tasks from cognitive psychology literature. The same model responded with correct, incorrect, human-like and non-human-like responses when the same task was repeated. GPT-4 gave the most logical, human-like responses, whereas Llama 2 only gave human-like responses in 8.3% of cases and refused to answer in 41.7% of cases. Royal Society Open Science

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research The Royal Society, Web page The URL will go live at some point after the embargo ends
Journal/
conference:
Royal Society Open Science
Research:Paper
Organisation/s: University College London, UK, University of Bologna, Italy
Funder: No funding has been received for this article.
Media Contact/s
Contact details are only visible to registered journalists.