AIs fail 'Humanity’s Last Exam' on the cutting edge of human knowledge

Publicly released:
Australia; International; NSW; VIC; WA; ACT
Photo by Solen Feyissa on Unsplash. Story by Lyndal Byford, Australian Science Media Centre
Photo by Solen Feyissa on Unsplash. Story by Lyndal Byford, Australian Science Media Centre

AIs, including those behind Chat-GPT, Gemini and Claude, have received what can only be described as a fail mark on 'Humanity’s Last Exam', a test designed to see how AIs fare against the cutting edge of human knowledge. The exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences and was developed by subject-matter experts across the world, including in Australia. Many of the AIs scored less than 10% on the exam, and the highest score was 25.3% from GPT-5. The researchers say this highlights that there is still a marked gap between current AI capabilities and the expert human frontier. By providing a clear measure of AI progress, 'Humanity’s Last Exam' creates a common reference point for scientists and policymakers to assess AI capabilities, the authors say.

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research Springer Nature, Web page Please link to the article in online versions of your report (the URL will go live after the embargo ends).
Journal/
conference:
Nature
Research:Paper
Organisation/s: The University of Western Australia, Swinburne University of Technology, The University of Sydney, University of Technology Sydney (UTS), The Australian National University, Monash University, La Trobe University, RMIT University, Murdoch University, The University of Melbourne, Center for AI Safety, USA
Funder: The research is supported by Center for AI Safety and Scale AI.
Media Contact/s
Contact details are only visible to registered journalists.