AI Experts Initiate Humanity's Last Exam To Test Advanced AI Intelligence Levels

Artificial intelligence systems are increasingly acing popular benchmark tests, prompting experts to devise more challenging assessments. The initiative, called "Humanity's Last Exam," aims to identify when AI reaches expert-level proficiency. Organised by the Centre for AI Safety (CAIS) and Scale AI, the project seeks to remain relevant as AI capabilities evolve.

The announcement follows the unveiling of OpenAI's new model, OpenAI o1, which has excelled in reasoning benchmarks. Dan Hendrycks, CAIS's executive director and advisor to Elon Musk's xAI startup, noted that this model has surpassed existing standards. Hendrycks co-authored influential papers in 2021 proposing tests for AI systems, which have since become widely adopted.

Initially, AI systems struggled with these exams, often providing random answers. However, Hendrycks told Reuters that they now perform exceptionally well. For instance, Anthropic's Claude models improved their undergraduate-level test scores from 77% in 2023 to nearly 89% a year later. This progress diminishes the significance of traditional benchmarks.

Despite these advancements, AI still struggles with less common assessments involving planning and visual pattern-recognition puzzles. Stanford University's AI Index Report from April highlighted this issue. OpenAI o1 scored only about 21% on a version of the ARC-AGI test, according to ARC organisers.

Some researchers argue that planning and abstract reasoning better gauge intelligence than current benchmarks. Hendrycks acknowledged that visual elements make ARC less suitable for evaluating language models but emphasised that "Humanity's Last Exam" will focus on abstract reasoning.

Concerns have arisen that benchmark answers may be included in training data for AI systems. To counteract this, some questions on "Humanity's Last Exam" will remain confidential to prevent memorisation by AI models.

The exam will feature at least 1,000 crowd-sourced questions due by November 1st. These questions should be difficult for non-experts and will undergo peer review. Winning submissions could earn co-authorship and up to $5,000 in prizes sponsored by Scale AI.

Alexandr Wang, CEO of Scale AI, stressed the need for more challenging tests to track rapid AI progress: "We desperately need harder tests for expert-level models." However, organisers have imposed a restriction against questions about weapons due to potential dangers associated with such topics.

The development of "Humanity's Last Exam" reflects ongoing efforts to adapt assessments as artificial intelligence continues to advance rapidly. By focusing on abstract reasoning and ensuring question confidentiality, organisers aim to create a robust measure of expert-level proficiency in AI systems.

24K Gold / Gram
22K Gold / Gram
Advertisement
First Name
Last Name
Email Address
Age
Select Age
  • 18 to 24
  • 25 to 34
  • 35 to 44
  • 45 to 54
  • 55 to 64
  • 65 or over
Gender
Select Gender
  • Male
  • Female
  • Transgender
Location
Explore by Category
Get Instant News Updates
Enable All Notifications
Select to receive notifications from