AI Experts Initiate Humanity's Last Exam To Test Advanced AI Intelligence Levels

Updated: Tuesday, September 17, 2024, 16:26 [GST]

Artificial intelligence systems are increasingly acing popular benchmark tests, prompting experts to devise more challenging assessments. The initiative, called "Humanity's Last Exam," aims to identify when AI reaches expert-level proficiency. Organised by the Centre for AI Safety (CAIS) and Scale AI, the project seeks to remain relevant as AI capabilities evolve.

The announcement follows the unveiling of OpenAI's new model, OpenAI o1, which has excelled in reasoning benchmarks. Dan Hendrycks, CAIS's executive director and advisor to Elon Musk's xAI startup, noted that this model has surpassed existing standards. Hendrycks co-authored influential papers in 2021 proposing tests for AI systems, which have since become widely adopted.

AI Experts Launch Humanity s Last Exam Initiative

Initially, AI systems struggled with these exams, often providing random answers. However, Hendrycks told Reuters that they now perform exceptionally well. For instance, Anthropic's Claude models improved their undergraduate-level test scores from 77% in 2023 to nearly 89% a year later. This progress diminishes the significance of traditional benchmarks.

Despite these advancements, AI still struggles with less common assessments involving planning and visual pattern-recognition puzzles. Stanford University's AI Index Report from April highlighted this issue. OpenAI o1 scored only about 21% on a version of the ARC-AGI test, according to ARC organisers.

Some researchers argue that planning and abstract reasoning better gauge intelligence than current benchmarks. Hendrycks acknowledged that visual elements make ARC less suitable for evaluating language models but emphasised that "Humanity's Last Exam" will focus on abstract reasoning.

Concerns have arisen that benchmark answers may be included in training data for AI systems. To counteract this, some questions on "Humanity's Last Exam" will remain confidential to prevent memorisation by AI models.

The exam will feature at least 1,000 crowd-sourced questions due by November 1st. These questions should be difficult for non-experts and will undergo peer review. Winning submissions could earn co-authorship and up to $5,000 in prizes sponsored by Scale AI.

Alexandr Wang, CEO of Scale AI, stressed the need for more challenging tests to track rapid AI progress: "We desperately need harder tests for expert-level models." However, organisers have imposed a restriction against questions about weapons due to potential dangers associated with such topics.

More technology News

ZainTECH And Oman Data Park Join Forces To Boost Cybersecurity And Regulatory Compliance Across The Region

The RE:HUMAN Report: Six AI Trends Reshaping Creativity, Wellness, and Culture

Samsung Unveils The First Micro RGB TV In The UAE With Advanced Features For Home Entertainment

The development of "Humanity's Last Exam" reflects ongoing efforts to adapt assessments as artificial intelligence continues to advance rapidly. By focusing on abstract reasoning and ensuring question confidentiality, organisers aim to create a robust measure of expert-level proficiency in AI systems.

AI United States