AI specialists prepared to thwart powerful tech
in “Humanity’s Last Exam”
On Monday, a group of international technology specialists released a request for the most challenging problems to put to artificial intelligence systems, which have been handling common benchmark tests like play.
The initiative, “Humanity’s Last Exam, opens new tab,” aims to ascertain the precise moment at which expert-level AI will become a reality. The Center for AI Safety (CAIS), a non-profit, and Scale AI, a startup, are the organizers, and they hope it will remain relevant even as capabilities expand in the future.
The ChatGPT creator recently unveiled a new model called OpenAI o1, which “destroyed the most popular reasoning benchmarks,” according to Dan Hendrycks, executive director of CAIS and advisor to Elon Musk’s xAI firm. This was days before the call.
Co-authoring two 2021 publications, Hendrycks tested the effectiveness of AI systems that are currently in widespread use by asking them questions about undergraduate-level understanding of subjects like American history and by examining the models’ reasoning through arithmetic problems that are challenging enough for competition.
The undergraduate-style test opens in a new tab and has received more downloads than any other dataset from the online AI center Hugging Face.
AI was responding to exam questions in an almost random manner at the time those papers were written.
As Hendrycks told Reuters, “They’re now crushed.”
According to a well-known capabilities scoreboard, opens new tab, the Claude models from the AI lab Anthropic, for instance, improved their performance from roughly 77% on an undergraduate-level test in 2023 to nearly 89% a year later.
As such, these shared criteria are less meaningful.
According to Stanford University’s AI Index Report from April, AI has seemed to do poorly on less common tests requiring plan formulation and visual pattern recognition difficulties. For example, the pattern-recognition ARC-AGI test had OpenAI o1 scoring about 21%, according to the ARC organizers’ announcement on Friday.
Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. “Humanity’s Last Exam” will require abstract reasoning, he said.
Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on “Humanity’s Last Exam” will remain private to make sure AI systems’ answers are not from memorization.
The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI.
“We desperately need harder tests for expert-level models to measure the rapid progress of AI,” said Alexandr Wang, Scale’s CEO.
One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study.
Co-authoring two 2021 publications, Hendrycks tested the effectiveness of AI systems that are currently in widespread use by asking them questions about undergraduate-level understanding of subjects like American history and by examining the models’ reasoning through arithmetic problems that are challenging enough for competition.
“We desperately need harder tests for expert-level models to measure the rapid progress of AI,” said Alexandr Wang, Scale’s CEO.
One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study.
#AI, #ArtificialIntelligence, #TechSpecialists, #HumanitysLastExam, #Innovation, #FutureOfTech, #Cybersecurity, #TechEthics, #DigitalDefense, #AIExperts, #TechnologyTrends, #DataProtection, #AIRevolution, #TechForGood, #SmartTechnology, #EthicalAI, #TechChallenges, #AIAdvancements, #TechLeadership, #FutureReady,
Discover more from Postbox Live
Subscribe to get the latest posts sent to your email.