Humanity's Last Exam
I was watching the recent Grok4 keynote presentation when I first heard the phrase “Humanity’s Last Exam.” It was mentioned almost in passing, but it stopped me cold. I paused the stream and started digging. What I discovered was both fascinating and unsettling in the best possible way.
Humanity’s Last Exam (HLE for short) isn’t some dramatic end-times metaphor. It’s a real thing: a new benchmark created by the Center for AI Safety in collaboration with Scale AI. Think of it as the Mount Everest of AI exams. Thousands of handcrafted, expert-level questions spanning more than 100 fields, from quantum physics and organic chemistry to ancient languages and mechanical engineering. 2,500 questions to be exact. These are not your average benchmark questions. They’re designed to test the outermost limits of what a language model can understand and reason through.
What hit me hardest was the intent behind it. Traditional benchmarks like Massive Multitask Language Understanding (MMLU) are being outpaced by top models. We’re at the point where GPT-4, Claude, Gemini, and now Grok4 are scoring in the 90th percentile on those older tests. But that doesn’t mean these models have reached true mastery. It just means the bar hasn’t kept up. HLE raises that bar…dramatically.
As I read more, I found myself reflecting on what it means to be smart in a world where AI can already outperform most people in structured knowledge recall. What happens when it starts out-thinking us in synthesis and reasoning too? It’s not just about the scores. It’s about the shifting boundary between human and machine capability and how fast that boundary is moving.
Current results from HLE are sobering. Models we often think of as “superintelligent” aren’t passing. Far from it. The best-performing model on record, xAI’s Grok4 (with tool assistance), scored somewhere between 38 and 45 percent. Gemini 2.5 Pro and OpenAI’s o3-high both landed just above 20 percent. Even these models, with billions in R&D behind them, are barely cracking the surface of this exam. They’re impressive, yes, but they’re also deeply fallible.
There’s something thrilling in that. It reminds us that while today’s AI models are astonishing, they’re not omniscient. And yet, their trajectory suggests they might get there, or at least closer than we’re comfortable with, sooner than we think. Benchmarks like HLE don’t just measure AI’s current limits. They force us to confront questions about what those limits should be, and what it means for society when they’re finally surpassed.
More than anything, encountering HLE made me think about the kinds of intelligence we value and whether we’re preparing ourselves, as individuals and institutions, for a world where synthetic intelligence might be part of every decision we make. Not just a tool, but a participant. Not just an assistant, but maybe a challenger. Someone at work said to me that when I use an AI to help with coding, it feels like “cheating” but in my mind whether I give my requirements to a developer or to an AI, it’s just me delegating. The skill is in knowing how and when to apply the tool. The capabilities of these tools has really impressed me.
It’s one thing, however, to be impressed by AI. It’s another to realize it’s starting to sit for the same exams we would and most times answer better (in fact, humans would be hard pressed to crack 5% on the current HLE). I’m still wrapping my head around what that means. But I know I’ll be watching the next set of results very closely.