Humanity's Last Exam (HLE)¶

HLE is a new benchmark designed to test the limits of LLMs on the most difficult human-level tasks.

Description¶

It consists of highly complex, multi-disciplinary questions that require deep expertise and advanced reasoning to solve. It is intended to be a benchmark that remains challenging even as models continue to improve.

Links¶

HLE Website

Alternatives¶

GPQA
MMLU

Backlog¶

Track performance of upcoming models (o1, Claude 3.5 Opus).