GPQA (Graduate-Level Google-Proof Q&A)¶
GPQA is a challenging benchmark for evaluating high-level reasoning and knowledge in LLMs.
Description¶
It consists of 448 multiple-choice questions written by experts (PhD-level) in biology, physics, and chemistry. The questions are designed to be "Google-proof," meaning they are difficult even for non-expert humans to solve with access to the internet.
Key Metrics¶
- Accuracy: Percentage of correct answers.
- Self-Consistency: Metric for reasoning reliability.
Links¶
Alternatives¶
Backlog¶
- Add comparative results for Claude 3 and GPT-4o.