Skip to content

GPQA (Graduate-Level Google-Proof Q&A)

GPQA is a challenging benchmark for evaluating high-level reasoning and knowledge in LLMs.

Description

It consists of 448 multiple-choice questions written by experts (PhD-level) in biology, physics, and chemistry. The questions are designed to be "Google-proof," meaning they are difficult even for non-expert humans to solve with access to the internet.

Key Metrics

  • Accuracy: Percentage of correct answers.
  • Self-Consistency: Metric for reasoning reliability.

Alternatives

Backlog

  • Add comparative results for Claude 3 and GPT-4o.