Skip to content

LM Evaluation Harness

A framework for few-shot evaluation of autoregressive language models.

Description

It provides a unified interface for evaluating models on hundreds of different tasks, including MMLU, ARC, HellaSwag, and many more.

Alternatives

Backlog

  • Configure custom task suite for internal model evaluation.