LM Evaluation Harness¶
A framework for few-shot evaluation of autoregressive language models.
Description¶
It provides a unified interface for evaluating models on hundreds of different tasks, including MMLU, ARC, HellaSwag, and many more.
Links¶
Alternatives¶
Backlog¶
- Configure custom task suite for internal model evaluation.