Skip to content

HumanEval

HumanEval is a benchmark released by OpenAI to evaluate the code generation capabilities of Large Language Models.

Description

It consists of 164 handwritten programming problems, each including a function signature, docstring, body, and several unit tests. The problems are designed to be self-contained and assess the model's ability to solve basic algorithmic tasks.

Key Metrics

  • Pass@k: The probability that at least one of the top $k$ generated code samples passes all unit tests.

Alternatives

Backlog

  • Add comparison with HumanEval-X for multilingual support.