Skip to content

SWE-bench

SWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks.

Description

It uses actual issues from GitHub and requires the model to generate a functional patch that passes existing tests.

Alternatives

Backlog

  • Track "Jules" performance on a subset of SWE-bench.