SWE-bench¶
SWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks.
Description¶
It uses actual issues from GitHub and requires the model to generate a functional patch that passes existing tests.
Links¶
Alternatives¶
Backlog¶
- Track "Jules" performance on a subset of SWE-bench.