This document summarizes the results of running SWE-Polybench. It includes overall statistics, pass rates broken down by programming language and task category, and file retrieval metrics.
Total resolved: 129/382
Total pass rate: 33.77%
Pass rate by language:
Python: 41/113 (36.3%)
JavaScript: 30/100 (30.0%)
TypeScript: 35/100 (35.0%)
Java: 23/69 (33.3%)
Pass rate by task category:
Bug Fix: 112/299 (37.5%)
Feature: 13/70 (18.6%)
Refactoring: 4/13 (30.8%)
File retrieval metrics by language:
file_recall file_precision file_f1
language
Java 0.63 0.68 0.60
JavaScript 0.58 0.62 0.55
Python 0.69 0.68 0.65
TypeScript 0.55 0.69 0.55
File retrieval metrics overall:
file_recall 0.61
file_precision 0.66
file_f1 0.59
- Total resolved: 129 / 382
- Total pass rate: 33.77%
- Python: 41 / 113 (36.3%)
- JavaScript: 30 / 100 (30.0%)
- TypeScript: 35 / 100 (35.0%)
- Java: 23 / 69 (33.3%)
- Bug Fix: 112 / 299 (37.5%)
- Feature: 13 / 70 (18.6%)
- Refactoring: 4 / 13 (30.8%)
| Language | file_recall | file_precision | file_f1 |
|---|---|---|---|
| Java | 0.63 | 0.68 | 0.60 |
| JavaScript | 0.58 | 0.62 | 0.55 |
| Python | 0.69 | 0.68 | 0.65 |
| TypeScript | 0.55 | 0.69 | 0.55 |
- file_recall: 0.61
- file_precision: 0.66
- file_f1: 0.59
- Overall pass rate is approximately 33.77%.
- Python achieves the highest file retrieval F1 (0.65) and strongest recall (0.69).
- TypeScript maintains a high pass rate (35.0%).
- Feature tasks are the most challenging, with the lowest pass rate (18.6%).
- File retrieval precision is generally higher than recall, indicating retrieved files tend to be relevant but some targets are missed.
- For more details, check the repository test configuration and logs.