Skip to content

EuniAI/Polybench-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

SWE-Polybench Results

Overview

This document summarizes the results of running SWE-Polybench. It includes overall statistics, pass rates broken down by programming language and task category, and file retrieval metrics.

Total resolved: 129/382
Total pass rate: 33.77%

Pass rate by language:
Python: 41/113 (36.3%)
JavaScript: 30/100 (30.0%)
TypeScript: 35/100 (35.0%)
Java: 23/69 (33.3%)

Pass rate by task category:
Bug Fix: 112/299 (37.5%)
Feature: 13/70 (18.6%)
Refactoring: 4/13 (30.8%)

File retrieval metrics by language:
            file_recall  file_precision  file_f1
language                                        
Java               0.63            0.68     0.60
JavaScript         0.58            0.62     0.55
Python             0.69            0.68     0.65
TypeScript         0.55            0.69     0.55

File retrieval metrics overall:
file_recall       0.61
file_precision    0.66
file_f1           0.59

Key Results (Summary)

  • Total resolved: 129 / 382
  • Total pass rate: 33.77%

Pass Rate by Language

  • Python: 41 / 113 (36.3%)
  • JavaScript: 30 / 100 (30.0%)
  • TypeScript: 35 / 100 (35.0%)
  • Java: 23 / 69 (33.3%)

Pass Rate by Task Category

  • Bug Fix: 112 / 299 (37.5%)
  • Feature: 13 / 70 (18.6%)
  • Refactoring: 4 / 13 (30.8%)

File Retrieval Metrics (by language)

Language file_recall file_precision file_f1
Java 0.63 0.68 0.60
JavaScript 0.58 0.62 0.55
Python 0.69 0.68 0.65
TypeScript 0.55 0.69 0.55

File Retrieval Metrics (overall)

  • file_recall: 0.61
  • file_precision: 0.66
  • file_f1: 0.59

Conclusions & Highlights

  • Overall pass rate is approximately 33.77%.
  • Python achieves the highest file retrieval F1 (0.65) and strongest recall (0.69).
  • TypeScript maintains a high pass rate (35.0%).
  • Feature tasks are the most challenging, with the lowest pass rate (18.6%).
  • File retrieval precision is generally higher than recall, indicating retrieved files tend to be relevant but some targets are missed.

Contact

  • For more details, check the repository test configuration and logs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published