Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

anastasiya1155 · 2025-06-09T11:42:29Z

Reference Issues/PRs

Addresses issue #414 where SWE-bench evaluation fails completely when a Modal sandbox cannot be created for some problems, preventing the generation of the final report.

What does this implement/fix? Explain your changes.

This PR modifies the run_instances_modal function in swebench/harness/modal_eval/run_evaluation_modal.py to handle Modal sandbox creation failures gracefully.

Currently, when running with --modal true, the entire evaluation fails if any single instance encounters an error during sandbox creation (e.g., when a dependency like types-setuptools==0.1.3 cannot be installed). This prevents the final report from being generated even if most problems could be successfully evaluated.

The main changes in this PR are:

Process each instance individually in a loop rather than all at once
Wrap each instance's processing in a try/except block to catch and isolate errors
Create log files for both successful and failed runs
Continue processing other instances even if one fails
Always attempt to generate the final report at the end

This is not a complete fix for the sandbox creation failures themselves (which may require more investigation into dependency issues), but it provides a significant improvement by allowing the evaluation to continue and produce a report even when some instances fail.

Any other comments?

This change is particularly important for large-scale evaluations where a small number of sandbox failures shouldn't prevent results from being generated for all other problems.

…e of the problems fails to create a sandbox

fix - running evaluation on modal should not throw an exception if on…

aafe580

…e of the problems fails to create a sandbox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

Uh oh!

Uh oh!

Uh oh!

Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

Are you sure you want to change the base?

Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!