8000 Fix: Continue SWE-bench evaluation when Modal sandbox creation fails by anastasiya1155 · Pull Request #416 · SWE-bench/SWE-bench · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anastasiya1155
Copy link
@anastasiya1155 anastasiya1155 commented Jun 9, 2025 89B7

Reference Issues/PRs

Addresses issue #414 where SWE-bench evaluation fails completely when a Modal sandbox cannot be created for some problems, preventing the generation of the final report.

What does this implement/fix? Explain your changes.

This PR modifies the run_instances_modal function in swebench/harness/modal_eval/run_evaluation_modal.py to handle Modal sandbox creation failures gracefully.

Currently, when running with --modal true, the entire evaluation fails if any single instance encounters an error during sandbox creation (e.g., when a dependency like types-setuptools==0.1.3 cannot be installed). This prevents the final report from being generated even if most problems could be successfully evaluated.

The main changes in this PR are:

  1. Process each instance individually in a loop rather than all at once
  2. Wrap each instance's processing in a try/except block to catch and isolate errors
  3. Create log files for both successful and failed runs
  4. Continue processing other instances even if one fails
  5. Always attempt to generate the final report at the end

This is not a complete fix for the sandbox creation failures themselves (which may require more investigation into dependency issues), but it provides a significant improvement by allowing the evaluation to continue and produce a report even when some instances fail.

Any other comments?

This change is particularly important for large-scale evaluations where a small number of sandbox failures shouldn't prevent results from being generated for all other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0