Fix: Continue SWE-bench evaluation when Modal sandbox creation fails #416
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Addresses issue #414 where SWE-bench evaluation fails completely when a Modal sandbox cannot be created for some problems, preventing the generation of the final report.
What does this implement/fix? Explain your changes.
This PR modifies the
run_instances_modal
function inswebench/harness/modal_eval/run_evaluation_modal.py
to handle Modal sandbox creation failures gracefully.Currently, when running with
--modal true
, the entire evaluation fails if any single instance encounters an error during sandbox creation (e.g., when a dependency liketypes-setuptools==0.1.3
cannot be installed). This prevents the final report from being generated even if most problems could be successfully evaluated.The main changes in this PR are:
This is not a complete fix for the sandbox creation failures themselves (which may require more investigation into dependency issues), but it provides a significant improvement by allowing the evaluation to continue and produce a report even when some instances fail.
Any other comments?
This change is particularly important for large-scale evaluations where a small number of sandbox failures shouldn't prevent results from being generated for all other problems.