-
Notifications
You must be signed in to change notification settings - Fork 174
Update CRDS cache workflow usage to reduce random failures #9040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9dacb42
to
8fa726e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9040 +/- ##
==========================================
- Coverage 76.90% 76.90% -0.01%
==========================================
Files 498 498
Lines 45761 45769 +8
==========================================
+ Hits 35194 35199 +5
- Misses 10567 10570 +3
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This reverts commit 8b5e7bf.
8fa726e
to
e82bac9
Compare
I'm going to merge since it's approved but please let me know if there are any questions. I requested some folks that reported the initial failures and to highlight one part of the description above:
|
https://github.com/spacetelescope/jwst/actions/runs/12587715413/attempts/1 provides an example of another wrinkle with the whole caching approach as a way to work around the CRDS parallel cache write issue. The caches are per-branch so after merging this PR the run on The failed jobs succeeded when re-run (since they now use the valid cache). |
As this is all CI related marking as "no-changelog".
This PR partially reverts #8987 related commits with the following end results:
jwst-edit
as the default virtual contextcontexts
workflow to resolve thejwst-edit
virtual context to a validpmap
pmap
(not the virtual context) (the main bug fix, the other changes are in support of this)This (partially) addresses the recent CI failures:
At the core I think this is a CRDS issue (or issue in how the pipeline uses CRDS) where simultaneous requests for a single file results in attempts to read partial files. This issue lead to more common (random-ish) failures when #8987 switched the cache key to
crds-jwst-edit
.This is only a partial fix because the first run for a new context can still lead to the same failure popping up. However the failures will go away (with these changes) after 1 successful run (needed to generate a valid c 8000 ache for a given pmap).
To provide some details (for future readers and curious reviewers).
CRDS partial files for xdist runs
Let's assume a job starts with an empty (or incomplete) crds cache. In this situation the test run will trigger downloading map files (and reference files). As the tests runs using xdist (spawning multiple worker processes that each run tests in parallel) it's possible that worker 1 will be fetching a mapping that worker 2 tries to use. If worker 1 is partially through downloading the file, the mapping file will be on disk but incomplete. Worker 2 will see the file exists, attempt to read it and fail (see the above linked failures).
This only happens for empty (or incomplete) crds caches. For empty caches if the run doesn't happen to have the problem, the run will complete and the cache will be stored (since it was originally a cache miss). No future runs that use this cache should have the problem unless a new mapping is required or a new reference file is needed (those will be incomplete cache runs). For empty caches if the run fails no cache will be stored. For an incomplete cache run even if the run is successful (and the required mapping or references are downloaded) no updated cache will be stored (because the key resulted in a cache hit).
Using a virtual context as a cache key
On main currently the default virtual context is
jwst-edit
. This leads to a cache keycrds-jwst-edit
. However sincejwst-edit
is a virtual context the cache will become stale (incomplete) whenjwst-edit
is updated to point to a new pmap. When the associated cache is stale every run will trigger downloading mapping and reference files that weren't in the pmap that was previously pointed to bejwst-edit
(see above how an incomplete cache can lead to random test failures due to CRDS behavior and pytest-xdist).Partial fix
This PR is only a partial fix because the first time a new pmap is used (either because
jwst-edit
was updated to point to a new pmap or a pmap was manually specified) it's possible the tests will randomly fail due to the "empty" cache behavior discussed above. A more complete fix would be to:With the partial fix in this PR (which seems more useful than any of the available more complete fixes) there is the possibility the issue will reappear when a new context is used. If failures of this type are seen re-running might resolve the issue (especially if a different run succeeded and stored a valid cache) or in the worst case a manual run could be triggered using
-n 1
to run the tests serially. We could update the workflows to run-n 1
on a cache miss to avoid the failures (at the expense of a slow run for the first run of a new context) but this would require a more extensive refactoring of the workflows which I think would be working around a more fundamental issue with CRDS. If a user were to start 2 pipeline runs that use the same reference files (missing from their local cache) shouldn't CRDS not try to load the partial file if run 2 starts slightly after run 1?Tasks
Build 11.3
(use the latest build if not sure)no-changelog-entry-needed
)changes/
:echo "changed something" > changes/<PR#>.<changetype>.rst
(see below for change types)docs/
pageokify_regtests
to update the truth filesnews fragment change types...
changes/<PR#>.general.rst
: infrastructure or miscellaneous changechanges/<PR#>.docs.rst
changes/<PR#>.stpipe.rst
changes/<PR#>.datamodels.rst
changes/<PR#>.scripts.rst
changes/<PR#>.fits_generator.rst
changes/<PR#>.set_telescope_pointing.rst
changes/<PR#>.pipeline.rst
stage 1
changes/<PR#>.group_scale.rst
changes/<PR#>.dq_init.rst
changes/<PR#>.emicorr.rst
changes/<PR#>.saturation.rst
changes/<PR#>.ipc.rst
changes/<PR#>.firstframe.rst
changes/<PR#>.lastframe.rst
changes/<PR#>.reset.rst
changes/<PR#>.superbias.rst
changes/<PR#>.refpix.rst
changes/<PR#>.linearity.rst
changes/<PR#>.rscd.rst
changes/<PR#>.persistence.rst
changes/<PR#>.dark_current.rst
changes/<PR#>.charge_migration.rst
changes/<PR#>.jump.rst
changes/<PR#>.clean_flicker_noise.rst
changes/<PR#>.ramp_fitting.rst
changes/<PR#>.gain_scale.rst
stage 2
changes/<PR#>.assign_wcs.rst
changes/<PR#>.badpix_selfcal.rst
changes/<PR#>.msaflagopen.rst
changes/<PR#>.nsclean.rst
changes/<PR#>.imprint.rst
changes/<PR#>.background.rst
changes/<PR#>.extract_2d.rst
changes/<PR#>.master_background.rst
changes/<PR#>.wavecorr.rst
changes/<PR#>.srctype.rst
changes/<PR#>.straylight.rst
changes/<PR#>.wfss_contam.rst
changes/<PR#>.flatfield.rst
changes/<PR#>.fringe.rst
changes/<PR#>.pathloss.rst
changes/<PR#>.barshadow.rst
changes/<PR#>.photom.rst
changes/<PR#>.pixel_replace.rst
changes/<PR#>.resample_spec.rst
changes/<PR#>.residual_fringe.rst
changes/<PR#>.cube_build.rst
changes/<PR#>.extract_1d.rst
changes/<PR#>.resample.rst
stage 3
changes/<PR#>.assign_mtwcs.rst
changes/<PR#>.mrs_imatch.rst
changes/<PR#>.tweakreg.rst
changes/<PR#>.skymatch.rst
changes/<PR#>.exp_to_source.rst
changes/<PR#>.outlier_detection.rst
changes/<PR#>.tso_photometry.rst
changes/<PR#>.stack_refs.rst
changes/<PR#>.align_refs.rst
changes/<PR#>.klip.rst
changes/<PR#>.spectral_leak.rst
changes/<PR#>.source_catalog.rst
changes/<PR#>.combine_1d.rst
changes/<PR#>.ami.rst
other
changes/<PR#>.wfs_combine.rst
changes/<PR#>.white_light.rst
changes/<PR#>.cube_skymatch.rst
changes/<PR#>.engdb_tools.rst
changes/<PR#>.guider_cds.rst