Update CRDS cache workflow usage to reduce random failures #9040

braingram · 2025-01-02T17:17:32Z

As this is all CI related marking as "no-changelog".

This PR partially reverts #8987 related commits with the following end results:

retain using jwst-edit as the default virtual context
remove uses of unstable crds jsonrpc call
restore the contexts workflow to resolve the jwst-edit virtual context to a valid pmap
cache based on the pmap (not the virtual context) (the main bug fix, the other changes are in support of this)

This (partially) addresses the recent CI failures:

At the core I think this is a CRDS issue (or issue in how the pipeline uses CRDS) where simultaneous requests for a single file results in attempts to read partial files. This issue lead to more common (random-ish) failures when #8987 switched the cache key to crds-jwst-edit.

This is only a partial fix because the first run for a new context can still lead to the same failure popping up. However the failures will go away (with these changes) after 1 successful run (needed to generate a valid c 8000 ache for a given pmap).

To provide some details (for future readers and curious reviewers).

CRDS partial files for xdist runs

Let's assume a job starts with an empty (or incomplete) crds cache. In this situation the test run will trigger downloading map files (and reference files). As the tests runs using xdist (spawning multiple worker processes that each run tests in parallel) it's possible that worker 1 will be fetching a mapping that worker 2 tries to use. If worker 1 is partially through downloading the file, the mapping file will be on disk but incomplete. Worker 2 will see the file exists, attempt to read it and fail (see the above linked failures).

This only happens for empty (or incomplete) crds caches. For empty caches if the run doesn't happen to have the problem, the run will complete and the cache will be stored (since it was originally a cache miss). No future runs that use this cache should have the problem unless a new mapping is required or a new reference file is needed (those will be incomplete cache runs). For empty caches if the run fails no cache will be stored. For an incomplete cache run even if the run is successful (and the required mapping or references are downloaded) no updated cache will be stored (because the key resulted in a cache hit).

Using a virtual context as a cache key

On main currently the default virtual context is jwst-edit. This leads to a cache key crds-jwst-edit. However since jwst-edit is a virtual context the cache will become stale (incomplete) when jwst-edit is updated to point to a new pmap. When the associated cache is stale every run will trigger downloading mapping and reference files that weren't in the pmap that was previously pointed to be jwst-edit (see above how an incomplete cache can lead to random test failures due to CRDS behavior and pytest-xdist).

Partial fix

This PR is only a partial fix because the first time a new pmap is used (either because jwst-edit was updated to point to a new pmap or a pmap was manually specified) it's possible the tests will randomly fail due to the "empty" cache behavior discussed above. A more complete fix would be to:

enable CRDS cache locking. Options exist but none are currently compatible with parallel test runs
pre-fetch all mappings and reference files to populate the cache before using it. This may require more than the allowed 10G of cache space and will produce a cache much larger than needed.
stop running tests in parallel. Which would lead to much slower test runs.

With the partial fix in this PR (which seems more useful than any of the available more complete fixes) there is the possibility the issue will reappear when a new context is used. If failures of this type are seen re-running might resolve the issue (especially if a different run succeeded and stored a valid cache) or in the worst case a manual run could be triggered using -n 1 to run the tests serially. We could update the workflows to run -n 1 on a cache miss to avoid the failures (at the expense of a slow run for the first run of a new context) but this would require a more extensive refactoring of the workflows which I think would be working around a more fundamental issue with CRDS. If a user were to start 2 pipeline runs that use the same reference files (missing from their local cache) shouldn't CRDS not try to load the partial file if run 2 starts slightly after run 1?

Tasks

news fragment change types...

changes/<PR#>.general.rst: infrastructure or miscellaneous change
changes/<PR#>.docs.rst
changes/<PR#>.stpipe.rst
changes/<PR#>.datamodels.rst
changes/<PR#>.scripts.rst
changes/<PR#>.fits_generator.rst
changes/<PR#>.set_telescope_pointing.rst
changes/<PR#>.pipeline.rst

stage 1

changes/<PR#>.group_scale.rst
changes/<PR#>.dq_init.rst
changes/<PR#>.emicorr.rst
changes/<PR#>.saturation.rst
changes/<PR#>.ipc.rst
changes/<PR#>.firstframe.rst
changes/<PR#>.lastframe.rst
changes/<PR#>.reset.rst
changes/<PR#>.superbias.rst
changes/<PR#>.refpix.rst
changes/<PR#>.linearity.rst
changes/<PR#>.rscd.rst
changes/<PR#>.persistence.rst
changes/<PR#>.dark_current.rst
changes/<PR#>.charge_migration.rst
changes/<PR#>.jump.rst
changes/<PR#>.clean_flicker_noise.rst
changes/<PR#>.ramp_fitting.rst
changes/<PR#>.gain_scale.rst

stage 2

changes/<PR#>.assign_wcs.rst
changes/<PR#>.badpix_selfcal.rst
changes/<PR#>.msaflagopen.rst
changes/<PR#>.nsclean.rst
changes/<PR#>.imprint.rst
changes/<PR#>.background.rst
changes/<PR#>.extract_2d.rst
changes/<PR#>.master_background.rst
changes/<PR#>.wavecorr.rst
changes/<PR#>.srctype.rst
changes/<PR#>.straylight.rst
changes/<PR#>.wfss_contam.rst
changes/<PR#>.flatfield.rst
changes/<PR#>.fringe.rst
changes/<PR#>.pathloss.rst
changes/<PR#>.barshadow.rst
changes/<PR#>.photom.rst
changes/<PR#>.pixel_replace.rst
changes/<PR#>.resample_spec.rst
changes/<PR#>.residual_fringe.rst
changes/<PR#>.cube_build.rst
changes/<PR#>.extract_1d.rst
changes/<PR#>.resample.rst

stage 3

changes/<PR#>.assign_mtwcs.rst
changes/<PR#>.mrs_imatch.rst
changes/<PR#>.tweakreg.rst
changes/<PR#>.skymatch.rst
changes/<PR#>.exp_to_source.rst
changes/<PR#>.outlier_detection.rst
changes/<PR#>.tso_photometry.rst
changes/<PR#>.stack_refs.rst
changes/<PR#>.align_refs.rst
changes/<PR#>.klip.rst
changes/<PR#>.spectral_leak.rst
changes/<PR#>.source_catalog.rst
changes/<PR#>.combine_1d.rst
changes/<PR#>.ami.rst

other

changes/<PR#>.wfs_combine.rst
changes/<PR#>.white_light.rst
changes/<PR#>.cube_skymatch.rst
changes/<PR#>.engdb_tools.rst
changes/<PR#>.guider_cds.rst

codecov · 2025-01-02T17:58:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.90%. Comparing base (bb62f0c) to head (e82bac9).
Report is 822 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9040      +/-   ##
==========================================
- Coverage   76.90%   76.90%   -0.01%     
==========================================
  Files         498      498              
  Lines       45761    45769       +8     
==========================================
+ Hits        35194    35199       +5     
- Misses      10567    10570       +3

Flag	Coverage Δ		*Carryforward flag
nightly	`77.39% <ø> (ø)`		Carriedforward from bb62f0c

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This reverts commit 8b5e7bf.

…8987)" This reverts commit 974682f.

braingram · 2025-01-02T19:30:58Z

I'm going to merge since it's approved but please let me know if there are any questions.

I requested some folks that reported the initial failures and to highlight one part of the description above:

If failures of this type are seen re-running might resolve the issue (especially if a different run succeeded and stored a valid cache) or in the worst case a manual run could be triggered using -n 1 to run the tests serially.

braingram · 2025-01-02T20:44:27Z

https://github.com/spacetelescope/jwst/actions/runs/12587715413/attempts/1 provides an example of another wrinkle with the whole caching approach as a way to work around the CRDS parallel cache write issue.

The caches are per-branch so after merging this PR the run on main didn't see the crds-jwst_1321.pmap cache generated for this PR. It ran with an empty cache and had a few failures (due to the above issue and starting from an empty cache).

The failed jobs succeeded when re-run (since they now use the valid cache).
https://github.com/spacetelescope/jwst/actions/runs/12587715413

github-actions bot added testing automation Continuous Integration (CI) and testing automation tools labels Jan 2, 2025

braingram force-pushed the fix_crds_cache branch 2 times, most recently from 9dacb42 to 8fa726e Compare January 2, 2025 17:30

braingram added the no-changelog-entry-needed label Jan 2, 2025

braingram added 3 commits January 2, 2025 13:15

Revert "hotfix for merge conflict"

151ff41

This reverts commit 8b5e7bf.

Revert "[CI] set default CRDS context to jwst-edit (spacetelescope#…

031358a

…8987)" This reverts commit 974682f.

use crds client for getting edit context

e82bac9

braingram force-pushed the fix_crds_cache branch from 8fa726e to e82bac9 Compare January 2, 2025 18:15

braingram marked this pull request as ready for review January 2, 2025 19:12

braingram requested a review from a team as a code owner January 2, 2025 19:12

braingram requested review from jhunkeler, melanieclarke, stscieisenhamer, tapastro and emolter January 2, 2025 19:12

braingram changed the title ~~Fix crds cache~~ Update CRDS cache workflow usage to reduce random failures Jan 2, 2025

stscieisenhamer approved these changes Jan 2, 2025

View reviewed changes

braingram merged commit adadfaa into spacetelescope:main Jan 2, 2025
31 checks passed

braingram deleted the fix_crds_cache branch January 2, 2025 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update CRDS cache workflow usage to reduce random failures #9040

Update CRDS cache workflow usage to reduce random failures #9040

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update CRDS cache workflow usage to reduce random failures #9040

Update CRDS cache workflow usage to reduce random failures #9040

Uh oh!

Conversation

Uh oh!

CRDS partial files for xdist runs

Using a virtual context as a cache key

Partial fix

Tasks

stage 1

stage 2

stage 3

other

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!