8000 Update CRDS cache workflow usage to reduce random failures by braingram · Pull Request #9040 · spacetelescope/jwst · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Update CRDS cache workflow usage to reduce random failures #9040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 2, 2025

Conversation

braingram
Copy link
Collaborator
@braingram braingram commented Jan 2, 2025

As this is all CI related marking as "no-changelog".

This PR partially reverts #8987 related commits with the following end results:

  • retain using jwst-edit as the default virtual context
  • remove uses of unstable crds jsonrpc call
  • restore the contexts workflow to resolve the jwst-edit virtual context to a valid pmap
  • cache based on the pmap (not the virtual context) (the main bug fix, the other changes are in support of this)

This (partially) addresses the recent CI failures:

At the core I think this is a CRDS issue (or issue in how the pipeline uses CRDS) where simultaneous requests for a single file results in attempts to read partial files. This issue lead to more common (random-ish) failures when #8987 switched the cache key to crds-jwst-edit.

This is only a partial fix because the first run for a new context can still lead to the same failure popping up. However the failures will go away (with these changes) after 1 successful run (needed to generate a valid c 8000 ache for a given pmap).

To provide some details (for future readers and curious reviewers).

CRDS partial files for xdist runs

Let's assume a job starts with an empty (or incomplete) crds cache. In this situation the test run will trigger downloading map files (and reference files). As the tests runs using xdist (spawning multiple worker processes that each run tests in parallel) it's possible that worker 1 will be fetching a mapping that worker 2 tries to use. If worker 1 is partially through downloading the file, the mapping file will be on disk but incomplete. Worker 2 will see the file exists, attempt to read it and fail (see the above linked failures).

This only happens for empty (or incomplete) crds caches. For empty caches if the run doesn't happen to have the problem, the run will complete and the cache will be stored (since it was originally a cache miss). No future runs that use this cache should have the problem unless a new mapping is required or a new reference file is needed (those will be incomplete cache runs). For empty caches if the run fails no cache will be stored. For an incomplete cache run even if the run is successful (and the required mapping or references are downloaded) no updated cache will be stored (because the key resulted in a cache hit).

Using a virtual context as a cache key

On main currently the default virtual context is jwst-edit. This leads to a cache key crds-jwst-edit. However since jwst-edit is a virtual context the cache will become stale (incomplete) when jwst-edit is updated to point to a new pmap. When the associated cache is stale every run will trigger downloading mapping and reference files that weren't in the pmap that was previously pointed to be jwst-edit (see above how an incomplete cache can lead to random test failures due to CRDS behavior and pytest-xdist).

Partial fix

This PR is only a partial fix because the first time a new pmap is used (either because jwst-edit was updated to point to a new pmap or a pmap was manually specified) it's possible the tests will randomly fail due to the "empty" cache behavior discussed above. A more complete fix would be to:

  • enable CRDS cache locking. Options exist but none are currently compatible with parallel test runs
  • pre-fetch all mappings and reference files to populate the cache before using it. This may require more than the allowed 10G of cache space and will produce a cache much larger than needed.
  • stop running tests in parallel. Which would lead to much slower test runs.

With the partial fix in this PR (which seems more useful than any of the available more complete fixes) there is the possibility the issue will reappear when a new context is used. If failures of this type are seen re-running might resolve the issue (especially if a different run succeeded and stored a valid cache) or in the worst case a manual run could be triggered using -n 1 to run the tests serially. We could update the workflows to run -n 1 on a cache miss to avoid the failures (at the expense of a slow run for the first run of a new context) but this would require a more extensive refactoring of the workflows which I think would be working around a more fundamental issue with CRDS. If a user were to start 2 pipeline runs that use the same reference files (missing from their local cache) shouldn't CRDS not try to load the partial file if run 2 starts slightly after run 1?

Tasks

  • request a review from someone specific, to avoid making the maintainers review every PR
  • add a build milestone, i.e. Build 11.3 (use the latest build if not sure)
  • Does this PR change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update or add relevant tests
    • update relevant docstrings and / or docs/ page
    • start a regression test and include a link to the running job (click here for instructions)
      • Do truth files need to be updated ("okified")?
        • after the reviewer has approved these changes, run okify_regtests to update the truth files
  • if a JIRA ticket exists, make sure it is resolved properly
news fragment change types...
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change
  • changes/<PR#>.docs.rst
  • changes/<PR#>.stpipe.rst
  • changes/<PR#>.datamodels.rst
  • changes/<PR#>.scripts.rst
  • changes/<PR#>.fits_generator.rst
  • changes/<PR#>.set_telescope_pointing.rst
  • changes/<PR#>.pipeline.rst

stage 1

  • changes/<PR#>.group_scale.rst
  • changes/<PR#>.dq_init.rst
  • changes/<PR#>.emicorr.rst
  • changes/<PR#>.saturation.rst
  • changes/<PR#>.ipc.rst
  • changes/<PR#>.firstframe.rst
  • changes/<PR#>.lastframe.rst
  • changes/<PR#>.reset.rst
  • changes/<PR#>.superbias.rst
  • changes/<PR#>.refpix.rst
  • changes/<PR#>.linearity.rst
  • changes/<PR#>.rscd.rst
  • changes/<PR#>.persistence.rst
  • changes/<PR#>.dark_current.rst
  • changes/<PR#>.charge_migration.rst
  • changes/<PR#>.jump.rst
  • changes/<PR#>.clean_flicker_noise.rst
  • changes/<PR#>.ramp_fitting.rst
  • changes/<PR#>.gain_scale.rst

stage 2

  • changes/<PR#>.assign_wcs.rst
  • changes/<PR#>.badpix_selfcal.rst
  • changes/<PR#>.msaflagopen.rst
  • changes/<PR#>.nsclean.rst
  • changes/<PR#>.imprint.rst
  • changes/<PR#>.background.rst
  • changes/<PR#>.extract_2d.rst
  • changes/<PR#>.master_background.rst
  • changes/<PR#>.wavecorr.rst
  • changes/<PR#>.srctype.rst
  • changes/<PR#>.straylight.rst
  • changes/<PR#>.wfss_contam.rst
  • changes/<PR#>.flatfield.rst
  • changes/<PR#>.fringe.rst
  • changes/<PR#>.pathloss.rst
  • changes/<PR#>.barshadow.rst
  • changes/<PR#>.photom.rst
  • changes/<PR#>.pixel_replace.rst
  • changes/<PR#>.resample_spec.rst
  • changes/<PR#>.residual_fringe.rst
  • changes/<PR#>.cube_build.rst
  • changes/<PR#>.extract_1d.rst
  • changes/<PR#>.resample.rst

stage 3

  • changes/<PR#>.assign_mtwcs.rst
  • changes/<PR#>.mrs_imatch.rst
  • changes/<PR#>.tweakreg.rst
  • changes/<PR#>.skymatch.rst
  • changes/<PR#>.exp_to_source.rst
  • changes/<PR#>.outlier_detection.rst
  • changes/<PR#>.tso_photometry.rst
  • changes/<PR#>.stack_refs.rst
  • changes/<PR#>.align_refs.rst
  • changes/<PR#>.klip.rst
  • changes/<PR#>.spectral_leak.rst
  • changes/<PR#>.source_catalog.rst
  • changes/<PR#>.combine_1d.rst
  • changes/<PR#>.ami.rst

other

  • changes/<PR#>.wfs_combine.rst
  • changes/<PR#>.white_light.rst
  • changes/<PR#>.cube_skymatch.rst
  • changes/<PR#>.engdb_tools.rst
  • changes/<PR#>.guider_cds.rst

@github-actions github-actions bot added testing automation Continuous Integration (CI) and testing automation tools labels Jan 2, 2025
@braingram braingram force-pushed the fix_crds_cache branch 2 times, most recently from 9dacb42 to 8fa726e Compare January 2, 2025 17:30
Copy link
codecov bot commented Jan 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.90%. Comparing base (bb62f0c) to head (e82bac9).
Report is 822 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9040      +/-   ##
==========================================
- Coverage   76.90%   76.90%   -0.01%     
==========================================
  Files         498      498              
  Lines       45761    45769       +8     
==========================================
+ Hits        35194    35199       +5     
- Misses      10567    10570       +3     
Flag Coverage Δ *Carryforward flag
nightly 77.39% <ø> (ø) Carriedforward from bb62f0c

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@braingram braingram marked this pull request as ready for review January 2, 2025 19:12
@braingram braingram requested a review from a team as a code owner January 2, 2025 19:12
@braingram braingram changed the title Fix crds cache Update CRDS cache workflow usage to reduce random failures Jan 2, 2025
@braingram
Copy link
Collaborator Author

I'm going to merge since it's approved but please let me know if there are any questions.

I requested some folks that reported the initial failures and to highlight one part of the description above:

If failures of this type are seen re-running might resolve the issue (especially if a different run succeeded and stored a valid cache) or in the worst case a manual run could be triggered using -n 1 to run the tests serially.

@braingram braingram merged commit adadfaa into spacetelescope:main Jan 2, 2025
31 checks passed
@braingram braingram deleted the fix_crds_cache branch January 2, 2025 19:31
@braingram
Copy link
Collaborator Author
braingram commented Jan 2, 2025

https://github.com/spacetelescope/jwst/actions/runs/12587715413/attempts/1 provides an example of another wrinkle with the whole caching approach as a way to work around the CRDS parallel cache write issue.

The caches are per-branch so after merging this PR the run on main didn't see the crds-jwst_1321.pmap cache generated for this PR. It ran with an empty cache and had a few failures (due to the above issue and starting from an empty cache).

The failed jobs succeeded when re-run (since they now use the valid cache).
https://github.com/spacetelescope/jwst/actions/runs/12587715413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Continuous Integration (CI) and testing automation tools no-changelog-entry-needed testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0