8000 JP-3773: Store logs from calibration pipeline in datamodel by braingram · Pull Request #9211 · spacetelescope/jwst · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

JP-3773: Store logs from calibration pipeline in datamodel #9211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 28, 2025

Conversation

braingram
Copy link
Collaborator
@braingram braingram commented Feb 21, 2025

Builds off #8857

Resolves JP-3773

Closes #8855

From #8857

This PR adds the log records during processing to a datamodel attribute model.cal_logs. This attribute is a dictionary, with keys equal to steps and/or pipelines applied to the data, and values associated with those keys equal to the log output.

This PR extends that by:

  • updating the code to conform to new style checks
  • applying the suggestions on that PR
  • adds scrubbing code that removes log lines that contain ip addresses, hostname or username
  • adds unit tests

Regression tests: https://github.com/spacetelescope/RegressionTests/actions/runs/13464767772

Tasks

  • request a review from someone specific, to avoid making the maintainers review every PR
  • add a build milestone, i.e. Build 11.3 (use the latest build if not sure)
  • Does this PR change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update or add relevant tests
    • update relevant docstrings and / or docs/ page
    • start a regression test and include a link to the running job (click here for instructions)
      • Do truth files need to be updated ("okified")?
        • after the reviewer has approved these changes, run okify_regtests to update the truth files
  • if a JIRA ticket exists, make sure it is resolved properly
news fragment change types...
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change
  • changes/<PR#>.docs.rst
  • changes/<PR#>.stpipe.rst
  • changes/<PR#>.datamodels.rst
  • changes/<PR#>.scripts.rst
  • changes/<PR#>.set_telescope_pointing.rst
  • changes/<PR#>.pipeline.rst

stage 1

  • changes/<PR#>.group_scale.rst
  • changes/<PR#>.dq_init.rst
  • changes/<PR#>.emicorr.rst
  • changes/<PR#>.saturation.rst
  • changes/<PR#>.ipc.rst
  • changes/<PR#>.firstframe.rst
  • changes/<PR#>.lastframe.rst
  • changes/<PR#>.reset.rst
  • changes/<PR#>.superbias.rst
  • changes/<PR#>.refpix.rst
  • changes/<PR#>.linearity.rst
  • changes/<PR#>.rscd.rst
  • changes/<PR#>.persistence.rst
  • changes/<PR#>.dark_current.rst
  • changes/<PR#>.charge_migration.rst
  • changes/<PR#>.jump.rst
  • changes/<PR#>.clean_flicker_noise.rst
  • changes/<PR#>.ramp_fitting.rst
  • changes/<PR#>.gain_scale.rst

stage 2

  • changes/<PR#>.assign_wcs.rst
  • changes/<PR#>.badpix_selfcal.rst
  • changes/<PR#>.msaflagopen.rst
  • changes/<PR#>.nsclean.rst
  • changes/<PR#>.imprint.rst
  • changes/<PR#>.background.rst
  • changes/<PR#>.extract_2d.rst
  • changes/<PR#>.master_background.rst
  • changes/<PR#>.wavecorr.rst
  • changes/<PR#>.srctype.rst
  • changes/<PR#>.straylight.rst
  • changes/<PR#>.wfss_contam.rst
  • changes/<PR#>.flatfield.rst
  • changes/<PR#>.fringe.rst
  • changes/<PR#>.pathloss.rst
  • changes/<PR#>.barshadow.rst
  • changes/<PR#>.photom.rst
  • changes/<PR#>.pixel_replace.rst
  • changes/<PR#>.resample_spec.rst
  • changes/<PR#>.residual_fringe.rst
  • changes/<PR#>.cube_build.rst
  • changes/<PR#>.extract_1d.rst
  • changes/<PR#>.resample.rst

stage 3

  • changes/<PR#>.assign_mtwcs.rst
  • changes/<PR#>.mrs_imatch.rst
  • changes/<PR#>.tweakreg.rst
  • changes/<PR#>.skymatch.rst
  • changes/<PR#>.exp_to_source.rst
  • changes/<PR#>.outlier_d 8000 etection.rst
  • changes/<PR#>.tso_photometry.rst
  • changes/<PR#>.stack_refs.rst
  • changes/<PR#>.align_refs.rst
  • changes/<PR#>.klip.rst
  • changes/<PR#>.spectral_leak.rst
  • changes/<PR#>.source_catalog.rst
  • changes/<PR#>.combine_1d.rst
  • changes/<PR#>.ami.rst

other

  • changes/<PR#>.wfs_combine.rst
  • changes/<PR#>.white_light.rst
  • changes/<PR#>.cube_skymatch.rst
  • changes/<PR#>.engdb_tools.rst
  • changes/<PR#>.guider_cds.rst

Copy link
codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.68%. Comparing base (2eae615) to head (50da0fe).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
jwst/stpipe/core.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9211      +/-   ##
==========================================
+ Coverage   73.65%   73.68%   +0.02%     
==========================================
  Files         368      369       +1     
  Lines       36374    36406      +32     
==========================================
+ Hits        26792    26824      +32     
  Misses       9582     9582              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@braingram braingram marked this pull request as ready for review February 21, 2025 21:52
@braingram braingram requested review from a team as code owners February 21, 2025 21:52
@braingram braingram requested a review from tapastro February 21, 2025 21:52
@tapastro
Copy link
Contributor

One issue that popped up in my testing - I ran some files locally, replacing the scrubbing behavior with a prepended scrub tag to see what's getting removed. The only scrub case I saw in testing was the removal of statements recording which reference files are being used, due to my user ID being present in the directory tree of my crds_cache.

Could we alter the scrubbing behavior to "target scrub" the USER and HOSTNAME matching, e.g. by running a replace on the USER string with something like {\SCRUBBED USER_ID} ?

@braingram
Copy link
Collaborator Author
braingram commented Feb 24, 2025

One issue that popped up in my testing - I ran some files locally, replacing the scrubbing behavior with a prepended scrub tag to see what's getting removed. The only scrub case I saw in testing was the removal of statements recording which reference files are being used, due to my user ID being present in the directory tree of my crds_cache.

Could we alter the scrubbing behavior to "target scrub" the USER and HOSTNAME matching, e.g. by running a replace on the USER string with something like {\SCRUBBED USER_ID} ?

Thanks for giving this a try. Are you asking about switching the "scrubbing" so that instead of removing the whole log line just replacing just the user name (for example) portion of the log line? So for a log line like the following:

file created by user braingram

The current PR would replace the entire log line with an empty string. Are you asking if instead we could replace it with the following?:

file created by user SCRUBBED USER_ID

I didn't go with that approach because of possible inadvertent leaking of the user or hostname due to the string replace. Lets say ops decides to name the user "tim". The following line:

total exposure time: 42

Would become:

total exposure SCRUBBED USER_IDe: 42

which would leak the user name (on a line that otherwise would not contain that information).

@tapastro
Copy link
Contributor

This is going to be tough to solve - one reason why I was hoping to avoid scrubbing in general...

The reference file history is one of the most useful components of the log - it would be a shame to implement this with such a useful aspect missing.

I'm not sure if it's worth attempting to implement a more "targeted vaccine", i.e. through string matching the user ID as part of a directory string, or if that just leads us further into the labyrinth.

@braingram
Copy link
Collaborator Author

This is going to be tough to solve - one reason why I was hoping to avoid scrubbing in general...

The reference file history is one of the most useful components of the log - it would be a shame to implement this with such a useful aspect missing.

Can you give examples of log messages that are useful that were scrubbed?

I'm not sure if it's worth attempting to implement a more "targeted vaccine", i.e. through string matching the user ID as part of a directory string, or if that just leads us further into the labyrinth.

Please do let me know if there are other approaches you'd like to explore.

@tapastro
Copy link
Contributor
tapastro commented Feb 24, 2025

Here's a selection:
>>> print('\n'.join(scrubs1+scrubs2))

SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for DARK reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_dark_0424.fits'.
SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for GAIN reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_gain_0097.fits'.
SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for LINEARITY reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_linearity_0052.fits'.
SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for MASK reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_mask_0076.fits'.
SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for READNOISE reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_readnoise_0266.fits'.
SCRUBBEDU: 2025-02-24T17:47:23.787 - stpipe.Detector1Pipeline - INFO - Prefetch for SATURATION reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_saturation_0097.fits'.
...
SCRUBBEDU: 2025-02-24T17:47:26.550 - stpipe.Detector1Pipeline.dark_current - INFO - Using DARK reference file /Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_dark_0424.fits
SCRUBBEDU: 2025-02-24T17:47:28.482 - stpipe.Detector1Pipeline.jump - INFO - Using GAIN reference file: /Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_gain_0097.fits
SCRUBBEDU: 2025-02-24T17:47:28.490 - stpipe.Detector1Pipeline.jump - INFO - Using READNOISE reference file: /Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_readnoise_0266.fits
SCRUBBEDU: 2025-02-24T17:47:53.593 - stpipe.Detector1Pipeline.ramp_fit - INFO - Using READNOISE reference file: /Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_readnoise_0266.fits
SCRUBBEDU: 2025-02-24T17:47:53.593 - stpipe.Detector1Pipeline.ramp_fit - INFO - Using GAIN reference file: /Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_gain_0097.fits
SCRUBBEDU: 2025-02-24T17:51:35.617 - stpipe.Image2Pipeline - INFO - Prefetch for AREA reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_area_0207.fits'.
SCRUBBEDU: 2025-02-24T17:51:35.617 - stpipe.Image2Pipeline - INFO - Prefetch for DISTORTION reference file is '/Users/tpauly/crds_cache/references/jwst/nircam/jwst_nircam_distortion_0280.asdf'.
...

@braingram
Copy link
Collaborator Author

Thanks. The "Prefetch for" ones appear to be from:
https://github.com/spacetelescope/stpipe/blob/47c3867cdab9d8c9fb13150047df44cfbbd97f9e/src/stpipe/pipeline.py#L327
I don't think the path of those files is useful but the final part of the filename certainly is but those are separately recorded in finalize_result:

if len(reference_files_used) > 0:

(and possibly other log messages).

The other log messages contain similar information and appear to be coming from steps like this one for the DARK reference:

self.log.info('Using DARK reference file %s', self.dark_name)

Are the paths useful in either of these cases for a run on ops? For a local run the log messages that end up on the commandline shouldn't be touched by this PR.

@tapastro
Copy link
Contributor

The log messages in the commandline are not affected, but I think having the stored logs out of sync with the command line logs is not an ideal situation. We do record the reference file names in the headers, but having the complete location is helpful in some cases. Perhaps it's not worth worrying about.

@braingram
Copy link
Collaborator Author

The log messages in the commandline are not affected, but I think having the stored logs out of sync with the command line logs is not an ideal situation. We do record the reference file names in the headers, but having the complete location is helpful in some cases. Perhaps it's not worth worrying about.

Yeah, any scrubbing will result in the 2 being out of sync. If we need to scrub file paths we can't have the full paths for the reference files. At the moment the files contain no logs so having something is better than nothing I guess. I'm not prepared to argue that we shouldn't scrub the logs and will leave that up to you.

@tapastro
Copy link
Contributor

After mulling it over, I think the keyword storage of reference file names is sufficient - we can re-visit the scrubbing if we receive negative feedback on missing entries.

@tapastro tapastro enabled auto-merge (squash) February 28, 2025 18:50
@tapastro tapastro merged commit 9c4f7bc into spacetelescope:main Feb 28, 2025
30 checks passed
@braingram braingram mentioned this pull request Mar 7, 2025
10 tasks
@braingram braingram deleted the cal_logs branch March 24, 2025 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store log output in datamodel extension
2 participants
0