8000 Intermittent array size I/O error while reading DistTargetsDESI · Issue #309 · desihub/redrock · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Intermittent array size I/O error while reading DistTargetsDESI #309
Open
@sbailey

Description

@sbailey

During the Jura run, we have encountered multiple cases of I/O errors of the form:

# from jura healpix/main/dark/176/17625/logs/redrock-main-dark-17625.log.0
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 929, in rrdesi
    targets = DistTargetsDESI(args.infiles, coadd=(not args.allspec),
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 570, in __init__
    hdata = hdus[extname].data[rows]
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/utils/decorators.py", line 837, in __get__
    val = self.fget(obj)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 250, in data
    data = self._get_scaled_image_data(self._data_offset, self.shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 809, in _get_scaled_image_data
    raw_data = self._get_raw_data(shape, code, offset)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/base.py", line 559, in _get_raw_data
    return self._file.readarray(offset=offset, dtype=code, shape=shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/file.py", line 400, in readarray
    data.shape = shape
Proc 0: ValueError: cannot reshape array of size 768 into shape (2875,11,2881)

The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.

Other examples (some failing when qso_qn calls redrock, some during the original redrock run)

healpix         jobid
main-dark-17625 26153196
main-dark-17352 26153083
main-dark-20239 26153880
main-dark-8676  26151991
main-dark-26147 26154319
main-dark-7272  26151594

The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.

@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0