Description
During the Jura run, we have encountered multiple cases of I/O errors of the form:
# from jura healpix/main/dark/176/17625/logs/redrock-main-dark-17625.log.0
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 929, in rrdesi
targets = DistTargetsDESI(args.infiles, coadd=(not args.allspec),
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 570, in __init__
hdata = hdus[extname].data[rows]
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/utils/decorators.py", line 837, in __get__
val = self.fget(obj)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 250, in data
data = self._get_scaled_image_data(self._data_offset, self.shape)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 809, in _get_scaled_image_data
raw_data = self._get_raw_data(shape, code, offset)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/base.py", line 559, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/file.py", line 400, in readarray
data.shape = shape
Proc 0: ValueError: cannot reshape array of size 768 into shape (2875,11,2881)
The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.
Other examples (some failing when qso_qn calls redrock, some during the original redrock run)
healpix jobid
main-dark-17625 26153196
main-dark-17352 26153083
main-dark-20239 26153880
main-dark-8676 26151991
main-dark-26147 26154319
main-dark-7272 26151594
The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.
@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.