8000 IndexError: list index out of range · Issue #32 · mlcommons/GaNDLF · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

IndexError: list index out of range #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Karol-G opened this issue Mar 28, 2021 · 7 comments · Fixed by #35
Closed

IndexError: list index out of range #32

Karol-G opened this issue Mar 28, 2021 · 7 comments · Fixed by #35

Comments

@Karol-G
Copy link
Collaborator
Karol-G commented Mar 28, 2021

Hi again,

sorry for opening so many issues >.<
When I try to train on the toy dataset in testing/data.zip I get the error IndexError: list index out of range. This might originate from normalize_nonZero in data_preprocessing. I am using the newest pull from gandalf-refactor and am using Linux.

The train.csv and the toy data:
https://cloud-ext.igd.fraunhofer.de/s/8kCtZzcFRX96Xt8

Full error log:

Using default folds for testing split:  -5
Using default folds for validation split:  -5
Number of channels :  3
Channel Keys :  ['subject_id', '1', '2', '3', 'label', 'path_to_metadata', 'value_0']



Initializing training at :  2021-03-28 10:34:16.335452
Found a pre-existing file for logging, now appending logs to that file!
Found a pre-existing file for logging, now appending logs to that file!
Device requested via CUDA_VISIBLE_DEVICES:  0
Total number of CUDA devices:  1
Device finally used:  0
Sending model to aforementioned device
Memory Total :  15.9 GB, Allocated:  0.1 GB, Cached:  0.1 GB
Device - Current: 0 Count: 1 Name: Tesla P100-PCIE-16GB Availability: True
Using device: cuda
********************
Starting Epoch :  0
Epoch start time :  2021-03-28 10:34:18.923012
Traceback (most recent call last):
  File "gandlf_run", line 75, in <module>
    main()
  File "gandlf_run", line 70, in main
    TrainingManager(dataframe=data_full, headers = headers, outputDir=model_path, parameters=parameters, device=device, reset_prev = reset_prev)
  File "/content/GaNDLF-refactor/GANDLF/training_manager.py", line 146, in TrainingManager
    device=device, params=parameters, testing_data=testingData)
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 477, in training_loop
    model, train_dataloader, optimizer, params
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 133, in train_network
    for batch_idx, (subject) in enumerate(train_dataloader):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torchio/data/queue.py", line 164, in __getitem__
    self.fill()
  File "/usr/local/lib/python3.7/dist-packages/torchio/data/queue.py", line 228, in fill
    subject = self.get_next_subject()
  File "/usr/local/lib/python3.7/dist-packages/torchio/data/queue.py", line 238, in get_next_subject
    subject = next(self.subjects_iterable)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torchio/data/dataset.py", line 85, in __getitem__
    subject = self._transform(subject)
  File "/usr/local/lib/python3.7/dist-packages/torchio/transforms/transform.py", line 121, in __call__
    transformed = self.apply_transform(subject)
  File "/usr/local/lib/python3.7/dist-packages/torchio/transforms/augmentation/composition.py", line 47, in apply_transform
    subject = transform(subject)
  File "/usr/local/lib/python3.7/dist-packages/torchio/transforms/transform.py", line 121, in __call__
    transformed = self.apply_transform(subject)
  File "/content/GaNDLF-refactor/GANDLF/preprocessing.py", line 221, in apply_transform
    images_dict[names_list[idx]]['data'] = torch.tensor(np.expand_dims(array, axis=0))
IndexError: list index out of range

This is the model.yaml:

# affix version
version:
  {
    minimum: 0.0.8,
    maximum: 0.0.8 # this should NOT be made a variable, but should be tested after every tag is created
  }
metrics:
  - mse
# Choose the model parameters here
model:
  {
    dimension: 2, # the dimension of the model and dataset: defines dimensionality of computations
    base_filters: 30, # Set base filters: number of filters present in the initial module of the U-Net convolution; for IncU-Net, keep this divisible by 4
    architecture: vgg16, # options: unet, resunet, fcn, uinc, vgg, densenet
    batch_norm: True, # this is only used for vgg
    final_layer: None, # can be either sigmoid, softmax or none (none == regression)
    amp: False, # Set if you want to use Automatic Mixed Precision for your operations or not - options: True, False
    n_channels: 3, # set the input channels - useful when reading RGB or images that have vectored pixel types
  }
# this is to enable or disable lazy loading - setting to true reads all data once during data loading, resulting in improvements
# in I/O at the expense of memory consumption
in_memory: False
# this will save the generated masks for validation and testing data for qualitative analysis
save_masks: False
# Set the Modality : rad for radiology, path for histopathology
modality: rad
# Patch size during training - 2D patch for breast images since third dimension is not patched 
patch_size: [64,64,64]
# uniform: UniformSampler or label: LabelSampler
patch_sampler: uniform
# Number of epochs
num_epochs: 100
# Set the patience - measured in number of epochs after which, if the performance metric does not improve, exit the training loop - defaults to the number of epochs
patience: 50
# Set the batch size
batch_size: 1
# Set the initial learning rate
learning_rate: 0.001
# Learning rate scheduler - options: triangle, triangle_modified, exp, reduce-on-lr, step, more to come soon - default hyperparameters can be changed thru code
scheduler: triangle
# Set which loss function you want to use - options : 'dc' - for dice only, 'dcce' - for sum of dice and CE and you can guess the next (only lower-case please)
# options: dc (dice only), dc_log (-log of dice), ce (), dcce (sum of dice and ce), mse () ...
# mse is the MSE defined by torch and can define a variable 'reduction'; see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss
# use mse_torch for regression/classification problems and dice for segmentation
loss_function: mse
# this parameter weights the loss to handle imbalanced losses better
weighted_loss: True 
#loss_function:
#  {
#    'mse':{
#      'reduction': 'mean' # see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss for all options
#    }
#  }
# Which optimizer do you want to use - adam/sgd
opt: adam
# this parameter controls the nested training process
# performs randomized k-fold cross-validation
# split is performed using sklearn's KFold method
# for single fold run, use '-' before the fold number
nested_training:
  {
    #testing: 5, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    #validation: 5 # this controls the validation data splits for model training
  }
## pre-processing
# this constructs an order of transformations, which is applied to all images in the data loader
# order: resize --> threshold/clip --> resample --> normalize
# 'threshold': performs intensity thresholding; i.e., if x[i] < min: x[i] = 0; and if x[i] > max: x[i] = 0
# 'clip': performs intensity clipping; i.e., if x[i] < min: x[i] = min; and if x[i] > max: x[i] = max
# 'threshold'/'clip': if either min/max is not defined, it is taken as the minimum/maximum of the image, respectively
# 'normalize': performs z-score normalization: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.ZNormalization
# 'normalize_nonZero': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels
# 'normalize_nonZero_masked': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels with the stats applied on non-zero pixels
# 'crop_external_zero_planes': crops all non-zero planes from input tensor to reduce image search space
# 'resample: resolution: X,Y,Z': resample the voxel resolution: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# 'resample: resolution: X': resample the voxel resolution in an isotropic manner: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# resize the image(s) and mask (this should be greater than or equal to patch_size); resize is done ONLY when resample is not defined
data_preprocessing:
  {
    # 'normalize',
    'normalize_nonZero', # this performs z-score normalization only on non-zero pixels
    'resample':{
      'resolution': [1,2,3]
    },
    #'resize': [128,128], # this is generally not recommended, as it changes image properties in unexpected ways
    'crop_external_zero_planes', # this will crop all zero-valued planes across all axes
  }
# various data augmentation techniques
# options: affine, elastic, downsample, motion, ghosting, bias, blur, gaussianNoise, swap
# keep/edit as needed
# all transforms: https://torchio.readthedocs.io/transforms/transforms.html?highlight=transforms
# 'kspace': one of motion, ghosting or spiking is picked (randomly) for augmentation
# 'probability' subkey adds the probability of the particular augmentation getting added during training (this is always 1 for normalize and resampling)
data_augmentation: 
  {
    default_probability: 0.5,
    'affine',
    'elastic',
    'kspace':{
      'probability': 1
    },
    'bias',
    'blur': {
      'std': [0, 1] # default std-dev range, for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomBlur
    },
    'noise': { # for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomNoise
      'mean': 0, # default mean
      'std': [0, 1] # default std-dev range
    },
    'anisotropic':{
      'axis': [0,1],
      'downsampling': [2,2.5]
    },
  }
# parallel training on HPC - here goes the command to prepend to send to a high performance computing
# cluster for parallel computing during multi-fold training
# not used for single fold training
# this gets passed before the training_loop, so ensure enough memory is provided along with other parameters
# that your HPC would expect
# ${outputDir} will be changed to the outputDir you pass in CLI + '/${fold_number}'
# ensure that the correct location of the virtual environment is getting invoked, otherwise it would pick up the system python, which might not have all dependencies
# parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'
## queue configuration - https://torchio.readthedocs.io/data/patch_training.html?#queue
# this determines the maximum number of patches that can be stored in the queue. Using a large number means that the queue needs to be filled less often, but more CPU memory is needed to store the patches
q_max_length: 40
# this determines the number of patches to extract from each volume. A small number of patches ensures a large variability in the queue, but training will be slower
q_samples_per_volume: 5
# this determines the number subprocesses to use for data loading; '0' means main process is used
q_num_workers: 2 # scale this according to available CPU resources
# used for debugging
q_verbose: False

Best
Karol

@sarthakpati
Copy link
Collaborator

No worries at all, this helps us with stress testing, so thank you! 😄

This could be due to the resampling. Could you try the following:

data_preprocessing:
  {
    # 'normalize',
    'normalize_nonZero', # this performs z-score normalization only on non-zero pixels
  }

@sarthakpati
Copy link
Collaborator
8000 sarthakpati commented Mar 28, 2021

I will also recommend to keep the configuration as light as possible for the toy examples (basically, let the defaults ride). Essentially, the configs in the testing module should be your starting point. 😄 🚀

@Karol-G
Copy link
Collaborator Author
Karol-G commented Mar 28, 2021

Will do :)
Old error is fixed, new one is this 😄:

This option has been superceded by 'model'
Number of channels :  3
Channel Keys :  ['subject_id', '1', '2', '3', 'label', 'path_to_metadata', 'value_0']



Initializing training at :  2021-03-28 13:12:04.708672
Found a pre-existing file for logging, now appending logs to that file!
Found a pre-existing file for logging, now appending logs to that file!
Device requested via CUDA_VISIBLE_DEVICES:  0
Total number of CUDA devices:  1
Device finally used:  0
Sending model to aforementioned device
Memory Total :  14.8 GB, Allocated:  0.1 GB, Cached:  0.1 GB
Device - Current: 0 Count: 1 Name: Tesla T4 Availability: True
Using device: cuda
********************
Starting Epoch :  0
Epoch start time :  2021-03-28 13:12:07.872670
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py:528: UserWarning: Using a target size (torch.Size([1, 1])) that is different to the input size (torch.Size([1, 2])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
Epoch Average Train loss :  tensor(1.6588, device='cuda:0', grad_fn=<DivBackward0>)
Epoch Average Train mse :  tensor(1.6588, device='cuda:0', grad_fn=<DivBackward0>)
Epoch Average Train loss :  tensor(0.9822, device='cuda:0', grad_fn=<DivBackward0>)
Epoch Average Train mse :  tensor(0.9822, device='cuda:0', grad_fn=<DivBackward0>)
********************
Starting validation : 
********************
Traceback (most recent call last):
  File "gandlf_run", line 75, in <module>
    main()
  File "gandlf_run", line 70, in main
    TrainingManager(dataframe=data_full, headers = headers, outputDir=model_path, parameters=parameters, device=device, reset_prev = reset_prev)
  File "/content/GaNDLF-refactor/GANDLF/training_manager.py", line 146, in TrainingManager
    device=device, params=parameters, testing_data=testingData)
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 480, in training_loop
    model, val_dataloader, params
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 269, in validate_network
    output_prediction += output.cpu().data.item()# this probably needs customization for classification (majority voting or median, perhaps?)
ValueError: only one element tensors can be converted to Python scalars

@sarthakpati
Copy link
Collaborator

Can you try pulling from my branch now and re-try?

@Karol-G
Copy link
Collaborator Author
Karol-G commented Mar 30, 2021

Sorry for replying so late! Was busy the last few days.
Sadly the error is still the same with the new refactor pull.

@Karol-G
Copy link
Collaborator Author
Karol-G commented Mar 31, 2021

You can try to reproduce the error from my fork: https://github.com/Karol-G/GaNDLF/tree/refactor
I added the experiment model.yaml, train.csv and the toy dataset.

You should be able to reproduce the error with the following command:
gandalf_run -config ./my_experiments/2d_classification/model_simple.yaml -data ./my_experiments/2d_classification/train.csv -output ./my_experiments/2d_classification/output_dir/ -train 1 -device -1 -reset_prev True

When debugging, I noticed that the model output is of size [1,2,1] which cannot be converted to a scalar with .item().

There is a comment there that "this probably needs customization for classification (majority voting or median, perhaps?)", so I guess that is probably the solution 😄

@Karol-G
Copy link
Collaborator Author
Karol-G commented Mar 31, 2021

Yes! It is working now 😁 Thanks!

@Karol-G Karol-G closed this as completed Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0