You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, when run vq_apc pretrain like this 'python3 run_pretrain.py -u vq_apc -g pretrain/vq_apc/config_model.yaml --multi_gpu -n my_pretrain', I encountered the following error
`[Runner] - Start a new experiment
[UpstreamPretrainExpert] - Using upstream config from: pretrain/vq_apc/config_model.yaml
[UpstreamPretrainExpert] - Using the apc preprocessor, on-the-fly feature preprocessing
[UpstreamPretrainExpert] - Initializing model...
[Dataset] - Sampling random segments for training, sample length: 1500
[Dataset] - Training data from these sets: ['train-clean-100']
[Dataset] - Number of individual training instances: 28539
[UpstreamPretrainExpert] - Multi-GPU training Enabled: 4
[UpstreamPretrainExpert] - Number of parameters: 4630096
[Runner] - Loss to device
[Runner] - Accumulated batch size: 32
[Runner] - Training for 100 epochs, which is equivalent to 89200 steps
train: 0%| | 0/892 [00:15<?, ?it/s]
Traceback (most recent call last):
File "//root/st/s3prl/s3prl/run_pretrain.py", line 158, in
main()
File "//root/st/s3prl/s3prl/run_pretrain.py", line 153, in main
eval('runner.train')()
File "//root/st/s3prl/s3prl/pretrain/runner.py", line 162, in train
loss, records = self.upstream(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/pretrain/apc/pretrain_expert.py", line 126, in forward
pred_spec, _ = self.model(audio_feat[:,:-self.n_future,:], audio_len, testing=False)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
output.reraise()
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/upstream/apc/apc.py", line 118, in forward
packed_rnn_inputs = pack_padded_sequence(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/utils/rnn.py", line 264, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
overall: 0%| | 0/89200 [00:16<?, ?it/s]
`
The text was updated successfully, but these errors were encountered:
Can you try running the pretraining without specifying --multi_gpu? It looks like an issue with multi-GPU training, probably due to a PyTorch version problem.
Can you try running the pretraining without specifying --multi_gpu? It looks like an issue with multi-GPU training, probably due to a PyTorch version problem.
Yes, the script can run normally without specifying '--multi_gpu'. My PyTorch version is 2.3.1+cu121.
Uh oh!
There was an error while loading. Please reload this page.
Hi, when run vq_apc pretrain like this 'python3 run_pretrain.py -u vq_apc -g pretrain/vq_apc/config_model.yaml --multi_gpu -n my_pretrain', I encountered the following error
`[Runner] - Start a new experiment
[UpstreamPretrainExpert] - Using upstream config from: pretrain/vq_apc/config_model.yaml
[UpstreamPretrainExpert] - Using the apc preprocessor, on-the-fly feature preprocessing
[UpstreamPretrainExpert] - Initializing model...
[Dataset] - Sampling random segments for training, sample length: 1500
[Dataset] - Training data from these sets: ['train-clean-100']
[Dataset] - Number of individual training instances: 28539
[UpstreamPretrainExpert] - Multi-GPU training Enabled: 4
[UpstreamPretrainExpert] - Number of parameters: 4630096
[Runner] - Loss to device
[Runner] - Accumulated batch size: 32
[Runner] - Training for 100 epochs, which is equivalent to 89200 steps
overall: 0%| | 0/89200 [00:00<?, ?it/s]
train: 0%| | 0/892 [00:00<?, ?it/s]�[A../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [0,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [1,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [2,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [3,0,0] Assertion
srcIndex < srcSelectDimSize
failed.....
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [4,0,0] Assertion
srcIndex < srcSelectDimSize
failed.train: 0%| | 0/892 [00:15<?, ?it/s]
Traceback (most recent call last):
File "//root/st/s3prl/s3prl/run_pretrain.py", line 158, in
main()
File "//root/st/s3prl/s3prl/run_pretrain.py", line 153, in main
eval('runner.train')()
File "//root/st/s3prl/s3prl/pretrain/runner.py", line 162, in train
loss, records = self.upstream(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/pretrain/apc/pretrain_expert.py", line 126, in forward
pred_spec, _ = self.model(audio_feat[:,:-self.n_future,:], audio_len, testing=False)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
output.reraise()
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/upstream/apc/apc.py", line 118, in forward
packed_rnn_inputs = pack_padded_sequence(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/utils/rnn.py", line 264, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.overall: 0%| | 0/89200 [00:16<?, ?it/s]
`
The text was updated successfully, but these errors were encountered: