8000 run vq_apc pretrain failed · Issue #542 · s3prl/s3prl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

run vq_apc pretrain failed #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
moodeerf opened this issue Jun 4, 2024 · 3 comments
Open

run vq_apc pretrain failed #542

moodeerf opened this issue Jun 4, 2024 · 3 comments
Assignees

Comments

@moodeerf
Copy link
moodeerf commented Jun 4, 2024

Hi, when run vq_apc pretrain like this 'python3 run_pretrain.py -u vq_apc -g pretrain/vq_apc/config_model.yaml --multi_gpu -n my_pretrain', I encountered the following error
`[Runner] - Start a new experiment
[UpstreamPretrainExpert] - Using upstream config from: pretrain/vq_apc/config_model.yaml
[UpstreamPretrainExpert] - Using the apc preprocessor, on-the-fly feature preprocessing
[UpstreamPretrainExpert] - Initializing model...
[Dataset] - Sampling random segments for training, sample length: 1500
[Dataset] - Training data from these sets: ['train-clean-100']
[Dataset] - Number of individual training instances: 28539
[UpstreamPretrainExpert] - Multi-GPU training Enabled: 4
[UpstreamPretrainExpert] - Number of parameters: 4630096
[Runner] - Loss to device
[Runner] - Accumulated batch size: 32
[Runner] - Training for 100 epochs, which is equivalent to 89200 steps

overall: 0%| | 0/89200 [00:00<?, ?it/s]

train: 0%| | 0/892 [00:00<?, ?it/s]�[A../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
....
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [45,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.

train: 0%| | 0/892 [00:15<?, ?it/s]
Traceback (most recent call last):
File "//root/st/s3prl/s3prl/run_pretrain.py", line 158, in
main()
File "//root/st/s3prl/s3prl/run_pretrain.py", line 153, in main
eval('runner.train')()
File "//root/st/s3prl/s3prl/pretrain/runner.py", line 162, in train
loss, records = self.upstream(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/pretrain/apc/pretrain_expert.py", line 126, in forward
pred_spec, _ = self.model(audio_feat[:,:-self.n_future,:], audio_len, testing=False)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
output.reraise()
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "//root/st/s3prl/s3prl/upstream/apc/apc.py", line 118, in forward
packed_rnn_inputs = pack_padded_sequence(
File "/root/miniconda3/envs/wis/lib/python3.10/site-packages/torch/nn/utils/rnn.py", line 264, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

overall: 0%| | 0/89200 [00:16<?, ?it/s]
`

@leo19941227
Copy link
Member

Hi @andi611 ,

Could you look into this issue when you are available?
(ps. I did not touch the pre-training part of APC, it was implemented by Andy)

@andi611
Copy link
Member
andi611 commented Jun 16, 2024

Hi @moodeerf,

Can you try running the pretraining without specifying --multi_gpu? It looks like an issue with multi-GPU training, probably due to a PyTorch version problem.

@moodeerf
Copy link
Author

Hi @moodeerf,

Can you try running the pretraining without specifying --multi_gpu? It looks like an issue with multi-GPU training, probably due to a PyTorch version problem.

Yes, the script can run normally without specifying '--multi_gpu'. My PyTorch version is 2.3.1+cu121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0