An illegal memory access was encountered with more then 1 GPU #2638
Replies: 3 comments 2 replies
-
I think that torchtune should see all GPUs I selected: +-----------------------------------------------------------------------------------------+ |
Beta Was this translation helpful? Give feedback.
-
Hi @mariozupan, can you share the contents of the file fft.yaml? |
Beta Was this translation helpful? Give feedback.
-
yea this is the try with 2 gpus requested. I export= CUDA_VISIBLE_DEVICES=1,3 and according to your advice prefix CUDA_LAUNCH_BLOCKING=1, I got: INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config: batch_size: 1
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config: batch_size: 1
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 578835463. Local seed is seed + rank = 578835463 + 0 Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1987 (most recent call first): Error an illegal memory access was encountered at line 113 in file /src/csrc/ops.cu Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1987 (most recent call first): W0502 18:14:22.769415 1012458 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1013865 closing signal SIGTERM
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I need a help about submitting a job on HPC. With the following command I selected 2 of 4 avaliable GPUs on HPC
#PBS -l select=1:ncpus=8:mem=200GB:ngpus=2
with
export CUDA_VISIBLE_DEVICES="${mapped%,}"
echo "Mapped CUDA_VISIBLE_DEVICES to: $CUDA_VISIBLE_DEVICES"
I'm getting:
Mapped CUDA_VISIBLE_DEVICES to: 1,3
Sometimes I get 0,1 it depends on free GPUs resources.
When I start ffft distributed torchtune process with nproc 1, it is working:
tune run --nnodes 1 --nproc_per_node 1
full_finetune_distributed --config ./fft.yaml
However, nproc 2 doesn't work. I'm getting:
"RuntimeError: CUDA error: an illegal memory access was encountered"
my torch version is: 2.8.0.dev20250421+cu126
I tried to decrease batch size from 4 to 1, validation batch size from 8 to 2 but it doesn't help.
Beta Was this translation helpful? Give feedback.
All reactions