An illegal memory access was encountered with more then 1 GPU #2638

mariozupan · 2025-04-26T03:17:17Z

mariozupan
Apr 26, 2025

I need a help about submitting a job on HPC. With the following command I selected 2 of 4 avaliable GPUs on HPC
#PBS -l select=1:ncpus=8:mem=200GB:ngpus=2
with
export CUDA_VISIBLE_DEVICES="${mapped%,}"
echo "Mapped CUDA_VISIBLE_DEVICES to: $CUDA_VISIBLE_DEVICES"
I'm getting:
Mapped CUDA_VISIBLE_DEVICES to: 1,3
Sometimes I get 0,1 it depends on free GPUs resources.

When I start ffft distributed torchtune process with nproc 1, it is working:
tune run --nnodes 1 --nproc_per_node 1
full_finetune_distributed --config ./fft.yaml
However, nproc 2 doesn't work. I'm getting:

"RuntimeError: CUDA error: an illegal memory access was encountered"

my torch version is: 2.8.0.dev20250421+cu126

I tried to decrease batch size from 4 to 1, validation batch size from 8 to 2 but it doesn't help.

mariozupan · 2025-04-26T17:58:10Z

mariozupan
Apr 26, 2025
Author

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 2 N/A N/A 2487383 C pmemd.cuda 1650MiB |
+-----------------------------------------------------------------------------------------+
['NVIDIA A100-SXM4-40GB', 'NVIDIA A100-SXM4-40GB']

0 replies

ebsmothers · 2025-05-01T23:10:42Z

ebsmothers
May 1, 2025
Collaborator

Hi @mariozupan, can you share the contents of the file fft.yaml?

2 replies

mariozupan May 2, 2025
Author

Here it is. Is it accelerate used in torchtune and does it get the information about the number of gpus?

output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/fft # /tmp may be deleted by your system. Change it to your preference.

Model arguments

model:
#component: torchtune.models.qwen2_5.qwen2_5_7b_instruct
component: torchtune.models.qwen2_5.qwen2_5_1_5b_instruct

Tokenizer

tokenizer:
component: torchtune.models.qwen2_5.qwen2_5_tokenizer
path: ./models/Qwen2.5-1.5B-Instruct/vocab.json
merges_file: ./models/Qwen2.5-1.5B-Instruct/merges.txt
max_seq_len: 1024 #null

Checkpointer

checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: ./models/Qwen2.5-1.5B-Instruct
checkpoint_files: [model.safetensors]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: QWEN2
resume_from_checkpoint: False

dataset:
component: torchtune.datasets.alpaca_cleaned_dataset
packed: False #It doesn't work with True
source: csv #json
data_files: ./data/data_alpaca.csv
split: train[:95%]
column_map:
instruction: instruction
input: input
output: output
seed: null
shuffle: True
batch_size: 1 #4

Validation

run_val_every_n_steps: 5 #None # Change to an integer to enable validation every N steps
max_validation_batches: -1
dataset_val:
component: torchtune.datasets.alpaca_cleaned_dataset
packed: False # It doesn't work with True
source: csv #json
data_files: ./data/data_alpaca.csv
split: train[95%:]
column_map:
instruction: instruction
input: input
output: output
batch_size_validation: 2 #8

Fine-tuning arguments

epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1 # Use to increase effective batch size
optimizer:
component: bitsandbytes.optim.PagedAdamW
lr: 2e-3
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
loss:
component: torchtune.modules.loss.CEWithChunkedOutputLoss

Training env

device: cuda

Memory management / performance

enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: True #False # True reduces memory
dtype: bf16 #reduced precission
#Zupsy: Do not change to True, it doesn't work!!!!
compile: False # torch.compile the model + loss, True increases speed + decreases memory

Logging

metric_logger:
#component: torchtune.training.metric_logging.DiskLogger
component: torchtune.training.metric_logging.WandBLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True

Profiler (disabled)

profiler:
component: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#torch.profiler.ProfilerActivity types to trace
cpu: True
cuda: True

#trace options passed to torch.profiler.profile
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

`torch.profiler.schedule` options:

wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat

wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1

ebsmothers May 2, 2025
Collaborator

@mariozupan accelerate is not used by torchtune, we just use torchrun under the hood. So if you specify e.g. CUDA_VISIBLE_DEVICES=i,j,k as a prefix to your tune run command that will determine the GPUs that get used (I think your export command should be equivalent though). Can you also prefix your command with CUDA_LAUNCH_BLOCKING=1? It would be helpful to see the full Python stack trace of where the illegal memory access is occurring.

mariozupan · 2025-05-04T06:23:16Z

mariozupan
May 4, 2025
Author

yea this is the try with 2 gpus requested. I export= CUDA_VISIBLE_DEVICES=1,3 and according to your advice prefix CUDA_LAUNCH_BLOCKING=1, I got:

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_validation: 2
checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: ./models/Qwen2.5-1.5B-Instruct
checkpoint_files:

model.safetensors
model_type: QWEN2
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu
recipe_checkpoint: null
compile: false
dataset:
component: torchtune.datasets.alpaca_cleaned_dataset
column_map:
input: input
instruction: instruction
output: output
data_files: ./data/accounting_data_alpaca.csv
packed: false
source: csv
split: train[:95%]
dataset_val:
component: torchtune.datasets.alpaca_cleaned_dataset
column_map:
input: input
instruction: instruction
output: output
data_files: ./data/accounting_data_alpaca.csv
packed: false
source: csv
split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
component: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
max_validation_batches: -1
metric_logger:
component: torchtune.training.metric_logging.WandBLogger
log_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu/logs
model:
component: torchtune.models.qwen2_5.qwen2_5_1_5b_instruct
optimizer:
component: bitsandbytes.optim.PagedAdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu
profiler:
component: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
run_val_every_n_steps: 5
seed: null
shuffle: true
tokenizer:
component: torchtune.models.qwen2_5.qwen2_5_tokenizer
max_seq_len: 1024
merges_file: ./models/Qwen2.5-1.5B-Instruct/merges.txt
path: ./models/Qwen2.5-1.5B-Instruct/vocab.json

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_validation: 2
checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: ./models/Qwen2.5-1.5B-Instruct
checkpoint_files:

model.safetensors
model_type: QWEN2
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu
recipe_checkpoint: null
compile: false
dataset:
component: torchtune.datasets.alpaca_cleaned_dataset
column_map:
input: input
instruction: instruction
output: output
data_files: ./data/accounting_data_alpaca.csv
packed: false
source: csv
split: train[:95%]
dataset_val:
component: torchtune.datasets.alpaca_cleaned_dataset
column_map:
input: input
instruction: instruction
output: output
data_files: ./data/accounting_data_alpaca.csv
packed: false
source: csv
split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
component: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
max_validation_batches: -1
metric_logger:
component: torchtune.training.metric_logging.WandBLogger
log_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu/logs
model:
component: torchtune.models.qwen2_5.qwen2_5_1_5b_instruct
optimizer:
component: bitsandbytes.optim.PagedAdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu
profiler:
component: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
run_val_every_n_steps: 5
seed: null
shuffle: true
tokenizer:
component: torchtune.models.qwen2_5.qwen2_5_tokenizer
max_seq_len: 1024
merges_file: ./models/Qwen2.5-1.5B-Instruct/merges.txt
path: ./models/Qwen2.5-1.5B-Instruct/vocab.json

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 578835463. Local seed is seed + rank = 578835463 + 0
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: mariozupansystems (mariozupansystems-srce) to https://api.wandb.ai. Use wandb login --relogin to force relogin
wandb: creating run
wandb: Tracking run with wandb version 0.19.9
wandb: Run data is saved locally in ./models/torchtune/qwen2_5_1_5B-Instruct/zupsy-fft-2gpu/logs/wandb/run-20250502_181411-5k6qslaz
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run tough-bee-77
wandb: ⭐️ View project at https://wandb.ai/mariozupansystems-srce/torchtune
wandb: 🚀 View run at https://wandb.ai/mariozupansystems-srce/torchtune/runs/5k6qslaz
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 4.90 secs
INFO:torchtune.utils._logging:Memory stats after model init:
GPU peak memory active: 1.53 GiB
GPU peak memory alloc: 1.53 GiB
GPU peak memory reserved: 1.57 GiB
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/15459 [00:00<?, ?it/s]Error an illegal memory access was encountered at line 113 in file /src/csrc/ops.cu
[rank0]:[E502 18:14:21.162773132 ProcessGroupNCCL.cpp:1981] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x153f77d806a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x55 (0x153f77d1c223 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x153f7818b402 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x153f1e420796 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x153f1e4309d0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x76c (0x153f1e43209c in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x153f1e433abd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x153f0e1d8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x153f78a2bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x153f78abca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x153f77d806a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x55 (0x153f77d1c223 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x153f7818b402 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x153f1e420796 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x153f1e4309d0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x76c (0x153f1e43209c in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x153f1e433abd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x153f0e1d8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x153f78a2bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x153f78abca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1987 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x153f77d806a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: + 0x11f31ae (0x153f1e4031ae in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xe3c09b (0x153f1e04c09b in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x153f0e1d8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x153f78a2bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x153f78abca04 in /lib/x86_64-linux-gnu/libc.so.6)

Error an illegal memory access was encountered at line 113 in file /src/csrc/ops.cu
[rank1]:[E502 18:14:22.271787187 ProcessGroupNCCL.cpp:1981] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x152fabcdc6a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x55 (0x152fabc78223 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x152fabdad402 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x152f52420796 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x152f524309d0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x76c (0x152f5243209c in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x152f52433abd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x152f421d8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x152fac915ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x152fac9a6a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x152fabcdc6a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x55 (0x152fabc78223 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x152fabdad402 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x152f52420796 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x152f524309d0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x76c (0x152f5243209c in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x152f52433abd in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x152f421d8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x152fac915ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x152fac9a6a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1987 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x88 (0x152fabcdc6a8 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: + 0x11f31ae (0x152f524031ae in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xe3c09b (0x152f5204c09b in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x152f421d8253 in /lib/x86_64-linux-gnu/libs 8000 tdc++.so.6)
frame #4: + 0x94ac3 (0x152fac915ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x152fac9a6a04 in /lib/x86_64-linux-gnu/libc.so.6)

W0502 18:14:22.769415 1012458 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1013865 closing signal SIGTERM
E0502 18:14:22.833377 1012458 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 0 (pid: 1013864) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/tune", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/usr/local/lib/python3.11/dist-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 139, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.11/dist-packages/recipes/full_finetune_distributed.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-02_18:14:22
host : x8000c2s5b0n0.hsn.hpc.srce.hr
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 1013864)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1013864

INFO: Terminating fuse-overlayfs after timeout
INFO: Timeouts can be caused by a running background process

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An illegal memory access was encountered with more then 1 GPU #2638

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

An illegal memory access was encountered with more then 1 GPU #2638

Uh oh!

mariozupan Apr 26, 2025

Replies: 3 comments · 2 replies

Uh oh!

mariozupan Apr 26, 2025 Author

Uh oh!

ebsmothers May 1, 2025 Collaborator

Uh oh!

mariozupan May 2, 2025 Author

Model arguments

Tokenizer

Checkpointer

Validation

Fine-tuning arguments

Training env

Memory management / performance

Logging

Profiler (disabled)

torch.profiler.schedule options:

wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat

Uh oh!

ebsmothers May 2, 2025 Collaborator

Uh oh!

mariozupan May 4, 2025 Author

/usr/local/lib/python3.11/dist-packages/recipes/full_finetune_distributed.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-05-02_18:14:22 host : x8000c2s5b0n0.hsn.hpc.srce.hr rank : 0 (local_rank: 0) exitcode : -6 (pid: 1013864) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 1013864

mariozupan
Apr 26, 2025

Replies: 3 comments 2 replies

mariozupan
Apr 26, 2025
Author

ebsmothers
May 1, 2025
Collaborator

mariozupan May 2, 2025
Author

`torch.profiler.schedule` options:

ebsmothers May 2, 2025
Collaborator

mariozupan
May 4, 2025
Author

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-02_18:14:22
host : x8000c2s5b0n0.hsn.hpc.srce.hr
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 1013864)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1013864