CUDA error: an Illegal memory access was encountered #2957

ali-fani-sd · 2025-05-07T22:24:00Z

We are using DLRM model for personalization and we are getting CUDA error. By setting up CUDA_LAUNCH_BLOCKING flag and enabling cuda core dump, it pointed to two files where the issue might be happening
1: torchrec/distributed/embeddingbag.py: input_dist
2:torchrec/sparse/jagged_tensor.py: permute()

Some of our jaggedtensors are using weights, so when we debug the Jagged_tenosor.py we see mismatch in values(permuted length per key sum) and weights. Do you think that could be the root cause of CUDA error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA error: an Illegal memory access was encountered #2957

CUDA error: an Illegal memory access was encountered #2957

CUDA error: an Illegal memory access was encountered #2957

CUDA error: an Illegal memory access was encountered #2957

Comments