Open
Description
I build a model with parallel_nsa and MoE, got the following error when train it in parallel. with 2 GPUs. do you know where does the error come from, is it from parallel_nsa, or MoE?, no parallel setting in the MoE module. is there an example of parallel training of nsa and moe?
[default1]:[rank1]: Traceback (most recent call last):
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[default1]:[rank1]: return f(*args, **kwargs)
[default1]:[rank1]: ^^^^^^^^^^^^^^^^^^
[default1]:[rank1]: ~train.py", line 650, in main
[default1]:[rank1]: loss.backward()
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
[default1]:[rank1]: torch.autograd.backward(
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/auto
52CF
grad/__init__.py", line 347, in backward
[default1]:[rank1]: _engine_run_backward(
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[default1]:[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[default1]:[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook
[default1]:[rank1]: frame.check_recomputed_tensors_match(gid)
[default1]:[rank1]: ~/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 865, in check_recomputed_tensors_match
[default1]:[rank1]: raise CheckpointError(
[default1]:[rank1]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
[default1]:[rank1]: Number of tensors saved during forward: 344
[default1]:[rank1]: Number of tensors saved during recomputation: 239