8000 Parallel training error: different number of tensor for forward and backword pass. · Issue #24 · fla-org/native-sparse-attention · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Parallel training error: different number of tensor for forward and backword pass. #24
Open
@CaiYitao

Description

@CaiYitao

I build a model with parallel_nsa and MoE, got the following error when train it in parallel. with 2 GPUs. do you know where does the error come from, is it from parallel_nsa, or MoE?, no parallel setting in the MoE module. is there an example of parallel training of nsa and moe?

[default1]:[rank1]: Traceback (most recent call last):

[default1]:[rank1]:   ~/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[default1]:[rank1]:     return f(*args, **kwargs)
[default1]:[rank1]:            ^^^^^^^^^^^^^^^^^^
[default1]:[rank1]:  ~train.py", line 650, in main
[default1]:[rank1]:     loss.backward()
[default1]:[rank1]:   ~/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
[default1]:[rank1]:     torch.autograd.backward(
[default1]:[rank1]:  ~/lib/python3.12/site-packages/torch/auto
52CF
grad/__init__.py", line 347, in backward
[default1]:[rank1]:     _engine_run_backward(
[default1]:[rank1]:  ~/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[default1]:[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[default1]:[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[default1]:[rank1]:   ~/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook
[default1]:[rank1]:     frame.check_recomputed_tensors_match(gid)
[default1]:[rank1]:   ~/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 865, in check_recomputed_tensors_match
[default1]:[rank1]:     raise CheckpointError(
[default1]:[rank1]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
[default1]:[rank1]: Number of tensors saved during forward: 344
[default1]:[rank1]: Number of tensors saved during recomputation: 239

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0