Description
Describe the bug
In PT2.7, I notice that torch.compile does not appear to fill the force tensor, which left at exactly zero. The compile test passes anyway because the ground truth force is pretty close to zero. To trigger an explicit error (as suggested here NVIDIA/cuEquivariance#77), adding fullgraph=True to torch.compile produces an exception in dynamo.
To Reproduce
- Install PT2.7 + MACE, latest from Github.
- Modify test_mace in tests/test_compile.py as follows:
# skip if on windows
@pytest.mark.skipif(os.name == "nt", reason="Not supported on Windows")
@pytest.mark.parametrize("device", ["cuda"])
def test_mace(device, default_dtype): # pylint: disable=W0621
print(f"using default dtype = {default_dtype}")
if device == "cuda" and not torch.cuda.is_available():
pytest.skip(reason="cuda is not available")
model_defaults = create_mace(device)
tmp_model = mace_compile.prepare(create_mace)(device)
model_compiled = torch.compile(tmp_model) # Add fullgraph = True to see the dynamo exception
batch = create_batch(device)
output1 = model_defaults(batch, training=True)
output2 = model_compiled(batch, training=True)
print(torch.norm(output1["forces"]))
print(torch.norm(output2["forces"]))
assert False
assert_close(output1["energy"], output2["energy"])
assert_close(output1["forces"], output2["forces"])
Expected behavior
The first tensor is close (but not exactly) zero, the second tensor is suspiciously zero (hence the test is passing). fullgraph=True yields the following exception:
E torch._dynamo.exc.Unsupported: SKIPPED INLINING <code object forward at 0x7fafdfbb24a0, file "/global/cfs/cdirs/m1982/vbharadw/equivariant_spmm/mace_oeq_integration/mace/modules/symmetric_contraction.py", line 212>:
E
E from user code:
E File "/global/cfs/cdirs/m1982/vbharadw/equivariant_spmm/mace_oeq_integration/mace/modules/models.py", line 430, in forward
E node_feats = product(
E File "/global/cfs/cdirs/m1982/vbharadw/equivariant_spmm/mace_oeq_integration/mace/modules/blocks.py", line 301, in forward
E node_feats = self.symmetric_contractions(node_feats, node_attrs)
E File "/global/cfs/cdirs/m1982/vbharadw/equivariant_spmm/mace_oeq_integration/mace/modules/symmetric_contraction.py", line 82, in forward
E outs = [contraction(x, y) for contraction in self.contractions]
E File "/global/cfs/cdirs/m1982/vbharadw/equivariant_spmm/mace_oeq_integration/mace/modules/symmetric_contraction.py", line 82, in <listcomp>
E outs = [contraction(x, y) for contraction in self.contractions]
E
E Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
/global/cfs/projectdirs/m1982/vbharadw/conda/pt27/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:659: Unsupported
Both tensors are nonzero and the error is not raised in PT2.6.
Platform
NERSC Perlmutter A100 nodes
ETA: I tested this behavior with MACE 3.10 as well; the issue doesn't seem to arise from some new diff.