CUDA Memory leak w/ torch.compile in both stable and trunk #119607

xmfan · 2024-02-09T23:35:12Z

🐛 Describe the bug

models traced with torch.compile don't seem to be freeing CUDA memory

import torch
import gc

def main():
    x = torch.randn(1000, 3000, device="cuda", requires_grad=True)
    model = torch.nn.Sequential(
        torch.nn.Linear(3000, 10000),
        torch.nn.ReLU(),
        torch.nn.Linear(10000, 50000),
        torch.nn.ReLU(),
        torch.nn.Linear(50000, 20000),
        torch.nn.ReLU(),
        torch.nn.Linear(20000, 1234),
    ).to("cuda")
    model = torch.compile(model, backend="eager")
    model(x)

if __name__ == "__main__":
    main()

    # tried clearing with a few ways
    torch.cuda.synchronize()
    torch.cuda.empty_cache()
    torch._C._cuda_clearCublasWorkspaces()
    gc.collect()

    print(f"{torch.cuda.memory_allocated()/1e9} GB!!")  # 6.219729408 GB!!

one high priority use case to fix this is for compiled autograd, which calls torch.compile for compiled fw and once for compiled bw, leading to 2x memory use

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @Chillee

Versions

2.2.0
trunk

xmfan · 2024-02-10T00:55:36Z

marking dynamo since it happens with backend="eager"

anijain2305 · 2024-02-10T16:14:28Z

Cc @williamwen42

malfet · 2024-02-12T22:16:15Z

dynamo has its own mechanism for cleaning compiled artifact, wouldn't that be sufficient? And perhaps something like that on the Triton side as well

williamwen42 · 2024-02-12T22:38:09Z

This is not intended behavior, but I find that if I wrap the torch.nn.Sequential inside a custom nn.Module, then the memory gets freed:

import gc
import torch

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        # self.fc1 = torch.nn.Linear(3000, 50000)
        self.fc1 = torch.nn.Sequential(
            torch.nn.Linear(3000, 10000),
            torch.nn.ReLU(),
            torch.nn.Linear(10000, 50000),
            torch.nn.ReLU(),
            torch.nn.Linear(50000, 20000),
            torch.nn.ReLU(),
            torch.nn.Linear(20000, 1234),
        )

    def forward(self, out):
        out = self.fc1(out)
        return out

def run(compile):
    mod = MyModel().cuda()
    if compile:
        mod = torch.compile(mod, backend="eager")
    inp = torch.rand(10000, 3000).cuda()
    mod(inp)

def clean_and_report_memory():
    gc.collect()
    print(f"max memory: {torch.cuda.max_memory_allocated()}, curr memory: {torch.cuda.memory_allocated()}")

run(False)
clean_and_report_memory()

run(True)
clean_and_report_memory()

torch._dynamo.reset()
clean_and_report_memory()

Output:

max memory: 2730451456, curr memory: 8519680
max memory: 2730451456, curr memory: 8519680
max memory: 2730451456, curr memory: 8519680

I will continue to investigate why memory is not being freed in the original code snippet.

gchanan · 2024-02-12T22:40:15Z

is it a regression?

atalman · 2024-02-14T19:25:32Z

Moving to release 2.2.2. Since fix is not out yet

williamwen42 · 2024-02-15T00:07:11Z

I've simplified the repro:

import gc
import weakref
import torch

mod = torch.nn.Linear(3000, 50000).cuda()
def fn(x):
    return mod(x)

ref = weakref.ref(mod, lambda _: print("mod deleted"))
weakref.finalize(fn, lambda: print("fn deleted"))

inp = torch.rand(10000, 3000).cuda()

torch.compile(backend="eager")(fn)(inp)

del mod
del fn

gc.collect()

# expect finalizers to run before this point
breakpoint()

It seems that dynamo holds on to a reference to mod somewhere.

lezcano · 2024-02-22T12:14:46Z

a shot in the dark, but may this be related to the memory leak that you were once hunting @Fidget-Spinner?

Fidget-Spinner · 2024-02-22T12:18:43Z

The memory leak I was once hunting concerned a circular reference between the compiled code cache and the code object itself IIRC. If anyone is aware of a reference from the compiled artefact object to the symbolic evaluator, that might be a source of leaks, because the symbolic stuff definitely holds a reference onto mod in this case. Otherwise, it wouldn't be that.

lezcano · 2024-02-22T14:42:38Z

alas @IvanYashchuk patched in #109422 and didn't seem to help with this one, so I guess there's something else going on here.

ezyang · 2024-02-26T15:58:22Z

@williamwen42 any updates on this?

williamwen42 · 2024-02-26T18:18:45Z

Still working on this - got blocked recently because #112090 was happening again, but it has been resolved.

williamwen42 · 2024-02-26T19:52:36Z

The repro no longer leaks due to #120578, but I still observe a leak if the model is a builtin module (e.g. torch.nn.Linear), or if we are on 3.11+.

Fixes #119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

Summary: Fixes pytorch/pytorch#119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. X-link: pytorch/pytorch#124238 Approved by: https://github.com/jansel Reviewed By: PaliC Differential Revision: D56289286 Pulled By: williamwen42 fbshipit-source-id: 121abe4d8165d3bb4a2145841a8909bbd23a98dc

JerrickLiu · 2024-04-18T23:55:12Z

@williamwen42 is there a fix for this? I am also experiencing a memroy leak with torch compile and python 3.11

williamwen42 · 2024-04-19T00:06:00Z

Do you have a repro? The fix only recently went in so you should try the nightly binaries.

JerrickLiu · 2024-04-19T00:11:38Z

Not a local repro unforunately. I can try the nightly binary. How long does it take to make it into the default stable installation?

Where can I find the nightly binaries that would have your fix?

JerrickLiu · 2024-04-19T00:20:29Z

I'm also hitting this leak with backend=cudagraphs, if your fix accounts for that

JerrickLiu · 2024-04-19T01:39:32Z

@williamwen42 bump on the nightly binary. Can I just use the one found here: https://pytorch.org/get-started/locally/

and selecting nightly? I have a nightly build, but with the above repro I still see the mem leak, most likely because I don't have your changes. Is there a way to verify I have your change?

williamwen42 · 2024-04-19T04:13:58Z

Yeah that's the right link. Give it another day and it should be in the nightlies. The fix will not be in the stable binaries until next release (2.4). The leak occurs at a very high level in the PT2 stack (dynamo) - it should occur even on the eager backend. Can you confirm this?

…120756) Fixes remaining refleaks found when debugging #119607, tests added in #120657. Also fixes some tests that xfail: #120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: #120756 Approved by: https://github.com/jansel

Fixes #119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: #124238 Approved by: https://github.com/jansel

JerrickLiu · 2024-04-22T18:55:50Z

@williamwen42 I confirmed with a nightly build that the memory leak is fixed

Fixes pytorch#119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: pytorch#124238 Approved by: https://github.com/jansel

Fixes #119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: #124238 Approved by: https://github.com/jansel (cherry picked from commit 812bae0)

…120756) Fixes remaining refleaks found when debugging #119607, tests added in #120657. Also fixes some tests that xfail: #120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: #120756 Approved by: https://github.com/jansel

…126332) * [dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756) Fixes remaining refleaks found when debugging #119607, tests added in #120657. Also fixes some tests that xfail: #120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: #120756 Approved by: https://github.com/jansel * [dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756) Fixes remaining refleaks found when debugging #119607, tests added in #120657. Also fixes some tests that xfail: #120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: #120756 Approved by: https://github.com/jansel

huydhn · 2024-05-30T02:09:00Z

This issue has been fixed in the upcoming 2.3.1 release https://dev-discuss.pytorch.org/t/pytorch-release-2-3-1-final-rc-is-available/2126 on python 3.11

xmfan changed the title ~~Memory leak in nightly~~ CUDA Memory leak w/ torch.compile in nightly Feb 9, 2024

xmfan added oncall: pt2 high priority labels Feb 9, 2024

pytorch-bot bot added the triage review label Feb 9, 2024

xmfan changed the title ~~CUDA Memory leak w/ torch.compile in nightly~~ CUDA Memory leak w/ torch.compile in both stable and nightly Feb 10, 2024

xmfan added the module: dynamo label Feb 10, 2024

xmfan changed the title ~~CUDA Memory leak w/ torch.compile in both stable and nightly~~ CUDA Memory leak w/ torch.compile in both stable and trunk Feb 10, 2024

williamwen42 self-assigned this Feb 12, 2024

Skylion007 added this to the 2.2.1 milestone Feb 12, 2024

github-project-automation bot added this to Release Milestone Review Feb 12, 2024

github-project-automation bot moved this to Review Required in Release Milestone Review Feb 12, 2024

soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Feb 13, 2024

atalman modified the milestones: 2.2.1, 2.2.2 Feb 14, 2024

atalman removed this from Release Milestone Review Feb 14, 2024

atalman added this to Release Milestone Review Feb 28, 2024

atalman removed this from Release Milestone Review Apr 10, 2024

williamwen42 mentioned this issue Apr 17, 2024

[dynamo] fix 3.11+ refleak #124238

Closed

pytorchmergebot closed this as completed in 812bae0 Apr 18, 2024

atalman mentioned this issue May 13, 2024

[v2.3.1] Release Tracker #125425

Closed

pytorchbot mentioned this issue May 13, 2024

[dynamo] fix 3.11+ refleak #126107

Closed

williamwen42 mentioned this issue May 15, 2024

[dynamo] use proxies to nn.Module in dynamo generated GraphModules #126332

Merged

atalman mentioned this issue May 29, 2024

Release 2.3.1 validations checklist and cherry-picks #127441

Closed

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA Memory leak w/ torch.compile in both stable and trunk #119607

CUDA Memory leak w/ torch.compile in both stable and trunk #119607

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CUDA Memory leak w/ torch.compile in both stable and trunk #119607

CUDA Memory leak w/ torch.compile in both stable and trunk #119607

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!