Added explicit memory management during the VAE decode process. < 8000 span class="f1-light color-fg-muted">#7587

KBRASK · 2025-04-13T04:05:48Z

Hello, I noticed that during the VAE decode process, the feature maps as intermediate variables were not explicitly deleted, which led to unnecessary CUDA out of memory errors. I have fixed this by expli 8000 citly deleting the intermediate variables.

comfyanonymous · 2025-04-20T00:00:40Z

I can't merge this as is because it would break on non cuda devices but does it actually improve things?

h = h_new does "del" the h variable. del doesn't actually delete anything, it only removes that reference to the object.

KBRASK · 2025-04-20T02:50:20Z

@comfyanonymous Hi, thank you for your review.
In this section, during the iterative upsampling process, the feature map h continuously accumulates, which is a significant reason for the CUDA out of memory errors.
I removed the reference to the old h and then used torch.cuda.empty_cache(). This allows the memory for the unused h tensor to be freed. This approach can reduce the peak VRAM usage during the VAE decoding process in ComfyUI by 40%.

KBRASK · 2025-04-20T02:55:09Z

You are correct that non-CUDA devices cannot use torch.cuda.empty_cache() to release VRAM. However, this step is quite important for CUDA devices. I can add logic here to check if CUDA is available and apply this optimization specifically for CUDA devices.

KBRASK · 2025-04-20T03:00:57Z

I can also show you the difference in VRAM usage during the VAE decoding stage, comparing the results with and without explicit memory management.
Previously (in the original logic), all feature maps generated during upsampling were kept in memory until the entire decoding phase was finished, at which point they were all released together.
Now (in the new logic), we clear the previous feature map immediately after each upsampling step. This avoids it taking up VRAM unnecessarily.

chaObserv · 2025-04-20T03:38:06Z

You are correct that non-CUDA devices cannot use torch.cuda.empty_cache() to release VRAM. However, this step is quite important for CUDA devices. I can add logic here to check if CUDA is available and apply this optimization specifically for CUDA devices.

If empty_cache is needed, comfy.model_management.soft_empty_cache() might be better for compatibility.

…ache() with model_management.soft_empty_cache().

KBRASK · 2025-04-20T07:50:11Z

@chaObserv Thank you for your valuable suggestion! I have already replaced torch.cuda.empty_cache() with model_management.soft_empty_cache() to support non-CUDA devices.

Added explicit memory management during the VAE decode process.

8e1d708

KBRASK requested a review from comfyanonymous as a code owner April 13, 2025 04:05

KBRASK and others added 2 commits April 20, 2025 15:26

Merge branch 'comfyanonymous:master' into master

4d4e6a0

To enable support for non-CUDA devices, substitute torch.cuda.empty_c…

0264149

…ache() with model_management.soft_empty_cache().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added explicit memory management during the VAE decode process. < 8000 span class="f1-light color-fg-muted">#7587

Added explicit memory management during the VAE decode process. #7587

Added explicit memory management during the VAE decode process. < 8000 span class="f1-light color-fg-muted">#7587

Are you sure you want to change the base?

Added explicit memory management during the VAE decode process. #7587

Conversation