Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

bazik210 · 2025-05-15T09:24:18Z

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage

This update introduces fast_sampler.py, a new module designed to enhance the performance of sampling and VAE decoding in ComfyUI. It replaces or augments functionality previously handled in model_management.py, providing better VRAM management, FP16 support, and tiled decoding for low-memory scenarios. Additionally, it improves the PreviewImage node in nodes.py for faster and more efficient preview generation. These changes improve efficiency, stability, and usability, particularly for GPU-based workflows.

Key Changes:

Implemented fast_ksampler for optimized sampling with improved memory management, FP16 support via torch.amp.autocast, and channels_last memory format for better GPU performance.
Added fast_vae_decode for efficient VAE decoding, incorporating FP16 support, channels_last, and selective VRAM clearing to prevent out-of-memory errors.
Introduced fast_vae_tiled_decode for tiled VAE decoding, enabling processing of large latents on GPUs with limited VRAM by using configurable tile sizes and overlaps.
Added profiling and debugging utilities (profile_section, profile_cuda_sync) to track execution times and VRAM usage when --profile or --debug flags are enabled.
Improved VRAM management with clear_vram, ensuring sufficient free memory before loading models or VAE, with configurable thresholds and minimum free memory requirements.
Implemented is_fp16_safe to check GPU compatibility for FP16 operations, disabling them on unsupported hardware (e.g., GTX 1660/Turing).
Optimized tensor transfers with optimized_transfer and optimized_conditioning for synchronous device placement and dtype casting.
Enhanced model preloading with preload_model, which unloads VAE before loading U-Net to conserve VRAM and checks for already-loaded VAE to avoid redundant transfers.
Integrated cudnn.benchmark for for tests, disabled by default.
VRAM should now be managed efficiently.
Updated PreviewImage node in nodes.py to support adaptive resizing of preview images to a maximum dimension of ~512 pixels while preserving aspect ratio, using Image.LANCZOS for quality. Increased compress_level from 1 to 4 for faster PNG compression, optimizing preview generation.

Impact:

Significantly reduces VRAM usage during sampling and VAE decoding, making ComfyUI more stable on GPUs with limited memory.
Improves performance for large-scale image generation through tiled decoding and FP16 optimizations.
Enhances debugging capabilities with detailed profiling and logging, aiding development and optimization.

Dependencies:

Relies on nod 8000 es.py for integration with KSampler, VAEDecode, VAEDecodeTiled, and PreviewImage nodes.
Assumes compatibility with existing ModelPatcher functionality for model patching (e.g., in LoraLoader).

Notes:

Users should enable --profile or --debug flags to access detailed performance logs.
FP16 support requires compatible GPU hardware (compute capability ≥ 8 or > 7).
Tiled decoding parameters (tile_size, overlap, etc.) may need tuning for specific workflows.
Preview images are now smaller and faster to generate, but users can adjust max_size in PreviewImage if higher resolution previews are needed.

This is a foundational change to improve ComfyUI's performance and scalability, particularly for resource-constrained environments.

Thanks to Grok @ xAI for help.

… PreviewImage This commit introduces `fast_sampler.py`, a new module designed to enhance the performance of sampling and VAE decoding in ComfyUI. It replaces or augments functionality previously handled in `model_management.py`, providing better VRAM management, FP16 support, and tiled decoding for low-memory scenarios. Additionally, it improves the `PreviewImage` node in `nodes.py` for faster and more efficient preview generation. These changes improve efficiency, stability, and usability, particularly for GPU-based workflows. **Key Changes:** - Implemented `fast_ksampler` for optimized sampling with improved memory management, FP16 support via `torch.amp.autocast`, and `channels_last` memory format for better GPU performance. - Added `fast_vae_decode` for efficient VAE decoding, incorporating FP16 support, `channels_last`, and selective VRAM clearing to prevent out-of-memory errors. - Introduced `fast_vae_tiled_decode` for tiled VAE decoding, enabling processing of large latents on GPUs with limited VRAM by using configurable tile sizes and overlaps. - Added profiling and debugging utilities (`profile_section`, `profile_cuda_sync`) to track execution times and VRAM usage when `--profile` or `--debug` flags are enabled. - Improved VRAM management with `clear_vram`, ensuring sufficient free memory before loading models or VAE, with configurable thresholds and minimum free memory requirements. - Implemented `is_fp16_safe` to check GPU compatibility for FP16 operations, disabling them on unsupported hardware (e.g., GTX 1660/Turing). - Optimized tensor transfers with `optimized_transfer` and `optimized_conditioning` for synchronous device placement and dtype casting. - Enhanced model preloading with `preload_model`, which unloads VAE before loading U-Net to conserve VRAM and checks for already-loaded VAE to avoid redundant transfers. - Integrated `cudnn.benchmark` for for tests, disabled by default. - VRAM should now be managed efficiently. - Updated `PreviewImage` node in `nodes.py` to support adaptive resizing of preview images to a maximum dimension of ~512 pixels while preserving aspect ratio, using `Image.LANCZOS` for quality. Increased `compress_level` from 1 to 4 for faster PNG compression, optimizing preview generation. **Impact:** - Significantly reduces VRAM usage during sampling and VAE decoding, making ComfyUI more stable on GPUs with limited memory. - Improves performance for large-scale image generation through tiled decoding and FP16 optimizations. - Enhances debugging capabilities with detailed profiling and logging, aiding development and optimization. **Dependencies:** - Relies on `nodes.py` for integration with `KSampler`, `VAEDecode`, `VAEDecodeTiled`, and `PreviewImage` nodes. - Assumes compatibility with existing `ModelPatcher` functionality for model patching (e.g., in `LoraLoader`). **Notes:** - Users should enable `--profile` or `--debug` flags to access detailed performance logs. - FP16 support requires compatible GPU hardware (compute capability ≥ 8 or > 7). - Tiled decoding parameters (`tile_size`, `overlap`, etc.) may need tuning for specific workflows. - Preview images are now smaller and faster to generate, but users can adjust `max_size` in `PreviewImage` if higher resolution previews are needed. This is a foundational change to improve ComfyUI's performance and scalability, particularly for resource-constrained environments. Thanks to Grok @ xAI for help.

workaround for torch.count_nonzero on DirectML

…agement for DirectML devices, improved logging for better debugging and profiling

…unting (for full version check optimization branch)

…rectML

loxotron added 2 commits May 15, 2025 06:54

vae tiled fixes and few other mistakes

aaed282

bazik210 requested a review from comfyanonymous as a code owner May 15, 2025 09:24

loxotron added 4 commits May 16, 2025 19:45

fixes for directml, use_pytorch_cross_attention and channels_last args

07b066c

workaround for torch.count_nonzero on DirectML

Fixes for DirectML detection in fast_sampler.py, bugfixes in VRAM man…

2fd0a12

…agement for DirectML devices, improved logging for better debugging and profiling

< 8CC1 div class="AvatarStack flex-self-start " >

vramhellfix without precision vram calculation and tensor's size acco…

cd47286

…unting (for full version check optimization branch)

nasty hack to avoid some memory crashes on higher resolutions with Di…

c38bb97

…rectML

comfyanonymous closed this May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

Uh oh!

Uh oh!

Uh oh!

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

Uh oh!

Conversation

Uh oh!

Uh oh!