8000 Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage by bazik210 · Pull Request #8136 · comfyanonymous/ComfyUI · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add fast_sampler.py with optimized sampling and VAE decoding, enhance PreviewImage #8136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bazik210
Copy link

Add fast_sampler.py with 8000 optimized sampling and VAE decoding, enhance PreviewImage

This update introduces fast_sampler.py, a new module designed to enhance the performance of sampling and VAE decoding in ComfyUI. It replaces or augments functionality previously handled in model_management.py, providing better VRAM management, FP16 support, and tiled decoding for low-memory scenarios. Additionally, it improves the PreviewImage node in nodes.py for faster and more efficient preview generation. These changes improve efficiency, stability, and usability, particularly for GPU-based workflows.

Key Changes:

  • Implemented fast_ksampler for optimized sampling with improved memory management, FP16 support via torch.amp.autocast, and channels_last memory format for better GPU performance.
  • Added fast_vae_decode for efficient VAE decoding, incorporating FP16 support, channels_last, and selective VRAM clearing to prevent out-of-memory errors.
  • Introduced fast_vae_tiled_decode for tiled VAE decoding, enabling processing of large latents on GPUs with limited VRAM by using configurable tile sizes and overlaps.
  • Added profiling and debugging utilities (profile_section, profile_cuda_sync) to track execution times and VRAM usage when --profile or --debug flags are enabled.
  • Improved VRAM management with clear_vram, ensuring sufficient free memory before loading models or VAE, with configurable thresholds and minimum free memory requirements.
  • Implemented is_fp16_safe to check GPU compatibility for FP16 operations, disabling them on unsupported hardware (e.g., GTX 1660/Turing).
  • Optimized tensor transfers with optimized_transfer and optimized_conditioning for synchronous device placement and dtype casting.
  • Enhanced model preloading with preload_model, which unloads VAE before loading U-Net to conserve VRAM and checks for already-loaded VAE to avoid redundant transfers.
  • Integrated cudnn.benchmark for for tests, disabled by default.
  • VRAM should now be managed efficiently.
  • Updated PreviewImage node in nodes.py to support adaptive resizing of preview images to a maximum dimension of ~512 pixels while preserving aspect ratio, using Image.LANCZOS for quality. Increased compress_level from 1 to 4 for faster PNG compression, optimizing preview generation.

Impact:

  • Significantly reduces VRAM usage during sampling and VAE decoding, making ComfyUI more stable on GPUs with limited memory.
  • Improves performance for large-scale image generation through tiled decoding and FP16 optimizations.
  • Enhances debugging capabilities with detailed profiling and logging, aiding development and optimization.

Dependencies:

  • Relies on nodes.py for integration with KSampler, VAEDecode, VAEDecodeTiled, and PreviewImage nodes.
  • Assumes compatibility with existing ModelPatcher functionality for model patching (e.g., in LoraLoader).

Notes:

  • Users should enable --profile or --debug flags to access detailed performance logs.
  • FP16 support requires compatible GPU hardware (compute capability ≥ 8 or > 7).
  • Tiled decoding parameters (tile_size, overlap, etc.) may need tuning for specific workflows.
  • Preview images are now smaller and faster to generate, but users can adjust max_size in PreviewImage if higher resolution previews are needed.

This is a foundational change to improve ComfyUI's performance and scalability, particularly for resource-constrained environments.

Thanks to Grok @ xAI for help.

loxotron added 2 commits May 15, 2025 06:54
… PreviewImage

This commit introduces `fast_sampler.py`, a new module designed to enhance the performance of sampling and VAE decoding in ComfyUI. It replaces or augments functionality previously handled in `model_management.py`, providing better VRAM management, FP16 support, and tiled decoding for low-memory scenarios. Additionally, it improves the `PreviewImage` node in `nodes.py` for faster and more efficient preview generation. These changes improve efficiency, stability, and usability, particularly for GPU-based workflows.

**Key Changes:**
- Implemented `fast_ksampler` for optimized sampling with improved memory management, FP16 support via `torch.amp.autocast`, and `channels_last` memory format for better GPU performance.
- Added `fast_vae_decode` for efficient VAE decoding, incorporating FP16 support, `channels_last`, and selective VRAM clearing to prevent out-of-memory errors.
- Introduced `fast_vae_tiled_decode` for tiled VAE decoding, enabling processing of large latents on GPUs with limited VRAM by using configurable tile sizes and overlaps.
- Added profiling and debugging utilities (`profile_section`, `profile_cuda_sync`) to track execution times and VRAM usage when `--profile` or `--debug` flags are enabled.
- Improved VRAM management with `clear_vram`, ensuring sufficient free memory before loading models or VAE, with configurable thresholds and minimum free memory requirements.
- Implemented `is_fp16_safe` to check GPU compatibility for FP16 operations, disabling them on unsupported hardware (e.g., GTX 1660/Turing).
- Optimized tensor transfers with `optimized_transfer` and `optimized_conditioning` for synchronous device placement and dtype casting.
- Enhanced model preloading with `preload_model`, which unloads VAE before loading U-Net to conserve VRAM and checks for already-loaded VAE to avoid redundant transfers.
- Integrated `cudnn.benchmark` for for tests, disabled by default.
- VRAM should now be managed efficiently.
- Updated `PreviewImage` node in `nodes.py` to support adaptive resizing of preview images to a maximum dimension of ~512 pixels while preserving aspect ratio, using `Image.LANCZOS` for quality. Increased `compress_level` from 1 to 4 for faster PNG compression, optimizing preview generation.

**Impact:**
- Significantly reduces VRAM usage during sampling and VAE decoding, making ComfyUI more stable on GPUs with limited memory.
- Improves performance for large-scale image generation through tiled decoding and FP16 optimizations.
- Enhances debugging capabilities with detailed profiling and logging, aiding development and optimization.

**Dependencies:**
- Relies on `nodes.py` for integration with `KSampler`, `VAEDecode`, `VAEDecodeTiled`, and `PreviewImage` nodes.
- Assumes compatibility with existing `ModelPatcher` functionality for model patching (e.g., in `LoraLoader`).

**Notes:**
- Users should enable `--profile` or `--debug` flags to access detailed performance logs.
- FP16 support requires compatible GPU hardware (compute capability ≥ 8 or > 7).
- Tiled decoding parameters (`tile_size`, `overlap`, etc.) may need tuning for specific workflows.
- Preview images are now smaller and faster to generate, but users can adjust `max_size` in `PreviewImage` if higher resolution previews are needed.

This is a foundational change to improve ComfyUI's performance and scalability, particularly for resource-constrained environments.

Thanks to Grok @ xAI for help.
@bazik210 bazik210 requested a review from comfyanonymous as a code owner May 15, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0