Add activations_shape info in UNet models #7482
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the TransformerBlocks inside the UNet only receive the flattened activations. This makes it impossible for them to know what the actual spatial dimensions of the activations that were passed to them were (because of downsampling, padding, etc. the aspect ratio of the activations is different in different layers, so it can't even be reliably inferred from the original latent shape).
For example, in Cubiq's IPAdapter implementation, an initial guess is made to the mask size/shape, but in some cases we have to give up and pad the mask with zeros, likely creating incorrect results.
Similarly, in the official comfy code for SAG we actually attempt to factorize the attention map shape, and take whichever "possible shape" is closest to the shape of the original latent.
Clearly, the need to implement a heuristic algorithm that guesses what shape the original activations were makes it needlessly complex to create an attention patch. This PR adds a new key to transfomer_options, which is only set inside of SpatialTransformers and SpatialVideoTransformers, that contains the non-flattened, spatial shape of the activations inside that block. This makes it easy for attention patches to resize their masks correctly, compute attention maps, etc.