8000 CogVideoX + RIFLEx: adjust num_frames by duration/fps · Issue #10 · thu-ml/RIFLEx · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
loretoparisi opened this issue Mar 3, 2025 · 1 comment
Open

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

loretoparisi opened this issue Mar 3, 2025 · 1 comment

Comments

@loretoparisi
Copy link

In my CogVideoX pipe I did

    # --- LP: RIFLEx BEGIN
    # 'Index of intrinsic frequency'
    k=2
    # 'The period of intrinsic frequency in latent space
    N_k = 20
    # Whether model is finetuned version
    finetune = None
    # the number of frames for inference
    L_test = (num_frames - 1) // 4 + 1  # latent frames
    
    # For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
    if L_test > N_k and not finetune:
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=k, L_test=L_test), pipe)

    # We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
    if finetune:
        L_test = N_k  # the fine-tuning frequency setting
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)
    # --- LP: RIFLEx END

Now, in my basic case I always have finetune=None (no fined-tuned model).
Above the pipe is created in this way

# create pipe
    pipe = None
    if video_input != "":
        # V2V pipe
        pass
    elif image_input != "":
        # I2V pipe
        model_path = "THUDM/CogVideoX-5b-I2V"
        
        # i2v transformer
        i2v_transformer = CogVideoXTransformer3DModel.from_pretrained(
            model_path, 
            subfolder="transformer", 
            torch_dtype=torch.bfloat16
        )
        i2v_transformer = quantize_model(part=i2v_transformer, quantization_scheme=quantization_scheme)
        pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=dtype).to("cpu")
        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
        pipe = CogVideoXImageToVideoPipeline.from_pretrained(
            model_path,
            transformer=i2v_transformer,
            vae=vae,
            scheduler=pipe.scheduler,
            tokenizer=pipe.tokenizer,
            text_encoder=text_encoder,
            torch_dtype=dtype,
        ).to(device)
    else:
        # T2V pipe
        pipe = CogVideoXPipeline.from_pretrained(
            model_path,
            text_encoder=text_encoder,
            transformer=transformer,
            vae=vae,
            torch_dtype=dtype,
            #device_map="balanced"
        ).to(device)
        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
        
    print(f'created pipe {pipe}')

because I have I2V and T2V pipes.
Now in my system I did this calculations

    # set frame rate
    '''
     
     4 SECONDS
 
     For 12 FPS
     
     4 * 12 = 48 FRAMES
     
     6 SECONDS

     6 * 8 = 48 FRAMES
     num_frames = 49 because of (num_seconds * fps + 1) 

     For 24 FPS

     6 * 24 = 144 FRAMES
     num_frames = 145
     '''

    max_seconds = 6
    num_seconds = duration # up to 6
    
    if num_seconds>max_seconds:
        num_seconds = max_seconds
       
    max_frames = 49
    # (num_seconds * fps + 1) = 6 * 4 + 1 = 25
    # ValueError: The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation.
    num_frames = num_seconds * fps + 1
    
    if num_frames>max_frames:
        # exceeded max frames
        num_frames = max_frames
    else:
        # too few frames set max
        num_frames = max_frames
    
    # adjust fps if not upscale
    fps = fps if enable_rife else math.ceil((num_frames-1) / num_seconds)
    
    print(f"Generating using seconds:{num_seconds} max seconds:{max_seconds} using frames:{num_frames} max frames:{max_frames} @{fps} FPS")

this because I wanted to let the user to pass num_seconds and then adapt the value for num_frames:

parser.add_argument(
            "--duration", type=int, default=6, help="Duration in seconds"
        )

While in RIFLEx I see that num_frames is set by defaults to 97, why?
Also I don't get this assert where by defaults k=2:

# num_frames defaulst to 97, hence `num_frames-1=96`, why?
 assert (num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"

Is this to because you want to ensure that num_frame is at least k times 4. Why?

@zhuhz22
Copy link
Collaborator
zhuhz22 commented Mar 4, 2025

Hi @loretoparisi , thank you for your attention to our work!

  • About num_frames : RIFLEx is a tool for video length extrapolation, which enables video models to generate videos longer than the training length. For CogVideoX, the training length is 49 frames, and with RIFLEx, we allow the video model to generate videos of twice the length (i.e., 97 frames) or even longer.

    So in your code, the following code should be deleted as there is no limit on video length:

    if num_frames>max_frames:
            # exceeded max frames
            num_frames = max_frames
    else:
        # too few frames set max
        num_frames = max_frames

    And in our code, num_frames is set by defaults to 97, which enables the model to generate videos twice the training length. Certainly you can also adjust num_frames in your way, such as num_frames = num_seconds * fps + 1.

  • About the assertion of 4 * k + 1: CogVideo uses a casual VAE that encodes (4k+1) pixel frames into (k+1) latent frames, requiring num_frames to be 1 modulo 4. In this assertion, k represents any positive integer, and it doesn't mean args.k, which is the index of intrinsic frequency.

loretoparisi reacted with heart emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0