CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

loretoparisi · 2025-03-03T17:02:43Z

In my CogVideoX pipe I did

    # --- LP: RIFLEx BEGIN
    # 'Index of intrinsic frequency'
    k=2
    # 'The period of intrinsic frequency in latent space
    N_k = 20
    # Whether model is finetuned version
    finetune = None
    # the number of frames for inference
    L_test = (num_frames - 1) // 4 + 1  # latent frames
    
    # For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
    if L_test > N_k and not finetune:
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=k, L_test=L_test), pipe)

    # We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
    if finetune:
        L_test = N_k  # the fine-tuning frequency setting
        pipe._prepare_rotary_positional_embeddings = MethodType(
            partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)
    # --- LP: RIFLEx END

Now, in my basic case I always have finetune=None (no fined-tuned model).
Above the pipe is created in this way

# create pipe
    pipe = None
    if video_input != "":
        # V2V pipe
        pass
    elif image_input != "":
        # I2V pipe
        model_path = "THUDM/CogVideoX-5b-I2V"
        
        # i2v transformer
        i2v_transformer = CogVideoXTransformer3DModel.from_pretrained(
            model_path, 
            subfolder="transformer", 
            torch_dtype=torch.bfloat16
        )
        i2v_transformer = quantize_model(part=i2v_transformer, quantization_scheme=quantization_scheme)
        pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=dtype).to("cpu")
        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
        pipe = CogVideoXImageToVideoPipeline.from_pretrained(
            model_path,
            transformer=i2v_transformer,
            vae=vae,
            scheduler=pipe.scheduler,
            tokenizer=pipe.tokenizer,
            text_encoder=text_encoder,
            torch_dtype=dtype,
        ).to(device)
    else:
        # T2V pipe
        pipe = CogVideoXPipeline.from_pretrained(
            model_path,
            text_encoder=text_encoder,
            transformer=transformer,
            vae=vae,
            torch_dtype=dtype,
            #device_map="balanced"
        ).to(device)
        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
        
    print(f'created pipe {pipe}')

because I have I2V and T2V pipes.
Now in my system I did this calculations

    # set frame rate
    '''
     
     4 SECONDS
 
     For 12 FPS
     
     4 * 12 = 48 FRAMES
     
     6 SECONDS

     6 * 8 = 48 FRAMES
     num_frames = 49 because of (num_seconds * fps + 1) 

     For 24 FPS

     6 * 24 = 144 FRAMES
     num_frames = 145
     '''

    max_seconds = 6
    num_seconds = duration # up to 6
    
    if num_seconds>max_seconds:
        num_seconds = max_seconds
       
    max_frames = 49
    # (num_seconds * fps + 1) = 6 * 4 + 1 = 25
    # ValueError: The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation.
    num_frames = num_seconds * fps + 1
    
    if num_frames>max_frames:
        # exceeded max frames
        num_frames = max_frames
    else:
        # too few frames set max
        num_frames = max_frames
    
    # adjust fps if not upscale
    fps = fps if enable_rife else math.ceil((num_frames-1) / num_seconds)
    
    print(f"Generating using seconds:{num_seconds} max seconds:{max_seconds} using frames:{num_frames} max frames:{max_frames} @{fps} FPS")

this because I wanted to let the user to pass num_seconds and then adapt the value for num_frames:

parser.add_argument(
            "--duration", type=int, default=6, help="Duration in seconds"
        )

While in RIFLEx I see that num_frames is set by defaults to 97, why?
Also I don't get this assert where by defaults k=2:

# num_frames defaulst to 97, hence `num_frames-1=96`, why?
 assert (num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"

Is this to because you want to ensure that num_frame is at least k times 4. Why?

The text was updated successfully, but these errors were encountered:

zhuhz22 · 2025-03-04T00:31:35Z

Hi @loretoparisi , thank you for your attention to our work!

About num_frames : RIFLEx is a tool for video length extrapolation, which enables video models to generate videos longer than the training length. For CogVideoX, the training length is 49 frames, and with RIFLEx, we allow the video model to generate videos of twice the length (i.e., 97 frames) or even longer.

So in your code, the following code should be deleted as there is no limit on video length:
```
if num_frames>max_frames:
        # exceeded max frames
        num_frames = max_frames
else:
    # too few frames set max
    num_frames = max_frames
```
And in our code, num_frames is set by defaults to 97, which enables the model to generate videos twice the training length. Certainly you can also adjust num_frames in your way, such as num_frames = num_seconds * fps + 1.
About the assertion of 4 * k + 1: CogVideo uses a casual VAE that encodes (4k+1) pixel frames into (k+1) latent frames, requiring num_frames to be 1 modulo 4. In this assertion, k represents any positive integer, and it doesn't mean args.k, which is the index of intrinsic frequency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

CogVideoX + RIFLEx: adjust num_frames by duration/fps #10

Comments