AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

📌 Table of Contents

Introduction
Dataset
AnimeShooterGen
License
Citation

👀 Introduction

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. Moreover, to demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs)and video diffusion models. Please refer to our research paper for more details.

Key Features of AnimeShooter

focus on animation field;
hierarchical story script for multi-shot annotation;
reference images for consistent character guidance.

📐 Dataset

Collection and Annotation

Our dataset collection begins by sourcing large-scale, diverse animated content from YouTube using keywords (e.g., "short animation", "cartoon short film"). Then, all the videos are filtered by content and duration, and futher divided into 1-min segments. Following this, we annotate all segments using Gemini-2.0-flash with a top-down multi-shot captioning strategy. Finally, we retrieve all related shots and use Sa2VA to segment the corresponding reference images.

Download Source Videos

To download the source videos from YouTube, we recommend to use yt-dlp tools with the following command:

mkdir ./videos
cd ./videos
yt-dlp --batch-file video_ids.txt -o "%(id)s.%(ext)s" -f "bv*[height<=720][height>=480][ext=mp4]"

In our dataset, each 1-min segment serves as an individual sample representing a self-contained narrative unit (one story). To align shot-level timestamps in each 1-min segment with the original video, please refer to crop_segment.py for segment spliting and corresponding story script extraction. And you can refer to rle_to_reference.py for reference image generation.

Dataset Structure

The dataset annotation can be downloaded from: Hugging Face. The complete annotation for each video is in JSON format including reference image masks. Each JSON file contains the following fields:

video ID: string - Unique YouTube identifier.
url: string - Direct YouTube link.
fps: float - Frame rate of the original video, used for temporal alignment.
segments: list - A list of segment objects. Each video is divided into these coherent story segments.

start frame index: integer - The starting frame index of the segment in the original video.
end frame index: integer - The ending frame index of the segment in the original video.
story script: object - Contains the annotations of the segment.
- storyline: string - A high-level summary of the story told in this segment.
- main characters: list of objects - Primary characters appearing in this segment. Each character object has:
  - ID: string - A unique identifier for the character.
  - appearance: string - A textual description of the character's visual appearance.
- main scenes: list of objects - Key locations or environments featured in this segment. Each scene object has:
  - ID: string - A unique identifier for the scene.
  - environment: string - A textual description of the scene.
- shots: list of objects - A detailed breakdown of the segment into individual shots.
  - start time: string - The start time of the shot (e.g., "00:00", "00:12") in the segment.
  - end time: string - The end time of the shot (e.g., "00:00", "00:12") in the segment.
  - is_prologue_or_epilogue: boolean - Flag if the shot is part of a prologue or epilogue.
  - main characters: list of strings - Character IDs active in this shot.
  - scene: string - The ID of the scene for this shot.
  - visual annotation: object - Textual descriptions of 8000 the shot's visual content:
    - narrative caption: string - Plot-driven description.
    - descriptive caption: string - Detailed visual inventory.
  - audio annotation: object - Only in AnimeShooter-audio.
reference images: list of objects - Character-specific reference images and masks.
- ID: string - The identifier of the character.
- masks: list of objects - Mask data for the character.
  - frame index: integer - Frame index for this mask in the original video.
  - rle mask: object - Run-Length Encoding for the mask.

🔮 AnimeShooterGen

Model Architecture

Quick Start

📍 Environment Setup:

To set up the environment for NVILA and CogVideo, you can run the following command:

git clone https://github.com/qiulu66/Anime-Shooter.git
cd Anime-Shooter
bash environment_setup.sh animeshooter

Download the pretrained weights of NVILA-8B-Video and CogVideoX-2B:

mkdir ./ckpt
cd ./ckpt
git clone https://huggingface.co/Efficient-Large-Model/NVILA-8B-Video
git clone https://huggingface.co/THUDM/CogVideoX-2b

📍 Inference:

We provide three IP-specific LoRA weights with corresponding demos: a yellow dog (video id: 1dCd6hCRoaQ), a wolf (video id 5VtARFNISH4) and a young girl (video id: yjIvkCaq0Zc). Please first download the model weights and then run the inference code:

cd ./ckpt
git clone https://huggingface.co/qiulu66/AnimeShooterGen
cd ..
bash scripts/inference.sh $VIDEO_ID

📍 Finetune Your Own Model:

We provide three IP-specific datasets in ./datasets for finetuning, and you can also prepare your own dataset in the same format with an unique video id. Then, run the following command to train your own model:

bash scripts/train.sh $VIDEO_ID

🔍 License

The AnimeShooter and AnimeShooter-audio datasets are released under the CC BY-NC 4.0 License for academic purpose only. AnimeShooterGen model is released under the Apache 2.0 License. It is built upon NVILA and CogVideo, please refer to https://github.com/NVlabs/VILA and https://github.com/THUDM/CogVideo to check their licenses.
All videos in our dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos. The copyright remains with the original owners of the video.
If any video in our dataset infringes upon your rights, please contact us for removal.

📜 Citation

If you find our work helpful for your research, please consider citing our work.

@misc{qiu2025animeshooter,
    title = {AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation},
    author = {Qiu, Lu and Li, Yizhuo and Ge, Yuying and Ge, Yixiao and Shan, Ying and Liu, Xihui},
    year = {2025},
    url = {https://arxiv.org/abs/2506.03126}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset_process		dataset_process
datasets		datasets
demos		demos
figures		figures
scripts		scripts
src		src
README.md		README.md
environment_setup.sh		environment_setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

📌 Table of Contents

👀 Introduction

Key Features of AnimeShooter

📐 Dataset

Collection and Annotation

Download Source Videos

Dataset Structure

🔮 AnimeShooterGen

Model Architecture

Quick Start

🔍 License

📜 Citation

About

Uh oh!

Releases

Packages

Languages

qiulu66/Anime-Shooter

Folders and files

Latest commit

History

Repository files navigation

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

📌 Table of Contents

👀 Introduction

Key Features of AnimeShooter

📐 Dataset

Collection and Annotation

Download Source Videos

Dataset Structure

🔮 AnimeShooterGen

Model Architecture

Quick Start

🔍 License

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages