8000 GitHub - Lliar-liar/Daily-Omni: This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

More Web Proxy on the site http://driver.im/

Lliar-liar / Daily-Omni Public

Notifications You must be signed in to change notification settings
Fork 0
Star 10

This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

arxiv.org/abs/2505.17862

GPL-3.0 license

10 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
baseline		baseline
example_videos		example_videos
test_model		test_model
test_model_api		test_model_api
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
captioning.py		captioning.py
config.py		config.py
example_metadata.csv		example_metadata.csv
qa.json		qa.json
qa_example.json		qa_example.json
qa_example_filtered.json		qa_example_filtered.json
qa_filter.py		qa_filter.py
qa_generation.py		qa_generation.py
question_optimize.py		question_optimize.py
requirements.txt		requirements.txt
revision.py		revision.py
run_pipeline.py		run_pipeline.py
utils.py		utils.py

Repository files navigation

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

[📖 arXiv Paper] [📊 Dataset]

Requirements

To install requirements:

pip install -r requirements.txt

To test our benchmark, you should download Videos.tar and 'qa.json' from Huggingface to this directory and extract the Videos/ folder and qa.json to this directory.

QA Generation

📋 We provide script to reproduce our Daily-Omni QA Generation pipeline. Run the following command to generate QA pairs. To run the script, first you should revise the config.py file to set the parameters:

Set the api keys, base_urls and model_name. You can create api keys from the following links: Gemini, OpenAI, Deepseek, Aliyun
Set BASE_DIR and CSV_PATH to the Video Folder you want to annotate and the path to the csv file that records the videos. You can use example_videos and example_metadata.csv as templetes.
Set MAX_WORKERS_PROCESSES to the number of processes you want to use to run the pipeline. You can set execution_mode in run_pipeline.py to choose the execution mode if your api key has parallel requests limitation.
Set run_pipeline_flags in run_pipeline.py to choose which part of the pipeline you want to run.

Test the pipeline with:

python run_pipeline.py

Test Daily-Omni

Model with api

📋To test the benchmark on third party models(Gemini, GPT-4o, Deepseek) with api and reproduce the results, you can use the script provided in test_model_api/

python test_model_api/test_model.py --model <model_name> --mode <Execution_mode> --max_items <Maximum number of QA items to process (for testing)>

You can check the model options in test_model_api/test_config.py

Model running locally

📋To test the benchmark on third party models(Qwen2.5-Omni, Qwen2.5-VL, VideoLLaMA2, Ola, Unified-IO 2) with local machines, check the code in test_model/

Qwen2.5-Omni

You should install the dependencies with instructions from the official Qwen2.5-Omni repo

Run the test script with the following command:

python test_model/Qwen2.5-Omni/testmodel.py --video_base_dir VIDEO_BASE_DIR --json_file_path JSON_FILE_PATH --model_name_or_path MODEL_NAME_OR_PATH --processor_name_or_path PROCESSOR_NAME_OR_PATH --use_audio_in_video --disable_audio_output

You can test Qwen2.5-Omni's performance without audio by removing --use_audio_in_video flag.

Qwen2.5-VL

You should install the dependencies with instructions from the official Qwen2.5-VL repo

Run the test script with the following command:

python test_model/Qwen2.5-VL/testmodel.py --video_base_dir VIDEO_BASE_DIR --json_file_path JSON_FILE_PATH --model_name_or_path MODEL_NAME_OR_PATH --processor_name_or_path PROCESSOR_NAME_OR_PATH

VideoLLaMA2

You should install the dependencies with instructions from the official VideoLLaMA2 repo

Run the test script with the following command:

python test_model/VideoLLaMA2-av/testmodel.py --video_base_dir VIDEO_BASE_DIR --json_file_path JSON_FILE_PATH

Unified-IO 2

You should install the dependencies with instructions from the official Unified-IO 2 repo

Run the test script with the following command:

python test_model/unified-io-2.pytorch/testmodel.py --video_base_dir VIDEO_BASE_DIR --json_file_path JSON_FILE_PATH --model MODEL_NAME

Ola

You should install the dependencies with instructions from the official Ola repo

Run the test script with the following command:

python test_model/Ola/inference/testmodel.py VIDEO_BASE_DIR --json_file_path JSON_FILE_PATH

Test Daily-Omni Agent

We provide a script to test Daily-Omni Agent on Daily-Omni benchmark. For efficiency, we used the API provided by Bailian Aliyun for Qwen2.5-VL and Qwen2.5, while Qwen2-Audio was deployed locally. According to the official documentation, qwen2.5-vl-7b-instruct and qwen2.5-14b-instruct provided by Bailian Aliyun are identical to their open-source counterparts. Our code implements direct passing of local_video_path to the Qwen2.5-VL API. However, this functionality might require you to contact Aliyun customer service to enable. If direct path input is not activated, you can alternatively pass a list of video frames, though this may result in suboptimal performance.

To run the agent, you need to setup a new environment for Qwen2-Audio according to the instructions in the official repository.

Then, launch the Qwen2-Audio server locally with running:

python baseline/qwen_audio.py

Segment the video and audio clips:

python baseline/segment_av.py

Run Daily-Omni Agent on Daily-Omni benchmark with the following command:

python baseline/base_model.py

This script will automatically evaluate the performance of the model on the Daily-Omni benchmark.

Results

Citation

@misc{zhou2025dailyomni,
      title={Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities}, 
      author={Ziwei Zhou and Rui Wang and Zuxuan Wu},
      year={2025},
      eprint={2505.17862},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17862}, 
}

About

This is the official repository of Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

arxiv.org/abs/2505.17862

GPL-3.0 license

Report repository

Releases

No releases published

Packages

No packages published

Languages

0