Neptune Dataset Collection

Tl;dr

This page covers the Neptune Dataset Collection which is a set of video QA datasets. This collection currently includes the original Neptune dataset and the MINERVA dataset.

MINERVA

MINERVA consists of ~1.5K challenging question-answer-decoy (QAD) sets for variable length videos. For each question, we provide 5 answer choices, as well as detailed, manually-annotated reasoning traces. Every question in MINERVA requires complex reasoning using two or more skills (for example numerical reasoning, temporal reasoning, spatial navigation). Videos also span multiple domains (short films, sports, instructional videos etc), with various video lengths (from 2 minutes to over 1.5 hours). The hand-crafted, detailed reasoning trace accompanying each question outlines the steps that are required to come to the correct answer. These traces include timestamps where necessary to refer to relevant sections of the video, and also describes key actions, objects, as well as outlines logical reasoning steps. More details are provided in our arXiv paper.

Examples from MINERVA. MINERVA consists of challenging question-answer-decoy sets for videos. The answer to each question is also accompanied by a detailed reasoning trace, which outlines the steps required to come to the answer. Reasoning traces are detailed, including timestamps (highlighted in green) and key actions (highlighted in pink).

Statistics of Minerva dataset, including video length and and reasoning trace lengths.

MINERVA covers a variety of video lengths. Reasoning traces are long and detailed.

Downloading the Data

We provide a json file that contains the YouTube IDs and annotations.

The json file contains the following fields:

key: Unique identifier for each question
video_id: YouTube URL
question: Free-form question
answer: Free-form answer
answer_choice_{i}: Decoys for MCQ evaluation, i in range(0,4)
answer_id: ID of the correct answer in the decoys
reasoning: Detailed reasoning trace
question type: A comma-separated list of multiple skills needed to answer the question
split: Coarse video domain
category: Fine-grained video domain

MINERVA json

Citing this work

@article{minerva25,
  title={MINERVA: Evaluating Complex Video Reasoning},
  author={Nagrani, Arsha and Menon, Sachit and Iscen, Ahmet and Buch, Shyamal and Mehran, Ramin and Jha, Nilpa and Hauth, Anja and Zhu, Yukun and Vondrick, Carl and Sirotenko, Mikhail and Schmid, Cordelia and Weyand, Tobias},
  journal={arXiv preprint arXiv:2505.00681},
  year={2025}
}

Neptune

Neptune is a dataset consisting of challenging question-answer-decoy (QAD) sets for variable length videos (up to 15 minutes). The goal of this dataset is to test video-language models for a broad range of long video reasoning abilities, which are provided as "question type" labels for each question, for example "video summarization", "temporal ordering", "state changes" and "creator intent" amongst others. More details are provided in our arXiv paper.

Neptune consists of challenging question-answer-decoy sets for videos to assess a number of long video reasoning abilities.

Neptune allows for two modes of evaluation: multiple-choice and open-ended question answering. For the latter, we provide our own open-ended metric based on Gemma, called Gemma Equivalence Metric (GEM).

Neptune was created using a semi-automatic pipeline, which involves careful prompting of large LLMs and VLMs, including Gemini. See more details provided in the paper.

Neptune has more than 3,200 questions for over 2,400 videos.

Statistics of Neptune dataset, including video length and question types.

Greater than 12% of the videos are longer than 5 minutes and over 25% are longer than 3 minutes. Neptune covers a number of question types and video domains.

Downloading the Data

We provide links to json files that contain the YouTube IDs and annotations for each split below. Please see the paper for details regarding each split.

The json files contains the following fields:

key: Unique identifier for each question
video_id: YouTube URL
question: Free-form question
answer: Free-form answer
answer_choice_{i}: Decoys for MCQ evaluation, i in range(0,4)
answer_id: ID of the correct answer in the decoys
question type: Question type

Neptune-Full

Neptune-MMH

Neptune-MMA

Evaluation and Metrics

Multiple choice evaluation involves selecting the answer from 5 options (including 4 decoys) and using accuracy as the metric.

For open-ended evaluation, we create a new language model based metric, called the Gemma Equivalence Metric (GEM). We do this by fine tuning a Gemma checkpoint on the BEM answer equivalence dataset and prompt it to determine if a produced answer is equivalent to the ground truth.

Citing this work

@article{neptune24,
      title={Neptune: The Long Orbit to Benchmarking Long Video Understanding},
      author={Nagrani, Arsha and Zhang, Mingda and Mehran, Ramin and Hornung, Rachel and Gundavarapu, Nitesh Bharadwaj and Jha, Nilpa and Myers, Austin and Zhou, Xingyi and Gong, Boqing and Schmid, Cordelia and Sirotenko, Mikhail and Zhu, Yukun and Weyand, Tobias},
      journal={arXiv preprint arXiv:2412.09582},
      year={2024},
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
combined_lengths.png		combined_lengths.png
examples.png		examples.png
minerva_examples.png		minerva_examples.png
stats.png		stats.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Neptune Dataset Collection

Tl;dr

MINERVA

Downloading the Data

Citing this work

Neptune

Downloading the Data

Evaluation and Metrics

Citing this work

License and disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

License

google-deepmind/neptune

Folders and files

Latest commit

History

Repository files navigation

Neptune Dataset Collection

Tl;dr

MINERVA

Downloading the Data

Citing this work

Neptune

Downloading the Data

Evaluation and Metrics

Citing this work

License and disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages