This page covers the Neptune Dataset Collection which is a set of video QA datasets. This collection currently includes the original Neptune dataset and the MINERVA dataset.
MINERVA consists of ~1.5K challenging question-answer-decoy (QAD) sets for variable length videos. For each question, we provide 5 answer choices, as well as detailed, manually-annotated reasoning traces. Every question in MINERVA requires complex reasoning using two or more skills (for example numerical reasoning, temporal reasoning, spatial navigation). Videos also span multiple domains (short films, sports, instructional videos etc), with various video lengths (from 2 minutes to over 1.5 hours). The hand-crafted, detailed reasoning trace accompanying each question outlines the steps that are required to come to the correct answer. These traces include timestamps where necessary to refer to relevant sections of the video, and also describes key actions, objects, as well as outlines logical reasoning steps. More details are provided in our arXiv paper.
We provide a json file that contains the YouTube IDs and annotations.
The json file contains the following fields:
- key: Unique identifier for each question
- video_id: YouTube URL
- question: Free-form question
- answer: Free-form answer
- answer_choice_{i}: Decoys for MCQ evaluation, i in range(0,4)
- answer_id: ID of the correct answer in the decoys
- reasoning: Detailed reasoning trace
- question type: A comma-separated list of multiple skills needed to answer the question
- split: Coarse video domain
- category: Fine-grained video domain
@article{minerva25,
title={MINERVA: Evaluating Complex Video Reasoning},
author={Nagrani, Arsha and Menon, Sachit and Iscen, Ahmet and Buch, Shyamal and Mehran, Ramin and Jha, Nilpa and Hauth, Anja and Zhu, Yukun and Vondrick, Carl and Sirotenko, Mikhail and Schmid, Cordelia and Weyand, Tobias},
journal={arXiv preprint arXiv:2505.00681},
year={2025}
}
Neptune is a dataset consisting of challenging question-answer-decoy (QAD) sets for variable length videos (up to 15 minutes). The goal of this dataset is to test video-language models for a broad range of long video reasoning abilities, which are provided as "question type" labels for each question, for example "video summarization", "temporal ordering", "state changes" and "creator intent" amongst others. More details are provided in our arXiv paper.
Neptune allows for two modes of evaluation: multiple-choice and open-ended question answering. For the latter, we provide our own open-ended metric based on Gemma, called Gemma Equivalence Metric (GEM).
Neptune was created using a semi-automatic pipeline, which involves careful prompting of large LLMs and VLMs, including Gemini. See more details provided in the paper.
Neptune has more than 3,200 questions for over 2,400 videos.
We provide links to json files that contain the YouTube IDs and annotations for each split below. Please see the paper for details regarding each split.
The json files contains the following fields:
- key: Unique identifier for each question
- video_id: YouTube URL
- question: Free-form question
- answer: Free-form answer
- answer_choice_{i}: Decoys for MCQ evaluation, i in range(0,4)
- answer_id: ID of the correct answer in the decoys
- question type: Question type
Multiple choice evaluation involves selecting the answer from 5 options (including 4 decoys) and using accuracy as the metric.
For open-ended evaluation, we create a new language model based metric, called the Gemma Equivalence Metric (GEM). We do this by fine tuning a Gemma checkpoint on the BEM answer equivalence dataset and prompt it to determine if a produced answer is equivalent to the ground truth.
@article{neptune24,
title={Neptune: The Long Orbit to Benchmarking Long Video Understanding},
author={Nagrani, Arsha and Zhang, Mingda and Mehran, Ramin and Hornung, Rachel and Gundavarapu, Nitesh Bharadwaj and Jha, Nilpa and Myers, Austin and Zhou, Xingyi and Gong, Boqing and Schmid, Cordelia and Sirotenko, Mikhail and Zhu, Yukun and Weyand, Tobias},
journal={arXiv preprint arXiv:2412.09582},
year={2024},
}
Copyright 2024 DeepMind Technologies Limited
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.