GitHub

Ask2Loc: Learning to Locate Instructional Visual Answers
by Asking Questions

📹 Instructional Visual Answer Localization | 🤖 Large and Pre-trained Language Models

🙋 Human Computer Interactions

✨ Overall Framework

We propose Ask2Loc, an interactive visual answer localization framework that identifies precise video segments to answer a user question by acquiring auxiliary knowledge through simulating multiple interactions via asking formats. The top-level framework consists of three primary phases as shown in the above figure.

💡 Interactive and Learning Modules

Chatting for Intention Awareness: Given that instructional videos often contain extensive domain knowledge that users are unfamiliar with, which leads to vague initial queries, this work leverages large language models (LLMs) to simulate interactive dialogue, progressively refining user intent through follow-up questioning and thus provide user-expected system responses.
Rewriting for Description Completeness: The In-VAL process faces two forms of semantic incompleteness: incomplete subtitle expressions within video segments and a semantic gap between prior QA dialogue and actual intent. These issues can be effectively addressed through LLM-based rewriting that improves linguistic completeness and alignment between user input and video content.
Searching for Context Expansion: To simulate human-like localization behavior, we propose a context expansion strategy that leverages a fine-tuned pre-trained language model (PLM) to identify semantically similar video seg 763C ments to enhance the understanding and assessment of a given segment. This method is inspired from embedding-based retrieval in retrieval-augmented generation (RAG) systems.
Learning for Answer Location Detection: We formulate the task of identifying whether each video segment falls within the answer span as a classification problem, where visual features are projected into the same space as textual features, fused with contextual encodings via a PLM, and jointly optimized through PLM-based fine-tuning using ground-truth and predicted location labels.

📕 Dataset

We reconstruct three instructional visual answer localization datasets for our In-VAL task. These datasets are named as In-MedVidQA (medical in English), In-VehicleVQA (vehicle in English), and In-CMIVQA (medical in Chinese), across multiple domains and multilingual scenarios.
For each sample (an individual video segment) in these datsets, the following data fields are included:
- Input Question
- Answer Span (start time and duration)
- Within Answer Label (0 or 1)
- Current Subtitle
- Chatting Dialogue (R Rounds)
- Rewritten Question (User Intent)
- Rewritten Subtitle (Current Content)
- Expanded Context (Top K Relevant Subtitles)
For the video subtitles and visual features downloading, please download from our GoogleDrive (update later)
For the In-VAL datasets including questions, descriptions, context, and visual locations, please redirect to the folder Dataset

🚀 Usage

🛠️ Train

Install Requirements
Setup Training Configuration vim config.py
Run Training python train.py

📜 Evaluation

Run evaluation with python evaluate.py
Important Results (mIoU):

Framework	Method	In-MedVidQA	In-VehicleVQA	In-CMIVQA
End-to-End	RandomGuess	5.96	4.96	4.21
	LLM-Gen	9.42	10.17	8.57
	Video-LLM	12.90	14.65	9.45
	PLM-Fusion	29.36	38.26	22.70
	PLM-Context	35.46	36.88	20.26
	PLM-Prompt	37.08	40.37	27.52
Two-Stage	Retrieval-Loc	28.52	33.72	16.59
	Expand-Loc	29.47	32.14	15.81
	Describe-Loc	31.02	26.75	20.76
Interactive	Ask2Loc (Ours)	43.22	55.28	31.37

📂 Checkpoints

Please download from our GoogleDrive (update later)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
pics		pics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_process.py		data_process.py
dataset.py		dataset.py
evaluate.py		evaluate.py
feature_fusion.py		feature_fusion.py
model.py		model.py
prompts.py		prompts.py
sub_rewrite_llm.py		sub_rewrite_llm.py
subtitle_to_feature.py		subtitle_to_feature.py
text_sim_train.py		text_sim_train.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ask2Loc: Learning to Locate Instructional Visual Answers
by Asking Questions

✨ Overall Framework

💡 Interactive and Learning Modules

📕 Dataset

🚀 Usage

🛠️ Train

📜 Evaluation

📂 Checkpoints

🕹️ Demo

About

Uh oh!

Releases

Packages

Languages

License

changzong/Ask2Loc

Folders and files

Latest commit

History

Repository files navigation

Ask2Loc: Learning to Locate Instructional Visual Answersby Asking Questions

✨ Overall Framework

💡 Interactive and Learning Modules

📕 Dataset

🚀 Usage

🛠️ Train

📜 Evaluation

📂 Checkpoints

🕹️ Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Ask2Loc: Learning to Locate Instructional Visual Answers
by Asking Questions

Packages