This repository contains the code for our ACL 2025 (Findings) paper on "Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data".
Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.
If you use this code, please cite:
@inproceedings{christmann2025recursive,
title={Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data},
author={Christmann, Philipp and Weikum, Gerhard},
booktitle={ACL 2025 Findings},
year={2025}
}
All code was tested on Linux only.
- Conda
- PyTorch
- GPU (for training)
We recommend the installation via conda, and provide the corresponding environment file in conda-reqap.yml:
git clone https://github.com/PhilippChr/ReQAP.git
cd ReQAP/
conda env create --file conda-reqap.yml
conda activate reqap
pip install -e .
Alternatively, you can also install the requirements via pip, using the requirements.txt or requirements-cpu.txt file. In this case, for running the code via GPU, further packages might be required.
To initialize the repo (download data, benchmark, models), run:
bash scripts/initialize.sh
-
ReQAP SFT
- Run QUD stage
bash scripts/pipeline.sh --qud-test config/perqa/reqap_sft.yml # much faster with GPU
- Run OTX stage
bash scripts/pipeline.sh --otx-test config/perqa/reqap_sft.yml # much faster with GPU
- Run QUD stage
-
ReQAP with LLaMA
- Run QUD stage
bash scripts/pipeline.sh --qud-test config/perqa/reqap_llama.yml # much faster with GPU
- Run OTX stage
bash scripts/pipeline.sh --otx-test config/perqa/reqap_llama.yml # much faster with GPU
- Run QUD stage
-
ReQAP with GPT
- Add your OpenAI credentials in
config/perqa/reqap_openai.yml
- Run QUD stage
bash scripts/pipeline.sh --qud-test config/perqa/reqap_openai.yml
- Run OTX stage
bash scripts/pipeline.sh --otx-test config/perqa/reqap_openai.yml # much faster with GPU
- Add your OpenAI credentials in
This requires adding your OpenAI credentials in config/perqa/reqap_openai.yml
,
or replacing config/perqa/reqap_openai.yml
with config/perqa/reqap_llama.yml
.
-
Run QUD stage via ICL on all train + dev questions
- Run QUD-ICL on train set
bash scripts/pipeline.sh --create_qu_plans-train config/perqa/reqap_openai.yml # requires GPU/API
- Run QUD-ICL on dev set
bash scripts/pipeline.sh --create_qu_plans-dev config/perqa/reqap_openai.yml # requires GPU/API
- Run QUD-ICL on train set
-
Train RETRIEVE and EXTRACT operators
-
RETRIEVE
- Detect all RETRIEVE calls
bash scripts/run_retrieval.sh --derive_retrieve_calls config/perqa/reqap_openai.yml
- Construct SPLADE indices
bash scripts/run_retrieval.sh --construct_index config/perqa/reqap_openai.yml # much faster with GPU
- Prepare data for training RETRIEVE models
bash scripts/prepare_retrieval_data.sh config/perqa/reqap_openai.yml # CPU: runs 14 parallel scripts
- Merge RETRIEVE training data for all personas
bash scripts/merge_retrieval_data.sh data/training_data/perqa
- Train RETRIEVE models of size L (default)
bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-12.yml # requires GPU bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-12.yml # requires GPU
- [OPTIONAL] Train RETRIEVE models of size M
bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-6.yml # requires GPU bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-6.yml # requires GPU
- [OPTIONAL] Train RETRIEVE models of size S
bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-2.yml # requires GPU bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-2.yml # requires GPU
- [OPTIONAL] Train RETRIEVE models of size XS
bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-TinyBERT-L-2.yml # requires GPU bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-TinyBERT-L-2.yml # requires GPU
- Detect all RETRIEVE calls
-
EXTRACT
- Derive EXTRACT calls with related attributes
bash scripts/run_extract.sh --derive_attributes config/perqa/reqap_openai.yml
- Identify aliases for keys in EXTRACT calls
bash scripts/run_extract.sh --derive_attribute_mappings config/perqa/reqap_openai.yml # requires GPU/API
- Derive training data for EXTRACT model
bash scripts/run_extract.sh --derive_data config/perqa/reqap_openai.yml
- Train EXTRACT model of size L (default)
bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-bart-base.yml # requires GPU
- [OPTIONAL] Train EXTRACT model of size M
bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-bart-small.yml # requires GPU
- [OPTIONAL] Train EXTRACT model of size S
bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-t5-efficient-mini.yml # requires GPU
- [OPTIONAL] Train EXTRACT model of size XS
bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-t5-efficient-tiny.yml # requires GPU
- Derive EXTRACT calls with related attributes
-
-
Derive data for model distillation
- [OPTION 1] Identify correct operator trees (in single runs)
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml # requires GPU bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml # requires GPU
- [OPTION 2] Identify correct operator trees (run individually per persona)
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_0 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_1 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_2 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_3 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_4 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_5 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_6 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_7 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_8 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_9 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_10 # requires GPU bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_11 # requires GPU bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_0 # requires GPU bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_1 # requires GPU
- [OPTION 3] Identify correct operator trees for single train persona (much faster, less training data)
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_0 # requires GPU bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_0 # requires GPU bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_1 # requires GPU
- [OPTION 1] Identify correct operator trees (in single runs)
-
Merge data for model distillation
bash scripts/merge_qu_data.sh
-
Train QUD stage
- Derive training data for QUD stage
bash scripts/run_qu.sh --derive_data config/perqa/reqap_sft.yml
- Train QUD model of size M (default)
bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-llama1b.yml # requires GPU
- [OPTIONAL] Train QUD model of size L
bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-llama3b.yml # requires GPU
- [OPTIONAL] Train QUD model of size S
bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-hf-smollm2-360M.yml # requires GPU
- [OPTIONAL] Train QUD model of size XS
bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-hf-smollm2-135M.yml # requires GPU
- Derive training data for QUD stage
-
Run ReQAP (SFT) inference
- Run QUD stage
bash scripts/pipeline.sh --qud-test config/perqa/reqap_sft.yml # much faster with GPU
- Run OTX stage
bash scripts/pipeline.sh --otx-test config/perqa/reqap_sft.yml # much faster with GPU
- Run QUD stage
-
RAG SFT (requires following training below)
- Run retrieval
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test # much faster with GPU
- Run generation
bash scripts/rag.sh --test config/perqa/rag_sft.yml # requires GPU
- Run retrieval
-
RAG with LLaMA
- Run retrieval
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test # much faster with GPU
- Run generation
bash scripts/rag.sh --test config/perqa/rag_llama.yml # requires GPU
- Run retrieval
-
RAG with GPT
- Add your OpenAI credentials in
config/perqa/rag_openai.yml
- Run retrieval
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test # much faster with GPU
- Run generation
bash scripts/rag.sh --test config/perqa/rag_openai.yml # requires GPU
- Add your OpenAI credentials in
-
Train retrieval for RAG baseline
- Derive training data (makes use of the ReQAP retrieval training data; assumed to be there already)
bash scripts/rag.sh --ce_derive_data config/perqa/rag_openai.yml
- Train the cross-encoder
bash scripts/rag.sh --ce_train config/perqa/rag_openai.yml # requires GPU
- Derive training data (makes use of the ReQAP retrieval training data; assumed to be there already)
-
Run retrieval inference
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml train # much faster with GPU bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml dev # much faster with GPU bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test # much faster with GPU
-
Train answering model
- Derive training data
bash scripts/rag.sh --derive_data config/perqa/rag_sft.yml
- Train model
bash scripts/rag.sh --train config/perqa/rag_sft.yml # requires GPU
- Derive training data
-
Inference
bash scripts/rag.sh --test config/perqa/rag_sft.yml # requires GPU
-
CodeGen SFT (requires following training below)
bash scripts/query_generation.sh --test config/perqa/query_generation_sft.yml # requires GPU
-
CodeGen with LLaMA
bash scripts/query_generation.sh --test config/perqa/query_generation_llama.yml # requires GPU
-
CodeGen with GPT
- Add your OpenAI credentials in
config/perqa/query_generation_openai.yml
bash scripts/query_generation.sh --test config/perqa/query_generation_openai.yml # requires GPU
- Add your OpenAI credentials in
- Prepare training data
bash scripts/query_generation.sh --derive_data config/perqa/query_generation_sft.yml
- Train translation model
bash scripts/query_generation.sh --train config/perqa/query_generation_sft.yml # requires GPU
- Inference
bash scripts/query_generation.sh --test config/perqa/query_generation_sft.yml # requires GPU
We tried our best to document the code of this project, and make it accessible for easy usage. If you feel that some parts of the documentation/code could be improved, or have other feedback, please do not hesitate and let us know!
You can contact us via mail: pchristm@mpi-inf.mpg.de. Any feedback (also positive ;) ) is much appreciated!
The ReQAP project by Philipp Christmann and Gerhard Weikum is licensed under a MIT license.
Our retrieval utilizes SPLADE (https://github.com/naver/splade). We adapt parts of their code in this repository.