https://arxiv.org/abs/2502.01142
Download the Wikipedia dump from the DPR repository using the following command:
mkdir -p data/dpr
wget -O data/dpr/psgs_w100.tsv.gz https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
pushd data/dpr
gzip -d psgs_w100.tsv.gz
popd
Use Elasticsearch to index the Wikipedia dump:
cd data
wget -O elasticsearch-7.17.9.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.9-linux-x86_64.tar.gz # download Elasticsearch
tar zxvf elasticsearch-7.17.9.tar.gz
rm elasticsearch-7.17.9.tar.gz
cd elasticsearch-7.17.9
nohup bin/elasticsearch & # run Elasticsearch in background
cd ../..
python prep_elastic.py --data_path data/dpr/psgs_w100.tsv --index_name wiki # build index
If you want to construct data from scratch, you can following the instructions below. using Llama-3-8B as example:
bash scripts/launch/run.sh
bash scripts/launch/run-72b.sh
bash scripts/data-construct/stage1.sh
Inference results will be saved in construct/sft/*
use eval script to evaluate the inference result in construct/sft/*
.
use scripts/data-construct/filter-stage1/extract.py
to filter the correct response. Then use scripts/data-construct/filter-stage1/tokenize_sft.py
to tokenize the response for further training.
Meanwhile, you can use scripts/data-construct/filter-stage1/reformat.py
to visualize the tokenize data format.
The tokenized data will be saved to construct/sft/tokenized_dataset
.
We use Llama-Factory for training.
Training scripts in scripts/training/stage1.sh
# model path should be the trained sft model
bash scripts/launch/run.sh
bash scripts/data-construct/stage2.sh
Inference results will be saved in construct/dpo/*
use eval script to evaluate the inference result in construct/dpo/*
.
use scripts/data-construct/filter-stage2/tokenize_dpo.py
to tokenize the response for further training.
Meanwhile, you can use scripts/data-construct/filter-stage2/make_pair.py
to visualize the tokenize data format.
The tokenized data will be saved to construct/dpo/tokenized_dataset
.
We use Llama-Factory for training.
Training scripts in scripts/training/stage2.sh
We further validate our framework's effectiveness using reinforcement learning. The implementation can be found in our RL extension repository.
https://github.com/gxy-gxy/Search-R1-for-DeepRAG/tree/main
# modify with your own model path
bash scripts/launch/run.sh
bash scripts/inference/run.sh
eavl is inherient from DRAGIN.
bash scripts/eval/run.sh
comming soon
This code is heavily based on DRAGIN, which provides a framework for building multiple baselines. We enhance the inference pipeline with API-based methods along with multi-process acceleration.
If you find this work helpful, please cite our paper:
@article{guan2025deepragthinkingretrievalstep,
title={DeepRAG: Thinking to Retrieve Step by Step for Large Language Models},
author={Xinyan Guan and Jiali Zeng and Fandong Meng and Chunlei Xin and Yaojie Lu and Hongyu Lin and Xianpei Han and Le Sun and Jie Zhou},
year={2025},
journal={arXiv preprint arXiv:2502.01142},
url={https://arxiv.org/abs/2411.11504}
}