Codebase for the paper:
"DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities"
(EMNLP 2024)
Create and activate the environment:
conda create --name lsr python=3.9.12
conda activate lsr
Install required packages:
pip install -r requirements.txt
git clone https://github.com/thongnt99/DyVo
cd DyVo
mkdir dyvo_data
cd dyvo_data
Make sure the Hugging Face CLI is installed:
pip install huggingface_hub
Then download the data:
huggingface-cli download lsr42/dyvo_data
Note:
- You may need to log in to Hugging Face before downloading:
huggingface-cli login
- The downloaded files will be cached locally. Refer to the Hugging Face CLI documentation for cache settings if needed.
Queries and documents are accessible via ir-datasets.
Please refer to the website for instructions on how to download them.
Dataset | ir_datasets Key |
---|---|
Wapo | wapo/v2/trec-core-2018 |
Robust04 | disks45/nocr/trec-robust-2004 |
Codec | codec |
Example command to start training:
python -m lsr.train +experiment=qmlp_dmlm_emlm_laque_wapo_msmarco_pretrained_inparsv2_monot53b_distillation_l1_0.0_0.001_entw_0.05.yaml training_arguments.fp16=True
- The list of experiment configuration files can be found in the
lsr/configs/experiment/
directory.
If you find this repository helpful, please cite our paper:
@inproceedings{nguyen-etal-2024-dyvo,
title = "DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities",
author = "Nguyen, Thong and
Chatterjee, Shubham and
MacAvaney, Sean and
Mackie, Iain and
Dalton, Jeff and
Yates, Andrew",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024"
}