Closed
Description
Currently, for HNSW indexing in Anserini, we're reading a very verbose json text-based format, which is very inefficient. We want to replace with a more efficient binary encoding.
Additional background:
- Anserini Dense Vector Format: binary encoding format for input vectors anserini#1956
- Conversion between different vector formats #3
safetensors
seems like the best bet.
If you want to work on this task, get started by doing the BEIR regressions here: https://github.com/castorini/anserini?tab=readme-ov-file#%EF%B8%8F-end-to-end-regression-experiments
In particular, do the BGE regressions on NFcorpus, which aligns with the onboarding exercise. If your personal machine isn't big enough to run the regression, the student linux environment should be sufficient.
Metadata
Metadata
Assignees
Labels
No labels