Anserini: replace verbose json-based vector format with more compact binary encoding

Currently, for HNSW indexing in Anserini, we're reading a very verbose json text-based format, which is very inefficient. We want to replace with a more efficient binary encoding.

Additional background:

safetensors seems like the best bet.

If you want to work on this task, get started by doing the BEIR regressions here: https://github.com/castorini/anserini?tab=readme-ov-file#%EF%B8%8F-end-to-end-regression-experiments

In particular, do the BGE regressions on NFcorpus, which aligns with the onboarding exercise. If your personal machine isn't big enough to run the regression, the student linux environment should be sufficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions