From 4aac8bbb9ab0507acb611b73ec99524c16577b1f Mon Sep 17 00:00:00 2001 From: lilyjge Date: Wed, 21 May 2025 09:37:01 -0400 Subject: [PATCH 1/8] add download via wget --- docs/experiments-msmarco-v2.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index b88eb43c5..f35c0e5ac 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -3,6 +3,12 @@ This guide contains instructions for running baselines on the MS MARCO V2 passage and document test collections, available [here](https://microsoft.github.io/msmarco/TREC-Deep-Learning.html). Note that Pyserini provides a [comparable guide](https://github.com/castorini/pyserini/blob/msmarco-v2/docs/experiments-msmarco-v2.md), so if you don't like Java, you can get the same results from Python. +To speed up the downloads, you can use: + +```bash +wget --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar +``` + If you're having issues downloading the collection via `wget`, try using [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). For example, to download passage collection: From 928488dab26ca7843bfa40e54d700f7490ccc667 Mon Sep 17 00:00:00 2001 From: lilyjge Date: Wed, 21 May 2025 10:15:56 -0400 Subject: [PATCH 2/8] fix to use passage instead of documents --- docs/experiments-msmarco-v2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index f35c0e5ac..4c793f2b0 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -6,7 +6,7 @@ Note that Pyserini provides a [comparable guide](https://github.com/castorini/py To speed up the downloads, you can use: ```bash -wget --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar +wget --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar ``` If you're having issues downloading the collection via `wget`, try using [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). From aa84e8b72d530d1a330bf6992b09f215df1f013a Mon Sep 17 00:00:00 2001 From: lilyjge Date: Thu, 22 May 2025 11:13:13 -0400 Subject: [PATCH 3/8] fix augmented passage url --- docs/experiments-msmarco-v2.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index 4c793f2b0..41958f079 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -13,7 +13,7 @@ If you're having issues downloading the collection via `wget`, try using [AzCopy For example, to download passage collection: ```bash -azcopy copy https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_passage.tar ./collections +azcopy copy https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar ./collections ``` The speedup using `azcopy` is significant compared to `wget`, but the actual downloading time will vary based on your location as well as many other factors. Azcopy will ask you to login to your Microsoft account with a code generated in the terminal. @@ -88,7 +88,7 @@ The passage corpus contains only passage texts; it is missing additional informa This information is available in the document collection, and we have written [a Python script](https://github.com/castorini/pyserini/blob/master/scripts/msmarco_v2/augment_passage_corpus.py) to augment the passage collection with these additional fields (specifically `url`, `title`, `headings`). For convenience, this augmented corpus is being distributed as part of the MS MARCO dataset as part of "additional resources", `msmarco_v2_passage_augmented.tar` (21 GB, MD5 checksum of `69acf3962608b614dbaaeb10282b2ab8`). -The tarball can be downloaded [here](https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_passage_augmented.tar). +The tarball can be downloaded [here](https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage_augmented.tar). Once again, we recommend downloading with [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). Indexing this augmented collection: From d91413d0abff6491d031543c7c109db4593e90bb Mon Sep 17 00:00:00 2001 From: lilyjge Date: Thu, 22 May 2025 16:02:28 -0400 Subject: [PATCH 4/8] add unpacking for augmented and note about no decompression --- docs/experiments-msmarco-v2.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index 41958f079..be02bd013 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -27,6 +27,7 @@ Download and unpack the collection into `collections/`. tar -xvf collections/msmarco_v2_passage.tar -C ./collections ``` +There's no need to uncompress the files, as Anserini can directly index gzipped files. Here's the indexing command for the passage collection, which is 21 GB compressed: ```bash @@ -91,6 +92,12 @@ For convenience, this augmented corpus is being distributed as part of the MS MA The tarball can be downloaded [here](https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage_augmented.tar). Once again, we recommend downloading with [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). +Unpack the collection into `collections/`. + +```bash +tar -xvf collections/msmarco_v2_passage_augmented.tar -C ./collections +``` + Indexing this augmented collection: ```bash From cbcefd3e4b3042315f7aab51840ba8a67fbaabcf Mon Sep 17 00:00:00 2001 From: lilyjge Date: Thu, 22 May 2025 16:10:20 -0400 Subject: [PATCH 5/8] no more machine --- docs/experiments-msmarco-v2.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index be02bd013..723a76a00 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -162,7 +162,6 @@ bin/run.sh io.anserini.index.IndexCollection -collection MsMarcoV2DocCollection ``` Same instructions as above. -On the same machine as described above, indexing takes around 40 minutes. The complete index with the above configuration occupies 134 GB (11,959,635 documents). Index size can be reduced by removing the options `-storePositions`, `-storeDocvectors`, `-storeRaw` as appropriate. For reference: From aa5f1402ff0c18d91e9b38417dd2112685aa7746 Mon Sep 17 00:00:00 2001 From: lilyjge Date: Wed, 28 May 2025 10:38:40 -0400 Subject: [PATCH 6/8] update urls --- docs/experiments-msmarco-v2.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index 723a76a00..4c8f232b0 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -150,7 +150,7 @@ We see that adding these additional fields gives a nice bump to effectiveness. ## Document Collection -Download and unpack the collection into `collections/`. +[Download](https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar) and unpack the collection into `collections/`. Here's the indexing command for the document collection, which is 33 GB compressed: ```bash @@ -221,7 +221,7 @@ Sentence chunking is performed with spaCy (v2.3.5); the version is important if We have also experimented with _not_ trimming each document to the first 10k characters; the collection becomes much bigger and the results become worse on the dev queries below. For convenience, this segmented corpus is being distributed as part of the MS MARCO dataset as part of "additional resources", `msmarco_v2_doc_segmented.tar` (26 GB, MD5 checksum of `f18c3a75eb3426efeb6040dca3e885dc`). -The tarball can be downloaded [here](https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc_segmented.tar). +The tarball can be downloaded [here](https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc_segmented.tar). Once again, we recommend downloading with [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). The segmented document collection can be indexed with the following command: From 3b44ef81781a462bf331cc84953c86ac68af1296 Mon Sep 17 00:00:00 2001 From: lilyjge Date: Wed, 28 May 2025 12:11:00 -0400 Subject: [PATCH 7/8] add repro log --- docs/experiments-msmarco-v2.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index 4c8f232b0..03fd22768 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -284,3 +284,4 @@ As we can see, even as first-stage retrieval (i.e., without reranking), retrieva + Results reproduced by [@crystina-z](https://github.com/crystina-z) on 2021-06-25 (commit [`dbc71ee`](https://github.com/castorini/anserini/commit/dbc71ee51fc7dbcdcb9118c9f7ad554b8b753a27)) + Results reproduced by [@t-k-](https://github.com/t-k-) on 2021-07-29 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) + Results reproduced by [@vincent-4](https://github.com/vincent-4) on 2025-04-07 (commit [`1af285b`](https://github.com/castorini/anserini/commit/1af285b610364acd0fb7a692e66b4cf432ddf7df)) ++ Results reproduced by [@lilyjge](https://github.com/lilyjge) on 2025-04-07 (commit [`bd4c3c7`](https://github.com/castorini/anserini/commit/bd4c3c78823e26bf5ea2ae81a89ab69e1b630575)) \ No newline at end of file From ff894ef687f448c2744f85f50cff95479608acab Mon Sep 17 00:00:00 2001 From: lilyjge Date: Wed, 28 May 2025 16:19:06 -0400 Subject: [PATCH 8/8] fix date --- docs/experiments-msmarco-v2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/experiments-msmarco-v2.md b/docs/experiments-msmarco-v2.md index 03fd22768..e4e65359c 100644 --- a/docs/experiments-msmarco-v2.md +++ b/docs/experiments-msmarco-v2.md @@ -284,4 +284,4 @@ As we can see, even as first-stage retrieval (i.e., without reranking), retrieva + Results reproduced by [@crystina-z](https://github.com/crystina-z) on 2021-06-25 (commit [`dbc71ee`](https://github.com/castorini/anserini/commit/dbc71ee51fc7dbcdcb9118c9f7ad554b8b753a27)) + Results reproduced by [@t-k-](https://github.com/t-k-) on 2021-07-29 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) + Results reproduced by [@vincent-4](https://github.com/vincent-4) on 2025-04-07 (commit [`1af285b`](https://github.com/castorini/anserini/commit/1af285b610364acd0fb7a692e66b4cf432ddf7df)) -+ Results reproduced by [@lilyjge](https://github.com/lilyjge) on 2025-04-07 (commit [`bd4c3c7`](https://github.com/castorini/anserini/commit/bd4c3c78823e26bf5ea2ae81a89ab69e1b630575)) \ No newline at end of file ++ Results reproduced by [@lilyjge](https://github.com/lilyjge) on 2025-05-28 (commit [`bd4c3c7`](https://github.com/castorini/anserini/commit/bd4c3c78823e26bf5ea2ae81a89ab69e1b630575)) \ No newline at end of file