目录
代码
安装
使用
msmarco-passage bm25
代码
git clone https://github.com/castorini/pyserini.git --recurse-submodules
pyserini/tools为submodules,地址:anserini-tools
安装
https://github.com/castorini/pyserini/blob/master/docs/installation.md
pyserini依赖java环境,可用conda直接安装。安装后,java版本查看命令 "java --version"
如果只是利用开源索引做测试,使用 PyPI Installation即可。
如果自行开发,需要Development Installation。最后一步将fatjar包copy到
pyserini/resources/jars/中,fatjar包有两种获取方式,
- 在 anserini 项目中编译“
mvn clean package
”,保存路径为 anserini/target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar。https://github.com/castorini/anserini?tab=readme-ov-file#-installation
- 直接下载 https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar。 https://github.com/castorini/anserini/blob/master/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md
使用
默认下载保存路径 ~/.cache/pyserini/
指定下载保存路径 export PYSERINI_CACHE=/path/to/cache
msmarco-passage bm25
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md
下载数据集
mkdir collections/msmarco-passage
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
格式转为jsonl
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
建立索引
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/msmarco-passage/collection_jsonl \
--index indexes/lucene-index-msmarco-passage \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw
# index 为索引保存路径
检索
python -m pyserini.search.lucene \
--index indexes/lucene-index-msmarco-passage \
--topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.bm25tuned.txt \
--output-format msmarco \
--hits 1000 \
--bm25 --k1 0.82 --b 0.68 \
--threads 4 --batch-size 16
计算指标
python -m pyserini.eval.msmarco_passage_eval \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
runs/run.msmarco-passage.bm25tuned.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
其他指标计算,需要建立trec格式索引,qrels转为trec格式
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#evaluation
自定义数据集
pyserini/docs/usage-index.md at master · castorini/pyserini · GitHub