BM25¶
Retrieving documents using BM25. We must always encode documents before queries when
using BM25 retriever in order to fit the vectorizer, otherwise the system will raise an error.
BM25 takes as input two distinct parameters, b
and k1
. b is a float value that determines
the impact of document length normalization. The default value is 0.75
. The higher the value, the
more penalized longer documents will be. k1 is a float value that determines how quickly the impact
of term frequency saturates. The default value is 1.5
. The higher the value, the more influential
term frequency will be.
from neural_cherche import retrieve
from lenlp import sparse
documents = [
{"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
{"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
{"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},
]
retriever = retrieve.BM25(
key="id",
on=["title", "text"],
count_vectorizer=sparse.CountVectorizer(normalize=True, ngram_range=(3, 5), analyzer="char_wb"),
k1=1.5,
b=0.75,
epsilon=0.,
)
documents_embeddings = retriever.encode_documents(
documents=documents,
)
retriever.add(
documents_embeddings=documents_embeddings,
)
Once we have created our index, we can use the retriever to retrieve the candidates.
queries = [
"What is the capital of France?",
"What is the largest city in Quebec?",
"Where is Bordeaux?",
]
queries_embeddings = retriever.encode_queries(
queries=queries,
)
scores = retriever(
queries_embeddings=queries_embeddings,
k=100,
)
scores
[[{'id': 'doc1', 'similarity': 88.86143220961094},
{'id': 'doc2', 'similarity': 8.409232541918755},
{'id': 'doc3', 'similarity': 7.134543210268021}],
[{'id': 'doc2', 'similarity': 107.05374336242676},
{'id': 'doc1', 'similarity': 9.28911879658699},
{'id': 'doc3', 'similarity': 1.9025448560714722}],
[{'id': 'doc3', 'similarity': 18.506150543689728},
{'id': 'doc1', 'similarity': 0.7961864173412323},
{'id': 'doc2', 'similarity': 0.7676786482334137}]]