BM25¶

Retrieving documents using BM25. We must always encode documents before queries when using BM25 retriever in order to fit the vectorizer, otherwise the system will raise an error. BM25 takes as input two distinct parameters, b and k1. b is a float value that determines the impact of document length normalization. The default value is 0.75. The higher the value, the more penalized longer documents will be. k1 is a float value that determines how quickly the impact of term frequency saturates. The default value is 1.5. The higher the value, the more influential term frequency will be.

from neural_cherche import retrieve
from lenlp import sparse

documents = [
    {"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
    {"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
    {"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},
]

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
    count_vectorizer=sparse.CountVectorizer(normalize=True, ngram_range=(3, 5), analyzer="char_wb"),
    k1=1.5,
    b=0.75,
    epsilon=0.,
)

documents_embeddings = retriever.encode_documents(
    documents=documents,
)

retriever.add(
    documents_embeddings=documents_embeddings,
)

Once we have created our index, we can use the retriever to retrieve the candidates.

queries = [
    "What is the capital of France?",
    "What is the largest city in Quebec?",
    "Where is Bordeaux?",
]

queries_embeddings = retriever.encode_queries(
    queries=queries,
)

scores = retriever(
    queries_embeddings=queries_embeddings,
    k=100,
)

scores

[[{'id': 'doc1', 'similarity': 88.86143220961094},
  {'id': 'doc2', 'similarity': 8.409232541918755},
  {'id': 'doc3', 'similarity': 7.134543210268021}],
 [{'id': 'doc2', 'similarity': 107.05374336242676},
  {'id': 'doc1', 'similarity': 9.28911879658699},
  {'id': 'doc3', 'similarity': 1.9025448560714722}],
 [{'id': 'doc3', 'similarity': 18.506150543689728},
  {'id': 'doc1', 'similarity': 0.7961864173412323},
  {'id': 'doc2', 'similarity': 0.7676786482334137}]]