rank.Embedding¶
The rank.Embedding
model utilizes pre-computed embeddings to re-rank documents within the output of the retriever. If you have a custom model that produces its own embeddings and want to re-rank documents accordingly, rank.Embedding
is the ideal tool for the job.
Tutorial¶
>>> from cherche import retrieve, rank
>>> from sentence_transformers import SentenceTransformer
>>> documents = [
... {
... "id": 0,
... "article": "Paris is the capital and most populous city of France",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 1,
... "article": "Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 2,
... "article": "The City of Paris is the centre and seat of government of the region and province of Île-de-France .",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... }
... ]
# Let's use a custom encoder and create our documents embeddings of shape (n_documents, dim_embeddings)
>>> encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
>>> embeddings_documents = encoder.encode([
... document["article"] for document in documents
... ])
>>> queries = ["paris", "art", "fashion"]
# Queries embeddings of shape (n_queries, dim_embeddings)
>>> embeddings_queries = encoder.encode(queries)
>>> retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents)
>>> ranker = rank.Embedding(
... key = "id",
... normalize = True,
... )
>>> ranker = ranker.add(
... documents=documents,
... embeddings_documents=embeddings_documents,
... )
>>> match = retriever(queries, k=100)
# Re-rank output of retriever
>>> ranker(q=embeddings_queries, documents=match, k=30)
[[{'id': 0, 'similarity': 0.6560695}, # Query 1
{'id': 1, 'similarity': 0.58203197},
{'id': 2, 'similarity': 0.5283624}],
[{'id': 1, 'similarity': 0.1115652}], # Query 2
[{'id': 1, 'similarity': 0.2555524}, {'id': 2, 'similarity': 0.06398084}]] # Query 3
Map index to documents¶
We can map the documents to the ids retrieved by the pipeline.
>>> ranker += documents
>>> match = retriever(queries, k=100)
>>> ranker(q=embeddings_queries, documents=match, k=30)
[[{'id': 0,
'article': 'Paris is the capital and most populous city of France',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.6560695},
{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.58203197},
{'id': 2,
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.5283624}],
[{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.1115652}],
[{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.2555524},
{'id': 2,
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 0.06398084}]]