rank.CrossEncoder¶
The rank.CrossEncoder
model re-ranks documents in ouput of the retriever using a pre-trained CrossEncoder. The cross coder takes as input both the query and the document and produces a score accordingly. The rank.Encoder
can't pre-compute embeddings to speed up search. A GPU will significantly speed up the cross-encoder.
Requirements¶
To use the CrossEncoder ranker we will need to install "cherche[cpu]"
pip install "cherche[cpu]"
or on GPU:
pip install "cherche[gpu]"
Documents¶
The cross-encoder must take as input the content of the document and not only the keys.
search = retriever + documents + cross_encoder
Tutorial¶
>>> from cherche import retrieve, rank
>>> from sentence_transformers import CrossEncoder
>>> documents = [
... {
... "id": 0,
... "article": "Paris is the capital and most populous city of France",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 1,
... "article": "Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 2,
... "article": "The City of Paris is the centre and seat of government of the region and province of Île-de-France .",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... }
... ]
>>> retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents)
# Cross-Encoder needs documents contents. So we map documents to the output of retriever.
>>> retriever += documents
>>> ranker = rank.CrossEncoder(
... on = ["title", "article"],
... encoder = CrossEncoder("cross-encoder/mmarco-mMiniLMv2-L12-H384-v1").predict,
... )
>>> match = retriever(["paris", "art", "fashion"], k=100)
# Re-rank output of retriever
>>> ranker(["paris", "art", "fashion"], documents=match, k=30)
[[{'id': 0, # Query 1
'article': 'Paris is the capital and most populous city of France',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 6.915566},
{'id': 2,
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 6.651541},
{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 4.3157015}],
[{'id': 1, # Query 2
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -2.9981978}],
[{'id': 2, # Query 3
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -4.356096},
{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -4.7529125}]]
Ranker in pipeline¶
>>> from cherche import retrieve, rank
>>> from sentence_transformers import CrossEncoder
>>> documents = [
... {
... "id": 0,
... "article": "Paris is the capital and most populous city of France",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 1,
... "article": "Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... },
... {
... "id": 2,
... "article": "The City of Paris is the centre and seat of government of the region and province of Île-de-France .",
... "title": "Paris",
... "url": "https://en.wikipedia.org/wiki/Paris"
... }
... ]
>>> retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=100)
>>> ranker = rank.CrossEncoder(
... on = ["title", "article"],
... encoder = CrossEncoder("cross-encoder/mmarco-mMiniLMv2-L12-H384-v1").predict,
... )
# Cross-Encoder needs documents contents. So we map documents to the output of retriever.
>>> search = retriever + documents + ranker
>>> search(q=["paris", "arts", "fashion"])
[[{'id': 0, # Query 1
'article': 'Paris is the capital and most populous city of France',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 6.915566},
{'id': 2,
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 6.651541},
{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': 4.3157015}],
[{'id': 1, # Query 2
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -2.9981978}],
[{'id': 2, # Query 3
'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France .',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -4.356096},
{'id': 1,
'article': 'Paris has been one of Europe major centres of finance, diplomacy , commerce , fashion , gastronomy , science , and arts.',
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'similarity': -4.7529125}]]
Pre-trained Cross-Encoders¶
Pre-trained Cross-Encoder are available on Hugging Face hub.