Encoder as a retriever¶
In certain cases, particularly with small corpora, a user's query may not match any documents. This is where neural search proves to be incredibly useful, as the encoder can act as a backup to locate relevant documents in situations where traditional retrievers have failed to do so.
from cherche import retrieve, rank, data
from sentence_transformers import SentenceTransformer
Let's load a dummy dataset
documents = data.load_towns()
documents[:2]
[{'id': 0, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'}, {'id': 1, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."}]
First, we will perform a search with a TfIdf to show that the model's ability to retrieve documents may be limited.
retriever = retrieve.TfIdf(key="id", on=["article", "title"], documents=documents)
retriever
TfIdf retriever key : id on : article, title documents: 105
There is a single document that match the query "food" using default TfIdf.
retriever("food", k=10)
[{'id': 96, 'similarity': 0.057060669878117906}, {'id': 20, 'similarity': 0.02514090300945658}]
We can now compare these results with the retrieve.Encoder
using Sentence Bert. The add
method takes time because the retriever will compute embeddings for every document.
retriever = retrieve.Encoder(
key="id",
on=["title", "article"],
encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)
retriever.add(documents=documents)
Encoder index creation: 100%|█| 2/2 [00:02<00:00, 1.30s/it
Encoder retriever key : id on : title, article documents: 105
As can be seen, the encoder recalls more documents, even if they do not systematically contain the word "food". These documents seem relevant.
retriever("food", k=5)
[{'id': 48, 'similarity': 0.3757082873324092}, {'id': 66, 'similarity': 0.3735201261683402}, {'id': 96, 'similarity': 0.37012889770913526}, {'id': 16, 'similarity': 0.3682042586662517}, {'id': 49, 'similarity': 0.3594711511884871}]
pipeline = retriever + documents
pipeline("food", k=5)
[{'id': 48, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.", 'similarity': 0.3757082873324092}, {'id': 66, 'title': 'Bordeaux', 'url': 'https://en.wikipedia.org/wiki/Bordeaux', 'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.', 'similarity': 0.3735201261683402}, {'id': 96, 'title': 'Montreal', 'url': 'https://en.wikipedia.org/wiki/Montreal', 'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.', 'similarity': 0.37012889770913526}, {'id': 16, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris received 12.', 'similarity': 0.3682042586662517}, {'id': 49, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': 'Lyon was historically an important area for the production and weaving of silk.', 'similarity': 0.3594711511884871}]
We can create a fancy neural search pipeline to benefit from TfIdf precision and Sentence Transformers recall using union operator |
.
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode
# Precision pipeline
precision = retrieve.TfIdf(
key="id", on=["article", "title"], documents=documents
) + rank.Encoder(key="id", on=["title", "article"], encoder=encoder)
# Recall pipeline
recall = retrieve.Encoder(key="id", on=["title", "article"], encoder=encoder)
search = precision | recall
search.add(documents=documents)
Encoder ranker: 100%|████████| 2/2 [00:02<00:00, 1.37s/it] Encoder index creation: 100%|█| 2/2 [00:02<00:00, 1.31s/it
Union Pipeline ----- TfIdf retriever key : id on : article, title documents: 105 Encoder ranker key : id on : title, article normalize : True embeddings: 105 Encoder retriever key : id on : title, article documents: 105 -----
Our pipeline will first propose documents from the precision
pipeline and then documents proposed by the recall
pipeline.
search("food", k=100)[:3]
TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 740.78it/s] Ranker scoring: 1it [00:00, 10407.70it/s] Ranker sorting: 1it [00:00, 15196.75it/s] Encoder retriever: 100%|█████| 1/1 [00:00<00:00, 19.30it/s]
[{'id': 96, 'similarity': 2.4}, {'id': 20, 'similarity': 1.0206185567010309}, {'id': 48, 'similarity': 0.3333333333333333}]
# Map documents to the pipeline.
search = search + documents
search("food", k=100)[:3]
TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 898.33it/s] Ranker scoring: 1it [00:00, 16644.06it/s] Ranker sorting: 1it [00:00, 20460.02it/s] Encoder retriever: 100%|█████| 1/1 [00:00<00:00, 18.98it/s]
[{'id': 96, 'title': 'Montreal', 'url': 'https://en.wikipedia.org/wiki/Montreal', 'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.', 'similarity': 2.4}, {'id': 20, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.', 'similarity': 1.0206185567010309}, {'id': 48, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.", 'similarity': 0.3333333333333333}]