Encoder as a retriever¶

In certain cases, particularly with small corpora, a user's query may not match any documents. This is where neural search proves to be incredibly useful, as the encoder can act as a backup to locate relevant documents in situations where traditional retrievers have failed to do so.

In [1]:

Copied!

from cherche import retrieve, rank, data
from sentence_transformers import SentenceTransformer
from cherche import retrieve, rank, data
from sentence_transformers import SentenceTransformer

Let's load a dummy dataset

In [2]:

Copied!

documents = data.load_towns()
documents[:2]
documents = data.load_towns()
documents[:2]

Out[2]:

[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."}]

First, we will perform a search with a TfIdf to show that the model's ability to retrieve documents may be limited.

In [3]:

Copied!

retriever = retrieve.TfIdf(key="id", on=["article", "title"], documents=documents)
retriever
retriever = retrieve.TfIdf(key="id", on=["article", "title"], documents=documents)
retriever

Out[3]:

TfIdf retriever
	key      : id
	on       : article, title
	documents: 105

There is a single document that match the query "food" using default TfIdf.

In [4]:

Copied!

retriever("food", k=10)
retriever("food", k=10)

Out[4]:

[{'id': 96, 'similarity': 0.057060669878117906},
 {'id': 20, 'similarity': 0.02514090300945658}]

We can now compare these results with the retrieve.Encoder using Sentence Bert. The add method takes time because the retriever will compute embeddings for every document.

In [5]:

Copied!





retriever = retrieve.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)

retriever.add(documents=documents)
retriever = retrieve.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)

retriever.add(documents=documents)

Encoder index creation: 100%|█| 2/2 [00:02<00:00,  1.30s/it

Out[5]:

Encoder retriever
	key      : id
	on       : title, article
	documents: 105

As can be seen, the encoder recalls more documents, even if they do not systematically contain the word "food". These documents seem relevant.

In [6]:

Copied!

retriever("food", k=5)
retriever("food", k=5)

Out[6]:

[{'id': 48, 'similarity': 0.3757082873324092},
 {'id': 66, 'similarity': 0.3735201261683402},
 {'id': 96, 'similarity': 0.37012889770913526},
 {'id': 16, 'similarity': 0.3682042586662517},
 {'id': 49, 'similarity': 0.3594711511884871}]

In [7]:

Copied!

pipeline = retriever + documents
pipeline("food", k=5)
pipeline = retriever + documents
pipeline("food", k=5)

Out[7]:

[{'id': 48,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.",
  'similarity': 0.3757082873324092},
 {'id': 66,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.',
  'similarity': 0.3735201261683402},
 {'id': 96,
  'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 0.37012889770913526},
 {'id': 16,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris received 12.',
  'similarity': 0.3682042586662517},
 {'id': 49,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon was historically an important area for the production and weaving of silk.',
  'similarity': 0.3594711511884871}]

We can create a fancy neural search pipeline to benefit from TfIdf precision and Sentence Transformers recall using union operator |.

In [8]:

Copied!

encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode

In [9]:

Copied!





# Precision pipeline
precision = retrieve.TfIdf(
    key="id", on=["article", "title"], documents=documents
) + rank.Encoder(key="id", on=["title", "article"], encoder=encoder)

# Recall pipeline
recall = retrieve.Encoder(key="id", on=["title", "article"], encoder=encoder)

search = precision | recall

search.add(documents=documents)
# Precision pipeline
precision = retrieve.TfIdf(
    key="id", on=["article", "title"], documents=documents
) + rank.Encoder(key="id", on=["title", "article"], encoder=encoder)

# Recall pipeline
recall = retrieve.Encoder(key="id", on=["title", "article"], encoder=encoder)

search = precision | recall

search.add(documents=documents)

Encoder ranker: 100%|████████| 2/2 [00:02<00:00,  1.37s/it]
Encoder index creation: 100%|█| 2/2 [00:02<00:00,  1.31s/it

Out[9]:

Union Pipeline
-----
TfIdf retriever
	key      : id
	on       : article, title
	documents: 105
Encoder ranker
	key       : id
	on        : title, article
	normalize : True
	embeddings: 105
Encoder retriever
	key      : id
	on       : title, article
	documents: 105
-----

Our pipeline will first propose documents from the precision pipeline and then documents proposed by the recall pipeline.

In [10]:

Copied!

search("food", k=100)[:3]
search("food", k=100)[:3]

TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 740.78it/s]
Ranker scoring: 1it [00:00, 10407.70it/s]
Ranker sorting: 1it [00:00, 15196.75it/s]
Encoder retriever: 100%|█████| 1/1 [00:00<00:00, 19.30it/s]

Out[10]:

[{'id': 96, 'similarity': 2.4},
 {'id': 20, 'similarity': 1.0206185567010309},
 {'id': 48, 'similarity': 0.3333333333333333}]

In [11]:

Copied!

# Map documents to the pipeline.
search = search + documents
search("food", k=100)[:3]
# Map documents to the pipeline.
search = search + documents
search("food", k=100)[:3]

TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 898.33it/s]
Ranker scoring: 1it [00:00, 16644.06it/s]
Ranker sorting: 1it [00:00, 20460.02it/s]
Encoder retriever: 100%|█████| 1/1 [00:00<00:00, 18.98it/s]

Out[11]:

[{'id': 96,
  'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 2.4},
 {'id': 20,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.',
  'similarity': 1.0206185567010309},
 {'id': 48,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.",
  'similarity': 0.3333333333333333}]