Retriever and ranker¶

This notebook present a simple neural search pipeline composed of two retrievers and a ranker.

In [2]:

Copied!

from cherche import data, rank, retrieve, utils
from sentence_transformers import SentenceTransformer
from cherche import data, rank, retrieve, utils
from sentence_transformers import SentenceTransformer

The first step is to define the corpus on which we will perform the neural search. The towns dataset contains about a hundred documents. Each document has fours attributes, the id, the title of the article, the url and the content of the article.

In [3]:

Copied!

documents = data.load_towns()
documents[:4]
documents = data.load_towns()
documents[:4]

Out[3]:

[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'id': 3,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017.'}]

We start by initiating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using the on parameter.

In [4]:

Copied!

retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents)
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents)

We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.

In [5]:

Copied!





ranker = rank.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)
ranker = rank.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)

We initialise the pipeline and ask the retrievers to index the documents and the ranker to pre-compute the document embeddings. This step can take some time if you have a lot of documents. It can be interesting to use a GPU to pre-calculate all the embeddings if you have many documents. The embeddings will be stored in the encoder.pkl file.

In [6]:

Copied!

search = retriever + ranker
search.add(documents)
search = retriever + ranker
search.add(documents)

Encoder ranker: 100%|████████| 2/2 [00:02<00:00,  1.33s/it]

Out[6]:

TfIdf retriever
	key      : id
	on       : title, article
	documents: 105
Encoder ranker
	key       : id
	on        : title, article
	normalize : True
	embeddings: 105

Let's call our model to retrieve documents related to football in Paris. The search pipeline provides a similarity score for each document. The documents are sorted in order of relevance, from most similar to least similar.

In [7]:

Copied!

search("paris football", k=30)
search("paris football", k=30)

Out[7]:

[{'id': 20, 'similarity': 0.7220986},
 {'id': 16, 'similarity': 0.48418275},
 {'id': 21, 'similarity': 0.47666836},
 {'id': 56, 'similarity': 0.47011483},
 {'id': 22, 'similarity': 0.45666158},
 {'id': 1, 'similarity': 0.44948608},
 {'id': 0, 'similarity': 0.44595104},
 {'id': 2, 'similarity': 0.4206621},
 {'id': 25, 'similarity': 0.4146704},
 {'id': 6, 'similarity': 0.41367412},
 {'id': 3, 'similarity': 0.4131328},
 {'id': 23, 'similarity': 0.41079015},
 {'id': 14, 'similarity': 0.37518078},
 {'id': 51, 'similarity': 0.37361926},
 {'id': 7, 'similarity': 0.37052304},
 {'id': 8, 'similarity': 0.36798736},
 {'id': 17, 'similarity': 0.35948235},
 {'id': 9, 'similarity': 0.34356856},
 {'id': 13, 'similarity': 0.33688956},
 {'id': 12, 'similarity': 0.31458178},
 {'id': 15, 'similarity': 0.3111611},
 {'id': 53, 'similarity': 0.30873594},
 {'id': 5, 'similarity': 0.30330563},
 {'id': 52, 'similarity': 0.30239156},
 {'id': 10, 'similarity': 0.2945645},
 {'id': 19, 'similarity': 0.2915255},
 {'id': 94, 'similarity': 0.28307498},
 {'id': 11, 'similarity': 0.27992725},
 {'id': 4, 'similarity': 0.276568},
 {'id': 18, 'similarity': 0.20204495}]

The retriever we use is a bit too basic, the word aerospace appears in the corpus but aero does not. We are therefore unable to retrieve relevant documents for the query aero.

In [8]:

Copied!

search("aero", k=30)  # Aerospace
search("aero", k=30)  # Aerospace

Out[8]:

[{'id': 67, 'similarity': 0.32282117},
 {'id': 29, 'similarity': 0.30668122},
 {'id': 31, 'similarity': 0.2690589},
 {'id': 96, 'similarity': 0.027692636}]

We can improve the retrieval by processing sub-units of words using the ngram_range parameter of the TfidfVectorizer model. This update to the retriever will reduce its precision but increase the recall.

In [9]:

Copied!





from sklearn.feature_extraction.text import TfidfVectorizer

retriever = retrieve.TfIdf(
    key="id",
    on=["title", "article"],
    documents=documents,
    tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3),
)

search = retriever + ranker
search.add(documents)
from sklearn.feature_extraction.text import TfidfVectorizer

retriever = retrieve.TfIdf(
    key="id",
    on=["title", "article"],
    documents=documents,
    tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3),
)

search = retriever + ranker
search.add(documents)

Encoder ranker: 100%|████████| 2/2 [00:02<00:00,  1.32s/it]

Out[9]:

TfIdf retriever
	key      : id
	on       : title, article
	documents: 105
Encoder ranker
	key       : id
	on        : title, article
	normalize : True
	embeddings: 105

In [10]:

Copied!

search("paris football", k=30)
search("paris football", k=30)

Out[10]:

[{'id': 20, 'similarity': 0.7220986},
 {'id': 24, 'similarity': 0.5216039},
 {'id': 16, 'similarity': 0.48418275},
 {'id': 21, 'similarity': 0.47666836},
 {'id': 56, 'similarity': 0.47011483},
 {'id': 22, 'similarity': 0.45666158},
 {'id': 1, 'similarity': 0.44948608},
 {'id': 0, 'similarity': 0.44595104},
 {'id': 2, 'similarity': 0.4206621},
 {'id': 25, 'similarity': 0.4146704},
 {'id': 6, 'similarity': 0.41367412},
 {'id': 3, 'similarity': 0.4131328},
 {'id': 23, 'similarity': 0.41079015},
 {'id': 14, 'similarity': 0.37518078},
 {'id': 7, 'similarity': 0.37052304},
 {'id': 8, 'similarity': 0.36798736},
 {'id': 17, 'similarity': 0.35948235},
 {'id': 9, 'similarity': 0.34356856},
 {'id': 13, 'similarity': 0.33688956},
 {'id': 12, 'similarity': 0.31458178},
 {'id': 15, 'similarity': 0.3111611},
 {'id': 5, 'similarity': 0.30330563},
 {'id': 10, 'similarity': 0.2945645},
 {'id': 19, 'similarity': 0.2915255},
 {'id': 11, 'similarity': 0.27992725},
 {'id': 4, 'similarity': 0.276568},
 {'id': 43, 'similarity': 0.2750644},
 {'id': 96, 'similarity': 0.21408883},
 {'id': 18, 'similarity': 0.20204495},
 {'id': 79, 'similarity': 0.09676781}]

By treating the characters we have built a retriever with a better recall.

In [11]:

Copied!

search("aero", k=30)  # Aerospace
search("aero", k=30)  # Aerospace

Out[11]:

[{'id': 67, 'similarity': 0.32282117},
 {'id': 29, 'similarity': 0.30668122},
 {'id': 31, 'similarity': 0.2690589},
 {'id': 96, 'similarity': 0.027692636}]

Let's map indexes to our documents.

In [12]:

Copied!

search += documents
search += documents

In [13]:

Copied!

search("paris football", k=10)
search("paris football", k=10)

Out[13]:

[{'id': 20,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.',
  'similarity': 0.7220986},
 {'id': 16,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris received 12.',
  'similarity': 0.48418275},
 {'id': 21,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The 80,000-seat Stade de France, built for the 1998 FIFA World Cup, is located just north of Paris in the neighbouring commune of Saint-Denis.',
  'similarity': 0.47666836},
 {'id': 22,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris hosts the annual French Open Grand Slam tennis tournament on the red clay of Roland Garros.',
  'similarity': 0.45666158},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts.",
  'similarity': 0.44948608},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.',
  'similarity': 0.4206621},
 {'id': 3,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017.',
  'similarity': 0.4131328},
 {'id': 7,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Opened in 1900, the city's subway system, the Paris Métro, serves 5.",
  'similarity': 0.37052304},
 {'id': 5,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Another source ranked Paris as most expensive, on par with Singapore and Hong Kong, in 2018.',
  'similarity': 0.30330563},
 {'id': 18,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The number of foreign visitors declined by 80.',
  'similarity': 0.20204495}]

In [14]:

Copied!

search("aero", k=30)  # Aerospace
search("aero", k=30)  # Aerospace

Out[14]:

[{'id': 67,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'It is a central and strategic hub for the aeronautics, military and space sector, home to international companies such as Dassault Aviation, Ariane Group, Safran and Thalès.',
  'similarity': 0.32282117},
 {'id': 29,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'Toulouse is the centre of the European aerospace industry, with the headquarters of Airbus (formerly EADS), the SPOT satellite system, ATR and the Aerospace Valley.',
  'similarity': 0.30668122},
 {'id': 31,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'Thales Alenia Space, ATR, SAFRAN, Liebherr-Aerospace and Airbus Defence and Space also have a significant presence in Toulouse.',
  'similarity': 0.2690589},
 {'id': 96,
  'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 0.027692636}]