Retriever and ranker¶
This notebook present a simple neural search pipeline composed of two retrievers and a ranker.
from cherche import data, rank, retrieve, utils
from sentence_transformers import SentenceTransformer
The first step is to define the corpus on which we will perform the neural search. The towns dataset contains about a hundred documents. Each document has fours attributes, the id
, the title
of the article, the url
and the content of the article
.
documents = data.load_towns()
documents[:4]
[{'id': 0, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'}, {'id': 1, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."}, {'id': 2, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'}, {'id': 3, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017.'}]
We start by initiating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using the on
parameter.
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents)
We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.
ranker = rank.Encoder(
key="id",
on=["title", "article"],
encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
)
We initialise the pipeline and ask the retrievers to index the documents and the ranker to pre-compute the document embeddings. This step can take some time if you have a lot of documents. It can be interesting to use a GPU to pre-calculate all the embeddings if you have many documents. The embeddings will be stored in the encoder.pkl
file.
search = retriever + ranker
search.add(documents)
Encoder ranker: 100%|████████| 2/2 [00:02<00:00, 1.33s/it]
TfIdf retriever key : id on : title, article documents: 105 Encoder ranker key : id on : title, article normalize : True embeddings: 105
Let's call our model to retrieve documents related to football in Paris. The search pipeline provides a similarity score for each document. The documents are sorted in order of relevance, from most similar to least similar.
search("paris football", k=30)
[{'id': 20, 'similarity': 0.7220986}, {'id': 16, 'similarity': 0.48418275}, {'id': 21, 'similarity': 0.47666836}, {'id': 56, 'similarity': 0.47011483}, {'id': 22, 'similarity': 0.45666158}, {'id': 1, 'similarity': 0.44948608}, {'id': 0, 'similarity': 0.44595104}, {'id': 2, 'similarity': 0.4206621}, {'id': 25, 'similarity': 0.4146704}, {'id': 6, 'similarity': 0.41367412}, {'id': 3, 'similarity': 0.4131328}, {'id': 23, 'similarity': 0.41079015}, {'id': 14, 'similarity': 0.37518078}, {'id': 51, 'similarity': 0.37361926}, {'id': 7, 'similarity': 0.37052304}, {'id': 8, 'similarity': 0.36798736}, {'id': 17, 'similarity': 0.35948235}, {'id': 9, 'similarity': 0.34356856}, {'id': 13, 'similarity': 0.33688956}, {'id': 12, 'similarity': 0.31458178}, {'id': 15, 'similarity': 0.3111611}, {'id': 53, 'similarity': 0.30873594}, {'id': 5, 'similarity': 0.30330563}, {'id': 52, 'similarity': 0.30239156}, {'id': 10, 'similarity': 0.2945645}, {'id': 19, 'similarity': 0.2915255}, {'id': 94, 'similarity': 0.28307498}, {'id': 11, 'similarity': 0.27992725}, {'id': 4, 'similarity': 0.276568}, {'id': 18, 'similarity': 0.20204495}]
The retriever we use is a bit too basic, the word aerospace appears in the corpus but aero does not. We are therefore unable to retrieve relevant documents for the query aero.
search("aero", k=30) # Aerospace
[{'id': 67, 'similarity': 0.32282117}, {'id': 29, 'similarity': 0.30668122}, {'id': 31, 'similarity': 0.2690589}, {'id': 96, 'similarity': 0.027692636}]
We can improve the retrieval by processing sub-units of words using the ngram_range
parameter of the TfidfVectorizer
model. This update to the retriever will reduce its precision but increase the recall.
from sklearn.feature_extraction.text import TfidfVectorizer
retriever = retrieve.TfIdf(
key="id",
on=["title", "article"],
documents=documents,
tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3),
)
search = retriever + ranker
search.add(documents)
Encoder ranker: 100%|████████| 2/2 [00:02<00:00, 1.32s/it]
TfIdf retriever key : id on : title, article documents: 105 Encoder ranker key : id on : title, article normalize : True embeddings: 105
search("paris football", k=30)
[{'id': 20, 'similarity': 0.7220986}, {'id': 24, 'similarity': 0.5216039}, {'id': 16, 'similarity': 0.48418275}, {'id': 21, 'similarity': 0.47666836}, {'id': 56, 'similarity': 0.47011483}, {'id': 22, 'similarity': 0.45666158}, {'id': 1, 'similarity': 0.44948608}, {'id': 0, 'similarity': 0.44595104}, {'id': 2, 'similarity': 0.4206621}, {'id': 25, 'similarity': 0.4146704}, {'id': 6, 'similarity': 0.41367412}, {'id': 3, 'similarity': 0.4131328}, {'id': 23, 'similarity': 0.41079015}, {'id': 14, 'similarity': 0.37518078}, {'id': 7, 'similarity': 0.37052304}, {'id': 8, 'similarity': 0.36798736}, {'id': 17, 'similarity': 0.35948235}, {'id': 9, 'similarity': 0.34356856}, {'id': 13, 'similarity': 0.33688956}, {'id': 12, 'similarity': 0.31458178}, {'id': 15, 'similarity': 0.3111611}, {'id': 5, 'similarity': 0.30330563}, {'id': 10, 'similarity': 0.2945645}, {'id': 19, 'similarity': 0.2915255}, {'id': 11, 'similarity': 0.27992725}, {'id': 4, 'similarity': 0.276568}, {'id': 43, 'similarity': 0.2750644}, {'id': 96, 'similarity': 0.21408883}, {'id': 18, 'similarity': 0.20204495}, {'id': 79, 'similarity': 0.09676781}]
By treating the characters we have built a retriever with a better recall.
search("aero", k=30) # Aerospace
[{'id': 67, 'similarity': 0.32282117}, {'id': 29, 'similarity': 0.30668122}, {'id': 31, 'similarity': 0.2690589}, {'id': 96, 'similarity': 0.027692636}]
Let's map indexes to our documents.
search += documents
search("paris football", k=10)
[{'id': 20, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.', 'similarity': 0.7220986}, {'id': 16, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris received 12.', 'similarity': 0.48418275}, {'id': 21, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The 80,000-seat Stade de France, built for the 1998 FIFA World Cup, is located just north of Paris in the neighbouring commune of Saint-Denis.', 'similarity': 0.47666836}, {'id': 22, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris hosts the annual French Open Grand Slam tennis tournament on the red clay of Roland Garros.', 'similarity': 0.45666158}, {'id': 1, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts.", 'similarity': 0.44948608}, {'id': 2, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.', 'similarity': 0.4206621}, {'id': 3, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017.', 'similarity': 0.4131328}, {'id': 7, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Opened in 1900, the city's subway system, the Paris Métro, serves 5.", 'similarity': 0.37052304}, {'id': 5, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Another source ranked Paris as most expensive, on par with Singapore and Hong Kong, in 2018.', 'similarity': 0.30330563}, {'id': 18, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The number of foreign visitors declined by 80.', 'similarity': 0.20204495}]
search("aero", k=30) # Aerospace
[{'id': 67, 'title': 'Bordeaux', 'url': 'https://en.wikipedia.org/wiki/Bordeaux', 'article': 'It is a central and strategic hub for the aeronautics, military and space sector, home to international companies such as Dassault Aviation, Ariane Group, Safran and Thalès.', 'similarity': 0.32282117}, {'id': 29, 'title': 'Toulouse', 'url': 'https://en.wikipedia.org/wiki/Toulouse', 'article': 'Toulouse is the centre of the European aerospace industry, with the headquarters of Airbus (formerly EADS), the SPOT satellite system, ATR and the Aerospace Valley.', 'similarity': 0.30668122}, {'id': 31, 'title': 'Toulouse', 'url': 'https://en.wikipedia.org/wiki/Toulouse', 'article': 'Thales Alenia Space, ATR, SAFRAN, Liebherr-Aerospace and Airbus Defence and Space also have a significant presence in Toulouse.', 'similarity': 0.2690589}, {'id': 96, 'title': 'Montreal', 'url': 'https://en.wikipedia.org/wiki/Montreal', 'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.', 'similarity': 0.027692636}]