Voting operator for retrievers and rankers¶
Let's build a pipeline using voting *
and union |
operators.
from cherche import data, rank, retrieve
from sentence_transformers import SentenceTransformer
The first step is to define the corpus on which we will perform the neural search. The towns dataset contains about a hundred documents, all of which have four attributes, an id
, the title
of the article, the url
and the content of the article
.
documents = data.load_towns()
documents[:4]
[{'id': 0, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'}, {'id': 1, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."}, {'id': 2, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'}, {'id': 3, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017.'}]
We start by creating a retriever whose mission will be to quickly filter the documents. This retriever will match the query with the documents using the title and content of the article with on
parameter.
retriever = retrieve.TfIdf(
key="id", on=["title", "article"], documents=documents, k=100
)
Voting¶
We will use two pre-trained models as rankers composed of the voting operator.
ranker = rank.Encoder(
key="id",
on=["title", "article"],
encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
k=30,
) * rank.Encoder(
key="id",
on=["title", "article"],
encoder=SentenceTransformer(
"sentence-transformers/multi-qa-mpnet-base-cos-v1"
).encode,
k=30,
)
search = retriever + ranker
search.add(documents)
Encoder ranker: 100%|████████| 2/2 [00:02<00:00, 1.29s/it] Encoder ranker: 100%|████████| 2/2 [00:02<00:00, 1.26s/it]
TfIdf retriever key : id on : title, article documents: 105 Vote ----- Encoder ranker key : id on : title, article normalize : True embeddings: 105 Encoder ranker key : id on : title, article normalize : True embeddings: 105 -----
The output similarity score of the pipeline is composed of the average of the similarity scores of the models. The scores have been normalized for each model.
search("Paris football")
[{'id': 20, 'similarity': 2.064516129032258}, {'id': 24, 'similarity': 1.0625}, {'id': 16, 'similarity': 0.7254901960784313}, {'id': 21, 'similarity': 0.5606060606060606}, {'id': 56, 'similarity': 0.4540540540540541}, {'id': 22, 'similarity': 0.3904761904761905}, {'id': 1, 'similarity': 0.33699633699633696}, {'id': 0, 'similarity': 0.3055555555555556}, {'id': 41, 'similarity': 0.27485380116959063}, {'id': 2, 'similarity': 0.24761904761904763}, {'id': 25, 'similarity': 0.2202797202797203}, {'id': 6, 'similarity': 0.21666666666666667}, {'id': 3, 'similarity': 0.19732441471571907}, {'id': 23, 'similarity': 0.18285714285714286}, {'id': 35, 'similarity': 0.17588652482269504}, {'id': 14, 'similarity': 0.15555555555555556}, {'id': 33, 'similarity': 0.1507177033492823}, {'id': 8, 'similarity': 0.13968957871396898}, {'id': 7, 'similarity': 0.1369047619047619}, {'id': 42, 'similarity': 0.1246923707957342}, {'id': 32, 'similarity': 0.12414965986394558}, {'id': 17, 'similarity': 0.11201079622132254}, {'id': 54, 'similarity': 0.10846560846560846}, {'id': 9, 'similarity': 0.10532915360501567}, {'id': 27, 'similarity': 0.10238095238095238}, {'id': 55, 'similarity': 0.0625}, {'id': 57, 'similarity': 0.058823529411764705}, {'id': 51, 'similarity': 0.05}, {'id': 46, 'similarity': 0.04}, {'id': 45, 'similarity': 0.037037037037037035}, {'id': 19, 'similarity': 0.023255813953488372}, {'id': 13, 'similarity': 0.0196078431372549}, {'id': 18, 'similarity': 0.017241379310344827}, {'id': 39, 'similarity': 0.01694915254237288}, {'id': 37, 'similarity': 0.016666666666666666}]
search("speciality Lyon")
[{'id': 52, 'similarity': 2.064516129032258}, {'id': 49, 'similarity': 1.0606060606060606}, {'id': 56, 'similarity': 0.7291666666666666}, {'id': 45, 'similarity': 0.5555555555555556}, {'id': 48, 'similarity': 0.45882352941176474}, {'id': 41, 'similarity': 0.38738738738738737}, {'id': 54, 'similarity': 0.3322259136212624}, {'id': 47, 'similarity': 0.3026315789473684}, {'id': 50, 'similarity': 0.27100271002710025}, {'id': 53, 'similarity': 0.25}, {'id': 42, 'similarity': 0.23896103896103896}, {'id': 51, 'similarity': 0.21014492753623187}, {'id': 46, 'similarity': 0.1982905982905983}, {'id': 55, 'similarity': 0.18831168831168832}, {'id': 44, 'similarity': 0.18095238095238095}, {'id': 43, 'similarity': 0.1689291101055807}, {'id': 67, 'similarity': 0.1675531914893617}, {'id': 63, 'similarity': 0.15192743764172334}, {'id': 69, 'similarity': 0.1437246963562753}, {'id': 29, 'similarity': 0.13773584905660377}, {'id': 74, 'similarity': 0.1286231884057971}, {'id': 35, 'similarity': 0.12727272727272726}, {'id': 37, 'similarity': 0.1172316384180791}, {'id': 57, 'similarity': 0.11703703703703704}, {'id': 70, 'similarity': 0.11407407407407408}, {'id': 28, 'similarity': 0.10714285714285714}, {'id': 93, 'similarity': 0.10114942528735632}, {'id': 32, 'similarity': 0.047619047619047616}, {'id': 40, 'similarity': 0.038461538461538464}, {'id': 36, 'similarity': 0.034482758620689655}, {'id': 90, 'similarity': 0.0196078431372549}, {'id': 81, 'similarity': 0.017543859649122806}, {'id': 68, 'similarity': 0.016666666666666666}]
We can automatically map document identifiers to their content.
search += documents
search("Paris football")[:3]
[{'id': 20, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.', 'similarity': 2.064516129032258}, {'id': 24, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The 1938 and 1998 FIFA World Cups, the 2007 Rugby World Cup, as well as the 1960, 1984 and 2016 UEFA European Championships were also held in the city.', 'similarity': 1.0625}, {'id': 16, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris received 12.', 'similarity': 0.7254901960784313}]
search("speciality Lyon")[:3]
[{'id': 52, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': 'Economically, Lyon is a major centre for banking, as well as for the chemical, pharmaceutical and biotech industries.', 'similarity': 2.064516129032258}, {'id': 49, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': 'Lyon was historically an important area for the production and weaving of silk.', 'similarity': 1.0606060606060606}, {'id': 56, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': "It ranked second in France and 40th globally in Mercer's 2019 liveability rankings.", 'similarity': 0.7291666666666666}]
Voting is also compatible with retrievers¶
retriever = retrieve.TfIdf(
key="id", on=["title", "article"], documents=documents, k=100
) * retrieve.Lunr(key="id", on=["title", "article"], documents=documents, k=100)
search = retriever + documents
search("Paris football")[:3]
TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 362.83it/s] Lunr retriever: 100%|███████| 1/1 [00:00<00:00, 545.57it/s]
[{'id': 20, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.', 'similarity': 2.0238095238095237}, {'id': 16, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris received 12.', 'similarity': 1.0235294117647058}, {'id': 7, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': "Opened in 1900, the city's subway system, the Paris Métro, serves 5.", 'similarity': 0.6893939393939393}]
search("speciality Lyon")[:3]
TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 279.47it/s] Lunr retriever: 100%|███████| 1/1 [00:00<00:00, 820.16it/s]
[{'id': 10, 'title': 'Paris', 'url': 'https://en.wikipedia.org/wiki/Paris', 'article': 'Paris is especially known for its museums and architectural landmarks: the Louvre remained the most-visited museum in the world with 2,677,504 visitors in 2020, despite the long museum closings caused by the COVID-19 virus.', 'similarity': 1.0}, {'id': 44, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': 'Lyon and 58 suburban municipalities have formed since 2015 the Metropolis of Lyon, a directly elected metropolitan authority now in charge of most urban issues, with a population of 1,385,927 in 2017.', 'similarity': 0.6974358974358974}, {'id': 41, 'title': 'Lyon', 'url': 'https://en.wikipedia.org/wiki/Lyon', 'article': 'Lyon or Lyons (UK: , US: , French: [ljɔ̃] (listen); Arpitan: Liyon, pronounced [ʎjɔ̃]) is the third-largest city and second-largest urban area of France.', 'similarity': 0.5303030303030303}]