Semanlink automatic tagging and evaluation¶

This notebook presents how to evaluate a neural search pipeline using pairs of queries and answers. We will automatically tag arXiv papers that François-Paul Servant manually automated as part of the Semanlink Knowledge Graph.

In [1]:

Copied!





from pprint import pprint as print
from cherche import data, rank, retrieve, evaluate
from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint as print
from cherche import data, rank, retrieve, evaluate
from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:

Copied!

documents, query_answers = data.arxiv_tags(
    arxiv_title=True, arxiv_summary=False, comment=False
)
documents, query_answers = data.arxiv_tags(
    arxiv_title=True, arxiv_summary=False, comment=False
)

The documents contain a list of tags. Each tag is represented as a dictionary and contains a set of attributes. We will try to automate the tagging of arXiv documents with a neural search pipeline that will retrieve tags based on their attributes using the title, abstract, and comments of the arXiv articles as a query. For each query, there is a list of relevant document identifiers.

In [3]:

Copied!

print(query_answers[:2])
print(query_answers[:2])

[(' Joint Embedding of Words and Labels for Text Classification',
  [{'uri': 'http://www.semanlink.net/tag/deep_learning_attention'},
   {'uri': 'http://www.semanlink.net/tag/arxiv_doc'},
   {'uri': 'http://www.semanlink.net/tag/nlp_text_classification'},
   {'uri': 'http://www.semanlink.net/tag/label_embedding'}]),
 (' A Survey on Recent Approaches for Natural Language Processing in '
  'Low-Resource Scenarios',
  [{'uri': 'http://www.semanlink.net/tag/bosch'},
   {'uri': 'http://www.semanlink.net/tag/survey'},
   {'uri': 'http://www.semanlink.net/tag/arxiv_doc'},
   {'uri': 'http://www.semanlink.net/tag/nlp_low_resource_scenarios'},
   {'uri': 'http://www.semanlink.net/tag/low_resource_languages'}])]

Here is the list of attributes each tag has:

In [4]:

Copied!

documents[0]
documents[0]

Out[4]:

{'prefLabel': ['Attention mechanism'],
 'type': ['http://www.semanlink.net/2001/00/semanlink-schema#Tag'],
 'broader': ['http://www.semanlink.net/tag/deep_learning'],
 'creationTime': '2016-01-07T00:58:24Z',
 'creationDate': '2016-01-07',
 'comment': 'Good explanation is this [blog post by D. Britz](/doc/?uri=http%3A%2F%2Fwww.wildml.com%2F2016%2F01%2Fattention-and-memory-in-deep-learning-and-nlp%2F). (But the best explanation related to attention is to be found in this [post](/doc/2019/08/transformers_from_scratch_%7C_pet) about Self-Attention.) \r\n\r\nWhile simple Seq2Seq builds a single context vector out of the encoder’s last hidden state, attention creates\r\nshortcuts between the context vector and the entire source input: the context vector has access to the entire input sequence.\r\nThe decoder can “attend” to different parts of the source sentence at each step of the output generation, and the model learns what to attend to based on the input sentence and what it has produced so far.\r\n\r\nPossible to interpret what the model is doing by looking at the Attention weight matrix\r\n\r\nCost: We need to calculate an attention value for each combination of input and output word (D. Britz: -> "attention is a bit of a misnomer: we look at everything in details before deciding what to focus on")\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n',
 'uri': 'http://www.semanlink.net/tag/deep_learning_attention',
 'broader_prefLabel': ['Deep Learning'],
 'broader_related': ['http://www.semanlink.net/tag/feature_learning',
  'http://www.semanlink.net/tag/feature_extraction'],
 'broader_prefLabel_text': 'Deep Learning',
 'prefLabel_text': 'Attention mechanism'}

Let's evaluate a first piepline made of a single retriever

In [5]:

Copied!





retriever = retrieve.TfIdf(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    documents=documents,
    tfidf=TfidfVectorizer(
        lowercase=True, max_df=0.9, ngram_range=(3, 7), analyzer="char"
    ),
    k=30,
)

evaluate.evaluation(search=retriever, query_answers=query_answers, hits_k=range(6))
retriever = retrieve.TfIdf(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    documents=documents,
    tfidf=TfidfVectorizer(
        lowercase=True, max_df=0.9, ngram_range=(3, 7), analyzer="char"
    ),
    k=30,
)

evaluate.evaluation(search=retriever, query_answers=query_answers, hits_k=range(6))

TfIdf retriever: 100%|███████| 1/1 [00:00<00:00, 30.78it/s]

Out[5]:

{'Precision@1': '63.06%',
 'Precision@2': '43.47%',
 'Precision@3': '33.12%',
 'Precision@4': '26.67%',
 'Precision@5': '22.55%',
 'Recall@1': '16.79%',
 'Recall@2': '22.22%',
 'Recall@3': '25.25%',
 'Recall@4': '27.03%',
 'Recall@5': '28.54%',
 'F1@1': '26.52%',
 'F1@2': '29.41%',
 'F1@3': '28.65%',
 'F1@4': '26.85%',
 'F1@5': '25.19%',
 'R-Precision': '26.95%'}

The results of Lunr are inferior to TfIdf on this dataset.

In [6]:

Copied!

retriever = retrieve.Lunr(
    key="uri", on=["prefLabel_text", "altLabel_text"], documents=documents, k=30
)

evaluate.evaluation(search=retriever, query_answers=query_answers, hits_k=range(6))
retriever = retrieve.Lunr(
    key="uri", on=["prefLabel_text", "altLabel_text"], documents=documents, k=30
)

evaluate.evaluation(search=retriever, query_answers=query_answers, hits_k=range(6))

Lunr retriever: 100%|██| 314/314 [00:00<00:00, 2258.93it/s]

Out[6]:

{'Precision@1': '60.38%',
 'Precision@2': '45.35%',
 'Precision@3': '36.92%',
 'Precision@4': '31.01%',
 'Precision@5': '26.00%',
 'Recall@1': '16.22%',
 'Recall@2': '23.62%',
 'Recall@3': '28.22%',
 'Recall@4': '31.23%',
 'Recall@5': '32.30%',
 'F1@1': '25.57%',
 'F1@2': '31.06%',
 'F1@3': '31.99%',
 'F1@4': '31.12%',
 'F1@5': '28.81%',
 'R-Precision': '30.95%'}

You can find an explanation of the metrics here. The TfIdf retriever using caracters ngrams did well.

Here is what tagging looks like using our retriever

In [7]:

Copied!

retriever(
    q="ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction",
)
retriever(
    q="ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction",
)

Out[7]:

[{'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 4.147},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/ranking_information_retrieval',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/embeddings_in_ir', 'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 3.489},
 {'uri': 'http://www.semanlink.net/tag/entity_discovery_and_linking',
  'similarity': 1.579},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 1.479}]

Let's try to improve those results using a ranker.

In [8]:

Copied!





retriever = retrieve.TfIdf(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    documents=documents,
    tfidf=TfidfVectorizer(
        lowercase=True, max_df=0.9, ngram_range=(3, 7), analyzer="char"
    ),
    k=100,
)

ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)
retriever = retrieve.TfIdf(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    documents=documents,
    tfidf=TfidfVectorizer(
        lowercase=True, max_df=0.9, ngram_range=(3, 7), analyzer="char"
    ),
    k=100,
)

ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)

Encoder ranker: 100%|████████| 7/7 [00:02<00:00,  2.35it/s]

In [9]:

Copied!

search = retriever + ranker
search = retriever + ranker

In [10]:

Copied!

evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))
evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))

TfIdf retriever: 100%|███████| 1/1 [00:00<00:00, 26.88it/s]

Out[10]:

{'Precision@1': '62.42%',
 'Precision@2': '41.88%',
 'Precision@3': '32.27%',
 'Precision@4': '26.19%',
 'Precision@5': '22.42%',
 'Recall@1': '16.87%',
 'Recall@2': '22.20%',
 'Recall@3': '25.41%',
 'Recall@4': '26.88%',
 'Recall@5': '28.53%',
 'F1@1': '26.56%',
 'F1@2': '29.02%',
 'F1@3': '28.44%',
 'F1@4': '26.53%',
 'F1@5': '25.11%',
 'R-Precision': '27.20%'}

The Bert Sentence classifier improved the results of the extractor a little. We managed to increase the F1@k score, precision and recall.

Here are proposed tags for Bert using retriever ranker:

In [11]:

Copied!

search(
    "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", k=5
)
search(
    "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", k=5
)

Out[11]:

[{'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 0.54491174},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 0.42808783},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 0.42641872},
 {'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 0.40513238},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 0.32937095}]

Let's try to use using Flash as a retriever. Flash Text will retrieve tags labels inside the title.

In [12]:

Copied!





retriever = retrieve.Flash(
    key="uri",
    on=["prefLabel", "altLabel"],
)

search = retriever + ranker
search.add(documents)
retriever = retrieve.Flash(
    key="uri",
    on=["prefLabel", "altLabel"],
)

search = retriever + ranker
search.add(documents)

Encoder ranker: 100%|████████| 7/7 [00:03<00:00,  2.17it/s]

Out[12]:

Flash retriever
	key      : uri
	on       : prefLabel, altLabel
	documents: 604
Encoder ranker
	key       : uri
	on        : prefLabel_text, altLabel_text
	normalize : True
	embeddings: 433

FlashText as a retriever provides fewer candidates than TfIdf but has higher precision.

In [13]:

Copied!

evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))
evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))

Flash retriever: 100%|█| 314/314 [00:00<00:00, 110173.29it/

Out[13]:

{'Precision@1': '72.80%',
 'Precision@2': '61.90%',
 'Precision@3': '59.90%',
 'Precision@4': '59.27%',
 'Precision@5': '59.37%',
 'Recall@1': '16.33%',
 'Recall@2': '19.54%',
 'Recall@3': '20.11%',
 'Recall@4': '20.16%',
 'Recall@5': '20.20%',
 'F1@1': '26.67%',
 'F1@2': '29.71%',
 'F1@3': '30.11%',
 'F1@4': '30.08%',
 'F1@5': '30.15%',
 'R-Precision': '20.20%'}

In [14]:

Copied!

search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")
search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

Out[14]:

[]

We can get the best of both worlds by using pipeline union. It gets a bit complicated, but the union allows us to retrieve the best candidates from the first model then add the candidates from the second model without duplicates (no matter how many models are in the union). Our first retriever and ranker (Flash + Encoder) have low recall and high precision. The second retriever has a lower precision but higher recall. So we can mix things up and offer Flash and Ranker candidates first, then TfIdf and Ranker candidates seconds.

In [15]:

Copied!





ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)

precision = (
    retrieve.Flash(
        key="uri",
        on=["prefLabel", "altLabel"],
    ).add(documents)
    + ranker
)

recall = (
    retrieve.TfIdf(
        key="uri",
        on=["prefLabel_text", "altLabel_text"],
        documents=documents,
        tfidf=TfidfVectorizer(lowercase=True, ngram_range=(3, 7), analyzer="char"),
        k=30,
    )
    + ranker
)

search = precision | recall
ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)

precision = (
    retrieve.Flash(
        key="uri",
        on=["prefLabel", "altLabel"],
    ).add(documents)
    + ranker
)

recall = (
    retrieve.TfIdf(
        key="uri",
        on=["prefLabel_text", "altLabel_text"],
        documents=documents,
        tfidf=TfidfVectorizer(lowercase=True, ngram_range=(3, 7), analyzer="char"),
        k=30,
    )
    + ranker
)

search = precision | recall

Encoder ranker: 100%|████████| 7/7 [00:03<00:00,  2.25it/s]

In [16]:

Copied!

evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))
evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))

Flash retriever: 100%|█| 314/314 [00:00<00:00, 108022.59it/
TfIdf retriever: 100%|███████| 1/1 [00:00<00:00, 33.39it/s]

Out[16]:

{'Precision@1': '69.11%',
 'Precision@2': '49.84%',
 'Precision@3': '39.07%',
 'Precision@4': '31.13%',
 'Precision@5': '25.92%',
 'Recall@1': '18.74%',
 'Recall@2': '25.89%',
 'Recall@3': '30.10%',
 'Recall@4': '31.58%',
 'Recall@5': '32.57%',
 'F1@1': '29.49%',
 'F1@2': '34.08%',
 'F1@3': '34.00%',
 'F1@4': '31.35%',
 'F1@5': '28.87%',
 'R-Precision': '31.99%'}

We did improves F1 and recall scores using union of pipelines.

We could also calculate a voting score between the precision and recall pipelines.

In [17]:

Copied!





ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)

precision = (
    retrieve.Flash(
        key="uri",
        on=["prefLabel", "altLabel"],
    ).add(documents)
    + ranker
)

recall = (
    retrieve.TfIdf(
        key="uri",
        on=["prefLabel_text", "altLabel_text"],
        documents=documents,
        tfidf=TfidfVectorizer(lowercase=True, ngram_range=(3, 7), analyzer="char"),
        k=30,
    )
    + ranker
)


# Vote between precision and recall followed by precision and recall
search = precision * recall
ranker = rank.Encoder(
    key="uri",
    on=["prefLabel_text", "altLabel_text"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=30,
).add(documents)

precision = (
    retrieve.Flash(
        key="uri",
        on=["prefLabel", "altLabel"],
    ).add(documents)
    + ranker
)

recall = (
    retrieve.TfIdf(
        key="uri",
        on=["prefLabel_text", "altLabel_text"],
        documents=documents,
        tfidf=TfidfVectorizer(lowercase=True, ngram_range=(3, 7), analyzer="char"),
        k=30,
    )
    + ranker
)


# Vote between precision and recall followed by precision and recall
search = precision * recall

Encoder ranker: 100%|████████| 7/7 [00:03<00:00,  2.15it/s]

In [18]:

Copied!

evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))
evaluate.evaluation(search=search, query_answers=query_answers, hits_k=range(6))

Flash retriever: 100%|█| 314/314 [00:00<00:00, 104774.18it/
TfIdf retriever: 100%|███████| 1/1 [00:00<00:00, 27.62it/s]

Out[18]:

{'Precision@1': '69.43%',
 'Precision@2': '49.84%',
 'Precision@3': '39.07%',
 'Precision@4': '31.13%',
 'Precision@5': '25.92%',
 'Recall@1': '18.81%',
 'Recall@2': '25.89%',
 'Recall@3': '30.10%',
 'Recall@4': '31.58%',
 'Recall@5': '32.57%',
 'F1@1': '29.60%',
 'F1@2': '34.08%',
 'F1@3': '34.00%',
 'F1@4': '31.35%',
 'F1@5': '28.87%',
 'R-Precision': '31.99%'}

Here are our tags for BERT's article with best of both worlds

In [19]:

Copied!

search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")
search("ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction")

Flash retriever: 100%|█████| 1/1 [00:00<00:00, 1801.68it/s]
TfIdf retriever: 100%|██████| 1/1 [00:00<00:00, 685.90it/s]

Out[19]:

[{'uri': 'http://www.semanlink.net/tag/retrieval_augmented_lm',
  'similarity': 1.0},
 {'uri': 'http://www.semanlink.net/tag/neural_models_for_information_retrieval',
  'similarity': 0.5},
 {'uri': 'http://www.semanlink.net/tag/embeddings_in_ir',
  'similarity': 0.3333333333333333},
 {'uri': 'http://www.semanlink.net/tag/dense_passage_retrieval',
  'similarity': 0.25},
 {'uri': 'http://www.semanlink.net/tag/information_retrieval',
  'similarity': 0.2},
 {'uri': 'http://www.semanlink.net/tag/entity_discovery_and_linking',
  'similarity': 0.16666666666666666},
 {'uri': 'http://www.semanlink.net/tag/ranking_information_retrieval',
  'similarity': 0.14285714285714285},
 {'uri': 'http://www.semanlink.net/tag/retrieval_based_nlp',
  'similarity': 0.125},
 {'uri': 'http://www.semanlink.net/tag/active_learning',
  'similarity': 0.1111111111111111},
 {'uri': 'http://www.semanlink.net/tag/cognitive_search', 'similarity': 0.1},
 {'uri': 'http://www.semanlink.net/tag/contrastive_learning',
  'similarity': 0.09090909090909091},
 {'uri': 'http://www.semanlink.net/tag/intent_classification_and_slot_filling',
  'similarity': 0.08333333333333333},
 {'uri': 'http://www.semanlink.net/tag/relational_inductive_biases',
  'similarity': 0.07692307692307693},
 {'uri': 'http://www.semanlink.net/tag/knowledge_augmented_language_models',
  'similarity': 0.07142857142857142},
 {'uri': 'http://www.semanlink.net/tag/thought_vector',
  'similarity': 0.06666666666666667},
 {'uri': 'http://www.semanlink.net/tag/aspect_detection',
  'similarity': 0.0625},
 {'uri': 'http://www.semanlink.net/tag/generative_adversarial_network',
  'similarity': 0.058823529411764705},
 {'uri': 'http://www.semanlink.net/tag/bert',
  'similarity': 0.05555555555555555},
 {'uri': 'http://www.semanlink.net/tag/information_extraction',
  'similarity': 0.05263157894736842},
 {'uri': 'http://www.semanlink.net/tag/connectionist_vs_symbolic_debate',
  'similarity': 0.05},
 {'uri': 'http://www.semanlink.net/tag/artificial_human_intelligence',
  'similarity': 0.047619047619047616},
 {'uri': 'http://www.semanlink.net/tag/good_related_work_section',
  'similarity': 0.045454545454545456},
 {'uri': 'http://www.semanlink.net/tag/artificial_general_intelligence',
  'similarity': 0.043478260869565216},
 {'uri': 'http://www.semanlink.net/tag/conscience_artificielle',
  'similarity': 0.041666666666666664},
 {'uri': 'http://www.semanlink.net/tag/neuroscience_and_ai',
  'similarity': 0.04},
 {'uri': 'http://www.semanlink.net/tag/introduction',
  'similarity': 0.038461538461538464},
 {'uri': 'http://www.semanlink.net/tag/constraint_satisfaction_problem',
  'similarity': 0.037037037037037035},
 {'uri': 'http://www.semanlink.net/tag/out_of_distribution_detection',
  'similarity': 0.03571428571428571},
 {'uri': 'http://www.semanlink.net/tag/rotate',
  'similarity': 0.034482758620689655},
 {'uri': 'http://www.semanlink.net/tag/patent_landscaping',
  'similarity': 0.03333333333333333}]