Skip to content

TfIdf

TfIdf retriever based on cosine similarities.

Parameters

  • key (str)

    Field identifier of each document.

  • on (Union[str, list])

    Fields to use to match the query to the documents.

  • documents (List[Dict[str, str]]) – defaults to None

    Documents in TFIdf retriever are static. The retriever must be reseted to index new documents.

  • tfidf (sklearn.feature_extraction.text.sparse.TfidfVectorizer) – defaults to None

    sparse.TfidfVectorizer class of Sklearn to create a custom TfIdf retriever.

  • k (Optional[int]) – defaults to None

    Number of documents to retrieve. Default is None, i.e all documents that match the query will be retrieved.

  • batch_size (int) – defaults to 1024

  • fit (bool) – defaults to True

Examples

>>> from pprint import pprint as print
>>> from cherche import retrieve
>>> from lenlp import sparse

>>> documents = [
...     {"id": 0, "title": "Paris", "article": "Eiffel tower"},
...     {"id": 1, "title": "Montreal", "article": "Montreal is in Canada."},
...     {"id": 2, "title": "Paris", "article": "Eiffel tower"},
...     {"id": 3, "title": "Montreal", "article": "Montreal is in Canada."},
... ]

>>> retriever = retrieve.TfIdf(
...     key="id",
...     on=["title", "article"],
...     documents=documents,
... )

>>> documents = [
...     {"id": 4, "title": "Paris", "article": "Eiffel tower"},
...     {"id": 5, "title": "Montreal", "article": "Montreal is in Canada."},
...     {"id": 6, "title": "Paris", "article": "Eiffel tower"},
...     {"id": 7, "title": "Montreal", "article": "Montreal is in Canada."},
... ]

>>> retriever = retriever.add(documents)

>>> print(retriever(q=["paris", "canada"], k=4))
[[{'id': 6, 'similarity': 0.5404109029445249},
  {'id': 0, 'similarity': 0.5404109029445249},
  {'id': 2, 'similarity': 0.5404109029445249},
  {'id': 4, 'similarity': 0.5404109029445249}],
 [{'id': 7, 'similarity': 0.3157669764669935},
  {'id': 5, 'similarity': 0.3157669764669935},
  {'id': 3, 'similarity': 0.3157669764669935},
  {'id': 1, 'similarity': 0.3157669764669935}]]

>>> print(retriever(["unknown", "montreal paris"], k=2))
[[],
 [{'id': 7, 'similarity': 0.7391866872635209},
  {'id': 5, 'similarity': 0.7391866872635209}]]

>>> print(retriever(q="paris"))
[{'id': 6, 'similarity': 0.5404109029445249},
 {'id': 0, 'similarity': 0.5404109029445249},
 {'id': 2, 'similarity': 0.5404109029445249},
 {'id': 4, 'similarity': 0.5404109029445249}]

Methods

call

Retrieve documents from batch of queries.

Parameters

  • q (Union[str, List[str]])
  • k (Optional[int]) – defaults to None
    Number of documents to retrieve. Default is None, i.e all documents that match the query will be retrieved.
  • batch_size (Optional[int]) – defaults to None
  • tqdm_bar (bool) – defaults to True
  • kwargs
add

Add new documents to the TFIDF retriever. The tfidf won't be refitted.

Parameters

  • documents (list)
    Documents in TFIdf retriever are static. The retriever must be reseted to index new documents.
  • batch_size (int) – defaults to 100000
  • tqdm_bar (bool) – defaults to False
  • kwargs
top_k

Return the top k documents for each query.

Parameters

  • similarities (scipy.sparse._csc.csc_matrix)
  • k (int)
    Number of documents to retrieve. Default is None, i.e all documents that match the query will be retrieved.

References

  1. sklearn.feature_extraction.text.sparse.TfidfVectorizer
  2. Python: tf-idf-cosine: to find document similarity