Skip to content

TfIdf

TfIdf scoring function.

Parameters

  • key (str)

  • on (list | str)

  • documents (list)

  • tfidf_nodes (sklearn.feature_extraction.text.TfidfVectorizer | None) – defaults to None

  • tfidf_documents (sklearn.feature_extraction.text.TfidfVectorizer | None) – defaults to None

  • kwargs

Attributes

  • distinct_documents_encoder

    Return True if the encoder is distinct for documents and nodes.

Examples

>>> from neural_tree import trees, scoring
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from pprint import pprint

>>> documents = [
...     {"id": 0, "text": "Paris is the capital of France."},
...     {"id": 1, "text": "Berlin is the capital of Germany."},
...     {"id": 2, "text": "Paris and Berlin are European cities."},
...     {"id": 3, "text": "Paris and Berlin are beautiful cities."},
... ]

>>> tree = trees.Tree(
...    key="id",
...    documents=documents,
...    scoring=scoring.TfIdf(key="id", on=["text"], documents=documents),
...    leaf_balance_factor=1,
...    branch_balance_factor=2,
... )

>>> print(tree)
node 1
    node 10
        leaf 100
        leaf 101
    node 11
        leaf 110
        leaf 111

>>> tree.leafs_to_documents
{'100': [0], '101': [1], '110': [2], '111': [3]}

>>> candidates = tree(
...    queries=["Paris is the capital of France.", "Paris and Berlin are European cities."],
...    k_leafs=2,
...    k=2,
... )

>>> candidates["scores"]
array([[0.99999994, 0.63854915],
       [0.99999994, 0.72823119]])

>>> candidates["leafs"]
array([['100', '101'],
       ['110', '111']], dtype='<U3')

>>> pprint(candidates["tree_scores"])
[{'10': tensor(1.0000),
  '100': tensor(1.0000),
  '101': tensor(0.6385),
  '11': tensor(0.1076)},
 {'10': tensor(0.1076),
  '11': tensor(1.0000),
  '110': tensor(1.0000),
  '111': tensor(0.7282)}]

>>> pprint(candidates["documents"])
[[{'id': 0, 'leaf': '100', 'similarity': 0.9999999999999978},
  {'id': 1, 'leaf': '101', 'similarity': 0.39941742405759667}],
 [{'id': 2, 'leaf': '110', 'similarity': 0.9999999999999978},
  {'id': 3, 'leaf': '111', 'similarity': 0.5385719658738707}]]

Methods

average

Average embeddings.

  • embeddings (scipy.sparse._csr.csr_matrix)
convert_to_tensor

Transform sparse matrix to tensor.

Parameters

  • embeddings (scipy.sparse._csr.csr_matrix)
  • device (str)
encode_queries_for_retrieval

Encode queries for retrieval.

Parameters

  • queries (list[str])
get_retriever

Create a retriever

leaf_scores

Return the scores of the embeddings.

Parameters

  • queries_embeddings (torch.Tensor)
  • leaf_embedding (torch.Tensor)
nodes_scores

Score between queries and nodes embeddings.

Parameters

  • queries_embeddings (torch.Tensor)
  • nodes_embeddings (torch.Tensor)
stack

Stack list of embeddings.

  • embeddings (list[scipy.sparse._csr.csr_matrix])
transform_documents

Transform documents to embeddings.

Parameters

  • documents (list[dict])
  • kwargs
transform_queries

Transform queries to embeddings.

Parameters

  • queries (list[str])
  • kwargs