Skip to content

ColBERT

TfIdf scoring function.

Parameters

  • key (str)

  • on (list | str)

  • documents (list)

  • model (neural_cherche.models.colbert.ColBERT) – defaults to None

  • device (str) – defaults to cpu

  • kwargs

Attributes

  • distinct_documents_encoder

    Return True if the encoder is distinct for documents and nodes.

Examples

>>> from neural_tree import trees, scoring
>>> from neural_cherche import models
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from pprint import pprint
>>> import torch

>>> _ = torch.manual_seed(42)

>>> documents = [
...     {"id": 0, "text": "Paris is the capital of France."},
...     {"id": 1, "text": "Berlin is the capital of Germany."},
...     {"id": 2, "text": "Paris and Berlin are European cities."},
...     {"id": 3, "text": "Paris and Berlin are beautiful cities."},
... ]

>>> model = models.ColBERT(
...     model_name_or_path="sentence-transformers/all-mpnet-base-v2",
...     embedding_size=128,
...     max_length_document=96,
...     max_length_query=32,
... )

>>> tree = trees.ColBERTTree(
...    key="id",
...    on="text",
...    model=model,
...    documents=documents,
...    leaf_balance_factor=1,
...    branch_balance_factor=2,
...    n_jobs=1,
... )

>>> print(tree)
node 1
    node 10
        leaf 100
        leaf 101
    node 11
        leaf 110
        leaf 111

>>> tree.leafs_to_documents
{'100': [0], '101': [1], '110': [2], '111': [3]}

>>> candidates = tree(
...    queries=["Paris is the capital of France.", "Paris and Berlin are European cities."],
...    k_leafs=2,
...    k=2,
... )

>>> candidates["scores"]
array([[28.12037659, 18.32332611],
       [29.28324509, 21.38923264]])

>>> candidates["leafs"]
array([['100', '101'],
       ['110', '111']], dtype='<U3')

>>> pprint(candidates["tree_scores"])
[{'10': tensor(28.1204),
  '100': tensor(28.1204),
  '101': tensor(18.3233),
  '11': tensor(20.9327)},
 {'10': tensor(21.6886),
  '11': tensor(29.2832),
  '110': tensor(29.2832),
  '111': tensor(21.3892)}]

>>> pprint(candidates["documents"])
[[{'id': 0, 'leaf': '100', 'similarity': 28.120376586914062},
  {'id': 1, 'leaf': '101', 'similarity': 18.323326110839844}],
 [{'id': 2, 'leaf': '110', 'similarity': 29.283245086669922},
  {'id': 3, 'leaf': '111', 'similarity': 21.389232635498047}]]

Methods

average

Average embeddings.

  • embeddings (torch.Tensor)
convert_to_tensor

Transform sparse matrix to tensor.

Parameters

  • embeddings (numpy.ndarray | torch.Tensor)
  • device (str)
encode_queries_for_retrieval

Encode queries for retrieval.

Parameters

  • queries (list[str])
get_retriever

Create a retriever

leaf_scores

Return the scores of the embeddings.

Parameters

  • queries_embeddings (torch.Tensor)
  • leaf_embedding (torch.Tensor)
nodes_scores

Score between queries and nodes embeddings.

Parameters

  • queries_embeddings (torch.Tensor)
  • nodes_embeddings (torch.Tensor)
stack

Stack list of embeddings.

Parameters

  • embeddings (list[torch.Tensor | numpy.ndarray])
transform_documents

Transform documents to embeddings.

Parameters

  • documents (list[dict])
  • batch_size (int)
  • tqdm_bar (bool)
  • kwargs
transform_queries

Transform queries to embeddings.

Parameters

  • queries (list[str])
  • batch_size (int)
  • tqdm_bar (bool)
  • kwargs