SparseEmbed¶
Retriever class.
Parameters¶
-
key (str)
Document unique identifier.
-
on (list[str])
Document texts.
-
model (models.SparseEmbed)
SparsEmbed model.
-
tokenizer_parallelism (str) – defaults to
false
Examples¶
>>> from neural_cherche import models, retrieve
>>> from pprint import pprint
>>> import torch
>>> _ = torch.manual_seed(42)
>>> device = "mps"
>>> model = models.SparseEmbed(
... model_name_or_path="raphaelsty/neural-cherche-sparse-embed",
... device=device,
... embedding_size=64,
... )
>>> retriever = retrieve.SparseEmbed(
... key="id",
... on="document",
... model=model,
... )
>>> documents = [
... {"id": 0, "document": "Food"},
... {"id": 1, "document": "Sports"},
... {"id": 2, "document": "Cinema"},
... ]
>>> queries = ["Food", "Sports", "Cinema"]
>>> documents_embeddings = retriever.encode_documents(
... documents=documents,
... batch_size=1,
... )
>>> queries_embeddings = retriever.encode_queries(
... queries=queries,
... batch_size=1,
... )
>>> retriever = retriever.add(
... documents_embeddings=documents_embeddings,
... )
>>> scores = retriever(
... queries_embeddings=queries_embeddings,
... batch_size=32
... )
>>> pprint(scores)
[[{'id': 0, 'similarity': 62.01531219482422},
{'id': 1, 'similarity': 59.01810836791992},
{'id': 2, 'similarity': 40.613182067871094}],
[{'id': 1, 'similarity': 97.81436920166016},
{'id': 2, 'similarity': 32.50034713745117},
{'id': 0, 'similarity': 25.678363800048828}],
[{'id': 2, 'similarity': 56.019283294677734},
{'id': 1, 'similarity': 37.612735748291016},
{'id': 0, 'similarity': 26.307708740234375}]]
Methods¶
call
Retrieve documents.
Parameters
- queries_embeddings (dict[str, scipy.sparse._csr.csr_matrix])
- k (int) – defaults to
None
- batch_size (int) – defaults to
64
- tqdm_bar (bool) – defaults to
True
add
Add documents embeddings and activations to the retriever.
Parameters
- documents_embeddings (dict[dict[str, torch.Tensor]])
encode_documents
Encode documents.
Parameters
- documents (list[dict])
- batch_size (int) – defaults to
32
- tqdm_bar (bool) – defaults to
True
- query_mode (bool) – defaults to
False
- kwargs
encode_queries
Encode queries.
Parameters
- queries (list[str])
- batch_size (int) – defaults to
32
- tqdm_bar (bool) – defaults to
True
- query_mode (bool) – defaults to
True
- kwargs
top_k
Return the top k documents for each query.
Parameters
- similarities (scipy.sparse._csc.csc_matrix)
- k (int)