abacusai.document_retriever

Classes

VectorStoreConfig

Config for indexing options of a document retriever. Default values of optional arguments are heuristically selected by the Abacus.AI platform based on the underlying data.

DocumentRetrieverConfig

A config for document retriever creation.

DocumentRetrieverVersion

A version of document retriever.

AbstractApiClass

DocumentRetriever

A vector store that stores embeddings for a list of document trunks.

Module Contents

class abacusai.document_retriever.VectorStoreConfig

Bases: abacusai.api_class.abstract.ApiClass

Config for indexing options of a document retriever. Default values of optional arguments are heuristically selected by the Abacus.AI platform based on the underlying data.

Parameters:
  • chunk_size (int) – The size of text chunks in the vector store.

  • chunk_overlap_fraction (float) – The fraction of overlap between chunks.

  • text_encoder (VectorStoreTextEncoder) – Encoder used to index texts from the documents.

  • chunk_size_factors (list) – Chunking data with multiple sizes. The specified list of factors are used to calculate more sizes, in addition to chunk_size.

  • score_multiplier_column (str) – If provided, will use the values in this metadata column to modify the relevance score of returned chunks for all queries.

  • prune_vectors (bool) – Transform vectors using SVD so that the average component of vectors in the corpus are removed.

chunk_size: int
chunk_overlap_fraction: float
text_encoder: abacusai.api_class.enums.VectorStoreTextEncoder
chunk_size_factors: list
score_multiplier_column: str
prune_vectors: bool
class abacusai.document_retriever.DocumentRetrieverConfig(client, chunkSize=None, chunkOverlapFraction=None, textEncoder=None, scoreMultiplierColumn=None, pruneVectors=None)

Bases: abacusai.return_class.AbstractApiClass

A config for document retriever creation.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • chunkSize (int) – The size of chunks for vector store, i.e., maximum number of words in the chunk.

  • chunkOverlapFraction (float) – The fraction of overlap between two consecutive chunks.

  • textEncoder (str) – The text encoder used to encode texts in the vector store.

  • scoreMultiplierColumn (str) – The values in this metadata column are used to modify the relevance scores of returned chunks.

  • pruneVectors (bool) – Corpus specific transformation of vectors that applies dimensional reduction techniques to strip common components from the vectors.

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

class abacusai.document_retriever.DocumentRetrieverVersion(client, documentRetrieverId=None, documentRetrieverVersion=None, createdAt=None, status=None, deploymentStatus=None, featureGroupId=None, featureGroupVersion=None, error=None, numberOfChunks=None, embeddingFileSize=None, warnings=None, resolvedConfig={})

Bases: abacusai.return_class.AbstractApiClass

A version of document retriever.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • documentRetrieverId (str) – The unique identifier of the Document Retriever.

  • documentRetrieverVersion (str) – The unique identifier of the Document Retriever version.

  • createdAt (str) – When the Document Retriever was created.

  • status (str) – The status of creating Document Retriever version.

  • deploymentStatus (str) – The status of deploying the Document Retriever version.

  • featureGroupId (str) – The feature group id associated with the document retriever.

  • featureGroupVersion (str) – The unique identifier of the feature group version at which the Document Retriever version is created.

  • error (str) – The error message when it failed to create the document retriever version.

  • numberOfChunks (int) – The number of chunks for the document retriever.

  • embeddingFileSize (int) – The size of embedding file for the document retriever.

  • warnings (list) – The warning messages when creating the document retriever.

  • resolvedConfig (DocumentRetrieverConfig) – The resolved configurations, such as default settings, for indexing documents.

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

DocumentRetrieverVersion

describe()

Describe a document retriever version.

Parameters:

document_retriever_version (str) – A unique string identifier associated with the document retriever version.

Returns:

The document retriever version object.

Return type:

DocumentRetrieverVersion

wait_for_results(timeout=3600)

A waiting call until document retriever version is complete.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

wait_until_ready(timeout=3600)

A waiting call until the document retriever version is ready. It restarts the document retriever if it is stopped.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

wait_until_deployment_ready(timeout=3600)

A waiting call until the document retriever deployment is ready to serve.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

get_status()

Gets the status of the document retriever version.

Returns:

A string describing the status of a document retriever version (pending, complete, etc.).

Return type:

str

get_deployment_status()

Gets the status of the document retriever version.

Returns:

A string describing the deployment status of a document retriever version (pending, deploying, etc.).

Return type:

str

class abacusai.document_retriever.AbstractApiClass(client, id)
__eq__(other)

Return self==value.

_get_attribute_as_dict(attribute)
class abacusai.document_retriever.DocumentRetriever(client, name=None, documentRetrieverId=None, createdAt=None, featureGroupId=None, featureGroupName=None, indexingRequired=None, latestDocumentRetrieverVersion={}, documentRetrieverConfig={})

Bases: abacusai.return_class.AbstractApiClass

A vector store that stores embeddings for a list of document trunks.

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • name (str) – The name of the document retriever.

  • documentRetrieverId (str) – The unique identifier of the vector store.

  • createdAt (str) – When the vector store was created.

  • featureGroupId (str) – The feature group id associated with the document retriever.

  • featureGroupName (str) – The feature group name associated with the document retriever.

  • indexingRequired (bool) – Whether the document retriever is required to be indexed due to changes in underlying data.

  • latestDocumentRetrieverVersion (DocumentRetrieverVersion) – The latest version of vector store.

  • documentRetrieverConfig (DocumentRetrieverConfig) – The config for vector store creation.

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

rename(name)

Updates an existing document retriever.

Parameters:

name (str) – The name to update the document retriever with.

Returns:

The updated document retriever.

Return type:

DocumentRetriever

create_version(feature_group_id=None, document_retriever_config=None)

Creates a document retriever version from the latest version of the feature group that the document retriever associated with.

Parameters:
  • feature_group_id (str) – The ID of the feature group to update the document retriever with.

  • document_retriever_config (VectorStoreConfig) – The configuration, including chunk_size and chunk_overlap_fraction, for document retrieval.

Returns:

The newly created document retriever version.

Return type:

DocumentRetrieverVersion

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

DocumentRetriever

describe()

Describe a Document Retriever.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

Returns:

The document retriever object.

Return type:

DocumentRetriever

list_versions(limit=100, start_after_version=None)

List all the document retriever versions with a given ID.

Parameters:
  • limit (int) – The number of vector store versions to retrieve.

  • start_after_version (str) – An offset parameter to exclude all document retriever versions up to this specified one.

Returns:

All the document retriever versions associated with the document retriever.

Return type:

list[DocumentRetrieverVersion]

get_document_snippet(document_id, start_word_index=None, end_word_index=None)

Get a snippet from documents in the document retriever.

Parameters:
  • document_id (str) – The ID of the document to retrieve the snippet from.

  • start_word_index (int) – If provided, will start the snippet at the index (of words in the document) specified.

  • end_word_index (int) – If provided, will end the snippet at the index of (of words in the document) specified.

Returns:

The documentation snippet found from the document retriever.

Return type:

DocumentRetrieverLookupResult

restart()

Restart the document retriever if it is stopped. This will start the deployment of the document retriever,

but will not wait for it to be ready. You need to call wait_until_ready to wait until the deployment is ready.

Parameters:

document_retriever_id (str) – A unique string identifier associated with the document retriever.

wait_until_ready(timeout=3600)

A waiting call until document retriever is ready. It restarts the document retriever if it is stopped.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

wait_until_deployment_ready(timeout=3600)

A waiting call until the document retriever deployment is ready to serve.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out. Default value given is 3600 seconds.

get_status()

Gets the indexing status of the document retriever.

Returns:

A string describing the status of a document retriever (pending, complete, etc.).

Return type:

str

get_deployment_status()

Gets the deployment status of the document retriever.

Returns:

A string describing the deployment status of document retriever (pending, deploying, active, etc.).

Return type:

str

get_matching_documents(query, filters=None, limit=None, result_columns=None, max_words=None, num_retrieval_margin_words=None, max_words_per_chunk=None, score_multiplier_column=None, min_score=None, required_phrases=None, filter_clause=None, crowding_limits=None)

Lookup document retrievers and return the matching documents from the document retriever deployed with given query.

Original documents are split into chunks and stored in the document retriever. This lookup function will return the relevant chunks from the document retriever. The returned chunks could be expanded to include more words from the original documents and merged if they are overlapping, and permitted by the settings provided. The returned chunks are sorted by relevance.

Parameters:
  • query (str) – The query to search for.

  • filters (dict) – A dictionary mapping column names to a list of values to restrict the retrieved search results.

  • limit (int) – If provided, will limit the number of results to the value specified.

  • result_columns (list) – If provided, will limit the column properties present in each result to those specified in this list.

  • max_words (int) – If provided, will limit the total number of words in the results to the value specified.

  • num_retrieval_margin_words (int) – If provided, will add this number of words from left and right of the returned chunks.

  • max_words_per_chunk (int) – If provided, will limit the number of words in each chunk to the value specified. If the value provided is smaller than the actual size of chunk on disk, which is determined during document retriever creation, the actual size of chunk will be used. I.e, chunks looked up from document retrievers will not be split into smaller chunks during lookup due to this setting.

  • score_multiplier_column (str) – If provided, will use the values in this column to modify the relevance score of the returned chunks. Values in this column must be numeric.

  • min_score (float) – If provided, will filter out the results with score lower than the value specified.

  • required_phrases (list) – If provided, each result will have at least one of the phrases.

  • filter_clause (str) – If provided, filter the results of the query using this sql where clause.

  • crowding_limits (dict) – A dictionary mapping metadata columns to the maximum number of results per unique value of the column. This is used to ensure diversity of metadata attribute values in the results. If a particular attribute value has already reached its maximum count, further results with that same attribute value will be excluded from the final result set.

Returns:

The relevant documentation results found from the document retriever.

Return type:

list[DocumentRetrieverLookupResult]