abacusai.dataset

Classes

ApplicationConnectorDatasetConfig

An abstract class for dataset configs specific to application connectors.

DatasetDocumentProcessingConfig

Document processing configuration for dataset imports.

DataType

Generic enumeration.

DocumentProcessingConfig

Document processing configuration.

ParsingConfig

Custom config for dataset parsing.

DatasetColumn

A schema description for a column

DatasetVersion

A specific version of a dataset

RefreshSchedule

A refresh schedule for an object. Defines when the next version of the object will be created

AbstractApiClass

Dataset

A dataset reference

Module Contents

class abacusai.dataset.ApplicationConnectorDatasetConfig

Bases: abacusai.api_class.dataset.DatasetConfig

An abstract class for dataset configs specific to application connectors.

Parameters:

application_connector_type (enums.ApplicationConnectorType) – The type of application connector

application_connector_type: abacusai.api_class.enums.ApplicationConnectorType
classmethod _get_builder()
class abacusai.dataset.DatasetDocumentProcessingConfig

Bases: DocumentProcessingConfig

Document processing configuration for dataset imports.

Parameters:
  • extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.

  • ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.

  • use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.

  • remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.

  • remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.

  • convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.

  • page_text_column (str) – Name of the output column which contains the extracted text for each page. If not provided, no column will be created.

page_text_column: str = None
class abacusai.dataset.DataType

Bases: ApiEnum

Generic enumeration.

Derive from this class to define new enumerations.

INTEGER = 'integer'
FLOAT = 'float'
STRING = 'string'
DATE = 'date'
DATETIME = 'datetime'
BOOLEAN = 'boolean'
LIST = 'list'
STRUCT = 'struct'
NULL = 'null'
class abacusai.dataset.DocumentProcessingConfig

Bases: abacusai.api_class.abstract.ApiClass

Document processing configuration.

Parameters:
  • extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.

  • ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.

  • use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.

  • remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.

  • remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.

  • convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.

extract_bounding_boxes: bool = False
ocr_mode: abacusai.api_class.enums.OcrMode
use_full_ocr: bool = None
remove_watermarks: bool = True
convert_to_markdown: bool = False
class abacusai.dataset.ParsingConfig

Bases: abacusai.api_class.abstract.ApiClass

Custom config for dataset parsing.

Parameters:
  • escape (str) – Escape character for CSV files. Defaults to ‘”’.

  • csv_delimiter (str) – Delimiter for CSV files. Defaults to None.

  • file_path_with_schema (str) – Path to the file with schema. Defaults to None.

escape: str
csv_delimiter: str
file_path_with_schema: str
class abacusai.dataset.DatasetColumn(client, name=None, dataType=None, detectedDataType=None, featureType=None, detectedFeatureType=None, originalName=None, validDataTypes=None, timeFormat=None, timestampFrequency=None)

Bases: abacusai.return_class.AbstractApiClass

A schema description for a column

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • name (str) – The unique name of the column.

  • dataType (str) – The underlying data type of each column.

  • detectedDataType (str) – The detected data type of the column.

  • featureType (str) – Feature type of the column.

  • detectedFeatureType (str) – The detected feature type of the column.

  • originalName (str) – The original name of the column.

  • validDataTypes (list[str]) – The valid data type options for this column.

  • timeFormat (str) – The detected time format of the column.

  • timestampFrequency (str) – The detected frequency of the timestamps in the dataset.

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

class abacusai.dataset.DatasetVersion(client, datasetVersion=None, status=None, datasetId=None, size=None, rowCount=None, fileInspectMetadata=None, createdAt=None, error=None, incrementalQueriedAt=None, uploadId=None, mergeFileSchemas=None, databaseConnectorConfig=None, applicationConnectorConfig=None, invalidRecords=None)

Bases: abacusai.return_class.AbstractApiClass

A specific version of a dataset

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • datasetVersion (str) – The unique identifier of the dataset version.

  • status (str) – The current status of the dataset version

  • datasetId (str) – A reference to the Dataset this dataset version belongs to.

  • size (int) – The size in bytes of the file.

  • rowCount (int) – Number of rows in the dataset version.

  • fileInspectMetadata (dict) – Metadata information about file’s inspection. For example - the detected delimiter for CSV files.

  • createdAt (str) – The timestamp this dataset version was created.

  • error (str) – If status is FAILED, this field will be populated with an error.

  • incrementalQueriedAt (str) – If the dataset version is from an incremental dataset, this is the last entry of timestamp column when the dataset version was created.

  • uploadId (str) – If the dataset version is being uploaded, this the reference to the Upload

  • mergeFileSchemas (bool) – If the merge file schemas policy is enabled.

  • databaseConnectorConfig (dict) – The database connector query used to retrieve data for this version.

  • applicationConnectorConfig (dict) – The application connector used to retrieve data for this version.

  • invalidRecords (str) – Invalid records in the dataset version

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

get_metrics(selected_columns=None, include_charts=False, include_statistics=True)

Get metrics for a specific dataset version.

Parameters:
  • selected_columns (List) – A list of columns to order first.

  • include_charts (bool) – A flag indicating whether charts should be included in the response. Default is false.

  • include_statistics (bool) – A flag indicating whether statistics should be included in the response. Default is true.

Returns:

The metrics for the specified Dataset version.

Return type:

DataMetrics

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

DatasetVersion

describe()

Retrieves a full description of the specified dataset version, including its ID, name, source type, and other attributes.

Parameters:

dataset_version (str) – Unique string identifier associated with the dataset version.

Returns:

The dataset version.

Return type:

DatasetVersion

get_logs()

Retrieves the dataset import logs.

Parameters:

dataset_version (str) – The unique version ID of the dataset version.

Returns:

The logs for the specified dataset version.

Return type:

DatasetVersionLogs

wait_for_import(timeout=900)

A waiting call until dataset version is imported.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

wait_for_inspection(timeout=None)

A waiting call until dataset version is completely inspected.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

get_status()

Gets the status of the dataset version.

Returns:

A string describing the status of a dataset version (importing, inspecting, complete, etc.).

Return type:

str

class abacusai.dataset.RefreshSchedule(client, refreshPolicyId=None, nextRunTime=None, cron=None, refreshType=None, error=None)

Bases: abacusai.return_class.AbstractApiClass

A refresh schedule for an object. Defines when the next version of the object will be created

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • refreshPolicyId (str) – The unique identifier of the refresh policy

  • nextRunTime (str) – The next run time of the refresh policy. If null, the policy is paused.

  • cron (str) – A cron-style string that describes the when this refresh policy is to be executed in UTC

  • refreshType (str) – The type of refresh that will be run

  • error (str) – An error message for the last pipeline run of a policy

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

class abacusai.dataset.AbstractApiClass(client, id)
__eq__(other)

Return self==value.

_get_attribute_as_dict(attribute)
class abacusai.dataset.Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, extractBoundingBoxes=None, mergeFileSchemas=None, referenceOnlyDocumentset=None, schema={}, refreshSchedules={}, latestDatasetVersion={}, parsingConfig={}, documentProcessingConfig={})

Bases: abacusai.return_class.AbstractApiClass

A dataset reference

Parameters:
  • client (ApiClient) – An authenticated API Client instance

  • datasetId (str) – The unique identifier of the dataset.

  • sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.

  • dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.

  • createdAt (str) – The timestamp at which this dataset was created.

  • ignoreBefore (str) – The timestamp at which all previous events are ignored when training.

  • ephemeral (bool) – The dataset is ephemeral and not used for training.

  • lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.

  • databaseConnectorId (str) – The Database Connector used.

  • databaseConnectorConfig (dict) – The database connector query used to retrieve data.

  • connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.

  • featureGroupTableName (str) – The table name of the dataset’s feature group

  • applicationConnectorId (str) – The Application Connector used.

  • applicationConnectorConfig (dict) – The application connector query used to retrieve data.

  • incremental (bool) – If dataset is an incremental dataset.

  • isDocumentset (bool) – If dataset is a documentset.

  • extractBoundingBoxes (bool) – Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.

  • mergeFileSchemas (bool) – If the merge file schemas policy is enabled.

  • referenceOnlyDocumentset (bool) – Signifies whether to save the data reference only. Only valid if is_documentset if True.

  • latestDatasetVersion (DatasetVersion) – The latest version of this dataset.

  • schema (DatasetColumn) – List of resolved columns.

  • refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.

  • parsingConfig (ParsingConfig) – The parsing config used for dataset.

  • documentProcessingConfig (DocumentProcessingConfig) – The document processing config used for dataset (when is_documentset is True).

__repr__()

Return repr(self).

to_dict()

Get a dict representation of the parameters in this class

Returns:

The dict value representation of the class parameters

Return type:

dict

create_version_from_file_connector(location=None, file_format=None, csv_delimiter=None, merge_file_schemas=None, parsing_config=None)

Creates a new version of the specified dataset.

Parameters:
  • location (str) – External URI to import the dataset from. If not specified, the last location will be used.

  • file_format (str) – File format to be used. If not specified, the service will try to detect the file format.

  • csv_delimiter (str) – If the file format is CSV, use a specific CSV delimiter.

  • merge_file_schemas (bool) – Signifies if the merge file schema policy is enabled.

  • parsing_config (ParsingConfig) – Custom config for dataset parsing.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_database_connector(object_name=None, columns=None, query_arguments=None, sql_query=None)

Creates a new version of the specified dataset.

Parameters:
  • object_name (str) – The name/ID of the object in the service to query. If not specified, the last name will be used.

  • columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.

  • query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.

  • sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns, and query_arguments.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_application_connector(dataset_config=None)

Creates a new version of the specified dataset.

Parameters:

dataset_config (ApplicationConnectorDatasetConfig) – Dataset config for the application connector. If any of the fields are not specified, the last values will be used.

Returns:

The new Dataset Version created.

Return type:

DatasetVersion

create_version_from_upload(file_format=None)

Creates a new version of the specified dataset using a local file upload.

Parameters:

file_format (str) – File format to be used. If not specified, the service will attempt to detect the file format.

Returns:

Token to be used when uploading file parts.

Return type:

Upload

create_version_from_document_reprocessing(document_processing_config=None)

Creates a new dataset version for a source docstore dataset with the provided document processing configuration. This does not re-import the data but uses the same data which is imported in the latest dataset version and only performs document processing on it.

Parameters:

document_processing_config (DatasetDocumentProcessingConfig) – The document processing configuration to use for the new dataset version. If not specified, the document processing configuration from the source dataset will be used.

Returns:

The new dataset version created.

Return type:

DatasetVersion

snapshot_streaming_data()

Snapshots the current data in the streaming dataset.

Parameters:

dataset_id (str) – The unique ID associated with the dataset.

Returns:

The new Dataset Version created by taking a snapshot of the current data in the streaming dataset.

Return type:

DatasetVersion

set_column_data_type(column, data_type)

Set a Dataset’s column type.

Parameters:
  • column (str) – The name of the column.

  • data_type (DataType) – The type of the data in the column. Note: Some ColumnMappings may restrict the options or explicitly set the DataType.

Returns:

The dataset and schema after the data type has been set.

Return type:

Dataset

set_streaming_retention_policy(retention_hours=None, retention_row_count=None, ignore_records_before_timestamp=None)

Sets the streaming retention policy.

Parameters:
  • retention_hours (int) – Number of hours to retain streamed data in memory.

  • retention_row_count (int) – Number of rows to retain streamed data in memory.

  • ignore_records_before_timestamp (int) – The Unix timestamp (in seconds) to use as a cutoff to ignore all entries sent before it

get_schema()

Retrieves the column schema of a dataset.

Parameters:

dataset_id (str) – Unique string identifier of the dataset schema to look up.

Returns:

List of column schema definitions.

Return type:

list[DatasetColumn]

set_database_connector_config(database_connector_id, object_name=None, columns=None, query_arguments=None, sql_query=None)

Sets database connector config for a dataset. This method is currently only supported for streaming datasets.

Parameters:
  • database_connector_id (str) – Unique String Identifier of the Database Connector to import the dataset from.

  • object_name (str) – If applicable, the name/ID of the object in the service to query.

  • columns (str) – The columns to query from the external service object.

  • query_arguments (str) – Additional query arguments to filter the data.

  • sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns and query_arguments.

refresh()

Calls describe and refreshes the current object’s fields

Returns:

The current object

Return type:

Dataset

describe()

Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.

Parameters:

dataset_id (str) – The unique ID associated with the dataset.

Returns:

The dataset.

Return type:

Dataset

list_versions(limit=100, start_after_version=None)

Retrieves a list of all dataset versions for the specified dataset.

Parameters:
  • limit (int) – The maximum length of the list of all dataset versions.

  • start_after_version (str) – The ID of the version after which the list starts.

Returns:

A list of dataset versions.

Return type:

list[DatasetVersion]

delete()

Deletes the specified dataset from the organization.

Parameters:

dataset_id (str) – Unique string identifier of the dataset to delete.

wait_for_import(timeout=900)

A waiting call until dataset is imported.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

wait_for_inspection(timeout=None)

A waiting call until dataset is completely inspected.

Parameters:

timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.

get_status()

Gets the status of the latest dataset version.

Returns:

A string describing the status of a dataset (importing, inspecting, complete, etc.).

Return type:

str

describe_feature_group()

Gets the feature group attached to the dataset.

Returns:

A feature group object.

Return type:

FeatureGroup

create_refresh_policy(cron)

To create a refresh policy for a dataset.

Parameters:

cron (str) – A cron style string to set the refresh time.

Returns:

The refresh policy object.

Return type:

RefreshPolicy

list_refresh_policies()

Gets the refresh policies in a list.

Returns:

A list of refresh policy objects.

Return type:

List[RefreshPolicy]