Metadata-Version: 2.1
Name: a-data-processing
Version: 0.0.1
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/kubeagi/arcadia
Author: ggservice007
Author-email: ggservice007@126.com
Keywords: PDF WORD WEB parsing preprocessing
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.9.0,<3.12
Requires-Dist: pandas ==2.1.2
Requires-Dist: numpy ==1.26.1
Requires-Dist: sanic ==23.6.0
Requires-Dist: sanic-cors ==2.2.0
Requires-Dist: aiohttp ==3.8.6
Requires-Dist: ulid ==1.1
Requires-Dist: minio ==7.1.17
Requires-Dist: zhipuai ==1.0.7
Requires-Dist: langchain ==0.0.354
Requires-Dist: spacy ==3.5.4
Requires-Dist: pypdf ==3.17.1
Requires-Dist: emoji ==2.2.0
Requires-Dist: ftfy ==6.1.1
Requires-Dist: psycopg2-binary ==2.9.9
Requires-Dist: kubernetes ==25.3.0
Requires-Dist: duckdb ==0.9.2
Requires-Dist: DBUtils ==3.0.3
Requires-Dist: pyyaml ==6.0.1
Requires-Dist: opencc ==0.2
Requires-Dist: opencc-python-reimplemented ==0.1.7
Requires-Dist: selectolax ==0.3.17
Requires-Dist: openai ==1.3.7
Requires-Dist: python-docx ==1.1.0
Requires-Dist: bs4 ==0.0.1
Requires-Dist: playwright ==1.40.0
Requires-Dist: pillow ==10.2.0
Requires-Dist: html2text ==2020.1.16

# Data Processing 

## Current Version Main Features

Data Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:
- txt
- json  
- doc
- html
- excel
- csv
- pdf
- markdown
- ppt

### Current Text Type Processing  

The data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.

## Design

![Design](../../docs/images/data-process.drawio.png)

## Local Development
### Software Requirements

Before setting up the local data-process environment, please make sure the following software is installed:

- Python 3.10.x

### Environment Setup

Install the Python dependencies in the requirements.txt file

### Running

Run the server.py file in the src directory

# isort
isort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order. 

## install
```shell
pip install isort
```

## isort a file
```shell
isort src/server.py
```

## isort a directory
```shell
isort .
```

