Metadata-Version: 2.1
Name: acryl-datahub
Version: 0.8.3.0
Summary: A CLI to work with DataHub metadata
Home-page: https://datahubproject.io/
Author: DataHub Committers
License: Apache License 2.0
Project-URL: Documentation, https://datahubproject.io/docs/
Project-URL: Source, https://github.com/linkedin/datahub
Project-URL: Changelog, https://github.com/linkedin/datahub/releases
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX :: Linux
Classifier: Environment :: Console
Classifier: Environment :: MacOS X
Classifier: Topic :: Software Development
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pydantic (>=1.5.1)
Requires-Dist: typing-inspect
Requires-Dist: expandvars (>=0.6.5)
Requires-Dist: toml (>=0.10.0)
Requires-Dist: avro-gen3 (==0.5.0)
Requires-Dist: click (>=6.0.0)
Requires-Dist: docker
Requires-Dist: PyYAML
Requires-Dist: mypy-extensions (>=0.4.3)
Requires-Dist: avro-python3 (>=1.8.2)
Requires-Dist: python-dateutil
Requires-Dist: entrypoints
Requires-Dist: dataclasses (>=0.6) ; python_version < "3.7"
Requires-Dist: typing-extensions (>=3.7.4) ; python_version < "3.8"
Provides-Extra: airflow
Requires-Dist: expandvars (>=0.6.5) ; extra == 'airflow'
Requires-Dist: toml (>=0.10.0) ; extra == 'airflow'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'airflow'
Requires-Dist: click (>=6.0.0) ; extra == 'airflow'
Requires-Dist: apache-airflow (>=1.10.2) ; extra == 'airflow'
Requires-Dist: docker ; extra == 'airflow'
Requires-Dist: PyYAML ; extra == 'airflow'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'airflow'
Requires-Dist: python-dateutil ; extra == 'airflow'
Requires-Dist: entrypoints ; extra == 'airflow'
Provides-Extra: all
Requires-Dist: expandvars (>=0.6.5) ; extra == 'all'
Requires-Dist: GeoAlchemy2 ; extra == 'all'
Requires-Dist: PyAthena[sqlalchemy] ; extra == 'all'
Requires-Dist: toml (>=0.10.0) ; extra == 'all'
Requires-Dist: acryl-pyhive[hive] (>=0.6.7) ; extra == 'all'
Requires-Dist: boto3 ; extra == 'all'
Requires-Dist: click (>=6.0.0) ; extra == 'all'
Requires-Dist: psycopg2-binary ; extra == 'all'
Requires-Dist: docker ; extra == 'all'
Requires-Dist: python-ldap (>=2.4) ; extra == 'all'
Requires-Dist: sql-metadata (==1.12.0) ; extra == 'all'
Requires-Dist: PyYAML ; extra == 'all'
Requires-Dist: python-dateutil ; extra == 'all'
Requires-Dist: entrypoints ; extra == 'all'
Requires-Dist: pymysql (>=1.0.2) ; extra == 'all'
Requires-Dist: cx-Oracle ; extra == 'all'
Requires-Dist: pybigquery (>=0.6.0) ; extra == 'all'
Requires-Dist: sqlalchemy-redshift ; extra == 'all'
Requires-Dist: fastavro (>=1.2.0) ; extra == 'all'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'all'
Requires-Dist: apache-airflow (>=1.10.2) ; extra == 'all'
Requires-Dist: lkml (>=1.1.0) ; extra == 'all'
Requires-Dist: requests ; extra == 'all'
Requires-Dist: looker-sdk (==21.6.0) ; extra == 'all'
Requires-Dist: confluent-kafka (>=1.5.0) ; extra == 'all'
Requires-Dist: pymongo (>=3.11) ; extra == 'all'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'all'
Requires-Dist: sqlalchemy-pytds (>=0.3) ; extra == 'all'
Requires-Dist: pydruid (>=0.6.2) ; extra == 'all'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'all'
Requires-Dist: snowflake-sqlalchemy ; extra == 'all'
Provides-Extra: athena
Requires-Dist: expandvars (>=0.6.5) ; extra == 'athena'
Requires-Dist: PyAthena[sqlalchemy] ; extra == 'athena'
Requires-Dist: toml (>=0.10.0) ; extra == 'athena'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'athena'
Requires-Dist: click (>=6.0.0) ; extra == 'athena'
Requires-Dist: docker ; extra == 'athena'
Requires-Dist: PyYAML ; extra == 'athena'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'athena'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'athena'
Requires-Dist: python-dateutil ; extra == 'athena'
Requires-Dist: entrypoints ; extra == 'athena'
Provides-Extra: base
Requires-Dist: expandvars (>=0.6.5) ; extra == 'base'
Requires-Dist: toml (>=0.10.0) ; extra == 'base'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'base'
Requires-Dist: click (>=6.0.0) ; extra == 'base'
Requires-Dist: docker ; extra == 'base'
Requires-Dist: PyYAML ; extra == 'base'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'base'
Requires-Dist: python-dateutil ; extra == 'base'
Requires-Dist: entrypoints ; extra == 'base'
Provides-Extra: bigquery
Requires-Dist: expandvars (>=0.6.5) ; extra == 'bigquery'
Requires-Dist: toml (>=0.10.0) ; extra == 'bigquery'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'bigquery'
Requires-Dist: click (>=6.0.0) ; extra == 'bigquery'
Requires-Dist: docker ; extra == 'bigquery'
Requires-Dist: PyYAML ; extra == 'bigquery'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'bigquery'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'bigquery'
Requires-Dist: python-dateutil ; extra == 'bigquery'
Requires-Dist: entrypoints ; extra == 'bigquery'
Requires-Dist: pybigquery (>=0.6.0) ; extra == 'bigquery'
Provides-Extra: datahub-kafka
Requires-Dist: expandvars (>=0.6.5) ; extra == 'datahub-kafka'
Requires-Dist: toml (>=0.10.0) ; extra == 'datahub-kafka'
Requires-Dist: fastavro (>=1.2.0) ; extra == 'datahub-kafka'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'datahub-kafka'
Requires-Dist: click (>=6.0.0) ; extra == 'datahub-kafka'
Requires-Dist: docker ; extra == 'datahub-kafka'
Requires-Dist: confluent-kafka (>=1.5.0) ; extra == 'datahub-kafka'
Requires-Dist: PyYAML ; extra == 'datahub-kafka'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'datahub-kafka'
Requires-Dist: python-dateutil ; extra == 'datahub-kafka'
Requires-Dist: entrypoints ; extra == 'datahub-kafka'
Provides-Extra: datahub-rest
Requires-Dist: expandvars (>=0.6.5) ; extra == 'datahub-rest'
Requires-Dist: toml (>=0.10.0) ; extra == 'datahub-rest'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'datahub-rest'
Requires-Dist: click (>=6.0.0) ; extra == 'datahub-rest'
Requires-Dist: requests ; extra == 'datahub-rest'
Requires-Dist: docker ; extra == 'datahub-rest'
Requires-Dist: PyYAML ; extra == 'datahub-rest'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'datahub-rest'
Requires-Dist: python-dateutil ; extra == 'datahub-rest'
Requires-Dist: entrypoints ; extra == 'datahub-rest'
Provides-Extra: dev
Requires-Dist: types-PyYAML ; extra == 'dev'
Requires-Dist: toml (>=0.10.0) ; extra == 'dev'
Requires-Dist: types-six ; extra == 'dev'
Requires-Dist: isort (>=5.7.0) ; extra == 'dev'
Requires-Dist: pytest (>=6.2.2) ; extra == 'dev'
Requires-Dist: acryl-pyhive[hive] (>=0.6.7) ; extra == 'dev'
Requires-Dist: click (>=6.0.0) ; extra == 'dev'
Requires-Dist: sql-metadata (==1.12.0) ; extra == 'dev'
Requires-Dist: flake8 (>=3.8.3) ; extra == 'dev'
Requires-Dist: PyYAML ; extra == 'dev'
Requires-Dist: build ; extra == 'dev'
Requires-Dist: requests-mock ; extra == 'dev'
Requires-Dist: python-dateutil ; extra == 'dev'
Requires-Dist: entrypoints ; extra == 'dev'
Requires-Dist: types-requests ; extra == 'dev'
Requires-Dist: pytest-docker (>=0.10.3) ; extra == 'dev'
Requires-Dist: pybigquery (>=0.6.0) ; extra == 'dev'
Requires-Dist: typing-inspect ; extra == 'dev'
Requires-Dist: twine ; extra == 'dev'
Requires-Dist: sqlalchemy-stubs ; extra == 'dev'
Requires-Dist: pymongo (>=3.11) ; extra == 'dev'
Requires-Dist: coverage (>=5.1) ; extra == 'dev'
Requires-Dist: requests ; extra == 'dev'
Requires-Dist: sqlalchemy-pytds (>=0.3) ; extra == 'dev'
Requires-Dist: mypy (>=0.901) ; extra == 'dev'
Requires-Dist: types-freezegun ; extra == 'dev'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'dev'
Requires-Dist: pydantic (>=1.5.1) ; extra == 'dev'
Requires-Dist: expandvars (>=0.6.5) ; extra == 'dev'
Requires-Dist: apache-airflow (==1.10.15) ; extra == 'dev'
Requires-Dist: black (>=19.10b0) ; extra == 'dev'
Requires-Dist: boto3 ; extra == 'dev'
Requires-Dist: docker ; extra == 'dev'
Requires-Dist: python-ldap (>=2.4) ; extra == 'dev'
Requires-Dist: mypy-extensions (>=0.4.3) ; extra == 'dev'
Requires-Dist: types-toml ; extra == 'dev'
Requires-Dist: types-PyMySQL ; extra == 'dev'
Requires-Dist: types-click (==0.1.12) ; extra == 'dev'
Requires-Dist: tox ; extra == 'dev'
Requires-Dist: pymysql (>=1.0.2) ; extra == 'dev'
Requires-Dist: freezegun ; extra == 'dev'
Requires-Dist: deepdiff ; extra == 'dev'
Requires-Dist: cx-Oracle ; extra == 'dev'
Requires-Dist: types-pkg-resources ; extra == 'dev'
Requires-Dist: apache-airflow-backport-providers-snowflake ; extra == 'dev'
Requires-Dist: fastavro (>=1.2.0) ; extra == 'dev'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'dev'
Requires-Dist: lkml (>=1.1.0) ; extra == 'dev'
Requires-Dist: looker-sdk (==21.6.0) ; extra == 'dev'
Requires-Dist: confluent-kafka (>=1.5.0) ; extra == 'dev'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'dev'
Requires-Dist: types-python-dateutil ; extra == 'dev'
Requires-Dist: types-dataclasses ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.8.1) ; extra == 'dev'
Provides-Extra: dev-airflow2
Requires-Dist: types-PyYAML ; extra == 'dev-airflow2'
Requires-Dist: toml (>=0.10.0) ; extra == 'dev-airflow2'
Requires-Dist: types-six ; extra == 'dev-airflow2'
Requires-Dist: isort (>=5.7.0) ; extra == 'dev-airflow2'
Requires-Dist: pytest (>=6.2.2) ; extra == 'dev-airflow2'
Requires-Dist: acryl-pyhive[hive] (>=0.6.7) ; extra == 'dev-airflow2'
Requires-Dist: click (>=6.0.0) ; extra == 'dev-airflow2'
Requires-Dist: apache-airflow (>=2.0.2) ; extra == 'dev-airflow2'
Requires-Dist: sql-metadata (==1.12.0) ; extra == 'dev-airflow2'
Requires-Dist: flake8 (>=3.8.3) ; extra == 'dev-airflow2'
Requires-Dist: PyYAML ; extra == 'dev-airflow2'
Requires-Dist: build ; extra == 'dev-airflow2'
Requires-Dist: requests-mock ; extra == 'dev-airflow2'
Requires-Dist: python-dateutil ; extra == 'dev-airflow2'
Requires-Dist: entrypoints ; extra == 'dev-airflow2'
Requires-Dist: types-requests ; extra == 'dev-airflow2'
Requires-Dist: pytest-docker (>=0.10.3) ; extra == 'dev-airflow2'
Requires-Dist: pybigquery (>=0.6.0) ; extra == 'dev-airflow2'
Requires-Dist: typing-inspect ; extra == 'dev-airflow2'
Requires-Dist: twine ; extra == 'dev-airflow2'
Requires-Dist: sqlalchemy-stubs ; extra == 'dev-airflow2'
Requires-Dist: pymongo (>=3.11) ; extra == 'dev-airflow2'
Requires-Dist: coverage (>=5.1) ; extra == 'dev-airflow2'
Requires-Dist: requests ; extra == 'dev-airflow2'
Requires-Dist: sqlalchemy-pytds (>=0.3) ; extra == 'dev-airflow2'
Requires-Dist: mypy (>=0.901) ; extra == 'dev-airflow2'
Requires-Dist: types-freezegun ; extra == 'dev-airflow2'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'dev-airflow2'
Requires-Dist: pydantic (>=1.5.1) ; extra == 'dev-airflow2'
Requires-Dist: expandvars (>=0.6.5) ; extra == 'dev-airflow2'
Requires-Dist: black (>=19.10b0) ; extra == 'dev-airflow2'
Requires-Dist: boto3 ; extra == 'dev-airflow2'
Requires-Dist: apache-airflow-providers-snowflake ; extra == 'dev-airflow2'
Requires-Dist: docker ; extra == 'dev-airflow2'
Requires-Dist: python-ldap (>=2.4) ; extra == 'dev-airflow2'
Requires-Dist: mypy-extensions (>=0.4.3) ; extra == 'dev-airflow2'
Requires-Dist: types-toml ; extra == 'dev-airflow2'
Requires-Dist: types-PyMySQL ; extra == 'dev-airflow2'
Requires-Dist: types-click (==0.1.12) ; extra == 'dev-airflow2'
Requires-Dist: tox ; extra == 'dev-airflow2'
Requires-Dist: pymysql (>=1.0.2) ; extra == 'dev-airflow2'
Requires-Dist: freezegun ; extra == 'dev-airflow2'
Requires-Dist: deepdiff ; extra == 'dev-airflow2'
Requires-Dist: cx-Oracle ; extra == 'dev-airflow2'
Requires-Dist: types-pkg-resources ; extra == 'dev-airflow2'
Requires-Dist: fastavro (>=1.2.0) ; extra == 'dev-airflow2'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'dev-airflow2'
Requires-Dist: lkml (>=1.1.0) ; extra == 'dev-airflow2'
Requires-Dist: looker-sdk (==21.6.0) ; extra == 'dev-airflow2'
Requires-Dist: confluent-kafka (>=1.5.0) ; extra == 'dev-airflow2'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'dev-airflow2'
Requires-Dist: types-python-dateutil ; extra == 'dev-airflow2'
Requires-Dist: types-dataclasses ; extra == 'dev-airflow2'
Requires-Dist: pytest-cov (>=2.8.1) ; extra == 'dev-airflow2'
Requires-Dist: dataclasses (>=0.6) ; (python_version < "3.7") and extra == 'dev-airflow2'
Requires-Dist: typing-extensions (>=3.7.4) ; (python_version < "3.8") and extra == 'dev-airflow2'
Requires-Dist: dataclasses (>=0.6) ; (python_version < "3.7") and extra == 'dev'
Requires-Dist: typing-extensions (>=3.7.4) ; (python_version < "3.8") and extra == 'dev'
Provides-Extra: druid
Requires-Dist: expandvars (>=0.6.5) ; extra == 'druid'
Requires-Dist: toml (>=0.10.0) ; extra == 'druid'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'druid'
Requires-Dist: click (>=6.0.0) ; extra == 'druid'
Requires-Dist: docker ; extra == 'druid'
Requires-Dist: PyYAML ; extra == 'druid'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'druid'
Requires-Dist: pydruid (>=0.6.2) ; extra == 'druid'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'druid'
Requires-Dist: python-dateutil ; extra == 'druid'
Requires-Dist: entrypoints ; extra == 'druid'
Provides-Extra: feast
Requires-Dist: expandvars (>=0.6.5) ; extra == 'feast'
Requires-Dist: toml (>=0.10.0) ; extra == 'feast'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'feast'
Requires-Dist: click (>=6.0.0) ; extra == 'feast'
Requires-Dist: docker ; extra == 'feast'
Requires-Dist: PyYAML ; extra == 'feast'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'feast'
Requires-Dist: python-dateutil ; extra == 'feast'
Requires-Dist: entrypoints ; extra == 'feast'
Provides-Extra: glue
Requires-Dist: expandvars (>=0.6.5) ; extra == 'glue'
Requires-Dist: toml (>=0.10.0) ; extra == 'glue'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'glue'
Requires-Dist: click (>=6.0.0) ; extra == 'glue'
Requires-Dist: boto3 ; extra == 'glue'
Requires-Dist: docker ; extra == 'glue'
Requires-Dist: PyYAML ; extra == 'glue'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'glue'
Requires-Dist: python-dateutil ; extra == 'glue'
Requires-Dist: entrypoints ; extra == 'glue'
Provides-Extra: hive
Requires-Dist: expandvars (>=0.6.5) ; extra == 'hive'
Requires-Dist: toml (>=0.10.0) ; extra == 'hive'
Requires-Dist: acryl-pyhive[hive] (>=0.6.7) ; extra == 'hive'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'hive'
Requires-Dist: click (>=6.0.0) ; extra == 'hive'
Requires-Dist: docker ; extra == 'hive'
Requires-Dist: PyYAML ; extra == 'hive'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'hive'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'hive'
Requires-Dist: python-dateutil ; extra == 'hive'
Requires-Dist: entrypoints ; extra == 'hive'
Provides-Extra: kafka
Requires-Dist: expandvars (>=0.6.5) ; extra == 'kafka'
Requires-Dist: toml (>=0.10.0) ; extra == 'kafka'
Requires-Dist: fastavro (>=1.2.0) ; extra == 'kafka'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'kafka'
Requires-Dist: click (>=6.0.0) ; extra == 'kafka'
Requires-Dist: docker ; extra == 'kafka'
Requires-Dist: confluent-kafka (>=1.5.0) ; extra == 'kafka'
Requires-Dist: PyYAML ; extra == 'kafka'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'kafka'
Requires-Dist: python-dateutil ; extra == 'kafka'
Requires-Dist: entrypoints ; extra == 'kafka'
Provides-Extra: kafka-connect
Requires-Dist: expandvars (>=0.6.5) ; extra == 'kafka-connect'
Requires-Dist: toml (>=0.10.0) ; extra == 'kafka-connect'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'kafka-connect'
Requires-Dist: click (>=6.0.0) ; extra == 'kafka-connect'
Requires-Dist: requests ; extra == 'kafka-connect'
Requires-Dist: docker ; extra == 'kafka-connect'
Requires-Dist: PyYAML ; extra == 'kafka-connect'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'kafka-connect'
Requires-Dist: python-dateutil ; extra == 'kafka-connect'
Requires-Dist: entrypoints ; extra == 'kafka-connect'
Provides-Extra: ldap
Requires-Dist: expandvars (>=0.6.5) ; extra == 'ldap'
Requires-Dist: toml (>=0.10.0) ; extra == 'ldap'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'ldap'
Requires-Dist: click (>=6.0.0) ; extra == 'ldap'
Requires-Dist: docker ; extra == 'ldap'
Requires-Dist: python-ldap (>=2.4) ; extra == 'ldap'
Requires-Dist: PyYAML ; extra == 'ldap'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'ldap'
Requires-Dist: python-dateutil ; extra == 'ldap'
Requires-Dist: entrypoints ; extra == 'ldap'
Provides-Extra: looker
Requires-Dist: expandvars (>=0.6.5) ; extra == 'looker'
Requires-Dist: toml (>=0.10.0) ; extra == 'looker'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'looker'
Requires-Dist: click (>=6.0.0) ; extra == 'looker'
Requires-Dist: looker-sdk (==21.6.0) ; extra == 'looker'
Requires-Dist: docker ; extra == 'looker'
Requires-Dist: PyYAML ; extra == 'looker'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'looker'
Requires-Dist: python-dateutil ; extra == 'looker'
Requires-Dist: entrypoints ; extra == 'looker'
Provides-Extra: lookml
Requires-Dist: expandvars (>=0.6.5) ; extra == 'lookml'
Requires-Dist: toml (>=0.10.0) ; extra == 'lookml'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'lookml'
Requires-Dist: click (>=6.0.0) ; extra == 'lookml'
Requires-Dist: lkml (>=1.1.0) ; extra == 'lookml'
Requires-Dist: docker ; extra == 'lookml'
Requires-Dist: sql-metadata (==1.12.0) ; extra == 'lookml'
Requires-Dist: PyYAML ; extra == 'lookml'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'lookml'
Requires-Dist: python-dateutil ; extra == 'lookml'
Requires-Dist: entrypoints ; extra == 'lookml'
Provides-Extra: mongodb
Requires-Dist: expandvars (>=0.6.5) ; extra == 'mongodb'
Requires-Dist: toml (>=0.10.0) ; extra == 'mongodb'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'mongodb'
Requires-Dist: click (>=6.0.0) ; extra == 'mongodb'
Requires-Dist: pymongo (>=3.11) ; extra == 'mongodb'
Requires-Dist: docker ; extra == 'mongodb'
Requires-Dist: PyYAML ; extra == 'mongodb'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'mongodb'
Requires-Dist: python-dateutil ; extra == 'mongodb'
Requires-Dist: entrypoints ; extra == 'mongodb'
Provides-Extra: mssql
Requires-Dist: expandvars (>=0.6.5) ; extra == 'mssql'
Requires-Dist: toml (>=0.10.0) ; extra == 'mssql'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'mssql'
Requires-Dist: click (>=6.0.0) ; extra == 'mssql'
Requires-Dist: docker ; extra == 'mssql'
Requires-Dist: PyYAML ; extra == 'mssql'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'mssql'
Requires-Dist: sqlalchemy-pytds (>=0.3) ; extra == 'mssql'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'mssql'
Requires-Dist: python-dateutil ; extra == 'mssql'
Requires-Dist: entrypoints ; extra == 'mssql'
Provides-Extra: mssql-odbc
Requires-Dist: expandvars (>=0.6.5) ; extra == 'mssql-odbc'
Requires-Dist: toml (>=0.10.0) ; extra == 'mssql-odbc'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'mssql-odbc'
Requires-Dist: click (>=6.0.0) ; extra == 'mssql-odbc'
Requires-Dist: pyodbc ; extra == 'mssql-odbc'
Requires-Dist: docker ; extra == 'mssql-odbc'
Requires-Dist: PyYAML ; extra == 'mssql-odbc'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'mssql-odbc'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'mssql-odbc'
Requires-Dist: python-dateutil ; extra == 'mssql-odbc'
Requires-Dist: entrypoints ; extra == 'mssql-odbc'
Provides-Extra: mysql
Requires-Dist: expandvars (>=0.6.5) ; extra == 'mysql'
Requires-Dist: toml (>=0.10.0) ; extra == 'mysql'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'mysql'
Requires-Dist: click (>=6.0.0) ; extra == 'mysql'
Requires-Dist: docker ; extra == 'mysql'
Requires-Dist: PyYAML ; extra == 'mysql'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'mysql'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'mysql'
Requires-Dist: python-dateutil ; extra == 'mysql'
Requires-Dist: entrypoints ; extra == 'mysql'
Requires-Dist: pymysql (>=1.0.2) ; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: expandvars (>=0.6.5) ; extra == 'oracle'
Requires-Dist: toml (>=0.10.0) ; extra == 'oracle'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'oracle'
Requires-Dist: click (>=6.0.0) ; extra == 'oracle'
Requires-Dist: cx-Oracle ; extra == 'oracle'
Requires-Dist: docker ; extra == 'oracle'
Requires-Dist: PyYAML ; extra == 'oracle'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'oracle'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'oracle'
Requires-Dist: python-dateutil ; extra == 'oracle'
Requires-Dist: entrypoints ; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: expandvars (>=0.6.5) ; extra == 'postgres'
Requires-Dist: GeoAlchemy2 ; extra == 'postgres'
Requires-Dist: toml (>=0.10.0) ; extra == 'postgres'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'postgres'
Requires-Dist: click (>=6.0.0) ; extra == 'postgres'
Requires-Dist: psycopg2-binary ; extra == 'postgres'
Requires-Dist: docker ; extra == 'postgres'
Requires-Dist: PyYAML ; extra == 'postgres'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'postgres'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'postgres'
Requires-Dist: python-dateutil ; extra == 'postgres'
Requires-Dist: entrypoints ; extra == 'postgres'
Provides-Extra: redshift
Requires-Dist: expandvars (>=0.6.5) ; extra == 'redshift'
Requires-Dist: GeoAlchemy2 ; extra == 'redshift'
Requires-Dist: sqlalchemy-redshift ; extra == 'redshift'
Requires-Dist: toml (>=0.10.0) ; extra == 'redshift'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'redshift'
Requires-Dist: click (>=6.0.0) ; extra == 'redshift'
Requires-Dist: psycopg2-binary ; extra == 'redshift'
Requires-Dist: docker ; extra == 'redshift'
Requires-Dist: PyYAML ; extra == 'redshift'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'redshift'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'redshift'
Requires-Dist: python-dateutil ; extra == 'redshift'
Requires-Dist: entrypoints ; extra == 'redshift'
Provides-Extra: snowflake
Requires-Dist: expandvars (>=0.6.5) ; extra == 'snowflake'
Requires-Dist: toml (>=0.10.0) ; extra == 'snowflake'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'snowflake'
Requires-Dist: click (>=6.0.0) ; extra == 'snowflake'
Requires-Dist: snowflake-sqlalchemy ; extra == 'snowflake'
Requires-Dist: docker ; extra == 'snowflake'
Requires-Dist: PyYAML ; extra == 'snowflake'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'snowflake'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'snowflake'
Requires-Dist: python-dateutil ; extra == 'snowflake'
Requires-Dist: entrypoints ; extra == 'snowflake'
Provides-Extra: sqlalchemy
Requires-Dist: expandvars (>=0.6.5) ; extra == 'sqlalchemy'
Requires-Dist: toml (>=0.10.0) ; extra == 'sqlalchemy'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'sqlalchemy'
Requires-Dist: click (>=6.0.0) ; extra == 'sqlalchemy'
Requires-Dist: docker ; extra == 'sqlalchemy'
Requires-Dist: PyYAML ; extra == 'sqlalchemy'
Requires-Dist: sqlalchemy (==1.3.24) ; extra == 'sqlalchemy'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'sqlalchemy'
Requires-Dist: python-dateutil ; extra == 'sqlalchemy'
Requires-Dist: entrypoints ; extra == 'sqlalchemy'
Provides-Extra: superset
Requires-Dist: expandvars (>=0.6.5) ; extra == 'superset'
Requires-Dist: toml (>=0.10.0) ; extra == 'superset'
Requires-Dist: avro-gen3 (==0.5.0) ; extra == 'superset'
Requires-Dist: click (>=6.0.0) ; extra == 'superset'
Requires-Dist: requests ; extra == 'superset'
Requires-Dist: docker ; extra == 'superset'
Requires-Dist: PyYAML ; extra == 'superset'
Requires-Dist: avro-python3 (>=1.8.2) ; extra == 'superset'
Requires-Dist: python-dateutil ; extra == 'superset'
Requires-Dist: entrypoints ; extra == 'superset'

# DataHub Metadata Ingestion

![Python version 3.6+](https://img.shields.io/badge/python-3.6%2B-blue)

This module hosts an extensible Python-based metadata ingestion system for DataHub.
This supports sending data to DataHub using Kafka or through the REST API.
It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.

## Getting Started

### Prerequisites

Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through [quickstart Docker images](../docker).

### Install from PyPI

The folks over at [Acryl Data](https://www.acryl.io/) maintain a PyPI package for DataHub metadata ingestion.

```shell
# Requires Python 3.6+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip uninstall datahub acryl-datahub || true  # sanity check - ok if it fails
python3 -m pip install --upgrade acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version
```

If you run into an error, try checking the [_common setup issues_](./developing.md#Common-setup-issues).

#### Installing Plugins

We use a plugin architecture so that you can install only the dependencies you actually need.

| Plugin Name   | Install Command                                            | Provides                            |
| ------------- | ---------------------------------------------------------- | ----------------------------------- |
| file          | _included by default_                                      | File source and sink                |
| console       | _included by default_                                      | Console sink                        |
| athena        | `pip install 'acryl-datahub[athena]'`                      | AWS Athena source                   |
| bigquery      | `pip install 'acryl-datahub[bigquery]'`                    | BigQuery source                     |
| feast         | `pip install 'acryl-datahub[feast]'`                       | Feast source                        |
| glue          | `pip install 'acryl-datahub[glue]'`                        | AWS Glue source                     |
| hive          | `pip install 'acryl-datahub[hive]'`                        | Hive source                         |
| mssql         | `pip install 'acryl-datahub[mssql]'`                       | SQL Server source                   |
| mysql         | `pip install 'acryl-datahub[mysql]'`                       | MySQL source                        |
| oracle        | `pip install 'acryl-datahub[oracle]'`                      | Oracle source                       |
| postgres      | `pip install 'acryl-datahub[postgres]'`                    | Postgres source                     |
| redshift      | `pip install 'acryl-datahub[redshift]'`                    | Redshift source                     |
| sqlalchemy    | `pip install 'acryl-datahub[sqlalchemy]'`                  | Generic SQLAlchemy source           |
| snowflake     | `pip install 'acryl-datahub[snowflake]'`                   | Snowflake source                    |
| superset      | `pip install 'acryl-datahub[superset]'`                    | Superset source                     |
| mongodb       | `pip install 'acryl-datahub[mongodb]'`                     | MongoDB source                      |
| ldap          | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source                         |
| looker        | `pip install 'acryl-datahub[looker]'`                      | Looker source                       |
| lookml        | `pip install 'acryl-datahub[lookml]'`                      | LookML source, requires Python 3.7+ |
| kafka         | `pip install 'acryl-datahub[kafka]'`                       | Kafka source                        |
| druid         | `pip install 'acryl-datahub[druid]'`                       | Druid Source                        |
| dbt           | _no additional dependencies_                               | dbt source                          |
| datahub-rest  | `pip install 'acryl-datahub[datahub-rest]'`                | DataHub sink over REST API          |
| datahub-kafka | `pip install 'acryl-datahub[datahub-kafka]'`               | DataHub sink over Kafka             |

These plugins can be mixed and matched as desired. For example:

```shell
pip install 'acryl-datahub[bigquery,datahub-rest]'
```

You can check the active plugins:

```shell
datahub check plugins
```

[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites

#### Basic Usage

```shell
pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
```

### Install using Docker

[![Docker Hub](https://img.shields.io/docker/pulls/linkedin/datahub-ingestion?style=plastic)](https://hub.docker.com/r/linkedin/datahub-ingestion)
[![datahub-ingestion docker](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml)

If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.

_Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly._

```shell
./scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml
```

### Install from source

If you'd like to install from source, see the [developer guide](./developing.md).

## Recipes

A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink).
Here's a simple example that pulls metadata from MSSQL and puts it into datahub.

```yaml
# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
source:
  type: mssql
  config:
    username: sa
    password: ${MSSQL_PASSWORD}
    database: DemoData

transformers:
  - type: "fully-qualified-class-name-of-transformer"
    config:
      some_property: "some.value"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"
```

We automatically expand environment variables in the config,
similar to variable substitution in GNU bash or in docker-compose files. For details, see
https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution.

Running a recipe is quite easy.

```shell
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
```

A number of recipes are included in the examples/recipes directory.

## Sources

### Kafka Metadata `kafka`

Extracts:

- List of topics - from the Kafka broker
- Schemas associated with each topic - from the schema registry

```yml
source:
  type: "kafka"
  config:
    connection:
      bootstrap: "broker:9092"
      consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#serde-consumer
      schema_registry_url: http://localhost:8081
      schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
```

For a full example with a number of security options, see this [example recipe](./examples/recipes/secured_kafka_to_console.yml).

### MySQL Metadata `mysql`

Extracts:

- List of databases and tables
- Column types and schema associated with each table

```yml
source:
  type: mysql
  config:
    username: root
    password: example
    database: dbname
    host_port: localhost:3306
    table_pattern:
      deny:
        # Note that the deny patterns take precedence over the allow patterns.
        - "performance_schema"
      allow:
        - "schema1.table2"
      # Although the 'table_pattern' enables you to skip everything from certain schemas,
      # having another option to allow/deny on schema level is an optimization for the case when there is a large number
      # of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
      # them out afterwards via the table_pattern.
    schema_pattern:
      deny:
        - "garbage_schema"
      allow:
        - "schema1"
```

### Microsoft SQL Server Metadata `mssql`

We have two options for the underlying library used to connect to SQL Server: (1) [python-tds](https://github.com/denisenkom/pytds) and (2) [pyodbc](https://github.com/mkleehammer/pyodbc). The TDS library is pure Python and hence easier to install, but only PyODBC supports encrypted connections.

Extracts:

- List of databases, schema, tables and views
- Column types associated with each table/view

```yml
source:
  type: mssql
  config:
    username: user
    password: pass
    host_port: localhost:1433
    database: DemoDatabase
    include_views: True # whether to include views, defaults to True
    table_pattern:
      deny:
        - "^.*\\.sys_.*" # deny all tables that start with sys_
      allow:
        - "schema1.table1"
        - "schema1.table2"
    options:
      # Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
      # See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details.
      # Many of these options are specific to the underlying database driver, so that library's
      # documentation will be a good reference for what is supported. To find which dialect is likely
      # in use, consult this table: https://docs.sqlalchemy.org/en/14/dialects/index.html.
      charset: "utf8"
    # If set to true, we'll use the pyodbc library. This requires you to have
    # already installed the Microsoft ODBC Driver for SQL Server.
    # See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
    use_odbc: False
    uri_args: {}
```

<details>
  <summary>Example: using ingestion with ODBC and encryption</summary>

This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15

```yml
source:
  type: mssql
  config:
    # See https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc
    use_odbc: True
    username: user
    password: pass
    host_port: localhost:1433
    database: DemoDatabase
    include_views: True # whether to include views, defaults to True
    uri_args:
      # See https://docs.microsoft.com/en-us/sql/connect/odbc/dsn-connection-string-attribute?view=sql-server-ver15
      driver: "ODBC Driver 17 for SQL Server"
      Encrypt: "yes"
      TrustServerCertificate: "Yes"
      ssl: "True"
      # Trusted_Connection: "yes"
```

</details>

### Hive `hive`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table
- Detailed table and storage information

```yml
source:
  type: hive
  config:
    # For more details on authentication, see the PyHive docs:
    # https://github.com/dropbox/PyHive#passing-session-configuration.
    # LDAP, Kerberos, etc. are supported using connect_args, which can be
    # added under the `options` config parameter.
    #scheme: 'hive+http' # set this if Thrift should use the HTTP transport
    #scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
    username: user # optional
    password: pass # optional
    host_port: localhost:10000
    database: DemoDatabase # optional, defaults to 'default'
    # table_pattern/schema_pattern is same as above
    # options is same as above
```

<details>
  <summary>Example: using ingestion with Azure HDInsight</summary>

```yml
# Connecting to Microsoft Azure HDInsight using TLS.
source:
  type: hive
  config:
    scheme: "hive+https"
    host_port: <cluster_name>.azurehdinsight.net:443
    username: admin
    password: "<password>"
    options:
      connect_args:
        http_path: "/hive2"
        auth: BASIC
    # table_pattern/schema_pattern is same as above
```

</details>

### PostgreSQL `postgres`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table
- Also supports PostGIS extensions

```yml
source:
  type: postgres
  config:
    username: user
    password: pass
    host_port: localhost:5432
    database: DemoDatabase
    include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
    # options is same as above
```

### Redshift `redshift`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table
- Also supports PostGIS extensions

```yml
source:
  type: redshift
  config:
    username: user
    password: pass
    host_port: example.something.us-west-2.redshift.amazonaws.com:5439
    database: DemoDatabase
    include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
    # options is same as above
```

### Snowflake `snowflake`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table

```yml
source:
  type: snowflake
  config:
    username: user
    password: pass
    host_port: account_name
    database: db_name
    warehouse: "COMPUTE_WH" # optional
    role: "sysadmin" # optional
    include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
    # options is same as above
```

### Superset `superset`

Extracts:

- List of charts and dashboards

```yml
source:
  type: superset
  config:
    username: user
    password: pass
    provider: db | ldap
    connect_uri: http://localhost:8088
    env: "PROD" # Optional, default is "PROD"
```

See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.

### Oracle `oracle`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table

Using the Oracle source requires that you've also installed the correct drivers; see the [cx_Oracle docs](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). The easiest one is the [Oracle Instant Client](https://www.oracle.com/database/technologies/instant-client.html).

```yml
source:
  type: oracle
  config:
    # For more details on authentication, see the documentation:
    # https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect and
    # https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings.
    username: user
    password: pass
    host_port: localhost:5432
    database: dbname
    service_name: svc # omit database if using this option
    include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
    # options is same as above
```

### Feast `feast`

**Note: Feast ingestion requires Docker to be installed.**

Extracts:

- List of feature tables (modeled as `MLFeatureTable`s), features (`MLFeature`s), and entities (`MLPrimaryKey`s)
- Column types associated with each feature and entity

Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then
parsed to DataHub's native objects. This was done because of a dependency conflict in the `feast` module.

```yml
source:
  type: feast
  config:
    core_url: localhost:6565 # default
    env: "PROD" # Optional, default is "PROD"
    use_local_build: False # Whether to build Feast ingestion image locally, default is False
```

### Google BigQuery `bigquery`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table

```yml
source:
  type: bigquery
  config:
    project_id: project # optional - can autodetect from environment
    options: # options is same as above
      # See https://github.com/mxmzdlv/pybigquery#authentication for details.
      credentials_path: "/path/to/keyfile.json" # optional
      include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
```

### AWS Athena `athena`

Extracts:

- List of databases and tables
- Column types associated with each table

```yml
source:
  type: athena
  config:
    username: aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
    # See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
    password: aws_secret_access_key # Optional.
    database: database # Optional, defaults to "default"
    aws_region: aws_region_name # i.e. "eu-west-1"
    s3_staging_dir: s3_location # "s3://<bucket-name>/prefix/"
    # The s3_staging_dir parameter is needed because Athena always writes query results to S3.
    # See https://docs.aws.amazon.com/athena/latest/ug/querying.html
    # However, the athena driver will transparently fetch these results as you would expect from any other sql client.
    work_group: athena_workgroup # "primary"
    include_views: True # whether to include views, defaults to True
    # table_pattern/schema_pattern is same as above
```

### AWS Glue `glue`

Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](./s3-ingestion.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub.

Extracts:

- List of tables
- Column types associated with each table
- Table metadata, such as owner, description and parameters

```yml
source:
  type: glue
  config:
    aws_region: # aws_region_name, i.e. "eu-west-1"
    env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".

    # Filtering patterns for databases and tables to scan
    database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
    table_pattern: # Optional, to filter tables scanned, same as table_pattern above.

    # Credentials. If not specified here, these are picked up according to boto3 rules.
    # (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
    aws_access_key_id: # Optional.
    aws_secret_access_key: # Optional.
    aws_session_token: # Optional.
    aws_role: # Optional (Role chaining supported by using a sorted list).
```

### Druid `druid`

Extracts:

- List of databases, schema, and tables
- Column types associated with each table

**Note** It is important to define a explicitly define deny schema pattern for internal druid databases (lookup & sys)
if adding a schema pattern otherwise the crawler may crash before processing relevant databases.
This deny pattern is defined by default but is overriden by user-submitted configurations

```yml
source:
  type: druid
  config:
    # Point to broker address
    host_port: localhost:8082
    schema_pattern:
      deny:
        - "^(lookup|sys).*"
    # options is same as above
```

### Other databases using SQLAlchemy `sqlalchemy`

The `sqlalchemy` source is useful if we don't have a pre-built source for your chosen
database system, but there is an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/14/dialects/)
defined elsewhere. In order to use this, you must `pip install` the required dialect packages yourself.

Extracts:

- List of schemas and tables
- Column types associated with each table

```yml
source:
  type: sqlalchemy
  config:
    # See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls
    connect_uri: "dialect+driver://username:password@host:port/database"
    options: {} # same as above
    schema_pattern: {} # same as above
    table_pattern: {} # same as above
    include_views: True # whether to include views, defaults to True
```

### MongoDB `mongodb`

Extracts:

- List of databases
- List of collections in each database and infers schemas for each collection

By default, schema inference samples 1,000 documents from each collection. Setting `schemaSamplingSize: null` will scan the entire collection.

Note that `schemaSamplingSize` has no effect if `enableSchemaInference: False` is set.

```yml
source:
  type: "mongodb"
  config:
    # For advanced configurations, see the MongoDB docs.
    # https://pymongo.readthedocs.io/en/stable/examples/authentication.html
    connect_uri: "mongodb://localhost"
    username: admin
    password: password
    env: "PROD" # Optional, default is "PROD"
    authMechanism: "DEFAULT"
    options: {}
    database_pattern: {}
    collection_pattern: {}
    enableSchemaInference: True
    schemaSamplingSize: 1000
    # database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
```

### LDAP `ldap`

Extracts:

- List of people
- Names, emails, titles, and manager information for each person
- List of groups

```yml
source:
  type: "ldap"
  config:
    ldap_server: ldap://localhost
    ldap_user: "cn=admin,dc=example,dc=org"
    ldap_password: "admin"
    base_dn: "dc=example,dc=org"
    filter: "(objectClass=*)" # optional field
    drop_missing_first_last_name: False # optional
```

The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
for devices or services should be excluded when they do not contain a first and last name. This will only
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.

### LookML `lookml`

Note! This plugin uses a package that requires Python 3.7+!

Extracts:

- LookML views from model files
- Name, upstream table names, dimensions, measures, and dimension groups

```yml
source:
  type: "lookml"
  config:
    base_folder: /path/to/model/files # Where the *.model.lkml and *.view.lkml files are stored.
    connection_to_platform_map: # mapping between connection names in the model files to platform names.
      my_snowflake_conn: snowflake
    platform_name: looker_views # Optional, default is "looker_views"
    actor: "urn:li:corpuser:etl" # Optional, "urn:li:corpuser:etl"
    model_pattern: {}
    view_pattern: {}
    env: "PROD" # Optional, default is "PROD"
    parse_table_names_from_sql: False # See note below.
```

Note! The integration can use [`sql-metadata`](https://pypi.org/project/sql-metadata/) to try to parse the tables the
views depends on. As these SQL's can be complicated, and the package doesn't official support all the SQL dialects that
Looker support, the result might not be correct. This parsing is disables by default, but can be enabled by setting
`parse_table_names_from_sql: True`.

### Looker dashboards `looker`

Extracts:

- Looker dashboards and dashboard elements (charts)
- Names, descriptions, URLs, chart types, input view for the charts

```yml
source:
  type: "looker"
  config:
    client_id: str # Your Looker API client ID. As your Looker admin
    client_secret: str # Your Looker API client secret. As your Looker admin
    base_url: str # The url to your Looker instance: https://company.looker.com:19999 or https://looker.company.com, or similar.
    platform_name: str = "looker" # Optional, default is "looker"
    view_platform_name: str = "looker_views" # Optional, default is "looker_views". Should be the same `platform_name` in the `lookml` source, if that source is also run.
    actor: str = "urn:li:corpuser:etl" # Optional, "urn:li:corpuser:etl"
    dashboard_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()
    chart_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()
    env: str = "PROD" # Optional, default is "PROD"
```

### File `file`

Pulls metadata from a previously generated file. Note that the file sink
can produce such files, and a number of samples are included in the
[examples/mce_files](examples/mce_files) directory.

```yml
source:
  type: file
  config:
    filename: ./path/to/mce/file.json
```

### dbt `dbt`

Pull metadata from dbt artifacts files:

- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
  - This file contains model, source and lineage data.
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
  - This file contains schema data.
  - dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
- target_platform:
  - The data platform you are enriching with dbt metadata.
  - [data platforms](https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/resources/DataPlatformInfo.json)
- load_schemas:
  - Load schemas from dbt catalog file, not necessary when the underlying data platform already has this data.
- node_type_pattern: 
  - Use this filter to exclude and include node types using allow or deny method  

```yml
source:
  type: "dbt"
  config:
    manifest_path: "./path/dbt/manifest_file.json"
    catalog_path: "./path/dbt/catalog_file.json"
    target_platform: "postgres" # optional, eg "postgres", "snowflake", etc.
    load_schemas: True or False
    node_type_pattern: # optional
      deny:
        - ^test.*
      allow:
        - ^.*    
```

Note: when `load_schemas` is False, models that use [identifiers](https://docs.getdbt.com/reference/resource-properties/identifier) to reference their source tables are ingested using the model identifier as the model name to preserve the lineage.

### Kafka Connect `kafka-connect`

Extracts:

- Kafka Connect connector as individual `DataFlowSnapshotClass` entity
- Creating individual `DataJobSnapshotClass` entity using `{connector_name}:{source_dataset}` naming
- Lineage information between source database to Kafka topic

```yml
source:
  type: "kafka-connect"
  config:
    connect_uri: "http://localhost:8083"
    cluster_name: "connect-cluster"
    connector_patterns:
      deny:
        - ^denied-connector.*
      allow:
        - ^allowed-connector.*
```

Current limitations:

- Currently works only for Debezium source connectors.

## Sinks

### DataHub Rest `datahub-rest`

Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.

```yml
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"
```

### DataHub Kafka `datahub-kafka`

Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
interface is that it's asynchronous and can handle higher throughput. This requires the
Datahub mce-consumer container to be running.

```yml
sink:
  type: "datahub-kafka"
  config:
    connection:
      bootstrap: "localhost:9092"
      producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#serializingproducer
      schema_registry_url: "http://localhost:8081"
      schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
```

### Console `console`

Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.

```yml
sink:
  type: "console"
```

### File `file`

Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the file source can read files generated by this sink.

```yml
sink:
  type: file
  config:
    filename: ./path/to/mce/file.json
```

## Transformations

Beyond basic ingestion, sometimes there might exist a need to modify the source data before passing it on to the sink.
Example use cases could be to add ownership information, add extra tags etc.

In such a scenario, it is possible to configure a recipe with a list of transformers.

```yml
transformers:
  - type: "fully-qualified-class-name-of-transformer"
    config:
      some_property: "some.value"
```

A transformer class needs to inherit from [`Transformer`](./src/datahub/ingestion/api/transform.py).

### `simple_add_dataset_ownership`

Adds a set of owners to every dataset.

```yml
transformers:
  - type: "simple_add_dataset_ownership"
    config:
      owner_urns:
        - "urn:li:corpuser:username1"
        - "urn:li:corpuser:username2"
        - "urn:li:corpGroup:groupname"
```

:::tip

If you'd like to add more complex logic for assigning ownership, you can use the more generic [`add_dataset_ownership` transformer](./src/datahub/ingestion/transformer/add_dataset_ownership.py), which calls a user-provided function to determine the ownership of each dataset.

:::

### `simple_add_dataset_tags`

Adds a set of tags to every dataset.

```yml
transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
```

:::tip

If you'd like to add more complex logic for assigning tags, you can use the more generic [`add_dataset_tags` transformer](./src/datahub/ingestion/transformer/add_dataset_tags.py), which calls a user-provided function to determine the tags for each dataset.

:::

## Using as a library

In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.

- [DataHub emitter via REST](./src/datahub/emitter/rest_emitter.py) (same requirements as `datahub-rest`). Basic usage [example](./examples/library/lineage_emitter_rest.py).
- [DataHub emitter via Kafka](./src/datahub/emitter/kafka_emitter.py) (same requirements as `datahub-kafka`). Basic usage [example](./examples/library/lineage_emitter_kafka.py).

## Lineage with Airflow

There's a couple ways to get lineage information from Airflow into DataHub.

:::note Running ingestion on a schedule

If you're simply looking to run ingestion on a schedule, take a look at these sample DAGs:

- [`generic_recipe_sample_dag.py`](./src/datahub_provider/example_dags/generic_recipe_sample_dag.py) - reads a DataHub ingestion recipe file and runs it
- [`mysql_sample_dag.py`](./src/datahub_provider/example_dags/mysql_sample_dag.py) - runs a MySQL metadata ingestion pipeline using an inlined configuration.

:::

### Using Datahub's Airflow lineage backend (recommended)

:::caution

The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.

:::

1. First, you must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.

   ```shell
   # For REST-based:
   airflow connections add  --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
   # For Kafka-based (standard Kafka sink config can be passed via extras):
   airflow connections add  --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
   ```

2. Add the following lines to your `airflow.cfg` file. You might need to
   ```ini
   [lineage]
   backend = datahub_provider.lineage.datahub.DatahubLineageBackend
   datahub_kwargs = {
       "datahub_conn_id": "datahub_rest_default",
       "capture_ownership_info": true,
       "capture_tags_info": true,
       "graceful_exceptions": true }
   # The above indentation is important!
   ```
   Configuration options:
   - `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default`, depending on what you named the connection in step 1.
   - `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
   - `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
   - `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
3. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](./src/datahub_provider/example_dags/lineage_backend_demo.py).
4. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.

### Emitting lineage via a separate operator

Take a look at this sample DAG:

- [`lineage_emission_dag.py`](./src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.

In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.

## Developing

See the [developing guide](./developing.md) or the [adding a source guide](./adding-source.md).


