datapi (from data + API) is a Python package that allows you to implement a distributed datalakehouse head made of data pods.
This package allows you to define, deploy data pods, list them, and generate documentation.
Clone this repository and
pip install .dataPiallow appliction developers to specify in a simpleYAMLfile what informational query their application needs (e.g. sales aggregated by quarter where region is EMEA)- When
datapi runis executed it creates a dataPod, a container based deployable unit that contains a local engine to resolve the query. - Each dataPod exposes an API REST, that when called, asks the metastore for the data location, and afer checking if permissions are in place, retrieves the data and executes locally to the container the query without calling the DataPlatform engine.
- Finally it sends the data back to the application.
dataPi builds on top of an existing Data Platform platform, currently there is support for:
- Lakehouse data format : Apache Iceberg
- Cloud Storage: GCS, AWS S3 and Microsoft ADLS
- Metastore: Apache Polaris
- dataPod deployment target: Google Cloud Run
- dataPod build service: Google Cloud Build
Query sources supported: Iceberg tables.
There is support for two types of dataPods: projection and reduction.
- Projection dataPods support the
selectandfiltersquery operators. - Reduction dataPods support the
aggregategroup_byandfiltersquery operators.
NOTE: Here is a guide on how to deploy and use Apache Polaris on Google Cloud.
To create a new datapi project, run:
datapi init [PROJECT_NAME]If you don't specify a project name, it will default to 'datapi_project'.
dataPi will deploy the following structure:
datapi_project
- config.yml
- resources
- - sample_resources.yml
- deployments/
- docs/The config.yml file should have dataPi general configuration. It looks like:
# datapi configuration file
metastore_type: POLARIS
metastore_uri: 'METASTORE_URI/api/catalog'
metastore_credentials: 'CLIENT_ID:CLIENT_SECRET'
metastore_catalog: 'METASTORE_CATALOG_NAME'
# datapi datapods - Deployment settings
deployment:
deployment_target: GCP_CLOUD_RUN
build_service: GCP_CLOUD_BUILD
project_id: GCP_PROJECT_ID
registry_url: REGISTRY_URL
region: GCP_REGION
Then the developer will fill in their dataPods specs under the resources folder.
For example, a reduction resource could look like:
resource_name: RESOURCE_NAME
type: REST
depends_on:
- namespace: METASTORE_NAMESPACE_NAME
table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: REDUCTION
aggregate: sales.sum()
group_by: quarter
filters: region = 'EMEA'
deploy: True And a projection resource could look like:
resource_name: RESOURCE_NAME
type: REST
depends_on:
- namespace: METASTORE_NAMESPACE_NAME
table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: PROJECTION
seelct: sales quarter region
filters: region = 'EMEA'
deploy: True -
Deploy all Resources
datapi run --all
-
Deploy a Single Resource
datapi run --resource [RESOURCE_NAME]
-
List Resources
datapi show --all
-
List one Resources
datapi show --resource [RESOURCE_NAME]
-
Generate Documentation
datapi docs generate --all
-
Generate Documentation for one Resource
datapi docs generate --resource [RESOURCE_NAME]
-
Serve Documentation
datapi docs serve
Once the dataPod is deployed, it will offer a get_data endpoint you can query to retrieve the results.
Alternatively, you can also use the python client SDK included in the package, for example from yout application you can:
client = Client(project_id=project_id, region=region, resource_name=resource_name)
services = client.list_services()
print("Available services:")
for resource, url in services.items():
print(f"- {resource}: {url}")
data = client.get_data()
print("Data from example_resource:", data)- Add more operators including JOINS
- Add support for other metastores like Unity and BQ metastore
- Add support for more container building services like local Docker
- Add support for more containter deployment infraestructure like k8s
- Add support for local transformations using dbt
- Make dataPods depend also on other dataPods and not only in tables
- Add a UI for view the dataPods deployed and exposed contract
- Add support for other embedded engines like polars and Fusion
- Add support for automatic generation of resources using embeddeds LLMs
- Add support for sending data in more formats (e.g. JSON, Arrow)
- Add support for grpc instead of REST

