dataPi

datapi (from data + API) is a Python package that allows you to implement a distributed datalakehouse head made of data pods. This package allows you to define, deploy data pods, list them, and generate documentation.

Installation from source

Clone this repository and

pip install .

How it works

dataPi allow appliction developers to specify in a simple YAML file what informational query their application needs (e.g. sales aggregated by quarter where region is EMEA)
When datapi run is executed it creates a dataPod, a container based deployable unit that contains a local engine to resolve the query.
Each dataPod exposes an API REST, that when called, asks the metastore for the data location, and afer checking if permissions are in place, retrieves the data and executes locally to the container the query without calling the DataPlatform engine.
Finally it sends the data back to the application.

Getting Started

DataLakeHouse platform requirements

dataPi builds on top of an existing Data Platform platform, currently there is support for:

Lakehouse data format : Apache Iceberg
Cloud Storage: GCS, AWS S3 and Microsoft ADLS
Metastore: Apache Polaris
dataPod deployment target: Google Cloud Run
dataPod build service: Google Cloud Build

Query sources supported: Iceberg tables.

There is support for two types of dataPods: projection and reduction.

Projection dataPods support the select and filters query operators.
Reduction dataPods support the aggregate group_by and filters query operators.

NOTE: Here is a guide on how to deploy and use Apache Polaris on Google Cloud.

Initialize a New Project

To create a new datapi project, run:

datapi init [PROJECT_NAME]

If you don't specify a project name, it will default to 'datapi_project'.

dataPi will deploy the following structure:

datapi_project
   - config.yml
   - resources
   - - sample_resources.yml
   - deployments/
   - docs/

The config.yml file should have dataPi general configuration. It looks like:

# datapi configuration file

metastore_type: POLARIS
metastore_uri: 'METASTORE_URI/api/catalog'
metastore_credentials: 'CLIENT_ID:CLIENT_SECRET'
metastore_catalog: 'METASTORE_CATALOG_NAME'

# datapi datapods - Deployment settings
deployment:
  deployment_target: GCP_CLOUD_RUN
  build_service: GCP_CLOUD_BUILD
  project_id: GCP_PROJECT_ID
  registry_url: REGISTRY_URL
  region: GCP_REGION

Then the developer will fill in their dataPods specs under the resources folder.

For example, a reduction resource could look like:

resource_name: RESOURCE_NAME
type: REST
depends_on:
    - namespace: METASTORE_NAMESPACE_NAME
      table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: REDUCTION
aggregate: sales.sum()
group_by: quarter
filters: region = 'EMEA'
deploy: True

And a projection resource could look like:

resource_name: RESOURCE_NAME
type: REST
depends_on:
    - namespace: METASTORE_NAMESPACE_NAME
      table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: PROJECTION
seelct: sales quarter region
filters: region = 'EMEA'
deploy: True

Commands

Deploy all Resources
```
datapi run --all
```
Deploy a Single Resource
```
datapi run --resource [RESOURCE_NAME]
```
List Resources
```
datapi show --all
```
List one Resources
```
datapi show --resource [RESOURCE_NAME]
```
Generate Documentation
```
datapi docs generate --all
```

Generate Documentation for one Resource

datapi docs generate --resource [RESOURCE_NAME]

Serve Documentation
```
datapi docs serve
```

Data acess from application

Once the dataPod is deployed, it will offer a get_data endpoint you can query to retrieve the results. Alternatively, you can also use the python client SDK included in the package, for example from yout application you can:

client = Client(project_id=project_id, region=region, resource_name=resource_name)
services = client.list_services()
print("Available services:")
  for resource, url in services.items():
      print(f"- {resource}: {url}")

data = client.get_data()
print("Data from example_resource:", data)

Planned features

Add more operators including JOINS
Add support for other metastores like Unity and BQ metastore
Add support for more container building services like local Docker
Add support for more containter deployment infraestructure like k8s
Add support for local transformations using dbt
Make dataPods depend also on other dataPods and not only in tables
Add a UI for view the dataPods deployed and exposed contract
Add support for other embedded engines like polars and Fusion
Add support for automatic generation of resources using embeddeds LLMs
Add support for sending data in more formats (e.g. JSON, Arrow)
Add support for grpc instead of REST

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
datapi		datapi
site		site
tests		tests
.gitignore		.gitignore
README.md		README.md
build_docs_test.sh		build_docs_test.sh
build_test_all.sh		build_test_all.sh
build_test_only_depends_datapod.sh		build_test_only_depends_datapod.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dataPi

Installation from source

How it works

Getting Started

DataLakeHouse platform requirements

Initialize a New Project

Commands

Data acess from application

Planned features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

velascoluis/datapi-core

Folders and files

Latest commit

History

Repository files navigation

dataPi

Installation from source

How it works

Getting Started

DataLakeHouse platform requirements

Initialize a New Project

Commands

Data acess from application

Planned features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages