Truedat Data Dictionary

td-dd is a back-end service developed as part of Truedat project that provides API's for the following functionality:

Data Catalog
Data Lineage
Connector Management
Data Quality

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Install dependencies with mix deps.get

To start your Phoenix server:

Installing

Create and migrate your database with mix ecto.create && mix ecto.migrate
Start Phoenix endpoint with mix phx.server
The td-dd API is published on localhost:4005
The td-cx API is published on localhost:4008

Running the tests

Run all aplication tests with mix test

Environment variables

REDIS_AUDIT_STREAM_MAXLEN (Optional) Maximum length for Redis audit stream. Default: 100
REDIS_STREAM_MAXLEN (Optional) Maximum length for Redis stream. Default: 100

SSL Connection

DB_SSL: Boolean value to enable SSL configuration. Default is false.
DB_SSL_CACERTFILE: Path to the Certification Authority (CA) certificate file, e.g. /path/to/ca.crt.
DB_SSL_VERSION: Supported versions are tlsv1.2 and tlsv1.3. Default is tlsv1.2.
DB_SSL_CLIENT_CERT: Path to the client SSL certificate file.
DB_SSL_CLIENT_KEY: Path to the client SSL private key file.
DB_SSL_VERIFY: Specifies whether server certificates should be verified (true/false).

Lineaje nodes domains id refresh

LINEAGE_NODES_DOMAINS_IDS_REFRESHER: default hourly, refresh the lineaje nodes domains ids column to show graph

Elastic bulk page size configuration

BULK_PAGE_SIZE_GRANTS: default 500
BULK_PAGE_SIZE_IMPLEMENTATIONS: default 100
BULK_PAGE_SIZE_JOBS: default 100
BULK_PAGE_SIZE_RULES: default 100
BULK_PAGE_SIZE_STRUCTURES: default 1000
BULK_PAGE_SIZE_GRANT_REQUESTS: default 500

Store chunk size

GRANT_STORE_CHUNK_SIZE: default 1000
GRANT_REQUEST_STORE_CHUNK_SIZE: default 1000
STRUCTURE_STORE_CHUNK_SIZE: default 1000
DSV_STORE_CHUNK_SIZE: default 1000

If the variable is set to false, it will not be deleted in the case that there is no index in the hot swap process.

DELETE_EXISTING_INDEX: detault true

Elastic Configuracion

The bulk_wait_interval variable defines the time interval between batches of bulk operations in Elasticsearch.

BULK_WAIT_INTERVAL_GRANTS: default 0

Elastic aggregations

The aggregation variables are defined as follows: AGG_<AGGREGATION_NAME>_SIZE

ElasticSearch authentication

(Optional) Basic HTTP authentication

These environment variables will add the Authentication header on each request with value Basic <ES_USERNAME>:<ES_PASSWORD>

ES_USERNAME: Username
ES_PASSWORD: Password

Disable the language-specific stemming functionality

In the long term, we should aim to filter only by keyword fields in our connectors. However, setting the variable APPLY_LANG_SETTINGS_STRUCTURES to false will disable the language-specific stemming functionality provided by Elasticsearch, which may impact search accuracy

APPLY_LANG_SETTINGS_STRUCTURES: default false

(Optional) ApiKey authentication

This environment variables will add the Authentication header on each request with value ApiKey <ES_API_KEY>

ES_API_KEY: ApiKey

(Optional) HTTP SSL Configuration (Normally required for ApiKey authentication)

These environment variables will configure CA Certificates for HTTPS requests

ES_SSL: [true | false] required to activate following options
ES_SSL_CACERTFILE: (Optional) Indicate the cacert file path. If not set, a certfile will be automatically generated by :certifi.cacertfile()
ES_SSL_VERIFY: (Optional) [verify_peer | verify_none] defaults to verify_none

ElasticSearch Force Merge Configuration

These environment variables control the force merge operation for ElasticSearch indices, which optimizes index performance by merging segments.

ES_WAIT_FOR_COMPLETION:
- Purpose: Controls whether the force merge operation should wait for completion before returning
- Default: nil (no wait)
- Usage: When set to true, the operation will wait until the force merge is complete before returning. When false or nil, the operation returns immediately and runs asynchronously
- Performance: Setting to true ensures the operation is complete but may cause longer response times
ES_MAX_NUM_SEGMENTS:
- Purpose: Specifies the maximum number of segments to merge down to
- Default: 5
- Usage: Controls how aggressively the force merge operation consolidates segments. Lower values result in fewer, larger segments
- Performance: Fewer segments generally improve search performance but may increase memory usage during the merge operation

Oban configuration

OBAN_DB_SCHEMA: Purpose: Defines the database schema where Oban will create its tables Default value: "private" Usage: Configures the schema prefix for Oban tables (jobs, peers, etc.) Example: If set to "oban_schema", tables will be created in the schema oban_schema.jobs, oban_schema.peers, etc.
OBAN_CREATE_SCHEMA: Purpose: Controls whether Oban should automatically create the database schema Default value: "true" Usage: Determines if the Oban migration should create the schema specified in OBAN_DB_SCHEMA Valid values: "true": Automatically creates the schema "false": Does not create the schema (must exist beforehand)

Oban Cron Jobs Configuration

OUTDATED_EMBEDDINGS_CRON: Purpose: Defines the cron schedule for the OutdatedEmbeddings worker Default value: "0 */3 * * *" (every 3 hours) Usage: Controls when the system processes outdated embeddings for data structure versions Example: "0 2 * * *" for daily execution at 2 AM
EMBEDDINGS_DELETION_CRON: Purpose: Defines the cron schedule for the EmbeddingsDeletion worker Default value: "@hourly" Usage: Controls when the system performs cleanup of deleted embeddings Example: "0 */6 * * *" for execution every 6 hours

Oban Queue Configuration

OBAN_QUEUE_DEFAULT: Purpose: Sets the number of concurrent workers for the default queue Default value: "5" Usage: Controls the parallelism for general background jobs
OBAN_QUEUE_XLSX_UPLOAD: Purpose: Sets the number of concurrent workers for Excel file upload processing Default value: "10" Usage: Controls the parallelism for XLSX file upload and processing jobs
OBAN_QUEUE_DELETE_UNITS: Purpose: Sets the number of concurrent workers for unit deletion operations Default value: "10" Usage: Controls the parallelism for data unit deletion jobs
OBAN_QUEUE_EMBEDDING_UPSERTS: Purpose: Sets the number of concurrent workers for embedding upsert operations Default value: "10" Usage: Controls the parallelism for creating and updating embeddings
OBAN_QUEUE_EMBEDDING_DELETION: Purpose: Sets the number of concurrent workers for embedding deletion operations Default value: "5" Usage: Controls the parallelism for embedding cleanup jobs

Embedding Management

LIMIT_OUTDATED_EMBEDDINGS:
- Purpose: Controls the maximum number of data structure versions that can be processed in a single batch when updating outdated embeddings
- Default: 50000
- Usage: Used by the OutdatedEmbeddings worker (runs every 3 hours via cron) to limit the number of data structure versions processed when finding and updating missing or outdated record embeddings
- Performance: Prevents memory issues and ensures system stability when processing large numbers of outdated embeddings
DATA_STRUCTURE_RECORD_EMBEDDINGS_BATCH_SIZE:
- Purpose: Controls the batch size used when processing record embeddings for data structures
- Default: 100
- Usage: Defines how many data structure IDs are processed together in each batch when generating or updating embeddings. Used by both synchronous and asynchronous embedding operations in td-dd
- Performance: Adjusting this value can help balance memory usage and processing efficiency when handling large numbers of data structure embeddings
RECORD_EMBEDDINGS_DEFAULT_DELAY_MS:
- Purpose: Controls the default delay in milliseconds between batches when processing record embeddings asynchronously
- Default: 500
- Usage: Defines the delay applied between consecutive batches of embedding upsert jobs. Used by the upsert_from_concepts_async/2 function to schedule jobs with a delay, preventing system overload when processing large numbers of embeddings
- Performance: Adjusting this value can help control the rate of embedding processing and prevent overwhelming the system or external embedding services

Deployment

Ready to run in production? Please check our deployment guides.

Built With

phoenix - A productive web framework
ecto - Elixir toolkit for database integration
postgrex - Elixir PostgreSQL driver
cowboy - An HTTP server for Erlang/OTP
httpoison - An HTTP client
credo - Static code analysis
guardian - Authentication library
bodyguard - Authorization library
ex_machina - A factory library for test data
cors_plug - Plug for CORS support
elasticsearch - Client for Elasticsearch
vaultex - Client for HashiCorp Vault

Authors

Bluetab Solutions Group, SL - Initial work - Bluetab

See also the list of contributors who participated in this project.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

In order to use this software, it is necessary that, depending on the type of functionality that you want to obtain, it is assembled with other software whose license may be governed by other terms different than the GNU General Public License version 3 or later. In that case, it will be absolutely necessary that, in order to make a correct use of the software to be assembled, you give compliance with the rules of the concrete license (of Free Software or Open Source Software) of use in each case, as well as, where appropriate, obtaining of the permits that are necessary for these appropriate purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4,564 Commits
.cursor/rules		.cursor/rules
ci		ci
config		config
lib		lib
priv/repo		priv/repo
rel		rel
test		test
.credo.exs		.credo.exs
.formatter.exs		.formatter.exs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.whitesource		.whitesource
CHANGELOG-cx.md		CHANGELOG-cx.md
CHANGELOG-dq.md		CHANGELOG-dq.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Truedat Data Dictionary

Getting Started

Prerequisites

Installing

Running the tests

Environment variables

SSL Connection

Lineaje nodes domains id refresh

Elastic bulk page size configuration

Store chunk size

If the variable is set to false, it will not be deleted in the case that there is no index in the hot swap process.

Elastic Configuracion

The bulk_wait_interval variable defines the time interval between batches of bulk operations in Elasticsearch.

Elastic aggregations

ElasticSearch authentication

(Optional) Basic HTTP authentication

Disable the language-specific stemming functionality

(Optional) ApiKey authentication

(Optional) HTTP SSL Configuration (Normally required for ApiKey authentication)

ElasticSearch Force Merge Configuration

Oban configuration

Oban Cron Jobs Configuration

Oban Queue Configuration

Embedding Management

Deployment

Built With

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 19

Uh oh!

Languages

License

Bluetab/td-dd

Folders and files

Latest commit

History

Repository files navigation

Truedat Data Dictionary

Getting Started

Prerequisites

Installing

Running the tests

Environment variables

SSL Connection

Lineaje nodes domains id refresh

Elastic bulk page size configuration

Store chunk size

If the variable is set to false, it will not be deleted in the case that there is no index in the hot swap process.

Elastic Configuracion

The bulk_wait_interval variable defines the time interval between batches of bulk operations in Elasticsearch.

Elastic aggregations

ElasticSearch authentication

(Optional) Basic HTTP authentication

Disable the language-specific stemming functionality

(Optional) ApiKey authentication

(Optional) HTTP SSL Configuration (Normally required for ApiKey authentication)

ElasticSearch Force Merge Configuration

Oban configuration

Oban Cron Jobs Configuration

Oban Queue Configuration

Embedding Management

Deployment

Built With

Authors

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 19

Uh oh!

Languages

Packages