Skip to content
/ td-dd Public

td-dd is a back-end service developed as part of /truedat project that supports the generation of a Data Dictionary

License

Notifications You must be signed in to change notification settings

Bluetab/td-dd

Repository files navigation

Truedat Data Dictionary

td-dd is a back-end service developed as part of Truedat project that provides API's for the following functionality:

  • Data Catalog
  • Data Lineage
  • Connector Management
  • Data Quality

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Install dependencies with mix deps.get

To start your Phoenix server:

Installing

  • Create and migrate your database with mix ecto.create && mix ecto.migrate
  • Start Phoenix endpoint with mix phx.server
  • The td-dd API is published on localhost:4005
  • The td-cx API is published on localhost:4008

Running the tests

Run all aplication tests with mix test

Environment variables

  • REDIS_AUDIT_STREAM_MAXLEN (Optional) Maximum length for Redis audit stream. Default: 100
  • REDIS_STREAM_MAXLEN (Optional) Maximum length for Redis stream. Default: 100

SSL Connection

  • DB_SSL: Boolean value to enable SSL configuration. Default is false.
  • DB_SSL_CACERTFILE: Path to the Certification Authority (CA) certificate file, e.g. /path/to/ca.crt.
  • DB_SSL_VERSION: Supported versions are tlsv1.2 and tlsv1.3. Default is tlsv1.2.
  • DB_SSL_CLIENT_CERT: Path to the client SSL certificate file.
  • DB_SSL_CLIENT_KEY: Path to the client SSL private key file.
  • DB_SSL_VERIFY: Specifies whether server certificates should be verified (true/false).

Lineaje nodes domains id refresh

  • LINEAGE_NODES_DOMAINS_IDS_REFRESHER: default hourly, refresh the lineaje nodes domains ids column to show graph

Elastic bulk page size configuration

  • BULK_PAGE_SIZE_GRANTS: default 500
  • BULK_PAGE_SIZE_IMPLEMENTATIONS: default 100
  • BULK_PAGE_SIZE_JOBS: default 100
  • BULK_PAGE_SIZE_RULES: default 100
  • BULK_PAGE_SIZE_STRUCTURES: default 1000
  • BULK_PAGE_SIZE_GRANT_REQUESTS: default 500

Store chunk size

  • GRANT_STORE_CHUNK_SIZE: default 1000
  • GRANT_REQUEST_STORE_CHUNK_SIZE: default 1000
  • STRUCTURE_STORE_CHUNK_SIZE: default 1000
  • DSV_STORE_CHUNK_SIZE: default 1000

If the variable is set to false, it will not be deleted in the case that there is no index in the hot swap process.

  • DELETE_EXISTING_INDEX: detault true

Elastic Configuracion

The bulk_wait_interval variable defines the time interval between batches of bulk operations in Elasticsearch.

BULK_WAIT_INTERVAL_GRANTS: default 0

Elastic aggregations

  • The aggregation variables are defined as follows: AGG_<AGGREGATION_NAME>_SIZE

ElasticSearch authentication

(Optional) Basic HTTP authentication

These environment variables will add the Authentication header on each request with value Basic <ES_USERNAME>:<ES_PASSWORD>

  • ES_USERNAME: Username
  • ES_PASSWORD: Password

Disable the language-specific stemming functionality

In the long term, we should aim to filter only by keyword fields in our connectors. However, setting the variable APPLY_LANG_SETTINGS_STRUCTURES to false will disable the language-specific stemming functionality provided by Elasticsearch, which may impact search accuracy

  • APPLY_LANG_SETTINGS_STRUCTURES: default false

(Optional) ApiKey authentication

This environment variables will add the Authentication header on each request with value ApiKey <ES_API_KEY>

  • ES_API_KEY: ApiKey

(Optional) HTTP SSL Configuration (Normally required for ApiKey authentication)

These environment variables will configure CA Certificates for HTTPS requests

  • ES_SSL: [true | false] required to activate following options
  • ES_SSL_CACERTFILE: (Optional) Indicate the cacert file path. If not set, a certfile will be automatically generated by :certifi.cacertfile()
  • ES_SSL_VERIFY: (Optional) [verify_peer | verify_none] defaults to verify_none

ElasticSearch Force Merge Configuration

These environment variables control the force merge operation for ElasticSearch indices, which optimizes index performance by merging segments.

  • ES_WAIT_FOR_COMPLETION:

    • Purpose: Controls whether the force merge operation should wait for completion before returning
    • Default: nil (no wait)
    • Usage: When set to true, the operation will wait until the force merge is complete before returning. When false or nil, the operation returns immediately and runs asynchronously
    • Performance: Setting to true ensures the operation is complete but may cause longer response times
  • ES_MAX_NUM_SEGMENTS:

    • Purpose: Specifies the maximum number of segments to merge down to
    • Default: 5
    • Usage: Controls how aggressively the force merge operation consolidates segments. Lower values result in fewer, larger segments
    • Performance: Fewer segments generally improve search performance but may increase memory usage during the merge operation

Oban configuration

  • OBAN_DB_SCHEMA: Purpose: Defines the database schema where Oban will create its tables Default value: "private" Usage: Configures the schema prefix for Oban tables (jobs, peers, etc.) Example: If set to "oban_schema", tables will be created in the schema oban_schema.jobs, oban_schema.peers, etc.

  • OBAN_CREATE_SCHEMA: Purpose: Controls whether Oban should automatically create the database schema Default value: "true" Usage: Determines if the Oban migration should create the schema specified in OBAN_DB_SCHEMA Valid values: "true": Automatically creates the schema "false": Does not create the schema (must exist beforehand)

Oban Cron Jobs Configuration

  • OUTDATED_EMBEDDINGS_CRON: Purpose: Defines the cron schedule for the OutdatedEmbeddings worker Default value: "0 */3 * * *" (every 3 hours) Usage: Controls when the system processes outdated embeddings for data structure versions Example: "0 2 * * *" for daily execution at 2 AM

  • EMBEDDINGS_DELETION_CRON: Purpose: Defines the cron schedule for the EmbeddingsDeletion worker Default value: "@hourly" Usage: Controls when the system performs cleanup of deleted embeddings Example: "0 */6 * * *" for execution every 6 hours

Oban Queue Configuration

  • OBAN_QUEUE_DEFAULT: Purpose: Sets the number of concurrent workers for the default queue Default value: "5" Usage: Controls the parallelism for general background jobs

  • OBAN_QUEUE_XLSX_UPLOAD: Purpose: Sets the number of concurrent workers for Excel file upload processing Default value: "10" Usage: Controls the parallelism for XLSX file upload and processing jobs

  • OBAN_QUEUE_DELETE_UNITS: Purpose: Sets the number of concurrent workers for unit deletion operations Default value: "10" Usage: Controls the parallelism for data unit deletion jobs

  • OBAN_QUEUE_EMBEDDING_UPSERTS: Purpose: Sets the number of concurrent workers for embedding upsert operations Default value: "10" Usage: Controls the parallelism for creating and updating embeddings

  • OBAN_QUEUE_EMBEDDING_DELETION: Purpose: Sets the number of concurrent workers for embedding deletion operations Default value: "5" Usage: Controls the parallelism for embedding cleanup jobs

Embedding Management

  • LIMIT_OUTDATED_EMBEDDINGS:

    • Purpose: Controls the maximum number of data structure versions that can be processed in a single batch when updating outdated embeddings
    • Default: 50000
    • Usage: Used by the OutdatedEmbeddings worker (runs every 3 hours via cron) to limit the number of data structure versions processed when finding and updating missing or outdated record embeddings
    • Performance: Prevents memory issues and ensures system stability when processing large numbers of outdated embeddings
  • DATA_STRUCTURE_RECORD_EMBEDDINGS_BATCH_SIZE:

    • Purpose: Controls the batch size used when processing record embeddings for data structures
    • Default: 100
    • Usage: Defines how many data structure IDs are processed together in each batch when generating or updating embeddings. Used by both synchronous and asynchronous embedding operations in td-dd
    • Performance: Adjusting this value can help balance memory usage and processing efficiency when handling large numbers of data structure embeddings
  • RECORD_EMBEDDINGS_DEFAULT_DELAY_MS:

    • Purpose: Controls the default delay in milliseconds between batches when processing record embeddings asynchronously
    • Default: 500
    • Usage: Defines the delay applied between consecutive batches of embedding upsert jobs. Used by the upsert_from_concepts_async/2 function to schedule jobs with a delay, preventing system overload when processing large numbers of embeddings
    • Performance: Adjusting this value can help control the rate of embedding processing and prevent overwhelming the system or external embedding services

Deployment

Ready to run in production? Please check our deployment guides.

Built With

Authors

  • Bluetab Solutions Group, SL - Initial work - Bluetab

See also the list of contributors who participated in this project.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

In order to use this software, it is necessary that, depending on the type of functionality that you want to obtain, it is assembled with other software whose license may be governed by other terms different than the GNU General Public License version 3 or later. In that case, it will be absolutely necessary that, in order to make a correct use of the software to be assembled, you give compliance with the rules of the concrete license (of Free Software or Open Source Software) of use in each case, as well as, where appropriate, obtaining of the permits that are necessary for these appropriate purposes.

About

td-dd is a back-end service developed as part of /truedat project that supports the generation of a Data Dictionary

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 19

Languages