td-dd is a back-end service developed as part of Truedat project that provides
API's for the following functionality:
- Data Catalog
- Data Lineage
- Connector Management
- Data Quality
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Install dependencies with mix deps.get
To start your Phoenix server:
- Create and migrate your database with
mix ecto.create && mix ecto.migrate - Start Phoenix endpoint with
mix phx.server - The
td-ddAPI is published onlocalhost:4005 - The
td-cxAPI is published onlocalhost:4008
Run all aplication tests with mix test
REDIS_AUDIT_STREAM_MAXLEN(Optional) Maximum length for Redis audit stream. Default: 100REDIS_STREAM_MAXLEN(Optional) Maximum length for Redis stream. Default: 100
DB_SSL: Boolean value to enable SSL configuration. Default isfalse.DB_SSL_CACERTFILE: Path to the Certification Authority (CA) certificate file, e.g./path/to/ca.crt.DB_SSL_VERSION: Supported versions aretlsv1.2andtlsv1.3. Default istlsv1.2.DB_SSL_CLIENT_CERT: Path to the client SSL certificate file.DB_SSL_CLIENT_KEY: Path to the client SSL private key file.DB_SSL_VERIFY: Specifies whether server certificates should be verified (true/false).
- LINEAGE_NODES_DOMAINS_IDS_REFRESHER: default hourly, refresh the lineaje nodes domains ids column to show graph
- BULK_PAGE_SIZE_GRANTS: default 500
- BULK_PAGE_SIZE_IMPLEMENTATIONS: default 100
- BULK_PAGE_SIZE_JOBS: default 100
- BULK_PAGE_SIZE_RULES: default 100
- BULK_PAGE_SIZE_STRUCTURES: default 1000
- BULK_PAGE_SIZE_GRANT_REQUESTS: default 500
- GRANT_STORE_CHUNK_SIZE: default 1000
- GRANT_REQUEST_STORE_CHUNK_SIZE: default 1000
- STRUCTURE_STORE_CHUNK_SIZE: default 1000
- DSV_STORE_CHUNK_SIZE: default 1000
If the variable is set to false, it will not be deleted in the case that there is no index in the hot swap process.
- DELETE_EXISTING_INDEX: detault true
The bulk_wait_interval variable defines the time interval between batches of bulk operations in Elasticsearch.
BULK_WAIT_INTERVAL_GRANTS: default 0
- The aggregation variables are defined as follows: AGG_<AGGREGATION_NAME>_SIZE
These environment variables will add the Authentication header on each request
with value Basic <ES_USERNAME>:<ES_PASSWORD>
- ES_USERNAME: Username
- ES_PASSWORD: Password
In the long term, we should aim to filter only by keyword fields in our connectors. However, setting the variable
APPLY_LANG_SETTINGS_STRUCTURES to false will disable the language-specific stemming functionality provided by Elasticsearch,
which may impact search accuracy
- APPLY_LANG_SETTINGS_STRUCTURES: default false
This environment variables will add the Authentication header on each request
with value ApiKey <ES_API_KEY>
- ES_API_KEY: ApiKey
These environment variables will configure CA Certificates for HTTPS requests
- ES_SSL: [true | false] required to activate following options
- ES_SSL_CACERTFILE: (Optional) Indicate the cacert file path. If not set, a certfile will be automatically generated by
:certifi.cacertfile() - ES_SSL_VERIFY: (Optional) [verify_peer | verify_none] defaults to
verify_none
These environment variables control the force merge operation for ElasticSearch indices, which optimizes index performance by merging segments.
-
ES_WAIT_FOR_COMPLETION:- Purpose: Controls whether the force merge operation should wait for completion before returning
- Default:
nil(no wait) - Usage: When set to
true, the operation will wait until the force merge is complete before returning. Whenfalseornil, the operation returns immediately and runs asynchronously - Performance: Setting to
trueensures the operation is complete but may cause longer response times
-
ES_MAX_NUM_SEGMENTS:- Purpose: Specifies the maximum number of segments to merge down to
- Default:
5 - Usage: Controls how aggressively the force merge operation consolidates segments. Lower values result in fewer, larger segments
- Performance: Fewer segments generally improve search performance but may increase memory usage during the merge operation
-
OBAN_DB_SCHEMA: Purpose: Defines the database schema where Oban will create its tables Default value: "private" Usage: Configures the schema prefix for Oban tables (jobs, peers, etc.) Example: If set to "oban_schema", tables will be created in the schema oban_schema.jobs, oban_schema.peers, etc.
-
OBAN_CREATE_SCHEMA: Purpose: Controls whether Oban should automatically create the database schema Default value: "true" Usage: Determines if the Oban migration should create the schema specified in OBAN_DB_SCHEMA Valid values: "true": Automatically creates the schema "false": Does not create the schema (must exist beforehand)
-
OUTDATED_EMBEDDINGS_CRON: Purpose: Defines the cron schedule for the OutdatedEmbeddings worker Default value: "0 */3 * * *" (every 3 hours) Usage: Controls when the system processes outdated embeddings for data structure versions Example: "0 2 * * *" for daily execution at 2 AM -
EMBEDDINGS_DELETION_CRON: Purpose: Defines the cron schedule for the EmbeddingsDeletion worker Default value: "@hourly" Usage: Controls when the system performs cleanup of deleted embeddings Example: "0 */6 * * *" for execution every 6 hours
-
OBAN_QUEUE_DEFAULT: Purpose: Sets the number of concurrent workers for the default queue Default value: "5" Usage: Controls the parallelism for general background jobs -
OBAN_QUEUE_XLSX_UPLOAD: Purpose: Sets the number of concurrent workers for Excel file upload processing Default value: "10" Usage: Controls the parallelism for XLSX file upload and processing jobs -
OBAN_QUEUE_DELETE_UNITS: Purpose: Sets the number of concurrent workers for unit deletion operations Default value: "10" Usage: Controls the parallelism for data unit deletion jobs -
OBAN_QUEUE_EMBEDDING_UPSERTS: Purpose: Sets the number of concurrent workers for embedding upsert operations Default value: "10" Usage: Controls the parallelism for creating and updating embeddings -
OBAN_QUEUE_EMBEDDING_DELETION: Purpose: Sets the number of concurrent workers for embedding deletion operations Default value: "5" Usage: Controls the parallelism for embedding cleanup jobs
-
LIMIT_OUTDATED_EMBEDDINGS:- Purpose: Controls the maximum number of data structure versions that can be processed in a single batch when updating outdated embeddings
- Default:
50000 - Usage: Used by the OutdatedEmbeddings worker (runs every 3 hours via cron) to limit the number of data structure versions processed when finding and updating missing or outdated record embeddings
- Performance: Prevents memory issues and ensures system stability when processing large numbers of outdated embeddings
-
DATA_STRUCTURE_RECORD_EMBEDDINGS_BATCH_SIZE:- Purpose: Controls the batch size used when processing record embeddings for data structures
- Default:
100 - Usage: Defines how many data structure IDs are processed together in each batch when generating or updating embeddings. Used by both synchronous and asynchronous embedding operations in td-dd
- Performance: Adjusting this value can help balance memory usage and processing efficiency when handling large numbers of data structure embeddings
-
RECORD_EMBEDDINGS_DEFAULT_DELAY_MS:- Purpose: Controls the default delay in milliseconds between batches when processing record embeddings asynchronously
- Default:
500 - Usage: Defines the delay applied between consecutive batches of embedding upsert jobs. Used by the
upsert_from_concepts_async/2function to schedule jobs with a delay, preventing system overload when processing large numbers of embeddings - Performance: Adjusting this value can help control the rate of embedding processing and prevent overwhelming the system or external embedding services
Ready to run in production? Please check our deployment guides.
- phoenix - A productive web framework
- ecto - Elixir toolkit for database integration
- postgrex - Elixir PostgreSQL driver
- cowboy - An HTTP server for Erlang/OTP
- httpoison - An HTTP client
- credo - Static code analysis
- guardian - Authentication library
- bodyguard - Authorization library
- ex_machina - A factory library for test data
- cors_plug - Plug for CORS support
- elasticsearch - Client for Elasticsearch
- vaultex - Client for HashiCorp Vault
- Bluetab Solutions Group, SL - Initial work - Bluetab
See also the list of contributors who participated in this project.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
In order to use this software, it is necessary that, depending on the type of functionality that you want to obtain, it is assembled with other software whose license may be governed by other terms different than the GNU General Public License version 3 or later. In that case, it will be absolutely necessary that, in order to make a correct use of the software to be assembled, you give compliance with the rules of the concrete license (of Free Software or Open Source Software) of use in each case, as well as, where appropriate, obtaining of the permits that are necessary for these appropriate purposes.