Environment Variable Contract
Kubernetes transformers need to be written adhering to the contract dictated by the pipeline engine w.r.t environment variables. There are several categories of environment variables provided to transformers, which is as follows:
[Kafka Transformers] Kafka Environment
For transformers with at least one input or output whose conn_type is INTERNAL_KAFKA, the following environment will be provided:
-
KAFKA_BROKER_HOST: The Kafka endpoint to use. -
KAFKA_SASL_MECHANISM: SASL Mechanism to use. UsuallySCRAM-SHA-512 -
KAFKA_SASL_USERNAME: SASL Username -
KAFKA_SASL_PASSWORD: SASL Password -
KAFKA_GROUP_ID: Consumer Group ID to be used. This allows for horizontal scaling of Kubernetes Jobs if a transformer’sreplicasis set greater than 1. -
KAFKA_SECURITY_PROTOCOL: Security protocol used. UsuallySASL_PLAINTEXT.
[MinIO Transformers] MinIO Environment
For transformers with at least one input or output whose conn_type is INTERNAL_MINIO, the following environment will be provided:
-
MINIO_ENDPOINT: MinIO Endpoint -
MINIO_ACCESS_KEY: MinIO username -
MINIO_SECRET_KEY: MinIO password
[Iceberg Transformers] Iceberg Environment
For transformers with at least one input or output whose conn_type is INTERNAL_ICEBERG, the following environment will be provided (in addition to the MinIO environment above):
-
ICEBERG_CATALOG_REST_HOST: The Iceberg REST catalog endpoint (e.g. Lakekeeper) -
ICEBERG_CATALOG_REST_CATALOG_PATH: The catalog path on the REST endpoint
[Dataset Transformers] Dataset Environment
For transformers with conn_type of DATASET, the engine resolves the dataset reference from the catalog and downcasts it to the underlying resource type (e.g. INTERNAL_KAFKA, INTERNAL_MINIO). The environment variables provided will match those of the resolved type.
[All Transformers] Prometheus Environment
These variables are provided to enable transformers to emit prometheus metrics. A PodMonitor is deployed in SDL that will automatically scrape the port and path specified by these variables:
-
PROM_METRICS_NS: Prometheus metric namespace to use. This isn’t strictly checked but should be used to help separate transformers' metric names. See [here](https://prometheus.io/docs/practices/naming/) for more details. -
PROM_METRICS_PORT: The port at which to start the Prometheus server at. The pipeline engine will appropriately expose this port on the Job/Pod. -
PROM_METRICS_ROUTE: The route at which the Prometheus server should serve metrics on. This is what Prometheus is configured to look for via aPodMonitor.
[All Transformers] Contextual Environment
These variables provide transformers with some context about the environment they’re running in when instantiated.
-
PIPELINE_UID: The UID of the pipeline instance that this Transformer is part of -
PIPELINE_NAME: The name of the pipeline instance that this Transformer is part of -
PIPELINE_TRANSFORMER_UID: The UID corresponding to the Transformer template that was used to instantiate this Transformer. -
PIPELINE_TRANSFORMER_NAME: The instantiated name of this Transformer (i.e. the Kubernetes Job name).
[Direct Transformer <→ Transformer] Direct Environment
Used specifically for connections of type TRANSFORMER. The configuration contract is defined specifically between two transformers; no default configuration is provided by the data pipeline engine. See Transformer Connections for more information.