SDL Data Pipelines
1. Introduction
SDL Data Pipelines provide capabilities to design, configure, deploy, and manage complex data processing jobs without being restricted to underlying tools like Kubernetes.
The user documentation contains help sections for authoring Transformers and deploying them to SDL; the developer documentation provides more information on the underlying capabilities and concepts, major architectural components, and design decisions.
2. High-Level Overview
Raft built the Data Pipelines system to address several pain points. Design will go into more detail, but at a glance:
-
Deploying and managing ETL jobs is difficult to do; many ETL platforms are not optimized for Kubernetes and require additional management.
-
In many enterprise systems, the question of "what data pipelines are running?" is non-trivial to answer. Even more difficult is the question of "what steps comprise a data pipeline?"
-
Many ETL systems are sticky; it’s difficult to maintain a system that uses more than one tool (ex. Flink, Spark, NiFi, Airflow, etc.).
There are three main concepts that are used:
-
Dataset: A logical grouping of data in SDL. For example, a Kafka topic, a bucket or key prefix on S3, or a table in PostgreSQL.
-
Transformer: Any process that acts upon a Dataset (or another Transformer), producing zero or more Datasets as output.
-
Pipeline: An undirected, possibly cyclic graph that describes a series of Transformers and their connections to Datasets.
Additionally, there are two other terms that are important:
-
Transformer Template (also called a Template): A record of a Transformer that is registered with the Pipeline Engine so that it can be used in Pipelines.
-
Pipeline Template: A pre-defined scaffold of a Pipeline with options for what Transformers can be used at each step, simplifying the process of creating commonly-needed Pipelines.
3. Native Transformation Path
Transformers can also convert between arbitrary formats without forcing conformity to a data model:
-
Format Bridges: Convert between coalition partner formats
-
Legacy Adapters: Translate protocols for systems that cannot be modified
-
High-Fidelity Passthrough: Preserve source format when downstream systems require it
Path 2 enables interoperability scenarios that forced conformity would break.
AXS --> AXS-to-AFATDS Transformer --> System B (Original AXS transformed to AFATDS for Multinational Operational Environments)