System Design

Terms

Data Tool - Any software component, usually open-source, used for processing, storing, or recording data. Examples may include Kafka, Flink, Spark, Beam, NiFi, Arroyo, Airflow, or others.
Dataset - A set of connected data objects, intended to be used/processed as a unit. A dataset may be bounded or unbounded.
Pipeline - A series of operations performed on a dataset. A pipeline may be cyclic or acyclic and may connect to other data pipelines.
Stage - A single operation contained in a data pipeline.

Rationale

The Data Pipeline Engine (DPE) was conceived from a set of pain points encountered by Raft over years of working with different organizations. Those pain points will be covered later, but to start with, a number of functional and non-functional requirements were established to guide development. They are:

The solution should be deployed closely within SDL, utilizing as much of the current stack as possible. This should include things like:
- Existing infrastructure like Kafka, MinIO, etc.
- Existing SDL core concepts like Datasets and Datasources.
- Kubernetes primitives, potentially including Kubernetes Operator-like capacities
The solution should provide a pathway to integration with the SDL Data Catalog for 'closed-loop' data pipelines that contribute new datasets to the catalog.

Multiple data tools may be involved in projects

This may not be true for small, very focused, or greenfield projects, where technology selection is limited to a small set of engineers. However, as new engineers onboard to a project, new user requirements come up that weren’t accounted for in the original design, or technological best practices evolve, this ceases to be the case. It is assumed that nearly all data-heavy, successful projects evolve to the point where multiple data tools are used for processing.

The proliferation of data tools in recent years is with good reason. At sufficient scales of data or system complexity (many times at lower thresholds than you may think) minor design decisions between different data tools become apparent-Kafka Streams may be preferable to Airflow, or Flink over Beam, etc. Data engineers should then understand to operate in these conditions.

In this light, in designing the technical stack for a new system, architects have three options for visualizing, managing, and deploying data pipelines:

Mandate that only a single data tool is used. It should be apparent that this option has signficant drawbacks, including being tethered to a single data tool/community and the ability of that data tool to encompass all current and future use cases.
Accept (implicitly or explicitly) that multiple data tools will be used, but don’t plan for a way to deal with that consequence. Projects architected with this option are typically characterized by disorganized or fragemented deployment processes, delays in implementing conceptually simple changes that affect multiple tools, and a complete inability to visualize or manage these complex data flows without manual methods (ex. hand-drawn architecture diagrams).
Explicitly accept this eventuality, but plan for the consequences, including how to deploy, visualize, monitor, and update data pipelines encoded in multiple tools.

Pain Points

The following pain points inform much of the core functionality of SDL Data Pipelines.

Visualizing data pipelines is hard

In terms of pipeline complexity, pre-MVCR 1 (2023) CBC2 wasn’t hugely complicated. The team was managing essentially three data pipelines, each with between 2 and 5 stages. All data were streamed in via Kafka and a subset were persisted to Postgres for retrieval via a REST API. Being Kafka-heavy, most applications were written as Kubernetes `Deployment`s, some utilizing Kafka Streams.

Even with only a single data tool being used (Kafka/KStreams), visualizing the pipelines was extremely difficult. Most of the time, deciphering what a particular pipeline was doing required examining each set of Deployment objects to figure out the interconnections (Kafka Topics). This worked well enough for engineers, but anyone non-technical (or even engineers not well-versed in the system) was in the dark. The team ended up keeping an architecture diagram to track individual data flows, complete with Kafka Topics and boxes representing pipeline stages, to show internal and external stakeholders the system.

This wasn’t the best solution, for several reasons:

Architecture diagrams, as they usually live in dedicated graphical tools outside the codebase, tend to stay slightly (or majorly) out-of-date. Updates to the data pipelines had to be intentionally reflected in the diagram.
Knowledge of how to create the architecture diagram was still maintained by engineers. Non-technical or outside parties may have a snapshot of the data architecture at a point in time, but that had limited usefulness for anything besides record-keeping.
Manually creating and updating diagrams can be time-intensive and requires broad system knowledge to get right. Though understanding the system broadly is probably a good thing, spending engineering time on architecture diagrams should be minimized to a reasonable extent.

When a project utilizes a single data tools, there is sometimes an open-source or proprietary visualization tool, which can help ease this pain point somewhat. However, in many cases that tool doesn’t exist, and in the event of another data tool being added, the benefits are significantly lessened.

Lesson Learned

Given that data pipelines are often heterogeneous with respect to data tools, they should be visualized outside the tools themselves. In other words, visualization of a data pipeline should not rely on features inside the data tools themselves.

This is a different direction from tools like Apache NiFi, which ship with built-in UIs. However, we see the benefits of those built-in UIs quickly deteriorate as pipelines are built outside of those tools.

Data pipelines are hard to deploy

This problem gets exponentially more difficult as more data tools are used in a given project, but still exists in the case of a single tool. On CBC2, pipelines were deployed as Helm charts, which has benefits (easy to templatize deployments between environments) and drawbacks (somewhat difficult to thread pipelines together, lots of modifications if stages added/removed).

Many popular data tools like Airflow, NiFi, and Arroyo bake deployment into their tools, either via UIs or their own deployment languages (like Python for Airflow).