Design Document - Monitoring

Overview

One crucial aspect of deploying a data pipeline is knowing what’s happening with it. At a minimum, we need to know whether it:

Is still running without error
Has completed (in the case of a discrete job)
Is in a degraded state

There are plenty of systems that do similar things - Argo Workflows is a really great tool - so we can take inspiration there.

Requirements

The system should report Pipeline-level status:
- Inactive - not started
- Active - currently running
- Degraded - one or more Transformers in the Pipeline are failing
- Completed - all Transformers in the Pipeline have finished
The system should report detailed statuses for each Transform in a Pipeline:
- Active - currently running
- Degraded - the Transformers has failed (and possibly restarted)
- Completed - the Transformer has finished
The system should report these things asynchronously, i.e the pipeline API shouldn’t need to aggregate Pipeline- and Transformer-level status every time a request comes in.
The types of statuses should be abstracted above a given instantiation type. For example, we can’t expect that every Transformer is instantiated as a Kubernetes Job.
A Pipeline may be instantiated in multiple underlying ways, but status information should be reported in a single place. In other words, the monitoring system should provide an abstraction over multiple instantiation types.

Abstract Solution

This diagram shows the logical flow between the Data Pipeline Engine and the hypothetical monitoring service. Keeping the creation of the Pipeline XYZ Status separate from the REST request in Step #4 allows the flow to be asynchronous from the user’s perspective - status information is always generally available without scraping the pipeline objects.

Since monitoring affects the life cycle of a Pipeline, it’s useful to break it down into parts:

Pipeline submission/validation
Pipeline instantiation
Pipeline status updates
Pipeline removal

As the alternatives will go through, points #2 and #3 can either be served by separate services or together, which has consequences for scalability and reliability.

Alternatives

There were a few different alternatives explored for how this functionality could be implemented. As a litmus test, the following scenario should perform somewhat well:

100 users
10 pipelines/user
5 transformers/pipeline

1. Monitor directly in the Pipeline Engine

This is the simplest option, as it doesn’t require any additional components to be managed. The Pipeline Engine simply retrieves status information at request-time by query each Transformer in a Pipeline, then aggregates it to form a top-level object.

Pros:

Simplistic deployment; no additional components.
Full integration with other Pipeline Engine primitives.
Can use caching to improve performance somewhat.

Cons:

Pipeline status aggregation is synchronous with API requests, even when cached
Scaling is sub-optimal; in test case above, Pipeline Engine needs to make 10 * 5 = 50 API calls to retrieve the current pipeline statuses for a single user.

2. Monitor via a separate service, writing to the pipeline DB

In this alternative, an sdl-pipeline-monitor service is created that handles status updates on Pipeline objects. Importantly, validation and instantiation of a Pipeline is still done in the Pipeline Engine; the monitor is only responsible for handling status changes. Writing results to the pipeline database means the Pipeline Engine only needs to retrieve information from a single source.

Pros:

Single access point for data is easier to reason through.
Asynchronous status updates are much more scalable.

Cons:

Concepts like redundancy, fault tolerance, etc. need to be implemented from scratch.
Monitor starts to look like a Kubernetes Operator without the benefits of one.
Sharing data structures/table schemas between the engine and monitor will be clunky.

3. Monitor via a Kubernetes Operator

This alternative is the largest change to the current Pipeline Engine. Technically a Kubernetes Operator could simply watch underlying resources like Jobs and Services, but you lose many benefits of using an operator (self-healing, lifecycle management). Because of this, using a Kubernetes Operator would entail moving Pipeline instantiation out of the Pipeline Engine, a significant change from the current state. Instead of creating Jobs and Services, the Pipeline Engine would instead create a Pipeline CR containing those specs; the operator would then handle creating and managing them.

Pros:

Kubernetes Operators built with kubebuilder have lots of functionality out of the box - redundancy, horizontal scaling, fault tolerance, etc.
Asynchronous instantiation and monitoring makes the API more responsive.
sdl-operator already exists, so no net new microservices to be added.
Existing parts of SDL already integrate with Kubernetes API, so this would be easy to integrate them with in the future.

Cons:

Moving instantiation out of the engine makes the system somewhat more complex.
Operator reconciliation logic is complex.
Although Kubernetes CRDs are well-integrated with the API/client libraries, still some friction keeping them in sync between multiple services.
Ties the system somewhat tightly to Kubernetes, although non-Kubernetes instantiation targets (Flink, Airflow, etc.) can still be targeted in the operator logic.
There have been issues installing CRDs in multi-tenant environments.

Final Implementation

The team decided on alternative #3 for monitoring. As described, a Kubernetes Operator pattern is used to instantiate, reconcile, and monitor Pipelines. Pipeline- and Transformer-level status information is reflected in the Status subresource of the CR, meaning given the litmus test above, the Pipeline Engine needs to simply list Pipeline CRs to retrieve their statuses. Although the implementation makes the system more complex, it increases the scalability and reliability of the system and also improves visibility (listing the CRs will show what Pipelines are currently running).