TRANSFORMER Connections

The concept of an input/output essentially defines two things:

What kind of data a transformer expects to consume or produce
How to access that data

For many data pipelines, especially those with high volume and and low latency requirements, underlying storage methods are used to provide reliability, fault tolerance, and scalability. However, many potential transformers either don’t need to or are unable to connect to these underlying drivers, instead exchanging data point-to-point. A contrived example would be a Kafka HTTP Tap, a hypothetical application that consumes from a Kafka topic/partition and allows clients to pull the latest message via an HTTP request. For these connections, it doesn’t make sense to jam a 'dataset' in between two applications just to be consistent. TRANSFORMER type connections are one solution, allowing two transformers to exchange information about how to connect each other.

In this example, the Kafka HTTP Tap consumes from kafka and exposes an HTTP route to pull from, accepts a single input of type INTERNAL_KAFKA, and produces a single output of type TRANSFORMER:

  // ...
  "outputs": {
    "ROUTE": {
      "display_name": "Route to access data from",
      "conn_type": "TRANSFORMER",
      "conn_config": {
        "static_environment": [
          {
            "name": "ROUTE",
            "value": "/latest"
          }
        ]
      }
    }
  },
  // ...

The corresponding sink app periodically pulls from the HTTP route. It places constraints on the upstream TRANSFORMER connection in the form of what configuration it must provide for it to be accepted:

{
  // ...
  "inputs": {
    "SOURCE": {
      "conn_type": "TRANSFORMER",
      "required_conn_config": [
        "ROUTE",
        "HOSTNAME"
      ]
    }
  },
  // ...
}

As seen in the outputs block of the source app, both vars are provided: ROUTE from the upstream conn_config and HOSTNAME automatically-not specified in the upstream conn_config, rather coming from the engine itself (see the sections below for more details on what variables can be auto-supplied.).

When a client makes a data pipeline consisting of this and a (in this case) downstream transformer, they would provide a pipeline with these parts (some parts omitted for clarity):

{
    "pipeline": [
        {
            "id": 12345,
            "name": "Kafka HTTP Tap",
            "outputs": {
                "ROUTE": { // Nothing in this block is change-able by clients
                    ...
                }
            }
            ...
        },
        {
            "id": 67890,
            "name": "Downstream HTTP Client",
            "inputs": {
                "SOURCE": {
                    "conn_type": "TRANSFORMER",
                    "stage": 0,
                    "key": "ROUTE"
                }
            }
        }
    ]
}

Note the stage and key fields in the inputs block, in lieu of a ref as would normally be present. This tells the data pipeline engine to look for an object in outputs in the zeroth stage of the pipeline list named ROUTE.

The pipeline engine would then supply Downstream HTTP Client with the following environment:

SOURCE_HOSTNAME=<hostname of 'Kafka HTTP Tap' Service in Kubernetes> (1)
SOURCE_ROUTE=/latest (2)

1	Engine-provided configuration consisting of the hostname of `Kafka HTTP Tap`, which is only known by the engine when instantiating the pipeline.
2	All configuration contained in the `conn_config` block of the `ROUTE` output, as provided by the `Kafka HTTP Tap` template.

The actual configuration values are created by concatenating the input name and the configuration provided by the connection being used, separated by (). Thus, SOURCE (input) + + HOSTNAME (from engine_provided_environment). This is intended to prevent naming collisions when multiple inputs and/or outputs are specified.

Push- and Pull-based Transformers

TRANSFORMER connections can provide configuration either upstream or downstream; that is, an input of type TRANSFORMER that has conn_config will provide it to any Transformers with the corresponding output, and vice-versa.

In data-engineering terms, a Transformer that has an output of type TRANSFORMER that provides connection information via conn_config can be thought of as pull-based. A Transformer that has an input of type TRANSFORMER that provides connection information via conn_config can be thought of as push-based. 'Normal' Transformers that provide configuration for an underlying data storage solution like Kafka are technically both, as producers produce and consumers call poll() or next().

Types of `TRANSFORMER` `conn_config`

Similarly to the configuration block of a Transformer, conn_config exposes a set of configurations available to provide to any Transformers that connect to this as one of their inputs or outputs.

environment

Currently, the data pipeline engine DOES NOT support providing user-settable environment for a connection of type `TRANSFORMER.

static_environment

These work as expected; after they’re defined in the transformer’s template, any Transformers that connect to that connection will be provided each of these.

engine_provided_environment

Configurations like HOSTNAME aren’t known until the pipeline decides on what to name the particular stage when it’s created (in Kubernetes, for example). Thus, they’re specified here.

The full list of supported engine_provided_environment is:

HOSTNAME: Specify this to direct the data pipeline engine to populate with the hostname of the Kubernetes Service pointing to this Transformer.

Transformer writers exposing inputs/outputs of type TRANSFORMER are encouraged to make fields in conn_config as generically-named as possible.

TRANSFORMER Connections

Push- and Pull-based Transformers

Types of TRANSFORMER conn_config

Types of `TRANSFORMER` `conn_config`