DATASET Connections

If you’re looking through the pipeline engine code or at the pipeline UI, you may see connections of type DATASET. They aren’t mentioned in this README as this is meant for transformer authors, and no transformer will ever need to accept connections of type DATASET.

DATASET connections are an abstraction provided by the pipeline between itself and clients submitting pipelines. It allows clients to not fully specify the interconnecting drivers in between each pair of transformers, instead letting the pipeline engine infer them from the templates provided.

For example, if a pipeline like this would be submitted:

The pipeline engine would look at the transformer templates for Transformer A and Transformer B to determine if the input/output pair matches. If so, the input and output would be resolved down to the underlying type, and validation/instantiation of the pipeline would continue as normal.

For example, take the following Transformer template:

{
  "name": "My Transformer",
  "uid": "1234",
  "inputs": {
    "SOURCE_TOPIC": {
      "display_name": "Source topic",
      "conn_type": "INTERNAL_KAFKA",
    },
  },
  // ...
}

If a pipeline were created using this Transformer like so:

{
  "name": "My Pipeline",
  "latest": [
    {
      "uid": "1234",
      "name": "My Transformer",
      "inputs": {
        "SOURCE_TOPIC": {
          "conn_type": "DATASET", // <--------- Note the conn_type
          "ref": "src-topic", // <--------- Note the raw ref
        },
      },
      "outputs": {},
      "configuration": {},
    },
  ],
}

The pipeline engine will see the conn_type of DATASET and attempt to reconcile it with the Transformer template. In this case, the Transformer template specifies that the SOURCE_TOPIC input must be of type INTERNAL_KAFKA. The pipeline engine will then replace the conn_type and proceed. In the case of an INTERNAL_KAFKA connection, the pipeline engine will resolve the source ref by appending a UUID to it. In the case above, the resulting pipeline would look like this:

{
  "name": "My Pipeline",
  "latest": [
    {
      "uid": "1234",
      "name": "My Transformer",
      "inputs": {
        "SOURCE_TOPIC": {
          "conn_type": "DATASET", // <--------- Note the conn_type is INTERNAL_KAFKA
          "ref": "src-topic-7fb2d3a6-57cc-48b2-b7dc-cb06f628b71a", // <--------- Note the ref has been salted
        },
      },
      "outputs": {},
      "configuration": {},
    },
  ],
}

In the case of an underlying type of INTERNAL_MINIO, the pipeline engine will not salt the ref. Instead, it will prepend an S3 storage prefix. An incoming DataConn looking like:

"SOURCE_S3_PATH": {
  "conn_type": "DATASET",     // <--------- Note the conn_type
  "ref": "src-bucket/prefix1" // <--------- Note the raw ref
}

Would become:

"SOURCE_S3_PATH": {
  "conn_type": "INTERNAL_MINIO",   // <--------- Note the conn_type
  "ref": "s3://src-bucket/prefix1" // <--------- Note the expanded ref
}

A conn_type of DATASET is transient; it will never be persisted by the pipeline engine, as it will always be resolved into a concrete conn_type first.