Dataset Schemas

The Dataset Schema API enables developers to retrieve and understand the structure of datasets in the catalog. This API integrates with the Schema Registry to provide both raw schema definitions and normalized field information for filtering and querying.

Overview

The Schema API provides two modes of operation:

Raw Schema Mode: Returns the actual schema definition as stored in the Schema Registry
Normalized Mode: Returns a standardized field list optimized for building filter expressions

This dual approach allows developers to:

Access original schema definitions for data processing
Build dynamic filter interfaces using normalized field information
Support multiple schema formats (Avro, Protobuf, JSON Schema)
Validate data structures before processing

API Endpoints

Get Dataset Schema

Retrieves the schema for a specific dataset.

GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema

Query Parameters

Parameter Type Description

Parameter	Type	Description
raw	boolean	When `true`, returns the raw schema content. When `false` or omitted, returns normalized field information. Default: `false`

raw

boolean

When true, returns the raw schema content. When false or omitted, returns normalized field information. Default: false

Response Types

Schema Format Raw Mode Content-Type Normalized Mode Content-Type

Schema Format	Raw Mode Content-Type	Normalized Mode Content-Type
Avro	`application/json`	`application/json`
Protobuf	`application/x-protobuf-schema`	`application/json`
JSON Schema	`application/json`	`application/json`

Avro

application/json

Protobuf

application/x-protobuf-schema

application/json

JSON Schema

application/json

Raw Schema Responses

When using ?raw=true, the API returns the actual schema definition from the Schema Registry.

Avro Schema Example

Avro schemas are returned as JSON with full type information and documentation.

Request:

GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true

Response Headers:

Content-Type: application/json

Response Body:

{
  "type": "record",
  "name": "CustomerRecord",
  "namespace": "io.raft.datafabric.customer",
  "doc": "A record representing customer data",
  "fields": [
    {
      "name": "customerId",
      "type": "long",
      "doc": "Unique customer identifier"
    },
    {
      "name": "firstName",
      "type": "string",
      "doc": "Customer's first name"
    },
    {
      "name": "lastName",
      "type": "string",
      "doc": "Customer's last name"
    },
    {
      "name": "email",
      "type": ["null", "string"],
      "default": null,
      "doc": "Customer's email address"
    },
    {
      "name": "age",
      "type": "int",
      "doc": "Customer's age"
    },
    {
      "name": "accountBalance",
      "type": {
        "type": "bytes",
        "logicalType": "decimal",
        "precision": 10,
        "scale": 2
      },
      "doc": "Customer's account balance"
    },
    {
      "name": "registrationDate",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      },
      "doc": "Date when customer registered"
    },
    {
      "name": "preferences",
      "type": {
        "type": "map",
        "values": "string"
      },
      "doc": "Customer preferences"
    },
    {
      "name": "tags",
      "type": {
        "type": "array",
        "items": "string"
      },
      "doc": "Tags associated with the customer"
    },
    {
      "name": "status",
      "type": {
        "type": "enum",
        "name": "CustomerStatus",
        "symbols": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"]
      },
      "doc": "Customer account status"
    },
    {
      "name": "address",
      "type": ["null", {
        "type": "record",
        "name": "Address",
        "fields": [
          {"name": "street", "type": "string"},
          {"name": "city", "type": "string"},
          {"name": "state", "type": "string"},
          {"name": "postalCode", "type": "string"},
          {"name": "country", "type": "string"}
        ]
      }],
      "default": null,
      "doc": "Customer's address"
    }
  ]
}

Protobuf Schema Example

Protobuf schemas are returned in their text format with the appropriate content type.

Request:

GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true

Response Headers:

Content-Type: application/x-protobuf-schema

Response Body:

syntax = "proto3";

package io.raft.datafabric.customer;

option java_package = "io.raft.datafabric.customer.proto";
option java_outer_classname = "CustomerProto";

message CustomerRecord {
  // Unique customer identifier
  int64 customer_id = 1;

  // Customer's name
  string first_name = 2;
  string last_name = 3;

  // Optional email
  optional string email = 4;

  // Customer's age
  int32 age = 5;

  // Account balance with decimal precision
  double account_balance = 6;

  // Registration timestamp (Unix epoch milliseconds)
  int64 registration_date = 7;

  // Customer preferences as key-value pairs
  map<string, string> preferences = 8;

  // Tags associated with the customer
  repeated string tags = 9;

  // Customer status
  enum CustomerStatus {
    CUSTOMER_STATUS_UNSPECIFIED = 0;
    CUSTOMER_STATUS_ACTIVE = 1;
    CUSTOMER_STATUS_INACTIVE = 2;
    CUSTOMER_STATUS_SUSPENDED = 3;
    CUSTOMER_STATUS_DELETED = 4;
  }
  CustomerStatus status = 10;

  // Nested address message
  message Address {
    string street = 1;
    string city = 2;
    string state = 3;
    string postal_code = 4;
    string country = 5;
  }

  // Optional address
  optional Address address = 11;

  // Contact preference using oneof
  oneof contact_method {
    string phone = 12;
    string mobile = 13;
    string work_phone = 14;
  }

  // Nested repeated messages for order history
  message Order {
    string order_id = 1;
    int64 order_date = 2;
    double total_amount = 3;

    enum OrderStatus {
      ORDER_STATUS_UNSPECIFIED = 0;
      ORDER_STATUS_PENDING = 1;
      ORDER_STATUS_PROCESSING = 2;
      ORDER_STATUS_SHIPPED = 3;
      ORDER_STATUS_DELIVERED = 4;
      ORDER_STATUS_CANCELLED = 5;
    }
    OrderStatus status = 4;
  }

  repeated Order orders = 15;
}

JSON Schema Example

JSON schemas follow the JSON Schema specification and include validation rules.

Request:

GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true

Response Headers:

Content-Type: application/json

Response Body:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://raft.io/schemas/customer-record.json",
  "title": "Customer Record",
  "description": "A record representing customer data",
  "type": "object",
  "required": ["customerId", "firstName", "lastName", "age", "status"],
  "properties": {
    "customerId": {
      "type": "integer",
      "description": "Unique customer identifier",
      "minimum": 1
    },
    "firstName": {
      "type": "string",
      "description": "Customer's first name",
      "minLength": 1,
      "maxLength": 100
    },
    "lastName": {
      "type": "string",
      "description": "Customer's last name",
      "minLength": 1,
      "maxLength": 100
    },
    "email": {
      "type": ["string", "null"],
      "description": "Customer's email address",
      "format": "email",
      "default": null
    },
    "age": {
      "type": "integer",
      "description": "Customer's age",
      "minimum": 0,
      "maximum": 150
    },
    "accountBalance": {
      "type": "number",
      "description": "Customer's account balance",
      "multipleOf": 0.01
    },
    "registrationDate": {
      "type": "string",
      "description": "Date when customer registered",
      "format": "date-time"
    },
    "preferences": {
      "type": "object",
      "description": "Customer preferences",
      "additionalProperties": {
        "type": "string"
      }
    },
    "tags": {
      "type": "array",
      "description": "Tags associated with the customer",
      "items": {
        "type": "string"
      },
      "uniqueItems": true
    },
    "status": {
      "type": "string",
      "description": "Customer account status",
      "enum": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"]
    },
    "address": {
      "type": ["object", "null"],
      "description": "Customer's address",
      "default": null,
      "properties": {
        "street": {"type": "string"},
        "city": {"type": "string"},
        "state": {"type": "string"},
        "postalCode": {"type": "string", "pattern": "^[0-9]{5}(-[0-9]{4})?$"},
        "country": {"type": "string"}
      },
      "required": ["street", "city", "country"]
    }
  },
  "additionalProperties": false
}

Normalized Schema Response

When raw is false or omitted, the API returns a normalized SchemaInfo object containing field information optimized for building filter expressions.

Request:

GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema

Response:

{
  "schemaId": "123",
  "schemaType": "AVRO",
  "fields": [
    {
      "path": "customerId",
      "name": "customerId",
      "type": "LONG",
      "description": "Unique customer identifier"
    },
    {
      "path": "firstName",
      "name": "firstName",
      "type": "STRING",
      "description": "Customer's first name"
    },
    {
      "path": "lastName",
      "name": "lastName",
      "type": "STRING",
      "description": "Customer's last name"
    },
    {
      "path": "email",
      "name": "email",
      "type": "STRING",
      "nullable": true,
      "description": "Customer's email address"
    },
    {
      "path": "age",
      "name": "age",
      "type": "INTEGER",
      "description": "Customer's age"
    },
    {
      "path": "status",
      "name": "status",
      "type": "STRING",
      "enumValues": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"],
      "description": "Customer account status"
    },
    {
      "path": "address.street",
      "name": "street",
      "type": "STRING",
      "nullable": true,
      "description": "Street address"
    },
    {
      "path": "address.city",
      "name": "city",
      "type": "STRING",
      "nullable": true,
      "description": "City"
    },
    {
      "path": "tags",
      "name": "tags",
      "type": "ARRAY",
      "description": "Tags associated with the customer"
    },
    {
      "path": "tags[]",
      "name": "tags[]",
      "type": "STRING",
      "description": "Array element"
    }
  ]
}

Field Type Mapping

The normalized response maps schema-specific types to a common set of field types:

Normalized Type	Avro Types	Protobuf Types	JSON Schema Types
STRING	string, enum	string, enum	string
INTEGER	int, long	int32, int64, sint32, sint64, fixed32, fixed64	integer
DOUBLE	float, double, decimal	float, double	number
BOOLEAN	boolean	bool	boolean
ARRAY	array	repeated fields (shown as field[] in API)	array
MAP	map	map	object (with additionalProperties)
RECORD	record	message	object

Normalized Type

Avro Types

Protobuf Types

JSON Schema Types

STRING

string, enum

string

INTEGER

int, long

int32, int64, sint32, sint64, fixed32, fixed64

integer

DOUBLE

float, double, decimal

float, double

number

BOOLEAN

boolean

bool

boolean

ARRAY

array

repeated fields (shown as field[] in API)

array

MAP

map

object (with additionalProperties)

RECORD

record

message

object

Nested Field Access

Nested fields are flattened using dot notation in the path property:

address.street - Access the street field within the address object
orders[].orderId - Access orderId within an array of orders (the orders[] entry itself would have type ARRAY)
metadata.tags[] - Access an array of tags within metadata

Array fields are represented with two entries: 1. The array field itself (e.g., tags with type ARRAY) 2. The array element type (e.g., tags[] with the element’s type like STRING, OBJECT, etc.)

When filtering, you can use specific array indices (e.g., tags[0], tags[5]) even though the schema only defines the general pattern tags[]. The filtering API automatically validates these indexed paths against the array element definition.

Error Responses

The API returns appropriate HTTP status codes and error messages for various failure scenarios.

Dataset Not Found (404)

{
  "timestamp": "2025-07-27T13:30:45.123Z",
  "status": 404,
  "error": "Not Found",
  "message": "Dataset not found: d456",
  "path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}

Schema Not Found (404)

{
  "timestamp": "2025-07-27T13:30:45.123Z",
  "status": 404,
  "error": "Not Found",
  "message": "Schema not found in registry: schema-123",
  "path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}

Dataset Has No Schema (404)

{
  "timestamp": "2025-07-27T13:30:45.123Z",
  "status": 404,
  "error": "Not Found",
  "message": "Dataset has no Kafka storage with schema",
  "path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}

Schema Registry Unavailable (503)

{
  "timestamp": "2025-07-27T13:30:45.123Z",
  "status": 503,
  "error": "Service Unavailable",
  "message": "Schema Registry is currently unavailable",
  "path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}

Invalid Schema Format (500)

{
  "timestamp": "2025-07-27T13:30:45.123Z",
  "status": 500,
  "error": "Internal Server Error",
  "message": "Failed to parse schema: Invalid Avro schema format",
  "path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}

Usage Examples

Example 1: Building a Dynamic Filter UI

When building a user interface for dataset filtering, use the normalized schema response:

// Fetch normalized schema
const response = await fetch('/datasources/ds1/enablements/e1/datasets/d1/schema');
const schemaInfo = await response.json();

// Build filter options from fields
// Note: Array element fields (those with [] in the path) are typically
// not used directly in filters - use the parent array field instead
const filterableFields = schemaInfo.fields
  .filter(field => !field.path.includes('[]'))  // Exclude array elements
  .map(field => ({
    label: field.description || field.name,
    value: field.path,
    type: field.type,
    enumValues: field.enumValues
  }));

// For array fields, you can allow users to specify indices
function getArrayFieldsWithIndices(schemaInfo) {
  const arrayFields = schemaInfo.fields
    .filter(field => field.type === 'ARRAY');

  const arrayElementFields = schemaInfo.fields
    .filter(field => field.path.includes('[]'));

  // For each array field, find its element fields
  return arrayFields.map(arrayField => {
    const elementFields = arrayElementFields
      .filter(ef => ef.path.startsWith(arrayField.path + '[]'))
      .map(ef => ({
        ...ef,
        // Allow user to specify index
        pathTemplate: ef.path.replace('[]', '[${index}]')
      }));

    return {
      arrayField,
      elementFields
    };
  });
}

// Create appropriate input based on field type
function createFilterInput(field) {
  switch(field.type) {
    case 'STRING':
      return field.enumValues
        ? createDropdown(field.enumValues)
        : createTextInput();
    case 'INTEGER':
    case 'DOUBLE':
      return createNumberInput();
    case 'BOOLEAN':
      return createCheckbox();
    default:
      return createTextInput();
  }
}

Example 2: Validating Data Before Publishing

Use the raw schema to validate data before publishing to Kafka:

// Fetch raw Avro schema
HttpResponse<String> response = httpClient.send(
    HttpRequest.newBuilder()
        .uri(URI.create("/datasources/ds1/enablements/e1/datasets/d1/schema?raw=true"))
        .build(),
    HttpResponse.BodyHandlers.ofString()
);

// Parse and use for validation
Schema schema = new Schema.Parser().parse(response.body());
GenericDatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);

// Validate data against schema
try {
    GenericRecord record = reader.read(null, decoder);
    // Data is valid
} catch (AvroTypeException e) {
    // Data does not match schema
}

Example 3: Schema Evolution Monitoring

Monitor schema changes by comparing raw schemas:

# Get current schema
curl -s "/datasources/ds1/enablements/e1/datasets/d1/schema?raw=true" \
  | jq . > current_schema.json

# Compare with previous version
diff previous_schema.json current_schema.json

# Check for breaking changes
if grep -q '"type".*"null"' current_schema.json; then
  echo "Warning: New nullable fields detected"
fi

Best Practices

Cache Schema Information: Schemas change infrequently. Cache normalized schema responses to reduce API calls.
Handle Schema Evolution: Always handle nullable fields and new fields gracefully in your applications.
Use Appropriate Mode:
- Use normalized mode (raw=false) for building UIs and filter expressions
- Use raw mode (raw=true) for data validation and processing
Error Handling: Implement retry logic for 503 errors, as Schema Registry may be temporarily unavailable.
Field Path Navigation: Use the dot notation paths from normalized responses to access nested fields in your data.

Schema Registry Integration

The Dataset Schema API integrates with Confluent Schema Registry to provide:

Centralized schema storage and versioning
Schema evolution and compatibility checking
Multiple format support (Avro, Protobuf, JSON Schema)

Datasets must have Kafka storage configured with a schema ID to use this API. The schema ID links the dataset to its schema definition in the registry.

Dataset Schemas

Overview

API Endpoints

Get Dataset Schema

Query Parameters

Response Types

Raw Schema Responses

Avro Schema Example

Protobuf Schema Example

JSON Schema Example

Normalized Schema Response

Field Type Mapping

Nested Field Access

Error Responses

Dataset Not Found (404)

Schema Not Found (404)

Dataset Has No Schema (404)

Schema Registry Unavailable (503)

Invalid Schema Format (500)

Usage Examples

Example 1: Building a Dynamic Filter UI

Example 2: Validating Data Before Publishing

Example 3: Schema Evolution Monitoring

Best Practices

Schema Registry Integration

Related Topics