Dataset Schemas
The Dataset Schema API enables developers to retrieve and understand the structure of datasets in the catalog. This API integrates with the Schema Registry to provide both raw schema definitions and normalized field information for filtering and querying.
Overview
The Schema API provides two modes of operation:
-
Raw Schema Mode: Returns the actual schema definition as stored in the Schema Registry
-
Normalized Mode: Returns a standardized field list optimized for building filter expressions
This dual approach allows developers to:
-
Access original schema definitions for data processing
-
Build dynamic filter interfaces using normalized field information
-
Support multiple schema formats (Avro, Protobuf, JSON Schema)
-
Validate data structures before processing
API Endpoints
Get Dataset Schema
Retrieves the schema for a specific dataset.
GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema
Raw Schema Responses
When using ?raw=true, the API returns the actual schema definition from the Schema Registry.
Avro Schema Example
Avro schemas are returned as JSON with full type information and documentation.
Request:
GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true
Response Headers:
Content-Type: application/json
Response Body:
{
"type": "record",
"name": "CustomerRecord",
"namespace": "io.raft.datafabric.customer",
"doc": "A record representing customer data",
"fields": [
{
"name": "customerId",
"type": "long",
"doc": "Unique customer identifier"
},
{
"name": "firstName",
"type": "string",
"doc": "Customer's first name"
},
{
"name": "lastName",
"type": "string",
"doc": "Customer's last name"
},
{
"name": "email",
"type": ["null", "string"],
"default": null,
"doc": "Customer's email address"
},
{
"name": "age",
"type": "int",
"doc": "Customer's age"
},
{
"name": "accountBalance",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 10,
"scale": 2
},
"doc": "Customer's account balance"
},
{
"name": "registrationDate",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"doc": "Date when customer registered"
},
{
"name": "preferences",
"type": {
"type": "map",
"values": "string"
},
"doc": "Customer preferences"
},
{
"name": "tags",
"type": {
"type": "array",
"items": "string"
},
"doc": "Tags associated with the customer"
},
{
"name": "status",
"type": {
"type": "enum",
"name": "CustomerStatus",
"symbols": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"]
},
"doc": "Customer account status"
},
{
"name": "address",
"type": ["null", {
"type": "record",
"name": "Address",
"fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "postalCode", "type": "string"},
{"name": "country", "type": "string"}
]
}],
"default": null,
"doc": "Customer's address"
}
]
}
Protobuf Schema Example
Protobuf schemas are returned in their text format with the appropriate content type.
Request:
GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true
Response Headers:
Content-Type: application/x-protobuf-schema
Response Body:
syntax = "proto3";
package io.raft.datafabric.customer;
option java_package = "io.raft.datafabric.customer.proto";
option java_outer_classname = "CustomerProto";
message CustomerRecord {
// Unique customer identifier
int64 customer_id = 1;
// Customer's name
string first_name = 2;
string last_name = 3;
// Optional email
optional string email = 4;
// Customer's age
int32 age = 5;
// Account balance with decimal precision
double account_balance = 6;
// Registration timestamp (Unix epoch milliseconds)
int64 registration_date = 7;
// Customer preferences as key-value pairs
map<string, string> preferences = 8;
// Tags associated with the customer
repeated string tags = 9;
// Customer status
enum CustomerStatus {
CUSTOMER_STATUS_UNSPECIFIED = 0;
CUSTOMER_STATUS_ACTIVE = 1;
CUSTOMER_STATUS_INACTIVE = 2;
CUSTOMER_STATUS_SUSPENDED = 3;
CUSTOMER_STATUS_DELETED = 4;
}
CustomerStatus status = 10;
// Nested address message
message Address {
string street = 1;
string city = 2;
string state = 3;
string postal_code = 4;
string country = 5;
}
// Optional address
optional Address address = 11;
// Contact preference using oneof
oneof contact_method {
string phone = 12;
string mobile = 13;
string work_phone = 14;
}
// Nested repeated messages for order history
message Order {
string order_id = 1;
int64 order_date = 2;
double total_amount = 3;
enum OrderStatus {
ORDER_STATUS_UNSPECIFIED = 0;
ORDER_STATUS_PENDING = 1;
ORDER_STATUS_PROCESSING = 2;
ORDER_STATUS_SHIPPED = 3;
ORDER_STATUS_DELIVERED = 4;
ORDER_STATUS_CANCELLED = 5;
}
OrderStatus status = 4;
}
repeated Order orders = 15;
}
JSON Schema Example
JSON schemas follow the JSON Schema specification and include validation rules.
Request:
GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema?raw=true
Response Headers:
Content-Type: application/json
Response Body:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://raft.io/schemas/customer-record.json",
"title": "Customer Record",
"description": "A record representing customer data",
"type": "object",
"required": ["customerId", "firstName", "lastName", "age", "status"],
"properties": {
"customerId": {
"type": "integer",
"description": "Unique customer identifier",
"minimum": 1
},
"firstName": {
"type": "string",
"description": "Customer's first name",
"minLength": 1,
"maxLength": 100
},
"lastName": {
"type": "string",
"description": "Customer's last name",
"minLength": 1,
"maxLength": 100
},
"email": {
"type": ["string", "null"],
"description": "Customer's email address",
"format": "email",
"default": null
},
"age": {
"type": "integer",
"description": "Customer's age",
"minimum": 0,
"maximum": 150
},
"accountBalance": {
"type": "number",
"description": "Customer's account balance",
"multipleOf": 0.01
},
"registrationDate": {
"type": "string",
"description": "Date when customer registered",
"format": "date-time"
},
"preferences": {
"type": "object",
"description": "Customer preferences",
"additionalProperties": {
"type": "string"
}
},
"tags": {
"type": "array",
"description": "Tags associated with the customer",
"items": {
"type": "string"
},
"uniqueItems": true
},
"status": {
"type": "string",
"description": "Customer account status",
"enum": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"]
},
"address": {
"type": ["object", "null"],
"description": "Customer's address",
"default": null,
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"postalCode": {"type": "string", "pattern": "^[0-9]{5}(-[0-9]{4})?$"},
"country": {"type": "string"}
},
"required": ["street", "city", "country"]
}
},
"additionalProperties": false
}
Normalized Schema Response
When raw is false or omitted, the API returns a normalized SchemaInfo object containing field information optimized for building filter expressions.
Request:
GET /datasources/{datasourceId}/enablements/{enablementId}/datasets/{datasetId}/schema
Response:
{
"schemaId": "123",
"schemaType": "AVRO",
"fields": [
{
"path": "customerId",
"name": "customerId",
"type": "LONG",
"description": "Unique customer identifier"
},
{
"path": "firstName",
"name": "firstName",
"type": "STRING",
"description": "Customer's first name"
},
{
"path": "lastName",
"name": "lastName",
"type": "STRING",
"description": "Customer's last name"
},
{
"path": "email",
"name": "email",
"type": "STRING",
"nullable": true,
"description": "Customer's email address"
},
{
"path": "age",
"name": "age",
"type": "INTEGER",
"description": "Customer's age"
},
{
"path": "status",
"name": "status",
"type": "STRING",
"enumValues": ["ACTIVE", "INACTIVE", "SUSPENDED", "DELETED"],
"description": "Customer account status"
},
{
"path": "address.street",
"name": "street",
"type": "STRING",
"nullable": true,
"description": "Street address"
},
{
"path": "address.city",
"name": "city",
"type": "STRING",
"nullable": true,
"description": "City"
},
{
"path": "tags",
"name": "tags",
"type": "ARRAY",
"description": "Tags associated with the customer"
},
{
"path": "tags[]",
"name": "tags[]",
"type": "STRING",
"description": "Array element"
}
]
}
Field Type Mapping
The normalized response maps schema-specific types to a common set of field types:
| Normalized Type | Avro Types | Protobuf Types | JSON Schema Types |
|---|---|---|---|
STRING |
string, enum |
string, enum |
string |
INTEGER |
int, long |
int32, int64, sint32, sint64, fixed32, fixed64 |
integer |
DOUBLE |
float, double, decimal |
float, double |
number |
BOOLEAN |
boolean |
bool |
boolean |
ARRAY |
array |
repeated fields (shown as field[] in API) |
array |
MAP |
map |
map |
object (with additionalProperties) |
RECORD |
record |
message |
object |
Nested Field Access
Nested fields are flattened using dot notation in the path property:
-
address.street- Access the street field within the address object -
orders[].orderId- Access orderId within an array of orders (theorders[]entry itself would have type ARRAY) -
metadata.tags[]- Access an array of tags within metadata
Array fields are represented with two entries:
1. The array field itself (e.g., tags with type ARRAY)
2. The array element type (e.g., tags[] with the element’s type like STRING, OBJECT, etc.)
When filtering, you can use specific array indices (e.g., tags[0], tags[5]) even though the schema only defines the general pattern tags[]. The filtering API automatically validates these indexed paths against the array element definition.
Error Responses
The API returns appropriate HTTP status codes and error messages for various failure scenarios.
Dataset Not Found (404)
{
"timestamp": "2025-07-27T13:30:45.123Z",
"status": 404,
"error": "Not Found",
"message": "Dataset not found: d456",
"path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}
Schema Not Found (404)
{
"timestamp": "2025-07-27T13:30:45.123Z",
"status": 404,
"error": "Not Found",
"message": "Schema not found in registry: schema-123",
"path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}
Dataset Has No Schema (404)
{
"timestamp": "2025-07-27T13:30:45.123Z",
"status": 404,
"error": "Not Found",
"message": "Dataset has no Kafka storage with schema",
"path": "/datasources/ds123/enablements/e123/datasets/d456/schema"
}
Usage Examples
Example 1: Building a Dynamic Filter UI
When building a user interface for dataset filtering, use the normalized schema response:
// Fetch normalized schema
const response = await fetch('/datasources/ds1/enablements/e1/datasets/d1/schema');
const schemaInfo = await response.json();
// Build filter options from fields
// Note: Array element fields (those with [] in the path) are typically
// not used directly in filters - use the parent array field instead
const filterableFields = schemaInfo.fields
.filter(field => !field.path.includes('[]')) // Exclude array elements
.map(field => ({
label: field.description || field.name,
value: field.path,
type: field.type,
enumValues: field.enumValues
}));
// For array fields, you can allow users to specify indices
function getArrayFieldsWithIndices(schemaInfo) {
const arrayFields = schemaInfo.fields
.filter(field => field.type === 'ARRAY');
const arrayElementFields = schemaInfo.fields
.filter(field => field.path.includes('[]'));
// For each array field, find its element fields
return arrayFields.map(arrayField => {
const elementFields = arrayElementFields
.filter(ef => ef.path.startsWith(arrayField.path + '[]'))
.map(ef => ({
...ef,
// Allow user to specify index
pathTemplate: ef.path.replace('[]', '[${index}]')
}));
return {
arrayField,
elementFields
};
});
}
// Create appropriate input based on field type
function createFilterInput(field) {
switch(field.type) {
case 'STRING':
return field.enumValues
? createDropdown(field.enumValues)
: createTextInput();
case 'INTEGER':
case 'DOUBLE':
return createNumberInput();
case 'BOOLEAN':
return createCheckbox();
default:
return createTextInput();
}
}
Example 2: Validating Data Before Publishing
Use the raw schema to validate data before publishing to Kafka:
// Fetch raw Avro schema
HttpResponse<String> response = httpClient.send(
HttpRequest.newBuilder()
.uri(URI.create("/datasources/ds1/enablements/e1/datasets/d1/schema?raw=true"))
.build(),
HttpResponse.BodyHandlers.ofString()
);
// Parse and use for validation
Schema schema = new Schema.Parser().parse(response.body());
GenericDatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
// Validate data against schema
try {
GenericRecord record = reader.read(null, decoder);
// Data is valid
} catch (AvroTypeException e) {
// Data does not match schema
}
Example 3: Schema Evolution Monitoring
Monitor schema changes by comparing raw schemas:
# Get current schema
curl -s "/datasources/ds1/enablements/e1/datasets/d1/schema?raw=true" \
| jq . > current_schema.json
# Compare with previous version
diff previous_schema.json current_schema.json
# Check for breaking changes
if grep -q '"type".*"null"' current_schema.json; then
echo "Warning: New nullable fields detected"
fi
Best Practices
-
Cache Schema Information: Schemas change infrequently. Cache normalized schema responses to reduce API calls.
-
Handle Schema Evolution: Always handle nullable fields and new fields gracefully in your applications.
-
Use Appropriate Mode:
-
Use normalized mode (
raw=false) for building UIs and filter expressions -
Use raw mode (
raw=true) for data validation and processing
-
-
Error Handling: Implement retry logic for 503 errors, as Schema Registry may be temporarily unavailable.
-
Field Path Navigation: Use the dot notation paths from normalized responses to access nested fields in your data.
Schema Registry Integration
The Dataset Schema API integrates with Confluent Schema Registry to provide:
-
Centralized schema storage and versioning
-
Schema evolution and compatibility checking
-
Multiple format support (Avro, Protobuf, JSON Schema)
Datasets must have Kafka storage configured with a schema ID to use this API. The schema ID links the dataset to its schema definition in the registry.