Superset

1. Introduction

1.1. Purpose

A comprehensive guide to understanding and utilizing Superset’s capabilities within Raft’s Data Fabric.

1.2. Target Audience

Data Stewards, Data Analysts

1.3. Scope

The Superset documentation for a data platform provides users with detailed guidance on using Apache Superset for data exploration and visualization. It explains how to connect to various data sources, create and customize interactive dashboards, and visualize data through a wide range of chart types. The documentation covers essential functionalities like building queries, filtering data, and setting up permissions for different user roles. Additionally, it offers best practices for optimizing performance, configuring Superset’s architecture, and integrating it with other platform components. This ensures users can effectively leverage Superset to analyze and present data insights in real-time.

2. Overview

2.1. Capability Description

Superset enables a system to provide powerful data exploration and visualization capabilities, allowing users to analyze and interact with data through customizable dashboards and charts. It connects seamlessly to various databases, enabling real-time querying and data analysis without the need for code. Users can build complex SQL queries using an intuitive drag-and-drop interface or write custom queries for more advanced analytics.

2.2. Key Features

  • Visualizations: numerous chart types, including bar charts, line charts, pie charts, heatmaps, time-series graphs, and geospatial maps, enabling rich and varied data visualizations

  • No-Code Data Exploration: explore datasets and build visualizations without needing to write SQL

  • SQL Query Support: users can write custom SQL queries for more complex data analysis and reporting, allowing flexibility in how data is queried

  • Interactive Dashboards: enables users to create and share interactive, real-time dashboards that can be customized with filters, drill-down capabilities, and responsive layouts

  • Multi-Database Connectivity: supports a wide variety of databases, providing flexibility in connecting to different data sources

  • Role-Based Access Control (RBAC): granular user and role-based permissions, ensuring that data access is controlled and secure for different user groups

  • Real-Time Data Exploration: perform real-time analysis on live data, making it ideal for monitoring, operational analytics, and business intelligence

  • Collaborative Workflows: supports collaboration through the ability to share dashboards, visualizations, and insights across teams, enhancing decision-making and data democratization

3. Available Data Sources

When working with Superset, it’s important to understand the available data sources in Data Fabric and how data flows through the system. Below is an overview of the most common data sources:

3.1. df-delta

This is the main data store where all data is persisted after being ingested through Kafka. Delta is suitable for structured, analytics-friendly data and supports real-time analysis. It is the primary destination for most datasets in Data Fabric. Data remains here after the Kafka time-to-live (TTL) limit is reached.

3.2. df-kafka

Kafka is used as a message broker to stream data in real-time before it is persisted in Delta. Data in Kafka is transient, with a TTL of 1 hour, meaning it gets dropped after 1 hour but is retained permanently in Delta. Kafka is optimal for real-time event streaming, though it is not used for long-term storage.

3.3. df-pinot

Pinot is a real-time analytics engine and OLAP store optimized for low-latency queries. It automatically pulls real-time data from Kafka streams for immediate complex analysis, making it ideal for dashboards or maps requiring real-time views, like monitoring events or detecting anomalies. While Pinot excels in fast, time-critical queries, large-scale data storage is handled by Delta Lake. Pinot is best used for scenarios requiring quick, real-time analysis.

3.4. df-postgres

PostgreSQL is used mainly for manual data uploads, allowing users to store relational data in a structured, SQL-compatible database. This source is suited for users who prefer uploading CSV files or working with custom schemas.

3.5. df-postgis

PostGIS is used for storing and querying spatial data. It supports case-by-case geospatial analyses and can be used when working with maps and spatial datasets.

3.6. df-arcadedb

ArcadeDB is a multi-model database that can handle graphs, documents, and other non-relational data models. It is used in specific cases where these data models are required, though it is not the default destination for most datasets.

4. Setup

4.1. Accessing Superset

There are two ways to navigate to Superset within Data Fabric:

  1. From the Data Fabric Home page, select the Superset card at the bottom left.

  2. You can also access Superset via the Data Insights tab directly, selecting Visualizations.

4.2. Logging In

Log in to Superset using your Keycloak credentials, which manages access to the application and ensures that your account has appropriate access levels.

4.3. Navigation

Upon logging in, you’ll be presented with the Superset home screen. Here you can navigate between the main features:

  • Dashboards: View and manage your existing dashboards, or create a new one.

  • Charts: Create new visualizations from datasets.

  • Datasets: Manage and explore datasets.

  • SQL Lab: Run custom SQL queries.

4.4. Settings Menu

In the upper-right corner of the home screen, you’ll find the “Settings” menu. This menu provides options for managing users, security settings, and database connections. It includes features like:

  • List Users and List Roles: For managing user accounts and access roles within the system.

  • Action Log: To view logs of user actions for auditing purposes.

  • Row Level Security: Allows you to manage access control policies based on data rows.

  • Database Connections: Configure and manage data source connections within Superset.

  • CSS Templates: Customize the appearance of dashboards and visualizations.

  • Annotation Layers: Manage and add annotations to your visualizations.

  • User Info: View your personal account details.

  • Logout: Log out of the Superset platform.

4.5. Additional Navigation Features

  1. Favorites Section: Under Dashboards and Charts, there is a "Favorites" tab that allows you to quickly access any dashboards or charts you’ve marked as favorites. This is particularly useful for regularly accessed visualizations.

  2. Sorting and Filtering Options: On the Dashboards, Charts, and Datasets pages, you can sort and filter items by various attributes (like date created or modified). This is especially helpful when dealing with many dashboards or datasets and finding recent or relevant work quickly.

  3. Recents Section: On the main Home screen, there’s a Recents section that displays your most recently viewed, edited, or created items. This makes it easy to pick up where you left off when returning to the platform.

  4. Database Connections (Settings Menu): Under the Settings menu (top-right corner), there’s an option for Database Connections. This allows admins or users with the right permissions to manage and configure new or existing connections to external databases.

  5. Action Log (Settings Menu): Also under Settings, the Action Log provides visibility into the actions performed within Superset, such as creating or modifying charts, dashboards, or datasets. This can help with auditing and tracking user actions within the platform.

  6. List Users and List Roles (Settings Menu): Under Settings, the List Users and List Roles options let you manage user accounts and roles. You can view and edit permissions, roles, and access levels for each user in the system.

  7. Annotation Layers (Settings Menu): Annotation Layers under the Settings menu allow users to create and manage annotations that can be applied to charts, helping to highlight specific data points or events on visualizations.

  8. Import Button: The Import button, available on the Dashboards and Charts pages, allows you to import previously exported visualizations from JSON files. This is useful when transferring dashboards or charts between instances or reusing visualizations.

  9. Bulk Select Button: The Bulk Select button is available on Dashboards, Charts, and Datasets pages. It allows you to select multiple items at once to perform bulk actions such as exporting, deleting, or marking items as favorites.

4.6. Adding Data Sources

Before creating dashboards and charts, ensure that Superset is connected to the correct data source. To add or verify data sources:

  1. Navigate to Datasets:

    • From the Superset home screen, click Datasets.

    • Click on + Dataset to start the process.

  2. Select a Database:

    • In the Database dropdown, select a Data Fabric-managed database from the list:

    • df-delta (for structured analytics data)

    • df-kafka (for real-time event streams)

    • df-postgres (for relational data)

    • df-postgis (for geospatial data)

    • df-pinot (for OLAP queries and real-time analytics)

    • df-arcadedb (for multi-model data, such as graphs and documents)

  3. Select a Schema: Once you’ve selected the database, choose a Schema. The schema organizes the data into categories. For Example:

    • In df-delta, choose a specific schema where your data is stored after being processed from Kafka topics.

    • In df-pinot, choose a schema like metrics if you’re analyzing real-time data ingested from Kafka streams related to system performance.

  4. Select a Table: Next, choose a table from the list. The available tables depend on the selected database and schema. You can use the search box to quickly find a table by name.

  5. Save the Dataset: Once you’ve selected the table, click "Create Dataset and Create Chart" to save the dataset. You will now be able to use this dataset to build charts and dashboards.

4.7. Connecting Disparate Data Sources in Superset

In Superset, users can connect to multiple data sources from different databases and combine them in dashboards and queries for more comprehensive analysis. If you’re working with datasets that exist in different databases (e.g., relational data in df-postgres and streaming data in df-kafka), Superset enables the integration of these disparate data sources through SQL Lab or by joining them within the same dashboard.

To connect and analyze data from different sources:

  1. Connect Each Data Source: Ensure that each data source is correctly configured in Superset as described in the Adding Data Sources section. Select databases like df-delta, df-pinot, or df-postgis based on the type of data you need.

  2. Use SQL Lab for Custom Queries: In SQL Lab, you can write queries that pull data from multiple databases. If you need to perform complex analysis across multiple sources, SQL Lab allows for joining or unioning tables from different databases. For example:

SELECT a.*, b.*
FROM "df-delta.table1" a
JOIN "df-postgres.table2" b ON a.id = b.id;

This query joins data from a df-delta table with a df-postgres table, providing a unified view of both datasets.

  1. Visualize in the Same Dashboard: Even if the data cannot be joined at the SQL level, Superset allows you to create visualizations from multiple data sources and combine them into the same dashboard. For instance, you might display a time-series graph from df-kafka alongside a bar chart from df-postgres, all within one dashboard for cross-analysis.

  2. Dealing with Schema or Format Differences: When integrating data from different databases, it’s essential to ensure that the schemas align, especially for joins or comparisons. For example, make sure the data types in different sources (e.g., timestamps or IDs) are formatted consistently, or you may need to apply transformations in SQL Lab before combining the data.

4.8. Using Local Data in Superset

Data analysts who have local datasets that are not integrated into the Data Fabric can still use Superset to visualize and analyze this data. Users can upload local files (such as CSVs) into databases like df-postgres. Once the data is uploaded to a supported database, it can be connected to Superset for querying and visualization.

Steps to Upload Local Data: 1. Upload Local Data to df-postgres: Analysts can upload their local CSV files to df-postgres, which serves as a relational database for manually uploaded data. This can be done through SQL Lab or directly via the database’s user interface. Example: A data analyst can upload a sales dataset stored locally as a CSV to df-postgres.

  1. Connect to the Uploaded Dataset in Superset: Once the data is uploaded to df-postgres, go to Datasets in Superset, select df-postgres as the database, and choose the schema where the uploaded data is stored. Save the dataset to begin creating charts and dashboards.

Benefit: This approach allows analysts to work with their own custom or processed datasets that may not be available in Data Fabric, making it easier to integrate personalized or project-specific data into Superset.

5. Create A New Chart & Dashboard

5.1. Create a New Chart

After saving the dataset, you will be directed to the "Create a New Chart" screen. Here, you can select from a variety of chart types, depending on the kind of data you’re analyzing and how you wish to visualize it. Some available options include:

  • Bar Chart

  • Line Chart

  • Pie Chart

  • Time-Series Plot

  • Heatmap

  • Pivot Table

  • Big Number with Trendline

  • World Map (for geospatial data)

Choose a chart type that fits your data and click on "CREATE NEW CHART."

Here’s how you can create and customize three common chart types: a Table, a Scatterplot, and a Map.

5.2. Discrete vs Continuous Data

When creating charts in Superset, it’s important to understand the difference between discrete and continuous data, as this influences the choice of chart type and how the data is visualized.

Discrete Data: Represents distinct, separate values that can often be categorized or counted. Examples include categories like gender, product types, or regions. In Superset, bar charts and pie charts are commonly used to display discrete data because these chart types emphasize distinct groupings.

Example: A bar chart showing sales numbers per product category is a typical use case for discrete data. Each bar represents a distinct product category.

Continuous Data: Represents data that can take any value within a range and is typically numeric. Examples include temperature, time, and distance. For continuous data, line charts, scatter plots, and histograms are more suitable because they allow for the visualization of trends and distributions over time or range.

Example: A line chart showing daily temperatures over a month would be an appropriate use of continuous data, as it reflects changes over a continuous time period.

5.3. Table Customization

5.3.1. Selecting the Dataset

After creating or selecting a dataset (e.g., from df-delta), click on Charts and choose Table from the list of chart types.

5.3.2. Metrics and Dimensions

In the table customization screen, you can select the columns to display under the Dimensions section. Drag and drop your dataset columns into Dimensions to display them in the table. For example, columns like category, severity, or type could be selected for a table view.

Metrics: You can also aggregate columns by dragging a field into the Metrics section and selecting an aggregation function (e.g., COUNT or SUM).

5.3.3. Sorting and Filters

Sort By: Choose a column to sort the table in ascending or descending order. You can enable the Sort Descending checkbox if needed.

Filters: You can apply filters to refine the table’s data. For instance, if you only want to see records where the severity is set to a certain value, apply a filter here.

5.3.4. Row Limit

By default, the row limit is set to 1000, but you can adjust this to display more or fewer records as needed.

5.3.5. Additional Settings

Show Totals: You can enable Show Totals for any columns you would like to aggregate values across, such as calculating totals for numeric columns.

5.4. Scatterplot Customization

5.4.1. Selecting the Dataset

After selecting a dataset (e.g., from df-pinot), choose Scatter Plot from the list of available chart types.

5.4.2. X and Y Axis Configuration

In the Data tab, drag a numeric column into the X Axis and Y Axis. For example:

  • X Axis: You could choose event_time if you want to see data over time.

  • Y Axis: You could choose severity or any other relevant measure.

5.4.3. Metrics

Select additional Metrics that you may want to overlay, such as calculating the average or sum of another field.

5.4.4. Scatter Plot Points

Each point on the plot represents a row in the dataset. You can customize the Size and Color of the points based on other dimensions, adding more information to the visual.

5.4.5. Filters

If you only want to display data within a specific range, use Filters. For example, filter based on category or severity to limit the points shown.

5.4.6. Styling Options

In the Customize tab, you can adjust the chart’s title, axis labels, and colors to match your preferences.

5.4.7. Finalizing the Scatter Plot

Once everything is configured, click Update Chart to preview the scatter plot, and then Save it to your dashboard.

5.5. Map Customization

5.5.1. Selecting the Dataset

After selecting a dataset (e.g., from df-postgis which stores spatial data), choose a map-based chart type like World Map or Deck.gl Scatterplot.

5.5.2. Latitude and Longitude

In the Data tab, select the columns that represent geographic data:

  • Latitude: For example, select a column like latitude or geometry.

  • Longitude: Select a column for longitude like longitude.

5.5.3. Additional Metrics

You can also select additional Metrics that affect the size or color of points on the map. For example, the intensity of points can be based on severity or urgency metrics.

5.5.4. Filtering and Time Range

Use Filters to focus the map on specific data points. For example, filter for data only in a particular region or within a specific timeframe.

If your dataset includes temporal data, you can adjust the Time Range to show data for a specific period.

5.5.5. Customize

  • Zoom level, point size, and colors: Adjust these in the Customize tab.

  • Tooltips: Configure the map to show tooltips with detailed data when you hover over points.

5.5.6. Map Layers

You can customize and switch between different base map layers to better match your data visualization needs. Available layers include:

  • Streets

  • Parks

  • Lights

  • Satellite Streets

  • Satellite

To adjust these map layers: * WMS Endpoint: Add a custom WMS (Web Map Service) endpoint in the designated field to retrieve map tiles from external sources. * WMS Layer: Select or configure the WMS layer you wish to use.

These layers allow you to modify the base map to fit the context of your data more effectively. You can also adjust the styling of these layers, such as toggling between light or dark modes for better visibility.

5.5.7. Finalizing the Map

Once you’ve configured the map, click Update Chart to see how it looks, then click Save to add it to your dashboard.

6. Validating Data Sources via SQL Lab

After adding a new data source, it’s crucial to verify that Superset is correctly connected to the source and that the data is accessible. The following steps will guide you through this process using SQL Lab.

6.1. Step 1: Check Data Availability

  1. Go to the Datasets tab and ensure that the new data source has populated the available tables and schemas.

    • For example, if you’ve added a df-delta database, select it and review the list of tables and columns that have been loaded.

  2. Select the relevant schema and ensure it includes the expected datasets and columns.

6.2. Step 2: Open SQL Lab for Query Testing

  1. Navigate to SQL tab at the top of the interface, and click on SQL Lab.

  2. Select your database from the dropdown menu.

  3. Verify the table’s schema by reviewing the columns listed on the left-hand side.

6.3. Step 3: Run a Test Query

  • In the SQL query editor, run a test query to verify that data is being returned as expected. For example, a simple query might be:

SELECT * FROM "your_table_name" LIMIT 10;

Requirement: Quotes around table names and an ending semi-colon is required.

  • Click Run to execute the query. You should see the query results appear in the Results tab below the editor.

6.4. Step 4: Save or Review Query History

Once you’ve successfully run the query, you can either: * Save the query for future reference by clicking the Save button. * Review your query history to verify past successful executions.

6.5. Troubleshooting

  • Ensure the dataset is properly ingested and available within the selected schema.

  • Check for any SQL syntax errors in the query itself.

7. Usage

Common Use Cases

7.1. Real-Time Event Stream Analysis

  • Use Case: Data Fabric’s integration with Kafka allows for real-time ingestion of event streams. Analysts can monitor incoming data such as sensor readings, network traffic, or system events in real-time.

  • Example: Using Superset, an operations analyst can set up a dashboard that provides real-time monitoring of security events, sensor data streams, or flight telemetry. Event surges or anomalies can trigger notifications or be visualized immediately to help with operational decision-making.

  • Benefit: Immediate action on real-time insights for mission-critical decision-making, such as threat detection or operational readiness.

7.2. Geospatial Data Analysis

  • Use Case: Geospatial data (such as geolocation from sensors, vehicles, or mapping systems) can be visualized and analyzed for mission planning, logistics, or situational awareness.

  • Example: A user can ingest and visualize geospatial data from a case-by-case source like df-postgis in Superset, layering the information with topographic maps, satellite imagery, or live location tracking for a more comprehensive view.

  • Benefit: Geospatial awareness allows for efficient mission planning and monitoring, particularly for military operations or search and rescue scenarios.

7.3. Time-Series Data Analytics

  • Use Case: Ingesting and analyzing time-series data from Kafka to Pinot or Delta allows teams to monitor trends over time for performance, status updates, or system health.

  • Example: A dashboard could be built to track performance metrics over time, such as system CPU usage or network throughput, and display these trends for IT administrators managing infrastructure.

  • Benefit: Time-series analysis provides insight into patterns and anomalies that help identify potential issues before they become critical, allowing for proactive management.

7.4. Sensor Data Collection and Environmental Monitoring

  • Use Case: Ingest sensor data from external APIs or sensor hubs, such as weather data or IoT systems, to monitor environmental conditions and assess impact on operations.

  • Example: A weather monitoring dashboard that pulls in real-time temperature, humidity, and pressure data can help analysts forecast and plan operations that are weather-sensitive.

  • Benefit: Sensor data analysis informs strategic decisions in operational environments, where external factors such as weather conditions or equipment sensor statuses affect the outcome.

7.5. Aggregating and Enriching Data from Multiple Sources

  • Use Case: Multiple data types (e.g., document, relational, real-time streams) can be combined into one view, allowing for complex multi-source analysis.

  • Example: Combining relational data from df-delta (structured data) with real-time sensor streams from df-kafka allows an operations team to correlate operational metrics with external environmental factors.

  • Benefit: Aggregating data from multiple sources into a single view allows for more holistic insights, facilitating better decision-making across disparate datasets.

7.6. Exporting Dashboards

  • Navigate to the "Dashboards" tab from the top navigation bar.

  • Locate the dashboard you want to export.

  • Click on the Actions button (the icon with three dots) next to the dashboard name.

  • Select Export from the dropdown. This will download the dashboard as a JSON file, which can be used to import the dashboard into other Superset instances.

7.7. Sharing Dashboards

  • In the dashboard view, click on the three dots in the upper-right corner of the screen to open the dashboard options.

  • From the dropdown, select Share. This will generate a shareable link.

  • You can adjust the permissions of the link, deciding whether users can only view or edit the dashboard.

  • Copy the link and distribute it to other users, who can then access the dashboard based on the permissions you’ve set.

Other Sharing Options: * Save As: Allows you to create a copy of the dashboard under a new name. * Download: You can download the dashboard as an image or PDF for offline sharing.

8. Advanced Features

8.1. Real-Time Analytics with Pinot and Kafka Integration

  • Feature: Leverage Pinot’s OLAP capabilities to query real-time data from Kafka streams for fast, low-latency analysis.

  • Implementation: Users can build Superset dashboards that directly query Kafka through Pinot, displaying up-to-the-minute data updates on events like sensor triggers, security alerts, or system logs.

  • Benefit: This setup allows for high-speed, real-time decision-making and analysis, perfect for operational teams that need insights instantly.

8.2. Geospatial Customization with WMS Endpoints

  • Feature: Customize map visualizations using custom WMS (Web Map Service) endpoints and layering capabilities for advanced geospatial analytics.

  • Implementation: Users can specify WMS endpoints and choose map layers like streets, parks, lights, and satellite views, offering detailed control over the presentation of geospatial data. The ability to toggle between these layers in maps enhances the granularity and context of spatial analysis.

  • Benefit: This feature is crucial for operations that require advanced geospatial analytics, such as mission planning and zone management for SOCOM SOF AT&L.

8.3. Time-Bounded Querying in Superset with df-delta and Pinot

  • Feature: When dealing with large time-series datasets, users can take advantage of time-bound querying in Superset to display relevant time frames of data. For example, monitoring only the last 24 hours of sensor input data, or tracking weekly trends in anomaly detection.

  • Implementation: In Superset, set time-based filters for charts that pull data from Pinot or df-delta, allowing for focused, high-performance queries on time-series data.

  • Benefit: Advanced filtering allows users to optimize performance and target specific data, which is essential when handling massive volumes of time-based data.

8.4. Automated Data Lifecycle Management with Kafka TTL (Time to Live)

  • Feature: Kafka topics have a default Time to Live (TTL) setting of 1 hour, after which the data is automatically purged but persisted in Delta for long-term storage.

  • Implementation: Users can configure data lifecycle rules and ensure that their streaming data in Kafka remains available for immediate analysis, while older data is stored in df-delta for historical analytics.

  • Benefit: This automated lifecycle management ensures optimal use of storage and processing resources while preserving critical data for historical analysis.

9. Troubleshooting

9.1. Issue: No Data Displayed in Chart/Dashboard

Possible Cause: The selected dataset may not have been correctly connected to Superset, or the data filters applied are too restrictive.

Solution:

  • Verify that the dataset is connected properly. You can do this by checking the Datasets tab and ensuring the schema and table are correct.

  • Go to SQL Lab and run a simple query like SELECT * FROM [table] LIMIT 10; to check if data is available.

  • Review the applied filters in the chart or dashboard. If the filters are too restrictive, broaden the selection to ensure more data is visible.

9.2. Issue: Chart Takes Too Long to Load

Possible Cause: The dataset may be too large, or query complexity is causing delays.

Solution:

  • Try reducing the data volume by applying filters to narrow down the dataset.

  • Adjust the row limit in the dataset query to a smaller number.

  • Check if you have selected the correct schema optimized for your data.

9.3. Issue: Missing or Inaccurate Data in the Dashboard

Possible Cause: There might be an issue with data ingestion, or data was dropped after Kafka’s time-to-live period.

Solution:

  • Confirm that the data source is actively ingesting data by checking with SQL Lab.

  • Ensure the data exists in Delta, as Kafka data will be dropped after one hour, but persisted in Delta.

  • Cross-check with team members about any changes in the data schema that could affect data visibility.

9.4. Issue: SQL Query Errors in SQL Lab

Possible Cause: Syntax errors in the query or incorrect table/schema selection.

Solution:

  • Double-check your SQL syntax, ensuring that table names are surrounded by quotes and queries end with a semicolon.

  • Ensure that the correct schema and table are selected in SQL Lab before running your query.

9.5. Issue: Map Not Rendering or Showing Incorrect Data

Possible Cause: Latitude and longitude columns might not be correctly selected, or the WMS endpoint might be incorrectly configured.

Solution:

  • Ensure the correct latitude and longitude columns are selected in the map customization panel.

  • Verify that the WMS endpoint and layer settings are properly configured in the Customize tab.

  • Zoom in or adjust the map’s viewport if points are too scattered.

9.6. Issue: Unexpected Formatting or Visualization Issues

Possible Cause: The chart type may not be suited to the data, or there may be too many metrics selected.

Solution:

  • Check if the chosen chart type (e.g., pie chart, scatter plot) is appropriate for the dataset.

  • Simplify the chart by reducing the number of metrics or dimensions. Overcomplicating visualizations can lead to poor representation or performance issues.

9.7. Issue: Dashboards Not Displaying Updates

Possible Cause: Cached data may be preventing updates from showing.

Solution:

  • Clear the cache by clicking the refresh button in Superset to see the latest data.

  • If you are still seeing old data, verify with your team if there are any delays in data ingestion or synchronization.

10. Reference Materials

Here are key reference resources for using Superset within Data Fabric:

  • Apache Superset Documentation - The official Apache Superset documentation provides a comprehensive guide to Superset’s functionality, including chart types, SQL Lab, and how to manage dashboards.

  • Apache Pinot Documentation - If using the df-pinot database for OLAP queries, review the Apache Pinot documentation for configuring queries and optimizing performance.

  • PostGIS Guide - When working with geospatial data, the PostGIS documentation provides detailed guidance on spatial queries and optimizations.

WMS Layers & Geospatial Data Resources

11. Glossary

  • Aggregation: The process of summarizing data, often by computing statistics like sum, average, or count, commonly applied to metrics in visualizations.

  • Apache Superset: An open-source data exploration and visualization platform used for building interactive dashboards and visualizing large datasets.

  • Dashboard: A collection of visualizations or charts organized into a single interface for monitoring and analyzing data in Superset.

  • Datasets: Collections of data available for querying and visualization within Superset, typically stored in databases like df-delta, df-pinot, or df-postgis.

  • df-arcadedb: A multi-model database capable of handling graph and document-based data. Typically used on a case-by-case basis within Data Fabric.

  • df-delta: A database used for structured, persisted analytics data within Data Fabric, primarily for long-term storage of ingested data from Kafka.

  • df-kafka: A messaging system that handles real-time event streams within Data Fabric. It ingests streaming data before it’s persisted in Delta.

  • df-pinot: A real-time OLAP database optimized for querying large-scale, low-latency datasets. Ideal for real-time analytics and event-driven queries.

  • df-postgis: A spatial database extender for PostgreSQL, used for handling geospatial data within Data Fabric.

  • Dimensions: Categorical values used for grouping and organizing data in Superset visualizations.

  • Filters: Constraints applied within visualizations to refine the data displayed, such as limiting by date range or specific categories.

  • Geospatial Data: Data that includes geographical components such as coordinates (latitude, longitude) used for mapping, typically stored in df-postgis.

  • Metrics: Numeric values in datasets that can be aggregated or analyzed within Superset charts.

  • OLAP (Online Analytical Processing): A category of software tools used for complex queries on large datasets, often employed by df-pinot for real-time and low-latency querying.

  • Real-Time Analytics: The processing and analysis of data as it is ingested into the system, providing near-instantaneous insights, often supported by df-kafka and df-pinot.

  • Row Limit: The maximum number of rows returned by a query, often adjusted to improve performance by limiting large result sets.

  • Schema: The structure that defines how data is organized in a database, including tables, fields, and relationships between them.

  • SQL Lab: A feature in Superset where users can write and execute SQL queries directly on datasets.

  • WMS (Web Map Service): A standard protocol used for serving georeferenced map images over the internet.

12. Appendices

12.1. Appendix A: Common SQL Queries in Superset

Here are some common SQL queries used within SQL Lab to explore datasets in Data Fabric:

View All Data:

SELECT * FROM [table_name] LIMIT 100;

Aggregate Data:

SELECT COUNT(*), [column_name]
FROM [table_name]
GROUP BY [column_name];

Filter Data by Date Range:

SELECT *
FROM [table_name]
WHERE [date_column] BETWEEN '2024-01-01' AND '2024-01-31';

Join Two Tables:

SELECT a.[column1], b.[column2]
FROM [table1] a
JOIN [table2] b ON a.[id] = b.[id];

Group and Aggregate Data:

SELECT [column1], COUNT([column2]), SUM([column3])
FROM [table_name]
GROUP BY [column1];

Sorting Data:

SELECT * FROM [table_name]
ORDER BY [column_name] DESC
LIMIT 10;

Filter Data with Multiple Conditions:

SELECT * FROM [table_name]
WHERE [column1] = 'value' AND [column2] > 1000;

Calculate Average:

SELECT AVG([column_name])
FROM [table_name];

Retrieve Data from Last 7 Days:

SELECT * FROM [table_name]
WHERE [date_column] >= NOW() - INTERVAL '7 DAYS';

Counting Distinct Values:

SELECT COUNT(DISTINCT [column_name])
FROM [table_name];

12.2. Appendix B: Role-Based Access Control (RBAC) Setup in Superset

Superset uses Keycloak for role-based access control (RBAC). Here’s how the roles typically work:

  • Admins: Have full control over datasets, dashboards, and SQL Lab, and can also manage users and roles.

  • Data Stewards & Analysts: Have access to create, modify, and share dashboards, and run queries in SQL Lab. They can also manage datasets but may have restricted access to more sensitive data.

  • Viewers: Can view dashboards and reports but cannot modify datasets or dashboards.

Each role has specific permissions based on the organization’s access policy. Ensure that your role in Keycloak matches the actions you need to perform in Superset.