Grafana

1. Introduction

1.1. Purpose

This documentation serves as a comprehensive guide to using Grafana for application activity monitoring, including health checks and general usage telemetry, specifically utilizing Prometheus for metrics and Loki for log aggregation. It covers the setup process, key functionalities, and advanced features to enable users to monitor application and infrastructure health, performance, and usage.

1.2. Target Audience

This guide is designed for users of the Data Fabric Application who need to monitor and analyze system health, performance, and usage metrics. It is aimed at both technical and non-technical users, including developers, operators, and data analysts, who want to gain insights into the application’s activity and performance using Grafana dashboards.

1.3. Scope

This guide outlines how to set up Grafana, integrate Prometheus for metrics, and Loki for log monitoring, create dashboards, configure alerts, and troubleshoot common issues. The document also highlights use cases for monitoring application performance and provides best practices for scalable and efficient dashboard design.

2. Overview

2.1. Capability Description

Grafana is a powerful tool for visualizing and monitoring data from multiple sources, like Prometheus and Loki. It allows users to create custom dashboards for real-time and historical analysis, helping track system performance and detect issues. Key features include multi-source integration, dynamic dashboards, and alerting to ensure system reliability and quick issue resolution.

2.2. Key Features

  • Metrics Monitoring with Prometheus: Collect and visualize CPU usage, memory consumption, request rates, and other telemetry data.

  • Log Aggregation with Loki: Centralized logging, making it easy to correlate logs with metrics for comprehensive monitoring.

  • Alerting: Trigger alerts when specific thresholds are met, such as high error rates or degraded system performance.

  • Multi-Source Integration: Support for multiple data sources allows you to combine metrics (from Prometheus) with logs (from Loki) in a single dashboard.

  • Custom Dashboards: Build flexible and intuitive dashboards for both real-time and historical data analysis.

3. Setup

The following steps guide you through the initial setup process to start using Grafana for monitoring and visualization within Raft’s Data Fabric:

3.1. Step 1: Accessing Grafana

There are two ways to navigate to Grafana: * When you first log in, you will see the Grafana tile on the Home page.

  • You can also locate Grafana under the Data Insights tab on the left.

Ensure your login credentials are set up with Keycloak for authentication.

3.2. Step 2: Configuring Authentication

Log in using your Keycloak credentials, which manage access to dashboards and data sources.

3.3. Step 3: Grafana Buttons

  • General / Home: Links to Grafana documentation, tutorials, community, and public Slack.

  • Search dashboards by name

  • Your starred dashboards

  • Dashboards: Browse, Playlists, Snapshots, Library Panels, + New dashboard, + New folder, + Import

  • Explore

  • Alerting: Alert rules, Contact points, Notification policies, Silences, Groups, Admins, + New alert rule

  • Configurations: Service accounts, API Keys, Preferences, Plugins, Teams, Users, Data sources

  • User profile: Sign out, Notification history, Preferences

The Help option is also available under General / Home.

3.4. Step 4: Adding Data Sources

After logging in, navigate to Configurations > Data sources to add a new data source.

Grafana supports multiple data sources. Data Fabric includes:

  • Prometheus for DF system metrics

  • Loki for DF log aggregation

Configure the connection with the following details:

  • Server address (URL or IP of the data source)

  • Authentication details (e.g., API key, credentials, or OAUTH)

  • Query parameters as required for the specific data source

# Example server address configuration
server_address: "http://localhost:9090"
authentication: "OAUTH"

3.5. Step 5: Verifying Data Source Connections

Once configured, use the “Test” button to ensure that Grafana can successfully connect to the data source.

If the test fails, verify your authentication credentials and data source server address.

3.6. Step 6: Initial Dashboard Setup

Grafana allows you to create a dashboard immediately after setting up your data source.

  1. Navigate to Dashboards > New Dashboard.

  2. Add a new panel and select your configured data source.

  3. Customize the panel by selecting the query, time range, and visualization type.

  4. Save your dashboard to begin real-time monitoring.

4. Usage

4.1. Getting Started

Grafana’s flexibility makes it easy to create dashboards by connecting to pre-built integrations like Prometheus for metrics (CPU, memory, request rates) and Loki for logs, providing a comprehensive view of application activity.

Beyond these built-in integrations, Grafana supports a wide range of custom data sources, allowing connections to databases, APIs, and systems like MySQL, PostgreSQL, or custom APIs. With the right data source plugin, you can monitor real-time metrics, visualize trends, and ensure your infrastructure’s health. Additionally, you can extend Grafana in the Data Fabric App by integrating external sources like additional databases, logs, or third-party APIs. This enables centralized monitoring of both Data Fabric and external systems, offering a broader perspective on system metrics.

Grafana’s robust alerting system lets you set thresholds for key metrics and receive notifications via Mattermost, email, or webhooks, ensuring you’re alerted in real-time to any issues. You can also use annotations to mark critical events like deployments or incidents directly on your graphs, making it easier to track performance changes. With data transformations, you can filter, aggregate, and modify data within Grafana itself, creating more meaningful visualizations. Finally, multi-source dashboards enable you to combine data from multiple sources into a single, unified view, providing a holistic perspective of your infrastructure and application health.

4.2. Common Use Cases

4.2.1. Use Case 1: Comprehensive Monitoring of Dataset Query Volume, Performance, and Load Balancing

Users of the Data Fabric Application need to retrieve datasets for analysis or reporting. Monitoring both query volume and performance, while ensuring the query load is evenly distributed across datasets, is crucial for efficient data retrieval and system optimization. This helps prevent individual datasets from being overloaded while maintaining fast response times.

Example: A panel shows the number of dataset queries executed per hour across different datasets, highlighting any dataset with disproportionately high load. Another panel tracks the average time taken to execute queries across datasets, providing insights into system performance under varying load.

Step 1 - Create a New Dashboard

  • Navigate to Dashboards > + New Dashboard.

  • Click Add New Panel to begin creating the first panel.

Step 2 - Add Dataset Query Volume Panel

  • Data Source: Select Prometheus as the data source.

  • Metric Query: In the query builder, use a metric like query_requests_total{dataset=~".*"} to track the total number of dataset queries made to the Data Fabric system across all datasets.

  • Example query: query_requests_total{dataset="all-datasets"}

Panel Configuration:

  • Set Panel Type to Graph to visualize query volume trends over time.

  • In the Legend, name it "Dataset Query Volume."

  • Filters: Apply filters to focus on specific datasets if needed:

    • Example filter: query_requests_total{dataset="customer-data"} to focus on a specific dataset.

  • Save the Panel: Click Apply to save the panel.

Step 3 - Add Dataset Query Performance Panel

  • Add a New Panel by clicking the "+" icon.

  • Metric Query: Use a query such as query_execution_duration_seconds{dataset=~".*"} to track the time it takes to execute queries across datasets.

  • Example query: query_execution_duration_seconds{dataset="all-datasets"}

Panel Configuration:

  • Set Panel Type to Graph to visualize query execution time over a set period.

  • Name the panel "Query Execution Time" in the Legend field.

  • Save the Panel: Click Apply to save the panel.

Step 4 - Add Query Load Balancing Panel

  • Add a New Panel by clicking + Add New Panel.

  • Metric Query: Use a query like query_requests_total{dataset=~".*"} to track how evenly queries are distributed across multiple datasets.

  • Example query: query_requests_total{dataset=~".*"} with each dataset being filtered individually to see query volume per dataset.

Panel Configuration:

  • Set Panel Type to Bar Gauge or Table to clearly visualize how the query load is spread across datasets.

  • Name the panel "Query Load Distribution."

  • Save the Panel: Click Apply.

4.2.2. Use Case 2: Monitoring Data Ingestion Pipeline Performance

Users of the Data Fabric Application often rely on data ingestion pipelines to bring in new datasets for analysis. Monitoring the performance of these pipelines—specifically the rate at which data is ingested and the time it takes to process data—ensures that ingestion tasks are completed efficiently and without delay.

Example: One panel tracks the data ingestion rate, showing how many datasets or records are ingested per minute, while another tracks the time taken for ingestion tasks to complete. This helps users identify potential bottlenecks or performance degradation in the data ingestion process.

Step 1 - Create a New Dashboard * Navigate to Dashboards > + New Dashboard. * Click Add New Panel.

Step 2 – Add Data Ingestion Rate Panel

  • Data Source: Select Prometheus.

  • Metric Query: Use a query like ingestion_requests_total to track how many datasets or records are ingested over time.

  • Example query: ingestion_requests_total{pipeline=~".*"} to track ingestion across all pipelines.

Panel Configuration:

  • Set Panel Type to Graph to visualize the ingestion rate.

  • Name the panel "Data Ingestion Rate."

  • Save the Panel: Click Apply.

Step 3 – Add Data Ingestion Time Panel

  • Metric Query: Use a query like ingestion_duration_seconds to track the time taken to complete data ingestion tasks.

  • Example query: ingestion_duration_seconds{pipeline=~".*"} to track duration across all pipelines.

Panel Configuration:

  • Set Panel Type to Graph to visualize the ingestion time for each task.

  • Name the panel "Data Ingestion Time" in the Legend.

  • Save the Panel: Click Apply.

Step 4 – Set Alerts for Ingestion Failures

  • Create an alert to trigger when ingestion time exceeds a set threshold (e.g., >5 minutes).

  • Configure the Notification Channels to send alerts via email, Webhook, or Mattermost when ingestion tasks are slow or failing.

Users of the Data Fabric Application may need to monitor Kafka topic growth to maintain system performance and manage storage effectively. By tracking the log size growth of topics over time, users can identify topics that consume large amounts of storage and take necessary actions, such as managing retention policies or optimizing resource allocation. Monitoring topic growth is crucial to ensuring that Kafka clusters run efficiently without unnecessary resource overuse.

Example: In this use case, users can set up a dashboard to monitor the log size for each Kafka topic over time. Panels display the current log size and highlight any trends or significant increases in storage consumption for specific topics. This helps users quickly identify which topics are growing rapidly and take action, such as increasing resource allocation or adjusting retention settings to prevent excessive storage consumption.

Step 1 - Create a New Dashboard

  • Navigate to Dashboards > + New Dashboard.

  • Click Add New Panel.

Step 2 - Add Kafka Log Size Panel

  • Data Source: Select Prometheus.

  • Metric Query: Use a query like sum(kafka_log_log_size{topic!~"(_)strimzi."}) to track the total log size for each Kafka topic.

Step 3 - Add a Time-Series Panel for Log Growth Over Time

  • Data Source: Select Prometheus.

  • Metric Query: Use a query like sum(rate(kafka_log_log_size{topic!~"(_)strimzi."}[$__rate_interval])) to track the growth of Kafka log size over time.

Step 4 - Set Alerts for Excessive Kafka Topic Growth

  • Create an alert to trigger when Kafka topic log size exceeds a set threshold (e.g., >10 GB).

  • Configure the Notification Channels to send alerts via email, Webhook, or Mattermost when certain Kafka topics experience rapid growth that may affect resource allocation.

4.3. Advanced Features

4.3.1. Feature 1: Dynamic Dashboards with Variables

Grafana allows template variables to create dynamic dashboards. This means you can create a single dashboard that adjusts based on user input, such as selecting different servers or data sources.

Steps:

  • Navigate to the Variables section in the dashboard settings.

  • Add a new variable (e.g., server).

  • Use the variable in your queries to dynamically change the data displayed based on user selection.

4.3.2. Feature 2: Advanced Alerting

Grafana’s alerting system can be set up to trigger notifications when a monitored metric crosses a specific threshold.

Steps:

  • In the panel settings, click on Alert > Create Alert.

  • Define the condition for the alert (e.g., CPU usage > 80%).

  • Set up the notification channel (e.g., Webhook).

  • Save the alert and monitor the dashboard for real-time updates.

4.3.3. Feature 3: Grafana Annotations

Annotations allow users to mark specific events on graphs for easier correlation between system events and metric changes.

Steps:

  • Open the dashboard settings and navigate to the "Annotations" tab.

  • Set up an annotation query based on log data or events.

  • Visualize annotations as markers on the graph, correlating them with data trends.

4.3.4. Feature 4: Dashboard Playlist

Playlist mode in Grafana allows you to set up a rotating view of multiple dashboards, useful for operations centers and monitoring rooms.

Steps:

  • Create a set of dashboards that you want to monitor.

  • Open the playlist feature from the Grafana toolbar.

  • Configure the rotation interval and start the playlist mode.

5. Best Practices

5.1. General Best Practices

  • Keep dashboards simple by focusing on key metrics and avoiding an overload of panels on a single dashboard.

  • Leverage Grafana’s template variables to create flexible and reusable dashboards.

5.2. Performance Optimization

  • Ensure queries are optimized to avoid overloading data sources, especially when working with large time-series datasets.

  • Use caching mechanisms in Grafana for frequently used queries to improve dashboard performance.

5.3. Security Considerations

  • Use role-based access control (RBAC) to manage permissions effectively. Ensure only authorized users can access or modify sensitive data.

  • Ensure proper integration with Keycloak for secure authentication using OAUTH mechanisms.

6. Troubleshooting

6.1. Common Issues

  • Dashboard Not Loading: Check the data source connection settings and verify that the data source is online.

  • Query Errors: Ensure that the queries used in panels are correct and compatible with the data source.

6.2. Error Messages

  • Data Source Error: Indicates an issue with the connection to your data source. Check server settings and authentication credentials.

6.3. Diagnostic Steps

  • Check the status of the data source (Prometheus, Loki, etc.) from the Data Sources tab.

  • Test the queries in the query editor to ensure they return data.

7. FAQs

  • How do I add a new data source in Grafana? Go to Settings > Data Sources and select your data source type. Configure the connection with the necessary credentials and server details.

  • Can I set alerts for specific metrics? Yes, alerts can be configured in each panel. Set the condition, and Grafana will notify you when the metric crosses the defined threshold.

  • How can I resolve "Permission Denied" when accessing a dashboard? Ensure that you have been assigned the correct user role in Keycloak.

  • Why can’t I connect to my data source? This issue may arise due to incorrect server credentials or configuration settings. Double-check the server URL, authentication details, and that the data source is running. If the problem persists, test the connection using Grafana’s "Test" button under Settings  Data Sources.

  • Can I export a dashboard for use in other Grafana instances? Yes, Grafana allows you to export dashboards in JSON format. Navigate to the dashboard you want to export, click on the "Share" button, and choose "Export." This JSON file can be imported into other Grafana instances.

  • What should I do if Grafana is running slow? If Grafana’s performance is slow, check if your queries are optimized. Large data sets or inefficient queries can slow down the dashboard.

9. Glossary

  • Dashboard: A collection of panels that visualize data from one or more data sources.

  • Panel: A single visualization in a Grafana dashboard (e.g., a graph, gauge, or table).

  • Data Source: A system that provides data for visualization in Grafana (e.g., Prometheus, Loki, Elasticsearch).

10. Delivered Dashboards

10.1. API Gateway Dashboard

Description: This dashboard monitors the health and performance of the API Gateway, providing insights into key metrics like uptime, operation rates, response times, and outcomes for various API routes. It allows users to observe the performance of both fabric services and support services, helping them track and troubleshoot issues in real-time.

Data Sources: Prometheus

Key Panels:

Uptime (Top Panel): Displays the API Gateway’s uptime, sourced in seconds (process_uptime_seconds), but converted and shown in days for readability. The panel uses color thresholds where uptime over 1 hour is shown in green. No critical thresholds are defined, but the panel visually indicates uptime status.

Proxy Services (Panel Group):

Operation rate: A time-series graph visualizing the rate of the operations (requests per second) for API Gateway proxy routes. The query tracks the rate of requests using the spring_cloud_gateway_requests_seconds_count metric, filtered by the namespace, job, and route ID pattern (i.e., routes starting with proxy-). If no routes match this pattern, or if no data is present for the selected time range, the panel will display "No Data."

Response Times: A time-series graph tracking the maximum response times for API requests through proxy routes. The graph shows the slowest response times for each HTTP method and route that matches the proxy- route pattern.

Outcomes: A time-series panel showing the outcomes of requests through proxy routes, such as SUCCESSFUL or CLIENT_ERROR, aggregated by HTTP method and route ID for proxy routes.

Fabric Services (Panel Group):

Operation Rate: Tracks the operation rate (requests per second) for API routes such as v1-pipelines, categorized by HTTP method (e.g., DELETE, GET, POST).

Response Times: Shows the maximum response times for API requests, visualized as a time-series graph. The panel displays the slowest response times for each HTTP method and route.

Outcomes: Displays the results of API requests, such as CLIENT_ERROR and SUCCESSFUL, aggregated by HTTP method and route.

Support Services (Panel Group):

Operation Rate: Monitors the operation rate for support services such as /api/test/auth, /api/v1/auth/token, and /api/test/hello.

Response Times: Visualizes the response times for support services, providing detailed latency data for these API routes.

Outcomes: Displays the results of API requests for support services, showing success and unknown outcomes.

CPU Usage (Instance-Level Panel): A gauge displaying the CPU usage of the API Gateway, using the process_cpu_usage metric. The panel uses color-coded thresholds to indicate the CPU usage levels: * Green: Normal usage, below 20% CPU. * Yellow: Warning state, between 20% and 40% CPU usage. * Red: Critical state, above 40% CPU usage.

Additional Features:

Annotations & Alerts: Supports annotations for tracking key events. Alerts can be set up for critical metrics like operation rate or response times.

Template Variables: Utilizes variables like namespace and gateway_job to easily switch between different namespaces and jobs without modifying queries.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.2. Alerting Dashboard

Description: This dashboard monitors the status and performance of the alerting system within the Data Fabric application. It provides insights into key metrics like alerts received, alerts published, filtering, and errors encountered in the alerting pipeline. Additionally, it monitors the health and uptime of the alerting services and provides data on the REST API for alert operations.

Data Sources: Prometheus

Key Panels:

Uptime (Top Panel): Displays the uptime of the alerting service, sourced from the process_uptime_seconds metric, converted to hours for readability. The panel uses color thresholds, with uptime shown in green if greater than 1 hour.

Alerts Section:

Alerts Published (Top Panel): Displays the total number of alerts published after being processed by the alerting system. Sourced from the alerts_published_total metric, this stat panel shows "No Data" when no alerts are published. The panel is color-coded blue when data is available.

Alerts Filtered (Top Panel): Tracks the total number of alerts filtered out by the system, using the alerts_filtered_total metric. This stat panel shows "No Data" if no filters are applied or no alerts pass through the filters, with an orange color when data is present.

Alert Errors (Top Panel): Monitors errors related to alerts, sourced from the alerts_errors_total metric. This stat panel turns red when errors are detected in the alerting process, with "No Data" shown if no errors occur.

Alerts Received Section:

Alerts Received (by source): This time-series panel shows alerts received, categorized by their source. Sourced from the alerts_received_total metric, it groups alerts by the source label. When no data is available, the panel displays "No Data."

Alerts Received (by type): This time-series graph displays the received alerts grouped by their type, showing how many of each type were processed. It is sourced from the same metric (alerts_received_total) but filtered by the type label.

Alerts Published Section:

Alerts Published (by sink): This time-series graph tracks the number of alerts published, grouped by the sink (destination) to which the alerts were sent, sourced from alerts_published_total. If no alerts are published, "No Data" is shown.

Alerts Published (by type): Similar to the above panel, this one tracks the alerts published, categorized by their type, using the alerts_published_total metric.

Alert Filtering Section:

Alerts Filtered (by filter): This panel shows the total number of alerts filtered, grouped by the filters applied to the alerts. The data is sourced from the alerts_filtered_total metric and categorized by the filter label.

Alerts Filtered (by type): Tracks the alerts filtered, grouped by the type of alert being filtered. It uses the same alerts_filtered_total metric but categorized by type.

Alert Errors Section:

Filter Errors: This panel shows errors encountered during the filtering of alerts, grouped by the filters in question. The data is sourced from alerts_errors_total, specifically focusing on filter-related errors.

Publication Errors: Tracks errors encountered when publishing alerts to sinks, using the alerts_errors_total metric. The panel groups errors by the sinks and shows "No Data" if no errors occur.

REST API Section:

REST Operation Rate: This time-series panel monitors the rate of REST API operations related to alerts. It tracks the request rate using the http_server_requests_seconds_count metric, filtered for URIs matching the /api/.* pattern.

REST Response Times: Tracks the response times for the REST API operations related to alerting, using the http_server_requests_seconds_max metric. It visualizes the slowest response times for each URI that matches the /api/.* pattern.

REST Outcomes: This panel tracks the outcomes of REST API operations related to alerting, such as successful requests or errors, grouped by their status. It uses the http_server_requests_seconds_count metric.

Instances Section:

CPU Usage (Instance-Level Panel): A gauge panel displaying the CPU usage of the alerting service, using the process_cpu_usage metric. The panel uses thresholds to indicate different CPU levels: * Green: Normal usage (below 20%) * Yellow: Warning state (20%–40%) * Red: Critical usage (above 40%)

Additional Features:

Annotations & Alerts: Supports annotations for key event tracking. Users can configure alerts based on metrics like alert errors, published alerts, or CPU usage.

Template Variables: Uses variables like namespace and alert_api_job to easily switch between different jobs and namespaces for tailored views.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.3. Home Dashboard

Description: This dashboard serves as the central hub for monitoring the overall health of the Data Fabric application. It provides users with quick access to error logs, API dashboards, Kafka dashboards, and alert dashboards. By aggregating key dashboards in one place, it allows users to seamlessly navigate and investigate different areas of the system’s performance. It also features a quick view of error logs from Data Fabric, helping users quickly spot issues across the environment.

Data Sources: * Prometheus (for dashboard lists) * Loki (for error logs)

Key Panels:

Error Logs FD (Top Panel): A time-series panel tracking the occurrence of errors in Data Fabric using logs from Loki. The query counts log entries that contain the term "error" over time (count_over_time). This provides a quick overview of any error spikes in the environment.

API Dashboards (Panel Group):

  • A dashboard list panel that provides access to API-related dashboards within Data Fabric. Key dashboards include:

  • API Gateway: Focused on monitoring the health and performance of API Gateway services.

Kafka Dashboards (Panel Group):

  • A dashboard list panel focused on monitoring Kafka clusters. Available dashboards include:

  • Kafka Cluster - Brokers Overview: Monitors the health and performance of Kafka brokers.

  • Kafka Cluster - Connections: Tracks connection metrics for Kafka.

  • Kafka Cluster - Consumer Lag: Measures the lag between Kafka consumers and producers.

  • Kafka Cluster - Topics: Displays metrics about Kafka topics.

  • Kafka Cluster - Topics Comparison: Compares various metrics between Kafka topics.

  • Kafka Cluster - Topics Overview: An overall view of all Kafka topics within the cluster.

Alert Dashboards (Panel Group):

  • A dashboard list providing access to alerting-related dashboards in Data Fabric, including:

  • Alerting: The primary dashboard for monitoring alert performance in the system.

Additional Features:

Annotations & Alerts: Supports annotations for tracking key events, though no alerts are pre-configured. Users can add alerts based on logs or API performance metrics.

Navigation: The Home Dashboard provides direct links to specific dashboards within the Data Fabric environment, including Error Logs, API Gateway, Kafka (Brokers Overview, Connections, Consumer Lag, Topics), and Alerting Dashboards. This centralized access allows users to quickly switch between dashboards to monitor different parts of the system and troubleshoot issues more efficiently.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.4. Kafka Cluster – Brokers Overview Dashboard

Description: This dashboard provides an overview of key metrics related to Kafka brokers, including the number of online brokers, controller status, partition health, and traffic rates. It allows users to monitor the health and performance of Kafka clusters in real-time and quickly identify any issues related to replication, partition availability, and broker performance. The dashboard is essential for administrators who need to maintain the stability of Kafka clusters and ensure smooth message processing.

Data Sources: Prometheus (Metrics for Kafka brokers, partitions, and controllers)

Key Panels:

Top Panel

Brokers Online: Displays the current number of Kafka brokers that are online. The panel uses Prometheus metrics (kafka_controller_kafkacontroller_activebrokercount) and is updated every 5 seconds. Thresholds are color-coded: * Green: 2 or more brokers online * Yellow: 1 broker online * Red: 0 brokers online

Active Controllers: Shows the number of active Kafka controllers. This panel uses kafka_controller_kafkacontroller_activecontrollercount to track the active controllers in the Kafka cluster.

Unclean Leader Election Rate: Monitors the rate of unclean leader elections in Kafka, which can indicate instability or failure in the Kafka cluster. The panel uses kafka_controller_controllerstats_uncleanleaderelectionenablerateandtimems to calculate the metric. Thresholds are color-coded for alerting purposes.

Online Partitions: Displays the number of online partitions in the Kafka cluster. The kafka_server_replicamanager_partitioncount metric is used to determine the number of partitions that are online and functioning correctly.

Under Replicated Partitions: Tracks the number of under-replicated partitions using kafka_server_replicamanager_underreplicatedpartitions. If the number of under-replicated partitions rises, it indicates that some replicas are not in sync, which can lead to data loss if a broker fails.

Offline Partitions Count: Displays the number of offline partitions in the Kafka cluster. Using the kafka_controller_kafkacontroller_offlinepartitionscount metric, it highlights any partitions that are currently unavailable.

Traffic and Performance Panels:

Messages In / Second: Tracks the number of incoming messages per second using kafka_server_brokertopicmetrics_messagesin_total. This panel shows the overall message rate being processed by the Kafka brokers.

Bytes In / Second & Bytes Out / Second: These panels track the incoming and outgoing data throughput in bytes per second. The kafka_server_brokertopicmetrics_bytesin_total and kafka_server_brokertopicmetrics_bytesout_total metrics are used to monitor the rate of data flowing into and out of Kafka brokers.

Messages In / Second / Broker & Bytes In / Second / Broker: These time-series panels provide a breakdown of message and byte throughput per individual broker. Metrics are displayed for each broker (df-kafka-0, df-kafka-1, df-kafka-2) to help identify imbalances in the load distribution.

Partitions / Broker: Shows the number of partitions per broker using the kafka_server_replicamanager_partitioncount metric. This panel helps in understanding how partitioned data is distributed across brokers.

Partition Leaders / Broker: Displays the number of partition leaders per broker, which is crucial for balancing load and ensuring the efficient handling of partitions.

Under Replicated Partitions / Broker: Tracks the number of under-replicated partitions for each broker. The panel provides a broker-wise view of replication health, helping to identify which brokers are struggling to keep their replicas in sync.

Kafka Log Size by Broker: Displays the total log size of each Kafka broker using the kafka_log_log_size metric. This helps in monitoring storage usage on each broker and ensuring enough disk space is available.

JVM Panel:

JVM Metrics (Threads): Shows JVM metrics like the number of threads per broker. This provides insight into the resource utilization of the JVM running on Kafka brokers.

Additional Features:

Annotations & Alerts: Supports annotations to track key events. Alerts can be configured for broker count, partition status, replication health, and message throughput.

Threshold-Based Coloring: Color-coded thresholds for critical metrics like broker count, CPU usage, and partition replication to highlight potential issues.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.5. 10.5 Kafka Cluster – Connections Dashboard

Description: This dashboard provides an overview of Kafka cluster connections, focusing on connection counts per listener and the versions of client software connected to the brokers. It helps users to monitor Kafka’s connection health, identifying potential bottlenecks or anomalies in listener activity and client distribution across software versions.

Data Source: Prometheus

Key Panels:

Connection Count / Listener (Time-Series Panel): This panel shows the count of active connections per listener (CONTROLPLANE-9090, PLAIN-9092, REPLICATION-9091) over time. The data is sourced from the kafka_server_socket_server_metrics_connections_software metric, grouping by listener. The panel visualizes the last and maximum values for each listener, allowing for real-time monitoring of connection patterns.

Client Versions (Bar Gauge Panel): This panel displays the active client versions connected to the Kafka brokers, visualizing how many connections are made by each Kafka client version. The data is sourced from the same metric, grouped by clientSoftwareName and clientSoftwareVersion. It allows the user to track the distribution of client software across the cluster, which helps identify the most common client versions in use.

Additional Features:

Annotations & Alerts: Supports annotations to track key events. Alerts can be configured based on connection metrics, listener activity, and client versions.

Template Variables: Allows users to select specific listeners and client versions to filter and focus the view.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.6. Kafka Cluster – Consumer Lag Dashboard

Description: This dashboard provides an in-depth view of consumer lag within a Kafka cluster. It helps users monitor lag at both the topic and partition levels, track current offsets, and evaluate the lag per consumer group. The dashboard is designed to help pinpoint issues that may affect data consumption and processing performance.

Data Source: Prometheus

Key Panels:

Consumer Lag By Topic (Time-Series Panel): This panel shows the consumer lag for different Kafka topics over time. The data is sourced from the kafka_consumergroup_lag metric, grouped by topic and consumer group. It allows users to identify which topics are experiencing the most lag, providing insights into where delays in message consumption are occurring.

Consumer Lag By Partition (Time-Series Panel): This panel displays the consumer lag for different Kafka partitions. The data is sourced from the same kafka_consumergroup_lag metric, but it breaks down the lag at the partition level. This enables users to monitor lag on a more granular scale, helping to identify specific partitions that may be causing issues.

Consumer Lag Table (Table Panel): This panel provides a detailed table view of consumer lag, showing the topic, partition, current offset, and lag by offset. The table is essential for users who need to track exact lag metrics and see how much delay each partition is experiencing.

Consumer Lag By Group (Bar Gauge Panel): This panel visualizes consumer lag grouped by consumer group. It provides a comparative view of how different consumer groups are performing in terms of lag, helping to quickly identify groups that may be lagging.

Additional Features:

Annotations & Alerts: Supports annotations for tracking key events. Users can configure custom alerts based on consumer lag metrics such as lag by partition, topic, or consumer group.

Detailed Consumer Lag Monitoring: Provides in-depth monitoring of consumer lag at the topic, partition, and group levels, helping to quickly identify issues.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.7. Kafka Cluster – Topics Dashboard

Description: This dashboard provides insights into the consumption metrics of Kafka topics, including topic size, partition count, message count, and I/O data rates. It is designed to help users monitor topic performance, assess data flow, and identify potential issues such as under-replicated partitions.

Data Sources: Prometheus

Key Panels:

Topic Size (Stat Panel): Displays the total size of the selected Kafka topic in megabytes (MB). The value updates based on the selected time range (default 24 hours). Helps to monitor data growth for each topic.

Partition Count (Stat Panel): Shows the total number of partitions for the selected topic. Kafka partitions provide concurrency, and tracking their count can help assess the distribution of data.

Current Message Count (Stat Panel): This panel displays the current count of messages in the topic, providing a snapshot of how much data is currently present.

Total Messages Produced (Stat Panel): Displays the total number of messages produced to the topic over time. It is calculated based on the Kafka log end offset metric grouped by partition, providing visibility into the throughput of the topic.

Messages In / Sec (Stat Panel): Displays the rate of messages being produced to the selected topic in real-time (messages per second). It helps assess the load being handled by the topic.

Bytes In / Sec (Stat Panel): Shows the rate at which data is being ingested into the topic, measured in bytes per second. This metric helps to track the inbound data flow to the topic.

Bytes Out / Sec (Stat Panel): Displays the rate of outbound data from the topic in bytes per second. This panel tracks how much data is being consumed from the topic, though in this case, it currently shows "N/A," indicating no outbound data at the moment.

Messages In / Second (Time-Series Panel): A time-series graph showing the rate of messages being ingested into the topic over the selected time frame (e.g., last 24 hours). This graph provides a visual indication of message ingestion trends and spikes.

Bytes In / Second (Time-Series Panel): A time-series graph displaying the rate of inbound data (bytes per second) over time, helping users observe changes in data flow.

Bytes Out / Second (Time-Series Panel): A time-series panel is meant to display the rate of outbound data. Currently, it shows no data, indicating that the topic has no data being read from it during the selected timeframe.

Additional Features:

Annotations & Alerts: Annotations are supported to track key events. Users can configure custom alerts for critical metrics like message rates, data throughput, and partition replication status.

Template Variables: Provides a dropdown to select specific Kafka topics, allowing for easy comparison of metrics across topics.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

10.8. Kafka Cluster – Topics Comparison Dashboard

Description: This dashboard offers a comparative overview of key metrics across different Kafka topics, such as message rates and byte throughput. It enables users to analyze message production and consumption patterns, helping to identify anomalies or underutilized topics over the selected time range.

Data Source: Prometheus

Key Panels:

Messages In / Sec (Stat Panel): Displays the rate of messages being produced to the selected topic in real-time (messages per second). It allows users to assess the load being handled by the topic and monitor message throughput over time.

Bytes In / Sec (Stat Panel): Shows the rate at which data is being ingested into the topic, measured in bytes per second. This panel provides insights into the inbound data flow to the topic.

Bytes Out / Sec (Stat Panel): Displays the rate of data being consumed from the topic in bytes per second. Currently, it shows "N/A," indicating no outbound data during the selected time period.

Messages In / Second (Time-Series Panel): A time-series graph that tracks the rate of messages being produced to the topic over time. It provides a detailed look at message throughput, helping users visualize spikes or drops in activity.

Bytes In / Second (Time-Series Panel): A time-series graph displaying the rate of inbound data to the topic over the selected timeframe. This graph helps users understand how the data volume changes over time.

Bytes Out / Second (Time-Series Panel): A time-series graph intended to track the rate of outbound data from the topic. However, it currently shows no data, indicating that the selected topic has not had any data consumed during the observed period.

Additional Features:

Annotations & Alerts: Supports annotations to track key events related to Kafka topics or infrastructure changes. These annotations help correlate Kafka metrics with external factors that might impact performance. Although no alerts are pre-configured, users can add alerts to monitor thresholds such as message throughput or byte consumption rates.

Topic Selection Dropdown: This dashboard includes a dynamic topic selection dropdown that allows users to select and compare multiple Kafka topics, such as _consumer_offsets, df.meilisearch.datasources, and others. The selected topic determines the data shown in the time-series and stat panels, enabling users to toggle between different topics to analyze and monitor performance.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

Dynamic Data Updates: All panels, including "Messages In / Sec," "Bytes In / Sec," and "Bytes Out / Sec," update dynamically based on the selected topic and time range. This enables continuous monitoring of Kafka performance metrics without needing manual refreshes.

10.9. Kafka Cluster – Topics Overview Dashboard

Description: This dashboard provides an overview of Kafka topics, offering insights into message rates, data flow, disk storage, and partition replication. It helps track Kafka topic activity and ensures fault tolerance by monitoring under-replicated partitions.

Data Source: Prometheus

Key Panels:

Topics (Stat Panel): Displays the total number of Kafka topics in the cluster. Useful for keeping track of the overall topic count.

_strimzi Topics (Stat Panel): Shows the number of Strimzi-managed topics, which can help monitor and manage topics created by the Strimzi Kafka operator.

All Topics (Stat Panel): Indicates the total number of topics across all categories, providing a complete view of topics being tracked.

Disk Storage (Stat Panel): Displays the total disk space used by Kafka topics, excluding Strimzi-managed topics, helping monitor overall storage consumption.

_strimzi Disk Storage (Stat Panel): Shows the disk usage for Strimzi-managed topics, which is useful for understanding the storage footprint of topics managed by the Strimzi operator.

All Disk Storage (Stat Panel): Displays the total disk space occupied by all Kafka topics, both system and user-managed, providing an overview of storage usage in the cluster.

Messages In / Second / Topic (Time-Series Panel): A time-series graph tracking the rate of messages produced per topic over time. This helps visualize throughput per topic, enabling users to detect spikes or anomalies in message production rates.

Bytes In / Second / Topic (Time-Series Panel): Monitors the rate of inbound data for each Kafka topic over time. This helps users understand the data flow for each topic, aiding in capacity planning and identifying bottlenecks.

Bytes Out / Second / Topic (Time-Series Panel): Tracks the outbound data rate per topic, indicating data consumption activity. This helps ensure topics are being consumed as expected and that consumers are working effectively.

Local Storage – Log Size by Topic (Time-Series Panel): A graph showing log size by topic over time. This panel helps identify topics that are consuming significant storage, allowing users to manage disk usage and plan for storage expansion.

Under-Replicated Partitions by Topic (Time-Series Panel): Displays the number of under-replicated partitions per topic, which is crucial for Kafka’s fault tolerance. It helps monitor if any partitions are not meeting the replication factor, highlighting potential data redundancy issues.

Additional Features:

Annotations & Alerts: Supports annotations for tracking key events. Alerts can be set for metrics like under-replicated partitions, message rates, and log size.

Topic Filtering: The dashboard provides a dropdown to select specific Kafka topics for focused monitoring, allowing users to filter data by topics like df.meilisearch.datasets.

Time Selection & Auto-Refresh: Time range selector and auto-refresh features available.

Detailed Legends for Time-Series Panels: Each time-series panel comes with detailed legends that show real-time metrics such as "last" and "max" values for topics. This makes it easy to compare the relative activity of different topics and identify anomalies.

11. Appendices

11.1. Common Data Source Queries

Here are some additional example queries to get the most out of the Data Fabric’s integration with Prometheus and Loki:

Total CPU Usage for a Service:

sum(rate(process_cpu_seconds_total{job="your-service"}[5m]))

Memory Consumption by Namespace:

sum(container_memory_usage_bytes{namespace="data-fabric"}) by (pod)

Error Rate for a Specific API Endpoint:

rate(http_requests_total{status_code=~"5..", job="api-gateway", route="/v1/pipelines"}[5m])

11.2. Importing Dashboards in Grafana

You can easily import dashboards from JSON files or from Grafana.com to replicate pre-built dashboards.

How to Import a Dashboard:

  1. Navigate to the New button at the top right in the Dashboard page.

  2. Select Import from the dropdown menu.

You will have three options:

  • Upload JSON File: Upload the JSON file of a dashboard you have saved.

  • Import via grafana.com: Provide the URL or ID of the dashboard from Grafana’s public repository.

  • Import via Panel JSON: Paste a JSON configuration for a specific panel to add it to your dashboard.

11.3. Customizing Grafana with Variables

Data Fabric supports the use of template variables for dynamic dashboards. Below are examples of how to configure template variables for more flexible monitoring.

Example: Variable for Kafka Topics: * Navigate to Dashboard Settings > Variables. * Add a new variable: - Name: topic - Query: label_values(kafka_topic)

Use this variable in your queries:

Example: sum(rate(kafka_log_log_size{topic="$topic"}[5m]))

Example: Variable for Data Sources: * Add a new variable for switching data sources dynamically: - Name: datasource - Type: Datasource - Query: Prometheus or Loki

Use $datasource in any query to switch between data sources without editing the panel.

11.4. Alerting: Setting Up a Webhook Alert in Data Fabric

In this section, we will guide you through the steps of setting up a webhook alert in Grafana within the Data Fabric environment. Webhook alerts are useful for sending notifications to external systems like chat applications, issue trackers, or custom monitoring services when an alert condition is triggered.

Prerequisites:

  • Access to Grafana: Ensure you have access to the Grafana instance within Data Fabric.

  • Webhook URL: You will need the URL of the webhook endpoint where you want to send alerts.

  • Authentication (if required): If the webhook requires authentication, have the necessary credentials (e.g., API key or token) ready.

Step 1: Create a New Notification Channel for Webhooks:

Login to Grafana: Using your Keycloak credentials, log in to Grafana in the Data Fabric environment.

Navigate to Notification Channels: 1. From the Grafana homepage, click the ⚙️ (gear icon) on the left sidebar. 2. Under the Alerting section, select Notification Channels.

Create a New Notification Channel:

  1. Click the + New Channel button at the top right of the Notification Channels page.

  2. Configure the Webhook Notification Channel:

    • In the Name field, enter a name for the webhook notification channel (e.g., "Data Fabric Webhook").

    • In the Type dropdown, select Webhook.

    • Webhook URL: In the URL field, enter the webhook endpoint where the alert notifications should be sent (e.g., https://your-webhook-service.com/alert).

    • If your webhook requires authentication, add any necessary headers by clicking Add Custom HTTP Header and entering the key (e.g., Authorization) and the value (e.g., Bearer your-token-here).

Test the Webhook (optional but recommended):

  1. Click Test to send a test notification to the webhook URL.

  2. Verify that the external service receives the notification and processes it as expected.

Save the Notification Channel:

  1. Scroll down and click Save to finish setting up the webhook notification channel.

Step 2: Configure Alerts for a Panel:

Open the Dashboard: Navigate to the dashboard you want to monitor. For example, if you want to monitor CPU usage, open the CPU Usage dashboard.

Select a Panel for Alerting:

  1. In the dashboard, identify the panel where you want to set up the alert (e.g., a graph tracking CPU usage).

  2. Click the panel title, then select Edit from the dropdown menu.

Create a New Alert:

  1. In the panel editor, navigate to the Alert tab.

  2. Click Create Alert.

Define the Alert Condition:

  1. Evaluation of Time Interval: Set how often the alert should evaluate data. For example, check every minute.

  2. Alert Condition: Define the condition for triggering the alert (e.g., WHEN avg() OF query (A, 5m) IS ABOVE 80). This will trigger an alert when the average CPU usage over a 5-minute window exceeds 80%.

  3. Adjust the time range and query if necessary.

Define Alert Behavior:

  1. Alert States: Grafana alerts have three states: OK, Pending, and Alerting. Configure these states based on how long the condition should persist before triggering the alert.

    • Example: If CPU usage remains above 80% for 2 minutes, trigger the alert.

Add Notification Channel:

  1. Under the Notifications section, click Add Notification Channel.

  2. Select the webhook notification channel you created earlier from the dropdown list (e.g., "Data Fabric Webhook").

  3. Optionally, you can add more notification channels (e.g., email, Mattermost) if you’d like to receive alerts through multiple methods.

Customizing the Alert Message (optional):

  1. You can customize the message that is sent to the webhook. Click Edit Message and enter your custom text, including variables like ${metric} to include dynamic values.

Step 3: Testing and Saving the Alert:

Test the Alert:

  1. Once the alert is set up, scroll down to the bottom of the panel editor and click Test Rule to evaluate the alert condition.

  2. This will run the alert condition and trigger a notification if the condition is met. You should verify that the webhook received the alert.

Save the Panel:

  1. Once satisfied with the alert setup, click Apply to save the alert configuration.

  2. Save the dashboard by clicking Save Dashboard to ensure the changes persist.

Step 4: Monitoring Alerts:

Monitoring Active Alerts:

  1. Navigate to the Alerting section in the Grafana menu to see the status of your alerts.

  2. Here, you can monitor all active alerts, check which conditions are being evaluated, and view their current state (OK, Pending, or Alerting).

Viewing Alert History:

  1. Click on Alert History in the same section to see past alert notifications and check if alerts were delivered as expected. This section provides insight into when and why alerts were triggered, helping you understand system behavior and assess whether alert conditions are configured appropriately.

Modifying Alerts:

  1. To make adjustments to an existing alert, return to the relevant panel and open the Alert tab in the panel settings.

  2. Adjust the conditions, notification channels, or alert thresholds as needed. Ensure you save changes to both the panel and the dashboard.

Disabling Alerts:

  1. If you need to temporarily disable an alert without deleting it, navigate to the Alert tab and toggle the enable/disable switch. This allows you to retain the alert configuration while preventing notifications from being sent.

11.5. Custom Query Examples for Advanced Users

This section provides a series of custom Prometheus and Loki queries that advanced users can leverage to enhance their Grafana dashboards in the Data Fabric environment.

Prometheus Queries:

Track Memory Usage Across All Pods in a Namespace:

`sum(container_memory_usage_bytes{namespace="data-fabric"}) by (pod)`

Monitor API Error Rates by Endpoint:

`rate(http_requests_total{status_code=~"5..", job="api-gateway", route="/v1/endpoint"}[5m])`

CPU Usage by Instance:

`avg(rate(container_cpu_usage_seconds_total{image!="", namespace="data-fabric"}[1m])) by (instance)`

Loki Queries:

Search for Specific Log Entries Containing Errors:

`{namespace="data-fabric"} |= "ERROR"`

Count of Errors Over a Time Range:

`sum(count_over_time({namespace="data-fabric"} |= "ERROR" [5m]))`

These queries can be further customized to fit specific monitoring requirements by adjusting filters, labels, and aggregation functions.