JupyterHub
1. Introduction
Purpose: A comprehensive guide to understanding and utilizing JupyterHub’s capabilities within Raft’s Data Fabric and the Data Fabric.
Target Audience: Data Stewards, Data Analysts
Scope: The JupyterHub documentation for a data platform provides users with detailed guidance on using JupyterHub for working within Data Fabric.
2. Overview
Capability Description: JupyterHub is a powerful, scalable, multi-user platform that allows users to create and run Jupyter notebooks on a shared Data Fabric server. JupyterHub allows users to store, access, and stream their data through their browser, which is a powerful tool for data-driven projects.
Key Features:
-
Interactive Data Analysis: Provides access to Jupyter Notebooks for interactive data exploration, visualization, and machine learning. This is especially useful in air-gapped environments, where Data Fabric’s JupyterHub helps solve the problem of finding, accessing, and analyzing data.
-
Libraries: JupyterHub allows Python libraries to be installed on-the-fly if there is internet connectivity. However, in air-gapped environment, Data Fabric can be connected to CDAO’s Python library repository, where most (if not all) common Python libraries are copied. This allows on-the-fly library installation even without internet access.
-
Multi-user Environment: Supports multiple users with individual environments, enabling collaboration while maintaining data privacy.
-
Data Storage and Streaming: Users can store, access, and stream data through their Jupyter environment, making it ideal for large-scale data processing tasks when running on Data Fabric’s scalable platform.
4. Usage
The following example notebooks are available to help users get started with interacting with Data Fabric data sources:
-
kafka_consumer.ipynb
(Kafka Consumer example) -
kafka_producer.ipynb
(Kafka Producer example) -
object_store.ipynb
(MinIO example) -
trino.ipynb
(Trino SQL query example)
4.1. Run Example Notebooks
4.1.1. Kafka Producer and Consumer Interaction
The kafka_producer.ipynb
and kafka_consumer.ipynb
notebooks are complementary, demonstrating a typical Kafka message flow within Data Fabric. The producer sends messages to a Kafka topic, and the consumer listens to that topic and processes the messages in real-time.
Why Use It? This notebook is helpful when you need to simulate a real-time data pipeline. It allows users to verify that messages are being sent and received properly, ensuring the data flow through Kafka is working as expected.
Example Use Case Imagine you’re working on a system that streams real-time data, like sensor readings or log files: 1. Send your data (sensor readings, logs, etc.) to a Kafka topic using the producer notebook. 2. Consume the data in real-time with the consumer notebook, ensuring that the data is flowing correctly and can be processed further.
Steps to Run the Producer and Consumer Notebooks:
-
Start the Kafka Consumer:
-
First, run the
kafka_consumer.ipynb
notebook to start listening for messages on the Kafka topic.-
The consumer is configured to subscribe to the
test-topic
. **It uses thepython-consumer
group ID to manage offset tracking and allow parallel consumption with other consumers if necessary.-
Run the first cell to set up the configuration, environment variables, and logging.
-
Run subsequent cells to start the Kafka consumer. This consumer will remain active and continuously listen for new messages on
test-topic
. You should see log outputs indicating the consumer is connected and ready to receive messages.
-
-
-
Example:
Listening for messages on 'test-topic'...
-
Produce Messages with the Kafka Producer:
-
Next, open and run the
kafka_producer.ipynb
notebook. This notebook generates and sends messages to the same Kafka topic (test-topic
), which the consumer is already listening to.-
Run the first cell to set up the environment and configuration.
-
Generate messages: In the subsequent cells, a loop generates 10 messages by default, which are sent to the Kafka topic. Each message is sent to
test-topic
, and a confirmation log is produced for each message. You can customize the number of messages by changing thenumber_of_messages_to_generate
variable.
-
-
-
Switch Back to the Kafka Consumer:
-
Once the producer starts sending messages, switch back to the
kafka_consumer.ipynb
notebook to observe the incoming messages.-
The consumer will print each message it receives from the Kafka topic.
-
You’ll see logs similar to the following as the messages are consumed:
-
-
-
Logging and Error Handling:
-
Both notebooks implement logging, which helps you monitor their operation. If there are any errors in producing or consuming messages, they will be captured and displayed.
-
The Kafka consumer logs errors if there are issues with message fetching, connection problems, or configuration mismatches.
-
The Kafka producer logs errors if it fails to send messages or connect to the Kafka server.
-
-
Congratulations! You just produced and consumed messages with Kafka running on Data Fabric!
4.1.2. MinIO and Kafka Avro Producer Example
This example notebook demonstrates how to interact with MinIO and produce Kafka messages using Avro schemas. It shows the integration between the MinIO object storage service and Kafka for producing messages with schema validation.
Why Use It? This notebook is useful when you need to store data in MinIO and send it as structured messages to Kafka. It ensures that the data conforms to a specific format, which is important for systems that rely on consistent and reliable data, like real-time analytics or event-driven processing.
Example Use Case Suppose you’re a data engineer working on a project that processes sensor data: 1. Store sensor data in MinIO as JSON files. 2. Use the notebook to generate an Avro schema from this data. 3. Send the data to Kafka, where other services can pick it up for further analysis or processing.
Steps to Use the Notebook:
Below is the function of each cell in order.
-
Connecting to MinIO:
-
The first cell establishes a connection to MinIO using environment variables for the access key and secret key. Users can manage MinIO buckets and objects once the connection is established.
-
-
Listing MinIO Buckets:
-
This cell lists all available MinIO buckets in the environment. Users can check if their buckets are accessible.
-
-
Uploading an Object to MinIO:
-
In this cell, the notebook uploads a file (e.g.,
example.json
) to a specific bucket (inbox-public
) in MinIO. This is useful for storing data that will be used later or shared with other services.
-
-
Listing Objects in MinIO:
-
After uploading the file, this cell lists all objects in the specified MinIO bucket, confirming the file was uploaded successfully.
-
-
Generating Avro Schema:
-
These cells define the logic for generating an Avro schema based on the data being sent to Kafka. The notebook dynamically creates a schema that ensures all messages conform to a standardized format.
-
-
Validating Missing Fields:
-
In this cell, the notebook checks if any fields in the data are missing from the generated schema, ensuring that all fields are accounted for.
-
-
Configuring Kafka Producer:
-
The Kafka producer is configured in this cell with the necessary security settings and the topic to which messages will be sent.
-
-
Producing Messages to Kafka:
-
This cell produces messages to Kafka, using the Avro schema generated earlier. The messages are sent to the Kafka topic specified in the configuration.
-
-
Configuring Kafka Settings:
-
In this cell, environment variables like
KAFKA_USERNAME
andKAFKA_PASSWORD
are set up to ensure the connection to Kafka is properly authenticated.
-
-
Producing More Kafka Messages:
-
Additional messages are produced to Kafka in this cell. Users can modify the fields to produce different data.
-
-
Listing Objects in MinIO:
-
This cell lists objects in the
inbox-public
bucket again to ensure everything remains accessible after Kafka message production.
-
-
Retrieving and Reading an Object from MinIO:
-
Finally, this cell retrieves and reads the contents of the file (
example.json
) from MinIO, decoding the JSON data to verify that it was stored correctly.
-
4.1.3. Object Store Example
This object_store.ipynb
example notebook demonstrates how to interact with MinIO for uploading files and listing objects in a MinIO bucket. It helps users manage their data in an S3-compatible object storage system within the Data Fabric.
Why Use It? This notebook is useful when you need to store and manage files in MinIO. It allows users to easily upload files (e.g., datasets, configuration files) to a shared storage system, verify the contents of MinIO buckets, and ensure data is safely stored for future use or processing.
Example Use Case
Imagine you are working on a project where you need to store documentation, datasets, or configuration files:
-
Upload your file (e.g.,
getting-started.md
) to a MinIO bucket (inbox-public
) for storage and easy retrieval later. -
List objects in the MinIO bucket to verify the upload and check the other available files.
-
Use the uploaded file in future data analysis tasks, configuration processes, or any other project-related activities.
Steps to Use the Notebook:
-
Connecting to MinIO:
-
The first cell establishes a connection to MinIO using the Minio client. The access key and secret key are retrieved from environment variables to securely connect to MinIO. Once connected, users can interact with the MinIO buckets to upload, retrieve, and manage files.
-
-
Listing MinIO Buckets:
-
This cell lists all available buckets in MinIO, allowing users to check that the connection is successful and see which storage buckets are accessible for file uploads and management.
-
-
Uploading an Object to MinIO:
-
In this cell, the notebook uploads a file (
getting-started.md
) to theinbox-public
bucket. This is useful for storing data files, configuration files, or any other data that needs to be shared or processed later.
-
-
Listing Objects in MinIO:
-
After uploading the file, this cell lists all objects in the
inbox-public
bucket. This confirms that the file has been successfully uploaded and gives users an overview of all the files stored in the bucket.
-
4.1.4. Trino Example
This example notebook demonstrates how to connect to a Trino instance using SQLAlchemy and JWT authentication. It shows users how to retrieve an authentication token from Data Fabric, establish a secure connection to Trino, and run SQL queries to interact with Trino’s catalogs.
Why Use It? This notebook is useful for users who need to run SQL queries on Trino while using secure JWT authentication. It provides an easy way to query data from various catalogs in Trino, allowing users to explore datasets stored within the Data Fabric using a Python interface.
Example Use Case
Imagine you’re working on a project where you need to query and analyze data from multiple sources within Trino:
-
Authenticate using credentials to retrieve a secure token from Data Fabric.
-
Connect to Trino with the obtained token, allowing you to securely run queries against various data catalogs.
-
Run SQL queries to explore datasets, pull relevant data, and perform analysis or data processing directly from your notebook.
Steps to Use the Notebook:
-
Setting Up Authentication:
-
The first cells disable warnings and set up environment variables for the username (
DF_USER
) and password (DF_PASS
). These credentials are used to authenticate with the Data Fabric API and retrieve a JWT token that will be used to connect to Trino.
-
-
Getting the Token:
-
These cells send a request to the Data Fabric API to obtain a JWT authentication token. The token is returned in JSON format, and the notebook extracts the token for use in the connection to Trino.
-
-
Connecting to Trino:
-
The connection to Trino is established using SQLAlchemy. The JWT token obtained earlier is passed in the connection, and the notebook uses HTTPS with the option to disable SSL verification in local environments.
-
-
Running Queries on Trino: This cell runs a sample SQL query (
SHOW CATALOGS
) to display the available catalogs in Trino. The results are loaded into a Pandas DataFrame for easy viewing, and users can modify the query to explore other datasets or perform further analysis.
5. Best Practices
5.1. Efficient Resource Usage
-
Limit Notebook Resources: Use small datasets and run lightweight operations to test functionality before executing large-scale computations. This helps prevent overloading the system and ensures efficient use of CPU and memory. For example, if working with large datasets from Delta Lake or Kafka streams, apply filters early to avoid loading unnecessary data.
-
Shut Down Idle Kernels: Notebooks left idle or running indefinitely consume system resources. Always shut down kernels when you’re done with a notebook session by using the Kernel → Shutdown option.
-
Run Long Jobs in Off-Peak Hours: If your work involves long-running processes or resource-intensive tasks, consider running them during off-peak hours to avoid affecting other users.
6. Troubleshooting
-
Notebook is Running Slowly or Crashing Cause: Your notebook may be using too much memory or CPU, especially when working with large datasets or running complex computations. Solution: Try using smaller data subsets or breaking your computations into smaller steps. Clear any unnecessary variables or data that are no longer needed using
%reset
ordel
commands to free up memory. If the notebook is still unresponsive, restart the kernel from the Kernel menu. -
Kernel is Not Responding Cause: The kernel may have become overwhelmed due to a heavy computation or large data load. Solution: Restart the kernel by going to Kernel → Restart Kernel. After restarting, re-run the necessary cells to continue your work. If the problem persists, consider optimizing your code or using smaller data batches.
-
Seek Help from Administrators If the issue seems related to the underlying environment or services (like Kafka or MinIO), it might be necessary to contact your administrator for assistance. Make sure to provide any relevant error messages and details about what you were trying to do.
7. FAQs
-
How do I save my work in JupyterHub? JupyterHub automatically saves your notebook every few minutes, but you can also manually save by clicking File → Save and Checkpoint. It’s a good habit to save your progress regularly, especially when running long computations.
-
How do I shut down a notebook that’s no longer in use? You can shut down a notebook by going to the Kernel menu at the top of the notebook and selecting Shutdown. Alternatively, you can shut down kernels from the JupyterHub dashboard by clicking the Running tab and manually stopping the notebook.
-
What should I do if my notebook is running slowly?
If your notebook is running slowly, try the following:
-
Shut down other notebooks that you’re not using.
-
Clear unnecessary variables or data from memory by using
%reset
ordel
commands. -
Restart the kernel and re-run only the cells you need.
-
What should I do if my notebook kernel becomes unresponsive? If the kernel is unresponsive, you can restart it by going to Kernel → Restart Kernel. This will reset the notebook without closing it, allowing you to re-run the cells. If that doesn’t work, try closing the notebook and re-opening it from the JupyterHub dashboard.
-
How do I install new Python libraries in JupyterHub? In many cases, Python libraries are pre-installed. If you need to install a new library, you can try running
!pip install <library-name>
in a notebook cell. However, if you’re in a restricted environment, you may need to request that the library be added by an administrator. -
What should I do if I encounter a connection error when accessing Kafka or MinIO? Verify your connection details, including the host, access keys, and any authentication tokens required. Ensure the correct endpoint and bucket names are being used for MinIO, or the correct topic for Kafka. If you still have issues, reach out to the administrator.
-
How do I know which libraries and tools are available in JupyterHub? You can check the available Python libraries by running
!pip list
in a notebook cell. For system tools or external services (like MinIO or Trino), refer to the documentation provided with your JupyterHub environment, or ask your administrator for a list of pre-configured services.
-
8. Reference Materials
-
JupyterHub Documentation: https://jupyterhub.readthedocs.io/en/stable/
-
MinIO Documentation: https://docs.min.io/
-
Kafka Documentation: https://kafka.apache.org/documentation/
-
Confluent Kafka Python Client: https://docs.confluent.io/platform/current/clients/python.html
-
Trino Documentation: https://trino.io/docs/current/
-
SQLAlchemy Documentation: https://docs.sqlalchemy.org/en/14/
-
Python Requests Library: https://docs.python-requests.org/en/latest/
-
Pandas Documentation: https://pandas.pydata.org/docs/
9. Glossary
-
Avro: A data serialization format commonly used with Kafka to encode data into a compact, binary format. Avro schemas ensure that data is structured consistently across producers and consumers.
-
Catalog (Trino): A namespace in Trino that contains multiple databases. Users can query catalogs in Trino to access datasets stored in the Data Fabric.
-
Consumer (Kafka): A process or service that reads data from a Kafka topic. Consumers receive messages sent by producers and process or analyze the data in real-time.
-
JWT (JSON Web Token): A token used for securely transmitting information between a client and a server. In this context, JWT tokens are used to authenticate connections to Trino.
-
Kafka: A distributed streaming platform used for building real-time data pipelines. It allows users to produce (send) and consume (receive) messages to and from topics.
-
Kernel: The computational engine in Jupyter notebooks that runs your code. You can restart or shut down kernels when they become unresponsive or after completing a task.
-
MinIO: A distributed object storage system used to store large datasets and files. It is similar to Amazon S3 and allows users to upload, list, and retrieve files via notebooks.
-
Notebook: An interactive document in JupyterHub that contains code, text, and visualizations. Notebooks are the primary tool for running code, exploring data, and documenting your analysis.
-
Producer (Kafka): A process or service that writes (produces) data to a Kafka topic. Producers generate messages that are sent to Kafka topics, where they can be consumed by other services or applications.
-
Pandas: A Python library used for data manipulation and analysis. It is often used in Jupyter notebooks to load, filter, and analyze data in a structured format (e.g., DataFrames).
-
SQLAlchemy: A Python SQL toolkit and Object Relational Mapper (ORM) used to connect to databases and run SQL queries. It’s used in the Trino notebook to query data from the Data Fabric.
-
Token (Authentication Token): A digital key used to authenticate users and authorize access to services, such as Trino or Kafka. In this context, authentication tokens are retrieved from Data Fabric and passed to secure the connection.
-
Topic (Kafka): A feed or category in Kafka where messages are published. Producers send data to topics, and consumers read data from these topics for real-time processing or analysis.
-
Trino: An open-source SQL query engine designed for running interactive queries across distributed data sources. It allows users to query datasets across multiple catalogs in the Data Fabric.
10. Appendices
10.1. Appendix A: Useful JupyterHub Shortcuts
Here are some useful keyboard shortcuts for JupyterHub to make your workflow more efficient:
-
Run a cell:
Shift + Enter
-
Insert a new cell below:
B
-
Delete a cell:
D + D
(pressD
twice) -
Save the notebook:
Ctrl + S
orCmd + S
(Mac) -
Open command palette:
P
10.2. Appendix B: Code Snippet – Restarting the Kernel
If your notebook becomes unresponsive or you need to free up resources, you can restart the kernel with the following steps: - Go to the top menu and select Kernel → Restart Kernel.
10.3. Appendix C: Example Python Libraries Used
Here are some Python libraries commonly used in the example notebooks and their main purpose:
-
Pandas: Used for data manipulation and analysis in the notebooks. Pandas allows users to work with tabular data efficiently.
-
Import statement:
import pandas as pd
-
-
SQLAlchemy: Used for database connections and running SQL queries against Trino.
-
Import statement:
from sqlalchemy.engine import create_engine
-
-
MinIO Python Client: Used for interacting with the MinIO object storage service.
-
Import statement:
from minio import Minio
-
-
Confluent Kafka: Used for producing and consuming Kafka messages.
-
Import statement:
from confluent_kafka import Producer, Consumer
-
10.4. Appendix D: Sample Queries for Trino
Here are some sample SQL queries that can be run in the trino.ipynb
notebook to query data catalogs:
- Show all catalogs:
SHOW CATALOGS;
-
Show all schemas in a catalog:
SHOW SCHEMAS FROM <catalog_name>;
-
List all tables in a schema:
SHOW TABLES FROM <catalog_name>.<schema_name>;
-
Run a simple query:
SELECT * FROM <catalog_name>.<schema_name>.<table_name> LIMIT 10;