Using JupyterHub on Kubernetes
In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install the JupyterHub Helm chart on a Kubernetes cluster.
In this post, we will focus on the experience of the developers, who are going to be leveraging our service to develop new models using scikit-learn
or perform calculations and transformations of large datasets using pandas. To illustrate the value that Jupyter Notebooks and JupyterHub provide in a multiuser environment, we will clone a Git repository containing two example Jupyter Notebooks that we can work with.
Each user that accesses JupyterHub will have their own workspace complete with a single-user Jupyter Notebook server, which uses the JupyterLab Interface. To demonstrate the capabilities of JupyterHub and Python, we will check out the following sample notebooks that we have written and executed:
scikit-learn
library for PythonNote: Each time a user logs into the JupyterHub web page, an additional pod will be instantiated for that user and a 10GB persistent volume will be mapped to the user’s home directory.
❯ kubectl port-forward -n jupyter svc/proxy-public 8080:80 &
[2] 39859
Forwarding from 127.0.0.1:8080 -> 8000
Forwarding from [::1]:8080 -> 8000
Once you are authenticated, you will be brought to the workspaces screen, which will allow you to open a local terminal inside the container, run interactive Python in a console, or create a new Python3 Jupyter notebook. You can also get interactive help or browse the local container’s directory structure.
git clone https://github.com/tkrausjr/data-science-demos.git data-science
To jumpstart your data science learning, two sample notebooks are in the data science directory created by your git clone operation in the previous step. The repos and sample datasets are here: https://github.com/tkrausjr/finance-analysis.git.
/data-science/jupyter-hub
directory.ml-stock-predictor-knn-v4.ipynb
file. You can run each cell individually by clicking in the cell and then hitting the >| Run
button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.This notebook aims to classify companies as high growth or low growth according to their historical annual revenue growth, using supervised learning with the k-nearest neighbor (KNN) classification algorithm. As such, it will:
scikit-learn
library.scikit-learn
library.The output of the classification report in cell 36 shows us that although the model seems to be working, it is actually just choosing the value for the majority class, which is “0” every time. To make this model work as a classification problem, we would need to do some additional work, namely to undersample from the majority class or oversample from the minority class in order to create a more balanced dataset.
To run the pandas data analysis notebook, double-click the industry-revenue-analysis.ipynb
file. You can run each cell individually by clicking in the cell and then hitting the >| Run
button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.
This notebook will do the following:
When another data scientist wants to use the platform, they just need to login to JupyterHub and it will take care of the rest.
When a second data scientist (in this case, myself, Thomas) logs into JupyterHub, a second persistent volume claim, persistent volume, and pod will be provisioned for the new user.
$ kubectl get pvc -n jupyter
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
claim-admin Bound pvc-6e02d65b-c47a-4cf4-a1ad-489dc7cc49de 10Gi RWO standard 45m
claim-john Bound pvc-3c1ee9a3-68ab-4d71-a1f4-8cf8c04fb975 10Gi RWO standard 109s
hub-db-dir Bound pvc-b0a0657d-a9f8-4a17-9002-2a3c8f2cade6 3Gi RWO standard 88m
$ kubectl get po -n jupyter -l component=singleuser-server
NAME READY STATUS RESTARTS AGE
jupyter-admin 1/1 Running 0 73m
jupyter-john 1/1 Running 0 2m15s
In this post, we have shown how deploying JupyterHub on top of Kubernetes provides various benefits, including:
Hopefully you now better understand how a JupyterHub implementation running on Kubernetes can provide a scalable, simple, and powerful platform for data science teams to work with. If you are new to Jupyter Notebooks in general, I would recommend experimenting with the sample notebooks used in this post or walking through a Jupyter Notebook tutorial that goes into more detail on developing in Jupyter Notebooks with pandas
, numpy
, or scikit-learn
.