Using JupyterHub on Kubernetes
In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install the JupyterHub Helm chart on a Kubernetes cluster.
In this post, we will focus on the experience of the developers, who are going to be leveraging our service to develop new models using scikit-learn or perform calculations and transformations of large datasets using pandas. To illustrate the value that Jupyter Notebooks and JupyterHub provide in a multiuser environment, we will clone a Git repository containing two example Jupyter Notebooks that we can work with.
Each user that accesses JupyterHub will have their own workspace complete with a single-user Jupyter Notebook server, which uses the JupyterLab Interface. To demonstrate the capabilities of JupyterHub and Python, we will check out the following sample notebooks that we have written and executed:
scikit-learn library for PythonNote: Each time a user logs into the JupyterHub web page, an additional pod will be instantiated for that user and a 10GB persistent volume will be mapped to the user’s home directory.
❯ kubectl port-forward -n jupyter svc/proxy-public 8080:80 &
[2] 39859
Forwarding from 127.0.0.1:8080 -> 8000                                                     
Forwarding from [::1]:8080 -> 8000

Once you are authenticated, you will be brought to the workspaces screen, which will allow you to open a local terminal inside the container, run interactive Python in a console, or create a new Python3 Jupyter notebook. You can also get interactive help or browse the local container’s directory structure.

git clone https://github.com/tkrausjr/data-science-demos.git data-science

To jumpstart your data science learning, two sample notebooks are in the data science directory created by your git clone operation in the previous step. The repos and sample datasets are here: https://github.com/tkrausjr/finance-analysis.git.
/data-science/jupyter-hub directory.
ml-stock-predictor-knn-v4.ipynb file. You can run each cell individually by clicking in the cell and then hitting the >| Run button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.
This notebook aims to classify companies as high growth or low growth according to their historical annual revenue growth, using supervised learning with the k-nearest neighbor (KNN) classification algorithm. As such, it will:
scikit-learn library.scikit-learn library.


The output of the classification report in cell 36 shows us that although the model seems to be working, it is actually just choosing the value for the majority class, which is “0” every time. To make this model work as a classification problem, we would need to do some additional work, namely to undersample from the majority class or oversample from the minority class in order to create a more balanced dataset.
To run the pandas data analysis notebook, double-click the industry-revenue-analysis.ipynb file. You can run each cell individually by clicking in the cell and then hitting the >| Run button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.

This notebook will do the following:

When another data scientist wants to use the platform, they just need to login to JupyterHub and it will take care of the rest.


When a second data scientist (in this case, myself, Thomas) logs into JupyterHub, a second persistent volume claim, persistent volume, and pod will be provisioned for the new user.
$ kubectl get pvc -n jupyter                    
NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
claim-admin           Bound    pvc-6e02d65b-c47a-4cf4-a1ad-489dc7cc49de   10Gi       RWO            standard       45m
claim-john            Bound    pvc-3c1ee9a3-68ab-4d71-a1f4-8cf8c04fb975   10Gi       RWO            standard       109s
hub-db-dir            Bound    pvc-b0a0657d-a9f8-4a17-9002-2a3c8f2cade6   3Gi        RWO            standard       88m
$ kubectl get po -n jupyter -l component=singleuser-server            
NAME            READY   STATUS    RESTARTS   AGE
jupyter-admin   1/1     Running   0          73m
jupyter-john    1/1     Running   0          2m15s
In this post, we have shown how deploying JupyterHub on top of Kubernetes provides various benefits, including:
Hopefully you now better understand how a JupyterHub implementation running on Kubernetes can provide a scalable, simple, and powerful platform for data science teams to work with. If you are new to Jupyter Notebooks in general, I would recommend experimenting with the sample notebooks used in this post or walking through a Jupyter Notebook tutorial that goes into more detail on developing in Jupyter Notebooks with pandas, numpy, or scikit-learn.