Provides a JupyterHub Service on an existing Openstack Cluster. This uses the helm chart provided by ZeroToJupyterHub.
- Features
- Limitations
- Requirements
- Deploying JupyterHub
- Customising your jupyterhub deployment
- Maintenance and Notes
- Longhorn Support
- Multiple profiles for different resource limits
- Automatic HTTPS support, can have a instance up in <1 hour (with pre-requisites in place)
- The primary worker/master flavour cannot be changed after creation
- Cannot use placeholders for optional profiles (e.g. GPU placeholder)
- Some metrics can't be selected by node name in Grafana dashboard as it requires a reverse DNS.
The following assumes you have an Ubuntu 20.04 machine with pip3
, python3
and python3-venv
already installed.
- Ansible (Installing Ansible — Ansible Documentation) - this is installed using
pip
inpy-requirements.txt
- Helm 3 (Installing Helm)
- kubectl (Install Tools | Kubernetes)
Helm and kubectl can be installed using snap:
sudo apt-get update && sudo apt-get install -y snapd
export PATH=$PATH:/snap/bin
sudo snap install kubectl --classic
sudo snap install helm --classic
- Deploy a capi cluster
- Ensure that you can access the cluster from the machine you are running this playbook from (
kubectl get no
) - git clone this repo (
git clone https://github.com/stfc/ansible-jupyter
) - Setup virtual environment:
a. Create a virtual environmentvenv
usingpython3 -m venv venv
b. Upgradepip3
usingpip3 install pip --upgrade
to ensure you are using the latest version of pip.
c. Activatevenv
using. venv/bin/activate
d. Install the python dependencies:pip3 install -r requirements.txt
e. If you get an error about the version of setuptools, upgrade it manually usingpip3 install setuptools --upgrade
f. Install Ansible requirementsansible-galaxy collection install -r requirements.yml
\ - Uncomment the correct line for your environment in
inventory/hosts
- Fill in the variables for your given environment in
group_vars/<environment>/all.yaml
iris_iam
: If true uses iris iam groups for admin and user accounts, if false uses jupyterhub deployed accounts insteadclient_id
: Client ID from iris iamclient_secret
: Client secret from iris iamadmin_groups
: List of iris iam groups to use for adminsallowed_groups
: List of allowed iris iam groups to use for usersadmin_names
: The admin usernames to be created (these will be prepended withadmin-
)number_of_users
: The number of user accounts to be createdstaging_cert
: whether to use acme to generate a staging certnfs_ip
: The IP address of the nfs serverdisplay_name
: The name of the environment displayed to the userdescription
: The description of the environment displayed to the userdefault
: Whether the environment is the default environment or notimage
: The image to use for generating the environmentcpu_limit
: The maximum number of CPU cores a user instance can havecpu_guarantee
: The minimum amount of CPU a user instance can havemem_limit
: The maximum amount of memory a user instance can havemem_guarantee
: The minimum amount of memory a user instance can haveuse_gpus
: Whether to use GPUsnumber_of_gpus
: The number of GPUs to usekey
: Toleration key. Usually: nvidia.com/gpuoperator
: How the key taint should be matched. Usually:Equals
effect
: Whether to schedule on node if key taint not matched. Usually:NoSchedule
commands
: The commands (git clones) to run on the deployed instances/images
- Run the playbook:
ansible-playbook deploy_jhub.yml
These are settings/variables to chagne/add to customise your jupyterhub deployment, and are optional.
Lets Encrypt is used to handle HTTPS certificates automatically. Jupyterhub can have unusual problems in HTTP mode only, so I would strongly strongly advice you run it with some level of TLS.
Simply ensure you have:
- An external IP address
- A internet routable domain name
- A (optionally/and/or) AAAA record(s) pointing to the IP address
Update the config file with the domain name.
Alternatively, if you already have an existing certificate and don't want to expose the service externally you can manually provide a certificate.
The primary disadvantage of this, is both remembering to renew the certificate annually and the associated downtime compared to the automatic Lets Encrypt method.
A Kubernetes secret is used, instructions can be found here
The Lets Encrypt (LE) certificate will have failed to issue, as the LB takes longer to create than the first issue. To issue your first certificate and enable automatic renewal:
As there are a limited number of attempts we can do (see rate limit) some sanity checks help ensure we don't run out of attempts:
- Check deployed
config.yaml
for the domain name - Test that the domain is using the correct external IP from an external server with
dig
. E.g.dig example.com @1.1.1.1
- Test that the HTTP server is serving with
telnet example.com 80
We need to force the HTTPS issuer to retry:
kubectl get pods -n jupyterhub
and take note of the pod name withautohttps
- Delete the auto HTTPS pod like so:
kubectl delete pod/autohttps-f954bb4d9-p2hnd
with the unique suffix varying on your cluster - Wait 1 minute. The logs can be monitored with:
watch kubectl logs service/proxy-public -n jupyterhub -c traefik
- Warnings about implicit names can be ignored. If successful there will be no error printed after a minute.
- Go to
https://<domain>.com
and it should be encrypted.
A maximum of 5 certificates will be issued to a set of domain names per week (on a 7 day rolling basis). Updating a deployment does not count towards this as Kubernetes holds the TLS secret.
However, helm uninstall jhub
will delete the certificate counting towards another when redeployed.
The currently issued certificate(s) can be viewed at: https://crt.sh/
If your are maintaining the service there are a couple of important things to note:
Jupyterhub is not designed for high availability, this means only a single pod can ever exist. Any upgrades or modifications to the service will incur a user downtime of a few minutes.
Any existing Pods containing user work will not be shutdown or restarted unless the profiles have changed. To be clear, hub redeploying will have a minor outage but without clearing existing work.
The autoscaler is the most "brittle" part of the deployment as it has to work with heat. The logs can be monitored with:
kubectl logs deployment/cluster-autoscaler --follow -n kube-system
The maximum number of nodes can be changed with:
kubectl edit deployment/cluster-autoscaler -n kube-system
- Under image arguments the max number of instances can be changed
- Saving the file will redeploy the auto scaler with the new settings immediately.
Deleting the public service endpoint does not delete the load balancer associated. You must delete the load balancer to prevent problems, as any redeployment of the service uses the existing LB without updating the members inside. This will cause the failover to stop working as well.
The following symptoms of this happening are:
kubectl get all -n kube-system
shows everything but the external service as completed- The external service will be pending, but on the openstack GUI the LB will be active (not updating / creating)
- The service is not accessible as the old pod is still referred to.
To fix this:
- Delete the service in Kubernetes and load balancer in openstack
- Re-run the ansible deployment script (see deploy Jupyterhub), this will recreate the service.
- Associate the desired floating IP as described above
Longhorn's configuration is defined by the release_values
in roles/deploy_hub/tasks/main.yml
. By default, this creates a load balancer for the UI labelled longhorn-frontend
, which must be associated with a prepared FIP, as described for JupyterHub's proxy_public
load balancer.
If you are required to uninstall and reinstall Longhorn, is may be necessary to manually delete the load balacer on OpenStack and the service (kubectl get services -n longhorn-system
will list these). You must then restart the OpenStack controller manager pods before a new Longhorn load balancer can be created successfully.