This GitHub repository provides a guide for setting up and running NVIDIA Triton Inference Servers on Slurm-managed cluster (specifically the NERSC Perlmutter Supercomputer). Triton is an open-source inference serving software that simplifies the deployment of AI models at scale.
This repository provides an example of deploying PyTorch ResNet Triton server on Perlmutter (see diagram below for overview). The ResNet example is taken from here. Modify this example to easily get started with a Triton server deployment.
Figure 1: Overview of the example PyTorch ResNet Triton server deployment on Perlmutter. Works with both gRPC and HTTP traffic.Download this repository and alter the Slurm job script (start_triton_slurm.sh
) with your relevant Slurm and Triton configurations.
Note
This repository uses Shifter to operate containers, but other container software will work as well. Likewise, there are many other load balancing software options that also work with gRPC and HTTP traffic.
Start up your Triton server via sbatch
:
sbatch -N 2 -A elvis start_triton_slurm.sh
or interactively in a slurm job:
salloc -N 1 -t 01:00:00 -C gpu -q interactive -A elvis
./start_triton_slurm.sh
This starts up a triton server from nvcr.io/nvidia/tritonserver:24.11-py3
image and loads in the ResNet50 PyTorch model (see this folder for the model and triton configuration):
model_repository
|
+-- resnet50
|
+-- config.pbtxt
+-- 1
|
+-- model.pt
You can verify the Triton servers are running correctly via a curl
request:
$ curl <node-address>:8000/v2/health/ready
Once the triton server have started up, you can then deploy the load balancer with your node addresses. This bash script automatically adds your node addresses to the haproxy.cfg
from your slurm job and deploys HAProxy:
./start_lb.sh <triton-slurm-job-id>
Test server with triton client:
#start pytorch image + install client
shifter --module=none --image=nvcr.io/nvidia/pytorch:24.11-py3 /bin/bash
pip install tritonclient[all]
#download image
wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"
#test
python3 client.py <load-balancer>:9191
Run a benchmark test with perf_analyzer
#start triton server image
shifter --module=none --image=nvcr.io/nvidia/tritonserver:24.11-py3-sdk \
perf_analyzer \
-m resnet50 -i grpc \
--input-data random -u <load-balancer>:9191