Running NVIDIA Triton Inference Server on SLURM

Introduction

This GitHub repository provides a guide for setting up and running NVIDIA Triton Inference Servers on Slurm-managed cluster (specifically the NERSC Perlmutter Supercomputer). Triton is an open-source inference serving software that simplifies the deployment of AI models at scale.

This repository provides an example of deploying PyTorch ResNet Triton server on Perlmutter (see diagram below for overview). The ResNet example is taken from here. Modify this example to easily get started with a Triton server deployment.

Figure 1: Overview of the example PyTorch ResNet Triton server deployment on Perlmutter. Works with both gRPC and HTTP traffic.

Getting Started

Download this repository and alter the Slurm job script (start_triton_slurm.sh) with your relevant Slurm and Triton configurations.

Note

This repository uses Shifter to operate containers, but other container software will work as well. Likewise, there are many other load balancing software options that also work with gRPC and HTTP traffic.

Step 1: Start up Triton

Start up your Triton server via sbatch:

sbatch -N 2 -A elvis start_triton_slurm.sh

or interactively in a slurm job:

salloc -N 1 -t 01:00:00 -C gpu -q interactive -A elvis
./start_triton_slurm.sh

This starts up a triton server from nvcr.io/nvidia/tritonserver:24.11-py3 image and loads in the ResNet50 PyTorch model (see this folder for the model and triton configuration):

model_repository
|
+-- resnet50
    |
    +-- config.pbtxt
    +-- 1
        |
        +-- model.pt

You can verify the Triton servers are running correctly via a curl request:

$ curl <node-address>:8000/v2/health/ready

Step 2: Start up Load Balancer

Once the triton server have started up, you can then deploy the load balancer with your node addresses. This bash script automatically adds your node addresses to the haproxy.cfg from your slurm job and deploys HAProxy:

./start_lb.sh <triton-slurm-job-id>

Step 3: Connect to Triton Servers

Test server with triton client:

#start pytorch image + install client
shifter --module=none --image=nvcr.io/nvidia/pytorch:24.11-py3 /bin/bash
pip install tritonclient[all]
#download image
wget  -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"
#test
python3 client.py <load-balancer>:9191

Run a benchmark test with perf_analyzer

#start triton server image
shifter --module=none --image=nvcr.io/nvidia/tritonserver:24.11-py3-sdk \
  perf_analyzer \
    -m resnet50 -i grpc \
        --input-data random -u <load-balancer>:9191

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
model_repository/resnet50		model_repository/resnet50
LICENSE		LICENSE
README.md		README.md
client.py		client.py
haproxy.cfg		haproxy.cfg
slurm_triton_lb_diagram.png		slurm_triton_lb_diagram.png
start_lb.sh		start_lb.sh
start_triton_slurm.sh		start_triton_slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Running NVIDIA Triton Inference Server on SLURM

Introduction

Getting Started

Step 1: Start up Triton

Step 2: Start up Load Balancer

Step 3: Connect to Triton Servers

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

asnaylor/slurm_triton_deployment

Folders and files

Latest commit

History

Repository files navigation

Running NVIDIA Triton Inference Server on SLURM

Introduction

Getting Started

Step 1: Start up Triton

Step 2: Start up Load Balancer

Step 3: Connect to Triton Servers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages