8000 Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack by asteny · Pull Request #316 · nebius/soperator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Update Slurm to 24.05.5, fix MPI/PMIx bug, update README, and remove AppArmor hack #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ CHART_OPERATOR_CRDS_PATH = $(CHART_PATH)/soperator-crds
CHART_CLUSTER_PATH = $(CHART_PATH)/slurm-cluster
CHART_STORAGE_PATH = $(CHART_PATH)/slurm-cluster-storage

SLURM_VERSION = 24.05.2
SLURM_VERSION = 24.05.5
UBUNTU_VERSION = jammy
VERSION = $(shell cat VERSION)

Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,10 @@ This helps cluster administrators and users monitor resource utilization, enforc
implemented it yet.
- **Single-partition clusters**. Slurm's ability to split clusters into several partitions isn't supported now.
- **Software versions**. The list of software versions we currently support is quite short.
- Linux: Ubuntu [20.04](https://releases.ubuntu.com/focal/) and
[22.04](https://releases.ubuntu.com/jammy/).
- Slurm: versions `23.11.6` and `24.05.3`.
- CUDA: version [12.2.2](https://developer.nvidia.com/cuda-12-2-2-download-archive).
- Kubernetes: >= [1.29](https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/).
- Linux: Ubuntu [22.04](https://releases.ubuntu.com/jammy/).
- Slurm: versions `24.05.5`.
- CUDA: version [12.4.1](https://developer.nvidia.com/cuda-12-4-1-download-archive).
- Kubernetes: >= [1.29](https://kubernetes.io/blog/2023/12/13/kubernetes-v1-29-release/).
- Versions of some preinstalled software packages can't be changed.


Expand Down
9 changes: 4 additions & 5 deletions docs/limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,10 @@ equipped with different GPU models, use different container images, have differe

### Software versions
Our list of supported software versions is pretty short right now:
- Linux distribution: Ubuntu [20.04](https://releases.ubuntu.com/focal/) and
[22.04](https://releases.ubuntu.com/jammy/).
- Slurm: versions `23.11.6` and `24.05.3`.
- CUDA: version [12.2.2](https://developer.nvidia.com/cuda-12-2-2-download-archive).
- Kubernetes: >= [1.28](https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/).
- Linux distribution: Ubuntu [22.04](https://releases.ubuntu.com/jammy/).
- Slurm: versions `24.05.5`.
- CUDA: version [12.4.1](https://developer.nvidia.com/cuda-12-4-1-download-archive).
- Kubernetes: >= [1.29](https://kubernetes.io/blog/2023/12/13/kubernetes-v1-29-release/).
- Versions of some preinstalled software packages can't be changed.

Other versions may also be supported, but we haven't checked it yet. It would be cool if someone from the community
Expand Down
18 changes: 9 additions & 9 deletions helm/slurm-cluster/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -406,13 +406,13 @@ telemetry: {}
# otelCollectorPort: 8429

images:
slurmctld: "cr.eu-north1.nebius.cloud/soperator/controller_slurmctld:1.17.0-jammy-slurm24.05.2"
slurmrestd: "cr.eu-north1.nebius.cloud/soperator/slurmrestd:1.17.0-jammy-slurm24.05.2"
slurmd: "cr.eu-north1.nebius.cloud/soperator/worker_slurmd:1.17.0-jammy-slurm24.05.2"
sshd: "cr.eu-north1.nebius.cloud/soperator/login_sshd:1.17.0-jammy-slurm24.05.2"
munge: "cr.eu-north1.nebius.cloud/soperator/munge:1.17.0-jammy-slurm24.05.2"
populateJail: "cr.eu-north1.nebius.cloud/soperator/populate_jail:1.17.0-jammy-slurm24.05.2"
ncclBenchmark: "cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.17.0-jammy-slurm24.05.2"
slurmdbd: "cr.eu-north1.nebius.cloud/soperator/controller_slurmdbd:1.17.0-jammy-slurm24.05.2"
exporter: "cr.eu-north1.nebius.cloud/soperator/exporter:1.17.0-jammy-slurm24.05.2"
slurmctld: "cr.eu-north1.nebius.cloud/soperator/controller_slurmctld:1.17.0-jammy-slurm24.05.5"
slurmrestd: "cr.eu-north1.nebius.cloud/soperator/slurmrestd:1.17.0-jammy-slurm24.05.5"
slurmd: "cr.eu-north1.nebius.cloud/soperator/worker_slurmd:1.17.0-jammy-slurm24.05.5"
sshd: "cr.eu-north1.nebius.cloud/soperator/login_sshd:1.17.0-jammy-slurm24.05.5"
munge: "cr.eu-north1.nebius.cloud/soperator/munge:1.17.0-jammy-slurm24.05.5"
populateJail: "cr.eu-north1.nebius.cloud/soperator/populate_jail:1.17.0-jammy-slurm24.05.5"
ncclBenchmark: "cr.eu-north1.nebius.cloud/soperator/nccl_benchmark:1.17.0-jammy-slurm24.05.5"
slurmdbd: "cr.eu-north1.nebius.cloud/soperator/controller_slurmdbd:1.17.0-jammy-slurm24.05.5"
exporter: "cr.eu-north1.nebius.cloud/soperator/exporter:1.17.0-jammy-slurm24.05.5"
mariaDB: "docker-registry1.mariadb.com/library/mariadb:11.4.3"
6 changes: 3 additions & 3 deletions images/accounting/slurmdbd.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ ARG BASE_IMAGE=ubuntu:jammy

FROM $BASE_IMAGE AS controller_slurmdbd

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -45,7 +45,7 @@ RUN apt-get update && \

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd-slurmdbd slurm-smd; do \
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd-slurmdbd slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down
10 changes: 10 additions & 0 deletions images/common/scripts/install_openmpi.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

OPENMPI_VERSION=4.1.7a1-1.2310055
OFED_VERSION=23.10-2.1.3.1
DISTRO=$(. /etc/os-release; echo "$ID""$VERSION_ID")
cd /etc/apt/sources.list.d || exit
wget https://linux.mellanox.com/public/repo/mlnx_ofed/$OFED_VERSION/"$DISTRO"/mellanox_mlnx_ofed.list
wget -qO - https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | apt-key add -
apt update
apt install openmpi="$OPENMPI_VERSION"
11 changes: 0 additions & 11 deletions images/common/scripts/install_pmix.sh

This file was deleted.

20 changes: 12 additions & 8 deletions images/controller/slurmctld.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ ARG BASE_IMAGE=ubuntu:jammy

FROM $BASE_IMAGE AS controller_slurmctld

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1
ARG OPENMPI_VERSION=4.1.7a1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -42,15 +43,18 @@ RUN apt-get update && \
lsof \
daemontools

# Install PMIx
COPY common/scripts/install_pmix.sh /opt/bin/
RUN chmod +x /opt/bin/install_pmix.sh && \
/opt/bin/install_pmix.sh && \
rm /opt/bin/install_pmix.sh
# Install OpenMPI
COPY common/scripts/install_openmpi.sh /opt/bin/
RUN chmod +x /opt/bin/install_openmpi.sh && \
/opt/bin/install_openmpi.sh && \
rm /opt/bin/install_openmpi.sh

ENV LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/lib
ENV PATH=$PATH:/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/bin

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd-slurmctld slurm-smd; do \
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd-slurmctld slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down
6 changes: 3 additions & 3 deletions images/exporter/exporter.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ RUN GOOS=$GOOS GOARCH=$GOARCH CGO_ENABLED=$CGO_ENABLED GO_LDFLAGS=$GO_LDFLAGS \
# Second stage: Bu 10000 ild image for the prometheus-slurm-exporter
FROM $BASE_IMAGE AS exporter

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1

# TODO: Install only those dependencies that are required for running slurm exporter
# Install dependencies
Expand Down Expand Up @@ -75,7 +75,7 @@ RUN apt-get update && \

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd; do \
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down
35 changes: 17 additions & 18 deletions images/jail/jail.dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# BASE_IMAGE defined here for second multistage build
ARG BASE_IMAGE=ghcr.io/asteny/cuda_base:12.2.2
ARG BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# First stage: Build the gpubench application
FROM golang:1.22 AS gpubench_builder
Expand All @@ -23,12 +23,13 @@ RUN GOOS=$GOOS GOARCH=$GOARCH CGO_ENABLED=$CGO_ENABLED GO_LDFLAGS=$GO_LDFLAGS \
#######################################################################################################################
# Second stage: Build jail image

ARG BASE_IMAGE=ghcr.io/asteny/cuda_base:12.2.2
ARG BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

FROM $BASE_IMAGE AS jail

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1
ARG OPENMPI_VERSION=4.1.7a1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -80,7 +81,9 @@ RUN apt update && \
numactl \
htop \
rdma-core \
ibverbs-utils
ibverbs-utils \
libpmix2 \
libpmix-dev

# Install python
COPY common/scripts/install_python.sh /opt/bin/
Expand Down Expand Up @@ -110,15 +113,18 @@ RUN chown 0:0 /etc/enroot/enroot.conf && chmod 644 /etc/enroot/enroot.conf
# Create directory for enroot runtime data that will be mounted from the host
RUN mkdir -p -m 777 /usr/share/enroot/enroot-data

# Install PMIx
COPY common/scripts/install_pmix.sh /opt/bin/
RUN chmod +x /opt/bin/install_pmix.sh && \
/opt/bin/install_pmix.sh && \
rm /opt/bin/install_pmix.sh
# Install OpenMPI
COPY common/scripts/install_openmpi.sh /opt/bin/
RUN chmod +x /opt/bin/install_openmpi.sh && \
/opt/bin/install_openmpi.sh && \
rm /opt/bin/install_openmpi.sh

ENV LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/lib
ENV PATH=$PATH:/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/bin

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd; do \
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down Expand Up @@ -148,13 +154,6 @@ RUN chmod +x /opt/bin/install_nvtop.sh && \
/opt/bin/install_nvtop.sh && \
rm /opt/bin/install_nvtop.sh

# Download and install NCCL packages
RUN wget -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/libnccl2_2.22.3-1+cuda12.2_amd64.deb && \
wget -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/libnccl-dev_2.22.3-1+cuda12.2_amd64.deb && \
dpkg -i /tmp/libnccl2_2.22.3-1+cuda12.2_amd64.deb && \
dpkg -i /tmp/libnccl-dev_2.22.3-1+cuda12.2_amd64.deb && \
rm -rf /tmp/*.deb

# Download NCCL tests executables
RUN wget -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/nccl-tests-perf.tar.gz && \
tar -xvzf /tmp/nccl-tests-perf.tar.gz -C /usr/bin && \
Expand Down
6 changes: 3 additions & 3 deletions images/nccl_benchmark/nccl_benchmark.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ ARG BASE_IMAGE=ubuntu:jammy

FROM $BASE_IMAGE AS nccl_benchmark

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -37,7 +37,7 @@ RUN apt-get update && \

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd; do \
RUN for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down
4 changes: 2 additions & 2 deletions images/restd/slurmrestd.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ ARG BASE_IMAGE=ubuntu:jammy

FROM $BASE_IMAGE AS slurmrestd

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down
26 changes: 16 additions & 10 deletions images/worker/slurmd.dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
ARG BASE_IMAGE=ghcr.io/asteny/cuda_base:12.2.2
ARG BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

FROM $BASE_IMAGE AS worker_slurmd

ARG SLURM_VERSION=24.05.2
ARG CUDA_VERSION=12.2.2
ARG SLURM_VERSION=24.05.5
ARG CUDA_VERSION=12.4.1
ARG OPENMPI_VERSION=4.1.7a1

ARG DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -51,20 +52,25 @@ RUN apt-get update && \
supervisor \
openssh-server \
rdma-core \
ibverbs-utils
ibverbs-utils \
libpmix2 \
libpmix-dev

# Install PMIx
COPY common/scripts/install_pmix.sh /opt/bin/
RUN chmod +x /opt/bin/install_pmix.sh && \
/opt/bin/install_pmix.sh && \
rm /opt/bin/install_pmix.sh
# Install OpenMPI
COPY common/scripts/install_openmpi.sh /opt/bin/
RUN chmod +x /opt/bin/install_openmpi.sh && \
/opt/bin/install_openmpi.sh && \
rm /opt/bin/install_openmpi.sh

ENV LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/lib
ENV PATH=$PATH:/usr/mpi/gcc/openmpi-${OPENMPI_VERSION}/bin

# TODO: Install only necessary packages
# Download and install Slurm packages
RUN wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/slurm-smd-torque_$SLURM_VERSION-1_all.deb && \
echo "slurm-smd-torque_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download slurm-smd-torque_$SLURM_VERSION-1_amd64.deb"; exit 1; } && \
for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libpmi0 slurm-smd-libpmi2-0 slurm-smd-libslurm-perl slurm-smd-slurmd slurm-smd-sview slurm-smd; do \
for pkg in slurm-smd-client slurm-smd-dev slurm-smd-libnss-slurm slurm-smd-libslurm-perl slurm-smd-slurmd slurm-smd-sview slurm-smd; do \
wget -q -P /tmp https://github.com/nebius/slurm-deb-packages/releases/download/$CUDA_VERSION-$(grep 'VERSION_CODENAME' /etc/os-release | cut -d= -f2)-slurm$SLURM_VERSION/${pkg}_$SLURM_VERSION-1_amd64.deb && \
echo "${pkg}_$SLURM_VERSION-1_amd64.deb successfully downloaded" || \
{ echo "Failed to download ${pkg}_$SLURM_VERSION-1_amd64.deb"; exit 1; }; \
Expand Down
F438
5 changes: 0 additions & 5 deletions images/worker/supervisord_entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -86,11 +86,6 @@ if [ "$SLURM_CLUSTER_TYPE" = "gpu" ]; then
export GRES="$(nvidia-smi --query-gpu=name --format=csv,noheader | sed -e 's/ /_/g' -e 's/.*/\L&/' | sort | uniq -c | awk '{print "gpu:" $2 ":" $1}' | paste -sd ',' -)"

echo "Detected GRES is $GRES"

echo "Create NVML symlink with the name expected by Slurm"
pushd /usr/lib/x86_64-linux-gnu
ln -s libnvidia-ml.so.1 libnvidia-ml.so
popd
else
echo "Skipping GPU detection"
fi
Expand Down
14 changes: 6 additions & 8 deletions internal/render/common/apparmorprofile.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"fmt"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"nebius.ai/slurm-operator/internal/naming"
"nebius.ai/slurm-operator/internal/values"

Expand Down Expand Up @@ -45,9 +46,6 @@ profile %s flags=(attach_disconnected,mediate_deleted) {

/** lrixw,


# remove [^m], when bump slurm 24.05.5 or higher

deny /usr/lib/x86_64-linux-gnu/libEGL_* w,
deny /usr/lib/x86_64-linux-gnu/libGLES* w,
deny /usr/lib/x86_64-linux-gnu/libGLX_nvidia* w,
Expand All @@ -59,23 +57,23 @@ profile %s flags=(attach_disconnected,mediate_deleted) {
deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia* w,
deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so w,
deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia w,
deny /usr/lib/x86_64-linux-gnu/libnvidia-[^m]* w,
deny /usr/lib/x86_64-linux-gnu/libnvidia-* w,
deny /usr/lib/x86_64-linux-gnu/libcuda.so* w,
deny /usr/lib/x86_64-linux-gnu/libcudadebugger.so* w,

deny /lib/x86_64-linux-gnu/libnvidia-[^m]* w,
deny /lib/x86_64-linux-gnu/libnvidia-* w,
deny /lib/x8 D00C 6_64-linux-gnu/libcuda.so* w,
deny /lib/x86_64-linux-gnu/libcudadebugger.so* w,

deny /usr/local/lib/x86_64-linux-gnu/libnvidia-[^m]* w,
deny /usr/local/lib/x86_64-linux-gnu/libnvidia-* w,
deny /usr/local/lib/x86_64-linux-gnu/libcuda.so* w,
deny /usr/local/lib/x86_64-linux-gnu/libcudadebugger.so* w,

deny /usr/local/nvidia/lib/x86_64-linux-gnu/libnvidia-[^m]* w,
deny /usr/local/nvidia/lib/x86_64-linux-gnu/libnvidia-* w,
deny /usr/local/nvidia/lib/x86_64-linux-gnu/libcuda.so* w,
deny /usr/local/nvidia/lib/x86_64-linux-gnu/libcudadebugger.so* w,

deny /usr/local/nvidia/lib64/libnvidia-[^m]* w,
deny /usr/local/nvidia/lib64/libnvidia-* w,
deny /usr/local/nvidia/lib64/libcuda.so* w,
deny /usr/local/nvidia/lib64/libcudadebugger.so* w,

Expand Down
1 change: 1 addition & 0 deletions internal/render/common/configmap.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ func generateSlurmConfig(cluster *values.SlurmCluster) renderutils.ConfigFile {
if cluster.ClusterType == consts.ClusterTypeGPU {
res.AddProperty("GresTypes", "gpu")
}
res.AddProperty("MpiDefault", "pmix")
res.AddProperty("MailProg", "/usr/bin/true")
res.AddProperty("PluginDir", "/usr/lib/x86_64-linux-gnu/"+consts.Slurm)
res.AddProperty("ProctrackType", "proctrack/cgroup")
Expand Down
Loading
0