SLURM

ref:

cf: https://www.cs.virginia.edu/wiki/doku.php?id=compute_slurm

cf: https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/scheduler-examples/

Command lists

sbatch slurm-test-compsci.sh
sinfo

Ex output: Note "STATE" below

xl6yq@portal12 (main)[RWKV-v5]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*        up 4-00:00:00      8   resv cortado[02-06,08-10]
main*        up 4-00:00:00      1    mix slurm1
main*        up 4-00:00:00     10  alloc affogato[01,04-05],cortado07,lynx[08-09],slurm[2-5]
main*        up 4-00:00:00     12   idle affogato[02-03,06-10],cortado01,hydro,optane01,panther01,puma01
gpu          up 4-00:00:00     11    mix cheetah[01,04-05],jaguar[01,04-05],lynx[04,10],puma02,sds02,serval01
gpu          up 4-00:00:00     18  alloc adriatic[01-06],affogato[11-15],lynx[03,05-07,11-12],sds01
gpu          up 4-00:00:00      7   idle cheetah[02-03],jaguar[02,06],lotus,lynx[01-02]
nolim        up 20-00:00:0      1  alloc heartpiece
nolim        up 20-00:00:0      6   idle doppio[01-05],epona
gnolim       up 20-00:00:0      5    mix ai[01-04],titanx05
gnolim       up 20-00:00:0      8  alloc ai[05-10],jinx[01-02]
gnolim       up 20-00:00:0      3   idle titanx[02-04]

squeue

Check job status, e.g. squeue |grep $USER

Ex

 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4369358      main    fxlin    xl6yq  R       0:25      1 slurm1

State: R-running, PD-pending. cf: https://curc.readthedocs.io/en/latest/running-jobs/squeue-status-codes.html

scontrol
srun
sbatch

check SLRUM job output

if not specifying output, will write a file like "slurm-4369350.out" in current dir, with job ID. (there's often seconds of delay)

SLURM on conda with

# create a demo env 
conda env create -f environment.yml

# install llama.cpp
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

sbatch demo.sh

Interactive jobs

ijob -A xsel -p interactive --time=0-00:30:00 --gres=gpu:rtx2080:1 --mem=8G

xl6yq@udc-ba37-32c1[RWKV-LM]$ ijob -A xsel -p interactive --time=0-00:30:00 --gres=gpu:rtx2080:1 --mem=8G
salloc: Pending job allocation 62277848
salloc: job 62277848 queued and waiting for resources


salloc: job 62277848 has been allocated resources
salloc: Granted job allocation 62277848
salloc: Waiting for resource configuration
salloc: Nodes udc-aw37-37 are ready for job

In command above, capped at 30 min

Direct access to a server

This is for debugging. You shall not use it as an altertive ssh

# access to a server where your script is running
srun --nodelist ai02 --partition gnolim --pty bash -i -l -

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
cs-server		cs-server
.gitignore		.gitignore
README.md		README.md
demo.sh		demo.sh
environment.yml		environment.yml
llm.py		llm.py
notes-job-waittime-fxl.md		notes-job-waittime-fxl.md
slurm-test-compsci.sh		slurm-test-compsci.sh
slurm-test-rivanna.sh		slurm-test-rivanna.sh
slurm.sh < 8709 /div>		slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SLURM

Command lists

check SLRUM job output

SLURM on conda with

Interactive jobs

Direct access to a server

About

Uh oh!

Releases

Packages

Languages

fxlin/slurm-demo

Folders and files

Latest commit

History

Repository files navigation

SLURM

Command lists

check SLRUM job output

SLURM on conda with

Interactive jobs

Direct access to a server

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages