Stars
Dragon distributed runtime for HPC and AI applications and workflows
orchestration for singularity containers (under development)
Portable WDL workflows for CZ ID production pipelines
A distributed storage benchmark for file systems, object stores & block devices with support for GPUs
LD_PRELOAD library to inject O_DIRECT into file I/O
VASTPY is the official Python SDK for the VAST Management System
Persistent remote applications for X11; screen sharing for X11, MacOS and MSWindows.
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
Automatically split your PyTorch models on multiple GPUs for training & inference
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
A taxonomy of Kubernetes configuration management tools
Reference implementations of MLPerf™ training benchmarks
Reference implementations of MLPerf™ inference benchmarks
RDMA client/server for transferring files using RDMA over IB
Scaling Data-Constrained Language Models
Scripts and documentation on scaling large language model training on the LUMI supercomputer
Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
Tool for running/managing ad hoc spark clusters on a Slurm cluster
vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it …
Monitoring and visualization of InfiniBand Fabrics
High Performance Linpack for GPUs (Using OpenCL, CUDA, CAL)
Optimized primitives for collective multi-GPU communication
Upload EBS volume snapshots to Amazon S3/Glacier