Pi(π) Incubator is an unified RL training framework designed with RLOps in mind to empower researchers with a modular, scalable framework for training reinforcement learning (RL) policies. It integrates various tools such as TorchRL, Metaflow, and Ray to support both local development and cloud-based experiments (e.g., AWS EKS). This repository includes training scripts, configuration files, and utilities for orchestrating distributed RL experiments at scale, providing frictionless, scalable, and user-friendly infrastructure that accelerates research development cycles
Core Components
- Sim Env Zoo: Collection of simulated training environments from various high fidelity simulators (MuJoCo, NVIDIA Isaac etc.)
- Sim Model Zoo: Reusable simulation/dynamics models and configurations from the shared simulators for standardized scenarios
- NN Model Zoo: Neural architectures tailored for RL (CNNs, RNN, Transformers, Mamba )
- Policy Algorithm Zoo: Range of RL algorithms (PPO, SAC, Dreamer, GRPO, etc.)
- Leverages PyTorch and TorchRL for training with GPU acceleration
- Parallel rollout workers collecting experiences from high fidelity simulators
- Dynamically orchestrates resource allocation and training workflow using Ray and Ray Tune
- Stores large-scale offline and real-world gathered experience data
- Facilitates off-policy training, data re-sampling, and memory-based RL
- Defines, executes, and tracks end-to-end training workflows (data prep, training, evaluation, deployment)
- Manages job submission to Kubernetes for auto-scaling
- Custom inference pipelines for batch evaluation with custom metrics
- Integrates with Ray Tune for hyperparameter tuning, metric aggregation and push to MLflow
- Automated alerts and early stopping to automatically halt underperforming experiments or trigger adjustments based on defined performance metrics.
- Stores artifacts (checkpoints, logs, metrics) for reproducibility and comparison
- Provides a dashboard for historical experiment insights
- Distills large policies into smaller, efficient models for real-world deployment
- Converts PyTorch RL policies into ONNX format for edge deployment on embedded devices (Nvidia Jetson, etc.)
- Training Optimization method like Population-Based Training (PBT), Auto-curriculum, Domain Randomization (DR), Unsupervised Environment Design (UED), Self-Play
- User defined experiment configurations (YAML/JSON) handling: Training workflow, environment parameters, hyperparameters, compute resources, deployment settings, etc.
By leveraging TorchRL, Ray, and Metaflow, they jointly provide an end-to-end RLOps solution—from flexible algorithm design to scalable, reproducible production experiments.**
- PyTorch-first: Leverages PyTorch's flexibility for custom RL algorithms.
- Modularity: Allows rapid prototyping with composable network components.
- Scalable Rollouts: Distributes high-fidelity simulation data collection seamlessly.
- Automated Tuning: Integrates hyperparameter search with third-party logging (W&B, MLflow).
- Efficient Scaling: Transitions effortlessly from local experiments to multi-node clusters.
- Orchestration: Simplifies complex workflow management and experiment tracking.
- Kubernetes Ready: Automates containerized training jobs for on-prem or cloud deployments.
Create and configure your Conda environment with the following commands:
conda create --name rlops_env python=3.11
conda install -y -c conda-forge glew
conda install -y -c anaconda mesa-libgl-cos6-x86_64
conda install -y -c menpo glfw3
conda env config vars set MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
conda deactivate && conda activate rlops_env
For additional functionality, install these optional packages:
pip install ray[default]==2.8.1
pip install onnx-pytorch
pip install onnxruntime
pip install metaflow-ray
To install pi_incubator
:
pip install -e .
To run a baseline training locally, use the following command:
python train_ppo.py --config configs/experiment_config_baseline.yaml
python torchrl_ray_train.py --config configs/experiment_config_baseline.yaml
For distributed training using torchrun
, execute:
torchrun --nnodes=1 --nproc-per-node=1 --max-restarts=1 --rdzv-id=1 --rdzv-backend=c10d --rdzv-endpoint=localhost:0 train_ppo.py --config configs/experiment_config_baseline.yaml
Or to run a tuned configuration:
torchrun --nnodes=1 --nproc-per-node=1 --max-restarts=1 --rdzv-id=1 --rdzv-backend=c10d --rdzv-endpoint=localhost:0 train_ppo.py --config configs/experiment_config_tuned_PPO.yaml
- Deploying to AWS with Kubernetes:
As of now, please go to terraform-aws-metaflow
dir then follow the official Outerbounds instruction to deploy these services in your AWS account:AWS EKS cluster, Amazon S3, AWS Fargate and Relational Database Service (RDS).
-
AWS EKS:
Update your kubeconfig for your EKS cluster:
aws eks update-kubeconfig --name <cluster name>
-
AWS API Gateway:
Retrieve your API key value:
aws apigateway get-api-key --api-key <YOUR_KEY_ID_FROM_CFN> --include-value | grep value
-
Metaflow AWS Configuration:
Configure Metaflow to use AWS:
metaflow configure aws
Run your training flows on Kubernetes with Metaflow:
python torch_flow.py --no-pylint --environment=conda run --with kubernetes
Other experimental Metaflow runs:
python ppo_metaflow.py run
python ppo_metaflow.py run --with kubernetes
python ray_flow.py --no-pylint --environment=pypi run --with kubernetes
If you're using Argo Workflows, you can track the running processes via port forwarding:
kubectl port-forward -n argo service/argo-argo-workflows-server 2746:2746
Investigating why collector kwargs are not being passed to desinated collector from RayCollector. Need to understand how the argument are being pass. Will reachout to TorchRL team.
python ray_collector_train.py
Currently not able to train with RayCollector and TorchRL trainer together. However, by fixing the the following on the official ray_train.py example. It is working without TorchRL trainer.
RuntimeError: Setting 'advantage' via the constructor is deprecated, use .set_keys(<key>='some_key') instead.
Uppon fixed the outdated ClipPPOLoss
RuntimeError: The distribution TanhNormal has not analytical mode. Use ExplorationMode.DETERMINISTIC to get a deterministic sample from it.
Need to set_exploration_type from ExplorationType.MODE to ExplorationType.DETERMINISTIC
- Experiment with Ray Tune on the TorchRL Trainer.
- Experiment with Ray Collector with TorchRL Trainer.
- Investigate how collector kwargs from Ray Collector are passing into SyncDataCollector
- Investigating Isaac Gym Wrapper
- Investigating on Self-play env set up
- Support local Kubernetes with S3Mock
- Set up Training Dependencies with a Docker registry.
- Experiment with Ray Collector with Ray Tune + TorchRL Trainer.
- Use W&B logger for improved experiment tracking.
- Run Metaflow training comparisons on TunedAdam.
This project is licensed under the MIT License - see the LICENSE file for details.