Open
Description
Over the past year, many projects have been launched leveraging Ray to scale out post-training and RL for LLMs. From our perspective, we’d like to ensure that Ray continues to be a great fit for these use cases and address any bugs or usability gaps in Ray.
We've spoken to a variety of project creators over the last couple of weeks and have gotten great feedback.
Below is our currently identified list of issues and features that we plan to address, but we’d also be eager to hear if there is any other feedback as well.
List of issues to address:
- ActorGroup / ActorMesh abstraction (cc @pcmoritz)
- Address scalability issues (high heartbeat overhead at 500-node scale -- ideally to push to 2-3k without overhead)
- Fast GPU object transfer (GPU Objects): RFC Issues
- Add docs to debug
SYSTEM_ERROR
(cc @jjyao)- Related issues: The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR volcengine/verl#1595 and fix: ray worker exit with SYSTEM_ERROR caused by SIGALRM from math re… volcengine/verl#1331 (comment) and "A worker died or was killed by an unexpected system error" when the training completes volcengine/verl#1299 and
- A worker died or was killed while executing a task by an unexpected system error volcengine/verl#472
- Too many threads / Add documentation on thread control
- More observability for hanging workloads
- Improving Ray Typing annotation: [core] Improving Ray Typing annota 735D tion #54149
Open Questions
- Anything we should do to better improve SLURM support?
Key Projects
- VeRL (cc @eric-haibin-lin)
- NemoRL (cc @terrykong)
- OpenRLHF (cc @xiaoxigua999 @hijkzzz)
- ROLL (cc @PanAndy @StephenRi)
- AReaL (cc @garrett4wade)
- SkyRL (@tyler-griggs @caoshiyi @lynnliu030 @DachengLi1)
We welcome folks to participate, and please feel free to let us know if there are other items to address.
cc @robertnishihara @SumanthRH @erictang000 @kouroshHakha @kevin85421 @stephanie-wang