[RFC] Improving Ray for Post-Training / RL for LLM Projects

@pcmoritz

Over the past year, many projects have been launched leveraging Ray to scale out post-training and RL for LLMs. From our perspective, we’d like to ensure that Ray continues to be a great fit for these use cases and address any bugs or usability gaps in Ray.

We've spoken to a variety of project creators over the last couple of weeks and have gotten great feedback.

Below is our currently identified list of issues and features that we plan to address, but we’d also be eager to hear if there is any other feedback as well.

List of issues to address:

ActorGroup / ActorMesh abstraction (cc @pcmoritz)
Address scalability issues (high heartbeat overhead at 500-node scale -- ideally to push to 2-3k without overhead)
Fast GPU object transfer (GPU Objects): RFC Issues
Add docs to debug SYSTEM_ERROR (cc @jjyao)
- Related issues: The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR volcengine/verl#1595 and fix: ray worker exit with SYSTEM_ERROR caused by SIGALRM from math re… volcengine/verl#1331 (comment) and "A worker died or was killed by an unexpected system error" when the training completes volcengine/verl#1299 and
- A worker died or was killed while executing a task by an unexpected system error volcengine/verl#472
Too many threads / Add documentation on thread control
- Multi thread caused pthread_create failed: resource temporatily unavailable volcengine/verl#719
- OutOfMemoryError volcengine/verl#1335
More observability for hanging workloads
- Running the GRPO program on multiple nodes causes it to hang. volcengine/verl#242
- The Ray process is stuck, with only the main worker in a running state volcengine/verl#1126
Improving Ray Typing annotation: [core] Improving Ray Typing annota 735D tion #54149

Open Questions

Anything we should do to better improve SLURM support?

Key Projects

VeRL (cc @eric-haibin-lin)
NemoRL (cc @terrykong)
OpenRLHF (cc @xiaoxigua999 @hijkzzz)
ROLL (cc @PanAndy @StephenRi)
AReaL (cc @garrett4wade)
SkyRL (@tyler-griggs @caoshiyi @lynnliu030 @DachengLi1)

We welcome folks to participate, and please feel free to let us know if there are other items to address.

cc @robertnishihara @SumanthRH @erictang000 @kouroshHakha @kevin85421 @stephanie-wang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

List of issues to address:

Open Questions

Key Projects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

List of issues to address:

Open Questions

Key Projects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions