feat: optimize get logprobs when cp enabled. #528

joyang-nv · 2025-06-18T12:48:10Z

What does this PR do ?

This PR optimize get logprobs when CP is enabled for FSDP2. Issue #549

Issues

In previous PR, the logits were retrieved from sharded one (local tensor shape [b, s / cp_size, v / tp_size]) into full tensor with shape [b, s, v] and passed to loss function.
The key reason was we had to ensure sequence order was correct when get log probs.
This PR allows permuted sequenced to pass to loss function with additional full tensor seq_index which indicates the order of the permuted sequence and allow parallel logprobs computation even mixed with tp enabled.

Test Result

cp8

convergence	time cost

tp4cp2

convergence	time cost

TP4CP2-0624 - TIMING/TRAIN/POLICY_TRAINING	CP8-0624 - TIMING/TRAIN/POLICY_TRAINING	LLAMA-3.1-8B-INSTRUCT-CP8-0610 - TIMING/TRAIN/POLICY_TRAINING	LLAMA-3.1-8B-INSTRUCT-TP4CP2-0609 - TIMING/TRAIN/POLICY_TRAINING
62.53703141	52.19302438	55.94293963	66.29522389

Average step has saved 3.+ seconds.

Signed-off-by: Jonas yang <joyang@nvidia.com>

terrykong · 2025-06-26T03:53:20Z

nemo_rl/models/policy/dtensor_policy_worker.py

+                                logits = DTensor.from_local(
+                                    local_logits,
+                                    device_mesh=self.device_mesh["cp", "tp"],
+                                    placements=[Shard(sequence_dim), Shard(-1)],


is there any benefit to doing the redistribute like this vs. logits.redistribute(device_mesh=..., placements=....)?

Also, can this be set to async_op=True?

Accepted. :) Just want to unify full tensor/dtensor format.

terrykong · 2025-06-26T04:03:09Z

nemo_rl/distributed/model_utils.py

+        assert isinstance(target, DTensor), (
+            "target must be a DTensor if seq_index is provided"
+        )
+        cp_mesh = target.device_mesh


@SahilJain314 to comment on CP making its appearance in the model agnostic utilities

github-actions bot added the documentation Improvements or additions to documentation label Jun 18, 2025

joyang-nv force-pushed the joyang/cp_opt branch from 581ae35 to 07d38da Compare June 18, 2025 14:53

github-actions bot removed the documentation Improvements or additions to documentation label Jun 18, 2025

joyang-nv force-pushed the joyang/cp_opt branch 3 times, most recently from 34f829d to a068b07 Compare June 25, 2025 07:14

joyang-nv changed the title ~~Optimize get logprobs when cp enabled.~~ feat: optimize get logprobs when cp enabled. Jun 25, 2025

joyang-nv added the CI:L1 Run doctests, unit tests, and functional tests label Jun 25, 2025

joyang-nv temporarily deployed to nemo-ci June 25, 2025 07:16 — with GitHub Actions Inactive

joyang-nv requested review from gshennvm, SahilJain314 and abukharin-nv and removed request for SahilJain314 June 25, 2025 08:23

joyang-nv added 3 commits June 25, 2025 21:20

Optimize get logprobs for CP enabled FSDP2 case.

0ff37d5

Signed-off-by: Jonas yang <joyang@nvidia.com>

Remove unused code.

e93d1b3

Signed-off-by: Jonas yang <joyang@nvidia.com>

Fix CI.

e7ffc1e

Signed-off-by: Jonas yang <joyang@nvidia.com>

joyang-nv force-pushed the joyang/cp_opt branch from 2168bca to e7ffc1e Compare June 25, 2025 13:21

joyang-nv added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jun 25, 2025

joyang-nv temporarily deployed to nemo-ci June 25, 2025 13:21 — with GitHub Actions Inactive

joyang-nv requested a review from terrykong June 25, 2025 16:06

terrykong reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize get logprobs when cp enabled. #528

feat: optimize get logprobs when cp enabled. #528

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: optimize get logprobs when cp enabled. #528

Are you sure you want to change the base?

feat: optimize get logprobs when cp enabled. #528

Uh oh!

Conversation

Uh oh!

What does this PR do ?

Issues

Test Result

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!