[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) #9705

SylarTiaNII · 2024-12-26T11:41:02Z

PR types

New features

PR changes

Others

Description

CP from: #9255
report error when getting loss of NaN of Inf, for LLM training platform integration.
[Pcard-88789]

paddle-bot · 2024-12-26T11:41:08Z

Thanks for your contribution!

codecov · 2024-12-26T12:26:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.02%. Comparing base (a0c4b4c) to head (0f556d8).
Report is 263 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9705      +/-   ##
===========================================
+ Coverage    51.99%   53.02%   +1.02%     
===========================================
  Files          726      718       -8     
  Lines       115607   112374    -3233     
===========================================
- Hits         60110    59581     -529     
+ Misses       55497    52793    -2704

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wawltor · 2024-12-27T08:02:03Z

paddlenlp/trainer/trainer.py

@@ -1133,6 +1133,9 @@ def fused_allreduce_gradients_no_sync(paramlist, hcg):
                    if self.args.pipeline_parallel_degree <= 1 and self._enable_delay_scale_loss():
                        tr_loss /= self.args.gradient_accumulation_steps

+                    # assert if loss is invalid
+                    self._check_loss_valid(tr_loss)
+


有个疑问，如果是float16的训练出现nan或者inf，会直接报错？

fp16训练不会报错，在下面的函数实现里会有判断。

wawltor

LGTM

[LLM] valid loss before optimizer step (PaddlePaddle#9255)

0f556d8

wawltor reviewed Dec 27, 2024

View reviewed changes

wawltor approved these changes Dec 27, 2024

View reviewed changes

wawltor merged commit 691ae01 into PaddlePaddle:develop Dec 27, 2024
10 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) #9705

[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) #9705

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) #9705

[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) #9705

Uh oh!

Conversation

Uh oh!

PR types

PR changes

Description

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!