-
Notifications
You must be signed in to change notification settings - Fork 3k
[llm]add bf16 moment adamw #9732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9732 +/- ##
===========================================
- Coverage 51.41% 51.03% -0.39%
===========================================
Files 745 745
Lines 118351 119410 +1059
===========================================
+ Hits 60856 60939 +83
- Misses 57495 58471 +976 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for your contribution! |
@@ -0,0 +1,15 @@ | |||
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024 -> 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# Update param | ||
if master_weight_ptr is not None: | ||
tl.store(master_weight_ptr + offsets, param, mask=mask) | ||
tl.store(param_ptr + offsets, param.to(tl.bfloat16), mask=mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里设计的需要考虑下optimizer原始参数的dtype,是否考虑float16的场景,开源模型部分模型是float16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在考虑了
BLOCK_SIZE: tl.constexpr, | ||
): | ||
pid = tl.program_id(0) | ||
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里offsets为啥是arrange的方式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解是要读取[0, BLOCK_SIZE 8000 ]所有tensor进行操作
paddlenlp/utils/optimizer.py
Outdated
@@ -149,3 +154,227 @@ def adamw_python( | |||
beta1_pow[:], beta2_pow[:] = beta1 * beta1_pow[:], beta2 * beta2_pow[:] | |||
# 看看怎么更新 | |||
return | |||
|
|||
|
|||
class AdamWPython(AdamW): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个名字是不是有点奇怪,是不是朴素实现,或者 AdamWSlow之类的更合适点
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成AdamWCustom
type = core.VarDesc.VarType.DENSE_TENSOR | ||
except: | ||
type = core.VarDesc.VarType.LOD_TENSOR | ||
self._add_accumulator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
论文中的beta1和beta2是float32是吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Others
Description
使用只需要增加 --optim adamw_16bit_moment