-
Notifications
You must be signed in to change notification settings - Fork 432
Auto-detect bf16 support for CUDA #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@Mergifyio rebase |
❌ Unable to rebase: user
|
@Mergifyio rebase |
❌ Base branch update has failed
|
d646d59
to
ecc1a38
Compare
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
8e3d568
to
7999e89
Compare
5f01310
to
0519a1d
Compare
@tiran what's the status on this? Thanks! |
@leseb I have rebased the PR. Let's see if tests are now passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently putting this to the test, I'll report my result shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly request to edit the newly added CHANGELOG.md
file, since this looks like a nice improvement for CPU-only system. Thanks!
Sharing my unimpressive results here, my machine:
Output of
Without the patch:
I stopped at 29% after 1h28min. With this patch:
I stopped at 29% after 1h28min. So the results were identical on my machine. My
I run it with like so |
On a test system with 64 GB RAM, this memory calculation came out as 62, not 64. Check for 60 instead of 64. Obviously this is not very scientific as we're making very rough assumptions about what is required. It would be better to enhance the code further to actually calculate a memory requirement based on the model instead just hard coding a rough guess. Signed-off-by: Russell Bryant <rbryant@redhat.com>
I spoke with @leseb on Slack and we determine that the memory check came out to |
Here are the results I've been waiting for :), the same system as commented in #993 (comment): Previously it took 1h28min to barely reach 29% of the training, now the whole training took 1h19min:
|
torch_dtype = "auto" if device.type == "cuda" else None | ||
if device.type == "cpu": | ||
total_memory = psutil.virtual_memory().total / (1024**3) | ||
if total_memory < 60: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if total_memory < 60: | |
if total_memory < 62: |
A system with 64GB of RAM, will report:
>>> import psutil
>>> mem = psutil.virtual_memory()
>>> mem
svmem(total=67228049408, available=31099351040, percent=53.7, used=35383861248, free=468701184, active=27983499264, inactive=37159084032, buffers=1079336960, cached=30296150016, shared=2109440, slab=1340628992)
And we have. 67228049408
Bytes converted to GiB gives us 67228049408 / 1024 ** 3
gives us 62.6 GiB
# There's more going on here and needs deeper exploration to find | ||
# the right parameters to be checking for choosing the best | ||
# configuration. | ||
# Anecdotally, 64 GB seems to be enough, but this calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A system with 64GB of RAM will report ~62.6 GiB so we base our calculation on 62
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's such a rough guess, 60 still seems fine? We need to actually do some math at some point ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll share my math in a few :) stay tuned!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more numbers:
- The training part take ~30GB of RAM to process, there is a very small chance that this could work on very minimal Linux installation, by minimal I mean, only system critical services run and nothing else.
- The inference part takes ~35GB of RAM
Essentially a system with 48GB of RAM should be able to run both training and inferencing. Although 48 GB of RAM is not very common.
This pull request has merge conflicts that must be resolved before it can be |
This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Hi @tiran! Are you still working on this PR? We're looking to do some housekeeping and close out stale PRs, including drafts. If we don't hear back within 7 days, we will close this PR, but please know that you are more than welcome to reopen it if you'd like! Thank you! |
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
This pull request has merge conflicts that must be resolved before it can be merged. @tiran please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Changes
Which issue is resolved by this Pull Request:
See #647
Description of your changes:
bf16 (bfloat16) is not available on older CUDA versions < 11.0 as well as devices with CUDA support level < 8.0. linux_train now detects and reports bf16 support. Training on CUDA falls back to fp16 (half precision float).
also closes #1006