-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Support pinning to local rank GPU index in Spark estimators #3737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pinning to local rank GPU index in Spark estimators #3737
Conversation
Unit Test Results (with flaky tests) 1 282 files - 13 1 282 suites - 13 13h 30m 55s ⏱️ -6s For more details on these failures, see this check. Results for commit 6f8491b. ± Comparison against base commit bfe96f8. ♻️ This comment has been updated with latest results. |
in general LGTM, just a minor comment |
0b451d3
to
c84b240
Compare
I'd say the problem is here:
This should read
|
With
your executors get two GPUs assigned. With
each task running on those multi-GPU executors get a single GPU assigned. Shouldn't |
This works only if GPUs are configured in EXCLUSIVE_PROCESS mode. Unfortunately, in many clusters, GPUs are too expensive to run only one process.
|
I think this solves most issues. In the beginning, I was wondering why horovod didn't consider availabel_devices could be larger than 1. So I suggested the env PR. Now, it looks like combine the two updates, horovod can work better. Say we have 3 VMs and each have 4 GPUs. There are three kinds of spark settings. Case 1. Each VM has 1 spark executor, and all the 4 GPUs are assinged to this executor. Now, we want to init 4 horovod processes on this VM since it has 4 GPUs.
This works as available_devices will always be like ['0', '1', '2', '3'] Case 2. Each VM has 4 spark executors and each executor has been assigned with 1 GPU. Now, we still want to init 4 horovod processes on each VM.
This will work. Case 3. Each VM has 1 spark executor but only 1 or even 0 GPU is assigned to it. (In this case, configs should be improved from the spark side, but it could happen.) Now, we still want to init 4 horovod processes on each VM.
This will do the trick. |
One thing I'd like to mention is that to add And the
|
Case 1 and 2 are understood. Case 3 looks like two cases: Case 3.1: There is one GPU assigned to the task but you want the local rank GPU to be used, which could be a completely different GPU (not visible in Spark resources). Case 3.2 is the current behaviour, right? |
Yes, Case 3.2 is the current behaviour. Case 3.1 could happen too. I agree that should be handled first by the Spark cluster side. A good thing is that our solution for Case 3.2 would also benefit Case 3.1. |
I am still not convinced that a wrong GPU configuration provided to Horovod on Spark should be worked around by Horovod. |
I agree with you, @EnricoMi , Horovod has no responsibility for working around other's issue. In this PR, changes for What do you think? Thanks :-) |
Hi @EnricoMi , may I ask for your decision again? Thanks. |
7e338b9
to
3933d69
Compare
@thinkall sorry for the long delay. I have reviewed the changes and made the following changes:
Let me know what you think. |
Hi @EnricoMi , no problem for the delay :-) And your changes sound great to me! Thanks. |
…local rank) Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
Signed-off-by: Li Jiang <bnujli@gmail.com>
…ling it" This reverts commit 75b59c6. Signed-off-by: Enrico Minack <github@enrico.minack.dev>
…ling it" This reverts commit d9584b2. Signed-off-by: Enrico Minack <github@enrico.minack.dev>
…method Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
dfe245c
to
6f8491b
Compare
Fixed DCO failure and rebased on master. @EnricoMi |
Signed-off-by: Li Jiang bnujli@gmail.com
Checklist before submitting
Description
When using spark estimators (torch/lightning/keras) in GPU clusters,
get_available_devices
will be calledhorovod/horovod/spark/task/task_info.py
Lines 25 to 28 in d29bc04
in
_get_assigned_gpu_or_default
.horovod/horovod/spark/common/util.py
Lines 766 to 774 in d29bc04
And the
get_available_devices
will get the devices by calling_get_resources
.horovod/horovod/spark/task/task_service.py
Lines 98 to 107 in d29bc04
It works well in single-GPU per node clusters. However, when it comes to clusters with multi-GPUs per node. Wrong GPU index may be returned.
For spark clusters, usually getGpusResources.sh is used for discovering GPU resources.
For clusters with multi-GPUs per node, if multi-GPUs are assigned to one executor, then the resources will be like:
In this case,
_get_resources
will return {"name": "gpu", "addresses":["0","1"]}, thus_get_assigned_gpu_or_default
will always return 0, which then be pined to different horovod processes. As a result, only one GPU (always index 0) of each node can be used by horovod spark estimators.If one GPU is assigned to one executor, then the resources will be like:
In this case, still only one GPU (always index 0) of each node can be used by horovod spark estimators.
Only when each node has multi-executors and each executor has been assigned different GPUs, will all the GPUs be available to horovod spark estimators. Like below:
However, in this case, GPUs must be configured in
EXCLUSIVE_PROCESS
mode. Unfortunately, for spark clusters with GPUs, spark-rapids is usually needed. Which means, if we configure GPUs inEXCLUSIVE_PROCESS
mode, we can only use spark-rapids or horovod, but not both.To solve the issue, and let us use all GPUs in all nodes, and leverage both spark-rapids and horovod seems to be meaningful.
In fact, it's quite easy to handle. Just add an env
USE_DEFAULT_GPU_INDEX
for manually pin to default GPU index (local rank) will do the trick.With this small change, users can maually pin to default GPU index (local rank) just like they do for horovod runners.
Review process to land