8000 Fix GPU options by nicolaschan · Pull Request #1 · ucb-rit/brc_oodapps · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix GPU options #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Fix GPU options #1

wants to merge 4 commits into from

Conversation

nicolaschan
Copy link
Contributor
  • Remove manual option for CPUs (was previously --ntasks-per-core)
  • Automatically set --cpus-per-task to 2 times the number of GPUs
  • Request user to enter only number of GPUs. For example, 2 instead of gpu:2

@paciorek
Copy link
Contributor

Having users specify number of gpus and the rest be automated seems great -- much more user-friendly.

Are we ok with eliminating the possibility that they could request a more complicated gres specification in return for the simplicity/friendliness? My initial reaction is yes.

@paciorek
Copy link
Contributor

@nicolaschan your revised help text for the gres field says a user can specify number and type of GPU. I don't think that this will work because you then do gres_value.to_i(), which would presumably fail if given something like k80:1. I might be missing something...

@nicolaschan
Copy link
Contributor Author
8000

Good catch, thanks Chris!

@paciorek
Copy link
Contributor

I think we need equivalent changes for the MATLAB and RStudio apps.

@kmuriki
Copy link
kmuriki commented May 4, 2021

@nicolaschan This is a very good improvisation. But few concerns. (1) We do not want to remove the CPU cores option alltogether because we need for the HTC partitions so you want to put the toggle_cpu_cores routine back in place and have logic in there to check if its a gpu partition then instead of displaying the cpu_cores field use the *2 multiplier. But then again (2) we are assuming users will need only gpu * 2, cpu cores. What is a user wants to use 2 gpu cores and 8 cpu cores ? I'm not sure how common the case is. May be we can just improve the help text to say you have to ask for *2 or above number of cpu cores and leave it at there instead of applying automatic multipliers on the backend ? Comments ? Thoughts ?

@nicolaschan
Copy link
Contributor Author

If someone requests 1 GPU but all of the CPU cores, won't that stop anyone else from using the other available GPUs? If this is the case, then perhaps this should not be allowed (that is, you need to request exactly 2*GPUs).

@kmuriki
Copy link
kmuriki commented May 6, 2021

Yeh If a user needs that weird combination of 1 GPU and all CPUs so be it. Why should we block it ? Users are charged appropriately. Ideally we should ask the number of GPUs question first in the form and based on what number they enter, if the number of cpus field is empty we should make a recommendation of 2*gpus in there, put a help note that it has to be 2 * gpus and still allow them to modify the number of cpus field. Does that make sense ?

@nicolaschan
Copy link
Contributor Author

Ah, ok. savio3_2080ti has 32 CPU cores but only 8 GPUs. So if you want the whole node you'll need to request more CPUs than required. I've added the CPU selection option back to the form for GPU partitions.

@paciorek
Copy link
Contributor

@nicolaschan I see that the OOD config is such that --ntasks-per-node is set. But in our standard job example for a non-OOD GPU job, we suggest setting --cpus-per-task with --ntasks=1.

Is there a reason to do it differently for OOD? I guess the effect is the same, so I suppose it doesn't really matter, but it could be that based on the non-OOD usage pattern, a user might expect SLURM_CPUS_PER_TASK to be set.

@nicolaschan
Copy link
Contributor Author

You can't run multiple node jobs with --ntasks=1. See the warning below:

[nicolaschan@ln000 ~]$ srun -A ac_scsguest -p savio --ntasks=1 --nodes=2 -t 0:10:00 --pty bash -i
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: job 8704371 queued and waiting for resources

@paciorek
Copy link
Contributor

Good point. I don't know how often users would have multi-node GPU jobs via OOD, but I guess it is something we want to accommodate. Our GPU scheduler example is a one-node example.

@tin6150
Copy link
Collaborator
tin6150 commented Sep 24, 2021

I guess at this point we are not going to have time in the near future to debug this multi-gpu task. maybe keep ood simple for now as we don't have the bandwidth for this? converting this to draft request for future consideration.

@paciorek
Copy link
Contributor

Agreed - it's not clear what we want to do in terms of any automation of setting cpus relative to gpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0