-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Conversation
I know this is contrary to what we discussed on Discord but I think I see a nicer solution:
|
Won't it run it locally then? Instead of running it on the remote node |
That part of gce.sh builds the |
Got it |
#!/usr/bin/env bash | ||
set -ex | ||
|
||
nvidia-smi -pm ENABLED || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed, as validators are not configured with a GPU machine even if enableGpu is true (unless leader rotation is enabled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just drop the set -ex
then the || true
goes away too.
set -x
adds like no value anyway, since the script is effectively a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just oneliner script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#!/usr/bin/env bash
nvidia-smi -pm ENABLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the latest commit matches with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvines Seems this is not working. I think, because, the main startupScript has "set -ex" at the top. This script gets "cat" to the startupScript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, sorry for the bad advice. yeah we need the || true
after all
* Load nvidia drivers on node startup * added new script to enable nvidia driver persistent mode * remove set -ex
Problem
Testnet is not running cuda enabled code even though hardware has the required GPUs and software
Summary of Changes
There was a race condition between running Solana binaries and loading of Nvidia drivers. Scripts were not detecting the cuda device sometimes, and were starting non cuda daemons.
This change loads the drivers before detecting cuda device.