8000 Load nvidia drivers on node startup by pgarg66 · Pull Request #2263 · solana-labs/solana · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Load nvidia drivers on node startup #2263

Merged
merged 3 commits into from
Dec 21, 2018
Merged

Conversation

pgarg66
Copy link
Contributor
@pgarg66 pgarg66 commented Dec 21, 2018

Problem

Testnet is not running cuda enabled code even though hardware has the required GPUs and software

Summary of Changes

There was a race condition between running Solana binaries and loading of Nvidia drivers. Scripts were not detecting the cuda device sometimes, and were starting non cuda daemons.

This change loads the drivers before detecting cuda device.

@pgarg66 pgarg66 requested a review from mvines December 21, 2018 18:01
@mvines
Copy link
Contributor
mvines commented Dec 21, 2018

I know this is contrary to what we discussed on Discord but I think I see a nicer solution:

  • Revert the remote-node.sh changes
  • Create a new script net/scripts/enable-nvidia-peristence-mode.sh that runs the command nvidia-smi -pm ENABLED (ref)
  • Modify net/gce.sh to include enable-nvidia-peristence-mode.sh if enableGpu=true

@pgarg66
Copy link
Contributor Author
pgarg66 commented Dec 21, 2018

Won't it run it locally then? Instead of running it on the remote node

@mvines
Copy link
Contributor
mvines commented Dec 21, 2018

That part of gce.sh builds the instance-startup-script.sh that is run on each instance at boot, before setting the /.instance-startup-complete flag on the instance indicating it's ready

@pgarg66
Copy link
Contributor Author
pgarg66 commented Dec 21, 2018

Got it

#!/usr/bin/env bash
set -ex

nvidia-smi -pm ENABLED || true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed, as validators are not configured with a GPU machine even if enableGpu is true (unless leader rotation is enabled)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just drop the set -ex then the || true goes away too.

set -x adds like no value anyway, since the script is effectively a constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just oneliner script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/usr/bin/env bash
nvidia-smi -pm ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 the latest commit matches with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvines Seems this is not working. I think, because, the main startupScript has "set -ex" at the top. This script gets "cat" to the startupScript.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, sorry for the bad advice. yeah we need the || true after all

@pgarg66 pgarg66 merged commit 4bf797c into solana-labs:master Dec 21, 2018
@pgarg66 pgarg66 deleted the voting branch December 21, 2018 19:43
pgarg66 referenced this pull request in pgarg66/solana Dec 21, 2018
* Load nvidia drivers on node startup

* added new script to enable nvidia driver persistent mode

* remove set -ex
pgarg66 added a commit that referenced this pull request Dec 21, 2018
* Load nvidia drivers on node startup

* added new script to enable nvidia driver persistent mode

* remove set -ex
willhickey pushed a commit that referenced this pull request Jul 24, 2024
Szymongib pushed a commit to ChorusOne/solana that referenced this pull request Jul 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0