8000 First Invasive check through external K8s Job - dcgmi by cmisale · Pull Request #8 · IBM/autopilot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

First Invasive check through external K8s Job - dcgmi #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 19, 2024
Merged

First Invasive check through external K8s Job - dcgmi #8

merged 11 commits into from
Mar 19, 2024

Conversation

cmisale
Copy link
Collaborator
@cmisale cmisale commented Mar 6, 2024

This PR creates a singleton clientset that we can use anywhere in the code to get access to kube api object.

It might be an overkill, but for now we can keep this.

An example of usage

cset := utils.GetClientsetInstance()
pods, err := cset.Cset.CoreV1().Pods("autopilot").List(context.TODO(), metav1.ListOptions{})

I also updated golang version, it seemed like a good moment to do that.

@cmisale cmisale marked this pull request as draft March 7, 2024 18:21
@cmisale cmisale changed the title Setting up for Node API read/update [DO NOT MERGE] Setting up for Node API read/update 8000 Mar 7, 2024
cmisale added 8 commits March 15, 2024 09:46
Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
first take on node labeling on dcgm 3 failure

first take on node labeling on dcgm 3 failure

explicit set of resources, plus some minor stuff

Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
Signed-off-by: Claudia <c.misale@ibm.com>
@cmisale cmisale changed the title [DO NOT MERGE] Setting up for Node API read/update First Invasive check through external K8s Job - dcgmi Mar 15, 2024
@cmisale cmisale marked this pull request as ready for review March 15, 2024 15:29
@cmisale
Copy link
Collaborator Author
cmisale commented Mar 15, 2024

PR is updated with core for creating an external Job running dcgmi diag -r 3 IFF GPUs are idle.

Each Autopilot daemon pod is responsible for checking ONLY the resources on the node they belong to.

Each dcgmi Job:

  • Is a standalone pod assigned directly to the relevant node through NodeName.
  • It is NOT created if GPUs are currently used.
  • It requests all 8 GPUs on the node, so no other workload can take that node.
  • Is launched every 4h by default. The value can be customized through Helm. When set to 0, the feature is deactivated
  • Will self delete after 4h, made possible by TTLSecondsAfterFinished
  • Will not restart in any case, including failure (i.e., return -1 kind of scenario). Made possible by setting BackoffLimit to 0
  • Will label the relevant node with key autopilot/dcgm.level.3 and PASS or ERR value, with a UTC timestamp, failure type (GPU Memory) and the affected GPU IDs. For instance autopilot/dcgm.level.3=PASS_2024-03-15_15.11.57UTC or autopilot/dcgm.level.3=ERR_2024-03-15_15.11.57UTC_GPU_Memory.0.1.2
  • Periodic checks and invasive checks are mutually exclusive. A mutex prevents autopilot from running both at the same time.

The PR also contains a bunch of refactoring and bug fixes and improvements

  • ping test, wrong unreachable node count
  • ping test, improved output with net iface id assigned to the the unreachable IP
  • ping test, if there is a discrepancy between autopilot pod's multi-nic annotation and the actual ifaces (listed with ifconfig, for instance), ping won't be run. This is just local to the pod, i.e., there's no distributed info exchange about this. A log entry is printed (this could become a prometheus metrics, like a flag saying the pod is healthy or not), the pod should be restarted but best thing would be to restart multi-nic controller
  • pciebw test, improved error handling

@cmisale cmisale mentioned this pull request Mar 15, 2024

func GetClientsetInstance() *K8sClientset {
var lock = &sync.Mutex{}
if k8sClientset == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code cleanliness perspective, I think we can remove this outer == nil check and acquire the lock every time

Copy link
Collaborator
@jimcadden jimcadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @cmisale !!

@cmisale cmisale merged commit 18b8087 into main Mar 19, 2024
@cmisale cmisale deleted the dcgm branch March 19, 2024 17:47
@cmisale cmisale mentioned this pull request Apr 17, 2024
eletonia pushed a commit to eletonia/autopilot that referenced this pull request Aug 13, 2024
* update to golang version

Signed-off-by: Claudia <c.misale@ibm.com>

* add singleton clientset instance

Signed-off-by: Claudia <c.misale@ibm.com>

* changes to dependencies

Signed-off-by: Claudia <c.misale@ibm.com>

* intrusive checks enablement

Signed-off-by: Claudia <c.misale@ibm.com>

* This is a combination of 3 commits.

first take on node labeling on dcgm 3 failure

first take on node labeling on dcgm 3 failure

explicit set of resources, plus some minor stuff

Signed-off-by: Claudia <c.misale@ibm.com>

* bugfix: better error handling

Signed-off-by: Claudia <c.misale@ibm.com>

* bugfix: unreachable node count

Signed-off-by: Claudia <c.misale@ibm.com>

* Minor touches. Set invasive timer to 0 to avoid them

Signed-off-by: Claudia <c.misale@ibm.com>

* Update functions.go

Removed comments

---------

Signed-off-by: Claudia <c.misale@ibm.com>
Mete4 pushed a commit to Mete4/autopilot-dashboard that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0