-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Dashboard] Add GPU component usage #52102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
|
Signed-off-by: zhaoch23 <c233zhao@uwaterloo.ca>
Script to test: import ray
import torch
import os
import time
# Initialize Ray, using all available GPUs
ray.init()
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")
@ray.remote(num_gpus=0.5)
class TorchGPUWorker:
def __init__(self):
assert torch.cuda.is_available(), "CUDA is not available"
print(os.getenv("CUDA_VISIBLE_DEVICES"))
self.device = torch.device("cuda")
print(f"Worker running on device: {self.device}")
def matrix_multiply(self, size=16384):
a = torch.randn(size, size, device=self.device)
b = torch.randn(size, size, device=self.device)
torch.matmul(a, b)
torch.cuda.synchronize()
REPEATS = 6
start = time.time()
for _ in range(REPEATS):
c = torch.matmul(a, b)
torch.cuda.synchronize()
if __name__ == "__main__":
# Create an actor
gpu_workers = [TorchGPUWorker.remote() for i in range(4)]
# Run a GPU task
for i in range(100):
result = ray.get([worker.matrix_multiply.remote() for worker in gpu_workers])
print(f"Result norm of matrix multiply on GPU: {result}") |
expr="sum(ray_component_gpu_utilization{{{global_filters}}} / 100) by (Component, pid, GpuIndex, GpuDeviceName)", | ||
legend="{{Component}}::{{pid}}, gpu.{{GpuIndex}}, {{GpuDeviceName}}", | ||
), | ||
], | ||
), | ||
Panel( | ||
id=46, | ||
title="Component GPU Memory Usage", | ||
description="GPU memory usage of Ray components.", | ||
unit="bytes", | ||
targets=[ | ||
Target( | ||
expr="sum(ray_component_gpu_memory_usage{{{global_filters}}}) by (Component, pid, GpuIndex, GpuDeviceName)", | ||
legend="{{Component}}::{{pid}}, gpu.{{GpuIndex}}, {{GpuDeviceName}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove pid to align with the Node CPU component graph.
if pid == "-": # no process on this GPU | ||
continue | ||
gpu_id = int(gpu_id) | ||
pinfo = ProcessGPUInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a different type here? Since gpu_memory_usage
is of type Megabytes
. It's very confusing for it to be a percentage and may introduce tricky bugs later.
if nv_process.usedGpuMemory | ||
else 0 | ||
), | ||
gpu_utilization=None, # Not available in pynvml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we match the Ray Dashboard behavior where we show the total gpu utilization for gpu that the process attaches to. Not necessarily the utilization exclusive to that process.
The nvdia-smi parsing is a fragile. I don't know what backwards compatability guarantees nvidia-smi provides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we give up nvidia-smi? Or we implement a fallback strategy that use the pynvml to display the total gpu utilization if nvidia-smi is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, lets remove the usage of nvidia-smi and just use pynvml all the time. We can add nvidia-smi at a later time if there is enough demand. But I think for most usecases the pynvml approach should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We will be adding the nvidia-smi
dependency. We will add a test validating the output of nvidia-smi pmon
. We will also update the ray dashboard UI to utilize the gpu utilization value from nvidia-smi
instead of pynvml
.
Signed-off-by: zhaoch23 <c233zhao@uwaterloo.ca>
Signed-off-by: zhaoch23 <c233zhao@uwaterloo.ca>
I have fixed some potential parsing error. This is what is looks like on my side: |
It fix by merge main, can you check it, thanks! |
@Bye-legumes could you update the PR description of what changed and attach a screenshot? |
updated! |
I think this time is OK now @jjyao |
@@ -993,6 +1127,81 @@ def generate_worker_stats_record(self, worker_stats: List[dict]) -> List[Record] | |||
|
|||
return records | |||
|
|||
def generate_worker_gpu_stats_record( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For component_cpu_percentage
metric, we don't emit one for each worker process, instead we group by task/actor name (in other words, pid
label is not set for core worker component_cpu_percentage
metric.
def generate_worker_stats_record(self, worker_stats: List[dict]) -> List[Record]:
"""Generate a list of Record class for worker proceses.
This API automatically sets the component_name of record as
the name of worker processes. I.e., ray::* so that we can report
per task/actor (grouped by a func/class name) resource usages.
We should do the same thing for component_gpu_percentage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaoch23 seems the updated code is not pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry, pushed just now
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Why are these changes needed?
Close #45755.
This PR addresses the need for enhanced GPU usage metrics at the task/actor level in the Ray dashboard. Currently, the Ray dashboard provides detailed CPU and memory usage metrics for individual tasks and actors, but lacks similar granularity for GPU metrics. This enhancement aims to fill that gap by introducing per-task/actor GPU utilization and memory usage metrics.
dashboard/agent.py
,dashboard/modules/stats_collector.py
nvidia-smi --query-gpu
if NVML is not available).dashboard/frontend/src/pages/node/Stats.vue
dashboard/frontend/src/components/ResourceIcon.tsx
gpu-core
,gpu-mem
icons and tooltip helpers.python/ray/dashboard/tests/test_gpu_stats.py
CUDA_VISIBLE_DEVICES=0
+ mock NVML bindings to assert Dashboard JSON schema and time-series values.pylint: disable=c-extension-no-member
guards, build-time NVML check insetup.py
.Related issue number
Close #45755.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.