feat: utilization/memory usage on CPU and GPU #420

GustavBaumgart · 2023-05-20T01:40:43Z

Description

The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running.

Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias.

setup.py has been updated to include gpustat and psutil.

Type of Change

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

codecov-commenter · 2023-05-20T01:43:50Z

Codecov Report

Merging #420 (23ec4c5) into main (f14b5b3) will not change coverage.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##             main     #420   +/-   ##
=======================================
  Coverage   14.86%   14.86%           
=======================================
  Files          48       48           
  Lines        2832     2832           
=======================================
  Hits          421      421           
  Misses       2382     2382           
  Partials       29       29

myungjin

left a few comments. Please consider to address them.

lib/python/flame/monitor/metric_collector.py

lib/python/setup.py

myungjin · 2023-05-24T15:36:30Z

lib/python/flame/monitor/metric_collector.py

+
+            if torch.cuda.is_available():
+                gpu_thread.start()
+        elif ml_framework == MLFramework.TENSORFLOW:


Is it necessary to check what ml framework is in use for gpu stat monitoring?

I think it may be enough to revise lines 56 and 57.

At N.nvmlInit(), you can check whether htere is an error (e.g., pynvml.nvml.NVMLError_LibraryNotFound).

At dcount = N.nvmlDeviceGetCount(), you can check whether dcount is zero or not.

Then, you may not need to know which ml framework is in use.

Right, so the reason why I checked with the framework was in order to make sure it was setup so that it could be accessed. Otherwise, I believe it's possible that we could monitor the GPUs in tensorflow, even though it's not setup so that they could work.

I can add a check for dcount == 0.

Update: will check overhead and change this

myungjin · 2023-05-24T15:40:22Z

lib/python/flame/monitor/metric_collector.py

+import threading
+import psutil
+from collections import defaultdict
+from flame.common.util import MLFramework, get_ml_framework_in_use


It's a good idea to separate standard packages from others (3rd party ones and our own ones).
Perhaps if you apply black, it may do reformatting.

I believe I applied black, does it usually do that for you?

jaemin-shin

Left some comments that could be considered, but generally LGTM.

jaemin-shin · 2023-05-24T21:13:54Z

lib/python/flame/monitor/metric_collector.py

+            if tensorflow.config.experimental.list_physical_devices("GPU"):
+                gpu_thread.start()
+
+    def gather_gpu_stats(self, interval=1):


Do you know how heavy is it to gather GPU stats every second? Psutil, which we use for CPU, seems to be known for its efficiency, but was wondering if gathering GPU stats every second induces any overhead. What would be the appropriate interval for this?

This, I am not certain of. So far, I haven't run any large-scale example along with the metric collection added in. Maybe I should do that before.

jaemin-shin · 2023-05-24T21:16:47Z

lib/python/flame/monitor/metric_collector.py

+                # GPU utilization
+                # TO DO: implement metric gathering for process-specific utilization of the GPUs
+                try:
+                    self.stat_log[f"gpu{d}_utilization"].append(


Maybe we want to check if gpu d is used by our pid, and if it isn't, log it as 0 as any utilization will be drawn by other processes?

This would be some additional overhead cost, but I can try.

Actually, do you think it would be misleading given the way I measure memory usage of the process?

Memory usage looks fine. My point is that we may need to subtract the utilization from running GPUs that aren't running any Flame processes, which we could at least know. We could do this by post-processing by looking at 0 memory usage, but I would prefer doing this within a same loop, as checking pid wouldn't take much of a overhead.

jaemin-shin · 2023-05-24T21:18:03Z

lib/python/flame/monitor/metric_collector.py

    def accumulate(self, mtype, alias, value):
        key = self.get_key(mtype, alias)
        self.state_dict[key] = value + self.state_dict.get(key, 0)
        logger.debug(f"Accumulating metric state_dict[{key}] = {self.state_dict[key]}")
-
+
+    def save_log_statistics(self):


can we add more than this? like quartile (25, 50, 75), median, std ? Using numpy array makes lot more easier, as it has its own functions like np.mean, np.min, np.median, np.quartile etc.

Yeah this was just an initial sort of idea for the stat collection. I can change it to numpy if it's in the requirements.

myungjin

left one comment

myungjin · 2023-05-25T22:30:58Z

lib/python/flame/monitor/metric_collector.py

+                        sum(
+                            [
+                                proc.usedGpuMemory
+                                for proc in N.nvmlDeviceGetComputeRunningProcesses(


nvmlDeviceGetComputeRunningProcesses is called twice (one here and the other in line 72).
I think it can be called only once (e.g., in line 63) and compute both memory and utilization at the same time.
Also, this may be pythonic way, but the code structure look so complicated: embedding for and if statements as an argument of a function (e.g., append) with sum function as well.

Can you simplify the code?

myungjin

left one comment

The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running. Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias. setup.py has been updated to include gpustat and psutil.

myungjin

lgtm

myungjin

lgtm

The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running. Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias. setup.py has been updated to include gpustat and psutil.

GustavBaumgart requested review from myungjin, jaemin-shin and lkurija1 May 20, 2023 01:40

myungjin reviewed May 24, 2023

View reviewed changes

jaemin-shin reviewed May 24, 2023

View reviewed changes

GustavBaumgart force-pushed the mem branch from 1e66bde to 30284f4 Compare May 25, 2023 19:34

GustavBaumgart requested a review from myungjin May 25, 2023 20:29

myungjin reviewed May 25, 2023

View reviewed changes

GustavBaumgart force-pushed the mem branch from 30284f4 to 23ec4c5 Compare May 26, 2023 00:30

myungjin approved these changes May 26, 2023

View reviewed changes

GustavBaumgart merged commit 53a850e into cisco-open:main May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: utilization/memory usage on CPU and GPU #420

feat: utilization/memory usage on CPU and GPU #420

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: utilization/memory usage on CPU and GPU #420

feat: utilization/memory usage on CPU and GPU #420

Conversation

Description

Type of Change

Checklist

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!