10000 feat: utilization/memory usage on CPU and GPU by GustavBaumgart · Pull Request #420 · cisco-open/flame · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: utilization/memory usage on CPU and GPU #420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 26, 2023

Conversation

GustavBaumgart
Copy link
Collaborator

Description

The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running.

Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias.

setup.py has been updated to include gpustat and psutil.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@codecov-commenter
Copy link
codecov-commenter commented May 20, 2023

Codecov Report

Merging #420 (23ec4c5) into main (f14b5b3) will not change coverage.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##             main     #420   +/-   ##
=======================================
  Coverage   14.86%   14.86%           
=======================================
  Files          48       48           
  Lines        2832     2832           
=======================================
  Hits          421      421           
  Misses       2382     2382           
  Partials       29       29           

Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments. Please consider to address them.


if torch.cuda.is_available():
gpu_thread.start()
elif ml_framework == MLFramework.TENSORFLOW:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to check what ml framework is in use for gpu stat monitoring?

I think it may be enough to revise lines 56 and 57.

At N.nvmlInit(), you can check whether htere is an error (e.g., pynvml.nvml.NVMLError_LibraryNotFound).

At dcount = N.nvmlDeviceGetCount(), you can check whether dcount is zero or not.

Then, you may not need to know which ml framework is in use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the reason why I checked with the framework was in order to make sure it was setup so that it could be accessed. Otherwise, I believe it's possible that we could monitor the GPUs in tensorflow, even though it's not setup so that they could work.

I can add a check for dcount == 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: will check overhead and change this

import threading
import psutil
from collections import defaultdict
from flame.common.util import MLFramework, get_ml_framework_in_use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to separate standard packages from others (3rd party ones and our own ones).
Perhaps if you apply black, it may do reformatting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I applied black, does it usually do that for you?

Copy link
Collaborator
@jaemin-shin jaemin-shin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that could be considered, but generally LGTM.

if tensorflow.config.experimental.list_physical_devices("GPU"):
gpu_thread.start()

def gather_gpu_stats(self, interval=1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how heavy is it to gather GPU stats every second? Psutil, which we use for CPU, seems to be known for its efficiency, but was wondering if gathering GPU stats every second induces any overhead. What would be the appropriate interval for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, I am not certain of. So far, I haven't run any large-scale example along with the metric collection added in. Maybe I should do that before.

# GPU utilization
# TO DO: implement metric gathering for process-specific utilization of the GPUs
try:
self.stat_log[f"gpu{d}_utilization"].append(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to check if gpu d is used by our pid, and if it isn't, log it as 0 as any utilization will be drawn by other processes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be some additional overhead cost, but I can try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you think it would be misleading given the way I measure memory usage of the process?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory usage looks fine. My point is that we may need to subtract the utilization from running GPUs that aren't running any Flame processes, which we could at least know. We could do this by post-processing by looking at 0 memory usage, but I would prefer doing this within a same loop, as checking pid wouldn't take much of a overhead.

def accumulate(self, mtype, alias, value):
key = self.get_key(mtype, alias)
self.state_dict[key] = value + self.state_dict.get(key, 0)
logger.debug(f"Accumulating metric state_dict[{key}] = {self.state_dict[key]}")


def save_log_statistics(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more than this? like quartile (25, 50, 75), median, std ? Using numpy array makes lot more easier, as it has its own functions like np.mean, np.min, np.median, np.quartile etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just an initial sort of idea for the stat collection. I can change it to numpy if it's in the requirements.

Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment

sum(
[
proc.usedGpuMemory
for proc in N.nvmlDeviceGetComputeRunningProcesses(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlDeviceGetComputeRunningProcesses is called twice (one here and the other in line 72).
I think it can be called only once (e.g., in line 63) and compute both memory and utilization at the same time.
Also, this may be pythonic way, but the code structure look so complicated: embedding for and if statements as an argument of a function (e.g., append) with sum function as well.

Can you simplify the code?

Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment

The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage.
So far, GPU utilization is still not specific to the process that is running.

Additionally, the metric collector was refactored in order to store samples in a variable stat_log.
Names of metrics now use a period to separate the metric type and alias.

setup.py has been updated to include gpustat and psutil.
Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@GustavBaumgart GustavBaumgart merged commit 53a850e into cisco-open:main May 26, 2023
dhruvsgarg pushed a commit to dhruvsgarg/flame that referenced this pull request Oct 18, 2024
The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage.
So far, GPU utilization is still not specific to the process that is running.

Additionally, the metric collector was refactored in order to store samples in a variable stat_log.
Names of metrics now use a period to separate the metric type and alias.

setup.py has been updated to include gpustat and psutil.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0