-
Notifications
You must be signed in to change notification settings - Fork 30
feat: utilization/memory usage on CPU and GPU #420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #420 +/- ##
=======================================
Coverage 14.86% 14.86%
=======================================
Files 48 48
Lines 2832 2832
=======================================
Hits 421 421
Misses 2382 2382
Partials 29 29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments. Please consider to address them.
|
||
if torch.cuda.is_available(): | ||
gpu_thread.start() | ||
elif ml_framework == MLFramework.TENSORFLOW: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to check what ml framework is in use for gpu stat monitoring?
I think it may be enough to revise lines 56 and 57.
At N.nvmlInit(), you can check whether htere is an error (e.g., pynvml.nvml.NVMLError_LibraryNotFound).
At dcount = N.nvmlDeviceGetCount(), you can check whether dcount is zero or not.
Then, you may not need to know which ml framework is in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so the reason why I checked with the framework was in order to make sure it was setup so that it could be accessed. Otherwise, I believe it's possible that we could monitor the GPUs in tensorflow, even though it's not setup so that they could work.
I can add a check for dcount == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: will check overhead and change this
import threading | ||
import psutil | ||
from collections import defaultdict | ||
from flame.common.util import MLFramework, get_ml_framework_in_use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea to separate standard packages from others (3rd party ones and our own ones).
Perhaps if you apply black
, it may do reformatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I applied black
, does it usually do that for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments that could be considered, but generally LGTM.
if tensorflow.config.experimental.list_physical_devices("GPU"): | ||
gpu_thread.start() | ||
|
||
def gather_gpu_stats(self, interval=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how heavy is it to gather GPU stats every second? Psutil, which we use for CPU, seems to be known for its efficiency, but was wondering if gathering GPU stats every second induces any overhead. What would be the appropriate interval for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, I am not certain of. So far, I haven't run any large-scale example along with the metric collection added in. Maybe I should do that before.
# GPU utilization | ||
# TO DO: implement metric gathering for process-specific utilization of the GPUs | ||
try: | ||
self.stat_log[f"gpu{d}_utilization"].append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to check if gpu d is used by our pid, and if it isn't, log it as 0 as any utilization will be drawn by other processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be some additional overhead cost, but I can try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do you think it would be misleading given the way I measure memory usage of the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory usage looks fine. My point is that we may need to subtract the utilization from running GPUs that aren't running any Flame processes, which we could at least know. We could do this by post-processing by looking at 0 memory usage, but I would prefer doing this within a same loop, as checking pid wouldn't take much of a overhead.
def accumulate(self, mtype, alias, value): | ||
key = self.get_key(mtype, alias) | ||
self.state_dict[key] = value + self.state_dict.get(key, 0) | ||
logger.debug(f"Accumulating metric state_dict[{key}] = {self.state_dict[key]}") | ||
|
||
|
||
def save_log_statistics(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add more than this? like quartile (25, 50, 75), median, std ? Using numpy array makes lot more easier, as it has its own functions like np.mean, np.min, np.median, np.quartile etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was just an initial sort of idea for the stat collection. I can change it to numpy if it's in the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left one comment
sum( | ||
[ | ||
proc.usedGpuMemory | ||
for proc in N.nvmlDeviceGetComputeRunningProcesses( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlDeviceGetComputeRunningProcesses
is called twice (one here and the other in line 72).
I think it can be called only once (e.g., in line 63) and compute both memory and utilization at the same time.
Also, this may be pythonic way, but the code structure look so complicated: embedding for and if statements as an argument of a function (e.g., append) with sum
function as well.
Can you simplify the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left one comment
The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running. Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias. setup.py has been updated to include gpustat and psutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running. Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias. setup.py has been updated to include gpustat and psutil.
Description
The metric collector now runs two additional threads to track GPU/CPU utilization and memory usage. So far, GPU utilization is still not specific to the process that is running.
Additionally, the metric collector was refactored in order to store samples in a variable stat_log. Names of metrics now use a period to separate the metric type and alias.
setup.py has been updated to include gpustat and psutil.
Type of Change
Checklist