-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New plugin: gpu_nvml collects NVIDIA GPU stats. #2923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great work, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @qdbp,
thank you very much for your pull request! The plugin appears to be straight forward and well written, I mostly have some coding style comments inline.
The lack of license information is a hard blocker for now though.
Thanks again and best regards,
—octo
src/gpu_nvml.c
Outdated
char *eptr; | ||
|
||
if (strcasecmp(key, config_keys[0]) == 0) { | ||
device_ix = strtoul(value, &eptr, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please declare device_ix
here.
https://collectd.org/review-comments#define-variables-on-first-use
src/gpu_nvml.c
Outdated
device_count = 64; | ||
} | ||
|
||
for (int ix = 0; ix < device_count; ix++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compares signed and unsigned ints, causing compiler warnings. Please use unsigned int ix = 0
instead.
src/gpu_nvml.c
Outdated
|
||
for (int ix = 0; ix < device_count; ix++) { | ||
|
||
int is_match = ((1 << ix) & conf_match_mask) || (conf_match_mask == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit cast to unsigned int
. Please change the variable type to unsigned int
to avoid compiler warnings.
src/gpu_nvml.c
Outdated
nvmlDevice_t dev; | ||
TRY(nvmlDeviceGetHandleByIndex(ix, &dev)); | ||
|
||
char dev_name[MAX_DEVNAME_LEN + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize right here:
char dev_name[MAX_DEVNAME_LEN + 1] = {0};
src/gpu_nvml.c
Outdated
unsigned long device_ix; | ||
char *eptr; | ||
|
||
if (strcasecmp(key, config_keys[0]) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find this easier to read if you repeated the string literal "GPUIndex"
here. Alternatively, add a comment to config_keys
that its order matters – that is not normally the case.
src/gpu_nvml.c
Outdated
#define TRY(f) TRY_CATCH(f, catch) | ||
#define TRYOPT(f) TRY_CATCH_OPTIONAL(f, catch) | ||
|
||
#define WRAPGAUGE(x) ((value_t){.gauge = (gauge_t)(x)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass a gauge_t
to nvml_submit()
and do the conversion there, I think you can avoid this macro.
src/gpu_nvml.c
Outdated
conf_match_mask |= (1 << device_ix); | ||
} else if (strcasecmp(key, config_keys[1])) { | ||
if | ||
IS_TRUE(value) { conf_mask_is_exclude = 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conf_mask_is_exclude = IS_TRUE(value);
src/gpu_nvml.c
Outdated
if (strcasecmp(key, config_keys[0]) == 0) { | ||
device_ix = strtoul(value, &eptr, 10); | ||
if (eptr == value) { | ||
ERROR("Failed to parse GPUIndex value \"%s\"", value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please prefix error messages with the plugin name, e.g.:
ERROR(PLUGIN_NAME ": Failed …")
configure.ac
Outdated
@@ -6735,6 +6788,7 @@ AC_PLUGIN([filecount], [yes], [Count files in dire | |||
AC_PLUGIN([fscache], [$plugin_fscache], [fscache statistics]) | |||
AC_PLUGIN([gmond], [$with_libganglia], [Ganglia plugin]) | |||
AC_PLUGIN([gps], [$plugin_gps], [GPS plugin]) | |||
AC_PLUGIN([gpu_nvml], [$with_cuda], [NVIDIA GPU plugin]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "nvml" standing for? Is this a well-known acronym / abbreviation? I'd find "gpu_nvidia" more descriptive, but I have next to no experience in GPU things ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the NVIDIA gpu management library. I think claiming just "GPU" is too broad, since there are other vendors and interfaces out there.
src/gpu_nvml.c
Outdated
|
||
static int nvml_config(const char *key, const char *value) { | ||
|
||
char *eptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please declare variables when they are first used.
So, move this declaration to line 72.
https://collectd.org/review-comments#define-variables-on-first-use
src/gpu_nvml.c
Outdated
return -1; | ||
} | ||
|
||
static void nvml_submit(const char *plugin_instance, const char *type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After WRAPGAUGE
macro was removed, last arg becomes gauge_t
instead of value_t
.
Please, rename nvml_submit
to nvml_submit_gauge
. That is more consistent with exiting practice.
Without such rename, it can lead to mismatch: metric, defined by type
, can be of other type than 'gauge'.
If function name will reflect type, what error has chances to not occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @qdbp,
thank you very much for your update! I left some suggestions inline, but overall this looks good to go.
I'd like to revisit the plugin name, though: I agree that "gpu" is probably too generic. I think that "gpu_nvidia" would be a good choice, because it conveys to users that this plugin is specific to nvidia GPUs. That the library used for communicating with the GPUs is "nvml" is an implementation detail the user likely doesn't care about.
Best regards,
—octo
src/collectd.conf.in
Outdated
#<Plugin gpu_nvml> | ||
# GPUIndex 0 | ||
# GPUIndex 2 | ||
# IgnoreSelected 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be false
instead of 0
. The settings here should reflect the default the plugin uses.
src/gpu_nvml.c
Outdated
unsigned int sm_clk_mhz; | ||
TRYOPT(nvmlDeviceGetClockInfo(dev, NVML_CLOCK_SM, &sm_clk_mhz)) | ||
if (nv_status == NVML_SUCCESS) | ||
nvml_submit_gauge(dev_name, "frequency", "sm", 1e6 * sm_clk_mhz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sm" strikes me as an acronym – could you expand it for clarity?
src/gpu_nvml.c
Outdated
unsigned int mem_clk_mhz; | ||
TRYOPT(nvmlDeviceGetClockInfo(dev, NVML_CLOCK_MEM, &mem_clk_mhz)) | ||
if (nv_status == NVML_SUCCESS) | ||
nvml_submit_gauge(dev_name, "frequency", "mem", 1e6 * mem_clk_mhz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito: "memory" is clearer than "mem".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @qdbp!
This is a C plugin that efficiently collects various operational parameters of NVIDIA GPUs through direct invocation of the NVML API. The targeted use case is monitoring and debugging hardware status for long-running GPGPU jobs (e.g. a BOINC daemon).
Currently following stats are collected:
The NVML header is provided by CUDA, having which is a build requirement for this module.