8000 New plugin: gpu_nvml collects NVIDIA GPU stats. by qdbp · Pull Request #2923 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

New plugin: gpu_nvml collects NVIDIA GPU stats. #2923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 27, 2018

Conversation

qdbp
Copy link
Contributor
@qdbp qdbp commented Sep 10, 2018

This is a C plugin that efficiently collects various operational parameters of NVIDIA GPUs through direct invocation of the NVML API. The targeted use case is monitoring and debugging hardware status for long-running GPGPU jobs (e.g. a BOINC daemon).

Currently following stats are collected:

  • fan speed
  • core temperature and power consumption
  • memory and streaming multiprocessor clocks
  • percent GPU and memory loads

The NVML header is provided by CUDA, having which is a build requirement for this module.

@ccnifo
Copy link
ccnifo commented Sep 11, 2018

Great work, thanks!
Maybe you should consider nvmlDeviceGetMemoryErrorCounter? We found this metric interesting to spot unexpected hardware errors ending in jobs being killed.

@octo octo added the New plugin label Oct 4, 2018
Copy link
Member
@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qdbp,

thank you very much for your pull request! The plugin appears to be straight forward and well written, I mostly have some coding style comments inline.

The lack of license information is a hard blocker for now though.

Thanks again and best regards,
—octo

src/gpu_nvml.c Outdated
char *eptr;

if (strcasecmp(key, config_keys[0]) == 0) {
device_ix = strtoul(value, &eptr, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/gpu_nvml.c Outdated
device_count = 64;
}

for (int ix = 0; ix < device_count; ix++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compares signed and unsigned ints, causing compiler warnings. Please use unsigned int ix = 0 instead.

src/gpu_nvml.c Outdated

for (int ix = 0; ix < device_count; ix++) {

int is_match = ((1 << ix) & conf_match_mask) || (conf_match_mask == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit cast to unsigned int. Please change the variable type to unsigned int to avoid compiler warnings.

src/gpu_nvml.c Outdated
nvmlDevice_t dev;
TRY(nvmlDeviceGetHandleByIndex(ix, &dev));

char dev_name[MAX_DEVNAME_LEN + 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize right here:

char dev_name[MAX_DEVNAME_LEN + 1] = {0};

src/gpu_nvml.c Outdated
unsigned long device_ix;
char *eptr;

if (strcasecmp(key, config_keys[0]) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find this easier to read if you repeated the string literal "GPUIndex" here. Alternatively, add a comment to config_keys that its order matters – that is not normally the case.

src/gpu_nvml.c Outdated
#define TRY(f) TRY_CATCH(f, catch)
#define TRYOPT(f) TRY_CATCH_OPTIONAL(f, catch)

#define WRAPGAUGE(x) ((value_t){.gauge = (gauge_t)(x)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass a gauge_t to nvml_submit() and do the conversion there, I think you can avoid this macro.

src/gpu_nvml.c Outdated
conf_match_mask |= (1 << device_ix);
} else if (strcasecmp(key, config_keys[1])) {
if
IS_TRUE(value) { conf_mask_is_exclude = 1; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf_mask_is_exclude = IS_TRUE(value);

src/gpu_nvml.c Outdated
if (strcasecmp(key, config_keys[0]) == 0) {
device_ix = strtoul(value, &eptr, 10);
if (eptr == value) {
ERROR("Failed to parse GPUIndex value \"%s\"", value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefix error messages with the plugin name, e.g.:

ERROR(PLUGIN_NAME ": Failed …")

configure.ac Outdated
@@ -6735,6 +6788,7 @@ AC_PLUGIN([filecount], [yes], [Count files in dire
AC_PLUGIN([fscache], [$plugin_fscache], [fscache statistics])
AC_PLUGIN([gmond], [$with_libganglia], [Ganglia plugin])
AC_PLUGIN([gps], [$plugin_gps], [GPS plugin])
AC_PLUGIN([gpu_nvml], [$with_cuda], [NVIDIA GPU plugin])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "nvml" standing for? Is this a well-known acronym / abbreviation? I'd find "gpu_nvidia" more descriptive, but I have next to no experience in GPU things ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the NVIDIA gpu management library. I think claiming just "GPU" is too broad, since there are other vendors and interfaces out there.

@octo octo changed the title NEW PLUGIN: gpu_nvml, collect NVIDIA GPU stats. New plugin: gpu_nvml collects NVIDIA GPU stats. Oct 22, 2018
@collectd-bot collectd-bot added this to the Features milestone Oct 22, 2018
src/gpu_nvml.c Outdated

static int nvml_config(const char *key, const char *value) {

char *eptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please declare variables when they are first used.
So, move this declaration to line 72.

https://collectd.org/review-comments#define-variables-on-first-use

src/gpu_nvml.c Outdated
return -1;
}

static void nvml_submit(const char *plugin_instance, const char *type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After WRAPGAUGE macro was removed, last arg becomes gauge_t instead of value_t.
Please, rename nvml_submit to nvml_submit_gauge. That is more consistent with exiting practice.

Without such rename, it can lead to mismatch: metric, defined by type, can be of other type than 'gauge'.
If function name will reflect type, what error has chances to not occur.

Copy link
Member
@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qdbp,

thank you very much for your update! I left some suggestions inline, but overall this looks good to go.

I'd like to revisit the plugin name, though: I agree that "gpu" is probably too generic. I think that "gpu_nvidia" would be a good choice, because it conveys to users that this plugin is specific to nvidia GPUs. That the library used for communicating with the GPUs is "nvml" is an implementation detail the user likely doesn't care about.

Best regards,
—octo

#<Plugin gpu_nvml>
# GPUIndex 0
# GPUIndex 2
# IgnoreSelected 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be false instead of 0. The settings here should reflect the default the plugin uses.

src/gpu_nvml.c Outdated
unsigned int sm_clk_mhz;
TRYOPT(nvmlDeviceGetClockInfo(dev, NVML_CLOCK_SM, &sm_clk_mhz))
if (nv_status == NVML_SUCCESS)
nvml_submit_gauge(dev_name, "frequency", "sm", 1e6 * sm_clk_mhz);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sm" strikes me as an acronym – could you expand it for clarity?

src/gpu_nvml.c Outdated
unsigned int mem_clk_mhz;
TRYOPT(nvmlDeviceGetClockInfo(dev, NVML_CLOCK_MEM, &mem_clk_mhz))
if (nv_status == NVML_SUCCESS)
nvml_submit_gauge(dev_name, "frequency", "mem", 1e6 * mem_clk_mhz);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dito: "memory" is clearer than "mem".

Copy link
Member
@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @qdbp!

@octo octo merged commit 10d9557 into collectd:master Oct 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0