-
Notifications
You must be signed in to change notification settings - Fork 0
gpu_sysman plugin update to v6.0 API + new labels + "raw" metrics + output options #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
const char *type, const char *subdev, double value) { | ||
metric_family_t fam = { | ||
.type = METRIC_TYPE_COUNTER, | ||
.name = (char *)name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope these don't get written to, as in case if the incoming is truly const, writes result in undefined behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilers puts all literal strings to read-only section by default, so in actuality they are const strings. Assigning them to const char*, and propagating them e.g. to ras_submit() as such, is the correct thing to do, regardless of the enabled compiler warnings [1].
The annoying thing here is that metric_family_t.name in collectd v6 is non-const string. This is because collectd offers also functions for allocation and freeing metric families, and for this purpose the .name member cannot be const.
My code does not use / need the metric family alloc / free functions, only the metrics processing ones, so explicitly casting the literal name + help strings to non-const is safe. Besides, the same cast happens implicitly [1] in the many places where literal strings are assigned to metric_family_t.name directly.
[1] GCC does not do const checks for literal strings even with "-Wall -Wextra", one needs to explicitly use "-Write-strings" before GCC emits warnings about their assignments. "-Write-strings" GCC option would cause a large number of warnings about "metric_family_t.name" usage elsewhere than in this function, and in all the other collectd plugins though.
Allocating all the name & help strings just to be able to enable "-Wwrite-strings" warning is not worth the overhead though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from the several division operations, and them const removals with casts, I think this looks fine. If you 8000 ensure that the divs can't be zero, and that them removed consts don't get written to, I'll say LGTM.
LGTM |
aff2f93
to
c0e01a4
Compare
Rebased the test code debug output change commit to one adding it initially, rebased clang-format white-space changes to appropriate commits + added comment commit explaining the const casting. |
c0e01a4
to
9e92a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
To document my chat with Alexey: I'm leaving handling of when ADD_DEV_FILE define is enabled until upstream comments on whether they: 1) want that Kubernetes Intel GPU plugin specific feature to be disabled in upstream, 2) a configure option for enabling it, or 3) configure checks for the required POSIX functions before "dev_file" label feature gets enabled. |
a8fdb11
to
13c5598
Compare
eeb2ecb
to
1a2ecb2
Compare
The reason for last two force pushes is that I'm using the reviewed branch for upstreaming (collectd#3968), and I rebased that to:
Unfortunately contrib/format.sh changes did not satisfy whichever clang-format version / settings collectd CI itself runs => expect further pushes once upstream tells what actually should be changed in regards to code formatting. |
13c5598
to
fcdfc1d
Compare
9e48877
to
f049b82
Compare
fcdfc1d
to
c51c8c8
Compare
…c(...) Except for gpu_subarray_alloc(), all allocs are done with calloc(). This way correctness of all of them is easy to check just by grepping for calloc (especially now that clang-format does not wrap those lines any more), and reviewing gpu_subarray_alloc().
Prometheus & OpenMetrics require metric names to be suffixed by the metric unit, and ratios (0-1) to be used instead of percentages (0-100).
…names There's now also support for multiple metrics per family although they are not used yet. "sstrncpy" is not needed any more.
Following labels are used: - sub_dev: subdevice ID (unsigned integer) - location: e.g. "gpu" / "memory" - type: e.g. "request" / "actual" - direction: "read" / "write" Additionally: * Two location label values were fixed * GPU engine indeces are now per engine type (instead of single index being used for all types) * All metric family and label names have been changed to use underscores instead of dashes to separate words, as required by Prometheus i.e. collectd does not need to convert them any more: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
NOTE: providing NULL as label value to delete it is NOT supported. Test code will assert on labels with NULL values.
Also rename GPU struct "name" member to more explicit "pci_bdf". This allowed simplifying the code slightly. Sysman API supports nowadays also other devices than GPUs, so prefix is removed to to simplify code and to be more future-proof: https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t (Plugin will still query only GPU devices from Sysman though.)
- do not add "pci_pdf" to metric name for matching - fix for adding metric labels to family copies of them
* Fix memory "type" label overwrite * Replace "free" memory metric with "memory_usage_ratio" one, and rename "memory_bytes" to "memory_used_bytes" metric * Split metric value aggregate function name to a separate "function" label * Have metric family declares always in same place in code * Avoid both setting metric labels, and reporting empty metrics, when higher internal sampling rate is used or there are L0 errors
* Add "memory_usage_ratio" checks * Update validation for metrics that can be sampled at higher rate i.e. have now the new aggregate function label * With empty metrics avoided, dispatch mock-up can assert on them * With extra L0 calls being skipped when not needed, number of calls can differ between query rounds: - refactor multi-sampling test to handle count changes - change error handing checks to be done in single-sampled mode * Debug output is needed to debug triggered multisample asserts, so do that when assert would have been triggered, then abort
And document why const-qual cast is safe, and why GCC does not warn about other assignments to .name & .help members.
More powerful GPUs can have a large number of engines of given type, but user may be interested only on the higher level engine groups utilization. "DisableEngineSingle" option allows skipping individual engine metrics.
This can be used to speciify whether output metrics values will be raw, derived or both. This commit add support just for the configuration option itself, adding / changing metrics to use it happens in next commit.
This adds new counter type metrics for: * memory bandwidth * frequency throttle time * engine execution time (activity) * energy usage Because collecd internally handles counters as integers, all units cannot be ones recommended by Prometheus, but microseconds and microjoules reported by Sysman.
Zero time intervals or max bandwidth would cause div-by-zero issues and (very rare) time wrap around would cause bogus metric value. Skip all of them.
Empty label equals to a missing one, and Prometheus queries can check for non-existence of a label, so let's just skip empty / unneeded ones. Main difference to earlier is that LevelZero error categories that provide non-zero values only for uncorrectable type (according to spec), are now without a type label. Correctable i.e. zero metrics for those categories were skipped already earlier.
And contrib/format.sh include re-order. "dev_file" support is behind a define (enabled by default) because it needs functions that are only part of POSIX, not C99. Intel Kubernetes GPU plugin uses primary GPU node device file names (card0, card1...) as its GPU identifiers. This new label helps in mapping Kubernetes custom metrics to them.
This was rebased on latest upstream collectd-6.0 branch, so that "gpu_sysman-6.0" branch based on this branch, gets CI fixes from upstream => upstream's own bugs do not fail checks for upstream "gpu_sysman-6.0" branch pull request. |
And document with what GCC warning options the code is tested / passes.
While for plugin that change does not really help (as target buffer is always larger than source), for test code it is useful. And it shuts up less capabable static checking tools than GCC. As test code cannot use existing collectd functionality for this (test code needs modified versions of some collectd functions, and all collectd code does not pass GCC warnings I use), sstrncpy() is copied to test code. For test code there's also a fix to size given for snprintf(), and removal of redundant string termination for modified plugin_log() copy (vsnprintf() already terminates string).
scalloc() wraps calloc() with exit on alloc failure, similarly to what smalloc() does for malloc().
If asserts were disabled, allocation failures would result in collectd memory errors => replace alloc+assert in the plugin with collectd smalloc/scalloc wrappers that exits after logging allocation error. Downsides are that this does not invoke debugger (which could be in a different control group with plenty of memory), nor tell where / what allocation failed, like enabled assert would, so test code variants of the wrappers still do asserts.
Closing this "gpu_sysman-6.0" branch review PR, upstream finally reviewed & merged sysman plugin. |
ChangeLog: collectd: Use one write thread per write plugin The previous write thread design used a single queue with a single read head from which one of the write threads would de-queue an element and would then sequentially call each registered write callback. This meant that all write plugins would have to cooperate in order to not drop values. If for example all write threads are stalled by the same write plugin's callback function not returning, the queue will start to fill up until elements start to be dropped, even though there are other plugins that could still make progress. In addition to that, all write callbacks have to be designed to be reentrant right now, which increases complexity. This new design uses a single linked-list write queue with one read head per output plugin. Each output plugin is serviced in a dedicated write thread. Elements are freed based on a reference count, which is shown in the ASCII-Art below: +- Thread #1 Head +- Thread #2 Head +- Tail v v v +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ | 0|->| 1|->| 1|->| 1|->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ ^ +- to be free()d The changes introduced by this commit have some side-effects: - The WriteThreads config option no longer exists, as a strict 1:1 ratio of write plugins and write threads is used. - The data flow has changed. The previous data flow was: (From one of the ReadThreads) plugin_dispatch_{values,multivalue}() plugin_dispatch_metric_family() enqueue_metric_family() write_queue_enqueue() -----{Queue}----+ | (In one of the WriteThreads threads) | plugin_write_thread() | ^- plugin_write_dequeue() <-+ plugin_dispatch_metric_internal() ^- fc_process_chain(pre_cache_chain) fc_process_chain(fc_process_chain) fc_bit_write_invoke() plugin_write(NULL) / plugin_write(plugin_name) plugin callback() The data flow now is: (From one of the ReadThreads) plugin_dispatch_{values,multivalue}() plugin_dispatch_metric_family() plugin_dispatch_metric_internal() ^- fc_process_chain(pre_cache_chain) fc_process_chain(post_cache_chain) fc_bit_write_invoke() plugin_write(NULL) / plugin_write(plugin_name) write_queue_enqueue() -----{Queue}----+ | (In one of the WriteThreads threads) | plugin_write_thread() <-+ plugin callback() One result of this change is, that the behaviour of plugin_write has changed from running the plugin callback immediately and in the same thread, to always enqueueing the value and de-queing in the dedicated thread. - The behaviour of the WriteQueueLimitHigh and WriteQueueLimitLow options has changed. The Queue will be be capped to a length of LimitHigh by dropping random queue elements between the queue end and LimitLow. Setting LimitLow to a reasonably large value ensures that fast write plugins do not loose values, even in the vicinity of a slow plugin. The diagram below shows the random element selected for removal (###) in Step 1 and the queue with the element removed in Step 2. Step 1: +- Thread #1 Head | +- Thread #2 Head +- Tail v | | v v +--+| +--+ #### +--+| +--+ +--+ +--+ +--+ +--+ | 1|->| 1|-># 1#->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X +--+| +--+ #### +--+| +--+ +--+ +--+ +--+ +--+ | | | LimitHigh | LimitLow Step 2: | +- Thread #1 Head +- Thread #2 Head +- Tail | v | v v | +--+ +--+ +--+| +--+ +--+ +--+ +--+ +--+ | | 1|->| 1|->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X | +--+ +--+ +--+| +--+ +--+ +--+ +--+ +--+ | | | LimitHigh | LimitLow Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
No description provided.