8000 gpu_sysman plugin update to v6.0 API + new labels + "raw" metrics + output options by eero-t · Pull Request #1 · eero-t/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

gpu_sysman plugin update to v6.0 API + new labels + "raw" metrics + output options #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from

Conversation

eero-t
Copy link
Owner
@eero-t eero-t commented Jan 11, 2022

No description provided.

Copy link
@byako byako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

const char *type, const char *subdev, double value) {
metric_family_t fam = {
.type = METRIC_TYPE_COUNTER,
.name = (char *)name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope these don't get written to, as in case if the incoming is truly const, writes result in undefined behavior

Copy link
Owner Author
@eero-t eero-t Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilers puts all literal strings to read-only section by default, so in actuality they are const strings. Assigning them to const char*, and propagating them e.g. to ras_submit() as such, is the correct thing to do, regardless of the enabled compiler warnings [1].

The annoying thing here is that metric_family_t.name in collectd v6 is non-const string. This is because collectd offers also functions for allocation and freeing metric families, and for this purpose the .name member cannot be const.

My code does not use / need the metric family alloc / free functions, only the metrics processing ones, so explicitly casting the literal name + help strings to non-const is safe. Besides, the same cast happens implicitly [1] in the many places where literal strings are assigned to metric_family_t.name directly.

[1] GCC does not do const checks for literal strings even with "-Wall -Wextra", one needs to explicitly use "-Write-strings" before GCC emits warnings about their assignments. "-Write-strings" GCC option would cause a large number of warnings about "metric_family_t.name" usage elsewhere than in this function, and in all the other collectd plugins though.

Allocating all the name & help strings just to be able to enable "-Wwrite-strings" warning is not worth the overhead though.

Copy link
@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the several division operations, and them const removals with casts, I think this looks fine. If you 8000 ensure that the divs can't be zero, and that them removed consts don't get written to, I'll say LGTM.

@uniemimu
Copy link

LGTM

@eero-t
Copy link
Owner Author
eero-t commented Jan 14, 2022

Rebased the test code debug output change commit to one adding it initially, rebased clang-format white-space changes to appropriate commits + added comment commit explaining the const casting.

@eero-t
Copy link
Owner Author
eero-t commented Jan 24, 2022

@uniemimu and @byako, could you review the 2 new commits, especially the last "dev_file" label one (intended to help with k8s custom rules)?

Copy link
@uniemimu uniemimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
@byako byako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eero-t
Copy link
Owner Author
eero-t commented Jan 24, 2022

To document my chat with Alexey: I'm leaving handling of when ADD_DEV_FILE define is enabled until upstream comments on whether they: 1) want that Kubernetes Intel GPU plugin specific feature to be disabled in upstream, 2) a configure option for enabling it, or 3) configure checks for the required POSIX functions before "dev_file" label feature gets enabled.

@eero-t eero-t force-pushed the gpu_sysman-6.0-review branch from a8fdb11 to 13c5598 Compare January 27, 2022 16:48
@eero-t eero-t force-pushed the gpu_sysman-6.0 branch 2 times, most recently from eeb2ecb to 1a2ecb2 Compare January 27, 2022 17:40
@eero-t
Copy link
Owner Author
eero-t commented Jan 28, 2022

The reason for last two force pushes is that I'm using the reviewed branch for upstreaming (collectd#3968), and I rebased that to:

  • Move separate Prometheus PR (needed for testing the new gpu_sysman plugin in k8s cluster) out from a branch beneath it, to a new new dev branch above it
  • Apply contrib/format.sh changes that collectd CI told to run. I've been using local (LLVM 13) clang-format, but changes it does differ slightly from what remote clang-format (called by format.sh) does

Unfortunately contrib/format.sh changes did not satisfy whichever clang-format version / settings collectd CI itself runs => expect further pushes once upstream tells what actually should be changed in regards to code formatting.

eero-t added 7 commits May 30, 2022 17:21
…c(...)

Except for gpu_subarray_alloc(), all allocs are done with calloc().
This way correctness of all of them is easy to check just by grepping
for calloc (especially now that clang-format does not wrap those lines
any more), and reviewing gpu_subarray_alloc().
Prometheus & OpenMetrics require metric names to be suffixed by the
metric unit, and ratios (0-1) to be used instead of percentages
(0-100).
…names

There's now also support for multiple metrics per family although they
are not used yet. "sstrncpy" is not needed any more.
Following labels are used:
- sub_dev: subdevice ID (unsigned integer)
- location: e.g. "gpu" / "memory"
- type: e.g. "request" / "actual"
- direction: "read" / "write"

Additionally:

* Two location label values were fixed

* GPU engine indeces are now per engine type
  (instead of single index being used for all types)

* All metric family and label names have been changed to use
  underscores instead of dashes to separate words, as required by
  Prometheus i.e. collectd does not need to convert them any more:
  https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
NOTE: providing NULL as label value to delete it is NOT supported.
Test code will assert on labels with NULL values.
Also rename GPU struct "name" member to more explicit "pci_bdf".

This allowed simplifying the code slightly.

Sysman API supports nowadays also other devices than GPUs, so prefix
is removed to to simplify code and to be more future-proof:
https://spec.oneapi.io/level-zero/latest/core/api.html#_CPPv416ze_device_type_t

(Plugin will still query only GPU devices from Sysman though.)
- do not add "pci_pdf" to metric name for matching
- fix for adding metric labels to family copies of them
eero-t added 10 commits May 30, 2022 17:21
* Fix memory "type" label overwrite

* Replace "free" memory metric with "memory_usage_ratio" one,
  and rename "memory_bytes" to "memory_used_bytes" metric

* Split metric value aggregate function name to a separate
  "function" label

* Have metric family declares always in same place in code

* Avoid both setting metric labels, and reporting empty metrics,
  when higher internal sampling rate is used or there are L0
  errors
* Add "memory_usage_ratio" checks

* Update validation for metrics that can be sampled at higher
  rate i.e. have now the new aggregate function label

* With empty metrics avoided, dispatch mock-up can assert on them

* With extra L0 calls being skipped when not needed, number of calls
  can differ between query rounds:
  - refactor multi-sampling test to handle count changes
  - change error handing checks to be done in single-sampled mode

* Debug output is needed to debug triggered multisample asserts,
  so do that when assert would have been triggered, then abort
And document why const-qual cast is safe, and why GCC does
not warn about other assignments to .name & .help members.
More powerful GPUs can have a large number of engines of given type,
but user may be interested only on the higher level engine groups
utilization.

"DisableEngineSingle" option allows skipping individual engine metrics.
This can be used to speciify whether output metrics values will be
raw, derived or both.

This commit add support just for the configuration option itself,
adding / changing metrics to use it happens in next commit.
This adds new counter type metrics for:
* memory bandwidth
* frequency throttle time
* engine execution time (activity)
* energy usage

Because collecd internally handles counters as integers, all units
cannot be ones recommended by Prometheus, but microseconds and
microjoules reported by Sysman.
Zero time intervals or max bandwidth would cause div-by-zero issues
and (very rare) time wrap around would cause bogus metric value.
Skip all of them.
Empty label equals to a missing one, and Prometheus queries can check
for non-existence of a label, so let's just skip empty / unneeded ones.

Main difference to earlier is that LevelZero error categories that
provide non-zero values only for uncorrectable type (according to
spec), are now without a type label. Correctable i.e. zero metrics for
those categories were skipped already earlier.
And contrib/format.sh include re-order.

"dev_file" support is behind a define (enabled by default) because it
needs functions that are only part of POSIX, not C99.

Intel Kubernetes GPU plugin uses primary GPU node device file names
(card0, card1...) as its GPU identifiers.  This new label helps in
mapping Kubernetes custom metrics to them.
@eero-t
Copy link
Owner Author
eero-t commented Jun 1, 2022

This was rebased on latest upstream collectd-6.0 branch, so that "gpu_sysman-6.0" branch based on this branch, gets CI fixes from upstream => upstream's own bugs do not fail checks for upstream "gpu_sysman-6.0" branch pull request.

eero-t added 6 commits June 7, 2022 15:14
And document with what GCC warning options the code is tested / passes.
While for plugin that change does not really help (as target buffer is
always larger than source), for test code it is useful. And it shuts
up less capabable static checking tools than GCC.

As test code cannot use existing collectd functionality for this (test
code needs modified versions of some collectd functions, and all
collectd code does not pass GCC warnings I use), sstrncpy() is copied
to test code.

For test code there's also a fix to size given for snprintf(), and
removal of redundant string termination for modified plugin_log() copy
(vsnprintf() already terminates string).
scalloc() wraps calloc() with exit on alloc failure,
similarly to what smalloc() does for malloc().
If asserts were disabled, allocation failures would result in collectd
memory errors => replace alloc+assert in the plugin with collectd
smalloc/scalloc wrappers that exits after logging allocation error.

Downsides are that this does not invoke debugger (which could be in a
different control group with plenty of memory), nor tell where / what
allocation failed, like enabled assert would, so test code variants of
the wrappers still do asserts.
@eero-t
Copy link
Owner Author
eero-t commented Jun 8, 2022

Closing this "gpu_sysman-6.0" branch review PR, upstream finally reviewed & merged sysman plugin.

@eero-t eero-t closed this Jun 8, 2022
eero-t pushed a commit that referenced this pull request Mar 1, 2023
ChangeLog: collectd: Use one write thread per write plugin

The previous write thread design used a single queue with a single read head
from which one of the write threads would de-queue an element and would then
sequentially call each registered write callback.
This meant that all write plugins would have to cooperate in order to not drop
values. If for example all write threads are stalled by the same write plugin's
callback function not returning, the queue will start to fill up until elements
start to be dropped, even though there are other plugins that could still make
progress. In addition to that, all write callbacks have to be designed to be
reentrant right now, which increases complexity.

This new design uses a single linked-list write queue with one read head per
output plugin. Each output plugin is serviced in a dedicated write thread.
Elements are freed based on a reference count, which is shown in the ASCII-Art
below:

          +- Thread #1 Head       +- Thread #2 Head       +- Tail
          v                       v                       v
   +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+
   | 0|->| 1|->| 1|->| 1|->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X
   +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+  +--+
    ^
    +- to be free()d

The changes introduced by this commit have some side-effects:

- The WriteThreads config option no longer exists, as a strict 1:1 ratio of
  write plugins and write threads is used.
- The data flow has changed. The previous data flow was:

    (From one of the ReadThreads)
      plugin_dispatch_{values,multivalue}()
      plugin_dispatch_metric_family()
      enqueue_metric_family()
      write_queue_enqueue() -----{Queue}----+
                                            |
    (In one of the WriteThreads threads)    |
      plugin_write_thread()                 |
        ^- plugin_write_dequeue()         <-+
      plugin_dispatch_metric_internal()
        ^- fc_process_chain(pre_cache_chain)
      fc_process_chain(fc_process_chain)
      fc_bit_write_invoke()
      plugin_write(NULL) / plugin_write(plugin_name)
      plugin callback()

  The data flow now is:

    (From one of the ReadThreads)
      plugin_dispatch_{values,multivalue}()
      plugin_dispatch_metric_family()
      plugin_dispatch_metric_internal()
        ^- fc_process_chain(pre_cache_chain)
      fc_process_chain(post_cache_chain)
      fc_bit_write_invoke()
      plugin_write(NULL) / plugin_write(plugin_name)
      write_queue_enqueue() -----{Queue}----+
                                            |
    (In one of the WriteThreads threads)    |
      plugin_write_thread()               <-+
      plugin callback()

  One result of this change is, that the behaviour of plugin_write has changed
  from running the plugin callback immediately and in the same thread, to
  always enqueueing the value and de-queing in the dedicated thread.
- The behaviour of the WriteQueueLimitHigh and WriteQueueLimitLow options has
  changed. The Queue will be be capped to a length of LimitHigh by dropping
  random queue elements between the queue end and LimitLow.
  Setting LimitLow to a reasonably large value ensures that fast write plugins
  do not loose values, even in the vicinity of a slow plugin.
  The diagram below shows the random element selected for removal (###) in
  Step 1 and the queue with the element removed in Step 2.

  Step 1:
    +- Thread #1 Head    |  +- Thread #2 Head       +- Tail
    v  |                 |  v                       v
   +--+| +--+  ####  +--+| +--+  +--+  +--+  +--+  +--+
   | 1|->| 1|-># 1#->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X
   +--+| +--+  ####  +--+| +--+  +--+  +--+  +--+  +--+
       |                 |
       | LimitHigh       | LimitLow

  Step 2:
       |  +- Thread #1 Head +- Thread #2 Head       +- Tail
       |  v              |  v                       v
       | +--+  +--+  +--+| +--+  +--+  +--+  +--+  +--+
       | | 1|->| 1|->| 1|->| 2|->| 2|->| 2|->| 2|->| 2|->X
       | +--+  +--+  +--+| +--+  +--+  +--+  +--+  +--+
       |                 |
       | LimitHigh       | LimitLow

Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0