-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add slurm plugin #3037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add slurm plugin #3037
Conversation
Hi! Thanks for your work on Collectd!
Can you please clarify, why |
Hi, thanks! Without
and
Therefore, and after finding the option here, I thought that this would be acceptable. After setting |
Thank you for the detailed explanation! |
Hi, you asked about 5.9 / merge. Let's discuss some implementation details in comments... |
src/slurm.c
Outdated
} | ||
|
||
static void slurm_submit_stats(stats_info_response_msg_t *stats_resp) { | ||
slurm_submit("stats", "slurm_stats", "parts_packed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point that catches me is slurm_stats
type.
Collectd types has idea what type describes some area, but here we see all data sent in one slurm_stats
type.
I think, that is wrong and needs more work (and more types).
Good example is jobs_*
metric, which you put into type instance.
From my POV on Collectd types, that should be own type, possible slurm_jobs
if no better type found.
And submitted/started/completed/canceled/failed should be type instances.
This type should be DERIVE (see cat types.db|grep _requests
), I think these jobs_*
grows continuosly and we mostly interested in change rate.
Install rrdtool
plugin + setup collection3
and you will see what are benefits ;-)
With slurm_jobs
implemented described way, you can get one chart which shows you all these metrics grouped, and, as I understand, their sum will give us 100% of jobs, so you can see distribution.
Another note - I think you should not set plugin instance value at all. Constant value in that field has no sence, I think. UPD: Ok, I found you send partition name in plugin instance... This makes this remark void, as there is no better solution now. (I really think about adding hierarhy
field in addition to plugin
and type
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks a lot for the feedback!
I think you are definitely correct about some of the metrics needing a different RRD type. These are all global types that are not bound to a specific partition, and as you pointed out some make sense to add as DERIVE, also some as COUNTER like *_sum metrics which are monotonically increasing.
However the granularity in which new types are added or metrics grouped together is not clear to me (see below). I intended to group metrics together in a way that makes sense, yet does not pollute types.db with lots of new types.
I am not using collection3, but currently the data get funneled through various dbs/frameworks and are displayed with Grafana at the end of the "pipeline", where they are already grouped by their metric name prefix (that is, by their type instance prefix).
While it would be certainly possible to create a type slurm_job as DERIVE, for the other metrics it is a bit less clear: Not all bf_* metrics and schedule_* metrics are of the same type, and therefore it would not be possible to create a slurm_schedule or slurm_bf type to group them together. Similarly, many other types would need to be created if the goal is to separate/plot them based on the type, in some cases like parts_packed, server_thread_count, agent_queue_size, these would each need a new type (one type per single metric). Unless, they are all grouped together somehow.
So I guess my question is: what granularity should we use to group them vs creating new types?
IMHO it would make sense to use slurm_stats as plugin instance to group all these global statistics together, and then use the generic types (counter for COUNTER type metrics, derive for DERIVE type metrics, gauge for GAUGE type metrics), and keep the metric names (type instance) as they are.
This would achieve: 1) each having their correct type 2) they are still grouped together as "slurm_stats" (based on the plugin instance), leaving finer granularity grouping up to the plotting software, and 3) not polluting types.db with lots of new types.
Does this make sense from a collectd/rrd point of view? Would you prefer an implementation of this change?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add new types when you think it is needed.
If in doubt, lets discuss.
counter for COUNTER type metrics, derive for DERIVE type metrics, gauge for GAUGE type metrics
I think that is not good solution, because one should do grouping, discussed above, anyway.
Not all bf_* metrics and schedule_* metrics are of the same type, and therefore it would not be possible to create a slurm_schedule or slurm_bf type to group them together.
That is not needed, that is OK to split them into subgroups.
->> https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMemoryStatStruct
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMemoryStats
https://github.com/collectd/collectd/blob/master/src/virt.c#L1707
Method provides "memory stats". Sometime ago that data was sent as one type with type instances. But after careful review, part of data now sent as two other types, existing before and used by other plugins ("widely used types"):
swap_io
type (in+ out) ; ps_pagefaults
type (major + minor)
and 'last_update' value is not sent at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving finer granularity grouping up to the plotting software,
As you already use your plugin, you already do some grouping.
I also think your added grouping will not avoid possibility to do another grouping in plotting software. But presence of that grouping could be useful. So, provide it, please )
and 3) not polluting types.db with lots of new types.
I dont think that is real problem. My POV is to freely add new types if required.
Do not decouple node flags from node states as was the case up to now, as it will not be possible to tell which node flag corresponds to which node state. Instead, node state includes the node state flags (such as draining, non_responding, etc) in the name.
I just updated the code with reworked types following our discussion. There are now 6 types in total:
I also rebased against the latest commits in master. Please let me know what you think about these improvements. |
src/types.db
Outdated
@@ -231,6 +231,12 @@ serial_octets rx:DERIVE:0:U, tx:DERIVE:0:U | |||
signal_noise value:GAUGE:U:0 | |||
signal_power value:GAUGE:U:0 | |||
signal_quality value:GAUGE:0:U | |||
slurm_job_state value:GAUGE:0:U | |||
slurm_stats_load server_thread_count:GAUGE:0:U, agent_thread_count:GAUGE:0:U, agent_queue_size:GAUGE:0:U, dbd_agent_queue_size:GAUGE:0:U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, misunderstanding occured.
Types with multiple values are not preferred. all values should go to its own types with type_instance as before.
Most of types should have only one data_source named value
.
I think, adding of new types with multiple data_sources will be very rare.
Is there is documentation about these statistics values?
Please understand, what threads_count and queue_size is different metrics (that is different kind of load), and they should go to different types, they should not be mixed together.
I will try to fill a table of proposed types.
src/types.db
Outdated
slurm_node_state value:GAUGE:0:U | ||
slurm_stats_backfill backfilled_jobs:DERIVE:0:U, backfilled_pack_jobs:DERIVE:0:U, cycle_counter:COUNTER:0:U, cycle_last:GAUGE:0:U, cycle_max:COUNTER:0:U, cycle_sum:COUNTER:0:U, cycle_last_depth:GAUGE:0:U, cycle_last_depth_try:GAUGE:0:U, depth_sum:COUNTER:0:U, depth_try_sum:COUNTER:0:U, queue_len:GAUGE:0:U, queue_len_sum:COUNTER:0:U | ||
slurm_stats_cycles max:COUNTER:0:U, last:GAUGE:0:U, sum:COUNTER:0:U, counter:COUNTER:0:U, depth:DERIVE:0:U, queue_len:GAUGE:0:U | ||
slurm_stats_jobs submitted:DERIVE:0:U, started:DERIVE:0:U, completed:DERIVE:0:U, canceled:DERIVE:0:U, failed:DERIVE:0:U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, slurm_stats_jobs
type + submitted/started/completed/canceled/failed _type_instance
would be good.
src/types.db
Outdated
slurm_stats_load server_thread_count:GAUGE:0:U, agent_thread_count:GAUGE:0:U, agent_queue_size:GAUGE:0:U, dbd_agent_queue_size:GAUGE:0:U | ||
slurm_node_state value:GAUGE:0:U | ||
slurm_stats_backfill backfilled_jobs:DERIVE:0:U, backfilled_pack_jobs:DERIVE:0:U, cycle_counter:COUNTER:0:U, cycle_last:GAUGE:0:U, cycle_max:COUNTER:0:U, cycle_sum:COUNTER:0:U, cycle_last_depth:GAUGE:0:U, cycle_last_depth_try:GAUGE:0:U, depth_sum:COUNTER:0:U, depth_try_sum:COUNTER:0:U, queue_len:GAUGE:0:U, queue_len_sum:COUNTER:0:U | ||
slurm_stats_cycles max:COUNTER:0:U, last:GAUGE:0:U, sum:COUNTER:0:U, counter:COUNTER:0:U, depth:DERIVE:0:U, queue_len:GAUGE:0:U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depth
and queue len
is not directly related to cycles count, I think.
So, that also should not be mixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, you don't really need max
, last
, count
metrics.
Did you use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look into https://github.com/SchedMD/slurm/blob/master/src/sdiag/sdiag.c
:
if (buf->schedule_cycle_counter > 0) {
printf("\tMean cycle: %u\n",
buf->schedule_cycle_sum / buf->schedule_cycle_counter);
printf("\tMean depth cycle: %u\n",
buf->schedule_cycle_depth / buf->schedule_cycle_counter);
}
So, you can use this piece of code as example, and submit metrics
- average cycle duration (if I understand things correctly)
- average cycle depth (I don't know what the depth is)
If you look to another example:
if ((buf->req_time - buf->req_time_start) > 60) {
printf("\tCycles per minute: %u\n",
(uint32_t) (buf->schedule_cycle_counter /
((buf->req_time - buf->req_time_start) / 60)));
}
This shows us we need use DERIVE to submit buf->schedule_cycle_counter. Such metric will show us 'cycles per second' metric. Notice we don't need to use req_time and req_time start, Collectd will save and use time internally.
src/types.db
Outdated
slurm_job_state value:GAUGE:0:U | ||
slurm_stats_load server_thread_count:GAUGE:0:U, agent_thread_count:GAUGE:0:U, agent_queue_size:GAUGE:0:U, dbd_agent_queue_size:GAUGE:0:U | ||
slurm_node_state value:GAUGE:0:U | ||
slurm_stats_backfill backfilled_jobs:DERIVE:0:U, backfilled_pack_jobs:DERIVE:0:U, cycle_counter:COUNTER:0:U, cycle_last:GAUGE:0:U, cycle_max:COUNTER:0:U, cycle_sum:COUNTER:0:U, cycle_last_depth:GAUGE:0:U, cycle_last_depth_try:GAUGE:0:U, depth_sum:COUNTER:0:U, depth_try_sum:COUNTER:0:U, queue_len:GAUGE:0:U, queue_len_sum:COUNTER:0:U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also mixes all-into-one, but in different form.
The goal was not in replacing type_instance
into value
index.
Do you really found all these metrics useful?
I found no documentation about these values, so there is no reasons to report all them.
From my POV we do engineering, so we should argue all things.
If there is no such ready documentation, please fill a table with all values you have from slurm_get_statistics()
call,
add explanation about each that value and that will argue filling subsequent columns by type
and type_instance
values.
That is the thing should be done before writing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the slurm documentation of the api is very lacking, but the best you can find is the sdiag manpage and sdiag.c source. This is the official slurm diagnosis tool which reports mostly the same metrics that I was aiming to include in this plugin.
I will add a reference to these in the collectd manpage and in the slurm.c module as a comment, as this would make the meaning of each metric easier to look up.
All these metrics are useful and needed, this is why the official slurm diagnosis tool reports all of those that I have included. (In my particular case some of these metrics may be constant and do not vary much or at all, but this is not a reason for excluding them, as they may be more useful for other slurm clusters.)
In theory metrics that end in _sum or _max could be derived from _last* kind of metrics, but in practice I found that these usually diverge, so keeping them seems sensible. Also, it may make it easier to have a global mean just like sdiag does by computing sum / count.
I will propose a new type listing and we can discuss. Would it be ok to re-use other types, e.g. threads
for the server_thread_count
, queue_length
for the agent_queue_size
(this would result in type: queue_length
, type_instance: agent_queue_size
), etc? Or is it preferred to duplicate the metric type with something like slurmctld_thread_count, slurm_agent_queue_size, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be ok to re-use other types, e.g. threads for the server_thread_count, queue_length for the agent_queue_size, etc?
That would be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory metrics that end in _sum or _max could be derived from _last* kind of metrics, but in practice I found that these usually diverge, so keeping them seems sensible.
I think max
has no sence, as it will never decrease, until restart.
So, it is better to look for specified interval of _last
, if that is needed.
Also, I think possibly _sum
should go to some DERIVE type, but this not a rule, but an opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot for your fast and helpful responses.
Also, I think possibly
_sum
should go to some DERIVE type, but this not a rule, but an opinion.
Yes, I think this makes sense. Possibly COUNTER, although it may get reset (a condition different from overflow).
I have been experimenting with DERIVE for slurm_stats_jobs and I notice something unexpected: While gauge values will display the following values for one of the metrics:
1559136017.745,41.000000
1559136077.575,42.000000
1559136137.578,42.000000
1559136197.575,43.000000
changing the same metric to DERIVE or COUNTER will provide the following values:
1559136416.574,0.000000
1559136476.575,2345593565836.732910
Is this expected? (This is output from the CSV plugin, and I see the same values on my grafana dashboards) It seems like an increment of 1 for this metric is causing the derivative to jump from 0 to a very high value, while I was expecting a value of 1. I may be misunderstanding something here, but wanted to clarify before switching everything to DERIVE types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like an increment of 1 for this metric is causing the derivative to jump from 0 to a very high value, while I was expecting a value of 1.
If I understand correctly, when you possibly has to change GAUGE to DERIVE you had to restart all services, as there are internal states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but wanted to clarify before switching everything to DERIVE types.
If value monotonously growing, that is candidate to DERIVE.
Possibly COUNTER, although it may get reset (a condition different from overflow).
I think, your plugin should not use COUNTER, that is rarely used and specific type.
Use GAUGE and DERIVE.
src/slurm.c
Outdated
slurm_free_job_info_msg(job_buffer_ptr); | ||
slurm_free_node_info_msg(node_buffer_ptr); | ||
slurm_free_partition_info_msg(part_buffer_ptr); | ||
ERROR("alloc_partition_states"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add PLUGIN_NAME to all ERRORs like:
ERROR(PLUGIN_NAME ": alloc_partition_states error.");
Hi,
Note that some types like
If you find these new types to be OK, I will submit a commit which implements the use of these new types very soon, (hopefully to make it before the code freeze). |
Looks good to me. @mrunge would you also mind having a look? |
Just to be clear:
Are these OK, should I try to go for more generic names, dropping the The reason I'm asking is because collectdctl listval shows very verbose/redundant naming: (extract)
I already have a code implementation of this, but since I remain a bit confused about how to name things and when/how to add new types, I wanted to make sure we get this right :) |
I just pushed the code implementation of what I described above. Let me know if you want to do any renaming (or have any other concerns). |
This is fine with me. @rpv-tomsk would you also mind having a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"backfilled_jobs", stats_resp->bf_backfilled_jobs); | ||
slurm_submit_derive("slurm_backfill_stats", "slurm_backfilled_jobs", | ||
"backfilled_pack_jobs", | ||
stats_resp->bf_backfilled_pack_jobs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this is broken on Slurm 20.02. They renamed this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll open a new PR to fix this.
This PR adds a new plugin called slurm that collects node, job and internal health statistics for the SLURM workload manager (see https://slurm.schedmd.com/).
In addition to the plugin implementation in
src/slurm.c
, I added new corresponding types totypes.db
, the relevant bits for autoconf & automake, README docs, collectd.conf manpage bits, collectd.conf.in changes, and the corresponding AUTHORS file contributions update.Please let me know if I missed anything.
Note that for the auto-generated collectd.conf, I believe it just adds the
LoadPlugin slurm
line, although slurm needs theGlobals true
option to work properly; not sure if it's possible to include the macro magic in thecollectd.conf.in
file to have that option on by default.Also note that even though I updated
src/collectd.conf.pod
, make does not updatesrc/collectd.conf.5
for me, not sure if these manpage source files are still used or not, but adding the docs to the .pod file seemed reasonable. I manually checked using pod2man that it parsed and rendered correctly.ChangeLog: The new SLURM plugin gathers metrics from the SLURM workload manager.