Description
- Version of collectd: 5.12.0
- Operating system / distribution: Amazon Linux 2023
- Kernel version (if applicable): 6.1.119-129.201.amzn2023.x86_64
Expected behavior
I am trying to monitor /var/log/messages
using the tail plugin for a few things including OOM kills.
Here is an abbreviated version of the config:
<Plugin tail>
<File "/var/log/messages">
Instance "dmesg_sensor"
<Match>
Regex "oom-kill:"
DSType "GaugeInc"
Type "gauge"
Instance "oom_kill"
</Match>
</File>
</Plugin>
It's expected that between intervals that GaugeInc resets to 0, this was a previous issue in:
#2448
Actual behavior
When running the command collectdctl -s /var/run/collectd-socket getval hostname/tail-dmesg_sensor/gauge-oom_kill
I receive value=nan
when the expected behaviour is value=0.000000e+00
I confirmed the rule is functioning by running echo > "oom-kill:" > /dev/kmsg
. Which when running the command changes to value=1.000000e+00
as expected before returning to value=nan
when the gauge is reset.
It's a simple fix and I think was overlooked in the previous fix. I have compiled the following git diff on top of collectd 5.12.0 and confirmed this does indeed resolve the issue.
diff --git a/src/utils_tail_match.c b/src/utils_tail_match.c
index 25714c16..597a1d46 100644
--- a/src/utils_tail_match.c
+++ b/src/utils_tail_match.c
@@ -76,7 +76,7 @@ static int simple_submit_match(cu_match_t *match, void *user_data) {
if ((match_value->ds_type & UTILS_MATCH_DS_TYPE_GAUGE) &&
(match_value->values_num == 0))
- values[0].gauge = NAN;
+ values[0].gauge = (match_value->ds_type & UTILS_MATCH_CF_GAUGE_INC) ? 0 : NAN;
else
values[0] = match_value->value;
As a side question please could you advise how to work around this as we're using Cloudwatch which doesn't accept NaN and leaves us without metrics. Would it be possible to write a temporary plugin that converts NaN's to 0's for these specific metrics? I would appreciate any recommendations on working around this issue!
Ideally we wouldn't have to distribute our own version of collectd/the patched tail plugin whilst waiting for an official version, I was wondering what the timeline on a release might be? I'm hesitant as the last release was 4 years ago.
It doesn't look like we're the only ones having this issue:
awslabs/collectd-cloudwatch#78
https://sage.amazon.dev/posts/1675491
Thank you!