-
Notifications
You must be signed in to change notification settings - Fork 1.2k
cpu plugin: add linux-specific "guest" states. #2439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi! As far I remember my investigations on this, An example.
That machine is a server running several KVM VM, no other load inside the OS itself. Let's look for a first cpu counters.
While The sum of first 8 values: 155 798 619 81 An error between these values is about 1% for ~5 years of uptime. So, that sum can be correctly used as This also confirms what But if we start to reduce |
Mine 1.5 years-old version of this patch is below. That is not used by me, for unknown reasons. Sorry, I can't remember all the details about.
|
As suggested, I have added options so this change does not feel like a regression to anyone:
|
src/cpu.c
Outdated
int numfields; | ||
long long user_value, nice_value, value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
In current codestyle variable should be declared when it used at first time, if that is possible.
Thanks for your work on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, adjusted :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xavierog, thank you very much for your patch!
In addition to the inline comments, if you make a case for keeping one or both config options, please also add them to src/collectd.conf.in
.
Best regards,
—octo
When set to B<true>, reports the "guest" and "guest_nice" CPU states. | ||
Defaults to B<false>. | ||
|
||
=item B<SubtractGuestState> B<false>|B<true> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two options to control this behavior feels a bit like premature optimization. In other words, are users actually going to use those options? (This is an honest question, not a rhetorical one; I know nothing about the new guest counters ;)
I'd prefer to implement the behavior that ReportGuestState true; SubtractGuestState true
has and implement config options only if (and when) users ask for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented these options to address the following concern:
But if we start to reduce user and nice, that will be backward-incompatible change for people who do not ready for a such. May be that should be tunable?
I believe having these options keeps you safe from any angry user, as they enable end users to revert to whatever old behaviour they were used to. So I would argue that, yes, indeed, some people will use them. Specifically, these counters are notoriously useless within VMs: people with 10 000+ VMs at $RandomCloudCompany may not want to store 20 000+ extra useless metrics. I may even use ReportGuestState myself, depending on the host I deploy collectd to :)
However, we surely can change the default values. The actual question is: what is the policy of the collectd project regarding new metrics within existing plugins?
- Conservative?
ReportGuestState false; SubtractGuestState true
- Progressive?
ReportGuestState true; SubtractGuestState true
- Prone to delegate computations to query and graphing tools?
ReportGuestState true; SubtractGuestState false
The choice is yours :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conservative: New versions need to be backwards compatible (with the exception of new major versions). Backwards compatibility is defined by the user observable behavior, i.e. the previous config file needs to still work and all metrics that were reported previously need to still be reported.
I'd consider this a bit of an edge case, sine the previous metrics are still present. But it sounds like having users make a conscious decision to enable these metrics is reasonable.
You've made a good case for ReportGuestState
– who would use SubtractGuestState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who would use SubtractGuestState?
End users who only want the raw values straight out from the kernel to graph them the way they want in one or several graphs. Typically, SubtractGuestState true
is useful for people who want to stack all CPU metrics in a single graph and be done with it. But some prefer to mix stacked and non-stacked values on a single graph. Some prefer a separate graph to reflect the host vs guests CPU consumption.
I tend to focus on graph aficionados but alerting is to be considered too: some alerts and thresholds may assume that guest is included in user, others may assume that user reflects only the host-specific activity.
That being said, since you aim for a conservative policy, what will most likely happen is that end users will notice the lack of guest and guest_nice (like I did), have a look at the documentation, be glad to find they can enable them easily and hopefully learn at the same time that subtlety about them being part of user and nice (I didn't know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, let's go with both options.
src/cpu.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please close the >= 9
block here and out-dent the >= 10
and >= 11
cases. I.e.:
if (numfields >= 9) {
/* … */
}
if (numfields >= 10) {
/* … */
}
if (numfields >= 11) {
/* … */
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I get the (numfields >= 9) block out too? It is part of the (numfields >= 8) block, and that's why I chose to nest (numfields >= 10) and (numfields >= 11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd be great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 092b2d8. If needed, I'll rebase/squash everything after the discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks @xavierog!
I'll wait for the CI system to catch up before merging.
As requested, I have updated |
Thank you very much :) |
Thank you! |
/* Guest is included in User; optionally subtract Guest from User: */ | ||
if (subtract_guest) { | ||
user_value -= value; | ||
if (user_value < 0) user_value = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
Do you think (or, possible, know) is that condition is possible at practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel code:
root@hs01:~/srv01/linux-3.2.78# cat /root/srv01/linux-3.2.78/kernel/sched.c |grep -A 50 account_guest_time
static void account_guest_time(struct task_struct *p, cputime_t cputime,
cputime_t cputime_scaled)
{
cputime64_t tmp;
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
tmp = cputime_to_cputime64(cputime);
/* Add guest time to process. */
p->utime = cputime_add(p->utime, cputime);
p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
account_group_user_time(p, cputime);
p->gtime = cputime_add(p->gtime, cputime);
/* Add guest time to cpustat. */
if (TASK_NICE(p) > 0) {
cpustat->nice = cputime64_add(cpustat->nice, tmp);
cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
} else {
cpustat->user = cputime64_add(cpustat->user, tmp);
cpustat->guest = cputime64_add(cpustat->guest, tmp);
}
}
So, that will be possible only on counter overflow, which should be practically impossible.
All is ok.
Hello,
The "cpu" plugin currently lacks two Linux-specific CPU states:
(excerpts from procfs(5))
This patch brings support for guest and guest_nice CPU states into collectd. It lacks testing on non-Linux systems though. Feedback on whatever can be improved would be greatly appreciated.