8000 Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) by ymettier · Pull Request #691 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 8, 2014

Conversation

ymettier
Copy link
Contributor
@ymettier ymettier commented Aug 5, 2014

I have a Collectd running in an overloaded server.

PR #640 (please merge it) prevents a thread bomb.
PR #280 (already merged in collectd-5.4.0) and PR #690 (recently created, please merge it) fix most problems.

However, I had no idea how to configure the WriteQueueLimitHigh and WriteQueueLimitLow parameters introduced with PR #280. So I wrote this patch to have statistics on Write Queue and improve those parameters.

Note : we have some internal statistics on the Read Queue from the network plugin. The Write Queue is an internal queue so I prefered not to use network (as p in the h/p-pi/t-ti notation) but internal. Some may think that collectd sounds better than internal. Choose the best.


/* Cache : Nb entry in cache tree */
vl.values[0].gauge = (gauge_t) uc_get_size();
sstrncpy (vl.type, "nb_values", sizeof (vl.type));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nb_values definition in types.db seems to be missing from this patch. Would it be possible to add it, or even better: reuse an existing type (cache_size seems appropriate here) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nb_values comes from another (unmerged yet) patch and I missed that it was not in this one. Sorry.
I agree with the fact that we can reuse one, and I like cache_size too. Patch updated.

@mfournier
Copy link

So I must say I'd absolutely love to have this feature in collectd !

I'm just a bit wary about the effect it might have on the performance as it touches core parts of the code. I'm not sure I understand how frequently plugin_update_internal_statistics () would be called. Could you give some insights on this aspect ?

Ideally @octo would give his opinion on this when he finds a moment :-)

vl.values_len = 2;
vl.time = 0;
sstrncpy (vl.host, hostname_g, sizeof (vl.host));
sstrncpy (vl.plugin, "internal", sizeof (vl.plugin));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference to have "collectd" here instead of "internal".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ymettier
Copy link
Contributor Author

About performances...

just like another plugin
plugin_update_internal_statistics() is the new function and is called every time the main loop starts. It is called from plugin_read_all().

Its cost is the same as any plugin : prepare a value_list_t and enqueue it with plugin_dispatch_values().
It is disabled by default (to prevent performance degradation if you do not know about this feature).
For those 2 reasons, I consider that its performance cost is exactly the same as any plugin, or even less because it is internal.

queue_length
Queue length is an available value in the write_queue_length global var.
No additionnal performance cost.

derive-dropped
stats_values_dropped is a new global var. Because 2 or more threads can update it at the same time, there is a need for a lock.
When Collectd is working well, no values are dropped. This values is 0 and will not be updated. Performance cost = 0.
When Collectd is not working well, you drop values. There is a performance cost here. However, if you drop values, you really need this metric to know what is not working well. So here I consider that it's worth paying for this. Note that this is the exact reason I wrote this patch !

cache_size
I have developped this feature for my jsonrpc plugin (#154 currently closed until I update it to be automatically merged). It's a very useful feature to graph the result of echo LISTVAL | nc -u /var/.../collectd.sock | wc -l. I often use it to understand why the load of my server grows : the number of values also grows !

I could implement it in my jsonrpc plugin because it works with copies of the internal cache tree and there is no additionnal performance cost there.
With internal statistics, I thought it could be a good idea to share that feature without the need for jsonrpc plugin.

The cost of this feature is a call to the new uc_get_size() function that locks the cache_lock mutex to get the size of cache_tree. Then the size is directly available from the structure c_avl_tree_s->size.
The performance cost is that lock : 1 time in each iteration of the main loop.
This is to be compared with the number of metrics you have in the cache (eg, if you have 100k metrics in your cache, the cache update currently costs 100.000 locks in each iteration. With the patch, it will cost 100.001 locks.)

I hope it helps to clarify how my patch impacts core performances.

Regards,
Yves


if (check_drop_value ())
if (check_drop_value ()) {
pthread_mutex_lock(&statistics_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential contention point and only reached on busy servers. Please add a check for record_statistics around the mutex.

…llectd'; no more lock when not recording internal stats
@ymettier
Copy link
Contributor Author

Hello,

Commit b2d20f9 :

  • renamed internal to collectd (3 votes : 2 for collectd and mine is blank)
  • renamed InternalStatistics to CollectdInternalStats (I noticed that @octo did not suggest CollectdInternalStatistics. I also prefer CollectdInternalStats because its shorter but still meaningful)
  • no more lock (and of course no more counter) when CollectdInternalStats is disabled.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0