Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

ymettier · 2014-08-05T08:31:28Z

I have a Collectd running in an overloaded server.

PR #640 (please merge it) prevents a thread bomb.
PR #280 (already merged in collectd-5.4.0) and PR #690 (recently created, please merge it) fix most problems.

However, I had no idea how to configure the WriteQueueLimitHigh and WriteQueueLimitLow parameters introduced with PR #280. So I wrote this patch to have statistics on Write Queue and improve those parameters.

Note : we have some internal statistics on the Read Queue from the network plugin. The Write Queue is an internal queue so I prefered not to use network (as p in the h/p-pi/t-ti notation) but internal. Some may think that collectd sounds better than internal. Choose the best.

mfournier · 2014-08-22T10:03:27Z

src/plugin.c

+
+	/* Cache : Nb entry in cache tree */
+	vl.values[0].gauge = (gauge_t) uc_get_size();
+	sstrncpy (vl.type, "nb_values", sizeof (vl.type));


nb_values definition in types.db seems to be missing from this patch. Would it be possible to add it, or even better: reuse an existing type (cache_size seems appropriate here) ?

nb_values comes from another (unmerged yet) patch and I missed that it was not in this one. Sorry.
I agree with the fact that we can reuse one, and I like cache_size too. Patch updated.

mfournier · 2014-08-22T10:28:40Z

So I must say I'd absolutely love to have this feature in collectd !

I'm just a bit wary about the effect it might have on the performance as it touches core parts of the code. I'm not sure I understand how frequently plugin_update_internal_statistics () would be called. Could you give some insights on this aspect ?

Ideally @octo would give his opinion on this when he finds a moment :-)

mfournier · 2014-08-22T10:31:25Z

src/plugin.c

+	vl.values_len = 2;
+	vl.time = 0;
+	sstrncpy (vl.host, hostname_g, sizeof (vl.host));
+	sstrncpy (vl.plugin, "internal", sizeof (vl.plugin));


Slight preference to have "collectd" here instead of "internal".

ymettier · 2014-08-25T13:15:18Z

About performances...

just like another plugin
plugin_update_internal_statistics() is the new function and is called every time the main loop starts. It is called from plugin_read_all().

Its cost is the same as any plugin : prepare a value_list_t and enqueue it with plugin_dispatch_values().
It is disabled by default (to prevent performance degradation if you do not know about this feature).
For those 2 reasons, I consider that its performance cost is exactly the same as any plugin, or even less because it is internal.

queue_length
Queue length is an available value in the write_queue_length global var.
No additionnal performance cost.

derive-dropped
stats_values_dropped is a new global var. Because 2 or more threads can update it at the same time, there is a need for a lock.
When Collectd is working well, no values are dropped. This values is 0 and will not be updated. Performance cost = 0.
When Collectd is not working well, you drop values. There is a performance cost here. However, if you drop values, you really need this metric to know what is not working well. So here I consider that it's worth paying for this. Note that this is the exact reason I wrote this patch !

cache_size
I have developped this feature for my jsonrpc plugin (#154 currently closed until I update it to be automatically merged). It's a very useful feature to graph the result of echo LISTVAL | nc -u /var/.../collectd.sock | wc -l. I often use it to understand why the load of my server grows : the number of values also grows !

I could implement it in my jsonrpc plugin because it works with copies of the internal cache tree and there is no additionnal performance cost there.
With internal statistics, I thought it could be a good idea to share that feature without the need for jsonrpc plugin.

The cost of this feature is a call to the new uc_get_size() function that locks the cache_lock mutex to get the size of cache_tree. Then the size is directly available from the structure c_avl_tree_s->size.
The performance cost is that lock : 1 time in each iteration of the main loop.
This is to be compared with the number of metrics you have in the cache (eg, if you have 100k metrics in your cache, the cache update currently costs 100.000 locks in each iteration. With the patch, it will cost 100.001 locks.)

I hope it helps to clarify how my patch impacts core performances.

Regards,
Yves

octo · 2014-08-26T10:46:55Z

src/plugin.c


-	if (check_drop_value ())
+	if (check_drop_value ()) {
+		pthread_mutex_lock(&statistics_lock);


This is a potential contention point and only reached on busy servers. Please add a check for record_statistics around the mutex.

…llectd'; no more lock when not recording internal stats

ymettier · 2014-08-26T11:09:45Z

Hello,

Commit b2d20f9 :

renamed internal to collectd (3 votes : 2 for collectd and mine is blank)
renamed InternalStatistics to CollectdInternalStats (I noticed that @octo did not suggest CollectdInternalStatistics. I also prefer CollectdInternalStats because its shorter but still meaningful)
no more lock (and of course no more counter) when CollectdInternalStats is disabled.

Regards

8000

New InternalStatistics

5a7d86e

mfournier added Feature labels Aug 5, 2014

mfournier reviewed Aug 22, 2014
View reviewed changes

nb_values missing in types.db; renaming as cache_size

c8549bc

octo reviewed Aug 26, 2014
View reviewed changes

renamed the option to CollectInternalStats; renamed 'internal' to 'co…

b2d20f9

…llectd'; no more lock when not recording internal stats

mfournier mentioned this pull request Aug 29, 2014

improve CollectInternalStats's documentation ymettier/collectd#1

Merged

mfournier self-assigned this Aug 29, 2014

mfournier merged commit b2d20f9 into collectd:master Sep 8, 2014

mfournier mentioned this pull request Dec 10, 2014

Feature Request: need for self monitoring plugin ( collectd getting metrics from itself) #849

Open

ymettier mentioned this pull request Dec 15, 2015

Improved documentation about CollectdInternalStats #1471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

Statistics on Write Queue length (to help on WriteQueueLimitHigh tuning) #691

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!