8000 concurrency issue with apache + network plugin when gcrypt is used · Issue #513 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

concurrency issue with apache + network plugin when gcrypt is used #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zerkms opened this issue Jan 13, 2014 · 27 comments
Closed

concurrency issue with apache + network plugin when gcrypt is used #513

zerkms opened this issue Jan 13, 2014 · 27 comments
Milestone

Comments

@zerkms
Copy link
Contributor
zerkms commented Jan 13, 2014

The latest ubuntu 12.04 LTS collectd version (4.10.1-2.1ubuntu7) crashes on curl initialization in apache plugin. Here is the stacktrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/collectd -C /etc/collectd/collectd.conf -P /var/run/collectd.pid'.
Program terminated with signal 6, Aborted.
#0  0x00dfd420 in __kernel_vsyscall ()
(gdb) backtrace full
#0  0x00dfd420 in __kernel_vsyscall ()
No symbol table info available.
#1  0x0013e1ef in raise () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00141835 in abort () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#3  0x00137095 in ?? () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#4  0x00137147 in __assert_fail () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#5  0x002f7f05 in ?? () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#6  0x0033aa3d in ?? () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#7  0x0033b0d9 in ?? () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#8  0x0033c2cb in ?? () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#9  0x0033a8c7 in ?? () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#10 0x002efe13 in gcry_create_nonce () from /lib/i386-linux-gnu/libgcrypt.so.11
No symbol table info available.
#11 0x008ddbaa in ?? () from /usr/lib/i386-linux-gnu/libgnutls.so.26
No symbol table info available.
#12 0x0087d35f in ?? () from /usr/lib/i386-linux-gnu/libgnutls.so.26
No symbol table info available.
#13 0x0086a35d in gnutls_global_init () from /usr/lib/i386-linux-gnu/libgnutls.so.26
No symbol table info available.
#14 0x00d8e862 in Curl_gtls_init () from /usr/lib/i386-linux-gnu/libcurl-gnutls.so.4
No symbol table info available.
#15 0x00d8f709 in Curl_ssl_init () from /usr/lib/i386-linux-gnu/libcurl-gnutls.so.4
No symbol table info available.
#16 0x00d80d8c in curl_global_init () from /usr/lib/i386-linux-gnu/libcurl-gnutls.so.4
No symbol table info available.
#17 0x00d80f05 in curl_easy_init () from /usr/lib/i386-linux-gnu/libcurl-gnutls.so.4
No symbol table info available.
#18 0x00de8006 in init_host (st=0x9d621e0) at apache.c:403
        credentials = '\000' <repeats 1023 times>
#19 apache_read_host (user_data=0x9d62344) at apache.c:637
        i = <optimized out>
        ptr = <optimized out>
        saveptr = <optimized out>
        lines = {0x0, 0x0, 0x9d61708 "", 0x0, 0x8056e6d "\213T$\024\205\300\017\216\227", 0x9d6c618 "\220\367\360", 0x9d6aa48 "@#\341", 0x0, 0x0, 0x0, 0x4 <Address 0x4 out of bounds>, 0x1 <Address 0x1 out of bounds>, 0x798cbf "\201\303\065\363", 0x0, 0x0, 0x0}
        lines_num = 0
        fields = {0x9d61708 "", 0x9d61708 "", 0x9d62340 "\020w\336", 0x3d0f00 "\205\300\017\205\267\371\377\377\017\266D$M\017\266L$L\213T$P\211,$\211D$8\213D$T\211L$0\211T$,\211D$4\213D$X\211D$<\213D$l\211D$\004\350\216\353"}
        fields_num = <optimized out>
        st = 0x9d621e0
        __PRETTY_FUNCTION__ = "apache_read_host"
#20 0x080525bf in plugin_read_thread (args=0x0) at plugin.c:433
        callback = <optimized out>
        rf = 0x9d62340
        status = <optimized out>
        rf_type = 1
        now = {tv_sec = 1389647376, tv_usec = 23827}
        rc = <optimized out>
        __PRETTY_FUNCTION__ = "plugin_read_thread"
#21 0x00796d4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
No symbol table info available.
#22 0x001faace in clone () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.

Some weirdness: it worked until friday, then suddenly started crashing, then worked for one day and at this moment it is still crashing on start. Nothing was installed/reconfigured on the server

@octo
Copy link
Member
octo commented Jan 14, 2014

The line reads:

        if ((st->curl = curl_easy_init ()) == NULL)

Since no argument is passed to curl_easy_init, that can hardly be the problem. However, the function is called from concurrent code.

How many hosts / Apache instances do you have configured? Do you use other plugins that use cURL?

—octo

@zerkms
Copy link
Contributor Author
zerkms commented Jan 14, 2014

@octo,

there is only single host (localhost) is set up.

And I've just noticed that it's a network plugin that conflicts with apache on that server. It doesn't use curl though (at least explicitly)

@mfournier
Copy link

Maybe completely unrelated, but this sounds similar to http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=735173

@zerkms, out of curiosity, does the problem still occur if you set the network plugin's SecurityLevel to none ?

@zerkms
Copy link
Contributor Author
zerkms commented Jan 14, 2014

@mfournier,

problem disappears for SecurityLevel "None" :-S

@mfournier
Copy link

So this could probably be the same issue.

@tokkee suggested it could be a thread-safety issue with libgcrypt. If it's the case, #156 and #88 might not be foreign to this.

According to http://www.gnupg.org/documentation/manuals/gcrypt/Multi_002dThreading.html "If your application is multi-threaded, you must set the thread support callbacks with the GCRYCTL_SET_THREAD_CBS command before any other function in the library.". This was respected before 80d8e67 but no longer since, it seems.

@ChrisLundquist, can you comment on this ?

Also, @pieterlexis you might want to follow this conversation.

@ChrisLundquist
Copy link
Contributor

It has been a while since I looked at this, but I think I was hitting a similar issue when I was using the network plugin and the curl plugin. Looking over my write up as you pointed to in #88 It seems like we're initializing the library in multiple places, or perhaps curl was.

The function gcry_check_version must be called before any other function in the library, except the GCRYCTL_SET_THREA 10000 D_CBS command (called via the gcry_control function), because it initializes the thread support subsystem in Libgcrypt.

The mailing list reply linked in the comment says

For proper initialization of the library, you must call
gcry_check_version() before calling any other functions except for a
these gcry_control operations:
GCRYCTL_SUSPEND_SECMEM_WARN
GCRYCTL_DISABLE_INTERNAL_LOCKING
GCRYCTL_ANY_INITIALIZATION_P
GCRYCTL_INITIALIZATION_FINISHED_P

So we should be safe having that if condition wrapping it. I missed the exception in my implementation as the mailing list doesn't mention the threading flag specifically.

Sounds like swapping the threading flag line and the version check line will fix this.

@ChrisLundquist
Copy link
Contributor

@zerkms would you be able to test #515 and report your findings?

@zerkms
Copy link
Contributor Author
zerkms commented Jan 15, 2014

Nope, I don't have an environment (and experience) to compile collectd :-(

@obazoud
Copy link
Contributor
obazoud commented Jan 15, 2014

@zerkms you can use my project to build collect https://github.com/obazoud/collectd-build
I build a tar.gz from master in vagrant with z debian box.

@pieterlexis
Copy link

Thanks for pinging me in this convo. If I have time I'll try to build a package today (but the package buildsystem is kind of fucky)

@zerkms
Copy link
Contributor Author
zerkms commented Jan 15, 2014

@obazoud I would be better if someone confident confirmed the issue went away. I'm not even sure how exactly to reproduce the issue :-S

@ChrisLundquist
Copy link
Contributor

Cool, ChrisLundquist@d219776 should apply cleanly to most any tag, or be trivial to reproduce. I'd feel best if someone else was able to confirm it fixes this issue. If it doesn't I'm happy to continue working through it.

Thank you all for finding this bug and helping me fix it.

@pieterlexis
Copy link

@ChrisLundquist In the debian sources for wheezy (collectd 5.1.0) it doesnt apply. The only gcry_control block is located in src/network.c on lines 3353-3356:

#if HAVE_LIBGCRYPT
        gcry_control (GCRYCTL_SET_THREAD_CBS, &gcry_threads_pthread);
        gcry_control (GCRYCTL_INIT_SECMEM, 32768, 0);
        gcry_control (GCRYCTL_INITIALIZATION_FINISHED, 0);
#endif

Looking at your patch (calling GCRYCTL_SET_THREAD_CBS first), it would appear that this is already done. Albeit I must admit I know little of collectd internals and gcrypt.

@ChrisLundquist
Copy link
Contributor

I based that patch from master, which looks pretty close to collectd-5.3.1.

In 4.10.1 (the reported version of this issue) it doesn't look like we call gcry_check_version which makes use of the flag GCRYCTL_SET_THREAD_CBS. We assume that network.c is the only thing using the library and don't query the library if it has already been initialized.

It looks like my changes in #156 first appear in collectd-4.10.8

Maybe I didn't introduce this bug? Either way, from what I've read, it sounds like #515 is more correct and would love to have it tested and merged.

I wonder if we could get the ubuntu collectd package updated to at least 4.10.lastest or maybe even go crazy with 5.3.latest?

@ChrisLundquist
Copy link
Contributor

Oh right, 12.04 LTS. Makes getting a package update there difficult since 14.04 LTS is just around the corner.

FWIW (shoutout to @tokkee 💖 )

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=13.10
DISTRIB_CODENAME=saucy
DISTRIB_DESCRIPTION="Ubuntu 13.10"
$ apt-cache show collectd | grep Version
Version: 5.1.0-3.1ubuntu3

But that doesn't really help 12.04 LTS users.

@mfournier
Copy link

I've also tried switching these 2 statements like in d219776, but the problem I faced is that I was unable to build a collectd which reproduces the bug without the patch to start with :(

Collectd 5.4.0-3 from debian jessie reliably crashes with the config snippet from http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=735173 but doesn't anymore as soon as I build my own 5.4.0 on this platform.

@zerkms, can you share your configuration that makes collectd crash on 4.10.1-2.1ubuntu7 ?

@pieterlexis out of curiosity, when did the problem start for you ? After a config change or suddenly with no reason like it did for @zerkms ?

@ChrisLundquist I guess your patch doesn't apply on 5.1.0 because of 0ec776a. This patch only appeared in the tree in 5.1.3. It's trivial to backport it if needed.

@zerkms
Copy link
Contributor Author
zerkms commented Jan 15, 2014

@mfournier I will, tomorrow

@pieterlexis
Copy link

@mfournier It started when I added the apache plugin to an existing installation (where encryption was enabled).

@pieterlexis
Copy link

@mfournier @ChrisLundquist I applied the patch in 0ec776a and d219776 and built a new package from 5.1.0-3. The problem indeed doesn't occur 👍

I'll publish the dpatch/source package asap

@pieterlexis
Copy link

Patched packages for Debian Wheezy can be found here

@ChrisLundquist
Copy link
Contributor

👍

@mfournier
Copy link

I confirm @pieterlexis's dpatch solves the problem, both for debian wheezy's 5.1.0-3 and ubuntu precise's 4.10.1-2.1ubuntu7. I've been running while true; do timeout 2 collectd -f -C /vagrant/collectd-513.conf; done for several minutes on both platforms, and didn't see it crash yet. Without the patch, it crashed at least 10% of the time.

Therefore +1 for merging @ChrisLundquist's patch. I'm going to adapt it for collectd-4.10 branch as requested by @octo, and I also want to add a comment so that we don't shoot ourselves in the foot in the future.

@ChrisLundquist
Copy link
Contributor

Thank you @mfournier . I wasn't quite sure what @octo was asking. I didn't see a 4.10 branch I could PR against. If you would /cc me on the PR / commit, I'd like to see and learn.

The only question I'm not entirely clear on is how we were seeing this in collectd-4.10.1

Maybe because we weren't calling gcry_check_version after setting the flag?

@mfournier
Copy link

@zerkms, oh and here's what I did to build a patched package for ubuntu 12.04:

sudo apt-get build-dep collectd
sudo apt-get install devscripts
apt-get source collectd
cd collectd-4.10.1/
curl "http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=10;filename=correcty-init-gcrypt.dpatch;att=1;bug=735173" > debian/patches/correcty-init-gcrypt.dpatch
echo correcty-init-gcrypt.dpatch >> debian/patches/00list
# optionnally run "dch -i" to increment package release number
dpkg-buildpackage -rfakeroot
sudo dpkg -i ../collectd_4.10.1*.deb ../collectd-core_4.10.1*.deb

Alternatively, this workaround in collectd.conf seems to work too, but your collectd will probably significantly underperform:

ReadThreads 1
WriteThreads 1

The same procedure works for 5.1.0 BTW.

@mfournier
Copy link

The only question I'm not entirely clear on is how we were seeing this in collectd-4.10.1

@ChrisLundquist, the same is true for 5.1.0, as 80d8e67 was only introduced in 5.1.1/4.10.8 and 0ec776a in 5.1.3/4.10.9.

But frankly, I don't know. An explanation would be that this bug was already there before 80d8e67, and we just got mislead in assuming it was the cause by @pieterlexis's comment in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=735173

I just double checked by building a stock 5.1.0 without all the packaging overhead/dpatches/etc, and I confirm the bug is there already. I didn't manage to build 4.10.1: too outdated and broken, but the code in network.c is basically the same, regarding gcrypt's use.

So I'm pretty confortable with your patch. More out of thorough testing than perfect understanding though :-/

octo added a commit that referenced this issue Jan 15, 2014
This is a shot in the dark in trying to address #513. By calling this
from an init() callback, I hope to be initializing the curl and gcrypt
libraries before collectd becomes multi-threaded, avoiding the problems
described in the issue.
@mfournier
Copy link

@ChrisLundquist, your patch just got included in the collectd-4.10 branch as ddffda7. It will soon be merged in collectd-5.3 and collectd-5.4, and will be part of the next bugfix releases. Many thanks to all of you on this issue !

According to @octo, part of the problem was that the apache plugin initializes curl after collectd gets multithreaded, and as curl uses libgcrypt, it could get initialized too late (ie: from inside a thread). An attempt to correct this is 5f2f969.

And if these 2 patches don't solve the issue, the next step would be to initalize libgcrypt outside of the plugins, to ensure it's done before any thread gets created.

@zerkms
Copy link
Contributor Author
zerkms commented Jan 15, 2014

@mfournier

As promised here is the config (joined from multiple files and with comments and sensitive data removed) that leads to crashes on my server:

FQDNLookup true
LoadPlugin syslog
<Plugin syslog>
    LogLevel info
</Plugin>
LoadPlugin cpu
LoadPlugin df
LoadPlugin entropy
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin processes
LoadPlugin users

<Plugin disk>
#   Disk "hda"
#   Disk "/sda[23]/"
#   IgnoreSelected false
</Plugin>

LoadPlugin apache
<Plugin apache>
    <Instance "localhost">
        URL "http://localhost/server-status?auto"
        Server "apache"
    </Instance>
</Plugin>

LoadPlugin network
<Plugin network>
    <Server "192.168.xxx.xxx" "xxxxxxx">
        SecurityLevel Encrypt
        Username "###############"
        Password "*************"
    </Server>
</Plugin>

LoadPlugin dns
<Plugin dns>
    Interface "any"
    IgnoreSource "192.168.xxxx.xxx"
    SelectNumericQueryTypes false
</Plugin>

Moving out apache's section solves the issue

katzj pushed a commit to katzj/collectd that referenced this issue Jan 27, 2014
This is a shot in the dark in trying to address collectd#513. By calling this
from an init() callback, I hope to be initializing the curl and gcrypt
libraries before collectd becomes multi-threaded, avoiding the problems
described in the issue.
katzj pushed a commit to Stackdriver/collectd that referenced this issue Feb 14, 2014
This is a shot in the dark in trying to address collectd#513. By calling this
from an init() callback, I hope to be initializing the curl and gcrypt
libraries before collectd becomes multi-threaded, avoiding the problems
described in the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
0