8000 collectd crashes when used nested includes · Issue #587 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

collectd crashes when used nested includes #587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toni-moreno opened this issue Mar 19, 2014 · 13 comments
Closed

collectd crashes when used nested includes #587

toni-moreno opened this issue Mar 19, 2014 · 13 comments

Comments

@toni-moreno
Copy link
Contributor

I've configured in /opt/collectd/etc/collectd.conf

...
Include "/opt/collectd/etc/metrics/*.conf"
...

and inside the metrics configuration

a /opt/collectd/etc/metrics/apache_generic.conf with

Include "/opt/collectd/etc/metrics/apache/*.conf"

a /opt/collectd/etc/metrics/jmx_generic.conf with

Include "/opt/collectd/etc/metrics/jmx/*.conf"

a /opt/collectd/etc/metrics/jmx_generic.conf with

Include "/opt/collectd/etc/metrics/jmx/*.conf"

a /opt/collectd/etc/metrics/oracle_generic.conf with

Include "/opt/collectd/etc/metrics/oracle/*.conf"

and so on:

While staring the daemon it crashes , randomly and I'm not able to reproduce the problem.

I've configured core and I've generated a lot of them ( system and java dumps)

.root:/opt/collectd/etc/metrics > ls -l /opt/collectd/var/lib/collectd/
total 47516
-rw------- 1 root root 212897792 Mar 19 17:24 core.15494
-rw------- 1 root root 202551296 Mar 19 15:27 core.16387
-rw------- 1 root root 212770816 Mar 19 15:29 core.16497
-rw------- 1 root root 234156032 Mar 19 15:31 core.17750
-rw------- 1 root root 233734144 Mar 19 15:33 core.17882
-rw------- 1 root root 234151936 Mar 19 16:42 core.29974
-rw------- 1 root root 233721856 Mar 19 16:43 core.30292
-rw------- 1 root root 233705472 Mar 19 16:43 core.30407
-rw------- 1 root root 322539520 Mar 19 16:44 core.30748
-rw------- 1 root root 322650112 Mar 19 16:44 core.30913
-rw------- 1 root root 233578496 Mar 19 15:45 core.5504
-rw------- 1 root root 233717760 Mar 19 15:46 core.5705
-rw-r--r-- 1 root root     28245 Mar 19 17:41 hs_err_pid19973.log
-rw-r--r-- 1 root root     28399 Mar 19 17:45 hs_err_pid20321.log
-rw-r--r-- 1 root root      4300 Mar 19 17:49 hs_err_pid21489.log
-rw-r--r-- 1 root root     25920 Mar 19 16:44 hs_err_pid30748.log
-rw-r--r-- 1 root root     26127 Mar 19 16:44 hs_err_pid30913.log
-rw-r--r-- 1 root root     32405 Mar 19 16:53 hs_err_pid895.log
.root:/opt/collectd/etc/metrics >

But it seems have no sense since on each crash the backtrace is different, it seems like a concurrency problem, any idea on how to fix or bypass this problem ?

Created new plugin context.
*** glibc detected *** /opt/collectd/sbin/collectd: realloc(): invalid pointer: 0x00002aaaac062e50 ***
======= Backtrace: =========
/lib64/libc.so.6(realloc+0x381)[0x3401075661]
/opt/collectd/lib/collectd/cpu.so[0x2ae6c3ffe550]
/opt/collectd/sbin/collectd[0x40ee13]
/lib64/libpthread.so.0[0x3401c0673d]
/lib64/libc.so.6(clone+0x6d)[0x34010d44bd]


*** glibc detected *** /opt/collectd/sbin/collectd: free(): invalid next size (fast): 0x00002aaaac062e10 ***
======= Backtrace: =========
/lib64/libc.so.6[0x340107245f]
/lib64/libc.so.6(cfree+0x4b)[0x34010728bb]
/opt/collectd/lib/collectd/df.so(cu_mount_freelist+0x64)[0x2ae6c4201414]
/opt/collectd/lib/collectd/df.so[0x2ae6c42007e6]
/opt/collectd/sbin/collectd[0x40ee13]
/lib64/libpthread.so.0[0x3401c0673d]
/lib64/libc.so.6(clone+0x6d)[0x34010d44bd]


ore was generated by `/opt/collectd/sbin/collectd -f'.
Program terminated with signal 6, Aborted.
#0  0x0000003401030265 in raise () from /lib64/libc.so.6
(gdb) br
Breakpoint 1 at 0x3401030265
(gdb) bt
#0  0x0000003401030265 in raise () from /lib64/libc.so.6
#1  0x0000003401031d10 in abort () from /lib64/libc.so.6
#2  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#3  0x000000340107245f in _int_free () from /lib64/libc.so.6
#4  0x00000034010728bb in free () from /lib64/libc.so.6
#5  0x00000034010ae29e in build_trtable () from /lib64/libc.so.6
#6  0x00000034010b7cb0 in re_search_internal () from /lib64/libc.so.6
#7  0x00000034010b989a in regexec@@GLIBC_2.3.4 () from /lib64/libc.so.6
#8  0x00002b7a0dd87a62 in tr_action_invoke (act_head=0x12b8d300, buffer_in=0x12b93330 "cpu", buffer_in_size=64, may_be_empty=0) at target_replace.c:172
#9  0x00002b7a0dd87bc1 in tr_invoke (ds=<value optimized out>, vl=0x12b932d0, meta=<value optimized out>, user_data=<value optimized out>) at target_replace.c:336
#10 0x000000000040ad88 in ?? ()


Core was generated by `/opt/collectd/sbin/collectd -P /opt/collectd/var/run/collectd.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003401073155 in _int_malloc () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003401073155 in _int_malloc () from /lib64/libc.so.6
#1  0x0000003401074aad in calloc () from /lib64/libc.so.6
#2  0x00002acaa9d421e6 in disk_read () at disk.c:527
#3  0x000000000040ee13 in ?? ()



Reading symbols from /opt/collectd/lib/collectd/target_set.so...done.
Loaded symbols for /opt/collectd/lib/collectd/target_set.so
Core was generated by `/opt/collectd/sbin/collectd -P /opt/collectd/var/run/collectd.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000034010616c6 in fgets () from /lib64/libc.so.6
(gdb) bt
#0  0x00000034010616c6 in fgets () from /lib64/libc.so.6
#1  0x00002ac71914d124 in disk_read () at disk.c:508
#2  0x000000000040ee13 in ?? ()
(gdb) Quit
(gdb) exit


Core was generated by `/opt/collectd/sbin/collectd -P /opt/collectd/var/run/collectd.pid'.
Program terminated with signal 6, Aborted.
#0  0x0000003401030265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003401030265 in raise () from /lib64/libc.so.6
#1  0x0000003401031d10 in abort () from /lib64/libc.so.6
#2  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#3  0x000000340107245f in _int_free () from /lib64/libc.so.6
#4  0x00000034010728bb in free () from /lib64/libc.so.6
#5  0x0000003401060eab in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00002afe25a60393 in cpu_read () at cpu.c:630
#7  0x000000000040ee13 in ?? ()
@toni-moreno
Copy link
Contributor Author

Note:
collectd (last commit yesterday) is crashing over the same machiver where it have been compiled.

Linux XXXXXXXXXXXXXX 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 5.5 (Tikanga)

glibc-2.5-58.el5_6.4

compiled with these options:

export JAVA_HOME=/soft/jdk1.6.0_45_sun_hotspot/
export ORACLE_HOME=/usr/lib/oracle/11.2/client64
export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib/
export CFLAGS="-g"

./configure --prefix=/opt/collectd --enable-debug --disable-python --enable-oracle  --with-java=/soft/jdk1.6.0_45_sun_hotspot/   --with-libcurl=/usr/local/lib

@toni-moreno
Copy link
Contributor Author

After a Include level have been removed ( I've maintained only a Include /opt/collectd/etc/metrics/*.conf ) no more segfault errors found, but I'm getting yet "SIGNAL 6 "Aborted" , signal randomly.

Once started all seems ok.

.root:/root > ls -ltr /opt/collectd/var/lib/collectd/core*
-rw------- 1 root root 481017856 Mar 24 16:00 /opt/collectd/var/lib/collectd/core.26855
-rw------- 1 root root 480768000 Mar 24 16:00 /opt/collectd/var/lib/collectd/core.26924
-rw------- 1 root root 480894976 Mar 24 16:00 /opt/collectd/var/lib/collectd/core.26987
-rw------- 1 root root 480706560 Mar 24 16:01 /opt/collectd/var/lib/collectd/core.27042
-rw------- 1 root root 236093440 Mar 24 16:13 /opt/collectd/var/lib/collectd/core.2743
-rw------- 1 root root 477396992 Mar 24 16:17 /opt/collectd/var/lib/collectd/core.6305
-rw------- 1 root root 481660928 Mar 24 17:00 /opt/collectd/var/lib/collectd/core.23295
-rw------- 1 root root 481366016 Mar 24 17:00 /opt/collectd/var/lib/collectd/core.23349

The cores give us the following information.

==============================
/opt/collectd/var/lib/collectd/core.26924
==============================
Program terminated with signal 6, Aborted.
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
===========================================
(gdb) bt
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
#1  0x00002ae4b6cd96c6 in skgesigOSCrash () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#2  0x00002ae4b6f8af79 in kpeDbgSignalHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#3  0x00002ae4b6cd98d6 in skgesig_sigactionHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#4  <signal handler called>
#5  0x0000003401030265 in raise () from /lib64/libc.so.6
#6  0x0000003401031d10 in abort () from /lib64/libc.so.6
#7  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#8  0x0000003401075661 in realloc () from /lib64/libc.so.6
#9  0x00002ae4b4361f67 in cpu_states_grow () at cpu.c:236
#10 0x00002ae4b4362433 in submit (cpu_num=2, derives=0x4b6f9800) at cpu.c:436
#11 0x00002ae4b43628ad in cpu_read () at cpu.c:626
#12 0x0000000000410142 in ?? ()
(gdb) info threads
  36 Thread 0x2ae4b3d512e0 (LWP 26924)  0x0000003401c0e1c1 in nanosleep () from /lib64/libpthread.so.0
  35 Thread 26925  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  34 Thread 26926  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  33 Thread 26927  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  32 Thread 26928  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  31 Thread 26929  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  30 Thread 26930  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  29 Thread 26932  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  28 Thread 26933  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  27 Thread 26934  0x00000034010bb187 in sched_yield () from /lib64/libc.so.6
  26 Thread 26935  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  25 Thread 26936  0x0000003401c0db3b in accept () from /lib64/libpthread.so.0
  24 Thread 26937  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23 Thread 26938  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  22 Thread 26939  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21 Thread 26940  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20 Thread 26941  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19 Thread 26942  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18 Thread 26943  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  17 Thread 26944  0x0000003401c0cd01 in sem_wait () from /lib64/libpthread.so.0
  16 Thread 26945  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  15 Thread 26946  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14 Thread 26947  0x00002ae4b3216b91 in ciObjectFactory::get_symbol(Symbol*) () from /opt/collectd/jre/lib/amd64/server/libjvm.so
  13 Thread 26948  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 26949  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11 Thread 26950  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 26951  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9 Thread 26952  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 26953  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7 Thread 26955  0x00002ae4b364746e in Metaspace::allocate(ClassLoaderData*, unsigned long, bool, MetaspaceObj::Type, Thread*) ()
   from /opt/collectd/jre/lib/amd64/server/libjvm.so
  6 Thread 26956  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 Thread 26957  0x00002ae4b5636ffe in upiini () from /opt/collectd/lib/oracle/libclntsh.so.11.1
  4 Thread 26958  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 26959  0x00000034010c678b in read () from /lib64/libc.so.6
  2 Thread 26960  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x4b6fa940 (LWP 26954)  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0

===============================
/opt/collectd/var/lib/collectd/core.26987
===============================
(gdb) bt
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
#1  0x00002ba42206f6c6 in skgesigOSCrash () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#2  0x00002ba422320f79 in kpeDbgSignalHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#3  0x00002ba42206f8d6 in skgesig_sigactionHandler () from /opt/collectd/lib/oracle/libclntsh.so
8000
.11.1
#4  <signal handler called>
#5  0x0000003401030265 in raise () from /lib64/libc.so.6
#6  0x0000003401031d10 in abort () from /lib64/libc.so.6
#7  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#8  0x000000340107245f in _int_free () from /lib64/libc.so.6
#9  0x00000034010728bb in free () from /lib64/libc.so.6
#10 0x0000003401060eab in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#11 0x00002ba41fb00861 in disk_read () at disk.c:681
#12 0x0000000000410142 in ?? ()
(gdb) info threads
  36 Thread 0x2ba41f0e72e0 (LWP 26987)  0x0000003401c0e1c1 in nanosleep () from /lib64/libpthread.so.0
  35 Thread 26988  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  34 Thread 26989  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  33 Thread 26990  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  32 Thread 26991  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  31 Thread 26992  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  30 Thread 26993  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  29 Thread 26994  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  28 Thread 26995  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  27 Thread 26996  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  26 Thread 26997  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  25 Thread 26998  0x0000003401c0db3b in accept () from /lib64/libpthread.so.0
  24 Thread 27000  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23 Thread 27001  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  22 Thread 27002  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21 Thread 27003  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20 Thread 27004  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19 Thread 27005  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18 Thread 27006  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  17 Thread 27007  0x0000003401c0cd01 in sem_wait () from /lib64/libpthread.so.0
  16 Thread 27008  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  15 Thread 27009  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14 Thread 27010  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 27011  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 27012  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11 Thread 27013  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 27014  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9 Thread 27016  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 27017  0x00002ba41e6f18cc in frame::adjust_unextended_sp() () from /opt/collectd/jre/lib/amd64/server/libjvm.so
  7 Thread 27018  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 27019  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 Thread 27020  0x0000003405807f8a in __libc_res_nsearch () from /lib64/libresolv.so.2
  4 Thread 27021  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 27022  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 27023  0x00000034010c678b in read () from /lib64/libc.so.6
* 1 Thread 0x4a4ea940 (LWP 27015)  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0


===============================
/opt/collectd/var/lib/collectd/core.26855
==============================
Program terminated with signal 6, Aborted.
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
==========================
(gdb) bt
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
#1  0x00002ae86006d6c6 in skgesigOSCrash () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#2  0x00002ae86031ef79 in kpeDbgSignalHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#3  0x00002ae86006d8d6 in skgesig_sigactionHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#4  <signal handler called>
#5  0x0000003401030265 in raise () from /lib64/libc.so.6
#6  0x0000003401031d10 in abort () from /lib64/libc.so.6
#7  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#8  0x000000340107245f in _int_free () from /lib64/libc.so.6
#9  0x00000034010728bb in free () from /lib64/libc.so.6
#10 0x0000003401060eab in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#11 0x00002ae85d6f68d9 in cpu_read () at cpu.c:630
#12 0x0000000000410142 in ?? ()

==========================

 36 Thread 0x2ae85d0e52e0 (LWP 26855)  0x0000003401c0e1c1 in nanosleep () from /lib64/libpthread.so.0
  35 Thread 26856  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  34 Thread 26857  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  33 Thread 26858  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  32 Thread 26859  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  31 Thread 26860  0x00000034010bb187 in sched_yield () from /lib64/libc.so.6
  30 Thread 26861  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  29 Thread 26862  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  28 Thread 26863  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  27 Thread 26864  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  26 Thread 26865  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  25 Thread 26866  0x0000003401c0db3b in accept () from /lib64/libpthread.so.0
  24 Thread 26868  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23 Thread 26869  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  22 Thread 26870  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21 Thread 26871  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20 Thread 26872  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19 Thread 26873  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18 Thread 26874  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  17 Thread 26875  0x0000003401c0cd01 in sem_wait () from /lib64/libpthread.so.0
  16 Thread 26876  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  15 Thread 26877  0x00002ae85c457a76 in AdvancedThresholdPolicy::select_task(CompileQueue*) () from /opt/collectd/jre/lib/amd64/server/libjvm.so
---Type <return> to continue, or q <return> to quit---
  14 Thread 26878  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 26879  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 26880  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11 Thread 26881  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 26882  0x0000003401c0d91b in read () from /lib64/libpthread.so.0
  9 Thread 26883  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 26884  0x00002ae85c5d233a in ClassFileParser::parse_method(bool, AccessFlags*, Thread*) ()
   from /opt/collectd/jre/lib/amd64/server/libjvm.so
  7 Thread 26885  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  6 Thread 26887  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 Thread 26888  0x00000034010bb187 in sched_yield () from /lib64/libc.so.6
  4 Thread 26889  0x00000034010d10ea in mmap64 () from /lib64/libc.so.6
  3 Thread 26890  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 26891  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x4c68d940 (LWP 26886)  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0


============================
/opt/collectd/var/lib/collectd/core.27042
=============================
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/opt/collectd/sbin/collectd -P /opt/collectd/var/run/collectd.pid'.
Program terminated with signal 6, Aborted.
===============================
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
#1  0x00002ae4266e16c6 in skgesigOSCrash () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#2  0x00002ae426992f79 in kpeDbgSignalHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#3  0x00002ae4266e18d6 in skgesig_sigactionHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#4  <signal handler called>
#5  0x0000003401030265 in raise () from /lib64/libc.so.6
#6  0x0000003401031d10 in abort () from /lib64/libc.so.6
#7  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#8  0x000000340107245f in _int_free () from /lib64/libc.so.6
#9  0x00000034010728bb in free () from /lib64/libc.so.6
#10 0x00000000004105dc in ?? ()
(gdb) info threads
  36 Thread 0x2ae4237592e0 (LWP 27042)  0x0000003401c0e1c1 in nanosleep () from /lib64/libpthread.so.0
  35 Thread 27043  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  34 Thread 27044  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  33 Thread 27045  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  32 Thread 27046  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  31 Thread 27047  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  30 Thread 27048  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  29 Thread 27049  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  28 Thread 27051  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  27 Thread 27052  0x0000003401c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  26 Thread 27053  0x0000003401c0db3b in accept () from /lib64/libpthread.so.0
  25 Thread 27055  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  24 Thread 27056  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23 Thread 27057  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  22 Thread 27058  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21 Thread 27059  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20 Thread 27060  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19 Thread 27061  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18 Thread 27062  0x0000003401c0cd01 in sem_wait () from /lib64/libpthread.so.0
  17 Thread 27063  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  16 Thread 27064  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  15 Thread 27065  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14 Thread 27066  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 27067  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 27068  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11 Thread 27069  0x00000034010bb187 in sched_yield () from /lib64/libc.so.6
  10 Thread 27070  0x00000034010d10ea in mmap64 () from /lib64/libc.so.6
  9 Thread 27071  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 27072  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
  7 Thread 27073  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 27074  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 Thread 27075  0x00002ae426c525b0 in naesh1s () from /opt/collectd/lib/oracle/libclntsh.so.11.1
  4 Thread 27076  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 27077  0x00002ae423058974 in Method::make_adapters(methodHandle, Thread*) () from /opt/collectd/jre/lib/amd64/server/libjvm.so
  2 Thread 27078  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x464ce940 (LWP 27050)  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0


============================
/opt/collectd/var/lib/collectd/core.2743
============================
Core was generated by `/opt/collectd/sbin/collectd -P /opt/collectd/var/run/collectd.pid'.
Program terminated with signal 6, Aborted.
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
=============================

(gdb) bt
#0  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
#1  0x00002b121e7c26c6 in skgesigOSCrash () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#2  0x00002b121ea73f79 in kpeDbgSignalHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#3  0x00002b121e7c28d6 in skgesig_sigactionHandler () from /opt/collectd/lib/oracle/libclntsh.so.11.1
#4  <signal handler called>
#5  0x0000003401030265 in raise () from /lib64/libc.so.6
#6  0x0000003401031d10 in abort () from /lib64/libc.so.6
#7  0x000000340106a99b in __libc_message () from /lib64/libc.so.6
#8  0x0000003401075661 in realloc () from /lib64/libc.so.6
#9  0x00002b121c057f67 in cpu_states_grow () at cpu.c:236
#10 0x00002b121c058433 in submit (cpu_num=2, derives=0x4b468800) at cpu.c:436
#11 0x00002b121c0588ad in cpu_read () at cpu.c:626
#12 0x0000000000410142 in ?? ()
(gdb) info threads
  23 Thread 0x2b121ba472e0 (LWP 2743)  0x0000003401c0e1c1 in nanosleep () from /lib64/libpthread.so.0
  22 Thread 2744  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21 Thread 2745  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20 Thread 2746  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19 Thread 2747  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18 Thread 2748  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  17 Thread 2749  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  16 Thread 2750  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  15 Thread 2751  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14 Thread 2752  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 2753  0x0000003401c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 2754  0x0000003401c0db3b in accept () from /lib64/libpthread.so.0
  11 Thread 2756  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 2757  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9 Thread 2758  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 2759  0x00000034010cb696 in poll () from /lib64/libc.so.6
  7 Thread 2760  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 2762  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 Thread 2763  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4 Thread 2764  0x00000034010c678b in read () from /lib64/libc.so.6
  3 Thread 2765  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 2767  0x0000003401c0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x4b469940 (LWP 2761)  0x0000003401c0e9dd in raise () from /lib64/libpthread.so.0
``

@mfournier
Copy link

As the configuration is parsed just once, at startup, it seems unlikely to me that an issue with the config file parser would cause collectd to crash later on. Can you please check that by copy-pasting the files you include in your main main configuration file instead of these Include statements ?

As you mention, a thread concurrency issue could be the cause (according to the symptoms and the backtraces). Can you try disabling the loaded plugins one by one ? This might give a hint of which one is causing trouble ? #114 and #526 ring a bell to me as possibly related.

@toni-moreno
Copy link
Contributor Author
toni-moreno 8000 commented Apr 2, 2014

Hi Marc.

I will do what you are requesting me as soon as I have a bit of time ( it requires a lot of changes), in a few days I hope .
But I suspect it has to do also on the java plugin , because I split "*.conf" files at the same time I began work with libjvm.

I will report you on that soon.

@dothebart
Copy link
Contributor

maybe starting collectd with valgrind (memcheck or maybe hellgrind) can give more detailed informations on this?

@mfournier
Copy link

This looks very much like this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750440

@toni-moreno
Copy link
Contributor Author

yes! but I'm having this issue in both systems debian and redhat...

As a workaround, I've wrote a init.d script that starts collectd inside a loop and waits for some seconds and checks it after . it try to start daemon for at least 10 interactions.

It requires usually 2 or 3 restarts to start ok.

as a side effect, all forked processes started with exec plugin are hung after crash and must be removed by hand.

In the other hand once the daemon have been started ok, collectd works fine with all my plugins ("jvm","oracle","exec", system plugins.. etc) .

@toni-moreno
Copy link
Contributor Author

I've executed valgrind with a collectd with the following plugins ( on debian )

/opt/collectd/etc/metrics/plugin_apache.conf
/opt/collectd/etc/metrics/plugin_cpu.conf
/opt/collectd/etc/metrics/plugin_df.conf
/opt/collectd/etc/metrics/plugin_disk.conf
/opt/collectd/etc/metrics/plugin_interface.conf
/opt/collectd/etc/metrics/plugin_memory.conf
/opt/collectd/etc/metrics/plugin_ping.conf
/opt/collectd/etc/metrics/plugin_processes.conf
/opt/collectd/etc/metrics/plugin_swap.conf

Here , the results form memcheck and helgrind. (for anyone who would like to check)

https://gist.github.com/toni-moreno/a2f80021535f87202de7

Summary:

*memcheck: some leak issues on cpu.c and plugin.c to review
*helgrind: some condicion race o 8000 n apache.c, write_graphite.c and dl-lookup.c to review

@mfournier
Copy link

@toni-moreno, could you please give dothebart@e09d935 a try and report back if this solves the problem for you ? Thanks :)

@toni-moreno
Copy link
Contributor Author

@mfournier and @dothebart I've tested your patch and it doesn't solve my crashes.

I've reviewed our crashed and I've notice they are not really the same crash because of its causing signal ( always signal 6, Aborted in my case) and segfault in yours as you can see in the title of the bug

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750440

@mfournier
Copy link

@toni-moreno, can you give the current master branch a try and let us know if you're still experiencing this problem ? @dothebart's patch got merged earlier today, but I'm not sure if it's the same one we talked about back in june. Thanks for your perseverance tracking this down :-)

@toni-moreno
Copy link
Contributor Author

Hi @mfournier .

Sorry for the late response ( I was out enjoying some vacation days) .

Now I have running on top of 50 production servers a customized version of collectd ( and no more crashes detected)

My collectd was made from ( d76d251 commit)

my three Pull Request: ( #585 #577 #576 )

And @dothebart patch ( dothebart@911b17c )

And All OK from 2 months.

I will be pleased to test it when I can change my compiled version for a new more actualized one but I need my "still" opened 3 pull request in the master branch also.

If you can merge my 3 PR I will test it ( on test environments first and production after.) .

Thank you very much

@toni-moreno
Copy link
Contributor Author

Hi @mfournier After 4 month deploying on production system with ( d76d251 commit) , my three Pull Request: ( #585 #577 #576 ) and the @dothebart patch I think we can close happily this issue.

Lots of thanks to both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0