8000 ipmi in combination with java plugins crashes · Issue #114 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ipmi in combination with java plugins crashes #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fcorneli opened this issue Aug 8, 2012 · 14 comments · Fixed by #910
Closed

ipmi in combination with java plugins crashes #114

fcorneli opened this issue Aug 8, 2012 · 14 comments · Fixed by #910
Labels
Bug A genuine bug
Milestone

Comments

@fcorneli
Copy link
Contributor
fcorneli commented Aug 8, 2012

When using the ipmi plugin in combination with the java plugin, collectd crashes.

[2012-08-08 13:57:42] java plugin: The JVM has been created.
[2012-08-08 13:57:42] java plugin: cjni_thread_attach: cjni_env->reference_counter = 1
[2012-08-08 13:57:42] java plugin: Loading class org/collectd/java/GenericJMX
[2012-08-08 13:57:42] java plugin: Registering new config callback: GenericJMX
[2012-08-08 13:57:42] java plugin: Registering new read callback: GenericJMX
[2012-08-08 13:57:42] java plugin: Registering new shutdown callback: GenericJMX
[2012-08-08 13:57:42] java plugin: cjni_thread_detach: cjni_env->reference_counter = 0
[2012-08-08 13:57:42] java plugin: Configuring GenericJMX
[2012-08-08 13:57:42] java plugin: cjni_thread_attach: cjni_env->reference_counter = 1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f708559ee10, pid=27654, tid=140121236432640
#
# JRE version: 6.0_33-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.8-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x711e10][2012-08-08 13:57:42] GenericJMX plugin: config: ci = { key: Plugin; values: [GenericJMX]; children: [{ key: MBean; values: [memory_pool]; children: [{ key: ObjectName; values: [java.lang:type=MemoryPool,*]; children: []; }, { key: Instanceprefix; values: [memory_pool-]; children: []; }, { key: InstanceFrom; values: [name]; children: []; }, { key: Value; values: []; children: [{ key: Type; values: [memory]; children: []; }, { key: Table; values: [true]; children: []; }, { key: Attribute; values: [Usage]; children: []; }]; }]; }, { key: Connection; values: []; children: [{ key: Host; values: [e-contract.be]; children: []; }, { key: ServiceURL; values: [service:jmx:rmi:///jndi/rmi://localhost:1090/jmxrmi]; children: []; }, { key: Collect; values: [memory_pool]; children: []; }]; }]; };
  SR_handler(int, siginfo*, ucontext*)+0x30
#
# An error report file with more information is saved as:
# /opt/collectd/var/lib/collectd/hs_err_pid27654.log
[2012-08-08 13:57:42] GenericJMXConfMBean: child.getKey () = ObjectName
[2012-08-08 13:57:42] GenericJMXConfMBean: child.getKey () = Instanceprefix
[2012-08-08 13:57:42] GenericJMXConfMBean: child.getKey () = InstanceFrom
[2012-08-08 13:57:42] GenericJMXConfMBean: child.getKey () = Value
[2012-08-08 13:57:42] GenericJMX.putMBean: Adding memory_pool
[2012-08-08 13:57:42] GenericJMXConfConnection: e-contract.be: Add memory_pool
[2012-08-08 13:57:42] java plugin: cjni_thread_detach: cjni_env->reference_counter = 0
[2012-08-08 13:57:42] java plugin: jvm_argc = 2;
[2012-08-08 13:57:42] java plugin: java_classes_list_len = 1;
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
[2012-08-08 13:57:42] java plugin: cjni_thread_attach: cjni_env->reference_counter = 1
#
[2012-08-08 13:57:42] java plugin: cjni_thread_detach: cjni_env->reference_counter = 0
Aborted
@fcorneli
Copy link
Contributor Author
fcorneli commented Aug 8, 2012

Maybe something to do with the signal handling of IPMI?

@octo
Copy link
Member
octo commented Sep 12, 2012

Hi,

do you by any chance have a stack trace from such a crash?

Best regards,
—octo

@fcorneli
Copy link
Contributor Author

There is no stack trace, just a JVM crash (see above).

@fcorneli
Copy link
Contributor Author
fcorneli commented Mar 2, 2013

As a work-around I run two instances of collectd. One with the IMPI plugin, and one with the Java plugin.

@robb-reporo
Copy link

is this any use?

(gdb) backtrace full
#0  0x00007f3b02f10425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007f3b02f13b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f3afba1b199 in os::abort (dump_core=true)
    at /build/buildd/openjdk-7-7u51-2.4.4/build/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:1594
No locals.
#3  0x00007f3afbb9379f in VMError::report_and_die (this=0x7f3af650eca0)
    at /build/buildd/openjdk-7-7u51-2.4.4/build/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:1034
        skip_os_abort = true
        buffer = "/var/lib/collectd/hs_err_pid41608.log", '\000' <repeats 1962 times>
        out = {<outputStream> = {<ResourceObj> = {<No data fields>}, _vptr.outputStream = 0x7f3afbfe1db0, 
            _indentation = 0, _width = 80, _position = 0, _newlines = 19, _precount = 695, _stamp = {_counter = 0}}, 
          _fd = 1, _need_close = false}
        skip_OnError = false
        skip_bug_url = true
        out_done = true
        log_done = true
        transmit_report_done = true
        recursive_error_count = 0
        log = {<outputStream> = {<ResourceObj> = {<No data fields>}, _vptr.outputStream = 0x7f3afbfe1db0, 
            _indentation = 0, _width = 80, _position = 0, _newlines = 1405, _precount = 79100, _stamp = {_counter = 0}}, 
          _fd = -1, _need_close = false}
        mytid = <optimized out>
#4  0x00007f3afba22b84 in JVM_handle_linux_signal (sig=11, info=0x7f3af650ee70, ucVoid=0x7f3af650ed40, 
    abort_if_unrecognized=1)
    at /build/buildd/openjdk-7-7u51-2.4.4/build/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:531
        uc = 0x7f3af650ed40
        thread = 0x0
        stub = <optimized out>
        newset = {__val = {1024, 0 <repeats 15 times>}}
---Type <return> to continue, or q <return> to quit--- 
        err = {<StackObj> = {<No data fields>}, _id = 11, _message = 0x0, _detail_msg = 0x0, _thread = 0x0, 
          _pc = 0x7f3afba1816e "H\213\230", _siginfo = 0x7f3af650ee70, _context = 0x7f3af650ed40, _filename = 0x0, 
          _lineno = 0, _current_step = 0, _current_step_info = 0x7f3afbbb5c5e "", _verbose = 1, 
          static first_error = 0x7f3af650eca0, static first_error_tid = 139891217336064, static coredump_status = true, 
          static coredump_message = "/var/lib/collectd/core or core.41608", '\000' <repeats 1963 times>, _size = 0}
        t = 0x0
        shm = {<StackObj> = {<No data fields>}, _thread = 0x0}
        vmthread = 0x0
        pc = 0x7f3afba1816e "H\213\230"
#5  <signal handler called>
No symbol table info available.
#6  osthread (this=<optimized out>)
    at /build/buildd/openjdk-7-7u51-2.4.4/build/openjdk/hotspot/src/share/vm/runtime/thread.hpp:408
No locals.
#7  SR_handler (sig=<optimized out>, siginfo=0x7f3af650f4f0, context=0x7f3af650f3c0)
    at /build/buildd/openjdk-7-7u51-2.4.4/build/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:3822
        old_errno = 2
        thread = 0x0
        osthread = 0x7f3af650fde0
        current = os::SuspendResume::SR_RUNNING
#8  <signal handler called>
No symbol table info available.
#9  0x00007f3b034b0f8c in pthread_kill () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#10 0x00007f3afc753a55 in ?? () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#11 0x00007f3afc754ca3 in sel_start_timer () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#12 0x00007f3afc7531e7 in ?? () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#13 0x00007f3afc4f435c in ?? () from /usr/lib/libOpenIPMI.so.0
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#14 0x00007f3afc4f4534 in ?? () from /usr/lib/libOpenIPMI.so.0
No symbol table info available.
#15 0x00007f3afc48ed39 in ipmi_handle_rsp_item_copymsg () from /usr/lib/libOpenIPMI.so.0
No symbol table info available.
#16 0x00007f3afc4f385c in ?? () from /usr/lib/libOpenIPMI.so.0
No symbol table info available.
#17 0x00007f3afc7546c4 in ?? () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#18 0x00007f3afc754fd7 in sel_select () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#19 0x00007f3afc752977 in ?? () from /usr/lib/libOpenIPMIpthread.so.0
No symbol table info available.
#20 0x00007f3afc95a3f0 in thread_main (user_data=<optimized out>) at ipmi.c:613
        tv = {tv_sec = 1, tv_usec = 0}
#21 0x00007f3b034abe9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#22 0x00007f3b02fce3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#23 0x0000000000000000 in ?? ()
No symbol table info available.

@robb-reporo
Copy link

still present in debian wheezy package collectd_5.1.0-3

@robb-reporo
Copy link

Built collectd_5.4.0-3ubuntu1_amd64.deb package from Trusty Tahir for our current server build (Precise) and crash is still present.

@mfournier
Copy link

Now we have a backtrace for this, I was wondering if it wouldn't be another of these thread concurrency issues. @tokkee, @ChrisLundquist, @katzj are you maybe able to comment on this ? Thanks :-)

@robb-reporo
Copy link

Just to add a little more detail, I have ipmi configured thus...

LoadPlugin ipmi

<Plugin ipmi>
#       Sensor "some_sensor"
#       Sensor "another_one"
#       IgnoreSelected false
#       NotifySensorAdd false
#       NotifySensorRemove true
#       NotifySensorNotPresent false
</Plugin>

and java like this...

LoadPlugin java

<Plugin "java">
  # required JVM argument is the classpath
  # JVMArg "-Djava.class.path=/installpath/collectd/share/collectd/java"
  # Since version 4.8.4 (commit c983405) the API and GenericJMX plugin are
  # provided as .jar files.
  JVMARG "-Djava.class.path=/usr/share/collectd/java/collectd-api.jar:/usr/share/collectd/java/generic-jmx.jar"
  LoadPlugin "org.collectd.java.GenericJMX"

  <Plugin "GenericJMX">
    # Memory usage by memory pool.
    <MBean "memory_pool">
      ObjectName "java.lang:type=MemoryPool,*"
      InstancePrefix "memory_pool-"
      InstanceFrom "name"
      <Value>
        Type "memory"
        #InstancePrefix ""
        #InstanceFrom ""
        Table true
        Attribute "Usage"
      </Value>
    </MBean>

    <Connection>
      Host "svhost"
      InstancePrefix "rad-8080"
      ServiceURL "service:jmx:rmi:///jndi/rmi://localhost:5000/jmxrmi"
      Collect "memory_pool"
    </Connection>
  </Plugin>
</Plugin>

Removing either java or ipmi configuration prevents the core dump.

I've tried adding "ReadThreads 1" and "WriteThreads 1" to the main section of collectd.conf, but the crash still occurs.

@ChrisLundquist
Copy link
Contributor

@robmbrooks if you use matching ``` it might help clear up the formatting.
You can do literal pastes like

this

More info here

@ChrisLundquist
Copy link
Contributor

@mfournier if I recall correctly, the previous threading issue I looked at had to do with multiple libraries initializing libgcrypt (and libgcrypt not getting the threadsafe flag). I'm not seeing libgcrypt in the backtrace. Though, I could see java trying to use it in some native extension.

@mfournier
Copy link

Not sure if this is related, but 513a5ca which was commited a few hours ago seems to do a better job of cleaning up threads. @robmbrooks if you still have your access to this environment where the problem occurs, could you maybe try cherry-picking this patch and let us know if it helps ? Thanks !

@vincentbernat
Copy link
Contributor

I am also hit by this problem and 513a5ca doesn't solve it. It still happens with the master branch. I'll try to debug that.

@vincentbernat
Copy link
Contributor

My current understanding is that OpenIPMI is using a signal to interrupt "wake" its event loop. It does that when timers are changed to be able to recompute the timeout correctly. This is a bit crazy design. However, it is also sending its signal to the exact same thread it is currently running (it is not currently blocked on select()).

I have the current bt:

Thread 12 (Thread 0x7fffefcb9700 (LWP 8942)):
#0  0x00007ffff5b8c89e in SR_handler(int, siginfo*, ucontext*) ()
   from /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#1  <signal handler called>
#2  0x00007ffff79c1f8c in pthread_kill () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff6bd46c5 in wake_sel_thread (sel=0x7fffe8000b00) at selector.c:205
#4  0x00007ffff6bd5d4f in sel_start_timer (timer=0x7fffe8011000, timeout=0x7fffefcb89e0)
    at selector.c:478
#5  0x00007ffff6bd447a in start_timer (cb_data=<optimized out>, timed_out=<optimized out>,
    timeout=<optimized out>, id=0x7fffe8010f70, handler=<optimized out>)
    at posix_thread_os_hnd.c:211
#6  start_timer (handler=<optimized out>, id=0x7fffe8010f70, timeout=<optimized out>,
    timed_out=<optimized out>, cb_data=<optimized out>) at posix_thread_os_hnd.c:188
#7  0x00007ffff69768cc in finish_connection (ipmi=0x7fffe800cc90, smi=0x7fffe800cec0)
    at ipmi_smi.c:1268
#8  0x00007ffff6976a98 in handle_dev_id (ipmi=0x7fffe800cc90, msgi=<optimized out>)
    at ipmi_smi.c:1346
#9  0x00007ffff690fdb9 in ipmi_handle_rsp_item_copymsg (ipmi=0x7fffe800cc90,
    rspi=0x7fffe80111a0, msg=<optimized out>, rsp_handler=0x7ffff69769f0 <handle_dev_id>)
    at ipmi.c:1761
#10 0x00007ffff697630c in handle_response (recv=0x7fffefcb8aa0, ipmi=0x7fffe800cc90)
    at ipmi_smi.c:701
#11 gen_recv_msg (recv=0x7fffefcb8aa0, ipmi=0x7fffe800cc90) at ipmi_smi.c:874
#12 ipmi_dev_data_handler (cb_data=0x7fffe800cc90, fd=<optimized out>, id=<optimized out>)
    at ipmi_smi.c:922
#13 ipmi_dev_data_handler (fd=<optimized out>, cb_data=0x7fffe800cc90, id=<optimized out>)
    at ipmi_smi.c:892
#14 0x00007ffff6bd576d in process_fds (sel=0x7fffe8000b00, timeout=<optimized out>,
    send_sig=<optimized out>, thread_id=<optimized out>, cb_data=<optimized out>)
    at selector.c:636
#15 0x00007ffff6bd60a6 in sel_select (sel=0x7fffe8000b00,
    send_sig=0x7ffff6bd3990 <posix_thread_send_sig>, thread_id=140737216482920,
    cb_data=0x7fffe8000a40, timeout=0x7fffefcb8eb0) at selector.c:739
#16 0x00007ffff6bd4067 in perform_one_op (os_hnd=0x7fffe80008e0, timeout=0x7fffefcb8eb0)
    at posix_thread_os_hnd.c:639
#17 0x00007ffff6ddb560 in thread_main (user_data=<optimized out>) at ipmi.c:613
#18 0x00007ffff79bce9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007ffff72d62ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()

No other thread is currently running in selector.c (where the event loop is). It would be far better if OpenIPMI was using a file descriptor to wakeup its event loop when needed instead of a signal. All the more it should be a library. I have opened a bug against OpenIPMI for that:
https://sourceforge.net/p/openipmi/bugs/72/

The problem is that the java plugin registered an handler for the very same signal for suspending/resuming its own thread. However, it is possible to change the signal to something else using _JAVA_SR_SIGNUM=10 (this will cause other bugs).

OpenIPMI shouldn't use signals. A workaround is to export _JAVA_SR_SIGNUM=25 (which is greater that SIGBUS and SIGSEGV on all architectures I think).

vincentbernat added a commit to vincentbernat/collectd that referenced this issue Jan 23, 2015
Java uses SIGUSR2 to suspend/resume threads. The OpenIPMI plugins also
need a signal to resume its event loop when setting a timer. They can't
both use the same signal. We ask OpenIPMI to use SIGIO instead.

This should fix collectd#114.
@pyr pyr closed this as completed in #910 Jan 26, 2015
mfournier pushed a commit that referenced this issue Jan 26, 2015
Java uses SIGUSR2 to suspend/resume threads. The OpenIPMI plugins also
need a signal to
5C42
 resume its event loop when setting a timer. They can't
both use the same signal. We ask OpenIPMI to use SIGIO instead.

This should fix #114.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A genuine bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0