8000 Random Segfaults · Issue #3453 · home-assistant/core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Random Segfaults #3453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
turbokongen opened this issue Sep 19, 2016 · 78 comments · Fixed by #3639
Closed

Random Segfaults #3453

turbokongen opened this issue Sep 19, 2016 · 78 comments · Fixed by #3639

Comments

@turbokongen
Copy link
Contributor
< 8000 /tr>

Make sure you are running the latest version of Home Assistant before reporting an issue.

You should only file an issue if you found a bug. Feature and enhancement requests should go in the Feature Requests section of our community forum:

Home Assistant release (hass --version):
0.29.0dev

Python release (python3 --version):
3.4.4

Component/platform:
asyncio

Description of problem:
Random segfaults. No pattern.

Expected:
Normal operation.

Problem-relevant configuration.yaml entries and steps to reproduce:

homeassistant:
  # Name of the location where Home Assistant is running
  name: 'Drømmehuset'
  # Location required to calculate the time the sun rises and sets
  latitude: !secret latitude
  longitude: !secret longitude
  elevation: 35
  # C for Celcius, F for Fahrenheit
  unit_system: metric
  # Pick yours from here: http://en.wikipedia.org/wiki/List_of_tz_database_time_zones
  time_zone: Europe/Oslo

##########################################################
# Personligfisering
##########################################################
  customize: !include customize.yaml

# Discover some devices automatically
# discovery:

# Pushbullet notification service
notify:
  platform: pushbullet
  api_key: !secret pushbullet_api_key
  name: 'pushmelding'

# Zwave settings
zwave:
  usb_path: /dev/ttyACM0
  config_path: /srv/hass/src/python-openzwave/openzwave/config
  polling_interval: 30000
  debug: 1
  autoheal: False

# RFXtrx433E settings
rfxtrx:
  device: /dev/serial/by-id/usb-RFXCOM_RFXtrx433_A1Z69C6G-if00-port0
  debug: True

# Enables support for tracking state changes over time.
history:
#  placeholder: true
recorder:
  purge_days: 14

# Track the sun
sun:

# Enables the frontend
frontend:

# View all events in a logbook
logbook:

logger:
  default: info
  logs:
    homeassistant.components.zwave: debug
    homeassistant.components.rfxtrx: debug

# Show links to resources in log and frontend
#introduction:

# Allows you to issue voice commands from the frontend
#conversation:

# Checks for available updates
# updater:

# Verisure Hub
verisure:
 username: !secret verisure_username
 password: !secret verisure_password
 code_digits: 6

http:
# development: 1
 api_password: !secret http_password

########################################################
# Overvåkere
########################################################
device_tracker:
  - platform: ddwrt
    host: 10.0.0.138
    username: !secret ddwrt_username
    password: !secret ddwrt_password
    consider_home: 36
  - platform: ddwrt
    host: 10.0.0.88
    username: !secret ddwrt_username
    password: !secret ddwrt_password

########################################################
# Grupper
########################################################
group: !include group.yaml

#########################################################
# HVAC
#########################################################
#hvac:
#  - platform: demo
#########################################################
# Sensorer
#########################################################
sensor: !include sensor.yaml

#########################################################
# Brytere
#########################################################
switch: !include switches.yaml
#########################################################
# Lysbrytere
#########################################################
light: !include light.yaml
#########################################################
# Thermostats
#########################################################
climate: !include climate.yaml

#########################################################
# Automation
#########################################################
automation: !include automation.yaml

####################################
# Automation scripts
####################################
script: !include script.yaml
####################################
# Mediaspillere
####################################
media_player:
  - platform: cast

######################################
# Input_select
######################################
input_select: !include input_select.yaml

########################################
# Kamera
########################################
camera:
  - platform: ffmpeg
    input: !secret grillbua_input
    name: Grillbua
    ffmpeg_bin: /usr/bin/ffmpeg
  - platform: ffmpeg
    input: !secret parkering_input
    name: Parkering
    ffmpeg_bin: /usr/bin/ffmpeg
  - platform: ffmpeg
    input: !secret terrasse_input
    name: Terrasse
    ffmpeg_bin: /usr/bin/ffmpeg
  - platform: local_file
    name: 'Dørklokke'
    file_path: /home/hasss/.homeassistant/ringeklokke.jpeg
  - platform: ffmpeg
    input: !secret garasje_input
    name: Garasjen
####################################################
# Cover
####################################################
#cover:
#  - platform: demo

#################################################
# Panels
#################################################
#panel_iframe:
#  ozwcp:
#    title: 'Zwave Control Panel'
#    icon: 'mdi:router-wireless'
thermostat:
  - platform: heat_control
    name: 'Gulvvarme bad 1.etg test'
    heater: switch.aeotec_dsc18103_micro_smart_switch_2nd_edition_switch_2
    target_sensor: sensor.gulvtemp_bad_1etg
    min_temp: 20
    max_temp: 33
    target_temp: 28.5
  1. No action needed to replicate.

Traceback (if applicable):
http://hastebin.com/donuvejugu.pas
http://hastebin.com/ulicipebes.pas
http://hastebin.com/salopofaya.pas
http://hastebin.com/idereremil.pas
http://hastebin.com/jivipugibe.pas
http://hastebin.com/ixadobikul.pas
http://hastebin.com/qalozikawe.pas
http://hastebin.com/reduwuvewi.pas

Additional info:
Components and platforms:
device_tracker: ddwrt
notify: pushbullet
zwave: climate, switch, sensor, binary_sensor
rfxtrx: switch, light, sensor
verisure: alarm, sensor, switch, lock
media_player: cast
camera: ffmpeg
thermostat: heat_control
climate: generic_thermostat
switch: rfxtrx, zwave, command_line, template
light: rfxtrx
sensor: rfxtrx, template, command_line, systemmonitor, yr
Automations and scripts as well.

OS: ubuntu 16.04 64bit

@technicalpickles
Copy link
Contributor
technicalpickles commented Sep 19, 2016

I've been hitting a random segfault as well starting yesterday on dev.

Some tips I've seen for debugging:

To help debug async, you can add this under line 115 in core.py
self.loop.set_debug(True) from gitter

And also using gdb:

if you can run with gdb, the exact cause can be determined, https://wiki.python.org/moin/DebuggingWithGdb
you can attach to the running process and wait for it to segfault, or run hass with gdb and wait for it
pull info on the threads, py-bt gets a python stack-trace
you'll need to run the command from a terminal so that when it stops, you can interact with it (not a system-started script) from gitter

@turbokongen
Copy link
Contributor Author

Initial output with gdb running: http://hastebin.com/uvegobuweg.pl
bt:
http://hastebin.com/zigemufaya.swift
info threads:
http://hastebin.com/vecacuhira.sql

@turbokongen
Copy link
Contributor Author

Recent crash from tonight with faulthandler:
http://hastebin.com/qozubivewa.vbs

8000 @balloob
Copy link
Member
balloob commented Sep 20, 2016

There are 47 threads running. Highlighted some things from the log that could be it. If we could get a log from another user we could compare.

16 callbacks are waiting for the eventloop (searched for run_callback_threadsafe

After the segfault, the output from ping is printed to the command line, means that a command line sensor was running.

  File "/usr/local/lib/python3.4/subprocess.py", line 491 in _eintr_retry_call
  File "/usr/local/lib/python3.4/subprocess.py", line 1520 in _try_wait

The recorder is writing a query:

File "/home/hasss/.homeassistant/deps/sqlalchemy/engine/default.py", line 450 in do_execute

Verisure was updating your lock (search for verisure/devices/lock.py)

RFXtrx is reading from the serial connection (search for serial/serialposix.py)


If possible, could you turn one of the following off 1 by 1 to see if it stops the segfaulting: command line switch, rfxtrx, recorder, verisure

@technicalpickles
Copy link
Contributor

Here's my config, but it doesn't look like we have much in the way of component overlap:

homeassistant: !include homeassistant.yaml

zone: !include zone.yaml

group: !include groups.yaml
scene: !include scenes.yaml

logbook:
frontend:
history:
discovery:
zeroconf:
sun:
http: !include http.yaml

mqtt: !include mqtt.yaml

device_tracker: !include device_trackers.yaml
sensor onlinedness: !include sensor-onlinedness.yaml

switch harmony hub: !include harmony-switches.yaml
switch template: !include template_switches.yaml

sensor forecast: !include forecast.yaml 
sensor speedtest: !include speedtest.yaml

notify slack: !include notify-slack.yaml

light hue: !include hue.yaml
ecobee: !include ecobee.yaml
sleepiq: !include sleepiq.yaml
#zwave: !include zwave.yaml
input_boolean: !include input_booleans.yaml
emulated_hue: !include emulated_hue.yaml
vera: !include vera.yaml

script: !include_dir_named scripts
automation: !include_dir_merge_list automations

Home Assistant release (hass --version):
0.29.0dev

Python release (python3 --version):
3.4.2

Running on a Raspberry Pi 2 with Raspberian Jessie.

I didn't see any segfaults yesterday, but I have hass running under gdb now

@turbokongen
Copy link
Contributor Author
turbokongen commented Sep 21, 2016

Crash from this night. Took longer this time.
Now I have disabled all commandlne stuff including systemmonitor.
http://hastebin.com/toxerobunu.vbs
next disabling template sensors

@turbokongen
Copy link
Contributor Author

It's now been running without problems for a day and a half.
I think I will reenable the commandline stuff, and see what happens.

@balloob
Copy link
Member
balloob commented Sep 22, 2016

If commandline stuff is to blame, we should port it over to use async stuff https://docs.python.org/3/library/asyncio-subprocess.html

@turbokongen
Copy link
Contributor Author

Two things were disabled. First Without command line stuff == crash
Disabeld template stuff, running fine.
Now I have enabled commandline stuff, and will report back ;)

@lwis
Copy link
Member
lwis commented Sep 23, 2016

Looks like I'm in the same boat, dmesg is showing me;

[91069.635106] python[6012]: segfault at 10 ip 00007fec2e6f52d0 sp 00007fec2822dd20 error 4 in libpython3.5m.so.1.0[7fec2e556000+47c000]
[94613.108242] python[331]: segfault at 10 ip 00007f05660fb2d0 sp 00007f05603750d0 error 4 in libpython3.5m.so.1.0[7f0565f5c000+47c000]
[179867.340974] python[26568]: segfault at 10 ip 00007f9fa35732d0 sp 00007f9f9dd59e90 error 4 in libpython3.5m.so.1.0[7f9fa33d4000+47c000]

Closing #3484 as it's a dupe.

@Danielhiversen
Copy link
Member

@lwis : Which components are you using?

@lwis
Copy link
Member
lwis commented Sep 23, 2016

@Danielhiversen

mqtt
notify
logbook
frontend
updater
sun
history
ifttt
influxdb
emulated_hue
netgear
owntracks
kodi
braviatv
sonos
samsungtv
cast
proximity
fastdotcom
command_line
forecast
time_date
google_travel_time
shell_command
zones

Think I got all the components + platforms.

@Danielhiversen
Copy link
Member

@lwis : Could you try to disable the command_line component?

@lwis
Copy link
Member
lwis commented Sep 23, 2016

@Danielhiversen sure, what's the thought behind why that would cause a segfault?

@turbokongen
Copy link
Contributor Author

No need to disable command line. It sems to be traced to any template stuff.

@lwis
Copy link
Member
lwis commented Sep 23, 2016

Running gdb -ex r --args python -m homeassistant --config /config now.

@turbokongen
Copy link
Contributor Author

Got a new somewhat different traceback: http://hastebin.com/ilepedefad.vbs

@technicalpickles
Copy link
Contributor

Finally caught one in the act: https://gist.github.com/technicalpickles/23e097e213fcd4beb2c83c0e8cf7e06b

It's at a gdb prompt. Anything I should grab while I have it? I tried py-bt but gdb said the command wasn't found.

@bbangert
Copy link
Member

If anyone with a segfault using the latest dev could pip install uvloop and restart HASS, let me know if the segfaults go away.

@technicalpickles
Copy link
Contributor
$ pip install uvloop
Collecting uvloop
  Using cached uvloop-0.5.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-uyo_w28w/uvloop/setup.py", line 11, in <module>
        raise RuntimeError('uvloop requires Python 3.5 or greater')
    RuntimeError: uvloop requires Python 3.5 or greater

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-uyo_w28w/uvloop/

@bbangert it looks like uvloop requires python 3.5. As far as I can tell, there isn't a Raspberian Jessie package for it. I can build from source to test, but wouldn't requiring it be a pretty significant change?

@bbangert
Copy link
Member

@technicalpickles awwww, bummer. Yes, its not feasible to require it since it'd up the required python too far.

@balloob
Copy link
Member
balloob commented Sep 24, 2016

@technicalpickles we wouldn't make it mandatory but it will help to be able to narrow down the seg faults to the default event loop implementation.

@lwis
Copy link
Member
lwis commented Sep 24, 2016

@bbangert @balloob I use uvloop and still experience the segfaults.

@balloob
Copy link
Member
balloob commented Sep 24, 2016

Time to get a better overview of the segfault data thus far:

Architecture OS Python Loop User templates? Command Line
AMD64 Ubuntu 16.04 3.4.4 Default (Epoll?) @turbokongen V V
AMD64 Ubuntu 16.04 3.5.2 uvloop @turbokongen V V
AMD64 - Docker Alpine 3.5.2 Default (Epoll?) @lwis V V
AMD64 - Docker Alpine 3.5.2 uvloop @lwis V V
RPi (?) Raspbian Jessie 3.4.2 (?) Default (Epoll?) @technicalpickles V V

@lwis
Copy link
Member
lwis commented Sep 24, 2016

I can uninstall uvloop if required.

I'm running an Alpine Docker image on Ubuntu 16.04, also using an amd64 machine.

@bbangert
Copy link
Member

Everyone with a segfault is using a component with the template platform? And disabling the template platform config bits makes the segfaults go away?

@lwis
Copy link
Member
lwis commented Sep 24, 2016

I've not tried disabling my template configuration, but I'm happy to branch
my config to test.

On Sat, 24 Sep 2016, 7:01 am Ben Bangert, notifications@github.com wrote:

Everyone with a segfault is using a component with the template platform?
And disabling the template platform config bits makes the segfaults go away?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3453 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA1TOwyw-Ei_vkpkbQPA7xz-tscATfMKks5qtLybgaJpZM4KAlqc
.

@technicalpickles
Copy link
Contributor

I updated to 0.29.3, and left it running overnight with python 3.5.2 and no evloop. I disabled the nmap device tracker, and template sensors. I still had discovery and a template switch enabled. I had another segfault in the morning 😓

I'm trying to disable discovery and template switch next.

@ktpx
Copy link
ktpx commented Sep 30, 2016

Updated to latest today, and i still have random segfaults. Have disabled several things to try, but makes no difference. Logs do not indicate really anything useful before a crash either.

@lwis
Copy link
Member
lwis commented Sep 30, 2016

I don't believe anything has been done to improve the situation yet, there
was some discussion on collating information on all the segfaults to better
understand the configurations and platforms people are experiencing them
on.

It's a shame that the occasional brief period of stability is frequently
red herring.

On Fri, 30 Sep 2016, 2:30 pm Thomas, notifications@github.com wrote:

Updated to latest today, and i still have random segfaults. Have disabled
several things to try, but makes no difference. Logs do not indicate really
anything useful before a crash either.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3453 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA1TO8A92uvzVrlAAa6vsMpj1MQslHL1ks5qvQD4gaJpZM4KAlqc
.

@tchellomello
Copy link
Contributor

I had a segfault overnight on 0.29.4. I've enabled to save core files here and let's what GDB will tell us.

@persandstrom
Copy link
Contributor
persandstrom commented Sep 30, 2016

I have compiled python with debug flags and executed it through gdb. Managed to get the following information, maybe useful for someone with a better knowledge of python garbage collector.

python3: Modules/gcmodule.c:364: update_refs: Assertion `_PyGCHead_REFS(gc) != 0' failed.

Row 364 of gcmodules.c is commented with:

        /* Python's cyclic gc should never see an incoming refcount
         * of 0:  if something decref'ed to 0, it should have been
         * deallocated immediately at that time.
         * Possible cause (if the assert triggers):  a tp_dealloc
         * routine left a gc-aware object tracked during its teardown
         * phase, and did something-- or allowed something to happen --
         * that called back into Python.  gc can trigger then, and may
         * see the still-tracked dying object.  Before this assert
         * was added, such mistakes went on to allow gc to try to
         * delete the object again.  In a debug build, that caused
         * a mysterious segfault, when _Py_ForgetReference tried
         * to remove the object from the doubly-linked list of all
         * objects a second time.  In a release build, an actual
         * double deallocation occurred, which leads to corruption
         * of the allocator's internal bookkeeping pointers.  That's
         * so serious that maybe this should be a release-build
         * check instead of an assert?

EDIT:
Backtrace: http://pastebin.com/YpMcXhra
Backtrace Full: http://pastebin.com/9X5fjCkC

@technicalpickles
Copy link
Contributor

I haven't seen any more segfaults since disabling discovery and template switch.

Is there a point it makes sense to revert the changes to the core?

@bbangert
Copy link
Member

I've pushed a branch that removes a possible issue with Python GC of the Task objects. If someone that has a segfault happen could give it a try and let me know if the segfaults persist that'd be great.

https://github.com/home-assistant/home-assistant/tree/fix/monkey-patch-asyncio

@mcradit
Copy link
mcradit commented Oct 1, 2016

I have ran every 0.29 release and have not had a segfault until today when I upgraded to 0.29.5 since then I have had 2. I am running 30.dev now.

@bbangert
Copy link
Member
bbangert commented Oct 1, 2016

@mcradit if that still has segfaults, try my branch, which is based on the latest dev with one tweak to remove a possible GC issue.

@rpitera
Copy link
rpitera commented Oct 1, 2016

Removed most of my template sensors and moved to the core version of wunderground.py (was still using the original in custom components) but still getting segfaults.

Running 0.29.5 on Python 3.4.2 Debian Jessie

● home-assistant.service - Home Assistant
   Loaded: loaded (/etc/systemd/system/home-assistant.service; enabled)
   Active: failed (Result: signal) since Fri 2016-09-30 17:07:00 EDT; 5h 16min ago
  Process: 14089 ExecStart=/srv/hass/hass_venv/bin/hass -c /home/hass/ (code=killed, signal=SEGV)
 Main PID: 14089 (code=killed, signal=SEGV)

@bbangert
Copy link
Member
bbangert commented Oct 1, 2016

@rpitera How long does it take before a seg-fault? Can you try my branch?

@mcradit
Copy link
mcradit commented Oct 1, 2016

I haven't had any yet. I didn't change anything in my config. I have a few
template sensors and Wunderground.

On Sep 30, 2016 9:43 PM, "Ben Bangert" notifications@github.com wrote:

@rpitera https://github.com/rpitera How long does it take before a
seg-fault? Can you try my branch?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3453 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AThEbeU6LDnn7Jm5U_LZ1b8DeHRBdUsJks5qvcjDgaJpZM4KAlqc
.

@rpitera
Copy link
rpitera commented Oct 1, 2016

@bbangert It was up for at least 6-8 hours on 29.4, but only about 2-3 under 29.5.

I've never used anything besides releases so you'd have to nursemaid me through testing your branch.

@jwl17330536
Copy link

@bbangert I just installed your branch and will let you know what I see.

Here's something I've noticed though... I have an alias 'start' that I use all the time. It executes: "systemctl start home-assistant; journalctl -f -u home-assistant". As long as I keep my session open I do not see any issues and my home-assistant instance maintains. A couple minutes after my session ends or I CTRL-C out of the journalctl I lose HA...

@jwl17330536
Copy link

@rpitera - I used: pip3 install git+git://github.com/home-assistant/home-assistant.git@fix/monkey-patch-asyncio

@balloob
Copy link
Member
balloob commented Oct 1, 2016

For the people experiencing segfaults, are you using the discovery component?

@persandstrom
Copy link
Contributor

Yes I was, will remove it and start again.

@balloob
Copy link
Member
balloob commented Oct 1, 2016

@persandstrom please also use the patch by @bbangert

@tchellomello
Copy link
Contributor

@bbangert I'm testing your branch monkey-patch-asyncio now. I'll post the results tomorrow.
thx

@persandstrom
Copy link
Contributor

@balloob Only disabling discovery did not help. Trying to apply patch now.

@jwl17330536
Copy link

I'm using discovery. The @bbangerts's branch has been going solid for me for 9 hours.

@ktpx
Copy link
ktpx commented Oct 1, 2016

Pretty serious issue.. there should be no code changes until this is fixed.

@tchellomello
Copy link
Contributor

@bbangert @balloob Good news!! No segfaults on my environment after running https://github.com/home-assistant/home-assistant/tree/fix/monkey-patch-asyncio Awesome!! 👍

@technicalpickles
Copy link
Contributor

I updated to use https://github.com/home-assistant/home-assistant/tree/fix/monkey-patch-asyncio yesterday afternoon, left it running over night, and no segfaults 🎉

@home-assistant home-assistant locked and limited conversation to collaborators Mar 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
0