8000 Fleet fails to start units after restart · Issue #1090 · coreos/fleet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Fleet fails to start units after restart #1090

Closed
yaronr opened this issue Jan 13, 2015 · 10 comments
Closed

Fleet fails to start units after restart #1090

yaronr opened this issue Jan 13, 2015 · 10 comments

Comments

@yaronr
Copy link
yaronr commented Jan 13, 2015

Hi

3-node CoreOS Beta channel cluster. One node on CoreOS 522.5 (the problematic one), two on 522.4.

Last night, one of my nodes decided to upgrade its CoreOS version. cool.
This morning I find that a few of the services that should run on this node are inactive/dead.
For this issue's sake, I will use two of the services:
https://gist.github.com/yaronr/62e70a897a5560a8cc63

weave.service 1cf0847f.../10.0.4.65 active running
weave.service 57c5b6a6.../10.0.5.237 active running
weave.service a3a566ba.../10.0.0.168 active running
zookeeper-weave-sidekick@1.service 1cf0847f.../10.0.4.65 active running
zookeeper-weave-sidekick@2.service a3a566ba.../10.0.0.168 active running
zookeeper-weave-sidekick@3.service 57c5b6a6.../10.0.5.237 inactive dead
zookeeper@1.service 1cf0847f.../10.0.4.65 active running
zookeeper@2.service a3a566ba.../10.0.0.168 active running
zookeeper@3.service 57c5b6a6.../10.0.5.237 inactive dead

registry.service is actually started by systemd and not via fleet (cloud-init), but it's also up:
core@ip-10-0-5-237 ~ $ systemctl | grep registry
registry.service loaded acti 8000 ve running Custom Docker Registry

I tried digging a bit deeper:
core@ip-10-0-5-237 ~ $ systemctl status zookeeper@3.service
zookeeper@3.service - Zookeeper 3
Loaded: loaded (/run/fleet/units/zookeeper@3.service; linked-runtime)
Active: inactive (dead)

Jan 13 05:20:44 ip-10-0-5-237.ec2.internal systemd[1]: Stopping Zookeeper 3...
Jan 13 05:20:44 ip-10-0-5-237.ec2.internal docker[9047]: zoo3
Jan 13 05:20:44 ip-10-0-5-237.ec2.internal systemd[1]: Stopped Zookeeper 3.

core@ip-10-0-5-237 ~ $ systemctl status zookeeper-weave-sidekick@3.service
zookeeper-weave-sidekick@3.service - zookeeper-weave-sidekick-3 service
Loaded: loaded (/run/fleet/units/zookeeper-weave-sidekick@3.service; linked-runtime)
Active: inactive (dead)

Interestingly, fleetctl list-unit-files gives:
zookeeper@3.service 090d52d launched launched 57c5b6a6.../10.0.5.237
even though list-units shows it as inactive/dead.

Ok, so I try:
fleetctl start zookeeper@3.service

Nothing changes, also systemctl status is the same (and no new logs)

sudo systemctl restart zookeeper@3.service
does the trick, both unit and sidekick are started.

fleetctl shows it as 'running/active'

Question: Could this be related to the Requires dependency on a non-Fleet unit? (even though the unit IS running, it's a systemd unit and not a fleet one)

@yaronr
Copy link
Author
yaronr commented Jan 13, 2015

Note. Another couple of units (also, unit+sidekick) failed the same way, and have no dependency on registry.service or any other non-fleet controlled unit, so I guess there's one less variable in the equation.

@yaronr
Copy link
Author
yaronr commented Jan 13, 2015

Ok, one additional piece of information:
I have another unit that's not starting, even after falling fleet start. Below is the unit's gist.
etcd is up, mesos-master-1 is up (%i = 1)

[Unit]
Description=%p discovery Container

Wants=etcd.service
After=etcd.service

After=mesos-master@%i.service
BindsTo=mesos-master@%i.service

[Service]
Restart=always
RestartSec=5
ExecStart=/bin/bash ....

[X-Fleet]
MachineOf=mesos-master@%i.service

@bcwaldon
Copy link
Contributor
bcwaldon commented Feb 9, 2015
  1. The reason your fleetctl start zookeeper@3.service appears to do nothing is due to a combination of idempotency of commands is confusing #745 and fleetctl start != systemctl start #1025. You've told fleet that zookeeper@3.service should be launched somewhere, and it is, so a subsequent fleetctl start is a NOP.
  2. The random failures on machine startup could be due to fleet can start units out of order on startup #997. fleet may attempt to start your sidekick unit(s) first, which fail due to dependency issues. After they fail, fleet won't try to start them again (fleet apparently ignores failed attempt to start unit #998).

Hopefully this information helps you figure out what's going on here.

@yaronr
Copy link
Author
yaronr commented Feb 11, 2015

@bcwaldon thanks for your attention.

I have another case:

Wants=etcd.service
After=etcd.service

BindsTo=wordpress.service
After=wordpress.service

Restart=always

Getting:
-- Reboot --
Feb 10 15:01:16 ip-10-0-0-171.ec2.internal systemd[1]: Cannot add dependency job for unit wordpress-sidekick.service, ignoring: Unit wordpress-sidekick.service failed to load: No such file or directory.

Is it the same thing?
(Note, I don't know how quickly after the reboot this happened)

@bcwaldon
Copy link
Contributor

Yes, this is likely related, if the wordpress-sidekick.service unit is started before wordpress.service makes it to the filesystem.

@yaronr
Copy link
Author
yaronr commented Mar 7, 2015

Just an update:
I still have this issue, even on CoreOS 607.0.0
Was this supposed to be addressed in 607? if not, is there a scheduled release?
This is very annoying

Thanks

@ericson-cepeda
Copy link

Same here: CoreOS stable (607.0.0)
registrator.service c62e1ed3... failed failed
skydns.service c62e1ed3... failed failed

Not even doing: sudo locksmithctl reboot.

@bcwaldon
Copy link
Contributor

This bug should be fixed in all channels. Please share any fleet logs that demonstrate this issue if you are still experiencing it (not just log snippets, it all matters). The exact contents of unit files would be useful, too. Please read through #1158 as well, as that may be the root cause.

@yaronr
Copy link
Author
yaronr commented Apr 12, 2015

@bcwaldon I think this issue should be re-opened.
I have the same thing, on: 633.1

stop-destroy-start doesn't solve the problem.
Note the 'file not found'

marathon-weave-sidekick@1.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)

Apr 12 07:34:57 localhost systemd[1]: Cannot add dependency job for unit marathon-weave-sidekick@1.service, ignoring: Unit marathon-weave-sidekick@1.service failed to load: No such file or directory.

@bcwaldon
Copy link
Contributor

fleet v0.9.2 (available in Alpha) addresses the problem you describe above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants
0