fleetd: keep a list of listeners across reconfiguring fleetd #1517

dongsupark · 2016-03-23T16:18:28Z

So far fleetd has lost its existing listeners after receiving SIGHUP.
That happened because reconfigure() simply killed the server and started
a fresh one, without taking consideration into the existing listeners.
That resulted in a strange error like:

Failed serving HTTP on listener: @ accept unix @: accept: invalid argument

So let's keep the list of existing listeners before killing the server,
and give it to the new server.

Fixes: #1190

/cc @kayrus @dalbani

antrik · 2016-03-23T16:45:33Z

server/server.go

+		listeners = make([]net.Listener, len(oldListeners))
+		for i, l := range oldListeners {
+			listeners[i] = l
+		}


Why are you copying the values one by one: can't you just assign listeners = oldListeners?...

(This will just copy the reference rather than the actual values -- but I don't see why this would be undesirable?)

@antrik

can't you just assign listeners = oldListeners?

Correct. Done.

antrik · 2016-03-23T16:47:31Z

Looks good aside from the one patch comment. I'd say this should come with a testcase, though?

dongsupark · 2016-03-24T12:09:02Z

@antrik

I'd say this should come with a testcase, though?

Done. I added a new functional test TestReconfigureServer.

jonboulle · 2016-03-26T13:13:52Z

server/server.go

-	listeners, err := activation.Listeners(false)
-	if err != nil {
-		return nil, err
+	var listeners []net.Listener


A little simpler
func New(cfg config.Config, listeners []net.Listener) (*Server, error) {

if len(listeners) == 0 { listeners, err = activation.Listeners ... }

== nil check is unnecessary, len handles it for you

@jonboulle It's done. Thanks for the suggestion.

jonboulle · 2016-04-04T19:38:47Z

Almost LGTM but perhaps consider https://github.com/coreos/fleet/pull/1528/files ?

dongsupark · 2016-04-05T09:28:15Z

@jonboulle Done. I dropped unnecessary sudo calls, except for sudo kill -HUP.

tixxdz · 2016-04-05T10:13:12Z

functional/server_test.go

+	// Without this sleep, the entire fleetd test always ends up succeeding
+	// no matter whether SIGHUP came or not.
+	_, _ = cluster.MemberCommand(m0, "sh", "-c", `'sleep 2'`)
+


Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?

AIUI the problem is actually that we need to make sure fleetd is fully initialised before even sending the SIGHUP. I agree though that it would be better to wait for this by checking the logs. (Unless there is some more elegant method for checking fleetd's exact status?...)

Note that a proper check like that will also make the preceding checks using systemctl unnecessary.

@tixxdz

Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?

Yes, I updated it like that.
But that change didn't allow me to remove the sleep command before sending SIGHUP. Without the sleep, test always ends up succeeding, no matter what.. :-(

@dongsupark actually, I just realised that WaitForNMachines() will only succeed once fleet is fully up? So the systemctl checks should definitely be redundant.

Also, I still don't understand why the sleep is necessary... But I guess that's not a showstopper.

tixxdz · 2016-04-05T10:17:08Z

I just commented, if there are no easy ways! then don't bother, lgtm.

Thank you!

antrik · 2016-04-11T16:32:20Z

functional/server_test.go

+	err = waitForFleetdSocket(cluster, m0)
+	if err != nil {
+		t.Fatalf("Failed to get a list of fleetd sockets: %v", err)
+	}


Do we still need this? I would assume it's no longer necessary, now that we actually check for the socket connection still working?...

I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep above?...

I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep above?...

Unfortunately yes, I still need the sleep. I just did another round of tests, but it's still the same.
As mentioned above, without the sleep, the whole test always succeeds. What would be the point testing this at all, if it would always give the same answer? That's where I'm puzzled, but for me it's more important to get the test working at all.

dongsupark · 2016-04-12T09:38:54Z

Updated.

remove unnecessary checks for systemctl show in the beginning
remove unnecessary checks for fleetd socket after sending the signal
split out waitForReloadConfig for better readability

tixxdz · 2016-04-13T10:17:04Z

functional/server_test.go

+			// in a subtle way, so that it sometimes fails only on semaphoreci.
+			// - dpark 20160408
+			stdout, _ := cluster.MemberCommand(m0, "journalctl -u fleet")
+			journalfleet := strings.TrimSpace(stdout)


@dongsupark does it make a difference here if you do "journalctl _PID=$(pidof fleetd)" ?

I don't see why pidof would be more reliable than -u? Do we expect more than one instance of fleetd to exist during the test? (And if so, is it actually appropriate to consider only the currently active one?...)

(I don't see why using grep wouldn't be reliable in the first place, though... But doing the compare in Go is fine too; so I guess this is a non-issue.)

What are the gains of using -u where we just want what the pid logs which is more precise in this context ?

and yeh I forget to say that some developers are also just lazy so they run the tests again and again on the same VM grepping the old bogus journal logs...

@tixxdz there are no "old bogus journal logs" -- each test spawns fresh containers with fresh logs. And I still think the -u abstraction offered to us by journalctl is more elegant than dealing with low-level details such as PIDs ourselves...

(PIDs are in fact prone to race conditions in general -- though I guess that's not really an issue in this specific case.)

tixxdz · 2016-04-13T10:39:35Z

functional/server_test.go

+
+	// check if fleetd is still running correctly, by running fleetctl status
+	_, _, err = cluster.Fleetctl(m0, "status")
+	if err != nil {


@dongsu: could you please change this to:
fleet $ ./bin/fleetctl list-units; echo $? Error retrieving list of units from repository: googleapi: Error 503: fleet server unable to communicate with etcd 1
fleet $ ./bin/fleetctl status; echo $? 0
Second one succeeded.
That's my result of experimenting right now, at least it is the same as in the original reported bug

Thanks you!

@tixxdz this reveals that there's still a bug somewhere. I'll try to investigate soon.

@tixxdz Fixed a bug. Checking with "fleetctl list-units" is now working.

So far fleetd has lost its existing listeners after receiving SIGHUP. That happened because reconfigure() simply killed the server and started a fresh one, without taking consideration into the existing listeners. That resulted in a strange error like: Failed serving HTTP on listener: @ accept unix @: accept: invalid argument So let's keep the list of existing listeners before killing the server, and give it to the new server. Fixes: coreos#1190

Fix a bug that fleetd loses connections to etcd after handling SIGHUP. Its reason was srv.Kill() closing killc unconditionally, which causes later the monitor to stop working. Solution is closing kill channel only if it's not in reconfigure context.

A new test TestReconfigureServer() checks whether fleetd managed to keep its net.Listeners across reconfiguration of fleetd, after receiving SIGHUP.

Improve TestReconfigureServer() like below: * Check if fleetd socket is still there across handling SIGHUP, by running once "fleetctl list-units". * Before sending SIGHUP, check for fleetd socket being still alive. * After sending SIGHUP, check for the journal containing expected logs about reconfiguration. * Call "systemctl kill -s SIGHUP fleet" instead of "kill -HUP $(pidof fleet)" * Call "journalctl _PID=$(pidof fleetd)" instead of "journalctl -u fleet" * Remove unnecessary checks for systemctl show in the beginning.

tixxdz · 2016-04-14T08:56:43Z

Lgtm, and since this is fixing a bug that most users want I'm merging this series. If any one have a problem with that minor sleep in the functional test feel free to open an issue or a PR. Sure it helps to serialize things when the container and all the services are started and where the logs and extra operation may have some delays...

Thanks!

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from d620ce2 to 53315cc Compare March 23, 2016 16:20

antrik reviewed Mar 23, 2016
View reviewed changes

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from 53315cc to 27e8a96 Compare March 24, 2016 10:22

jonboulle reviewed Mar 26, 2016
View reviewed changes

jonboulle added the reviewed/needs rework label Mar 31, 2016

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from f61402a to 6beca61 Compare April 4, 2016 10:14

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 3 times, most recently from 0cf63e2 to a55c155 Compare April 5, 2016 09:20

tixxdz reviewed Apr 5, 2016
View reviewed changes

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 8 times, most recently from f7ef9e6 to b05ec58 Compare April 8, 2016 16:28

antrik reviewed Apr 11, 2016
View reviewed changes

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 2 times, most recently from ec9f290 to a803e41 Compare April 12, 2016 09:36

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from a803e41 to 4fbe355 Compare April 12, 2016 09:52

tixxdz reviewed Apr 13, 2016
View reviewed changes

Dongsu Park added 4 commits April 13, 2016 17:37

functional: introduce a new test TestReconfigureServer for fleetd

7f9fcff

A new test TestReconfigureServer() checks whether fleetd managed to keep its net.Listeners across reconfiguration of fleetd, after receiving SIGHUP.

dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from 4fbe355 to 755ed0b Compare April 13, 2016 15:48

tixxdz merged commit de248c5 into coreos:master Apr 14, 2016

dongsupark deleted the dongsu/fleetd-reconfig-listeners branch April 19, 2016 10:26

dongsupark mentioned this pull request Apr 22, 2016

usability: reload fleetd configuration #1470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fleetd: keep a list of listeners across reconfiguring fleetd #1517

fleetd: keep a list of listeners across reconfiguring fleetd #1517

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fleetd: keep a list of listeners across reconfiguring fleetd #1517

fleetd: keep a list of listeners across reconfiguring fleetd #1517

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!