-
Notifications
You must be signed in to change notification settings - Fork 301
fleetd: keep a list of listeners across reconfiguring fleetd #1517
fleetd: keep a list of listeners across reconfiguring fleetd #1517
Conversation
d620ce2
to
53315cc
Compare
listeners = make([]net.Listener, len(oldListeners)) | ||
for i, l := range oldListeners { | ||
listeners[i] = l | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you copying the values one by one: can't you just assign listeners = oldListeners
?...
(This will just copy the reference rather than the actual values -- but I don't see why this would be undesirable?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from the one patch comment. I'd say this should come with a testcase, though? |
53315cc
to
27e8a96
Compare
Done. I added a new functional test |
listeners, err := activation.Listeners(false) | ||
if err != nil { | ||
return nil, err | ||
var listeners []net.Listener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little simpler
func New(cfg config.Config, listeners []net.Listener) (*Server, error) {
if len(listeners) == 0 {
listeners, err = activation.Listeners ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== nil check is unnecessary, len handles it for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonboulle It's done. Thanks for the suggestion.
f61402a
to
6beca61
Compare
Almost LGTM but perhaps consider https://github.com/coreos/fleet/pull/1528/files ? |
0cf63e2
to
a55c155
Compare
@jonboulle Done. I dropped unnecessary sudo calls, except for |
// Without this sleep, the entire fleetd test always ends up succeeding | ||
// no matter whether SIGHUP came or not. | ||
_, _ = cluster.MemberCommand(m0, "sh", "-c", `'sleep 2'`) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI the problem is actually that we need to make sure fleetd is fully initialised before even sending the SIGHUP. I agree though that it would be better to wait for this by checking the logs. (Unless there is some more elegant method for checking fleetd's exact status?...)
Note that a proper check like that will also make the preceding checks using systemctl unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?
Yes, I updated it like that.
But that change didn't allow me to remove the sleep command before sending SIGHUP. Without the sleep, test always ends up succeeding, no matter what.. :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongsupark actually, I just realised that WaitForNMachines()
will only succeed once fleet is fully up? So the systemctl
checks should definitely be redundant.
Also, I still don't understand why the sleep is necessary... But I guess that's not a showstopper.
I just commented, if there are no easy ways! then don't bother, lgtm. Thank you! |
f7ef9e6
to
b05ec58
Compare
err = waitForFleetdSocket(cluster, m0) | ||
if err != nil { | ||
t.Fatalf("Failed to get a list of fleetd sockets: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this? I would assume it's no longer necessary, now that we actually check for the socket connection still working?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep
above?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep above?...
Unfortunately yes, I still need the sleep. I just did another round of tests, but it's still the same.
As mentioned above, without the sleep, the whole test always succeeds. What would be the point testing this at all, if it would always give the same answer? That's where I'm puzzled, but for me it's more important to get the test working at all.
ec9f290
to
a803e41
Compare
Updated.
|
a803e41
to
4fbe355
Compare
// in a subtle way, so that it sometimes fails only on semaphoreci. | ||
// - dpark 20160408 | ||
stdout, _ := cluster.MemberCommand(m0, "journalctl -u fleet") | ||
journalfleet := strings.TrimSpace(stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongsupark does it make a difference here if you do "journalctl _PID=$(pidof fleetd)" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why pidof
would be more reliable than -u
? Do we expect more than one instance of fleetd to exist during the test? (And if so, is it actually appropriate to consider only the currently active one?...)
(I don't see why using grep wouldn't be reliable in the first place, though... But doing the compare in Go is fine too; so I guess this is a non-issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the gains of using -u
where we just want what the pid logs which is more precise in this context ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and yeh I forget to say that some developers are also just lazy so they run the tests again and again on the same VM grepping the old bogus journal logs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tixxdz there are no "old bogus journal logs" -- each test spawns fresh containers with fresh logs. And I still think the -u
abstraction offered to us by journalctl
is more elegant than dealing with low-level details such as PIDs ourselves...
(PIDs are in fact prone to race conditions in general -- though I guess that's not really an issue in this specific case.)
|
||
// check if fleetd is still running correctly, by running fleetctl status | ||
_, _, err = cluster.Fleetctl(m0, "status") | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongsu: could you please change this to:
fleet $ ./bin/fleetctl list-units; echo $? Error retrieving list of units from repository: googleapi: Error 503: fleet server unable to communicate with etcd 1
fleet $ ./bin/fleetctl status; echo $? 0
Second one succeeded.
That's my result of experimenting right now, at least it is the same as in the original reported bug
Thanks you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tixxdz this reveals that there's still a bug somewhere. I'll try to investigate soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tixxdz Fixed a bug. Checking with "fleetctl list-units" is now working.
So far fleetd has lost its existing listeners after receiving SIGHUP. That happened because reconfigure() simply killed the server and started a fresh one, without taking consideration into the existing listeners. That resulted in a strange error like: Failed serving HTTP on listener: @ accept unix @: accept: invalid argument So let's keep the list of existing listeners before killing the server, and give it to the new server. Fixes: coreos#1190
Fix a bug that fleetd loses connections to etcd after handling SIGHUP. Its reason was srv.Kill() closing killc unconditionally, which causes later the monitor to stop working. Solution is closing kill channel only if it's not in reconfigure context.
A new test TestReconfigureServer() checks whether fleetd managed to keep its net.Listeners across reconfiguration of fleetd, after receiving SIGHUP.
Improve TestReconfigureServer() like below: * Check if fleetd socket is still there across handling SIGHUP, by running once "fleetctl list-units". * Before sending SIGHUP, check for fleetd socket being still alive. * After sending SIGHUP, check for the journal containing expected logs about reconfiguration. * Call "systemctl kill -s SIGHUP fleet" instead of "kill -HUP $(pidof fleet)" * Call "journalctl _PID=$(pidof fleetd)" instead of "journalctl -u fleet" * Remove unnecessary checks for systemctl show in the beginning.
4fbe355
to
755ed0b
Compare
Lgtm, and since this is fixing a bug that most users want I'm merging this series. If any one have a problem with that minor sleep in the functional test feel free to open an issue or a PR. Sure it helps to serialize things when the container and all the services are started and where the logs and extra operation may have some delays... Thanks! |
So far fleetd has lost its existing listeners after receiving SIGHUP.
That happened because
reconfigure()
simply killed the server and starteda fresh one, without taking consideration into the existing listeners.
That resulted in a strange error like:
So let's keep the list of existing listeners before killing the server,
and give it to the new server.
Fixes: #1190
/cc @kayrus @dalbani