8000 fleetd: keep a list of listeners across reconfiguring fleetd by dongsupark · Pull Request #1517 · coreos/fleet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

fleetd: keep a list of listeners across reconfiguring fleetd #1517

Merged
merged 4 commits into from
Apr 14, 2016

Conversation

dongsupark
Copy link

So far fleetd has lost its existing listeners after receiving SIGHUP.
That happened because reconfigure() simply killed the server and started
a fresh one, without taking consideration into the existing listeners.
That resulted in a strange error like:

Failed serving HTTP on listener: @ accept unix @: accept: invalid argument

So let's keep the list of existing listeners before killing the server,
and give it to the new server.

Fixes: #1190

/cc @kayrus @dalbani

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from d620ce2 to 53315cc Compare March 23, 2016 16:20
listeners = make([]net.Listener, len(oldListeners))
for i, l := range oldListeners {
listeners[i] = l
}
8000 Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you copying the values one by one: can't you just assign listeners = oldListeners?...

(This will just copy the reference rather than the actual values -- but I don't see why this would be undesirable?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antrik

can't you just assign listeners = oldListeners?

Correct. Done.

@antrik
Copy link
Contributor
antrik commented Mar 23, 2016

Looks good aside from the one patch comment. I'd say this should come with a testcase, though?

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from 53315cc to 27e8a96 Compare March 24, 2016 10:22
@dongsupark
Copy link
Author

@antrik

I'd say this should come with a testcase, though?

Done. I added a new functional test TestReconfigureServer.

listeners, err := activation.Listeners(false)
if err != nil {
return nil, err
var listeners []net.Listener
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little simpler
func New(cfg config.Config, listeners []net.Listener) (*Server, error) {

if len(listeners) == 0 {
    listeners, err = activation.Listeners ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== nil check is unnecessary, len handles it for you

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle It's done. Thanks for the suggestion.

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from f61402a to 6beca61 Compare April 4, 2016 10:14
@jonboulle
Copy link
Contributor

Almost LGTM but perhaps consider https://github.com/coreos/fleet/pull/1528/files ?

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 3 times, most recently from 0cf63e2 to a55c155 Compare April 5, 2016 09:20
@dongsupark
Copy link
Author

@jonboulle Done. I dropped unnecessary sudo calls, except for sudo kill -HUP.

// Without this sleep, the entire fleetd test always ends up succeeding
// no matter whether SIGHUP came or not.
_, _ = cluster.MemberCommand(m0, "sh", "-c", `'sleep 2'`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI the problem is actually that we need to make sure fleetd is fully initialised before even sending the SIGHUP. I agree though that it would be better to wait for this by checking the logs. (Unless there is some more elegant method for checking fleetd's exact status?...)

Note that a proper check like that will also make the preceding checks using systemctl unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tixxdz

Hmm, would it help after SIGHUP is sent to check the journal if "fleet.Reloading configuration." is really there ?

Yes, I updated it like that.
But that change didn't allow me to remove the sleep command before sending SIGHUP. Without the sleep, test always ends up succeeding, no matter what.. :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongsupark actually, I just realised that WaitForNMachines() will only succeed once fleet is fully up? So the systemctl checks should definitely be redundant.

Also, I still don't understand why the sleep is necessary... But I guess that's not a showstopper.

@tixxdz
Copy link
Contributor
tixxdz commented Apr 5, 2016

I just commented, if there are no easy ways! then don't bother, lgtm.

Thank you!

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 8 times, most recently from f7ef9e6 to b05ec58 Compare April 8, 2016 16:28
err = waitForFleetdSocket(cluster, m0)
if err != nil {
t.Fatalf("Failed to get a list of fleetd sockets: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? I would assume it's no longer necessary, now that we actually check for the socket connection still working?...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep above?...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you use it here now to check for fleet being up, which makes sense I guess. With this in place, do you still need the sleep above?...

Unfortunately yes, I still need the sleep. I just did another round of tests, but it's still the same.
As mentioned above, without the sleep, the whole test always succeeds. What would be the point testing this at all, if it would always give the same answer? That's where I'm puzzled, but for me it's more important to get the test working at all.

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch 2 times, most recently from ec9f290 to a803e41 Compare April 12, 2016 09:36
@dongsupark
Copy link
Author

Updated.

  • remove unnecessary checks for systemctl show in the beginning
  • remove unnecessary checks for fleetd socket after sending the signal
  • split out waitForReloadConfig for better readability

@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from a803e41 to 4fbe355 Compare April 12, 2016 09:52
// in a subtle way, so that it sometimes fails only on semaphoreci.
// - dpark 20160408
stdout, _ := cluster.MemberCommand(m0, "journalctl -u fleet")
journalfleet := strings.TrimSpace(stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongsupark does it make a difference here if you do "journalctl _PID=$(pidof fleetd)" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why pidof would be more reliable than -u? Do we expect more than one instance of fleetd to exist during the test? (And if so, is it actually appropriate to consider only the currently active one?...)

(I don't see why using grep wouldn't be reliable in the first place, though... But doing the compare in Go is fine too; so I guess this is a non-issue.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the gains of using -u where we just want what the pid logs which is more precise in this context ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeh I forget to say that some developers are also just lazy so they run the tests again and again on the same VM grepping the old bogus journal logs...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tixxdz there are no "old bogus journal logs" -- each test spawns fresh containers with fresh logs. And I still think the -u abstraction offered to us by journalctl is more elegant than dealing with low-level details such as PIDs ourselves...

(PIDs are in fact prone to race conditions in general -- though I guess that's not really an issue in this specific case.)


// check if fleetd is still running correctly, by running fleetctl status
_, _, err = cluster.Fleetctl(m0, "status")
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongsu: could you please change this to:
fleet $ ./bin/fleetctl list-units; echo $? Error retrieving list of units from repository: googleapi: Error 503: fleet server unable to communicate with etcd 1
fleet $ ./bin/fleetctl status; echo $? 0
Second one succeeded.
That's my result of experimenting right now, at least it is the same as in the original reported bug

Thanks you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tixxdz this reveals that there's still a bug somewhere. I'll try to investigate soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tixxdz Fixed a bug. Checking with "fleetctl list-units" is now working.

Dongsu Park added 4 commits April 13, 2016 17:37
So far fleetd has lost its existing listeners after receiving SIGHUP.
That happened because reconfigure() simply killed the server and started
a fresh one, without taking consideration into the existing listeners.
That resulted in a strange error like:

  Failed serving HTTP on listener: @ accept unix @: accept: invalid argument

So let's keep the list of existing listeners before killing the server,
and give it to the new server.

Fixes: coreos#1190
Fix a bug that fleetd loses connections to etcd after handling SIGHUP.
Its reason was srv.Kill() closing killc unconditionally, which causes
later the monitor to stop working. Solution is closing kill channel
only if it's not in reconfigure context.
A new test TestReconfigureServer() checks whether fleetd managed to
keep its net.Listeners across reconfiguration of fleetd, after receiving
SIGHUP.
Improve TestReconfigureServer() like below:

* Check if fleetd socket is still there across handling SIGHUP,
  by running once "fleetctl list-units".
* Before sending SIGHUP, check for fleetd socket being still alive.
* After sending SIGHUP, check for the journal containing expected
  logs about reconfiguration.
* Call "systemctl kill -s SIGHUP fleet" instead of "kill -HUP $(pidof
  fleet)"
* Call "journalctl _PID=$(pidof fleetd)" instead of "journalctl -u
  fleet"
* Remove unnecessary checks for systemctl show in the beginning.
@dongsupark dongsupark force-pushed the dongsu/fleetd-reconfig-listeners branch from 4fbe355 to 755ed0b Compare April 13, 2016 15:48
@tixxdz
Copy link
Contributor
tixxdz commented Apr 14, 2016

Lgtm, and since this is fixing a bug that most users want I'm merging this series. If any one have a problem with that minor sleep in the functional test feel free to open an issue or a PR. Sure it helps to serialize things when the container and all the services are started and where the logs and extra operation may have some delays...

Thanks!

@tixxdz tixxdz merged commit de248c5 into coreos:master Apr 14, 2016
@dongsupark dongsupark deleted the dongsu/fleetd-reconfig-listeners branch April 19, 2016 10:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0