8000 Kill check processes after the timeout is reached by kyhavlov · Pull Request #3567 · hashicorp/consul · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Kill check processes after the timeout is reached #3567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 11, 2017
Merged

Conversation

kyhavlov
Copy link
Contributor

Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around.

Fixes #3565.

@kyhavlov kyhavlov requested a review from slackpad October 10, 2017 21:11
Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around.

Fixes #3565.
agent/check.go Outdated
errCh <- fmt.Errorf("Timed out running check '%s'", c.Script)
}()
err = <-errCh
err = <-waitCh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set err to fmt.Errorf("Timed out running check '%s'", c.Script) to make sure the check gets set to critical (otherwise if the wait returns nil it can go passing, and then we just block on the waitCh but don't set err, so:

err = fmt.Errorf("Timed out running check '%s'", c.Script)
<-waitCh

Copy link
Contributor
@slackpad slackpad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - can you add a test that runs a sleep that exceeds the timeout as a sanity check to force it through this path?


func SetSysProcAttr(cmd *exec.Cmd) {}

func KillCommandSubtree(cmd *exec.Cmd) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document this behavior difference on https://www.consul.io/docs/agent/checks.html.

if c.Timeout > 0 {
time.Sleep(c.Timeout)

timeout := 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have test for this - I thought saw it here a little while ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an existing test, I just ended up not having to modify it at all

@slackpad slackpad merged commit 106b8b0 into master Oct 11, 2017
@slackpad slackpad deleted the check-timeout-kill branch October 11, 2017 18:57
@nh2
Copy link
nh2 commented Oct 11, 2017

@slackpad @kyhavlov I'm a bit late to the party/review, but first, thanks for the quick fix, and then a couple questions:

  • I think it would be useful to document how the processes are getting killed, e.g. that you're using a process group kill, and which signal you use. That'd be good so that users don't try to do graceful exit on sigterm just to notice that they can't actually do it
  • Does this really fix Script checks can overlap #3565? What I reported there was that script checks can overlap (the fact that no killing happened was just a side observation). It seems to me that if you don't use timeout and use a fast interval, scripts can still overlap and the same issues as I described in the bug can occur. Or did I miss a change to that regard in the commit messages?

@nh2 nh2 mentioned this pull request Oct 11, 2017
@slackpad
Copy link
Contributor

Hi @nh2 we added a wait in the timeout path so the check() function will block, preventing another run of that health check until the Wait() call comes back - https://github.com/hashicorp/consul/blob/master/agent/check.go#L153. That will prevent that particular check from overlapping itself. The docs were updated as part of this change but haven't been pushed out yet.

@nh2
Copy link
nh2 commented Oct 11, 2017

we added a wait in the timeout path so the check() function will block, preventing another run of that health check until the Wait() call comes back

Ah, great.

The docs were updated as part of this change but haven't been pushed out yet.

@slackpad I can see this docs change https://github.com/hashicorp/consul/pull/3567/files#diff-736ca63f9e217827ea7d70758c40cedeR28 but it doesn't really explain to the user how their scripts will be killed.

@slackpad
Copy link
Contributor

@slackpad I can see this docs change https://github.com/hashicorp/consul/pull/3567/files#diff-736ca63f9e217827ea7d70758c40cedeR28 but it doesn't really explain to the user how their scripts will be killed.

That's fair - it says "processes" but should be clearer about process groups and which signal is sent. @kyhavlov can you update the docs to make that more clear?

michaelw pushed a commit to michaelw/consul that referenced this pull request Jan 11, 2018
Version 1.0.0

* tag 'v1.0.0': (455 commits)
  Release v1.0.0
  Puts the tree in 1.0 final release mode.
  Fixes an XSS issue with unescaped node names. (hashicorp#3578)
  Adds a note about the Raft protocol not being the same as the Consul protocol.
  Adds a 1.0 section to the upgrade guide and cleans up the change log.
  Update sentinel.html.markdown.erb
  Update dns forwarding documentation (hashicorp#3574)
  Adds a brief wait and poll period to update check status after a timeout. (hashicorp#3573)
  Cleans up some drift between the OSS and Enterprise trees.
  Clarify the docs around script check timeout behavior
  Updates the change log.
  Kill check processes after the timeout is reached (hashicorp#3567)
  Updates the change log.
  retry locks on network errors (hashicorp#3553)
  Fix example code formatting in godoc
  config: remove redundant code
  config: fix check for segment.port <= 0 and add test
  Adds check to make sure port is given so we avoid a nil bind address.
  Removes obsolete segment stub.
  agent: add option to discard health output (hashicorp#3562)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0