-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Grace period option to health checks. #28938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't really care for the name, but otherwise SGTM. |
@cpuguy83 I agree, the name could be better. I'll change it to |
5c06a0d
to
8f354c6
Compare
docs/reference/builder.md
Outdated
@@ -1561,6 +1562,11 @@ is considered to have failed. | |||
It takes **retries** consecutive failures of the health check for the container | |||
to be considered `unhealthy`. | |||
|
|||
Given a **start period** the tries performed during that period will not be | |||
counted towards the maximum number of retires. However, if a health check succeeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo s/retires/retries/
I can see this being useful for services that have a long startup period, so sgtm. We can bike-shed over naming during review |
ping @aaronlehmann could you have a look if this would work for SwarmKit? |
It would need to be plumbed through SwarmKit. This would involve a PR to SwarmKit to add the new flag to the protobuf definitions, and changes in Docker to add the flag to service create/update, and pass it through to the container in the executor. |
@aaronlehmann Do you want me to add a PR to SwarmKit and add mapping for service to this PR? Was a bit uncertain if you wanted this to be merged before adding it to SwarmKit as it has a dependency to the types updated in this PR. |
Not sure I'm the best one to answer that question. I think if maintainers are happy with the design of this PR, the next step would be to open a SwarmKit PR to add support there. But I don't want to get ahead of the design review. |
65e453f
to
14d0d4f
Compare
What do you think about a readiness healthcheck/probe in addition to If readiness healthcheck/probe returns related to #26664 (comment) I used to write
Because any operations like database upgrade ( PS: And maybe a different restart policy to apply between that state and unhealthy |
/cc @dongluochen |
daemon/health.go
Outdated
|
||
if shouldIncrementStreak { | ||
h.FailingStreak++ | ||
} | ||
if h.FailingStreak >= retries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State change should only happen when FailingStreak
increases. It's better to move if h.FailingStreak >= retries
block into if shouldIncrementStreak
.
docs/reference/builder.md
Outdated
@@ -1561,6 +1562,11 @@ is considered to have failed. | |||
It takes **retries** consecutive failures of the health check for the container | |||
to be considered `unhealthy`. | |||
|
|||
Given a **start period** the tries performed during that period will not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start-period provides initialization time for containers that need time to bootstrap. Probe failure during ...
Design sounds good to me. |
I think readiness test is more flexible |
My 2 cents: The title of the PR says Grace period. I kinda like that term better than start period, e.g. |
I don't have a preference between |
We were discussing this in the maintainers meeting, and think this needs more discussion (healthcheck vs readiness check) |
Thanks so much @elifa ! |
Great! Thanks for all your help! |
@@ -22,6 +22,7 @@ keywords: "API, Docker, rcli, REST, documentation" | |||
* `POST /networks/create` now supports creating the ingress network, by specifying an `Ingress` boolean field. As of now this is supported only when using the overlay network driver. | |||
* `GET /networks/(name)` now returns an `Ingress` field showing whether the network is the ingress one. | |||
* `GET /networks/` now supports a `scope` filter to filter networks based on the network mode (`swarm`, `global`, or `local`). | |||
* `POST /containers/create`, `POST /service/create` and `POST /services/(id or name)/update` now takes the field `StartPeriod` as a part of the `HealthConfig` allowing for specification of a period during which the container should not be considered unealthy even if health checks do not pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unealthy" is a typo I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijc25 created #32523 for that
Grace period option to health checks.
Hello gents, I'm not sure to understand the difference between the two flags.
I'm not sure why
Cheers! |
@pascalandy They're the same; the This option is only used if the image / container you're running uses a health check (your example uses the Say, the service is created with $ docker service create --name=health \
--health-cmd='exit 1' \
--health-interval=10s \
--health-timeout=3s \
--health-retries=3 \
--health-start-period=60s \
nginx:alpine
If you run the above example, you can follow what's happening. During the first 60 seconds, you can inspect the container, and see that the health check is failing, and failures are logged; $ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
"Status": "starting",
"FailingStreak": 0,
"Log": [
{
"Start": "2017-05-16T09:53:21.308489492Z",
"End": "2017-05-16T09:53:21.360249491Z",
"ExitCode": 1,
"Output": ""
}
]
}
$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
"Status": "starting",
"FailingStreak": 0,
"Log": [
{
"Start": "2017-05-16T09:53:21.308489492Z",
"End": "2017-05-16T09:53:21.360249491Z",
"ExitCode": 1,
"Output": ""
},
{
"Start": "2017-05-16T09:53:31.361873326Z",
"End": "2017-05-16T09:53:31.415394581Z",
"ExitCode": 1,
"Output": ""
}
]
} However, even though it fails 3 times or more in a row, the $ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
"Status": "starting",
"FailingStreak": 0,
"Log": [
{
"Start": "2017-05-16T09:53:21.308489492Z",
"End": "2017-05-16T09:53:21.360249491Z",
"ExitCode": 1,
"Output": ""
},
{
"Start": "2017-05-16T09:53:31.361873326Z",
"End": "2017-05-16T09:53:31.415394581Z",
"ExitCode": 1,
"Output": ""
},
{
"Start": "2017-05-16T09:53:41.415861379Z",
"End": "2017-05-16T09:53:41.452461489Z",
"ExitCode": 1,
"Output": ""
},
{
"Start": "2017-05-16T09:53:51.453103536Z",
"End": "2017-05-16T09:53:51.490980125Z",
"ExitCode": 1,
"Output": ""
},
{
"Start": "2017-05-16T09:54:01.492147629Z",
"End": "2017-05-16T09:54:01.533664526Z",
"ExitCode": 1,
"Output": ""
}
]
} Once the container is running for the $ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
"Status": "starting",
"FailingStreak": 1,
"Log": [
{
"Start": "2017-05-16T09:53:31.361873326Z",
"End": "2017-05-16T09:53:31.415394581Z",
... And when it reaches $ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
"Status": "unhealthy",
"FailingStreak": 3,
"Log": [
{
"Start": "2017-05-16T09:53:51.453103536Z",
"End": "2017-05-16T09:53:51.490980125Z",
... At that point, Swarm takes control; stops the container/task and starts a new one to replace it; $ docker service ps health
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
uqaf5bbiunnf health.1 nginx:alpine 91d5d251ecc3 Running Starting 20 seconds ago
be45d670f023 \_ health.1 nginx:alpine 91d5d251ecc3 Shutdown Complete 25 seconds ago |
To add to the above; the "health start period" (or "grace period", which is also used as a term), allows you to monitor a service's health (by inspecting the Swarm mode takes the health-state into account when routing network traffic to the task. The startup period can take less than the specified amount (e.g. the database migration took less time than expected, whoop!), at which point the container/task will start to receive network requests. To illustrate the above; a simple example; the health-check below simulates a long-running startup. The first 40 seconds (4 healthcheck intervals), the healthcheck returns "unhealthy". Because of the $ docker service create --name=health \
--health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
--health-interval=10s \
--health-timeout=3s \
--health-retries=3 \
--health-start-period=60s \
-p8080:80 \
nginx:alpine As long as the container is not "healthy", no traffic is routed to the task; $ curl localhost:8080
curl: (7) Failed to connect to localhost port 8080: Connection refused After 40 seconds, the container becomes healthy, and Swarm starts to route traffic to it; $ curl localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body { |
This is crystal clear now @thaJeztah. Thank you so much for this deep explanation :) |
@pascalandy you're welcome! I took a bit of time to write it down, because I noticed that documentation around this was largely missing, so thought it would help as a starting point for that (I opened an issue in the documentation repository: docker/docs#3282). To come back to your initial comment;
Be aware that the deploy time is separate from the "start period"; the deploy time may include pulling the image before the task/container is started. This time is not part of the start period (which starts once the task/container is actually started). |
Perfectly aware but the pulling is already done.
|
@thaJeztah is this already integrated with Compose v3.X? Or should I open a specific issue to have it implemented? |
@vide a quick glance at the docker-compose 3.3 schema tells me it's not implemented yet; https://github.com/docker/cli/blob/master/cli/compose/schema/data/config_schema_v3.3.json#L310-L326 Can you open an issue in the https://github.com/docker/cli/issues issue tracker? (update: this was implemented in docker/cli#475 (docker cli 17.09 and up)) |
@thaJeztah is it possible to change non-swarm docker instances to behave similarly to swarm service? i.e. in your example you can't access port 8080 before service becomes healthy but if you do docker run -d --name=health \ --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \ --health-interval=10s \ --health-timeout=3s \ --health-retries=3 \ --health-start-period=60s \ -p 8080:80 nginx:alpine \ && sleep 1 && curl -s localhost:8080 | head -2 \ && docker inspect health --format '{{.State.Health.Status}}' you'll get something like 818af365e18fb77448ec99babab260289452e030ac9f2f974867ac93ff81cf31 <!DOCTYPE html> <html> starting instead of curl: (7) Failed to connect to localhost port 8080: Connection refused |
@xdmitry not sure that would be possible easily 🤔 there's no reconciliation loop or monitor process on non-swarm containers. Is there a reason for not using a swarm service for your use-case? |
The documentation does not make this clear. Is |
I just tested here, with services:
example:
image: ubuntu
command:
- sh
- -c
- |
set -eux
echo starting | tee /status
sleep 30s
echo success | tee /status
sleep infinity
healthcheck:
test: grep success /status
interval: 5s
timeout: 30s
retries: 3
start_period: 30s It works when running That's a little odd, because, for example, the Oracle DB container takes 10 minutes to start for the first time (because it needs to create all the DB files), while subsequent starts takes less than 1 minute. Any suggestion? |
- What I did
Added the option
--start-period
toHEALTHCHECK
in order to allow containers with startup times to have a health check configured based on their behaviour once started.The
--start-period
flag defines a period during which health check results are not counted towards the maximum number of retries configured by the--retries
flag. However, if a health check succeeds during the grace period, any failures from there on will be counted towards the retries.The default is to use no start period, meaning no change to current behaviour.
Additionally the
run
flag--health-start-period
has been added to the CLI to override the value from the build.The need for this has been discussed in #26498 and #26664 and this is my suggestion of how to solve the use cases discussed there.
- How I did it
Based on how long it has been since the container start
Container.State.StartedAt
and the given--start-period
, it is determined whether theContainer.State.Health.FailingStreak
is incremented or not.- How to verify it
A test has been added for verification of the new functionallity.
- Description for the changelog
Add
--start-period
flag toHEALTHCHECK
andrun
override flag--health-start-period
in order to enable health check for containers with an initial startup time.Signed-off-by: Elias Faxö elias.faxo@gmail.com