8000 Grace period option to health checks. by elifa · Pull Request #28938 · moby/moby · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Grace period option to health checks. #28938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 6, 2017
Merged

Grace period option to health checks. #28938

merged 1 commit into from
Apr 6, 2017

Conversation

elifa
Copy link
@elifa elifa commented Nov 29, 2016

- What I did
Added the option --start-period to HEALTHCHECK in order to allow containers with startup times to have a health check configured based on their behaviour once started.

The --start-period flag defines a period during which health check results are not counted towards the maximum number of retries configured by the --retries flag. However, if a health check succeeds during the grace period, any failures from there on will be counted towards the retries.

The default is to use no start period, meaning no change to current behaviour.

Additionally the run flag --health-start-period has been added to the CLI to override the value from the build.

The need for this has been discussed in #26498 and #26664 and this is my suggestion of how to solve the use cases discussed there.

- How I did it
Based on how long it has been since the container start Container.State.StartedAt and the given --start-period, it is determined whether the Container.State.Health.FailingStreak is incremented or not.

- How to verify it
A test has been added for verification of the new functionallity.

- Description for the changelog

Add --start-period flag to HEALTHCHECK and run override flag --health-start-period in order to enable health check for containers with an initial startup time.

Signed-off-by: Elias Faxö elias.faxo@gmail.com

@cpuguy83
Copy link
Member

I don't really care for the name, but otherwise SGTM.
Maybe start-period to make it clear this is to give time during container startup, not once it's already started.

@elifa
Copy link
Author
elifa commented Nov 29, 2016

@cpuguy83 I agree, the name could be better. I'll change it to start-period instead and edit the PR.

@elifa elifa force-pushed the master branch 2 times, most recently from 5c06a0d to 8f354c6 Compare November 29, 2016 15:56
@@ -1561,6 +1562,11 @@ is considered to have failed.
It takes **retries** consecutive failures of the health check for the container
to be considered `unhealthy`.

Given a **start period** the tries performed during that period will not be
counted towards the maximum number of retires. However, if a health check succeeds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo s/retires/retries/

@thaJeztah
Copy link
Member

I can see this being useful for services that have a long startup period, so sgtm. We can bike-shed over naming during review

@thaJeztah
Copy link
Member

ping @aaronlehmann could you have a look if this would work for SwarmKit?

@aaronlehmann
Copy link
Contributor

It would need to be plumbed through SwarmKit. This would involve a PR to SwarmKit to add the new flag to the protobuf definitions, and changes in Docker to add the flag to service create/update, and pass it through to the container in the executor.

@elifa
Copy link
Author
elifa commented Dec 19, 2016

@aaronlehmann Do you want me to add a PR to SwarmKit and add mapping for service to this PR? Was a bit uncertain if you wanted this to be merged before adding it to SwarmKit as it has a dependency to the types updated in this PR.

@aaronlehmann
Copy link
Contributor

Not sure I'm the best one to answer that question. I think if maintainers are happy with the design of this PR, the next step would be to open a SwarmKit PR to add support there. But I don't want to get ahead of the design review.

@kakawait
Copy link
kakawait commented Jan 23, 2017

What do you think about a readiness healthcheck/probe in addition to --health-start-period in order to optimize start-period duration?

If readiness healthcheck/probe returns OK before --health-start-period we can change state to unhealthy (then healthy when healthcheck will be triggered). However --health-start-period must be the max time, even if readiness healthcheck/probe always returns NOT OK (act as timeout).

related to #26664 (comment)

I used to write spring-boot application with http://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/system/EmbeddedServerPortFileWriter.html that will simply write file with port number when servlet container is ready to listen request. I expect a way to write something:

my container is not readiness (!= healthy) until that file is present and contains valid value.

Because any operations like database upgrade (flyway) are performed before servlet container listen.


PS: And maybe a different restart policy to apply between that state and unhealthy

@aluzzardi
Copy link
Member

/cc @dongluochen

daemon/health.go Outdated

if shouldIncrementStreak {
h.FailingStreak++
}
if h.FailingStreak >= retries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State change should only happen when FailingStreak increases. It's better to move if h.FailingStreak >= retries block into if shouldIncrementStreak.

@@ -1561,6 +1562,11 @@ is considered to have failed.
It takes **retries** consecutive failures of the health check for the container
to be considered `unhealthy`.

Given a **start period** the tries performed during that period will not be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start-period provides initialization time for containers that need time to bootstrap. Probe failure during ...

@dongluochen
Copy link
Contributor

Design sounds good to me.

@krasi-georgiev
Copy link
Contributor

I think readiness test is more flexible

@aluzzardi
Copy link
Member

My 2 cents: The title of the PR says Grace period. I kinda like that term better than start period, e.g. --healthcheck-grace-period.

@dongluochen ?

@dongluochen
Copy link
Contributor

I don't have a preference between grace-period and start-period.

@thaJeztah
Copy link
Member

We were discussing this in the maintainers meeting, and think this needs more discussion (healthcheck vs readiness check)

@thaJeztah thaJeztah merged commit c4010e2 into moby:master Apr 6, 2017
@thaJeztah
Copy link
Member

Thanks so much @elifa !

@elifa
Copy link
Author
elifa commented Apr 7, 2017

Great! Thanks for all your help!
I have created a PR towards swarmkit moby/swarmkit#2103 finalizing the feature.

@@ -22,6 +22,7 @@ keywords: "API, Docker, rcli, REST, documentation"
* `POST /networks/create` now supports creating the ingress network, by specifying an `Ingress` boolean field. As of now this is supported only when using the overlay network driver.
* `GET /networks/(name)` now returns an `Ingress` field showing whether the network is the ingress one.
* `GET /networks/` now supports a `scope` filter to filter networks based on the network mode (`swarm`, `global`, or `local`).
* `POST /containers/create`, `POST /service/create` and `POST /services/(id or name)/update` now takes the field `StartPeriod` as a part of the `HealthConfig` allowing for specification of a period during which the container should not be considered unealthy even if health checks do not pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unealthy" is a typo I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijc25 created #32523 for that

dnephin pushed a commit to dnephin/docker that referenced this pull request Apr 17, 2017
< 9E88 input type="hidden" name="dropdown_direction" value="w" autocomplete="off" data-targets="batch-deferred-content.inputs" />
Grace period option to health checks.
@pascalandy
Copy link
pascalandy commented May 16, 2017

Hello gents,
I'm so glad to see appearing. Since the healthcheck, my services take 30s to 180s to deploy.

I'm not sure to understand the difference between the two flags.
I don't want any healthcheck for the first 90 seconds. I guess I only need to use:

--health-start-period "90s"

I'm not sure why --start-period "90s" exist. Can you enlighten me?
Here is my core setup to start nginx:

docker service create \
	--name "$CTN_nginx_app" \
	--hostname "$CTN_nginx_app" \
	--network "$NTW_FRONT" \
	--replicas "1" \
	--reserve-memory "12M" \
	--limit-memory "20M" \
	--constraint node.labels.apps_accepted=="yes" \
	--mount	type="bind",src="$WWW_SRC_NGINX",target="$WWW_DST_NGINX" \
	--restart-condition "any" \
	--restart-max-attempts "55" \
	--update-delay "5s" \
	--update-parallelism "1" \
	--start-period "90s" \
	--health-start-period "90s" \
nginx:alpine

Cheers!
Pascal

@thaJeztah
Copy link
Member
thaJeztah commented May 16, 2017

@pascalandy They're the same; the --start-period is not a command-line option, but an option to the HEALTHCHECK Dockerfile instruction to define the start-period as part of the image, whereas --health-start-period is a command-line option to set/override that period at runtime.

This option is only used if the image / container you're running uses a health check (your example uses the nginx:alpine image, which does not have a health check defined). When using the --health-start-period, it works roughly like this;

Say, the service is created with

$ docker service create --name=health \
  --health-cmd='exit 1' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  nginx:alpine
  1. The container (task) is created and started
  2. Every 10 seconds the command exit 1 is executed to check if the container is healthy
  3. If the health-cmd, exit 1, fails (hint: that's always 😄), but the container has not been running for more than 60 seconds (the start period), don't count the failure, and let the container run as usual
  4. Similarly; if the health-cmd takes longer than 3 seconds (--health-timeout) to complete, don't count the failure, and let the container run as usual
  5. After 60 seconds, start tracking failures. If exit 1 fails (or takes longer to complete than 3 seconds) 3 times in a row, mark the container/task as "unhealthy", stop it, and start a new task to replace it.

If you run the above example, you can follow what's happening. During the first 60 seconds, you can inspect the container, and see that the health check is failing, and failures are logged;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

However, even though it fails 3 times or more in a row, the FailingStreak remains 0, and Status remains "starting" (because we're still in the "start period", and the container hasn't reported as "healthy" yet);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:41.415861379Z",
      "End": "2017-05-16T09:53:41.452461489Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:54:01.492147629Z",
      "End": "2017-05-16T09:54:01.533664526Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

Once the container is running for the 60 seconds, docker starts to track failures (FailingStreak is incremented with each consecutive failure);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 1,
  "Log": [
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
...

And when it reaches --health-retries (3), the container/task is marked as unhealthy;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "unhealthy",
  "FailingStreak": 3,
  "Log": [
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
...

At that point, Swarm takes control; stops the container/task and starts a new one to replace it;

$ docker service ps health
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                 ERROR               PORTS
uqaf5bbiunnf        health.1            nginx:alpine        91d5d251ecc3        Running             Starting 20 seconds ago
be45d670f023         \_ health.1        nginx:alpine        91d5d251ecc3        Shutdown            Complete 25 seconds ago

@thaJeztah
Copy link
Member

To add to the above; the "health start period" (or "grace period", which is also used as a term), allows you to monitor a service's health (by inspecting the .State.Health of the container), during the startup period. This can be helpful for services that need to perform certain tasks the first time they are started (think of a database-migration), but you still want to keep track if the migration is running, and if the health-checks are running (you can log messages as part of the health check, which show up in .State.Health in the container inspect).

Swarm mode takes the health-state into account when routing network traffic to the task. The startup period can take less than the specified amount (e.g. the database migration took less time than expected, whoop!), at which point the container/task will start to receive network requests.

To illustrate the above; a simple example; the health-check below simulates a long-running startup. The first 40 seconds (4 healthcheck intervals), the healthcheck returns "unhealthy". Because of the health-start-period the container is not terminated, but the health-check is still performed every 10 seconds;

$ docker service create --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  -p8080:80 \
  nginx:alpine

As long as the container is not "healthy", no traffic is routed to the task;

$ curl localhost:8080
curl: (7) Failed to connect to localhost port 8080: Connection refused

After 40 seconds, the container becomes healthy, and Swarm starts to route traffic to it;

$ curl localhost:8080

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {

@pascalandy
Copy link

This is crystal clear now @thaJeztah. Thank you so much for this deep explanation :)

@thaJeztah
Copy link
Member

@pascalandy you're welcome! I took a bit of time to write it down, because I noticed that documentation around this was largely missing, so thought it would help as a starting point for that (I opened an issue in the documentation repository: docker/docs#3282).

To come back to your initial comment;

my services take 30s to 180s to deploy

Be aware that the deploy time is separate from the "start period"; the deploy time may include pulling the image before the task/container is started. This time is not part of the start period (which starts once the task/container is actually started).

@pascalandy
Copy link

Perfectly aware but the pulling is already done.

the deploy time may include pulling the image before the task/container is started

@vide
Copy link
vide commented May 22, 2017

@thaJeztah is this already integrated with Compose v3.X? Or should I open a specific issue to have it implemented?

@thaJeztah
Copy link
Member
thaJeztah commented May 22, 2017

@vide a quick glance at the docker-compose 3.3 schema tells me it's not implemented yet; https://github.com/docker/cli/blob/master/cli/compose/schema/data/config_schema_v3.3.json#L310-L326

Can you open an issue in the https://github.com/docker/cli/issues issue tracker?

(update: this was implemented in docker/cli#475 (docker cli 17.09 and up))

@xdmitry
Copy link
xdmitry commented Oct 3, 2019

@thaJeztah is it possible to change non-swarm docker instances to behave similarly to swarm service?

i.e. in your example you can't access port 8080 before service becomes healthy

but if you do

docker run -d --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
   --health-start-period=60s \
  -p 8080:80   nginx:alpine \
&& sleep 1 && curl -s localhost:8080 | head -2 \
&& docker inspect health --format '{{.State.Health.Status}}'

you'll get something like

818af365e18fb77448ec99babab260289452e030ac9f2f974867ac93ff81cf31
<!DOCTYPE html>
<html>
starting

instead of

curl: (7) Failed to connect to localhost port 8080: Connection refused

@thaJeztah
Copy link
Member

@xdmitry not sure that would be possible easily 🤔 there's no reconciliation loop or monitor process on non-swarm containers. Is there a reason for not using a swarm service for your use-case?

@felipecrs
Copy link
felipecrs commented Sep 15, 2022

The documentation does not make this clear. Is --start-period only valid for the first initialization of a container, or is it valid even when the container is restarted?

@felipecrs
Copy link
felipecrs commented Sep 15, 2022

I just tested here, with docker-compose up --wait example and the following docker-compose.yaml:

services:
  example:
    image: ubuntu
    command:
      - sh
      - -c
      - |
        set -eux
        echo starting | tee /status
        sleep 30s
        echo success | tee /status
        sleep infinity
    healthcheck:
      test: grep success /status
      interval: 5s
      timeout: 30s
      retries: 3
      start_period: 30s

It works when running docker-compose up --wait example for the first time, and also works when running docker-compose restart example && docker-compose up --wait example (while it should not if start_period was only taken into account for the first initialization).

That's a little odd, because, for example, the Oracle DB container takes 10 minutes to start for the first time (because it needs to create all the DB files), while subsequent starts takes less than 1 minute.

Any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0