Grace period option to health checks. #28938

elifa · 2016-11-29T13:54:13Z

- What I did
Added the option --start-period to HEALTHCHECK in order to allow containers with startup times to have a health check configured based on their behaviour once started.

The --start-period flag defines a period during which health check results are not counted towards the maximum number of retries configured by the --retries flag. However, if a health check succeeds during the grace period, any failures from there on will be counted towards the retries.

The default is to use no start period, meaning no change to current behaviour.

Additionally the run flag --health-start-period has been added to the CLI to override the value from the build.

The need for this has been discussed in #26498 and #26664 and this is my suggestion of how to solve the use cases discussed there.

- How I did it
Based on how long it has been since the container start Container.State.StartedAt and the given --start-period, it is determined whether the Container.State.Health.FailingStreak is incremented or not.

- How to verify it
A test has been added for verification of the new functionallity.

- Description for the changelog

Add --start-period flag to HEALTHCHECK and run override flag --health-start-period in order to enable health check for containers with an initial startup time.

Signed-off-by: Elias Faxö elias.faxo@gmail.com

cpuguy83 · 2016-11-29T14:17:59Z

I don't really care for the name, but otherwise SGTM.
Maybe start-period to make it clear this is to give time during container startup, not once it's already started.

elifa · 2016-11-29T14:52:15Z

@cpuguy83 I agree, the name could be better. I'll change it to start-period instead and edit the PR.

thaJeztah · 2016-12-15T19:49:11Z

docs/reference/builder.md

@@ -1561,6 +1562,11 @@ is considered to have failed.
 It takes **retries** consecutive failures of the health check for the container
 to be considered `unhealthy`.

+Given a **start period** the tries performed during that period will not be
+counted towards the maximum number of retires. However, if a health check succeeds


nit: typo s/retires/retries/

thaJeztah · 2016-12-15T19:52:04Z

I can see this being useful for services that have a long startup period, so sgtm. We can bike-shed over naming during review

thaJeztah · 2016-12-15T19:53:06Z

ping @aaronlehmann could you have a look if this would work for SwarmKit?

aaronlehmann · 2016-12-15T19:56:55Z

It would need to be plumbed through SwarmKit. This would involve a PR to SwarmKit to add the new flag to the protobuf definitions, and changes in Docker to add the flag to service create/update, and pass it through to the container in the executor.

elifa · 2016-12-19T12:09:38Z

@aaronlehmann Do you want me to add a PR to SwarmKit and add mapping for service to this PR? Was a bit uncertain if you wanted this to be merged before adding it to SwarmKit as it has a dependency to the types updated in this PR.

aaronlehmann · 2016-12-19T18:32:08Z

Not sure I'm the best one to answer that question. I think if maintainers are happy with the design of this PR, the next step would be to open a SwarmKit PR to add support there. But I don't want to get ahead of the design review.

kakawait · 2017-01-23T10:23:55Z

What do you think about a readiness healthcheck/probe in addition to --health-start-period in order to optimize start-period duration?

If readiness healthcheck/probe returns OK before --health-start-period we can change state to unhealthy (then healthy when healthcheck will be triggered). However --health-start-period must be the max time, even if readiness healthcheck/probe always returns NOT OK (act as timeout).

related to #26664 (comment)

I used to write spring-boot application with http://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/system/EmbeddedServerPortFileWriter.html that will simply write file with port number when servlet container is ready to listen request. I expect a way to write something:

my container is not readiness (!= healthy) until that file is present and contains valid value.

Because any operations like database upgrade (flyway) are performed before servlet container listen.

PS: And maybe a different restart policy to apply between that state and unhealthy

aluzzardi · 2017-01-23T22:18:01Z

/cc @dongluochen

dongluochen · 2017-01-24T01:50:46Z

daemon/health.go

+
+		if shouldIncrementStreak {
+			h.FailingStreak++
+		}
 		if h.FailingStreak >= retries {


State change should only happen when FailingStreak increases. It's better to move if h.FailingStreak >= retries block into if shouldIncrementStreak.

dongluochen · 2017-01-24T02:03:18Z

docs/reference/builder.md

@@ -1561,6 +1562,11 @@ is considered to have failed.
 It takes **retries** consecutive failures of the health check for the container
 to be considered `unhealthy`.

+Given a **start period** the tries performed during that period will not be


start-period provides initialization time for containers that need time to bootstrap. Probe failure during ...

dongluochen · 2017-01-24T02:04:12Z

Design sounds good to me.

krasi-georgiev · 2017-01-24T16:23:23Z

I think readiness test is more flexible

aluzzardi · 2017-01-24T19:28:09Z

My 2 cents: The title of the PR says Grace period. I kinda like that term better than start period, e.g. --healthcheck-grace-period.

@dongluochen ?

dongluochen · 2017-01-24T19:56:31Z

I don't have a preference between grace-period and start-period.

thaJeztah · 2017-01-26T19:38:46Z

We were discussing this in the maintainers meeting, and think this needs more discussion (healthcheck vs readiness check)

thaJeztah · 2017-04-06T17:24:10Z

Thanks so much @elifa !

elifa · 2017-04-07T12:37:33Z

Great! Thanks for all your help!
I have created a PR towards swarmkit moby/swarmkit#2103 finalizing the feature.

ijc · 2017-04-10T15:53:19Z

docs/api/version-history.md

@@ -22,6 +22,7 @@ keywords: "API, Docker, rcli, REST, documentation"
 * `POST /networks/create` now supports creating the ingress network, by specifying an `Ingress` boolean field. As of now this is supported only when using the overlay network driver.
 * `GET /networks/(name)` now returns an `Ingress` field showing whether the network is the ingress one.
 * `GET /networks/` now supports a `scope` filter to filter networks based on the network mode (`swarm`, `global`, or `local`).
+* `POST /containers/create`, `POST /service/create` and `POST /services/(id or name)/update` now takes the field `StartPeriod` as a part of the `HealthConfig` allowing for specification of a period during which the container should not be considered unealthy even if health checks do not pass.


"unealthy" is a typo I think.

@ijc25 created #32523 for that

Grace period option to health checks.

pascalandy · 2017-05-16T01:28:16Z

Hello gents,
I'm so glad to see appearing. Since the healthcheck, my services take 30s to 180s to deploy.

I'm not sure to understand the difference between the two flags.
I don't want any healthcheck for the first 90 seconds. I guess I only need to use:

--health-start-period "90s"

I'm not sure why --start-period "90s" exist. Can you enlighten me?
Here is my core setup to start nginx:

docker service create \
	--name "$CTN_nginx_app" \
	--hostname "$CTN_nginx_app" \
	--network "$NTW_FRONT" \
	--replicas "1" \
	--reserve-memory "12M" \
	--limit-memory "20M" \
	--constraint node.labels.apps_accepted=="yes" \
	--mount	type="bind",src="$WWW_SRC_NGINX",target="$WWW_DST_NGINX" \
	--restart-condition "any" \
	--restart-max-attempts "55" \
	--update-delay "5s" \
	--update-parallelism "1" \
	--start-period "90s" \
	--health-start-period "90s" \
nginx:alpine

Cheers!
Pascal

thaJeztah · 2017-05-16T10:08:29Z

@pascalandy They're the same; the --start-period is not a command-line option, but an option to the HEALTHCHECK Dockerfile instruction to define the start-period as part of the image, whereas --health-start-period is a command-line option to set/override that period at runtime.

This option is only used if the image / container you're running uses a health check (your example uses the nginx:alpine image, which does not have a health check defined). When using the --health-start-period, it works roughly like this;

Say, the service is created with

$ docker service create --name=health \
  --health-cmd='exit 1' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  nginx:alpine

The container (task) is created and started
Every 10 seconds the command exit 1 is executed to check if the container is healthy
If the health-cmd, exit 1, fails (hint: that's always 😄), but the container has not been running for more than 60 seconds (the start period), don't count the failure, and let the container run as usual
Similarly; if the health-cmd takes longer than 3 seconds (--health-timeout) to complete, don't count the failure, and let the container run as usual
After 60 seconds, start tracking failures. If exit 1 fails (or takes longer to complete than 3 seconds) 3 times in a row, mark the container/task as "unhealthy", stop it, and start a new task to replace it.

If you run the above example, you can follow what's happening. During the first 60 seconds, you can inspect the container, and see that the health check is failing, and failures are logged;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

However, even though it fails 3 times or more in a row, the FailingStreak remains 0, and Status remains "starting" (because we're still in the "start period", and the container hasn't reported as "healthy" yet);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-05-16T09:53:21.308489492Z",
      "End": "2017-05-16T09:53:21.360249491Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:41.415861379Z",
      "End": "2017-05-16T09:53:41.452461489Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2017-05-16T09:54:01.492147629Z",
      "End": "2017-05-16T09:54:01.533664526Z",
      "ExitCode": 1,
      "Output": ""
    }
  ]
}

Once the container is running for the 60 seconds, docker starts to track failures (FailingStreak is incremented with each consecutive failure);

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "starting",
  "FailingStreak": 1,
  "Log": [
    {
      "Start": "2017-05-16T09:53:31.361873326Z",
      "End": "2017-05-16T09:53:31.415394581Z",
...

And when it reaches --health-retries (3), the container/task is marked as unhealthy;

$ docker inspect --format '{{json .State.Health }}' be45d670f023 | jq .
{
  "Status": "unhealthy",
  "FailingStreak": 3,
  "Log": [
    {
      "Start": "2017-05-16T09:53:51.453103536Z",
      "End": "2017-05-16T09:53:51.490980125Z",
...

At that point, Swarm takes control; stops the container/task and starts a new one to replace it;

$ docker service ps health
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                 ERROR               PORTS
uqaf5bbiunnf        health.1            nginx:alpine        91d5d251ecc3        Running             Starting 20 seconds ago
be45d670f023         \_ health.1        nginx:alpine        91d5d251ecc3        Shutdown            Complete 25 seconds ago

thaJeztah · 2017-05-16T11:27:18Z

To add to the above; the "health start period" (or "grace period", which is also used as a term), allows you to monitor a service's health (by inspecting the .State.Health of the container), during the startup period. This can be helpful for services that need to perform certain tasks the first time they are started (think of a database-migration), but you still want to keep track if the migration is running, and if the health-checks are running (you can log messages as part of the health check, which show up in .State.Health in the container inspect).

Swarm mode takes the health-state into account when routing network traffic to the task. The startup period can take less than the specified amount (e.g. the database migration took less time than expected, whoop!), at which point the container/task will start to receive network requests.

To illustrate the above; a simple example; the health-check below simulates a long-running startup. The first 40 seconds (4 healthcheck intervals), the healthcheck returns "unhealthy". Because of the health-start-period the container is not terminated, but the health-check is still performed every 10 seconds;

$ docker service create --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
  --health-start-period=60s \
  -p8080:80 \
  nginx:alpine

As long as the container is not "healthy", no traffic is routed to the task;

$ curl localhost:8080
curl: (7) Failed to connect to localhost port 8080: Connection refused

After 40 seconds, the container becomes healthy, and Swarm starts to route traffic to it;

$ curl localhost:8080

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {

pascalandy · 2017-05-16T11:44:58Z

This is crystal clear now @thaJeztah. Thank you so much for this deep explanation :)

thaJeztah · 2017-05-16T11:57:29Z

@pascalandy you're welcome! I took a bit of time to write it down, because I noticed that documentation around this was largely missing, so thought it would help as a starting point for that (I opened an issue in the documentation repository: docker/docs#3282).

To come back to your initial comment;

my services take 30s to 180s to deploy

Be aware that the deploy time is separate from the "start period"; the deploy time may include pulling the image before the task/container is started. This time is not part of the start period (which starts once the task/container is actually started).

pascalandy · 2017-05-16T23:33:50Z

Perfectly aware but the pulling is already done.

the deploy time may include pulling the image before the task/container is started

vide · 2017-05-22T10:41:31Z

@thaJeztah is this already integrated with Compose v3.X? Or should I open a specific issue to have it implemented?

thaJeztah · 2017-05-22T10:53:58Z

@vide a quick glance at the docker-compose 3.3 schema tells me it's not implemented yet; https://github.com/docker/cli/blob/master/cli/compose/schema/data/config_schema_v3.3.json#L310-L326

Can you open an issue in the https://github.com/docker/cli/issues issue tracker?

(update: this was implemented in docker/cli#475 (docker cli 17.09 and up))

xdmitry · 2019-10-03T16:13:52Z

@thaJeztah is it possible to change non-swarm docker instances to behave similarly to swarm service?

i.e. in your example you can't access port 8080 before service becomes healthy

but if you do

docker run -d --name=health \
  --health-cmd='if [ ! -f "/count" ] ; then ctr=0; else ctr=`cat /count`; fi; ctr=`expr ${ctr} + 1`; echo "${ctr}" > /count; if [ "$ctr" -gt 4 ] ; then exit 0; else exit 1; fi' \
  --health-interval=10s \
  --health-timeout=3s \
  --health-retries=3 \
   --health-start-period=60s \
  -p 8080:80   nginx:alpine \
&& sleep 1 && curl -s localhost:8080 | head -2 \
&& docker inspect health --format '{{.State.Health.Status}}'

you'll get something like

818af365e18fb77448ec99babab260289452e030ac9f2f974867ac93ff81cf31
<!DOCTYPE html>
<html>
starting

instead of

curl: (7) Failed to connect to localhost port 8080: Connection refused

thaJeztah · 2019-10-04T08:00:53Z

@xdmitry not sure that would be possible easily 🤔 there's no reconciliation loop or monitor process on non-swarm containers. Is there a reason for not using a swarm service for your use-case?

felipecrs · 2022-09-15T00:29:40Z

The documentation does not make this clear. Is --start-period only valid for the first initialization of a container, or is it valid even when the container is restarted?

felipecrs · 2022-09-15T00:53:02Z

I just tested here, with docker-compose up --wait example and the following docker-compose.yaml:

services:
  example:
    image: ubuntu
    command:
      - sh
      - -c
      - |
        set -eux
        echo starting | tee /status
        sleep 30s
        echo success | tee /status
        sleep infinity
    healthcheck:
      test: grep success /status
      interval: 5s
      timeout: 30s
      retries: 3
      start_period: 30s

It works when running docker-compose up --wait example for the first time, and also works when running docker-compose restart example && docker-compose up --wait example (while it should not if start_period was only taken into account for the first initialization).

That's a little odd, because, for example, the Oracle DB container takes 10 minutes to start for the first time (because it needs to create all the DB files), while subsequent starts takes less than 1 minute.

Any suggestion?

Refs Default HEALTHCHECKs for 19c and 21c aren't enough for first initialization oracle/docker-images#2462

GordonTheTurtle added the status/0-triage label Nov 29, 2016

vdemeester added status/1-design-review and removed status/0-triage labels Nov 29, 2016

vdemeester added this to the 1.14.0 milestone Nov 29, 2016

elifa force-pushed the master branch 2 times, most recently from 5c06a0d to 8f354c6 Compare November 29, 2016 15:56

elifa force-pushed the master branch from 8f354c6 to 8f54719 Compare December 10, 2016 11:41

thaJeztah reviewed Dec 15, 2016

View reviewed changes

elifa force-pushed the master branch 6 times, most recently from 65e453f to 14d0d4f Compare January 10, 2017 12:20

cpuguy83 mentioned this pull request Jan 16, 2017

HEALTHECK bootstrap period before running normal hatlchecks #30172

Closed

dongluochen reviewed Jan 24, 2017

View reviewed changes

thaJeztah merged commit c4010e2 into moby:master Apr 6, 2017

krasi-georgiev mentioned this pull request Apr 6, 2017

No possibility to specify grace period in the Healthcheck for service startup #26664

Closed

elifa mentioned this pull request Apr 7, 2017

Plumbing of StartPeriod field in container HealthConfig moby/swarmkit#2103

Merged

ijc reviewed Apr 10, 2017

View reviewed changes

albers mentioned this pull request Apr 11, 2017

Fix typo in version-history.md #32523

Merged

teohhanhui mentioned this pull request Apr 12, 2017

Missing Depends_ON functionality during start process within docker-swarm #31333

Open

albers mentioned this pull request May 5, 2017

Add bash completion for --health-start-period #33045

Merged

pascalandy unassigned dongluochen May 16, 2017

panga mentioned this pull request May 16, 2017

Feature Request: Support for Docker Health Checks when bumping a task definition revision aws/amazon-ecs-agent#534

Closed

thaJeztah mentioned this pull request May 16, 2017

Add examples, more documentation around health checks docker/docs#3282

Open

vide mentioned this pull request May 22, 2017

Add support for --health-start-period in Compose/Stack file docker/cli#116

Closed

This was referenced Jun 12, 2017

Add --initial-retries for HEALTHCHECK #26673

Closed

Unexpected behavior from healthcheck #33676

Closed

thaJeztah mentioned this pull request Jul 25, 2017

Large stack deploy kills host - need inter-service rolling #34126

Open

nitisht mentioned this pull request Aug 4, 2017

Failed to lookup minio1 on Docker Swarm minio/minio#4761

Closed

Grace period option to health checks. #28938

Grace period option to health checks. #28938

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!