-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Proposal - Application-defined "alive probe" #21142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(continuing from #21143 (comment) -- worth reiterating that I'm definitely +1 to having "how to probe XYZ container for healthiness" as a bit of image metadata) I'm a big +1 to the idea here -- it's very common to have a "retry" loop for dependent services (or something complex involving As for the implementation, I'm a little bit concerned that we might be pigeonholing ourselves by only accepting URLs for "what to probe" -- would we plan to just use custom schemes like I'm also curious about what this probing would/could be used for in the engine itself -- the proposal touches on a few potential use cases (automatic restarting of "unhealthy" containers, for example), but I don't know whether that's being left intentionally vague so that we can discuss "health status" for containers first (ie, what that means and how to calculate/gather said status) and then discuss how it would interact with other features, or if it was just an oversight and there's already a set interaction in mind. 😄 (I'd reiterate that I definitely see value in a "health" status that's separate from anything the engine is doing to the container, so I'd love to see those features be orthogonally defined -- for example, something like a restart policy of |
Huge +1 on this. A few things:
I think the probe should be active immediately after container start. The grace period might be one minute for instance, but the container may be "alive" within a few seconds and we don't want to wait that long to report it as alive. What the grace period could mean is the delay after container start for which we do not increase the failure count. Example:
That's perhaps too aggressive: 3 might be more sensitive
We play with URIs for discovery and I'm not a fan anymore. It's fragile, confusing and hard to customize (e.g. for HTTP you might want to specify a custom response code while for a script an exit code). I'd suggest having something like
IMO, the probe should be part of the container lifecycle and This implies:
|
channeling @crosbymichael, health checking on the same system that's running the service isn't a great healthcheck. What if there was cluster-wide knowledge of health checks that any host can perform? Checks could be backed by a driver interface, and pluggable with built-ins for simple tcp and http+url checks |
@cpuguy83 I had similar thoughts but then I assumed, perhaps incorrectly, that most of this was to be supported by the Swarm manager and therefore wouldn't be on the same system. Its only in a smaller, dev/test, environments where doing the check on the same system would be ok. But, overall I do like the idea of adding support at some level. I just would like to see the complete picture first. For example, putting it on the "docker run" command is interesting in limited cases, but my first thought would be that it should be in a compose.yml file. There are probably a few options and it would be good to discuss the long-term vision before we start down any particular path. |
+1 on this or some sort of HEALTHCHECK instruction #21143 #7445 |
@cpuguy83 The health check from Engine is not meant to be complete but it provides value. A container failing Engine health check shows container is not functioning as expected. This feedback is useful for a few scenarios as described above. For example, users can stop rolling update with this feedback to investigate. Container passing health check may not be reachable from external. Orchestration tools can add external monitoring. Combining the result from Engine health check and external monitoring would help failure diagnosis. |
@cpuguy83 I think the name healthcheck is misleading. The goal of this system is to guarantee that the container is actually running, not that it's end-to-end reachable from the outside like a load balancer health check would do - that is out of scope. |
+1 for this as well. I agree it's not a full solution for checking the health of the service but a good initial data point. I also think the name is misleading although I'm not coming up with anything better atm. |
+1 to @aluzzardi 's idea to change to But I don't think |
I think we should consider calling it I believe the main objection to "health check" is that this is not a true check of health at the application level. For instance, it doesn't ensure the container is reachable, it relies on the Engine to essentially check itself which is not robust against Engine crashes, etc. I agree with all those points. However, it seems to me that this feature is a true container-level health check: it asserts that the process inside the container is running in a manner consistent with the expectations of the image's author. Because it's being specified in the context of a (Also, I would point out that this container-level health check doesn't preclude the creation of a higher level health check within the Swarm manager.) Are there other downsides to the name "health check" that I'm missing? What about a more accurately scoped name like |
Agree with @dongluochen that running the health check inside the container being monitored is suboptimal in many cases. If the container is wedged, it may not be able to update a file to indicate it is unhealthy. Checking that a URL returns 200 may be fine if you just want to ensure a webserver is responding, but for complicated logic there should also be an ability to call a separate process outside the container being monitored. Developers can provide a separate healthcheck container for their service which does any arbitrary test that is needed, and we can examine the exitcode to see if it passed or not. |
USER POLL The best way to get notified of updates is to use the Subscribe button on this page. Please don't use "+1" or "I have this too" comments on issues. We automatically The people listed below have upvoted this issue by leaving a +1 comment: |
This could finally allow docker compose to start dependent services in order. Also, to me "health check" means a very limited scope test; basically is this service ready to accept requests and replies correctly to a request that doesn't involve any downstream services. "Smoke tests" can be used for end to end testing. |
On the design I think the term probe is confusing and it should just be called a health check. The system that reads the health checks is responsible for dealing with cases where the container is incapable of performing the check. Any check that fails to run is equivalent to a check that runs and reports a failure. I think seconds is the wrong unit for some of these. It should be milliseconds, especially for timeout. On open questions If it were added as a container state, changes in state should be reflected in the events feed, and a new restart policy If it is not a change in container state, it shouldn't be part of the event stream, and shouldn't impact restart policies at all. I think it would be good to make it a new state, but that it's not absolutely necessary for V1. |
I think it is bad practice to encourage ready checks for sequencing container starts. This should be handled at the application level... e.g. connect-to-db->fail->loop(connect-to-db)
Makes sense, like an event broadcasting that the container is ready.
Kind of ick to put Docker in this category. |
How is it different from load balancing? |
In general, I am huge fan of this idea. Providing secondary checks for a process to indicate its liveness will only help to inform the docker engine. The great thing about this concept is that it can be complementary to an external health checking system. Coupled with a plugin system, it can operate in concert with a larger system or simply as a local liveness check. Through the event API, it can be joined with remote data to inform service discovery. Remote health checks are still required to check for service access, but this will cover an important gap at the local level. I do hate the name "probe". A probe implies the measurement of a remote value, such as a voltage or where the rebel scum may be located. |
How about checking the filesystem?
|
Also agree for the name "health check" over "probe." Probe is a bit vague whereas health check (even if just container level and not app level) is pretty widely understood by users |
Would it make more sense for the health check specification to be defined on the container being checked, and the link/dependency itself be defined on the dependent container, or is that too inflexible? This might not work well with a swarm, or otherwise outside of a local single node environment; I'm naive to the docker internals. e.g. Ignore the method for health-checking, that part isn't super important, but assume
|
IMO for the first pass it would be valuable to focus on defining how to determine and discover the "health" status (including how to monitor it for changes), and then separately discuss how that state impacts the rest of the system (container dependencies, etc); I fear that if we try to implement both in one shot that we'll end up hamstringing our implementation of "health checking" to solve a narrow use case (or a narrow set of use cases) 😞 |
consider provide a TCP/UDP check for non-http(s) apps, with user-defined request and response pattern. |
Instead of a check from the outside, could it make sense to be able to tell the run command when it can automatically detach? This way, a container could be started with the usual |
There's a pull request opened to implement this, so anyone that's interested, PTAL at #22719 |
@thaJeztah Is this issue closable? #23218 |
Yup! Same as the other one 👍 Implemented in #23218 |
Where is the doc and/or schema for the output of |
@talex5: thanks! Hope this gets documented someday. |
Problem statement
Docker currently doesn't provide any built-in way to determine if a container is "alive" in the sense that the service it provides is up and running. This is useful in many scenarios, for example:
This issue cover the support for "alive probes" at the Engine level.
Proposal
Every container would support one optional container-specific probe to determine whether a service is alive. This would translate in
docker
UX through several new command line options for therun
sub-command (naming to be discussed):--probe
""
--probe-interval
60
--probe-retry
1
--probe-timeout
--probe-grace
A container is considered alive when the probe returns
0
for afile://
based probe, or a status code in the 200-399 range for anhttp[s]://
based probe.Examples:
Open questions
Implementation
events
API?Restart policies
--restart=always
and--restart=on-failure
when a--probe
is specified? That is roughly equivalent to assuming that the restart policy was always backed by a default probe which behavior is to look for the container process being alive.References
Ping @crosbymichael @tonistiigi @mgoelzer @aluzzardi @ehazlett
The text was updated successfully, but these errors were encountered: