-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Refactor libcontainerd to minimize containerd RPCs #43564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
20851da
to
3f303e1
Compare
// Fix for https://github.com/moby/moby/issues/38719. | ||
// If the init process failed to launch, we still need to reap the | ||
// container to avoid leaking it. | ||
// | ||
// Note we use the explicit exit code of 127 which is the | ||
// Linux shell equivalent of "command not found". Windows cannot | ||
// know ahead of time whether or not the command exists, especially | ||
// in the case of Hyper-V containers. | ||
ctr.Unlock() | ||
exitedAt := time.Now() | ||
p := &process{ | ||
id: libcontainerdtypes.InitProcessName, | ||
pid: 0, | ||
} | ||
c.reapContainer(ctr, p, 127, exitedAt, nil, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: containers will not be leaked if creating the init process fails, despite the code being removed here. The daemon handles Start
errors by cleaning up the container with (types.Container).Delete
, and the local_windows implementation shuts down and closes the container, same as reapContainer
did.
3f303e1
to
91ce829
Compare
needs a rebase 😅 |
91ce829
to
2c1c558
Compare
666151d
to
718120f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to do another pass on this; left some comments / thoughts
Whether a process within this container has been killed because it ran | ||
out of memory since the container was last started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I wasn't fully aware we also set this field if an exec or "any other process in the namespace" was OOM killed.
IIRC (but I'd have to dig history) the intent of this field was to learn "why did the container exit?" (which could be due to OOM), so (implicitly) this field was meant as indicator for the container's main process being OOM killed.
So, this change means a couple of things;
- (already the case?) OOMKilled could mean "any process" was killed (some of which could've been "non-fatal")
- 👍 setting this field immediately allows for such (non-fatal) occurrences to be observed while the container is running
- ❓ but they may be (potentially) a red herring if the
OOMKilled
field istrue
(but wasn't the cause of the container ultimately exiting). - 👉 that said; if processes are being killed due to OOM in the container, it could still be useful information (container exiting because one of it's child-processes was terminated, and the container running in "degraded" state).
The above is neither "good", nor "bad", but "different". The only thing I'm wondering is; would there be reasons (would there be value) in being able to differentiate those? Thinking along the lines of;
- A counter for OOM events (seeing the counter go up, but the container still running, means that it's potentially in degraded state and/or configured with insufficient resources).
OOMKilled
to be reserved for the container's main process? (devil's advocate; the exit event itself may not be directly caused by OOM for the main process, but as a result of other processes)- ^^ possible solution to that would be to either deprecate
OOMKilled
(in favor of a counter?) or on exit;OOMKilled = len(OOMKilledCounter) > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOMKilled always meant "any process" was killed. I have only changed when the flag is updated, and corrected the documentation.
Unfortunately, we can't reserve OOMKilled
for the container's pid 1 because cgroups only keeps track of how many times the OOM killer has been invoked on processes in a group, not which processes in the group got killed. And even if we could, things could get really confusing with docker run --init
and containers which use a fit-for-purpose init (e.g. s6-svscan
) as pid 1. If we had the ability to know which processes got OOM-killed, I think it would be so much more useful to surface that information as discrete events associated with the container with timestamps, PIDs and /proc/<pid>/cmdline
s.
Even without detailed information, I think surfacing OOMs as timestamped events would be far superior to a counter as it would allow the fatal OOM-kills to be differentiated from the non-fatal ones, even after the container exits. (Heuristic: how much time elapsed between the OOM-kill and the container stopping.) A close runner-up would be surfacing a counter along with the timestamp of the most recent OOM-killed event received by the daemon.
If only memory.oom.group
was enabled in containers. (AFAICT runC does not.) That would clear up any ambiguity quite nicely.
// InitializeStdio is called by libcontainerd to connect the stdio. | ||
func (c *Config) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) { | ||
func (c *ExecConfig) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly wondering what the motivation was to move this into the container
package (instead of keeping it as a smaller package). I'd probably need to check if locally to see if there's any issues w.r.t. non-exported fields (e.g.)?
(Also looking commit-by-commit, so perhaps there's more in the next commit(s) 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do it to avoid a circular import between the ./daemon/exec
and ./container
packages.
s.ctr = ctr | ||
s.task = tsk | ||
if tsk != nil { | ||
s.Pid = int(tsk.Pid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should consider changing our State.Pid
to an uint32
to match containerd's type
@@ -57,8 +57,11 @@ func (daemon *Daemon) CheckpointCreate(name string, config types.CheckpointCreat | |||
return err | |||
} | |||
|
|||
if !container.IsRunning() { | |||
return fmt.Errorf("Container %s not running", name) | |||
container.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just thinking out loud) should we have an RWMutex instead of a Mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if profiling reveals significant lock contention as RWMutex is slower than Mutex. golang/go#38813, golang/go#17973
switch status { | ||
case containerd.Paused, containerd.Pausing: | ||
// nothing to do | ||
case containerd.Unknown, containerd.Stopped, "": | ||
log.WithField("status", status).Error("unexpected status for paused container during restore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly wondering here; so alive
is set based on containerd state above; so would we ever be able to get into this situation?
Also, currently this code looks like;
switch {
case SOME_CONDITION && alive:
case OTHER_CONDITION && alive:
}
if !alive {
// do other things
}
Instead of including the && alive
in the switch condition, perhaps it'd be clearer to include it in a if/else for alive
;
if alive {
switch {
case SOME_CONDITION:
case OTHER_CONDITION:
}
} else {
// do other things
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get into this situation if containerd's task state disagrees with docker's serialized container state.
perhaps it'd be clearer to...
func (*Daemon) restore()
is in desperate need of a major refactor. It's nearly 400 lines long with a cyclomatic complexity to match! I would love to improve it, but this PR is already huge and tough enough to review as it is. I think it would be best for everyone's sanity to refactor restore()
in a separate PR and include your suggested change there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK to me - would love to get it into HEAD and have it receive more testing. 👀
It needs a rebase, but hopefully it's not too complicated (looks like maybe just some minor struct changes?)
daemon/exec.go
Outdated
// Must use a new context since the current context is done. | ||
ctx := context.TODO() | ||
logrus.Debugf("Sending TERM signal to process %v in container %v", name, ec.Container.ID) | ||
ec.Process.Kill(ctx, signal.SignalMap["TERM"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant to #43739, this could be simpler if we were just SIGKILL
ing 👀 (this is the same bit of code, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same bit of code!
Looks like quite the rebase is needed 😢 |
I did a quick rebase in #43967 (pushed as temporary PR to have a run of CI, but if it looks good, we can reset this branch to that branch) |
3b33fe3
to
16a9ec7
Compare
discussing in the maintainers meeting; lets get #43739 in first, so that we can backport that for 22.06, then merge this one 😅 |
16a9ec7
to
dbb8497
Compare
The daemon.containerd.Exec call does not access or mutate the container's ExecCommands store in any way, and locking the exec config is sufficient to synchronize with the event-processing loop. Locking the ExecCommands store while starting the exec process only serves to block unrelated operations on the container for an extended period of time. Convert the Store struct's mutex to an unexported field to prevent this from regressing in the future. Signed-off-by: Cory Snider <csnider@mirantis.com>
The OOMKilled flag on a container's state has historically behaved rather unintuitively: it is updated on container exit to reflect whether or not any process within the container has been OOM-killed during the preceding run of the container. The OOMKilled flag would be set to true when the container exits if any process within the container---including execs---was OOM-killed at any time while the container was running, whether or not the OOM-kill was the cause of the container exiting. The flag is "sticky," persisting through the next start of the container; only being cleared once the container exits without any processes having been OOM-killed that run. Alter the behavior of the OOMKilled flag such that it signals whether any process in the container had been OOM-killed since the most recent start of the container. Set the flag immediately upon any process being OOM-killed, and clear it when the container transitions to the "running" state. There is an ulterior motive for this change. It reduces the amount of state the libcontainerd client needs to keep track of and clean up on container exit. It's one less place the client could leak memory if a container was to be deleted without going through libcontainerd. Signed-off-by: Cory Snider <csnider@mirantis.com>
The containerd client is very chatty at the best of times. Because the libcontained API is stateless and references containers and processes by string ID for every method call, the implementation is essentially forced to use the containerd client in a way which amplifies the number of redundant RPCs invoked to perform any operation. The libcontainerd remote implementation has to reload the containerd container, task and/or process metadata for nearly every operation. This in turn amplifies the number of context switches between dockerd and containerd to perform any container operation or handle a containerd event, increasing the load on the system which could otherwise be allocated to workloads. Overhaul the libcontainerd interface to reduce the impedance mismatch with the containerd client so that the containerd client can be used more efficiently. Split the API out into container, task and process interfaces which the consumer is expected to retain so that libcontainerd can retain state---especially the analogous containerd client objects---without having to manage any state-store inside the libcontainerd client. Signed-off-by: Cory Snider <csnider@mirantis.com>
The existing logic to handle container ID conflicts when attempting to create a plugin container is not nearly as robust as the implementation in daemon for user containers. Extract and refine the logic from daemon and use it in the plugin executor. Signed-off-by: Cory Snider <csnider@mirantis.com>
Attempting to delete the directory while another goroutine is concurrently executing a CheckpointTo() can fail on Windows due to file locking. As all callers of CheckpointTo() are required to hold the container lock, holding the lock while deleting the directory ensures that there will be no interference. Signed-off-by: Cory Snider <csnider@mirantis.com>
Modifying the builtin Windows runtime to send the exited event immediately upon the container's init process exiting, without first waiting for the Compute System to shut down, perturbed the timings enough to make TestWaitConditions flaky on that platform. Make TestWaitConditions timing-independent by having the container wait for input on STDIN before exiting. Signed-off-by: Cory Snider <csnider@mirantis.com>
dbb8497
to
15b8e4a
Compare
// Hold the container lock while deleting the container root directory | ||
// so that other goroutines don't attempt to concurrently open files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through this one a few more times. Lots of code, but I didn't spot any issues, but of course we'll now have the opportunity to give this some "burn in" time in CI. Looks like a great improvement of the existing code; thank you!
LGTM
This test passed in the prior CI run so it's |
We have integration tests which assert the invariant that a GET /containers/{id}/json response lists only IDs of execs which are in the Running state, according to GET /exec/{id}/json. The invariant could be violated if those requests were to race the handling of the exec's task-exit event. The coarse-grained locking of the container ExecStore when starting an exec task was accidentally synchronizing (*Daemon).ProcessEvent and (*Daemon).ContainerExecInspect to it just enough to make it improbable for the integration tests to catch the invariant violation on execs which exit immediately. Removing the unnecessary locking made the underlying race condition more likely for the tests to hit. Maintain the invariant by deleting the exec from its container's ExecCommands before clearing its Running flag. Additionally, fix other potential data races with execs by ensuring that the ExecConfig lock is held whenever a mutable field is read from or written to. Signed-off-by: Cory Snider <csnider@mirantis.com>
containerID := container.Run(ctx, t, cli, opts...) | ||
poll.WaitOn(t, container.IsInState(ctx, cli, containerID, "running"), poll.WithTimeout(30*time.Second), poll.WithDelay(100*time.Millisecond)) | ||
opts := append([]func(*container.TestContainerConfig){ | ||
container.WithCmd("sh", "-c", "read -r; exit 99"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 this is so much a cleaner solution than sleep
Bombs away! |
- What I did
While investigating a report of slow
docker exec
s and failing health checks on heavily-loaded systems, I instrumenteddockerd
with OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces of various container operations. I was surprised to see RPCs such ascontainerd.services.containers.v1.Container/Get
being called multiple times in a row for nearly every operation. In the common case of a local containerd process each RPC costs at least two context switches to complete, so we want to keep the number of RPCs per operation to a minimum to reduce the overhead of dockerd and containerd on the system. I overhauled libcontainerd and refactored how the daemon uses libcontainerd to do just that.I modified the behavior of the
Container.State.OOMKilled
flag to update immediately (rather than on container exit) to simplify the code and reduce the amount of state libcontainerd needs to track, with the side benefit of the new behavior being more intuitive to users.I benchmarked these changes against the base commit (bb88ff4) and measured a reduction in wallclock time for
docker exec
of 6.8% in my test environment.Benchmarking details
- How I did it
The containerd client library is very chatty, generally erring on the side of refreshing metadata over referencing a locally-cached copy. (The containerd gRPC API has no concurrency control so multiple clients could race to mutate the same resources. Aggressively refreshing metadata only shortens the race window; it does not eliminate it. Perhaps a future version of the containerd API could improve upon the situation so it can be less aggressive about refreshing. But I digress...) And further exacerbating the situation, the libcontainerd implementation is effectively forced to reload the containerd resources at the start of every operation because containers and processes are referenced by ID strings. To illustrate, consider the libcontainerd
Start()
method. Its implementation is, in broad strokes:That's four calls to
containerd.services.containers.v1.Container/Get
in a row, all to get the exact same metadata! Two of those redundant calls can be easily eliminated by reusing the container metadata initially loaded byc.client.LoadContainer
. And theLoadContainer
call could also be elided if the container object returned byc.client.NewContainer
from when the container was created was retained. The number of RPCs to start a container can be cut in half without any changes to the containerd client! The containerd client resource objects need to be retained and reused to pull this off, which means more state for the Docker daemon to manage.The Docker daemon already keeps track of state for every container and exec, with mappings from ID strings to the respective state objects. Holding persistent state in the libcontainerd layer with mappings from ID strings to state objects would be redundant, so I overhauled the libcontainerd interfaces to allow for implementations which hold no shared mutable state. Not coincidentally, this shape mirrors the containerd client's interfaces. I delegated retaining references to the libcontainerd objects to the daemon's container state and exec config structs.
The local Windows (HCSv1) libcontainerd implementation was a bit of a challenge to refactor to fit the new shape of the libcontainerd interface, but it also benefited from eliminating shared mutable state.
- How to verify it
CI ✅
- Description for the changelog
State.OOMKilled
flag is now set to true immediately upon any container process getting OOM-killed by the kernel, and cleared to false when the container is restarted- A picture of a cute animal (not mandatory but encouraged)
