Refactor libcontainerd to minimize containerd RPCs #43564

corhere · 2022-05-04T23:00:55Z

Relates to How many docker health checks can be handled in a single docker node ? #33933

- What I did
While investigating a report of slow docker execs and failing health checks on heavily-loaded systems, I instrumented dockerd with OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces of various container operations. I was surprised to see RPCs such as containerd.services.containers.v1.Container/Get being called multiple times in a row for nearly every operation. In the common case of a local containerd process each RPC costs at least two context switches to complete, so we want to keep the number of RPCs per operation to a minimum to reduce the overhead of dockerd and containerd on the system. I overhauled libcontainerd and refactored how the daemon uses libcontainerd to do just that.

I modified the behavior of the Container.State.OOMKilled flag to update immediately (rather than on container exit) to simplify the code and reduce the amount of state libcontainerd needs to track, with the side benefit of the new behavior being more intuitive to users.

I benchmarked these changes against the base commit (bb88ff4) and measured a reduction in wallclock time for docker exec of 6.8% in my test environment.

Benchmarking details

$ id=$(docker run -d alpine sleep infinity)
$ for x in {1..5}; do time for i in {1..500}; do docker exec $id true; done; done

	master	libcontainerd-overhaul
Run 1	45.768s	43.029s
Run 2	47.123s	42.953s
Run 3	46.916s	43.299s
Run 4	47.735s	43.245s
Run 5	43.926s	43.223s
Average	46.294s	43.150s

(43.150 - 46.294) / 46.294 = -6.79%

- How I did it
The containerd client library is very chatty, generally erring on the side of refreshing metadata over referencing a locally-cached copy. (The containerd gRPC API has no concurrency control so multiple clients could race to mutate the same resources. Aggressively refreshing metadata only shortens the race window; it does not eliminate it. Perhaps a future version of the containerd API could improve upon the situation so it can be less aggressive about refreshing. But I digress...) And further exacerbating the situation, the libcontainerd implementation is effectively forced to reload the containerd resources at the start of every operation because containers and processes are referenced by ID strings. To illustrate, consider the libcontainerd Start() method. Its implementation is, in broad strokes:

func (c *client) Start(ctx context.Context, id string, /* ... */) {
	ctr, _ := c.client.LoadContainer(ctx, id) // v1.Container/Get
	spec, _ := ctr.Spec(ctx) // v1.Container/Get
	labels, _ := ctr.Labels(ctx) // v1.Container/Get
	t, _ := ctr.NewTask(ctx, /* options derived from spec, labels */) // v1.Container/Get, v1.Task/Create
	_ = t.Start(ctx) // v1.Task/Start
}

That's four calls to containerd.services.containers.v1.Container/Get in a row, all to get the exact same metadata! Two of those redundant calls can be easily eliminated by reusing the container metadata initially loaded by c.client.LoadContainer. And the LoadContainer call could also be elided if the container object returned by c.client.NewContainer from when the container was created was retained. The number of RPCs to start a container can be cut in half without any changes to the containerd client! The containerd client resource objects need to be retained and reused to pull this off, which means more state for the Docker daemon to manage.

The Docker daemon already keeps track of state for every container and exec, with mappings from ID strings to the respective state objects. Holding persistent state in the libcontainerd layer with mappings from ID strings to state objects would be redundant, so I overhauled the libcontainerd interfaces to allow for implementations which hold no shared mutable state. Not coincidentally, this shape mirrors the containerd client's interfaces. I delegated retaining references to the libcontainerd objects to the daemon's container state and exec config structs.

The local Windows (HCSv1) libcontainerd implementation was a bit of a challenge to refactor to fit the new shape of the libcontainerd interface, but it also benefited from eliminating shared mutable state.

- How to verify it
CI ✅

- Description for the changelog

Reduced the number of RPCs required to perform operations with containerd-based runtimes
A container's State.OOMKilled flag is now set to true immediately upon any container process getting OOM-killed by the kernel, and cleared to false when the container is restarted

- A picture of a cute animal (not mandatory but encouraged)

corhere · 2022-05-10T21:32:02Z

libcontainerd/local/local_windows.go

-		// Fix for https://github.com/moby/moby/issues/38719.
-		// If the init process failed to launch, we still need to reap the
-		// container to avoid leaking it.
-		//
-		// Note we use the explicit exit code of 127 which is the
-		// Linux shell equivalent of "command not found". Windows cannot
-		// know ahead of time whether or not the command exists, especially
-		// in the case of Hyper-V containers.
-		ctr.Unlock()
-		exitedAt := time.Now()
-		p := &process{
-			id:  libcontainerdtypes.InitProcessName,
-			pid: 0,
-		}
-		c.reapContainer(ctr, p, 127, exitedAt, nil, logger)


Note for reviewers: containers will not be leaked if creating the init process fails, despite the code being removed here. The daemon handles Start errors by cleaning up the container with (types.Container).Delete, and the local_windows implementation shuts down and closes the container, same as reapContainer did.

thaJeztah · 2022-07-07T18:31:15Z

needs a rebase 😅

thaJeztah

will need to do another pass on this; left some comments / thoughts

thaJeztah · 2022-07-26T12:32:17Z

api/swagger.yaml

+          Whether a process within this container has been killed because it ran
+          out of memory since the container was last started.


TBH, I wasn't fully aware we also set this field if an exec or "any other process in the namespace" was OOM killed.

IIRC (but I'd have to dig history) the intent of this field was to learn "why did the container exit?" (which could be due to OOM), so (implicitly) this field was meant as indicator for the container's main process being OOM killed.

So, this change means a couple of things;

(already the case?) OOMKilled could mean "any process" was killed (some of which could've been "non-fatal")

👍 setting this field immediately allows for such (non-fatal) occurrences to be observed while the container is running

❓ but they may be (potentially) a red herring if the OOMKilled field is true (but wasn't the cause of the container ultimately exiting).

👉 that said; if processes are being killed due to OOM in the container, it could still be useful information (container exiting because one of it's child-processes was terminated, and the container running in "degraded" state).

The above is neither "good", nor "bad", but "different". The only thing I'm wondering is; would there be reasons (would there be value) in being able to differentiate those? Thinking along the lines of;

A counter for OOM events (seeing the counter go up, but the container still running, means that it's potentially in degraded state and/or configured with insufficient resources).

OOMKilled to be reserved for the container's main process? (devil's advocate; the exit event itself may not be directly caused by OOM for the main process, but as a result of other processes)

^^ possible solution to that would be to either deprecate OOMKilled (in favor of a counter?) or on exit; OOMKilled = len(OOMKilledCounter) > 0

OOMKilled always meant "any process" was killed. I have only changed when the flag is updated, and corrected the documentation.

Unfortunately, we can't reserve OOMKilled for the container's pid 1 because cgroups only keeps track of how many times the OOM killer has been invoked on processes in a group, not which processes in the group got killed. And even if we could, things could get really confusing with docker run --init and containers which use a fit-for-purpose init (e.g. s6-svscan) as pid 1. If we had the ability to know which processes got OOM-killed, I think it would be so much more useful to surface that information as discrete events associated with the container with timestamps, PIDs and /proc/<pid>/cmdlines.

Even without detailed information, I think surfacing OOMs as timestamped events would be far superior to a counter as it would allow the fatal OOM-kills to be differentiated from the non-fatal ones, even after the container exits. (Heuristic: how much time elapsed between the OOM-kill and the container stopping.) A close runner-up would be surfacing a counter along with the timestamp of the most recent OOM-killed event received by the daemon.

If only memory.oom.group was enabled in containers. (AFAICT runC does not.) That would clear up any ambiguity quite nicely.

thaJeztah · 2022-07-26T12:38:44Z

container/exec.go

 // InitializeStdio is called by libcontainerd to connect the stdio.
-func (c *Config) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) {
+func (c *ExecConfig) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) {


Slightly wondering what the motivation was to move this into the container package (instead of keeping it as a smaller package). I'd probably need to check if locally to see if there's any issues w.r.t. non-exported fields (e.g.)?

(Also looking commit-by-commit, so perhaps there's more in the next commit(s) 😅

I had to do it to avoid a circular import between the ./daemon/exec and ./container packages.

thaJeztah · 2022-07-26T12:43:13Z

container/state.go

+	s.ctr = ctr
+	s.task = tsk
+	if tsk != nil {
+		s.Pid = int(tsk.Pid())


Wondering if we should consider changing our State.Pid to an uint32 to match containerd's type

thaJeztah · 2022-07-26T12:47:18Z

daemon/checkpoint.go

@@ -57,8 +57,11 @@ func (daemon *Daemon) CheckpointCreate(name string, config types.CheckpointCreat
 		return err
 	}

-	if !container.IsRunning() {
-		return fmt.Errorf("Container %s not running", name)
+	container.Lock()


(Just thinking out loud) should we have an RWMutex instead of a Mutex?

Only if profiling reveals significant lock contention as RWMutex is slower than Mutex. golang/go#38813, golang/go#17973

thaJeztah · 2022-07-26T12:57:53Z

daemon/daemon.go

+					switch status {
+					case containerd.Paused, containerd.Pausing:
+						// nothing to do
+					case containerd.Unknown, containerd.Stopped, "":
+						log.WithField("status", status).Error("unexpected status for paused container during restore")


Slightly wondering here; so alive is set based on containerd state above; so would we ever be able to get into this situation?

Also, currently this code looks like;

switch { case SOME_CONDITION && alive: case OTHER_CONDITION && alive: } if !alive { // do other things }

Instead of including the && alive in the switch condition, perhaps it'd be clearer to include it in a if/else for alive;

if alive { switch { case SOME_CONDITION: case OTHER_CONDITION: } } else { // do other things }

We could get into this situation if containerd's task state disagrees with docker's serialized container state.

perhaps it'd be clearer to...

func (*Daemon) restore() is in desperate need of a major refactor. It's nearly 400 lines long with a cyclomatic complexity to match! I would love to improve it, but this PR is already huge and tough enough to review as it is. I think it would be best for everyone's sanity to refactor restore() in a separate PR and include your suggested change there.

daemon/daemon_unix.go

tianon

Generally looks OK to me - would love to get it into HEAD and have it receive more testing. 👀

It needs a rebase, but hopefully it's not too complicated (looks like maybe just some minor struct changes?)

tianon · 2022-08-12T23:22:25Z

daemon/exec.go

+		// Must use a new context since the current context is done.
+		ctx := context.TODO()
+		logrus.Debugf("Sending TERM signal to process %v in container %v", name, ec.Container.ID)
+		ec.Process.Kill(ctx, signal.SignalMap["TERM"])


Relevant to #43739, this could be simpler if we were just SIGKILLing 👀 (this is the same bit of code, right?)

It is the same bit of code!

cpuguy83 · 2022-08-15T17:43:53Z

Looks like quite the rebase is needed 😢

thaJeztah · 2022-08-16T13:40:51Z

Looks like quite the rebase is needed 😢

I did a quick rebase in #43967 (pushed as temporary PR to have a run of CI, but if it looks good, we can reset this branch to that branch)

thaJeztah · 2022-08-18T18:09:15Z

discussing in the maintainers meeting; lets get #43739 in first, so that we can backport that for 22.06, then merge this one 😅

thaJeztah · 2022-08-23T20:38:33Z

@corhere #43739 was merged, so this needs a rebase now; after that, I think it should be ready to go?

The daemon.containerd.Exec call does not access or mutate the container's ExecCommands store in any way, and locking the exec config is sufficient to synchronize with the event-processing loop. Locking the ExecCommands store while starting the exec process only serves to block unrelated operations on the container for an extended period of time. Convert the Store struct's mutex to an unexported field to prevent this from regressing in the future. Signed-off-by: Cory Snider <csnider@mirantis.com>

The OOMKilled flag on a container's state has historically behaved rather unintuitively: it is updated on container exit to reflect whether or not any process within the container has been OOM-killed during the preceding run of the container. The OOMKilled flag would be set to true when the container exits if any process within the container---including execs---was OOM-killed at any time while the container was running, whether or not the OOM-kill was the cause of the container exiting. The flag is "sticky," persisting through the next start of the container; only being cleared once the container exits without any processes having been OOM-killed that run. Alter the behavior of the OOMKilled flag such that it signals whether any process in the container had been OOM-killed since the most recent start of the container. Set the flag immediately upon any process being OOM-killed, and clear it when the container transitions to the "running" state. There is an ulterior motive for this change. It reduces the amount of state the libcontainerd client needs to keep track of and clean up on container exit. It's one less place the client could leak memory if a container was to be deleted without going through libcontainerd. Signed-off-by: Cory Snider <csnider@mirantis.com>

The containerd client is very chatty at the best of times. Because the libcontained API is stateless and references containers and processes by string ID for every method call, the implementation is essentially forced to use the containerd client in a way which amplifies the number of redundant RPCs invoked to perform any operation. The libcontainerd remote implementation has to reload the containerd container, task and/or process metadata for nearly every operation. This in turn amplifies the number of context switches between dockerd and containerd to perform any container operation or handle a containerd event, increasing the load on the system which could otherwise be allocated to workloads. Overhaul the libcontainerd interface to reduce the impedance mismatch with the containerd client so that the containerd client can be used more efficiently. Split the API out into container, task and process interfaces which the consumer is expected to retain so that libcontainerd can retain state---especially the analogous containerd client objects---without having to manage any state-store inside the libcontainerd client. Signed-off-by: Cory Snider <csnider@mirantis.com>

The existing logic to handle container ID conflicts when attempting to create a plugin container is not nearly as robust as the implementation in daemon for user containers. Extract and refine the logic from daemon and use it in the plugin executor. Signed-off-by: Cory Snider <csnider@mirantis.com>

Attempting to delete the directory while another goroutine is concurrently executing a CheckpointTo() can fail on Windows due to file locking. As all callers of CheckpointTo() are required to hold the container lock, holding the lock while deleting the directory ensures that there will be no interference. Signed-off-by: Cory Snider <csnider@mirantis.com>

Modifying the builtin Windows runtime to send the exited event immediately upon the container's init process exiting, without first waiting for the Compute System to shut down, perturbed the timings enough to make TestWaitConditions flaky on that platform. Make TestWaitConditions timing-independent by having the container wait for input on STDIN before exiting. Signed-off-by: Cory Snider <csnider@mirantis.com>

thaJeztah · 2022-08-24T20:21:53Z

daemon/delete.go

+	// Hold the container lock while deleting the container root directory
+	// so that other goroutines don't attempt to concurrently open files


Nice find 👍

thaJeztah

Went through this one a few more times. Lots of code, but I didn't spot any issues, but of course we'll now have the opportunity to give this some "burn in" time in CI. Looks like a great improvement of the existing code; thank you!

LGTM

corhere · 2022-08-24T20:54:40Z

=== FAIL: amd64.integration-cli TestDockerSuite/TestExecAPIStartValidCommand (0.96s)
    docker_api_exec_test.go:187: assertion failed: inspectJSON.ExecIDs is not nil
    --- FAIL: TestDockerSuite/TestExecAPIStartValidCommand (0.96s)

This test passed in the prior CI run so it's ~~likely~~ some kind of race condition. ~~I'm looking into whether the test is just flaky or if~~ my refactoring opened up a window of time where an invariant is violated.

We have integration tests which assert the invariant that a GET /containers/{id}/json response lists only IDs of execs which are in the Running state, according to GET /exec/{id}/json. The invariant could be violated if those requests were to race the handling of the exec's task-exit event. The coarse-grained locking of the container ExecStore when starting an exec task was accidentally synchronizing (*Daemon).ProcessEvent and (*Daemon).ContainerExecInspect to it just enough to make it improbable for the integration tests to catch the invariant violation on execs which exit immediately. Removing the unnecessary locking made the underlying race condition more likely for the tests to hit. Maintain the invariant by deleting the exec from its container's ExecCommands before clearing its Running flag. Additionally, fix other potential data races with execs by ensuring that the ExecConfig lock is held whenever a mutable field is read from or written to. Signed-off-by: Cory Snider <csnider@mirantis.com>

tianon · 2022-08-25T18:32:22Z

integration/container/wait_test.go

-			containerID := container.Run(ctx, t, cli, opts...)
-			poll.WaitOn(t, container.IsInState(ctx, cli, containerID, "running"), poll.WithTimeout(30*time.Second), poll.WithDelay(100*time.Millisecond))
+			opts := append([]func(*container.TestContainerConfig){
+				container.WithCmd("sh", "-c", "read -r; exit 99"),


😍 this is so much a cleaner solution than sleep

tianon · 2022-08-25T18:51:58Z

Bombs away!

thaJeztah added area/runtime status/2-code-review labels May 5, 2022

corhere force-pushed the libcontainerd-overhaul branch 7 times, most recently from 20851da to 3f303e1 Compare May 10, 2022 20:02

corhere commented May 10, 2022

View reviewed changes

corhere marked this pull request as ready for review May 10, 2022 21:32

corhere requested a review from cpuguy83 as a code owner May 10, 2022 21:32

corhere force-pushed the libcontainerd-overhaul branch from 3f303e1 to 91ce829 Compare May 11, 2022 18:53

thaJeztah added this to the v-next milestone May 12, 2022

corhere force-pushed the libcontainerd-overhaul branch from 91ce829 to 2c1c558 Compare July 7, 2022 20:03

corhere force-pushed the libcontainerd-overhaul branch 2 times, most recently from 666151d to 718120f Compare July 22, 2022 20:00

thaJeztah reviewed Jul 26, 2022

View reviewed changes

tianon approved these changes Aug 12, 2022

View reviewed changes

thaJeztah mentioned this pull request Aug 16, 2022

Refactor libcontainerd to minimize containerd RPCs (rebase) #43967

Closed

thaJeztah force-pushed the libcontainerd-overhaul branch from 3b33fe3 to 16a9ec7 Compare August 16, 2022 22:23

corhere force-pushed the libcontainerd-overhaul branch from 16a9ec7 to dbb8497 Compare August 23, 2022 23:03

corhere added 3 commits August 24, 2022 14:59

corhere force-pushed the libcontainerd-overhaul branch from dbb8497 to 15b8e4a Compare August 24, 2022 18:59

thaJeztah reviewed Aug 24, 2022

View reviewed changes

thaJeztah approved these changes Aug 24, 2022

View reviewed changes

tianon reviewed Aug 25, 2022

View reviewed changes

tianon approved these changes Aug 25, 2022

View reviewed changes

tianon merged commit 0ec426a into moby:master Aug 25, 2022

corhere deleted the libcontainerd-overhaul branch August 25, 2022 20:15

This was referenced Aug 26, 2022

Fix health check timeout test on Windows+containerd #44014

Merged

daemon: un-skip TestHealthCheckProcessKilled on Windows #44044

Closed

thaJeztah mentioned this pull request Sep 1, 2022

[do not merge, might explode] #43902

Closed

thaJeztah mentioned this pull request Nov 2, 2022

imageservice: Add context to various methods #44365

Merged

corhere mentioned this pull request Jan 12, 2023

libnetwork: replace some unnecessary interfaces with their concrete types #44805

Merged

corhere mentioned this pull request Jan 31, 2023

[23.0 backport] Fix exit-event handling for Kata runtime #44892

Merged

vvoland mentioned this pull request Mar 1, 2023

libcontainerd/client: Fix checkpoint not being set #45090

Merged

vvoland mentioned this pull request Apr 3, 2023

libcontainerd: Fix racy error override #45261

Closed

corhere mentioned this pull request Apr 3, 2023

libcontainerd: close stdin sync if possible #45263

Merged

This was referenced May 23, 2023

Many parallel calls to docker exec make Docker unresponsive #45595

Closed

[24.0 backport] Fix npe in exec resize when exec errored #45643

Merged

thaJeztah mentioned this pull request Jun 13, 2023

api: synchronize api/swagger.yaml with docs/api/v1.43.yaml #45467

Merged

sam-thibault mentioned this pull request Jun 20, 2023

dockerd panic: runtime error at monitor.go:189 #45770

Closed

thaJeztah mentioned this pull request Jun 23, 2023

[24.0 backport] daemon: fix restoring container with missing task #45801

Merged

corhere mentioned this pull request Jun 23, 2023

daemon: fix restoring container with missing task #45800

Merged

corhere mentioned this pull request Jan 18, 2024

[23.0 backport] Lock container when deleting its root directory #47110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor libcontainerd to minimize containerd RPCs #43564

Refactor libcontainerd to minimize containerd RPCs #43564

		Whether a process within this container has been killed because it ran
		out of memory since the container was last started.

		// Hold the container lock while deleting the container root directory
		// so that other goroutines don't attempt to concurrently open files

Refactor libcontainerd to minimize containerd RPCs #43564

Refactor libcontainerd to minimize containerd RPCs #43564

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment