-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Many parallel calls to docker exec
make Docker unresponsive
#45595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You mention "since today"; was this working ok before on the same version of Docker Engine, or were you running different version before today? |
I did update Docker Engine today, it worked on previous versions. But I’m not sure what we updated, I just ran ‘apt-get upgrade’. I saw that 24.0.1 was released on May 19th, so I assumed it might have been that. We have done such integration tests for two years, and didn’t have this problem before. |
Thanks! Yes, mostly curious between which versions the change happened on your side. Based on the timeline, I expect that to be the 23.0 -> 24.0 upgrade; if you have a system to test on, you could try downgrading to the latest 23.0.x release (https://docs.docker.com/engine/install/ubuntu/#install-docker-engine - see the "specific version" tab in that section). I know there's been some refactoring that happened in
/cc @corhere if you have immediate ideas. |
Ok, I'll try it out with 2.23.0 and check if there's any difference. I'm pretty busy today, but I'll be able to check it later today. For the downgrade, can I just uninstall Docker, and then install 2.23.0? Also, FYI, I just checked it locally on my Windows machine, there everything works fine, but it's also on 20.10.21, so quite a bit older. |
You should probably be able to use the steps outlined in that section and use the I should mention that it's always best practice to verify upgrades/downgrades in a testing environment before rolling out to production, but from your description it looks like this is a testing environment (so "probably" no critical content stored in containers/volumes/etc) 😅 My initial thinking is that possibly the Also wondering if it would make a difference if the |
Yeah, that's not on critical systems. It's just two dedicated servers that act as agents for the CI/CD system. We're not blocked since we can still run everything on hosted runners or locally. About the |
Ah, yes, it won't wait; or at least, the Right so your case is "roughly" multiple of these in parallel; docker exec <do copy>
# wait for copy to complete
docker exec <do sql command> I guess for your example (as I asked a colleague on my team to try and reproduce the issue, and to do a |
Almost😀 The only time we're calling And, we're running on Roughly, the workflow is like this:
Points 4. and 6. run per test, and the tests run in parallel. We can work around the issue by preventing parallel calls to And, to make it still a bit more complicated, we're not actually calling public static async Task CopySeedDatabaseFilesForNewDatabase(string newDatabaseName,
string dockerContainerId)
{
var dockerClient = DockerContainerUtilities.GetDockerClient();
try
{
await _semaphore.WaitAsync();
var dataFile = await dockerClient
.Exec
.ExecCreateContainerAsync(dockerContainerId,
new ContainerExecCreateParameters
{
Cmd = new List<string>
{
"cp",
$"/var/opt/mssql/data/{SqlServerDockerCollectionFixture.INITIAL_SEED_DATABASE_NAME}.mdf",
$"/var/opt/mssql/data/{newDatabaseName}.mdf"
}
});
await dockerClient.Exec.StartContainerExecAsync(dataFile.ID);
var logFile = await dockerClient
.Exec
.ExecCreateContainerAsync(dockerContainerId,
new ContainerExecCreateParameters
{
Cmd = new List<string>
{
"cp",
$"/var/opt/mssql/data/{SqlServerDockerCollectionFixture.INITIAL_SEED_DATABASE_NAME}_log.ldf",
$"/var/opt/mssql/data/{newDatabaseName}_log.ldf"
}
});
await dockerClient.Exec.StartContainerExecAsync(logFile.ID);
}
finally
{
_semaphore.Release();
}
} Above, the The But, I could also reproduce this on two servers with the repro steps from the issue, where I just called |
Right, I've been looking into this (thanks for the ping @thaJeztah), and I haven't been able to start bisecting yet but the gist so far is – There's no specific for i in {1..50}
do
docker exec test-container sh -c "touch $i.txt" &
done And then I see the symptoms described by @GeorgDangl. If I check: $ ps aux | grep exec
...
laurabr+ 23843 0.0 0.7 1105448 22648 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 22.txt
laurabr+ 23844 0.0 0.7 1105448 23220 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 23.txt
laurabr+ 23845 0.0 0.7 1179240 22748 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 24.txt
laurabr+ 23846 0.0 0.7 1179496 23596 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 25.txt
laurabr+ 23847 0.0 0.7 1179496 23624 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 26.txt
laurabr+ 23848 0.0 0.7 1105192 23380 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 27.txt
laurabr+ 23849 0.0 0.7 1253288 23416 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 28.txt
laurabr+ 23850 0.0 0.7 1179496 22460 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 29.txt
laurabr+ 23851 0.0 0.7 1179240 22136 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 30.txt
laurabr+ 23852 0.0 0.7 1179496 22400 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 31.txt
laurabr+ 23853 0.0 0.7 1253288 22808 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 32.txt
laurabr+ 23854 0.0 0.7 1179240 23452 pts/3 Sl 13:12 0:00 docker exec test-container sh -c touch 33.txt
... We have a lot of hanging
and still see the issue, so maybe having 50+ clients connected aren't the issue (of course, even with 1 client we could be messing up and leaving a ton of open connections). |
Using the following reproducer, the behaviour I observed is that when many execs are started concurrently, all the execs take an unreasonable amount of time to run on v24.0.1 but not v23.0.6. $ docker run --rm -it --privileged docker:24.0.1-dind sh
# unset DOCKER_HOST
# dockerd -D &
# docker run --rm -d --name pause1 alpine sleep infinity
# time sh -c 'myjobs=""; for i in $(seq 1 50); do docker exec pause1 true & myjobs="$myjobs $!"; done; for pid in $myjobs; do wait $pid; done' The inflection point for execs to start taking an unreasonable amount of time appears to be somewhere around 16-20 concurrent execs on my machine, though not consistently. At 20 concurrent execs I am seeing times between 1s and 7s, and at 50 execs it takes between 45 seconds and a minute. The debug logs are very interesting, as well. The first few execs appear to complete their lifecycles almost instantaneously. Then at some point there is a flood of |
Thanks! Ah, yes, in hindsight we probably should've started the Slack thread in the maintainers channel, not our internal one |
On a hunch, I instrumented moby/libcontainerd/remote/client.go Lines 299 to 307 in 463850e
TaskService().Start() requests are in-flight concurrently for the same task, containerd takes an excessive amount of time to handle them all. That is also consistent with b752462 being the first commit which the issue could be reproduced with, as that is the commit which removed the locking which had effectively serialized all exec requests for a particular container.
|
Yes, we came to the "previously they were (either intentionally or unintentionally) serialized" conclusion as well. But that was just before I had to head out, so haven't checked my slack it @laurazard found more details (all credits to her for doing the bisecting / digging) |
That's good news, so I wasn't imagining it😀 I hope it's fine if I don't try it out on older versions, since it seems like you've already been able to find when the behavior changed. I'd be happy to help if there's anything else I can do. |
docker exec
to copy a file inside the container on a tmpfs volume make Docker unresponsivedocker exec
make Docker unresponsive
Discussing in the maintainers call; while the containerd bug has to be looked into (but no fix / root-cause / ETA yet), we can add a temporary (famous last words), but this time deliberate serialisation back. |
Thank you guys for the quick resolution😀 |
@corhere We have recently switched from docker 23 to docker 26 on our platform and we observe a similar problem of a docker exec command hanging indefinitely. Docker version: |
@dtronche does your version of containerd include containerd/containerd#9828? If not, that could be causing hangs. Edit: containerd/containerd#9828 was included in containerd |
Thank you, the update of containerd seems to have solved the issue. Is there a minimal version of docker-engine to work with it ? |
Description
Since today, we've experienced a problem where our CI system could no longer run our tests properly.
After a bit of investigation, we found how to resolve it, but not the root cause. This happens on two independent servers, one on Ubuntu 22.04, the other on 18.04. Both are running Docker 24.0.1.
Essentially, we're launching an image of SQL Server using a tmpfs volume for the data files. Before the tests, we're creating a database with seed data. Then, for every test run, we run
docker exec
to copy the data (and log) files, and attach them as a new database. This is running in parallel without any concurrency limit, and has worked before.Now we found that when doing this, all Docker commands become unresponsive after a short time. E.g.

docker stats
no longer refreshes correctly and shows only dashes--
instead of the actual information:Other Docker commands, e.g.
docker stop
take extremely long (if they finish at all). The tests run usually in 10 minutes, now it takes several hours with most tests failing due to a timeout exception when calling Docker commands from within the test suite.Reproduce
docker run -e ACCEPT_EULA=Y -e "SA_PASSWORD=yourStrong(#)Password" \ --mount type=tmpfs,destination=/var/opt/mssql/data \ --mount type=tmpfs,destination=/var/opt/mssql/log \ --mount type=tmpfs,destination=/var/opt/mssql/secrets \ dangl/mssql-tmpfs:latest
cp
operations viadocker exec
in parallel:sh concurrent_exec.sh
and watch the output ofdocker stats
in another session.docker stats
will freeze for a time (but eventually recover). It takes longer the more parallel operations I do.The actual behavior depends on how many parallel calls I'm making. E.g. with
5
, there's no (noticeable) interruption, with50
it already blocks for around a minute.Expected behavior
No response
docker version
Client: Docker Engine - Community Version: 24.0.1 API version: 1.43 Go version: go1.20.4 Git commit: 6802122 Built: Fri May 19 18:06:21 2023 OS/Arch: linux/amd64 Context: default Server: Docker Engine - Community Engine: Version: 24.0.1 API version: 1.43 (minimum version 1.12) Go version: go1.20.4 Git commit: 463850e Built: Fri May 19 18:06:21 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.21 GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8 runc: Version: 1.1.7 GitCommit: v1.1.7-0-g860f061 docker-init: Version: 0.19.0 GitCommit: de40ad0
docker info
Additional Info
I've tried it out on a fresh server, had the same behavior there. I'd be happy to assist with any further details, but I would need some guidance on what is relevant and how to get that.
The text was updated successfully, but these errors were encountered: