8000 Intermittent "unexpected EOF" while downloading container layers when built with go 1.24 · Issue #49513 · moby/moby · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Intermittent "unexpected EOF" while downloading container layers when built with go 1.24 #49513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
berolinux opened this issue Feb 21, 2025 · 115 comments
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/27.0 version/28.0

Comments

@berolinux
Copy link

Description

Since updating to 28.0.0, I'm getting a lot of "unexpected EOF" errors when bringing up a series of containers.
Unfortunately this seems to happen at random, so there's no safe reproducer. The connection is fast and reliable, there is likely no massive timeout involved.
I've seen it happening both with docker pull and docker compose up while pulling layers.

Setting "max-download-attempts": 5000 (or even more ridiculous values) in /etc/docker/daemon.json doesn't fix it; chances are that since the error is something other than connection timed out or so, docker doesn't recognize this as a download failure and therefore doesn't make another download attempt.

[+] Running 12/14
 ⠙ dashboard [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 6.277MB / 6.277MB Pulling                                                                                                                                                       6.1s 
   ✔ 804c8aba2cc6 Already exists                                                                                                                                                                             0.0s 
   ✔ 2ae710cd8bfe Already exists                                                                                                                                                                             0.0s 
   ✔ d462aa345367 Already exists                                                                                                                                                                             0.0s 
   ✔ 0f8b424aa0b9 Already exists                                                                                                                                                                             0.0s 
   ✔ d557676654e5 Already exists                                                                                                                                                                             0.0s 
   ✔ c8022d07192e Already exists                                                                                                                                                                             0.0s 
   ✔ d858cbc252ad Already exists                                                                                                                                                                             0.0s 
   ✔ 1069fc2daed1 Already exists                                                                                                                                                                             0.0s 
   ✔ b40161cd83fc Pull complete                                                                                                                                                                              0.5s 
   ✔ 5318d93a3a65 Pull complete                                                                                                                                                                              0.5s 
   ✔ 307c1adadb60 Pull complete                                                                                                                                                                              3.1s 
   ⠧ 258b5bb46f9a Extracting      [==================================================>]  5.791MB/5.791MB                                                                                                     4.8s 
   ✔ 51fea5d3cd54 Download complete                                                                                                                                                                          2.2s 
unexpected EOF
exit status 1

Simply running the same docker pull or docker compose command again "fixes" it most of the time (and when it doesn't, surely running it a third or fourth time does).

Reproduce

  1. "docker pull" a container, preferrable one with many layers
  2. If it succeeds, try again, at some point it will result in "unexpected EOF"

Expected behavior

it works

docker version

Client:
 Version:           28.0.0
 API version:       1.48
 Go version:        go1.24.0
 Git commit:        
 Built:             Thu Feb 20 22:16:09 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          28.0.0
  API version:      1.48 (minimum version 1.24)
  Go version:       go1.24.0
  Git commit:       b0f5bc3
  Built:            Thu Feb 20 22:15:42 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          2.0.2
  GitCommit:        .m
 runc:
  Version:          1.20 [crun]
  GitCommit:        1.20-1
 docker-init:
  Version:          0.19.0
  GitCommit:

docker info

Client:
 Version:    28.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  2.33.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 9
  Running: 9
  Paused: 0
  Stopped: 0
 Images: 10
 Server Version: 28.0.0
 Storage Driver: btrfs
  Btrfs: 
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: .m
 runc version: 1.20-1 [crun]
 init version: 
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.13.3-server-1omv2590
 Operating System: OpenMandriva Lx 25.90 (Nickel) Cooker
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.26GiB
 Name: updatetestamd.federatedcomputer.net
 ID: 5419e1b8-b71b-4351-9772-dc2646c8cd39
 Docker Root Dir: /federated/docker
 Debug Mode: false
 Username: *******
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

No response

@berolinux berolinux added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Feb 21, 2025
@thaJeztah thaJeztah changed the title Intermittent "unexpected EOF" while downloading container layers 28.0: Intermittent "unexpected EOF" while downloading container layers Feb 21, 2025
@thaJeztah
Copy link
Member

Thanks for reporting; unfortunately the EOF error on its own does not provide much information on what could be happening here; the EOF is printed by the client (compose in your example), and effectively means that the connection with the Docker Engine API was (unexpectedly?) disconnected.

❓ 👉 Are you able to obtain logs from the daemon when this happens? Ideally, with the daemon running with debug enabled (which will log requests together with other events), which can correlate what happened related to what request, but if something went really wrong, logs would likely contain some information in either case.

@thaJeztah
Copy link
Member
thaJeztah commented Feb 21, 2025

I should point to a couple of things I noticed in your docker info and docker version output; I suspect you're building a package of Docker built by the OpenMandriva packagers; I see that package contain a containerd version built from "dirty" (uncommitted modifications) of the source code, and using crun (instead of runc) as runtime, which also may have been built from modified source (1.20-1 is not a tag that exists in upstream run; https://github.com/containers/crun/tags)

 containerd:
  Version:          2.0.2
  GitCommit:        .m
 runc:
  Version:          1.20 [crun]
  GitCommit:        1.20-1

The docker daemon source commit looks to have been overridden, because b0f5bc3 is a commit from 4 Years ago for a 20.10 release of docker (#42352)

  Git commit:       b0f5bc3

All of that make it possible that OpenMandriva is shipping with modifications of the source-code that could be relevant; we should check the daemon logs for sure to narrow down what's happening, but might be worth to also report with the OpenMandriva packager.

@berolinux
Copy link
Author

I am the OpenMandriva packager -- this error is happening while testing if our updated package is ready to go out to end users (FWIW the status is "almost ready" -- it works perfectly once the containers are installed, the seemingly random crashes while unpacking layers are the only problem).

We aren't currently applying any patches. The wrong git commit being listed is indeed an oversight (good catch!), we're usually using release tarballs (which obviously don't have a commit id that git would see at build time), some time ago it wouldn't build without seeing a commit, so we told it what commit it is and then forgot to remove the workaround (or keep the commit id up to date). Fixing that, but it seems to be unrelated to the problem. We also aren't applying any patches to containerd, it looks like it too thinks it is "dirty modified" because it's from a release tarball and therefore not seeing the commit ID. We have both crun and runc, with crun being used by default because it tends to give slightly better performance. Switching to runc doesn't affect the problem.

The unexpected EOF looks like it is caused by dockerd crashing while untarring a layer. I see this in the logs when the problem happens:

Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.303578642Z" level=debug msg="Using /usr/bin/unpigz to decompress"
Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.333383654Z" level=debug msg="Start untar layer" id=8646ce26e1f6bf4c35c93d24a5d3bdc0bfdd7f631abec8bb7ab76b2b40d0a28d
Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.334790762Z" level=debug msg="Untar time: 0.001409903s" id=8646ce26e1f6bf4c35c93d24a5d3bdc0bfdd7f631abec8bb7ab76b2b40d0a28d
Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.334818234Z" level=debug msg="Applied tar sha256:af5aa97ebe6ce1604747ec1e21af7136ded391bcabe4acef882e718a87c86bcc to 8646ce26e1f6bf4c35>
Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.343126270Z" level=debug msg="Using /usr/bin/unpigz to decompress"
Feb 21 12:08:10 eagle.fedcom.net dockerd[19978]: time="2025-02-21T12:08:10.348255743Z" level=debug msg="Start untar layer" id=57e15de4c3447d30f703eab3c1c9e8ceb45d11d7a0a7784a3b5f39fb61a33ea4
Feb 21 12:08:10 eagle.fedcom.net systemd-coredump[21873]: Process 19978 (dockerd) of user 0 terminated abnormally with signal 11/SEGV, processing...

Unfortunately the backtrace doesn't look very useful (at least to me), looks like a crash during memory allocation with no indicator of what is being allocated.

(lldb) target create "/usr/bin/dockerd" --core "/var/tmp/coredump-qxcSvq"
Core file '/var/tmp/coredump-qxcSvq' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'dockerd', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000563381cd1c58 dockerd`runtime.mallocgcSmallNoscan + 216
    frame #1: 0x0000563381d30fd9 dockerd`runtime.mallocgc + 185
    frame #2: 0x0000563381d36429 dockerd`runtime.growslice + 1481
    frame #3: 0x0000563381d2d916 dockerd`runtime.vgetrandomPutState + 86
    frame #4: 0x0000563381d011e5 dockerd`runtime.mexit + 453

@thaJeztah
Copy link
Member

Ah! Sorry, didn't notice that 🙈 - I've been bit by some cases where packaging changes were relevant and very subtly breaking things (most recently we had issues with the debian packages, due to breaking changes in one of our dependencies that happened to be updated in their packaging pipeline).

Thanks for trying to get more info; it's indeed unfortunately not providing a lot of info; I do see things seem to go bad around the invocation of unpigz to handle the extraction; would you be able to try if disabling unpigz for the extraction would make that problem go away? I see it expects a valid boolean(ish) value, so MOBY_DISABLE_PIGZ=true should probably do that, but must be set on the daemon process (not the CLI);

if noPigzEnv := os.Getenv("MOBY_DISABLE_PIGZ"); noPigzEnv != "" {

@berolinux
Copy link
Author

Disabling pigz makes the log output slightly different, but doesn't make the problem go away.

Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.405986884Z" level=debug msg="Downloaded 307c1adadb60 to tempfile /federated/docker/tmp/GetImageBlob665704286"
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.406065041Z" level=debug msg="pulling blob \"sha256:51fea5d3cd54704d3753cc644859696c65747cb71fb3306f974d0d61fc4d0501>
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.413025019Z" level=debug msg="Downloaded b40161cd83fc to tempfile /federated/docker/tmp/GetImageBlob243858367"
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.413102564Z" level=debug msg="Use of pigz is disabled due to MOBY_DISABLE_PIGZ=true"
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.419317837Z" level=debug msg="Start untar layer" id=2a8a320384aa2b0a59d94e9e600a48abb2a7041ab3f21482edd0ad219a0de488
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.420289546Z" level=debug msg="Untar time: 0.000970238s" id=2a8a320384aa2b0a59d94e9e600a48abb2a7041ab3f21482edd0ad219>
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.420558190Z" level=debug msg="Applied tar sha256:1a73b54f556b477f0a8b939d13c504a3b4f4db71f7a09c63afbc10acb3de5849 to>
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.428565009Z" level=debug msg="Use of pigz is disabled due to MOBY_DISABLE_PIGZ=true"
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.435952919Z" level=debug msg="Start untar layer" id=6845bcb32e0bc1deb98c86557c9cb08273ec81a50c9dd28185d02073b70a26e9
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.437294602Z" level=debug msg="Untar time: 0.00134532s" id=6845bcb32e0bc1deb98c86557c9cb08273ec81a50c9dd28185d02073b7>
Feb 21 14:29:59 updatetestamd.federatedcomputer.net dockerd[26190]: time="2025-02-21T14:29:59.437326573Z" level=debug msg="Applied tar sha256:c048279a7d9f8e94b4c022b699ad8e8a0cb08b717b014ce4af15afaf375a6ac2 to>
Feb 21 14:29:59 updatetestamd.federatedcomputer.net systemd-coredump[30664]: Process 26190 (dockerd) of user 0 terminated abnormally with signal 11/SEGV, processing...

The backtrace remains the same too.

I'm starting to suspect go 1.24 may have something to do with this -- I've started bisecting this and unless I messed something up, if I rebuild the known good 27.5.0 package in today's environment, it starts showing the same breakage. The main difference between the build environment when the known good package was built and today's build environment is a go update from 1.23.something to 1.24. Will run some more checks on that.

@thaJeztah
Copy link
Member

Oh! I missed you're building with go1.24; yes, I've seen cases in other repositories/projects where go1.24 broke things. We are currently still on go1.23 (usually wait upgrading to latest go, as we 9 out of 10 times run into subtle regressions in areas the Go maintainers didn't expect things to be used)

@thaJeztah
Copy link
Member

I have a draft PR that was used to do initial testing with go1.24; we'll likely be updating our master/main branch once we have the most urgent v28.0.0 kinks fixed, but currently our master/main (and release branch) is still on go1.23;

@thaJeztah thaJeztah changed the title 28.0: Intermittent "unexpected EOF" while downloading container layers 28.0: Intermittent "unexpected EOF" while downloading container layers (SEGFAULT with go1.24?) Feb 21, 2025
@berolinux berolinux changed the title 28.0: Intermittent "unexpected EOF" while downloading container layers (SEGFAULT with go1.24?) 28.0: Intermittent "unexpected EOF" while downloading container layers when built with go 1.24 Feb 21, 2025
@berolinux
Copy link
Author

I've confirmed that the problem with 28.0.0 goes away if I rebuild it with go 1.23.6 -- so this is definitely a go bug or a bad use of go APIs that is no longer possible with 1.24

@thaJeztah
Copy link
Member

Thanks! Hm.. so now the challenge is to find indeed where the problem lies.

Looking if I see anything suspicious; last log around that issue is Applied tar, which comes from here;

moby/layer/layer_store.go

Lines 220 to 253 in 459686b

func (ls *layerStore) applyTar(tx *fileMetadataTransaction, ts io.Reader, parent string, layer *roLayer) error {
tsw, err := tx.TarSplitWriter(true)
if err != nil {
return err
}
metaPacker := storage.NewJSONPacker(tsw)
defer tsw.Close()
digester := digest.Canonical.Digester()
tr := io.TeeReader(ts, digester.Hash())
// we're passing nil here for the file putter, because the ApplyDiff will
// handle the extraction of the archive
rdr, err := asm.NewInputTarStream(tr, metaPacker, nil)
if err != nil {
return err
}
applySize, err := ls.driver.ApplyDiff(layer.cacheID, parent, rdr)
// discard trailing data but ensure metadata is picked up to reconstruct stream
// unconditionally call io.Copy here before checking err to ensure the resources
// allocated by NewInputTarStream above are always released
io.Copy(io.Discard, rdr) // ignore error as reader may be closed
if err != nil {
return err
}
layer.size = applySize
layer.diffID = DiffID(digester.Digest())
log.G(context.TODO()).Debugf("Applied tar %s to %s, size: %d", layer.diffID, layer.cacheID, applySize)
return nil
}

That code also involves github.com/vbatts/tar-split, which was updated from v0.11.5 to v0.11.6 in the v28.0 release in 1ef5957;

https://github.com/moby/moby/blob/v27.5.1/vendor.mod#L95
https://github.com/moby/moby/blob/v28.0.0/vendor.mod#L96

Diff; vbatts/tar-split@v0.11.5...v0.11.6

This commit looks to be in the asm package that's used there;
vbatts/tar-split@99c8914

I did notice is that we're not on the latest version of github.com/vbatts/tar-split; current release is v0.12.1, but have not yet dug into that diff;
vbatts/tar-split@v0.11.6...v0.12.1

@berolinux
Copy link
Author

I can reproduce the "unexpected EOF" things if I rebuild 27.5.1 with go 1.24.0 (I initially thought this was a 28.0 regression because we updated go between the release of 27.5.1 and 28.0.0 -- so the bug happened to occur with the 28.0 update -- but it's actually present in 27.5.1 too), so the github.com/vbatts/tar-split 0.11.5->0.11.6 update can be ruled out as a cause.

@thaJeztah
Copy link
Member

Thank you! That's a useful datapoint; much appreciated.

@thaJeztah thaJeztah changed the title 28.0: Intermittent "unexpected EOF" while downloading container layers when built with go 1.24 Intermittent "unexpected EOF" while downloading container layers when built with go 1.24 Feb 21, 2025
@vvoland
Copy link
Contributor
vvoland commented Feb 24, 2025

FWIW, I tested 28.0.0 built with go1.24 on Archlinux and didn't experience this.

Does it happen with all images?

@leo9800
Copy link
leo9800 commented Feb 25, 2025

I had experienced exact the same issue with docker 28.0.0 shipped with archlinux official repository, which is believed to be built with go 1.24.0, as the 'last update' of go==2:1.24.0-1 is 2025-02-11 21:30 UTC and that of docker==1:28.0.0-1 is 2025-02-24 22:06 UTC.

reference:

docker version outputs

Client:
 Version:           28.0.0
 API version:       1.48
 Go version:        go1.24.0
 Git commit:        f9ced58158
 Built:             Mon Feb 24 21:55:48 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          28.0.0
  API version:      1.48 (minimum version 1.24)
  Go version:       go1.24.0
  Git commit:       af898abe44
  Built:            Mon Feb 24 21:55:48 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v2.0.2
  GitCommit:        c507a0257ea6462fbd6f5ba4f5c74facb04021f4.m
 runc:
  Version:          1.2.5
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@leo9800
Copy link
leo9800 commented Feb 25, 2025

Besides, when docker daemon (dockerd) segfaulted, seems the containerd terminates all running containers and manual restarting is required after docker daemon goes back online.

@vvoland
Copy link
Contributor
vvoland commented Feb 25, 2025

@leo9800 Can you provide the example image that this happens with? docker info could also be useful.

@berolinux
Copy link
Author

It seems to happen more or less at random (which is why just running the pull/compose command on the same container a few times in a row usually "fixes" it).
I've seen it happening with matrixdotorg/synapse at least a few times.

@leo9800
Copy link
leo9800 commented Feb 26, 2025

@vvoland

@leo9800 Can you provide the example image that this happens with? docker info could also be useful.

Virtually any image could cause this problem, without stable reproduction.

Says, I upgraded image ollama/ollama:latest on host A and segfaulted, but after dockerd restarted, upgrading nats:alpine did not cause any problem.

While on another host B (virtual machine) mocking the environment of host A, pull ollama/ollama:latest was fine, but then pulling nats:alpine results in dockerd segfault.

After restore snapshot for host B, reboot, pull ollama and nats again, in same sequence, both were succeeded without any error.

Besides, if docker pull fails, it always fail on extracting, not downloading; also, docker pull an already up-to-date image never fail, probably no extraction (uncompression of layers) was triggered.

I'll try chroot-rebuild docker==28.0.0-1 archlinux package with go 1.23 as build dependency later and see if this issue gone, which doubles what @berolinux did.

I've confirmed that the problem with 28.0.0 goes away if I rebuild it with go 1.23.6 -- so this is definitely a go bug or a bad use of go APIs that is no longer possible with 1.24

8000

@vvoland
Copy link
Contributor
vvoland commented Feb 26, 2025

@leo9800 are you on a btrfs graphdriver too?

From the docker info output provided by @berolinux:

 Storage Driver: btrfs

@nightah
Copy link
nightah commented Apr 2, 2025

@sipsma ironically, the below commit is what I was testing overnight in my latest bisect, and I was unable to reproduce the crash over 12 hours of continual testing.

  1. golang/go@eb6f2c2 is the culprit

I'll try with your minimal reproduction too, though here's where I'm at so far with two more bisect steps to go:

git bisect start
# status: waiting for both good and bad commits
# bad: [3901409b5d0fb7c85a3e6730a59943cc93b2835c] [release-branch.go1.24] go1.24.0
git bisect bad 3901409b5d0fb7c85a3e6730a59943cc93b2835c
# good: [6885bad7dd86880be6929c02085e5c7a67ff2887] [release-branch.go1.23] go1.23.0
git bisect good 6885bad7dd86880be6929c02085e5c7a67ff2887
# good: [5e8a7316658c2f300a375041b6e0a606fec4c5f2] README: fix CC BY license name
git bisect good 5e8a7316658c2f300a375041b6e0a606fec4c5f2
# bad: [87a89fa45130d4406fa4d9f0882b9c5014240d03] runtime: add the checkPtraceScope to skip certain tests
git bisect bad 87a89fa45130d4406fa4d9f0882b9c5014240d03
# good: [1b5ae45181ef5274045b9b93ae0603ebb34fa811] os/user: User.GroupIds shouldn't error on users with no groups
git bisect good 1b5ae45181ef5274045b9b93ae0603ebb34fa811
# good: [ba42120723a8bb4161c4f54c93f7ab3234923473] runtime: properly compute whether PC is inside vDSO pages
git bisect good ba42120723a8bb4161c4f54c93f7ab3234923473
# bad: [0733682e5ff4cd294f5eccb31cbe87a543147bc6] internal/runtime/maps: initial swiss table map implementation
git bisect bad 0733682e5ff4cd294f5eccb31cbe87a543147bc6
# bad: [e86982c515ba4a494fb1f8e1367f4238a2b59c2e] encoding/json: add omitzero option
git bisect bad e86982c515ba4a494fb1f8e1367f4238a2b59c2e
# bad: [6536c207c2309da7c1c21e3669f8ddf491e31f5b] net: improve GODEBUG=netdns=1 debug messages
git bisect bad 6536c207c2309da7c1c21e3669f8ddf491e31f5b
# good: [5fe3b31cf898de6fbc4f8ac524e16238a9a85e66] go/types, types2: remove dependency on Scope.Contains in resolver
git bisect good 5fe3b31cf898de6fbc4f8ac524e16238a9a85e66
# good: [6ad3933e285b036137a339f598f00a21578fcbfb] go/types, types2: move go/types-only Scope methods into scopes2.go
git bisect good 6ad3933e285b036137a339f598f00a21578fcbfb
# good: [eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13] runtime: use vDSO for getrandom() on linux
git bisect good eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13

@sipsma
Copy link
Contributor
sipsma commented Apr 2, 2025

@sipsma ironically, the below commit is what I was testing overnight in my latest bisect, and I was unable to reproduce the crash over 12 hours of continual testing.

It wouldn't shock me if that vdso change by itself made this "possible but rare" and then golang/go@8678196 (that your git bisect hit) made it "possible and common" since it further changed some of the parameters relevant to when growslice would get called (mildly-educated guess, take with salt).

I'll try with your minimal reproduction too, though here's where I'm at so far with two more bisect steps to go:

Thanks! @Doridian's comment here is possibly relevant too: #49513 (comment), may want to limit processes in order for it to get hit more consistently, depending on your hardware.

@thaJeztah
Copy link
Member

OHMAN, Y'all make me happy here! Thanks everyone here for helping out on this one 🤞 hope that's indeed gonna help fix the issue, but this looks REALLY hopeful!

@nightah
Copy link
nightah commented Apr 2, 2025

It wouldn't shock me if that vdso change by itself made this "possible but rare" and then golang/go@8678196 (that your git bisect hit) made it "possible and common" since it further changed some of the parameters relevant to when growslice would get called (mildly-educated guess, take with salt).

I've just managed to reproduce it with Docker against eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13 (seems it was very rare here utilising Docker pulls as the reproduction, however, there could be other influences in play like what @Doridan has suggested).

Updated the bisect below, I'll close this off by testing both with Docker and the minimal repro from here on.

git bisect start
# status: waiting for both good and bad commits
# bad: [3901409b5d0fb7c85a3e6730a59943cc93b2835c] [release-branch.go1.24] go1.24.0
git bisect bad 3901409b5d0fb7c85a3e6730a59943cc93b2835c
# good: [6885bad7dd86880be6929c02085e5c7a67ff2887] [release-branch.go1.23] go1.23.0
git bisect good 6885bad7dd86880be6929c02085e5c7a67ff2887
# good: [5e8a7316658c2f300a375041b6e0a606fec4c5f2] README: fix CC BY license name
git bisect good 5e8a7316658c2f300a375041b6e0a606fec4c5f2
# bad: [87a89fa45130d4406fa4d9f0882b9c5014240d03] runtime: add the checkPtraceScope to skip certain tests
git bisect bad 87a89fa45130d4406fa4d9f0882b9c5014240d03
# good: [1b5ae45181ef5274045b9b93ae0603ebb34fa811] os/user: User.GroupIds shouldn't error on users with no groups
git bisect good 1b5ae45181ef5274045b9b93ae0603ebb34fa811
# good: [ba42120723a8bb4161c4f54c93f7ab3234923473] runtime: properly compute whether PC is inside vDSO pages
git bisect good ba42120723a8bb4161c4f54c93f7ab3234923473
# bad: [0733682e5ff4cd294f5eccb31cbe87a543147bc6] internal/runtime/maps: initial swiss table map implementation
git bisect bad 0733682e5ff4cd294f5eccb31cbe87a543147bc6
# bad: [e86982c515ba4a494fb1f8e1367f4238a2b59c2e] encoding/json: add omitzero option
git bisect bad e86982c515ba4a494fb1f8e1367f4238a2b59c2e
# bad: [6536c207c2309da7c1c21e3669f8ddf491e31f5b] net: improve GODEBUG=netdns=1 debug messages
git bisect bad 6536c207c2309da7c1c21e3669f8ddf491e31f5b
# good: [5fe3b31cf898de6fbc4f8ac524e16238a9a85e66] go/types, types2: remove dependency on Scope.Contains in resolver
git bisect good 5fe3b31cf898de6fbc4f8ac524e16238a9a85e66
# good: [6ad3933e285b036137a339f598f00a21578fcbfb] go/types, types2: move go/types-only Scope methods into scopes2.go
git bisect good 6ad3933e285b036137a339f598f00a21578fcbfb
# bad: [eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13] runtime: use vDSO for getrandom() on linux
git bisect bad eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13

@nightah
Copy link
nightah commented Apr 2, 2025

I'm fairly confident that @sipsma is correct.

I had to do a local replace of x/sys as I couldn't get the minimal repro to build with Go<1.24 due to a build constraint specifically for Go >=1.24 on the vgetrandom call.

Anything before golang/go@eb6f2c2 will obviously fail to compile.

Also, the mildly educated guess, which is tied to what I initially thought to be the culprit commit, makes sense too.
It certainly seems like that performance increase amplifies the problem, as it's much more commonly reproducible from that commit onwards, just with Docker pulls.

Final bisect below:

git bisect start
# status: waiting for both good and bad commits
# bad: [3901409b5d0fb7c85a3e6730a59943cc93b2835c] [release-branch.go1.24] go1.24.0
git bisect bad 3901409b5d0fb7c85a3e6730a59943cc93b2835c
# good: [6885bad7dd86880be6929c02085e5c7a67ff2887] [release-branch.go1.23] go1.23.0
git bisect good 6885bad7dd86880be6929c02085e5c7a67ff2887
# good: [5e8a7316658c2f300a375041b6e0a606fec4c5f2] README: fix CC BY license name
git bisect good 5e8a7316658c2f300a375041b6e0a606fec4c5f2
# bad: [87a89fa45130d4406fa4d9f0882b9c5014240d03] runtime: add the checkPtraceScope to skip certain tests
git bisect bad 87a89fa45130d4406fa4d9f0882b9c5014240d03
# good: [1b5ae45181ef5274045b9b93ae0603ebb34fa811] os/user: User.GroupIds shouldn't error on users with no groups
git bisect good 1b5ae45181ef5274045b9b93ae0603ebb34fa811
# good: [ba42120723a8bb4161c4f54c93f7ab3234923473] runtime: properly compute whether PC is inside vDSO pages
git bisect good ba42120723a8bb4161c4f54c93f7ab3234923473
# bad: [0733682e5ff4cd294f5eccb31cbe87a543147bc6] internal/runtime/maps: initial swiss table map implementation
git bisect bad 0733682e5ff4cd294f5eccb31cbe87a543147bc6
# bad: [e86982c515ba4a494fb1f8e1367f4238a2b59c2e] encoding/json: add omitzero option
git bisect bad e86982c515ba4a494fb1f8e1367f4238a2b59c2e
# bad: [6536c207c2309da7c1c21e3669f8ddf491e31f5b] net: improve GODEBUG=netdns=1 debug messages
git bisect bad 6536c207c2309da7c1c21e3669f8ddf491e31f5b
# good: [5fe3b31cf898de6fbc4f8ac524e16238a9a85e66] go/types, types2: remove dependency on Scope.Contains in resolver
git bisect good 5fe3b31cf898de6fbc4f8ac524e16238a9a85e66
# good: [6ad3933e285b036137a339f598f00a21578fcbfb] go/types, types2: move go/types-only Scope methods into scopes2.go
git bisect good 6ad3933e285b036137a339f598f00a21578fcbfb
# bad: [eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13] runtime: use vDSO for getrandom() on linux
git bisect bad eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13
# good: [677b6cc17544e5e667d4bb67d063f5d775c69e32] test: simplify issue 69434 test
git bisect good 677b6cc17544e5e667d4bb67d063f5d775c69e32
# first bad commit: [eb6f2c24cd17c0ca1df7e343f8d9187eef7d6e13] runtime: use vDSO for getrandom() on linux

@Foxboron
Copy link
Foxboron commented Apr 3, 2025

Arch will push a patched go package and rebuild docker in a few hours.

@Foxboron
Copy link
Foxboron commented Apr 3, 2025

rebuilt docker, dagger packages and a patched go is in the Arch [extra-testing] repository. The reproducer fails with the patched version as expected.

λ go » time go run .
signal: segmentation fault (core dumped)
go run .  1.21s user 0.14s system 514% cpu 0.262 total
λ go » sudo pacman -U /var/cache/pacman/pkg/go-2:1.24.1-2-x86_64.pkg.tar.zst
[...snip...]
λ go » time go run .
^Csignal: interrupt
go run .  72.98s user 0.22s system 1808% cpu 4.046 total

@leo9800
Copy link
leo9800 commented Apr 5, 2025

Feedback: Applied docker=1:28.0.4-2 (rebuilt with patched golang) on 2 Arch Linux based servers for serveral days and no more unexpected EOF occurred. ;-)

@danieletorelli
Copy link

Unfortunately it just happened again to me while pulling an image on an Aarch64 machine running ArchLinux ARM with docker 28.0.4-2.

@mib1982
Copy link
mib1982 commented Apr 15, 2025

Unfortunately it just happened again to me while pulling an image on an Aarch64 machine running ArchLinux ARM with docker 28.0.4-2.

Happens to me quite frequently as well on the same setup. Can we be sure that docker 28.0.4-2 on Aarch64 is actually built against a patched go-version? If not, this would explain why we're still experiencing this problem.

@ironsmile
Copy link

One could verify which version of Go was used for building a particular binary by using go version if they have to Go toolchain installed:

go version $(which dockerd)

Will give 28.0.4-2 a try some time later.

@danieletorelli
Copy link
danieletorelli commented Apr 15, 2025

One could verify which version of Go was used for building a particular binary by using go version if they have to Go toolchain installed:

go version $(which dockerd)

Will give 28.0.4-2 a try some time later.

$ go version $(which dockerd)
/usr/bin/dockerd: go1.24.1
$ docker version
Client:
 Version:           28.0.4
 API version:       1.48
 Go version:        go1.24.1
 Git commit:        b8034c0ed7
 Built:             Thu Apr  3 23:49:16 2025
 OS/Arch:           linux/arm64
 Context:           default

Server:
 Engine:
  Version:          28.0.4
  API version:      1.48 (minimum version 1.24)
  Go version:       go1.24.1
  Git commit:       6430e49a55
  Built:            Thu Apr  3 23:49:16 2025
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          v2.0.4
  GitCommit:        1a43cb6a1035441f9aca8f5666a9b3ef9e70ab20.m
 runc:
  Version:          1.2.6
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@Altycoder
Copy link

Seems to be working for me at the moment, the next release of Immich will be the acid test as that's what failed previously, will report back later this week as they tend to run on a weekly/bi-weekly release schedule.

Many of my other images/containers have been pulling fine so far.

@Foxboron
Copy link

Happens to me quite frequently as well on the same setup. Can we be sure that docker 28.0.4-2 on Aarch64 is actually built against a patched go-version? If not, this would explain why we're still experiencing this problem.

ALARM is not Arch, so I would not bet on that being the case. You can check by running.

λ ~ » bsdtar -xOf /var/cache/pacman/pkg/docker-1:28.0.4-2-*zst .BUILDINFO | grep "go-2"
installed = go-2:1.24.1-2-x86_64

or something similar.

@danieletorelli
Copy link
danieletorelli commented Apr 15, 2025

Happens to me quite frequently as well on the same setup. Can we be sure that docker 28.0.4-2 on Aarch64 is actually built against a patched go-version? If not, this would explain why we're still experiencing this problem.

ALARM is not Arch, so I would not bet on that being the case. You can check by running.

λ ~ » bsdtar -xOf /var/cache/pacman/pkg/docker-1:28.0.4-2-*zst .BUILDINFO | grep "go-2"
installed = go-2:1.24.1-2-x86_64

or something similar.

$ bsdtar -xOf /var/cache/pacman/pkg/docker-1\:28.0.4-2-aarch64.pkg.tar.xz .BUILDINFO | grep "go-2"
installed = go-2:1.24.1-1-aarch64

@Foxboron
Copy link

go-2:1.24.1-1-aarch64 means it's not rebuilt. So this is not an upstream issue per se.

@Xyaren
Copy link
Xyaren commented Apr 27, 2025

Not sure if totally related but I experience crashes of dockerd aswell when pulling some images (not 100% sure which ones exactly).
This unfortunately also causes all containers to stop.

Apr 27 13:01:17 lx0001 dockerd[1295]: time="2025-04-27T13:01:17.375138705+02:00" level=debug msg="Applied tar sha256:705a020407cee71d8469fb3ad868f679f1130574cc657ba1f30c6c709ea709ef to 50be82696951304dd4aeb29d6f80a46a6a92eb7e654824d81e>
Apr 27 13:01:17 lx0001 dockerd[1295]: time="2025-04-27T13:01:17.426652168+02:00" level=debug msg="Using /usr/bin/unpigz to decompress"
Apr 27 13:01:17 lx0001 dockerd[1295]: time="2025-04-27T13:01:17.446758524+02:00" level=debug msg="Applying tar in /var/lib/docker/overlay2/7db906385d1c811f0909d4307fa4e8ddbffaf21a5890fb72fb617bb1999a2e22/diff" storage-driver=overlay2
Apr 27 13:01:18 lx0001 systemd[1]: docker.service: Main process exited, code=dumped, status=11/SEGV
Apr 27 13:01:18 lx0001 systemd[1]: docker.service: Failed with result 'core-dump'.
Apr 27 13:01:18 lx0001 systemd[1]: docker.service: Consumed 1min 11.435s CPU time, 999M memory peak, 8.4M memory swap peak.
Apr 27 13:01:18 lx0001 bash[11531]: unexpected EOF
Apr 27 13:01:18 lx0001 systemd[1]: docker.service: Scheduled restart job immediately on client request, restart counter is at 1.
Apr 27 13:01:18 lx0001 systemd[1]: Starting docker.service - Docker Application Container Engine...
Apr 27 13:01:18 lx0001 dockerd[39517]: time="2025-04-27T13:01:18.837408099+02:00" level=info msg="Starting up"
Apr 27 13:01:18 lx0001 dockerd[39517]: time="2025-04-27T13:01:18.837458617+02:00" level=warning msg="Running experimental build"

Docker Info

Client:
 Version:    27.5.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.35.1
    Path:     /usr/local/lib/docker/cli-plugins/docker-compose

Server:
 Containers: 56
  Running: 24
  Paused: 0
  Stopped: 32
 Images: 52
 Server Version: 27.5.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: journald
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.14.0-15-generic
 Operating System: Ubuntu 25.04
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.56GiB
 Name: lx0001
 ID: 2dab5412-cfed-4ba0-b5aa-751cdba4a48b
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 341
  Goroutines: 254
  System Time: 2025-04-27T13:14:00.216965029+02:00
  EventsListeners: 1
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 10.1.0.0/16, Size: 24
   Base: fd00:1337:ffff::/64, Size: 64
   Base: fd00:1337:ffff:1::/64, Size: 64
   Base: fd00:1337:ffff:2::/64, Size: 64
   Base: fd00:1337:ffff:3::/64, Size: 64
   Base: fd00:1337:ffff:4::/64, Size: 64
   Base: fd00:1337:ffff:5::/64, Size: 64
   Base: fd00:1337:ffff:6::/64, Size: 64
   Base: fd00:1337:ffff:7::/64, Size: 64
   Base: fd00:1337:ffff:8::/64, Size: 64
   Base: fd00:1337:ffff:9::/64, Size: 64
   Base: fd00:1337:ffff:a::/64, Size: 64
   Base: fd00:1337:ffff:b::/64, Size: 64
   Base: fd00:1337:ffff:c::/64, Size: 64
   Base: fd00:1337:ffff:d::/64, Size: 64
   Base: fd00:1337:ffff:e::/64, Size: 64
   Base: fd00:1337:ffff:f::/64, Size: 64
   Base: fd00:1337:ffff:10::/64, Size: 64
   Base: fd00:1337:ffff:11::/64, Size: 64
   Base: fd00:1337:ffff:12::/64, Size: 64
   Base: fd00:1337:ffff:13::/64, Size: 64
   Base: fd00:1337:ffff:14::/64, Size: 64
   Base: fd00:1337:ffff:15::/64, Size: 64
   Base: fd00:1337:ffff:16::/64, Size: 64
   Base: fd00:1337:ffff:17::/64, Size: 64
   Base: fd00:1337:ffff:18::/64, Size: 64
   Base: fd00:1337:ffff:19::/64, Size: 64
   Base: fd00:1337:ffff:1a::/64, Size: 64
   Base: fd00:1337:ffff:1b::/64, Size: 64
   Base: fd00:1337:ffff:1c::/64, Size: 64
   Base: fd00:1337:ffff:1d::/64, Size: 64
   Base: fd00:1337:ffff:1e::/64, Size: 64
   Base: fd00:1337:ffff:1f::/64, Size: 64

Re-running the command after docker re-started works.

The kernel log shows segfaults when the service crashes:

(Collection of segfaults during the last days)

Apr 25 09:52:57 lx0001 kernel: sh[2798344]: segfault at 7ffd272aaf08 ip 00007dbde51223ad sp 00007ffd272aaf00 error 6 in ld-musl-x86_64.so.1[4f3ad,7dbde50e7000+4c000] likely on CPU 2 (core 2, socket 0)
Apr 25 10:55:39 lx0001 kernel: sh[128816]: segfault at 7ffdbe4f9ff8 ip 000077db734079e4 sp 00007ffdbe4fa000 error 6 in ld-musl-x86_64.so.1[4e9e4,77db733cd000+4c000] likely on CPU 1 (core 1, socket 0)
Apr 27 12:48:43 lx0001 kernel: sh[1294496]: segfault at 7ffe6f3b6f28 ip 000076dce56783ad sp 00007ffe6f3b6f20 error 6 in ld-musl-x86_64.so.1[4f3ad,76dce563d000+4c000] likely on CPU 1 (core 1, socket 0)
Apr 27 12:51:40 lx0001 kernel: sh[1300594]: segfault at 7ffc9a24dfe0 ip 00007020b68569f8 sp 00007ffc9a24dfe0 error 6 in ld-musl-x86_64.so.1[4e9f8,7020b681c000+4c000] likely on CPU 0 (core 0, socket 0)
Apr 27 13:01:20 lx0001 kernel: sh[3354]: segfault at 7fffaefb7f28 ip 00007416cd9ce3ad sp 00007fffaefb7f20 error 6 in ld-musl-x86_64.so.1[4f3ad,7416cd993000+4c000] likely on CPU 0 (core 0, socket 0)
Apr 27 13:22:16 lx0001 kernel: sh[43851]: segfault at 7ffc00869f90 ip 00007988909ce9f8 sp 00007ffc00869f90 error 6 in ld-musl-x86_64.so.1[4e9f8,798890994000+4c000] likely on CPU 3 (core 3, socket 0)

Edit: Added bug on ubuntu docker.io package: https://bugs.launchpad.net/ubuntu/+source/docker.io/+bug/2109499

@partizanna
Copy link
partizanna commented Apr 28, 2025

Does it only happen when pulling or in other cases too?

For those who are able to reproduce: could you enable "debug": true in your daemon.json and post your daemon log after it crashes?

Also, what CPU does your system have? (starting to wonder if it's some optimization by the go compiler performs that produces weird result on some microarchs)

FWIW, I tested it on multiple machines and weren't able to reproduce:

  • Ryzen 5800X + Archlinux + official Docker 28.0.1
  • Ryzen 5800X + Archlinux + custom Docker 28.0.0 (before it was packaged officially by arch)
  • Apple M1 + Docker Desktop
  • Intel N100 minipc + Debian with Docker Engine static packages (built directly from this repository)

My system is running a Ryzen 5 3600 with around 40 different docker images, it happened randomly on any of the container pull extractions. The system basically acts as a home server with dockerized services like Nextcloud, Mailcow, Immich, Jellyfin, Paperless, Vaultwarden, Homeassistant and many more... So I rolled back to 27.5.1-1 package at the time, as I could not have essential stuff outages...

But at that time occasionally also happened on my desktop which uses a Ryzen 7 5800X3D, but dont use docker there often.

I just updated the home server to docker 28.1.1-1 and latest packages, also updated the whole system and set docker debug to true, will report back if I have issues during updates.

Update:

Updated all the available images, so far no crashes, the new package also seems to use go version 1.24.2:

Client:
 Version:           28.1.1
 API version:       1.49
 Go version:        go1.24.2
 Git commit:        4eba377327
 Built:             Mon Apr 21 13:12:23 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          28.1.1
  API version:      1.49 (minimum version 1.24)
  Go version:       go1.24.2
  Git commit:       01f442b84d
  Built:            Mon Apr 21 13:12:23 2025
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          v2.0.5
  GitCommit:        fb4c30d4ede3531652d86197bf3fc9515e5276d9.m
 runc:
  Version:          1.2.6
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@thaJeztah thaJeztah marked this as a duplicate of #49886 Apr 28, 2025
@sebrhex
Copy link
sebrhex commented May 12, 2025

Is there a hint (apart from the packages existing in e.g. arch linux) whether it's safe to upgrade to Docker 1.28 on amd64? As far as I understand the comments there were some users who still experienced the EOF?

@englut
Copy link
englut commented May 15, 2025

I just downgraded to 27.5.1 again after noticing one of my containers was intermittently throwing this error again.

@kolyshkin
Copy link
Contributor
kolyshkin commented May 15, 2025

To anyone who wants to add a comment saying they still have this bug -- check that your Docker is compiled by go1.24.3 (by running e.g. docker version | grep Go.

If you see Go 1.24 version older than go1.24.3, file a bug with your distro vendor.

@middleagedman
Copy link

I think Arch Linux is the exception. They are at 1.24.2 but they patched it in.

@englut
Copy link
englut commented May 15, 2025

I think Arch Linux is the exception. They are at 1.24.2 but they patched it in.

I'm on Arch Linux, and still seeing the issue (latest docker version, and it's built with 1.24.2). I've commented over there to see if the maintainer can update the build.

@middleagedman
Copy link
middleagedman commented May 15, 2025

@englut Really? Hmm I had the issue but haven't had one since. I'm on docker 1:28.1.1-1
See https://bbs.archlinux.org/viewtopic.php?pid=2230042#p2230042

@Altycoder
Copy link

Same here, I'm on Arch and I had the issue but since the patch it's solved it for me.

According to docker version I'm on Docker 28.1.1 with Go 1.24.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/27.0 version/28.0
Projects
None yet
Development

No branches or pull requests

0