10000 fix: docker inspect timeout by Takuka0311 · Pull Request #2269 · alibaba/loongcollector · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: docker inspect timeout #2269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2025
Merged

Conversation

Takuka0311
Copy link
Collaborator
@Takuka0311 Takuka0311 commented Jun 24, 2025

问题描述

来自真实问题。

loongcollector同节点下的operator挂了,触发了loongcollector的容器发现模块的一个bug:loongcollector在获取这个挂了的容器的元信息时会无限阻塞住,从而导致采集不到后续更新的容器日志(容器发现是一个单独的go协程,里面读写一个cache map维护容器信息。docker/containerd请求与内部读写cache map不是同一个锁,所以也不会造成锁抢占后无限等待恢复,只会导致后续的容器定时发现逻辑卡死,容器信息无法更新)。

Warning Unhealthy 9m50s (x5970 over 24d) kubelet Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Warning Unhealthy 3m50s (x5971 over 24d) kubelet Readiness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

问题分析

pprof堆栈看是卡死在ContainerInspect了。

1 @ 0x7fc5a83fac96 0x7fc5a840b35c 0x7fc5a8707774 0x7fc5a86faf5a 0x7fc5a86e1e59 0x7fc5a86a1097 0x7fc5a86a08fb 0x7fc5a86a2d3c 0x7fc5a8b1d8aa 0x7fc5a8b1d894 0x7fc5a8b1d513 0x7fc5a8b029ce 0x7fc5a8b02972 0x7fc5a9570d4d 0x7fc5a9570626 0x7fc5a9564f7f 0x7fc5a9564f79 0x7fc5a956cacf 0x7fc5a842d6c1
# 0x7fc5a8707773 net/http.(*persistConn).roundTrip+0x973 /usr/local/go/src/net/http/transport.go:2620
# 0x7fc5a86faf59 net/http.(*Transport).roundTrip+0x7b9 /usr/local/go/src/net/http/transport.go:595
# 0x7fc5a86e1e58 net/http.(*Transport).RoundTrip+0x18 /usr/local/go/src/net/http/roundtrip.go:17
# 0x7fc5a86a1096 net/http.send+0x5f6 /usr/local/go/src/net/http/client.go:251
# 0x7fc5a86a08fa net/http.(*Client).send+0x9a /usr/local/go/src/net/http/client.go:175
# 0x7fc5a86a2d3b net/http.(*Client).do+0x8fb /usr/local/go/src/net/http/client.go:715
# 0x7fc5a8b1d8a9 net/http.(*Client).Do+0x169 /usr/local/go/src/net/http/client.go:581
# 0x7fc5a8b1d893 github.com/docker/docker/client.(*Client).doRequest+0x153 /opt/go/pkg/mod/github.com/docker/docker@v20.10.23+incompatible/client/request.go:125
# 0x7fc5a8b1d512 github.com/docker/docker/client.(*Client).sendRequest+0xd2 /opt/go/pkg/mod/github.com/docker/docker@v20.10.23+incompatible/client/request.go:113
# 0x7fc5a8b029cd github.com/docker/docker/client.(*Client).get+0x10d /opt/go/pkg/mod/github.com/docker/docker@v20.10.23+incompatible/client/request.go:37
# 0x7fc5a8b02971 github.com/docker/docker/client.(*Client).ContainerInspect+0xb1 /opt/go/pkg/mod/github.com/docker/docker@v20.10.23+incompatible/client/container_inspect.go:18
# 0x7fc5a9570d4c github.com/alibaba/ilogtail/pkg/helper.(*DockerCenter).fetchAll.func2+0xac /workspaces/alibaba_ilogtail/pkg/helper/docker_center.go:1053
# 0x7fc5a9570625 github.com/alibaba/ilogtail/pkg/helper.(*DockerCenter).fetchAll+0x785 /workspaces/alibaba_ilogtail/pkg/helper/docker_center.go:1055
# 0x7fc5a9564f7e github.com/alibaba/ilogtail/pkg/helper.(*ContainerDiscoverManager).fetchDocker+0x99e /workspaces/alibaba_ilogtail/pkg/helper/container_discover_controller.go:111
# 0x7fc5a9564f78 github.com/alibaba/ilogtail/pkg/helper.(*ContainerDiscoverManager).Init+0x998 /workspaces/alibaba_ilogtail/pkg/helper/container_discover_controller.go:255
# 0x7fc5a956cace github.com/alibaba/ilogtail/pkg/helper.getDockerCenterInstance.func1.1+0x6e /workspaces/alibaba_ilogtail/pkg/helper/docker_center.go:672

有一片Netflix的文章,遇到了类似的问题,说是Linux内核问题导致:https://netflixtechblog.com/debugging-a-fuse-deadlock-in-the-linux-kernel-c75cd7989b6d

问题解决

通过给ContainerInspect添加超时机制,解决问题

image
image

此 PR 检查了所有 docker 和 containerd 的调用,均添加了timeout,避免出现类似的问题

@Takuka0311 Takuka0311 merged commit b546004 into alibaba:main Jun 26, 2025
17 checks passed
xiongyunn pushed a commit to xiongyunn/loongcollector that referenced this pull request Jun 27, 2025
linrunqi08 pushed a commit that referenced this pull request Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0