fix: docker inspect timeout #2269
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
问题描述
来自真实问题。
loongcollector同节点下的operator挂了,触发了loongcollector的容器发现模块的一个bug:loongcollector在获取这个挂了的容器的元信息时会无限阻塞住,从而导致采集不到后续更新的容器日志(容器发现是一个单独的go协程,里面读写一个cache map维护容器信息。docker/containerd请求与内部读写cache map不是同一个锁,所以也不会造成锁抢占后无限等待恢复,只会导致后续的容器定时发现逻辑卡死,容器信息无法更新)。
问题分析
pprof堆栈看是卡死在ContainerInspect了。
有一片Netflix的文章,遇到了类似的问题,说是Linux内核问题导致:https://netflixtechblog.com/debugging-a-fuse-deadlock-in-the-linux-kernel-c75cd7989b6d
问题解决
通过给ContainerInspect添加超时机制,解决问题
此 PR 检查了所有 docker 和 containerd 的调用,均添加了timeout,避免出现类似的问题