8000 [1.15] daemon: Cleanup after a failed start by antonipp · Pull Request #36801 · cilium/cilium · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[1.15] daemon: Cleanup after a failed start #36801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

antonipp
Copy link
Contributor

Backport #32673 to 1.15

This required backporting this commit as well: f21f303

Fixes: #32647

cilium-agent: Fix crash due to skipped resource cleanup when agent is stopping due to failed start.

[ upstream commit f21f303 ]

Return the error from startDaemon through the start hook to abort
shutdown cleanly rather than using log.Fatal.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
[ upstream commit edc9cba ]

If the daemon start failed due to e.g. invalid configuration the cleanup was not
performed as it was only done from the stop function that was not invoked since
the daemon start had failed.

This in some situations lead to confusing crashes as hive did stop all the dependencies
of the legacy daemon, for example this crash was reported:

   level=fatal msg="failed to start: daemon creation failed: failed to detect devices: ...
   ... bunch of "stopping" messages omitted ...
   panic: runtime error: invalid memory address or nil pointer dereference
   [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1dfa041]
   goroutine 460 [running]:
     github.com/cilium/cilium/pkg/node.(*LocalNodeStore).Observe(0x3f4cfb0? ...
     	<autogenerated>:1 +0x21

The crash was caused by the LocalNodeStore stopping and clearing the global variable
in the node package (to make sure for example our tests don't leak data across them
and to make sure we do cleanup correctly).

To fix this, run the cleanup functions also when "newDaemon" or "startDaemon" fails
with an error.

Fixes: f21f303 ("daemon: bubble up error from startDaemon instead of fataling")
Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
@antonipp antonipp requested a review from a team as a code owner December 27, 2024 09:56
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Dec 27, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 27, 2024
@antonipp antonipp mentioned this pull request Dec 27, 2024
3 tasks
@hemanthmalla
Copy link
Member

/test

Sorry, something went wrong.

@joestringer
Copy link
Member

Could you speak a bit about whether you can reproduce the issue you see and how much testing you have performed with this change? The bug being fixed here does not seem to be catastrophic or security-relevant (criteria), so I'm not sure whether applying this patch to v1.15 is worth the risk. Startup/shutdown ordering issues can be racy and hard to follow, so I'd be concerned about solving one issue then creating another different issue.

@antonipp
Copy link
Contributor Author
antonipp commented Jan 7, 2025

Yeah, tbh it's an intermittent issue that we've seen only a handful of times and I haven't managed to reliably reproduce it:
#32647 (comment)

Since I saw that the original PR had the "backport needed" label, I thought that I might as well do it but I agree with you that it's not a critical bug. So if you think it's not worth it, I am fine with closing the PR.

@christarazi
Copy link
Member

Yeah, tbh it's an intermittent issue that we've seen only a handful of times and I haven't managed to reliably reproduce it: #32647 (comment)

Since I saw that the original PR had the "backport needed" label, I thought that I might as well do it but I agree with you that it's not a critical bug. So if you think it's not worth it, I am fine with closing the PR.

@antonipp Given the backport criteria and your above statement, it seems like this PR doesn't meet the requirements. If you agree, feel free to close the PR.

@antonipp antonipp closed this Jan 20, 2025
@antonipp antonipp deleted the ai/backport-pr-32673 6D7B branch January 20, 2025 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. kind/community-contribution This was a contribution made by a community member.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0