-
Notifications
You must be signed in to change notification settings - Fork 3.2k
hubble: accurately report startup failure reason from cilium status #37567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: accurately report startup failure reason from cilium status #37567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fantastic improvement for users troubleshooting Hubble issues 🚀
And thanks for creating more specific/actionable error messages as well!
486e8d1
to
038c75e
Compare
Thanks for the PR @devodev, overall LGTM. I'm slightly uncomfortable to have several "status signals" combined with atomic / threading. As this PR adds a new signal for Hubble startup status (i.e.
What do you think? |
We still need observer to retrieve the server status right? But I guess we dont need it to be an aotmic pointer anymore if we use the launchError pointer as gate. I will make the change. |
038c75e
to
e2498b6
Compare
I renamed launchError to launchWarning and tried to group all usage of the pointer under a new Finally I return the observer from @kaworu What do you think of these changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update 🙏
We still need observer to retrieve the server status right?
Ah yes thanks for pointing it out I completely missed it 👍
But I guess we dont need it to be an atomic pointer anymore if we use the launchError pointer as gate.
I think your reasoning is correct but I'd prefer to keep it synchronized, it feels more future/bug proof and the overhead is negligible.
8000 div>
Sure I can put it back. |
e2498b6
to
bf73c02
Compare
The Hubble cell is a very large system part of the Cilium daemon that can fail to start for many reasons. At the moment, it is not considered a critical Cilium component and will merely output logs whenever a startup issue occurs, making troubleshooting Hubble-related issues a bit less intuitive. Update the launch and probe mechanisms to correctly record and report the failure reason so it can be accurately displayed in the cilium-cli and cilium-dbg status output. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
bf73c02
to
4451367
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devodev LGTM!
/test |
Got a flake in |
Description
The Hubble cell is a very large system part of the Cilium daemon that can fail to start for many reasons. At the moment, it is not considered a critical Cilium component and will merely output logs whenever a startup issue occurs, making troubleshooting Hubble-related issues a bit less intuitive.
This PR updates the launch and probe mechanisms to record and report the failure reason so it can be accurately displayed in the cilium-cli and cilium-dbg status output.
A follow-up to this PR will be to update the Cilium documentation to add a troubleshooting guide for Hubble.
Do note that more work is coming towards migrating Hubble sub-systems to cells, and hopefully this will allow us to delegate most of the health reporting to Hive.
Related: #37023
Example
Tested by misconfiguring metrics:
Before
cilium-cli status
cilium-dbg status
$ kubectl exec -n kube-system -c cilium-agent ds/cilium -- cilium status ... Hubble: Warning Server not initialized ... Modules Health: Stopped(0) Degraded(2) OK(81)
After:
cilium-cli status
cilium-dbg status