8000 hubble: accurately report startup failure reason from cilium status by devodev · Pull Request #37567 · cilium/cilium · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

hubble: accurately report startup failure reason from cilium status #37567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

devodev
Copy link
Contributor
@devodev devodev commented Feb 11, 2025

Description

The Hubble cell is a very large system part of the Cilium daemon that can fail to start for many reasons. At the moment, it is not considered a critical Cilium component and will merely output logs whenever a startup issue occurs, making troubleshooting Hubble-related issues a bit less intuitive.

This PR updates the launch and probe mechanisms to record and report the failure reason so it can be accurately displayed in the cilium-cli and cilium-dbg status output.

A follow-up to this PR will be to update the Cilium documentation to add a troubleshooting guide for Hubble.

Do note that more work is coming towards migrating Hubble sub-systems to cells, and hopefully this will allow us to delegate most of the health reporting to Hive.

Related: #37023

Example

Tested by misconfiguring metrics:

hubble:
  metrics:
    enabled:
      - unknown-metric

Before

cilium-cli status

image

Warnings:    cilium    cilium-gdbmw    Hubble: Server not initialized

cilium-dbg status

$ kubectl exec -n kube-system -c cilium-agent ds/cilium -- cilium status
...
Hubble:                  Warning Server not initialized
...
Modules Health:          Stopped(0) Degraded(2) OK(81)

After:

cilium-cli status

image

Warnings:    cilium    cilium-t6fwb    Hubble: failed to setup metrics: metric 'unknown-metric' does not exist

cilium-dbg status

$ kubectl exec -n kube-system -c cilium-agent ds/cilium -- cilium status
...
Hubble:                  Warning failed to setup metrics: metric 'unknown-metric' does not exist
...
Modules Health:          Stopped(0) Degraded(2) OK(79)

@devodev devodev requested a review from a team as a code owner February 11, 2025 23:46
@devodev devodev requested a review from rolinh February 11, 2025 23:46
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 11, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 11, 2025
@rolinh rolinh added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble area/hubble Impacts hubble server or relay labels Feb 12, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 12, 2025
Copy link
Member
@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fantastic improvement for users troubleshooting Hubble issues 🚀
And thanks for creating more specific/actionable error messages as well!

@kaworu
Copy link
Member
kaworu commented Feb 12, 2025

Thanks for the PR @devodev, overall LGTM.

I'm slightly uncomfortable to have several "status signals" combined with atomic / threading. As this PR adds a new signal for Hubble startup status (i.e. hubbleIntegration.launchError), we then could cleanup and remove the previous "signal" (i.e. hubbleIntegration.observer). For example:

  1. before calling launch() set h.launchError = errors.New("Starting")
  2. Use the return value from launch() to set h.launchError (as this PR already does when err != nil, but also do it when err == nil to "reset" the "Starting" error status)
  3. Remove hubbleIntegration.observer and usage.

What do you think?

@devodev
Copy link
Contributor Author
devodev commented Feb 12, 2025

Thanks for the PR @devodev, overall LGTM.

I'm slightly uncomfortable to have several "status signals" combined with atomic / threading. As this PR adds a new signal for Hubble startup status (i.e. hubbleIntegration.launchError), we then could cleanup and remove the previous "signal" (i.e. hubbleIntegration.observer). For example:

  1. before calling launch() set h.launchError = errors.New("Starting")
  2. Use the return value from launch() to set h.launchError (as this PR already does when err != nil, but also do it when err == nil to "reset" the "Starting" error status)
  3. Remove hubbleIntegration.observer and usage.

What do you think?

We still need observer to retrieve the server status right? But I guess we dont need it to be an aotmic pointer anymore if we use the launchError pointer as gate. I will make the change.

@devodev devodev force-pushed the pr/devodev/hubble-report-launch-failure branch from 038c75e to e2498b6 Compare February 12, 2025 17:38
@devodev
Copy link
Contributor Author
devodev commented Feb 12, 2025

I renamed launchError to launchWarning and tried to group all usage of the pointer under a new Launch() method, which I also added to the HubbleIntegration interface. Even though we don't need it per-say, I like that it documents what Hubble offers from the cell perspective. Its currently in the same package, but in theory it could be outside.

Finally I return the observer from launch() so we can assign it close to where we clear launchWarning to make it explicit what the pointer gates.

@kaworu What do you think of these changes?

@rolinh rolinh requested review from kaworu and removed request for rolinh February 13, 2025 07:57
Copy link
Member
@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update 🙏

We still need observer to retrieve the server status right?

Ah yes thanks for pointing it out I completely missed it 👍

But I guess we dont need it to be an atomic pointer anymore if we use the launchError pointer as gate.

I think your reasoning is correct but I'd prefer to keep it synchronized, it feels more future/bug proof and the overhead is negligible.

@devodev
Copy link
Contributor Author
devodev commented Feb 14, 2025

Thanks for the update 🙏

We still need observer to retrieve the server status right?

Ah yes thanks for pointing it out I completely missed it 👍

But I guess we dont need it to be an atomic pointer anymore if we use the launchError pointer as gate.

I think your reasoning is correct but I'd prefer to keep it synchronized, it feels more future/bug proof and the overhead is negligible.

Sure I can put it back.

@devodev devodev force-pushed the pr/devodev/hubble-report-launch-failure branch from e2498b6 to bf73c02 Compare February 14, 2025 14:04
The Hubble cell is a very large system part of the Cilium daemon
that can fail to start for many reasons. At the moment, it is not considered a critical Cilium component and will merely output
logs whenever a startup issue occurs, making troubleshooting
Hubble-related issues a bit less intuitive.

Update the launch and probe mechanisms to correctly record and
report the failure reason so it can be accurately displayed in the
cilium-cli and cilium-dbg status output.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
@devodev devodev force-pushed the pr/devodev/hubble-report-launch-failure branch from bf73c02 to 4451367 Compare February 14, 2025 15:22
Copy link
Member
@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devodev LGTM!

@kaworu
Copy link
Member
kaworu commented Feb 17, 2025

/test

@devodev
Copy link
Contributor Author
devodev commented Feb 17, 2025

Got a flake in Cilium E2E Upgrade (ci-e2e-upgrade) 😿
If we can re-run that test, hopefully it'll go through @kaworu

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 18, 2025
@kaworu kaworu added this pull request to the merge queue Feb 18, 2025
Merged via the queue into cilium:main with commit 54e084f Feb 18, 2025
64 checks passed
@pchaigno pchaigno added the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Mar 21, 2025
@viktor-kurchenko viktor-kurchenko mentioned this pull request Mar 26, 2025
16 tasks
@viktor-kurchenko viktor-kurchenko added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Mar 26, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Mar 28, 2025
@devodev devodev mentioned this pull request May 31, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hubble Impacts hubble server or relay backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0