8000 hubble/relay: Fix certificate reloading in PeerManager by glrf · Pull Request #31376 · cilium/cilium · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

hubble/relay: Fix certificate reloading in PeerManager #31376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 19, 2024

Conversation

glrf
Copy link
Contributor
@glrf glrf commented Mar 13, 2024

This change fixes an issue with the PeerManager that can lead to Relay being unable to reconnect to some or all peers when the client certificate expires or the Certificate Authority is replaced.

Before this change, when the client certificate changes, we did not redial or update the exiting gRPC ClientConns. When the old certificate becomes invalid, (expiring, changed CA, or revoked) The connection will eventually fail with a certificate error.
However, the gRPC ClientConn is not closed, but treats the certificate error as a transient failure and will retry connecting with the old credentials indefinitely.
In most cases this will cause the relay health checks to fail. Relay will restart and successfully reconnect to all peers. However, if a new peer joins between the certificate being updated and the connections failing, Relay may keep on running in a degraded state.

This issue was introduced by #28595. Before that change, Relay aggressively closed and re-dialed ClientConns on any error, mitigating this problem.

We fix this issue by wrapping the provided gRPC transport credentials and updating the TLS configuration whenever a new TLS connection is established. This means every TLS connection will use up-to-date certificates and gRPC ClientConns will be able to recover when their certificate changes.

Fixes: aca4d42 ("hubble/relay: Fix connection leak when reconnecting to peer service")

Please ensure your pull request adheres to the following guidelines:

  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.

@maintainer-s-little-helper maintainer-s-little-helper bot added the 8000 dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 13, 2024
@glrf glrf force-pushed the hubble-relay/fix-cert-reload branch from d0f9ea8 to a4a2d10 Compare March 14, 2024 08:43
@glrf glrf force-pushed the hubble-relay/fix-cert-reload branch from a4a2d10 to b98794b Compare March 14, 2024 09:04
@glrf glrf marked this pull request as ready for review March 14, 2024 09:21
@glrf glrf requested a review from a team as a code owner March 14, 2024 09:21
@glrf glrf requested a review from kaworu March 14, 2024 09:21
@rolinh rolinh added sig/hubble release-note/bug This PR fixes an issue in a previous release of Cilium. labels Mar 15, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 15, 2024
@glrf glrf force-pushed the hubble-relay/fix-cert-reload branch from b98794b to 143d5a5 Compare March 15, 2024 09:40
@glrf
Copy link
Contributor Author
glrf commented Mar 15, 2024

/test

Copy link
Member
@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rolinh rolinh added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Mar 15, 2024
This change fixes an issue with the PeerManager that can lead to Relay
being unable to reconnect to some or all peers when the client
certificate expires or the Certificate Authority is replaced.

Before this change, when the client certificate changes, we did not
redial or update the exiting gRPC ClientConns. When the old certificate
becomes invalid, (expiring, changed CA, or revoked) The connection will
eventually fail with a certificate error.
However, the gRPC ClientConn is not closed, but treats the certificate
error as a transient failure and will retry connecting with the old
credentials indefinitely.
In most cases this will cause the relay health checks to fail. Relay
will restart and successfully reconnect to all peers. However, if a new
peer joins between the certificate being updated and the connections
failing, Relay may keep on running in a degraded state.

This issue was introduced by cilium#28595. Before that change, Relay
aggressively closed and re-dialed ClientConns on any error, mitigating
this problem.

We fix this issue by wrapping the provided gRPC transport credentials
and updating the TLS configuration whenever a new TLS connection is
established. This means every TLS connection will use up-to-date
certificates and gRPC ClientConns will be able to recover when their
certificate changes.

Fixes: aca4d42 ("hubble/relay: Fix connection leak when reconnecting to peer service")

Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com>
@glrf
Copy link
Contributor Author
glrf commented Mar 19, 2024

Hit flake that should be fixed by #31381 Rebasing on main

@glrf glrf force-pushed the hubble-relay/fix-cert-reload branch from 143d5a5 to b2035f6 Compare March 19, 2024 14:39
@glrf
Copy link
Contributor Author
glrf commented Mar 19, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 19, 2024
@aditighag aditighag added this pull request to the merge queue Mar 19, 2024
Merged via the queue into cilium:main with commit 7162892 Mar 19, 2024
@sayboras sayboras mentioned this pull request Mar 24, 2024
6 tasks
@sayboras sayboras added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 24, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

6 participants
0