8000 Add zk error handling and logging by wilkermichael · Pull Request #1762 · Altinity/clickhouse-operator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add zk error handling and logging #1762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 0.25.2
Choose a base branch
from

Conversation

wilkermichael
Copy link
@wilkermichael wilkermichael commented Jul 3, 2025

Summary

This PR improves error handling and observability for Keeper operations by:

  • Surfacing Keeper errors that were previously silently ignored
  • Combining retry attempts and logging all errors together to reduce log noise
  • Adding comprehensive tests for Keeper connections

Previously, Keeper connection failures during retries were not visible, making it difficult to diagnose deployment issues.

Motivation

When deploying the operator in a separate namespace from the ClickHouse and Keeper clusters, we experienced a consistent ~10 minute delay for ClickHouse pods to start after Keeper pods were ready.

The root cause was that cross-namespace deployments require fully qualified DNS names (including namespace) in the ClickHouse configuration to properly resolve Keeper endpoints. However, the existing retry logic silently swallowed connection errors, making this configuration issue nearly impossible to diagnose.

This change surfaces these errors to help operators quickly identify and resolve similar DNS/connectivity issues.

Testing

Tested in a local kind cluster:

Before fix: With incorrect DNS configuration, no error logs were visible despite connection failures. It would just log a warning about retrying the connection.

After fix: Clear error messages now appear showing the specific connection failures, making the DNS misconfiguration immediately apparent
Screenshot 2025-07-03 at 10 57 02 AM

Once the DNS configuration was corrected with fully qualified names, ClickHouse pods started immediately after Keeper became available 🎆.

From Altinity

Thanks for taking the time to contribute to clickhouse-operator!

Please, read carefully instructions on how to make a Pull Request.

This will help a lot for maintainers to adopt your Pull Request.

Important items to consider before making a Pull Request

Please check items PR complies to:

  • All commits in the PR are squashed. More info
  • The PR is made into dedicated next-release branch, not into master branch1. More info
  • The PR is signed. More info

--

1 If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into master branch, but it has to be confirmed by project's maintainer.

@wilkermichael wilkermichael marked this pull request as ready for review July 3, 2025 18:33
@wilkermichael
Copy link
Author
wilkermichael commented Jul 3, 2025

Please let me know if this is the right branch to merge into, I can fix which branch I branched off of

@wilkermichael wilkermichael force-pushed the mw/add-more-zk-error-handling branch from 6a11858 to 9970359 Compare July 3, 2025 18:44
@wilkermichael wilkermichael changed the title Mw/add more zk error handling Add zk error handling and logging Jul 3, 2025
@wilkermichael wilkermichael force-pushed the mw/add-more-zk-error-handling branch from 9970359 to 910be59 Compare July 3, 2025 18:45
- mock out zk connection for testing
- surface errors that were being swallowed up/ignored
- make retry delay configurable (useful for testing)
- add logging to pathmanager when checking for path existence

Signed-off-by: Michael Wilkerson <wilker.michael@gmail.com>
@wilkermichael wilkermichael force-pushed the mw/add-more-zk-error-handling branch from 910be59 to d49362b Compare July 3, 2025 18:45
@sunsingerus sunsingerus changed the base branch from 0.25.1 to 0.25.2 July 4, 2025 07:45
@sunsingerus
Copy link
Collaborator

PR is quite big, will take a look later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0