Add zk error handling and logging #1762
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
2C51
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves error handling and observability for Keeper operations by:
Previously, Keeper connection failures during retries were not visible, making it difficult to diagnose deployment issues.
Motivation
When deploying the operator in a separate namespace from the ClickHouse and Keeper clusters, we experienced a consistent ~10 minute delay for ClickHouse pods to start after Keeper pods were ready.
The root cause was that cross-namespace deployments require fully qualified DNS names (including namespace) in the ClickHouse configuration to properly resolve Keeper endpoints. However, the existing retry logic silently swallowed connection errors, making this configuration issue nearly impossible to diagnose.
This change surfaces these errors to help operators quickly identify and resolve similar DNS/connectivity issues.
Testing
Tested in a local kind cluster:
Before fix: With incorrect DNS configuration, no error logs were visible despite connection failures. It would just log a warning about retrying the connection.
After fix: Clear error messages now appear showing the specific connection failures, making the DNS misconfiguration immediately apparent

Once the DNS configuration was corrected with fully qualified names, ClickHouse pods started immediately after Keeper became available 🎆.
From Altinity
Thanks for taking the time to contribute to
clickhouse-operator
!Please, read carefully instructions on how to make a Pull Request.
This will help a lot for maintainers to adopt your Pull Request.
Important items to consider before making a Pull Request
Please check items PR complies to:
next-release
branch, not intomaster
branch1. More info--
1 If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into
master
branch, but it has to be confirmed by project's maintainer.