Description
I ran into an issue where the certs would stop reloading when we rotated them in our kubernetes environment. We imported this small library to handle the tls secret rotations that happen when but after the second rotation we got some TLS errors in our replica sets that came up. After the first rotation you will notice the certman logs stops running.
we saw the logs would stop appearing after a rotation.
2022/02/03 03:57:53 certman: watch event: "/run/secrets/tls/tls.key": REMOVE
2022/02/03 03:57:53 certman: certificate and key loaded
2022/02/03 03:57:53 certman: watch event: "/run/secrets/tls/tls.crt": CHMOD
2022/02/03 03:57:53 certman: certificate and key loaded
2022/02/03 03:57:53 certman: watch event: "/run/secrets/tls/tls.crt": REMOVE
2022/02/03 03:57:53 certman: certificate and key loaded
The error was the following:
Error creating: Internal error occurred: failed calling webhook "janus.mutating.custom-admission-webhooks<redacted>": Post "https://janus.ns-team-janus.svc:443/janus/v1/sidecar?timeout=2s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<redacted>")
We noticed the cert that was mounted as a volume mount in our deployment object didn't match the one that was being served for our service runtime.
we consume our secret via kubernetes volume secret
volumes:
- name: janus-webhook-tls
secret:
secretName: janus-webhook-tls
We verified by grabbing the cert with the following command and noticed the dates were older than the originally rotated cert on the volume mount.
/ # echo | openssl s_client -servername <service-endpoint> -connect <service-endpoint>:<port>
Cert that was mounted via volume was the new one.
Validity
Not Before: Feb 8 12:27:12 2022 GMT
Not After : Aug 8 12:27:12 2022 GMT
vs.
cert that was exposed via the openssl command is old.
Validity
Not Before: Feb 2 03:56:21 2022 GMT
Not After : Aug 2 03:56:20 2022 GMT
There is already a PR filed that seems to address the same issue.
#1
The issue seemed to be the timing of the load the event that was being triggered and we also changed the the fsnotify to monitor the directory instead other individual files. This PR had to be tweak for us to get it to work in our application.
https://github.com/ts3ng/certman/tree/fix-reload
LMK if you want me to open PR as this lib doesn't look like it's been maintained for a while now.