-
Notifications
You must be signed in to change notification settings - Fork 29
SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
results.entrySet().stream() | ||
.filter(result -> !result.getValue().isHealthy()) | ||
.forEach(result -> LOG.error("Health check '{}' failed with error: {}", result.getKey(), result.getValue().getMessage())); | ||
String failedHealthCheckNames = results.entrySet().stream() | ||
.filter(result -> !result.getValue().isHealthy()) | ||
.map(result -> String.format("'%s'", result)) | ||
.map(result -> String.format("'%s'", result.getKey())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe t 8000 his comment to others. Learn more.
The original code looks like it was trying to create a list of health check names, but it was calling result.toString
, which also included the result message and some other information. In some cases, we don't want the details of a health check failure to be public, thus the above change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts before my 3 day weekend. :)
- How will this be invoked? Via Uptime Robot as had been discussed in Slack?
- Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the
4
threshold in the other PR?
Not user facing, so, per our policy, no review required.
This should be reviewed.
See description of https://github.com/dockstore/dockstore-deploy/pull/762 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build failures look like #5841
Can cherry-pick to find out
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5843 +/- ##
=============================================
+ Coverage 74.46% 74.52% +0.06%
- Complexity 5248 5260 +12
=============================================
Files 366 368 +2
Lines 18975 19018 +43
Branches 2021 2025 +4
=============================================
+ Hits 14130 14174 +44
+ Misses 3888 3883 -5
- Partials 957 961 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but see my question in the accompanying deploy PR -- I think it will be fine, but was just being paranoid there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor question
CountRequest countRequest = new CountRequest(index); | ||
CountResponse countResponse = client.count(countRequest, RequestOptions.DEFAULT); | ||
if (countResponse.status().getStatus() != HttpStatus.SC_OK) { | ||
throw new RuntimeException("Non-OK response to Elasticsearch request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be CustomWebApplicationException
with an appropriate error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the health check system we use runs each health check, it catches any exception that the health check throws and maps it to an unhealthy Result. These Results are then mapped to a response by our endpoint code. If we throw a CustomWebApplicationException
here, that implies that it's gonna emerge from the webservice with the attached status code, which isn't going to happen. So, I'd probably lean against it. It'd be nice to throw a specialization of RuntimeException
that was more specific to the particular error, but I couldn't find a good match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternalException
is meant to mirror an internal JDK error, see the package documentation: https://docs.oracle.com/en/java/javase/17/docs/api/jdk.jdi/com/sun/jdi/package-summary.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also declare your own
...ice/src/main/java/io/dockstore/webservice/resources/ElasticsearchConsistencyHealthCheck.java
Show resolved
Hide resolved
|
Description
This PR adds health checks to the webservice that detect when:
Kathy convinced me that these are indeed health checks, so they're run and reported via the existing
/metadata/health
endpoint and associated machinery. They do differ from some of the existing health checks: although they signal a condition that's not entirely healthy, their failure indicates a non-fatal condition, and the webservice should continue to run, it need not be stopped/replaced/etc. That's ok, because currently, our monitoring software only replaces the webservice task when theconnectionPool
health check fails.We calculate how long the Liquibase lock has been held by comparing the current time against when it was last granted, per the database table. If the lock has been held more than 10 minutes, we declare it held too long.
Initially, I tried to manage the required
Sessions
"manually" viaSessionFactory.openSession
andManagedSessionContext.bind
.However, for unknown reasons, this screwed up other Sessions in subsequent unrelated requests, causing them to malfunction with
IllegalStateExceptions
etc. So, instead, I usedUnitOfWorkAwareProxyFactory
to wrap thecheck()
methods, which is cleaner and worked as advertised. I cribbed the subsequently-rejected manual session management code from https://github.com/dockstore/dockstore/blob/develop/dockstore-webservice/src/main/java/io/dockstore/webservice/DockstoreWebserviceApplication.java#L526, so its continued presence worries me a little.When a health check fails, the resource method logs an
ERROR
level message containing the health check name. We use this log entry to create a Cloudwatch alarm in companion PR https://github.com/dockstore/dockstore-deploy/pull/762Review Instructions
Trigger the exceptional conditions on qa and confirm that the alarms happen.
Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6225
https://ucsc-cgl.atlassian.net/browse/SEAB-4825
Security and Privacy
No unusual concerns.
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation