SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

svonworl · 2024-03-15T22:42:24Z

Description
This PR adds health checks to the webservice that detect when:

the Liquibase migration lock has been held for too long.
the number of published non-checker tools, workflows, and notebooks does not match the number of documents in the corresponding Elasticsearch indexes.

Kathy convinced me that these are indeed health checks, so they're run and reported via the existing /metadata/health endpoint and associated machinery. They do differ from some of the existing health checks: although they signal a condition that's not entirely healthy, their failure indicates a non-fatal condition, and the webservice should continue to run, it need not be stopped/replaced/etc. That's ok, because currently, our monitoring software only replaces the webservice task when the connectionPool health check fails.

We calculate how long the Liquibase lock has been held by comparing the current time against when it was last granted, per the database table. If the lock has been held more than 10 minutes, we declare it held too long.

Initially, I tried to manage the required Sessions "manually" via SessionFactory.openSession and ManagedSessionContext.bind.
However, for unknown reasons, this screwed up other Sessions in subsequent unrelated requests, causing them to malfunction with IllegalStateExceptions etc. So, instead, I used UnitOfWorkAwareProxyFactory to wrap the check() methods, which is cleaner and worked as advertised. I cribbed the subsequently-rejected manual session management code from https://github.com/dockstore/dockstore/blob/develop/dockstore-webservice/src/main/java/io/dockstore/webservice/DockstoreWebserviceApplication.java#L526, so its continued presence worries me a little.

When a health check fails, the resource method logs an ERROR level message containing the health check name. We use this log entry to create a Cloudwatch alarm in companion PR https://github.com/dockstore/dockstore-deploy/pull/762

Review Instructions
Trigger the exceptional conditions on qa and confirm that the alarms happen.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6225
https://ucsc-cgl.atlassian.net/browse/SEAB-4825

Security and Privacy

No unusual concerns.

Security and Privacy assessed

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

svonworl · 2024-03-15T23:34:20Z

dockstore-webservice/src/main/java/io/dockstore/webservice/resources/MetadataResource.java

            results.entrySet().stream()
                    .filter(result -> !result.getValue().isHealthy())
                    .forEach(result -> LOG.error("Health check '{}' failed with error: {}", result.getKey(), result.getValue().getMessage()));
            String failedHealthCheckNames = results.entrySet().stream()
                    .filter(result -> !result.getValue().isHealthy())
-                    .map(result -> String.format("'%s'", result))
+                    .map(result -> String.format("'%s'", result.getKey()))


The original code looks like it was trying to create a list of health check names, but it was calling result.toString, which also included the result message and some other information. In some cases, we don't want the details of a health check failure to be public, thus the above change.

coverbeck

A couple of thoughts before my 3 day weekend. :)

How will this be invoked? Via Uptime Robot as had been discussed in Slack?
Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the 4 threshold in the other PR?

Not user facing, so, per our policy, no review required.

This should be reviewed.

svonworl · 2024-03-16T04:12:17Z

A couple of thoughts before my 3 day weekend. :)

How will this be invoked? Via Uptime Robot as had been discussed in Slack?

Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the 4 threshold in the other PR?

See description of https://github.com/dockstore/dockstore-deploy/pull/762

denis-yuen

Build failures look like #5841
Can cherry-pick to find out

codecov · 2024-03-19T18:03:18Z

Codecov Report

Attention: Patch coverage is 91.11111% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 74.52%. Comparing base (aaaf076) to head (f3f570c).
Report is 2 commits behind head on develop.

Files	Patch %	Lines
...resources/ElasticsearchConsistencyHealthCheck.java	88.46%	1 Missing and 2 partials ⚠️
...webservice/resources/LiquibaseLockHealthCheck.java	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             develop    #5843      +/-   ##
=============================================
+ Coverage      74.46%   74.52%   +0.06%     
- Complexity      5248     5260      +12     
=============================================
  Files            366      368       +2     
  Lines          18975    19018      +43     
  Branches        2021     2025       +4     
=============================================
+ Hits           14130    14174      +44     
+ Misses          3888     3883       -5     
- Partials         957      961       +4

Flag	Coverage Δ
bitbuckettests	`27.10% <35.55%> (+0.01%)`	⬆️
integrationtests	`58.49% <91.11%> (+0.09%)`	⬆️
languageparsingtests	`11.00% <35.55%> (+0.05%)`	⬆️
localstacktests	`21.55% <35.55%> (+0.03%)`	⬆️
toolintegrationtests	`30.46% <35.55%> (+0.01%)`	⬆️
unit-tests_and_non-confidential-tests	`28.90% <35.55%> (+0.01%)`	⬆️
workflowintegrationtests	`38.70% <35.55%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coverbeck

Approved, but see my question in the accompanying deploy PR -- I think it will be fine, but was just being paranoid there.

denis-yuen

minor question

denis-yuen · 2024-03-19T20:02:27Z

...ice/src/main/java/io/dockstore/webservice/resources/ElasticsearchConsistencyHealthCheck.java

+        CountRequest countRequest = new CountRequest(index);
+        CountResponse countResponse = client.count(countRequest, RequestOptions.DEFAULT);
+        if (countResponse.status().getStatus() != HttpStatus.SC_OK) {
+            throw new RuntimeException("Non-OK response to Elasticsearch request");


Should this be CustomWebApplicationException with an appropriate error code?

When the health check system we use runs each health check, it catches any exception that the health check throws and maps it to an unhealthy Result. These Results are then mapped to a response by our endpoint code. If we throw a CustomWebApplicationException here, that implies that it's gonna emerge from the webservice with the attached status code, which isn't going to happen. So, I'd probably lean against it. It'd be nice to throw a specialization of RuntimeException that was more specific to the particular error, but I couldn't find a good match.

Makes sense, maybe https://docs.oracle.com/en/java/javase/17/docs/api/jdk.jdi/com/sun/jdi/InternalException.html

InternalException is meant to mirror an internal JDK error, see the package documentation: https://docs.oracle.com/en/java/javase/17/docs/api/jdk.jdi/com/sun/jdi/package-summary.html

Could also declare your own

...ice/src/main/java/io/dockstore/webservice/resources/ElasticsearchConsistencyHealthCheck.java

sonarqubecloud · 2024-03-20T21:28:55Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
91.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

svonworl added 12 commits March 13, 2024 14:09

rough draft

ba37773

convert to health check

6ba768a

remove endpoint, add es consistency health check

0a3c092

checkstyle, tweak

cdc015b

cleanup, tie some loose ends

07412ee

tweak health check failure response, fix test

4484024

tweak test

ce70eb8

tweak it, one more attempt
8000

e0640c3

add ITs for failed health checks, comment

ca3a499

refactor

4d2d445

use UnitOfWorkAwareProxyFactory to handle sessions

690d4b5

cleanup build issues

fda82e5

svonworl commented Mar 15, 2024

View reviewed changes

svonworl requested review from kathy-t, denis-yuen, coverbeck, david4096 and hyunnaye March 15, 2024 23:36

svonworl self-assigned this Mar 15, 2024

coverbeck reviewed Mar 16, 2024

View reviewed changes

denis-yuen reviewed Mar 18, 2024

View reviewed changes

merge develop

b516855

svonworl requested review from denis-yuen and coverbeck March 19, 2024 18:00

coverbeck approved these changes Mar 19, 2024

View reviewed changes

apply some sonarcloud recommendations

da616c0

denis-yuen reviewed Mar 19, 2024

View reviewed changes

kathy-t approved these changes Mar 20, 2024

View reviewed changes

add comment, mollify codacy a little

f3f570c

svonworl requested a review from denis-yuen March 20, 2024 20:59

denis-yuen approved these changes Mar 20, 2024

View reviewed changes

david4096 approved these changes Mar 21, 2024

View reviewed changes

svonworl merged commit ab664cb into develop Mar 25, 2024

svonworl deleted the feature/seab-6225/liquibase-lock-check branch March 25, 2024 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Quality Gate passed

Uh oh!

Uh oh!