8000 SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks by svonworl · Pull Request #5843 · dockstore/dockstore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 25, 2024

Conversation

svonworl
Copy link
Contributor
@svonworl svonworl commented Mar 15, 2024

Description
This PR adds health checks to the webservice that detect when:

  • the Liquibase migration lock has been held for too long.
  • the number of published non-checker tools, workflows, and notebooks does not match the number of documents in the corresponding Elasticsearch indexes.

Kathy convinced me that these are indeed health checks, so they're run and reported via the existing /metadata/health endpoint and associated machinery. They do differ from some of the existing health checks: although they signal a condition that's not entirely healthy, their failure indicates a non-fatal condition, and the webservice should continue to run, it need not be stopped/replaced/etc. That's ok, because currently, our monitoring software only replaces the webservice task when the connectionPool health check fails.

We calculate how long the Liquibase lock has been held by comparing the current time against when it was last granted, per the database table. If the lock has been held more than 10 minutes, we declare it held too long.

Initially, I tried to manage the required Sessions "manually" via SessionFactory.openSession and ManagedSessionContext.bind.
However, for unknown reasons, this screwed up other Sessions in subsequent unrelated requests, causing them to malfunction with IllegalStateExceptions etc. So, instead, I used UnitOfWorkAwareProxyFactory to wrap the check() methods, which is cleaner and worked as advertised. I cribbed the subsequently-rejected manual session management code from https://github.com/dockstore/dockstore/blob/develop/dockstore-webservice/src/main/java/io/dockstore/webservice/DockstoreWebserviceApplication.java#L526, so its continued presence worries me a little.

When a health check fails, the resource method logs an ERROR level message containing the health check name. We use this log entry to create a Cloudwatch alarm in companion PR https://github.com/dockstore/dockstore-deploy/pull/762

Review Instructions
Trigger the exceptional conditions on qa and confirm that the alarms happen.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6225
https://ucsc-cgl.atlassian.net/browse/SEAB-4825

Security and Privacy

No unusual concerns.

  • Security and Privacy assessed

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

results.entrySet().stream()
.filter(result -> !result.getValue().isHealthy())
.forEach(result -> LOG.error("Health check '{}' failed with error: {}", result.getKey(), result.getValue().getMessage()));
String failedHealthCheckNames = results.entrySet().stream()
.filter(result -> !result.getValue().isHealthy())
.map(result -> String.format("'%s'", result))
.map(result -> String.format("'%s'", result.getKey()))
Copy link
Contributor Author
@svonworl svonworl Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe t 8000 his comment to others. Learn more.

The original code looks like it was trying to create a list of health check names, but it was calling result.toString, which also included the result message and some other information. In some cases, we don't want the details of a health check failure to be public, thus the above change.

Copy link
Collaborator
@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts before my 3 day weekend. :)

  • How will this be invoked? Via Uptime Robot as had been discussed in Slack?
  • Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the 4 threshold in the other PR?

Not user facing, so, per our policy, no review required.

This should be reviewed.

@svonworl
Copy link
Contributor Author

A couple of thoughts before my 3 day weekend. :)

  • How will this be invoked? Via Uptime Robot as had been discussed in Slack?
  • Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the 4 threshold in the other PR?

See description of https://github.com/dockstore/dockstore-deploy/pull/762

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build failures look like #5841
Can cherry-pick to find out

Copy link
codecov bot commented Mar 19, 2024

Codecov Report

Attention: Patch coverage is 91.11111% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 74.52%. Comparing base (aaaf076) to head (f3f570c).
Report is 2 commits behind head on develop.

Files Patch % Lines
...resources/ElasticsearchConsistencyHealthCheck.java 88.46% 1 Missing and 2 partials ⚠️
...webservice/resources/LiquibaseLockHealthCheck.java 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #5843      +/-   ##
=============================================
+ Coverage      74.46%   74.52%   +0.06%     
- Complexity      5248     5260      +12     
=============================================
  Files            366      368       +2     
  Lines          18975    19018      +43     
  Branches        2021     2025       +4     
=============================================
+ Hits           14130    14174      +44     
+ Misses          3888     3883       -5     
- Partials         957      961       +4     
Flag Coverage Δ
bitbuckettests 27.10% <35.55%> (+0.01%) ⬆️
integrationtests 58.49% <91.11%> (+0.09%) ⬆️
languageparsingtests 11.00% <35.55%> (+0.05%) ⬆️
localstacktests 21.55% <35.55%> (+0.03%) ⬆️
toolintegrationtests 30.46% <35.55%> (+0.01%) ⬆️
unit-tests_and_non-confidential-tests 28.90% <35.55%> (+0.01%) ⬆️
workflowintegrationtests 38.70% <35.55%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator
@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but see my question in the accompanying deploy PR -- I think it will be fine, but was just being paranoid there.

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor question

CountRequest countRequest = new CountRequest(index);
CountResponse countResponse = client.count(countRequest, RequestOptions.DEFAULT);
if (countResponse.status().getStatus() != HttpStatus.SC_OK) {
throw new RuntimeException("Non-OK response to Elasticsearch request");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be CustomWebApplicationException with an appropriate error code?

Copy link
Contributor Author
@svonworl svonworl Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the health check system we use runs each health check, it catches any exception that the health check throws and maps it to an unhealthy Result. These Results are then mapped to a response by our endpoint code. If we throw a CustomWebApplicationException here, that implies that it's gonna emerge from the webservice with the attached status code, which isn't going to happen. So, I'd probably lean against it. It'd be nice to throw a specialization of RuntimeException that was more specific to the particular error, but I couldn't find a good match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalException is meant to mirror an internal JDK error, see the package documentation: https://docs.oracle.com/en/java/javase/17/docs/api/jdk.jdi/com/sun/jdi/package-summary.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also declare your own

@svonworl svonworl requested a review from denis-yuen March 20, 2024 20:59
Copy link

@svonworl svonworl merged commit ab664cb into develop Mar 25, 2024
@svonworl svonworl deleted the feature/seab-6225/liquibase-lock-check branch March 25, 2024 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0