SEAB-6148: Log more diagnostic information #5821

svonworl · 2024-02-23T16:30:42Z

Description
This PR introduces a new DiagnosticsHelper class that the webservice application can wire into itself and log various information:

Periodic "Global" info about ~~running processes,~~ the filesystem, the database, and memory pools, logged via a Timer daemon thread.
Request-level info about a request and associated thread and Hibernate Session, logged at the start and finish of each request, two log entries per request.

In this iteration, I dropped the thread dump from the periodic logging output. At 25kB a pop, it didn't seem worth it.

Diagnostic logging is enabled using the diagnosticsConfig configuration file property. Periodic and request logging can be enabled/disabled separately, and the logging period can be set. By default, diagnostic logging is disabled. To enable both periodic and request logging, add the following to your config file:

diagnosticsConfig:
  logRequests: true
  logPeriodic: true
  periodSeconds: 60

To review the output, pipe the webservice stdout into tee /tmp/output and view the resulting file. Search for DiagnosticsHelper to find the new log entries.

A big concern is that, since we're logging data pulled out of the guts of our runtime environment, we'll accidentally log a secret, especially via the ps output, which includes the arguments to commands. We purposefully do not log environment variables, which are a known vector for secrets.

To reduce the chance of a secret spilling, all of the information that is logged - literally the entirety of each log message - passes through censoring code that looks for runs of characters that appear to be hex/base64, does some analysis on the character distributions, and X's them out if they appear to be keys. The crux here is that file paths resemble base64 strings because they tend to contain the same characters, so we need a way to let them through uncensored. The assumption is that a key will be well-encoded and contain a random jumble of characters, whereas a path (or other structured non-key data) will contain longer runs of the same character classes, and/or contain text that resembles English. The resulting censoring scheme works pretty well, see the comments on the CensorHelper class for more information.

Another concern is that these changes will introduce instability into the webservice. Keep an eye out for that type of problem. We're purposefully doing the periodic logging via a Timer/thread so in case it hangs or throws, the normal functions of the webservice aren't as likely to be impacted.

We can easily add a filter that will exclude certain requests from diagnostic logging (for example, hits to the health check endpoint).

Review Instructions
View the Cloudwatch logs for qa, and confirm that diagnostic log entries are being created before and after each request, and periodically every so often. Search for "DiagnosticsHelper" to find the entries.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6148

Security and Privacy
Many security concerns, see description.

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

denis-yuen · 2024-02-23T18:13:58Z

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file
https://www.dropwizard.io/en/stable/manual/configuration.html#logging
that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

coverbeck

Conceptually, looks good.

Do you have some sample output, just to get a sense of what it might look like and how large it would be?

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file https://www.dropwizard.io/en/stable/manual/configuration.html#logging that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

On the one hand, yes it could make the "main" log really noisy and might be worth separating. On the other hand, if it would make easier correlating things if we're researching an issue, e.g., seeing the thread dump right before/after a request is logged would be useful. I wonder if CloudWatch has a way to "merge" log groups when querying?

coverbeck · 2024-02-28T17:35:17Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/DebugHelper.java

+
+            // Request started.
+            case START:
+                logStarted(request);


This will be very useful; right now the request is only logged when completed. If we die (rare, but it's happened) processing a request, it's hard to piece together what the request was.

denis-yuen · 2024-02-28T19:27:16Z

I wonder if CloudWatch has a way to "merge" log groups when querying?

Looks like, yes, via the insights queries
https://aws.amazon.com/about-aws/whats-new/2019/07/cloudwatch-logs-insights-adds-cross-log-group-querying/

svonworl · 2024-02-28T21:31:31Z

Conceptually, looks good.

Do you have some sample output, just to get a sense of what it might look like and how large it would be?

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file https://www.dropwizard.io/en/stable/manual/configuration.html#logging that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

On the one hand, yes it could make the "main" log really noisy and might be worth separating. On the other hand, if it would make easier correlating things if we're researching an issue, e.g., seeing the thread dump right before/after a request is logged would be useful. I wonder if CloudWatch has a way to "merge" log groups when querying?

Some thoughts:

It's relatively difficult to send separate streams of information to different log groups from a container running in ECS. The typical setup is to use the cloudwatch agent to send stdout/stderr to a group.
We could possibly modify our logging to condense each multi-line log entry to a single expandable entry: https://stackoverflow.com/questions/67396522/cloudwatch-multiline-log-messages-from-containerized-app-runnning-on-ecs-ec2 This would greatly reduce the visual impact of the logging entries that appear in this PR, as well as collapse the giant stack traces that currently appear in our logs.
It is most useful to display the debugging information interleaved temporally with our other log output. The harder it is to do that, the less useful the debug information is.

Regarding size, there's two groups of information being logged. The first, thread/request-level info, is pretty small:

INFO  [2024-02-28 20:50:59,095] io.dockstore.webservice.helpers.DebugHelper: debug.started by thread "dw-45 - GET /organizations/9/starredUsers" (45):
GET "organizations/9/starredUsers"
INFO  [2024-02-28 20:50:59,109] io.dockstore.webservice.helpers.DebugHelper: debug.session by thread "dw-45 - GET /organizations/9/starredUsers" (45):
SessionStatistics[entity count=1,collection count=4]
INFO  [2024-02-28 20:50:59,131] io.dockstore.webservice.helpers.DebugHelper: debug.finished by thread "dw-45 - GET /organizations/9/starredUsers" (45):
GET "organizations/9/starredUsers"
INFO  [2024-02-28 20:50:59,131] io.dockstore.webservice.helpers.DebugHelper: debug.thread by thread "dw-45 - GET /organizations/9/starredUsers" (45):
allocated: 0.60 MB
cpu-time: 0.012 sec
user-time: 0.011 sec
elapsed-time: 0.036 sec

The periodic "global" information is larger:

INFO  [2024-02-28 20:51:05,045] io.dockstore.webservice.helpers.DebugHelper: debug.threads by thread "diagnostics" (18):
"Reference Handler" daemon prio=10 Id=2 RUNNABLE
        at java.base@17.0.7/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
        at java.base@17.0.7/java.lang.ref.Reference.processPendingReferences(Reference.java:253)
        at java.base@17.0.7/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:215)
[...]

INFO  [2024-02-28 20:51:05,045] io.dockstore.webservice.helpers.DebugHelper: debug.database by thread "diagnostics" (18):
pool-size: 8
pool-active: 0
pool-idle: 8

INFO  [2024-02-28 20:51:05,046] io.dockstore.webservice.helpers.DebugHelper: debug.memory by thread "diagnostics" (18):
HEAP: init = 268435456(262144K) used = 70545688(68892K) committed = 144703488(141312K) max = 4294967296(4194304K)
NON-HEAP: init = 7667712(7488K) used = 121830168(118974K) committed = 123928576(121024K) max = -1(-1K)
POOL: CodeHeap 'non-nmethods', Non-heap memory
current: init = 2555904(2496K) used = 1457792(1423K) committed = 2555904(2496K) max = 5840896(5704K)
peak: init = 2555904(2496K) used = 1520384(1484K) committed = 2555904(2496K) max = 5840896(5704K)
collection: null
POOL: Metaspace, Non-heap memory
current: init = 0(0K) used = 84102304(82131K) committed = 84738048(82752K) max = -1(-1K)
peak: init = 0(0K) used = 84102304(82131K) committed = 84738048(82752K) max = -1(-1K)
collection: null
[...]

I've elided most of the thread dump, in total. On my development box, it's around 10 lines per thread, and 34 threads are listed, totally approximately 25kB. The memory pool dump is the next biggest entry, about 2.5kB total.

The ps output is not included here. It's a ton of output on my development box, but is relatively small when running in ECS, a few lines.

My feeling is that, although it's intriguing to ponder logging this stuff to a different group, it might be more trouble than it's worth, if we can format the debug info so it doesn't overwhelm the other log info.

denis-yuen · 2024-02-29T16:43:24Z

We could possibly modify our logging to condense each multi-line log entry to a single expandable entry: https://stackoverflow.com/questions/67396522/cloudwatch-multiline-log-messages-from-containerized-app-runnning-on-ecs-ec2 This would greatly reduce the visual impact of the logging entries that appear in this PR, as well as collapse the giant stack traces that currently appear in our logs.

I think this could work, in addition to swapping most of these outputs from INFO to TRACE
Presumably the other thing we can do is probably do some kind of filter to produce more than one logstream.
The downside I think is we'll need to consider log expiry, but we probably need a ticket on that either way.

denis-yuen

cleaning list

denis-yuen · 2024-02-23T18:15:27Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/DebugHelper.java

+public final class DebugHelper {
+
+    private static final Logger LOG = LoggerFactory.getLogger(DebugHelper.class);
+    private static final long LOG_PERIOD_MS = 10000L;


For reviewers, every ten seconds if curious

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/CensorHelper.java

david4096

Awesome to see new diagnostic information in the logs! It would be nice-to-have that portions that are quantifiable had metrics defined in dockstore-deploy so we could track these over time/per container.

In my understanding the logs have the same access restrictions as the container and DB so censoring the logs doesn't seem critical. For maintainability I would consider not including that part, however it is good to have.

It might be preferable use an existing library (which I couldn't immediately find) or something like https://github.com/mazen160/secrets-patterns-db/blob/master/db/rules-stable.yml or https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml as linked by Denis.

denis-yuen

Hmmm, I think that puts a finger on two things that have been bugging me about this. I think I'll have to type up something longer in #dockstore-devs

sonarqubecloud · 2024-04-11T16:50:55Z

Quality Gate failed

Failed conditions
2.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

svonworl · 2024-04-11T16:51:05Z

Ok, I made a few changes, the current state is the proposed final version.

Part of my original "move it forward" plan was to remove the custom censoring code and replace it with the AWS Cloudwatch "data protection" feature. However, upon deeper inspection, I learned that the AWS feature probably wouldn't catch the type of secrets that might leak, either because it's not looking for certain types of secrets (GitHub tokens, for example), or it requires they appear in certain contexts to detect them (for example, an AWS key needs to appear in proximity to the keyword aws_secret_access_key (or similar) for the AWS feature to censor it, which is unlikely to happen in our case).

By far, the most likely place for secrets to leak is the ps output. So, to totally neutralize that scenario, I removed the ps output from this PR. Now, the probability of a secret leaking is greatly reduced, much closer to the (low) background probability of leaking a secret in the rest of the logs, and censoring is much less important (either via our own code or the AWS feature).

So, I've removed the original webservice-side censoring code, and since the AWS data protection feature doesn't perform very well in our use case, I recommend that, at this juncture, we don't spend the time to enable it.

I should note that, left to my own devices, I would include the original webservice-side censoring code, but having dropped the ps output, I can merge this PR without it and not feel bad/worried/irresponsible/etc.

denis-yuen · 2024-04-11T17:08:10Z

ecause it's not looking for certain types of secrets (GitHub tokens, for example),

There is a custom data identifier feature https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL-custom-data-identifiers.html
I think you should be able to add the github tokens since they start with gho, ghr, etc. https://github.blog/2021-04-05-behind-githubs-new-authentication-token-formats/

denis-yuen · 2024-04-11T17:10:35Z

we don't spend the time to enable it.

I think it should only be a handful of steps https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/ but you can create a follow-up ticket

denis-yuen

Would prefer to have at least the follow-up ticket for data protection

svonworl added 7 commits February 21, 2024 22:30

skeleton

e8415a9

use jersey eventlisteners

f4aa0bf

next revision

3ff70ef

refactor

5094aeb

add censoring code, refactor

368b0c9

checkstyle + tweaks

d56e9a8

add comment [skip ci]

71ecf9e

svonworl marked this pull request as draft February 23, 2024 16:30

svonworl changed the title ~~Feature/seab 6148/log more diagnostic information~~ SEAB-6148: Log more diagnostic information Feb 23, 2024

svonworl requested review from denis-yuen, coverbeck, david4096, kathy-t, hyunnaye and ll5zh February 23, 2024 17:08

coverbeck reviewed Feb 28, 2024

View reviewed changes

denis-yuen reviewed Feb 29, 2024

View reviewed changes

svonworl added 8 commits March 4, 2024 13:57

merge develop

c453fbb

add config, refactor, add df output

179da48

rename, add frequency resource and test

18bbb95

refactor

56eca74

refactor

6f59b42

improve test

b2a8856

add CensorHelper

e5aacf9

make periodic and request diagnostics separately enableable

374c0a0

github-advanced-security bot found potential problems Mar 6, 2024

View reviewed changes

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/CensorHelper.java Fixed Show fixed Hide fixed

svonworl added 7 commits March 27, 2024 12:08

merge develop

383ca2b

fix "type-narrowing" bug

2c6d5af

add copyright headers

b5098b2

add some comments

7d3b908

change character comparisons to commons.lang3.CharSet calls

2f5e1d1

change log level to debug

5bde8c7

fix oopsie

559bb0c

svonworl mentioned this pull request Mar 27, 2024

SEAB-6148: Enable diagnostics in webservice config template dockstore/compose_setup#254

Merged

1 task

svonworl requested a review from denis-yuen March 28, 2024 15:51

kathy-t approved these changes Mar 28, 2024

View reviewed changes

david4096 approved these changes Mar 28, 2024

View reviewed changes

denis-yuen requested changes Mar 28, 2024

View reviewed changes

svonworl added 7 commits April 10, 2024 16:30

merge develop

945fdf4

remove process logging and censoring

859b4b7

support variable logging levels

0b19cb5

refactor

09e6664

remove unused file

0ab727c

address some codacy

7ace62c

checkstyle

2302b57

svonworl requested a review from denis-yuen April 11, 2024 16:52

denis-yuen reviewed Apr 11, 2024

View reviewed changes

denis-yuen self-requested a review April 11, 2024 17:12

david4096 approved these changes Apr 19, 2024

View reviewed changes

denis-yuen approved these changes Apr 19, 2024

View reviewed changes

svonworl merged commit 63e9928 into develop Apr 24, 2024

svonworl deleted the feature/seab-6148/log-more-diagnostic-information branch April 24, 2024 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEAB-6148: Log more diagnostic information #5821

SEAB-6148: Log more diagnostic information #5821

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SEAB-6148: Log more diagnostic information #5821

SEAB-6148: Log more diagnostic information #5821

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Quality Gate failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!