8000 SEAB-6148: Log more diagnostic information by svonworl · Pull Request #5821 · dockstore/dockstore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SEAB-6148: Log more diagnostic information #5821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Apr 24, 2024

Conversation

svonworl
Copy link
Contributor
@svonworl svonworl commented Feb 23, 2024

Description
This PR introduces a new DiagnosticsHelper class that the webservice application can wire into itself and log various information:

  • Periodic "Global" info about running processes, the filesystem, the database, and memory pools, logged via a Timer daemon thread.
  • Request-level info about a request and associated thread and Hibernate Session, logged at the start and finish of each request, two log entries per request.

In this iteration, I dropped the thread dump from the periodic logging output. At 25kB a pop, it didn't seem worth it.

Diagnostic logging is enabled using the diagnosticsConfig configuration file property. Periodic and request logging can be enabled/disabled separately, and the logging period can be set. By default, diagnostic logging is disabled. To enable both periodic and request logging, add the following to your config file:

diagnosticsConfig:
  logRequests: true
  logPeriodic: true
  periodSeconds: 60

To review the output, pipe the webservice stdout into tee /tmp/output and view the resulting file. Search for DiagnosticsHelper to find the new log entries.

A big concern is that, since we're logging data pulled out of the guts of our runtime environment, we'll accidentally log a secret, especially via the ps output, which includes the arguments to commands. We purposefully do not log environment variables, which are a known vector for secrets.

To reduce the chance of a secret spilling, all of the information that is logged - literally the entirety of each log message - passes through censoring code that looks for runs of characters that appear to be hex/base64, does some analysis on the character distributions, and X's them out if they appear to be keys. The crux here is that file paths resemble base64 strings because they tend to contain the same characters, so we need a way to let them through uncensored. The assumption is that a key will be well-encoded and contain a random jumble of characters, whereas a path (or other structured non-key data) will contain longer runs of the same character classes, and/or contain text that resembles English. The resulting censoring scheme works pretty well, see the comments on the CensorHelper class for more information.

Another concern is that these changes will introduce instability into the webservice. Keep an eye out for that type of problem. We're purposefully doing the periodic logging via a Timer/thread so in case it hangs or throws, the normal functions of the webservice aren't as likely to be impacted.

We can easily add a filter that will exclude certain requests from diagnostic logging (for example, hits to the health check endpoint).

Review Instructions
View the Cloudwatch logs for qa, and confirm that diagnostic log entries are being created before and after each request, and periodically every so often. Search for "DiagnosticsHelper" to find the entries.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6148

Security and Privacy
Many security concerns, see description.

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@svonworl svonworl marked this pull request as draft February 23, 2024 16:30
@svonworl svonworl changed the title Feature/seab 6148/log more diagnostic information SEAB-6148: Log more diagnostic information Feb 23, 2024
@denis-yuen
Copy link
Member

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file
https://www.dropwizard.io/en/stable/manual/configuration.html#logging
that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

Copy link
Collaborator
@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, looks good.

Do you have some sample output, just to get a sense of what it might look like and how large it would be?

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file https://www.dropwizard.io/en/stable/manual/configuration.html#logging that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

On the one hand, yes it could make the "main" log really noisy and might be worth separating. On the other hand, if it would make easier correlating things if we're researching an issue, e.g., seeing the thread dump right before/after a request is logged would be useful. I wonder if CloudWatch has a way to "merge" log groups when querying?

8000

// Request started.
case START:
logStarted(request);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very useful; right now the request is only logged when completed. If we die (rare, but it's happened) processing a request, it's hard to piece together what the request was.

@denis-yuen
Copy link
Member

I wonder if CloudWatch has a way to "merge" log groups when querying?

Looks like, yes, via the insights queries
https://aws.amazon.com/about-aws/whats-new/2019/07/cloudwatch-logs-insights-adds-cross-log-group-querying/

@svonworl
Copy link
Contributor Author
svonworl commented Feb 28, 2024

Conceptually, looks good.

Do you have some sample output, just to get a sense of what it might look like and how large it would be?

Basically, this PR introduces a new DebugHelper class which wires itself into the infrastructure and logs various information. "Global" info about threads, processes, the database, and memory is logged periodically via a Timer, and request-level info about a handling thread and Hibernate Session is logged at the end of each request.

Seems like a lot of data, not having looked at the code yet, wonder if it should be logged to a separate appender/file https://www.dropwizard.io/en/stable/manual/configuration.html#logging that could eventually end up in a log group so we can more easily choose what to view and/or have different expiry/deletion rules

On the one hand, yes it could make the "main" log really noisy and might be worth separating. On the other hand, if it would make easier correlating things if we're researching an issue, e.g., seeing the thread dump right before/after a request is logged would be useful. I wonder if CloudWatch has a way to "merge" log groups when querying?

Some thoughts:

  • It's relatively difficult to send separate streams of information to different log groups from a container running in ECS. The typical setup is to use the cloudwatch agent to send stdout/stderr to a group.
  • We could possibly modify our logging to condense each multi-line log entry to a single expandable entry: https://stackoverflow.com/questions/67396522/cloudwatch-multiline-log-messages-from-containerized-app-runnning-on-ecs-ec2 This would greatly reduce the visual impact of the logging entries that appear in this PR, as well as collapse the giant stack traces that currently appear in our logs.
  • It is most useful to display the debugging information interleaved temporally with our other log output. The harder it is to do that, the less useful the debug information is.

Regarding size, there's two groups of information being logged. The first, thread/request-level info, is pretty small:

INFO  [2024-02-28 20:50:59,095] io.dockstore.webservice.helpers.DebugHelper: debug.started by thread "dw-45 - GET /organizations/9/starredUsers" (45):
GET "organizations/9/starredUsers"
INFO  [2024-02-28 20:50:59,109] io.dockstore.webservice.helpers.DebugHelper: debug.session by thread "dw-45 - GET /organizations/9/starredUsers" (45):
SessionStatistics[entity count=1,collection count=4]
INFO  [2024-02-28 20:50:59,131] io.dockstore.webservice.helpers.DebugHelper: debug.finished by thread "dw-45 - GET /organizations/9/starredUsers" (45):
GET "organizations/9/starredUsers"
INFO  [2024-02-28 20:50:59,131] io.dockstore.webservice.helpers.DebugHelper: debug.thread by thread "dw-45 - GET /organizations/9/starredUsers" (45):
allocated: 0.60 MB
cpu-time: 0.012 sec
user-time: 0.011 sec
elapsed-time: 0.036 sec

The periodic "global" information is larger:

INFO  [2024-02-28 20:51:05,045] io.dockstore.webservice.helpers.DebugHelper: debug.threads by thread "diagnostics" (18):
"Reference Handler" daemon prio=10 Id=2 RUNNABLE
        at java.base@17.0.7/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
        at java.base@17.0.7/java.lang.ref.Reference.processPendingReferences(Reference.java:253)
        at java.base@17.0.7/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:215)
[...]

INFO  [2024-02-28 20:51:05,045] io.dockstore.webservice.helpers.DebugHelper: debug.database by thread "diagnostics" (18):
pool-size: 8
pool-active: 0
pool-idle: 8

INFO  [2024-02-28 20:51:05,046] io.dockstore.webservice.helpers.DebugHelper: debug.memory by thread "diagnostics" (18):
HEAP: init = 268435456(262144K) used = 70545688(68892K) committed = 144703488(141312K) max = 4294967296(4194304K)
NON-HEAP: init = 7667712(7488K) used = 121830168(118974K) committed = 123928576(121024K) max = -1(-1K)
POOL: CodeHeap 'non-nmethods', Non-heap memory
current: init = 2555904(2496K) used = 1457792(1423K) committed = 2555904(2496K) max = 5840896(5704K)
peak: init = 2555904(2496K) used = 1520384(1484K) committed = 2555904(2496K) max = 5840896(5704K)
collection: null
POOL: Metaspace, Non-heap memory
current: init = 0(0K) used = 84102304(82131K) committed = 84738048(82752K) max = -1(-1K)
peak: init = 0(0K) used = 84102304(82131K) committed = 84738048(82752K) max = -1(-1K)
collection: null
[...]

I've elided most of the thread dump, in total. On my development box, it's around 10 lines per thread, and 34 threads are listed, totally approximately 25kB. The memory pool dump is the next biggest entry, about 2.5kB total.

The ps output is not included here. It's a ton of output on my development box, but is relatively small when running in ECS, a few lines.

My feeling is that, although it's intriguing to ponder logging this stuff to a different group, it might be more trouble than it's worth, if we can format the debug info so it doesn't overwhelm the other log info.

@denis-yuen
Copy link
Member

I think this could work, in addition to swapping most of these outputs from INFO to TRACE
Presumably the other thing we can do is probably do some kind of filter to produce more than one logstream.
The downside I think is we'll need to consider log expiry, but we probably need a ticket on that either way.

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaning list

public final class DebugHelper {

private static final Logger LOG = LoggerFactory.getLogger(DebugHelper.class);
private static final long LOG_PERIOD_MS = 10000L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers, every ten seconds if curious

Copy link
Contributor
@david4096 david4096 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see new diagnostic information in the logs! It would be nice-to-have that portions that are quantifiable had metrics defined in dockstore-deploy so we could track these over time/per container.

In my understanding the logs have the same access restrictions as the container and DB so censoring the logs doesn't seem critical. For maintainability I would consider not including that part, however it is good to have.

It might be preferable use an existing library (which I couldn't immediately find) or something like https://github.com/mazen160/secrets-patterns-db/blob/master/db/rules-stable.yml or https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml as linked by Denis.

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think that puts a finger on two things that have been bugging me about this. I think I'll have to type up something longer in #dockstore-devs

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
2.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

@svonworl
Copy link
Contributor Author

Ok, I made a few changes, the current state is the proposed final version.

Part of my original "move it forward" plan was to remove the custom censoring code and replace it with the AWS Cloudwatch "data protection" feature. However, upon deeper inspection, I learned that the AWS feature probably wouldn't catch the type of secrets that might leak, either because it's not looking for certain types of secrets (GitHub tokens, for example), or it requires they appear in certain contexts to detect them (for example, an AWS key needs to appear in proximity to the keyword aws_secret_access_key (or similar) for the AWS feature to censor it, which is unlikely to happen in our case).

By far, the most likely place for secrets to leak is the ps output. So, to totally neutralize that scenario, I removed the ps output from this PR. Now, the probability of a secret leaking is greatly reduced, much closer to the (low) background probability of leaking a secret in the rest of the logs, and censoring is much less important (either via our own code or the AWS feature).

So, I've removed the original webservice-side censoring code, and since the AWS data protection feature doesn't perform very well in our use case, I recommend that, at this juncture, we don't spend the time to enable it.

I should note that, left to my own devices, I would include the original webservice-side censoring code, but having dropped the ps output, I can merge this PR without it and not feel bad/worried/irresponsible/etc.

@svonworl svonworl requested a review from denis-yuen April 11, 2024 16:52
@denis-yuen
Copy link
Member

ecause it's not looking for certain types of secrets (GitHub tokens, for example),

There is a custom data identifier feature https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL-custom-data-identifiers.html
I think you should be able to add the github tokens since they start with gho, ghr, etc. https://github.blog/2021-04-05-behind-githubs-new-authentication-token-formats/

@denis-yuen
Copy link
Member

we don't spend the time to enable it.

I think it should only be a handful of steps https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/ but you can create a follow-up ticket

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to have at least the follow-up ticket for data protection

@denis-yuen denis-yuen self-requested a review April 11, 2024 17:12
@svonworl svonworl merged commit 63e9928 into develop Apr 24, 2024
@svonworl svonworl deleted the feature/seab-6148/log-more-diagnostic-information branch April 24, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0