8000 SEAB-6488: Modify SourceFile to flag errors by svonworl · Pull Request #5934 · dockstore/dockstore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SEAB-6488: Modify SourceFile to flag errors #5934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 18, 2024

Conversation

svonworl
Copy link
Contributor
@svonworl svonworl commented Jul 10, 2024

This PR modifies SourceFile so that we can conclusively differentiate a successfully downloaded file from one which failed to meet our requirements (for example, a file that could not be stored because it was binary.) See this PR for some motivation, search for "in a followup PR": #5893

Specifically, this PR:

  • adds a state property to SourceFile, which describes the representation of the SourceFile, and currently has two values:
  1. COMPLETE: the file was successfully downloaded and the SourceFile's content represents the file body.
  2. NOT_STORED: the SourceFile's content contains a message, rather than the actual file body, that explains why the body isn't stored. For example, "Dockstore does not store binary files".
  3. STUB: the SourceFile is a stub and has null content.
  • changes the "limited" SourceFile builder to set the new properties appropriately.
  • changes the SourceFile copy/duplicate helper methods to copy the new properties.

Later, we could add other values to the new enum. For example, if we wanted to try to handle big files by storing a smaller alternate representation of the content (instead of totally jettisoning it), we could have new state values like:

  • SUMMARY: the content represents a summarized version of the file content.
  • HEAD: the content represents the leading portion of the file content.

The above might be useful to support the registration of AI models, for example, which have files containing gigabytes of weights.

The PR also contains some db migration fixups for the existing sourcefile values.

There will be a companion UI PR, which, at a minimum, will tweak the UI so it builds and change the notebook preview code to not attempt to render not-COMPLETE content as a notebook.

Review Instructions

Register an entry that contains files that are binary and/or too large, and confirm that the resulting sourcefiles have the correct "state" values, both in the db and in the API responses.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6488

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

  • Security and Privacy assessed

e.g. Does this change...

  • Any user data we collect, or data location?
  • Access control, authentication or authorization?
  • Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@svonworl svonworl self-assigned this Jul 10, 2024
Copy link
codecov bot commented Jul 10, 2024

Codecov Report

Attention: Patch coverage is 17.24138% with 24 lines in your changes missing coverage. Please review.

Project coverage is 74.21%. Comparing base (7022dd4) to head (03098f5).

Files Patch % Lines
...tore/webservice/core/LimitedSourceFileBuilder.java 0.00% 19 Missing ⚠️
.../java/io/dockstore/webservice/core/SourceFile.java 50.00% 5 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #5934      +/-   ##
=============================================
- Coverage      74.27%   74.21%   -0.06%     
- Complexity      5374     5376       +2     
=============================================
  Files            376      376              
  Lines          19462    19484      +22     
  Branches        2032     2032              
=============================================
+ Hits           14455    14460       +5     
- Misses          4033     4050      +17     
  Partials         974      974              
Flag Coverage Δ
bitbuckettests 26.89% <17.24%> (-0.02%) ⬇️
hoverflytests 27.37% <17.24%> (-0.01%) ⬇️
integrationtests 56.78% <17.24%> (-0.04%) ⬇️
languageparsingtests 11.07% <6.89%> (-0.01%) ⬇️
localstacktests 21.59% <17.24%> (+<0.01%) ⬆️
toolintegrationtests 30.29% <17.24%> (-0.01%) ⬇️
unit-tests_and_non-confidential-tests 25.92% <13.79%> (-0.01%) ⬇️
workflowintegrationtests 38.20% <17.24%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May pre-date this PR, but given how core this is to Dockstore, the 0% patch test coverage on LimitedSourceFileBuilder is worth a look. Wonder if there's something going on with unit tests vs integration tests

Some minor comments/questions

// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER
// https://www.ascii-code.com/character/%E2%90%80
limitedContent = "Dockstore does not store binary files";
limitedForm = SourceFile.FormEnum.ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly an error in this case since it is working as intended.
Maybe "incomplete" or "placeholder"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly an error in this case since it is working as intended. Maybe "incomplete" or "placeholder"?

I agree that this is not an error, it's the design, and I think it should be distinguishable from an actual error. The conversation is moving and some of this has been addressed the comments and in your PR description, so this is probably repetitive, but...

  • I also think it's worth distinguishing the cases where we don't have have the data because it's binary vs. the case where the content is too large (and your TRUNCATED proposal addresses that).

// https://www.ascii-code.com/character/%E2%90%80
if (Bytes.indexOf(bytes, Byte.decode("0x00")) != -1) {
limitedContent = "Dockstore does not store binary files";
limitedForm = SourceFile.FormEnum.ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borderline (see previous comment), but seems more okay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate, I'd almost imagine that an error should really be something unexpected like a 500 on GitHub's side and we'd have states for large files or binary files rather than merging them into errors in-case those two states are useful someday.
i.e. maybe we'd link to binaries or large files on github?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is similar to your description

Later, we could add other values to the "form" enum. For example, if we wanted to try to handle big files by storing some portion of the content (instead of totally jettisoning it), we could have values like:

Copy link
Contributor Author
@svonworl svonworl Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd have states for large files or binary files

IMHO, the following two things are different concepts:

  • whether a file is binary (or not).
  • how we choose to represent the file's content on our end.

I'd recommend they are stored in separate fields (one for isBinary, another that describes the representation we've chosen for the content (which could be "none")).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I think as long as they're not merged into the error state, being even more descriptive is totally cool

@@ -371,4 +385,9 @@ public static class VerificationInformation {

// There is no dbupdatedate because it doesn't work with @Embeddable nor @ElementCollection
}

public enum FormEnum {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably code comment the enum and the enum values.
Don't feel strongly about this, but Form doesn't really convey much to me aside from Google Forms, web forms, etc.
Perhaps "status" or "state"?

Copy link
Contributor Author
@svonworl svonworl Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this one is tough. I'm not super-hot on "form", either, but it feels more apt than most of the alternatives.
"Status" and "state" are both vague, and suggest the possibility future change. "Mode" is a previously-loaded term when it comes to files. Maybe "representation", which could have values COMPLETE and NONE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like representation better than form at least

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not wild about Form either, although I can see why you chose it. I don't think I like Representation either -- I may be biased against it from its meaning in REST.

SyncState or SyncStatus? But that's not really applicable to hosted entries. ContentState?

// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER
// https://www.ascii-code.com/character/%E2%90%80
limitedContent = "Dockstore does not store binary files";
limitedForm = SourceFile.FormEnum.ERROR;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly an error in this case since it is working as intended. Maybe "incomplete" or "placeholder"?

I agree that this is not an error, it's the design, and I think it should be distinguishable from an actual error. The conversation is moving and some of this has been addressed the comments and in your PR description, so this is probably repetitive, but...

  • I also think it's worth distinguishing the cases where we don't have have the data because it's binary vs. the case where the content is too large (and your TRUNCATED proposal addresses that).

</addColumn>
<sql dbms="postgresql">
<comment>Set appropriate content and form for sourcefiles with null content</comment>
UPDATE sourcefile SET form = 'ERROR', content = 'Dockstore could not retrieve this file' WHERE content IS NULL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that this is always the reason why the content is null? Just checking... And there are no null content source files for hosted entries?

Copy link
Contributor Author
@svonworl svonworl Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, upon further investigation, turns out that we do intentionally create SourceFile stubs, in EntryVersionHelper at least. I've added a STUB state and changed the migrations to define all existing null-content SourceFiles as stubs. The webservice might not have intentionally created all of them as stubs, but that's effectively what they are now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editorial: the "stub" concept introduces another state into our system, which makes it more difficult to reason about. If we can simplify our system by someday eliminating stubs, that probably would be good,

@@ -371,4 +385,9 @@ public static class VerificationInformation {

// There is no dbupdatedate because it doesn't work with @Embeddable nor @ElementCollection
}

public enum FormEnum {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not wild about Form either, although I can see why you chose it. I don't think I like Representation either -- I may be biased against it from its meaning in REST.

SyncState or SyncStatus? But that's not really applicable to hosted entries. ContentState?

@svonworl svonworl requested review from coverbeck and denis-yuen July 12, 2024 17:54
Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better, questions

STUB
}

public enum Reason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to considering the two enum approach. The weird part in my eyes is we would need constraints or a validator (?) to enforce illegal combinations. Like a complete + binary error for example.

What is the argument against one enum with more states?

Copy link
Contributor Author
@svonworl svonworl Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the argument against one enum with more states?

Each enum contains a different type of information: State describes how the SourceFile is stored, and Reason describes why we stored the SourceFile that way. When the values are separate, the code that uses them is simpler, and needs less modification when we add values, we don't need as many ifs. The UI can look at State to figure out how to display the SourceFile, and other code can examine Reason to generate statistics, etc.

What's the advantage to "one enum" (that encodes both values) besides eliminating the illegal combos?

It would be easy to add a constraint that made sure "reason" was null for the COMPLETE and STUB states.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on this one. Some thoughts:

Would you also need to add a constraint that if the state is MESSAGE, then reason has to be set?

And if you add more states, you'd have to constrain those relationships too. Could it start to get a little complex?

On the other hand, if you have one enum, you could end up with a lot of values.

Digression on other possible states -- truncated, which you proposed. Not sure what else, maybe quarantined, lol -- we find or are informed about a file that might be have malware or PII?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the "one" and "two" enum solutions seem to have their pros and cons. How's about, in the interest of 10000 forward progress, we go with the "two enum" solution for now. Maybe its pros will be more apparent when in use, and if we don't like it, we can easily change to the other later. Agile!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me.

Copy link
Contributor Author
@svonworl svonworl Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also note that I'm personally a proponent of the "one enum" solution with values that only encode the representation, with the "reason" being recorded as an English message, stored in the content field (for now). Clean, simple, no partial 2d grid of enum values to have to keep consistent. We wouldn't have access to the "reason" as an abstract enum value, but if we wanted to know how often it happens, we could log something...

I separated the Reason field into a separate enum because a conversation with a key team member implied that they might insist upon it...

Copy link
Member
@denis-yuen denis-yuen Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not wedded to the one enum approach, but thanks for explaining.

That said, like in https://github.com/dockstore/dockstore/blob/develop/dockstore-common/src/main/java/io/dockstore/common/DescriptorLanguage.java#L45 we can put some smarts in the enum, I think that can deal with some of the concerns about having too many states.

Regardless of what approach we end up with, I would prefer constraints or whatever in this PR to ensure we don't forget and that we never have to deal with inconsistent data.

Copy link
Collaborator
@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in-line, on the fence on 1 vs 2 enums. Agreed with comments on the enums.

STUB
}

public enum Reason {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on this one. Some thoughts:

Would you also need to add a constraint that if the state is MESSAGE, then reason has to be set?

And if you add more states, you'd have to constrain those relationships too. Could it start to get a little complex?

On the other hand, if you have one enum, you could end up with a lot of values.

Digression on other possible states -- truncated, which you proposed. Not sure what else, maybe quarantined, lol -- we find or are informed about a file that might be have malware or PII?

@denis-yuen denis-yuen self-requested a review July 16, 2024 20:40
Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't care too much about the enum approach, do prefer constraints or what not up front and not as a follow-up.

@svonworl
Copy link
Contributor Author

Ok, I reverted to a simplified version which is similar to the original implementation, with a single enum that represents the state, some naming improvements, an additional STUB enum value, and some logging so we can determine how often sourcefiles are limited in the various ways. The dual enums are history, and discussions about their logical consistency and how to maintain it are now moot. w00t!

@svonworl svonworl requested a review from denis-yuen July 17, 2024 01:54
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
13.5% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

// Thus, Dockstore cannot currently store binary files.
// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER
// https://www.ascii-code.com/character/%E2%90%80
file.setContent("Dockstore does not store binary files");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, if we wanted to get a list of binary files or truncated files we would need to do a SQL query with this string, so should at least be a constant.
A hex/reason code or something quick to look up may be helpful too if we end up with a bunch of these.

e.g. "R01 - Dockstore does not store binary files", "R02 - Dockstore does not store files of this type ", etc.

@svonworl svonworl merged commit 3c672b6 into develop Jul 18, 2024
17 of 19 checks passed
@svonworl svonworl deleted the feature/seab-6488/modify-sourcefile-to-flag-errors branch July 18, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0