-
Notifications
You must be signed in to change notification settings - Fork 29
SEAB-6488: Modify SourceFile to flag errors #5934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEAB-6488: Modify SourceFile to flag errors #5934
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5934 +/- ##
=============================================
- Coverage 74.27% 74.21% -0.06%
- Complexity 5374 5376 +2
=============================================
Files 376 376
Lines 19462 19484 +22
Branches 2032 2032
=============================================
+ Hits 14455 14460 +5
- Misses 4033 4050 +17
Partials 974 974
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May pre-date this PR, but given how core this is to Dockstore, the 0% patch test coverage on LimitedSourceFileBuilder is worth a look. Wonder if there's something going on with unit tests vs integration tests
Some minor comments/questions
// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER | ||
// https://www.ascii-code.com/character/%E2%90%80 | ||
limitedContent = "Dockstore does not store binary files"; | ||
limitedForm = SourceFile.FormEnum.ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly an error in this case since it is working as intended.
Maybe "incomplete" or "placeholder"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly an error in this case since it is working as intended. Maybe "incomplete" or "placeholder"?
I agree that this is not an error, it's the design, and I think it should be distinguishable from an actual error. The conversation is moving and some of this has been addressed the comments and in your PR description, so this is probably repetitive, but...
- I also think it's worth distinguishing the cases where we don't have have the data because it's binary vs. the case where the content is too large (and your
TRUNCATED
proposal addresses that).
// https://www.ascii-code.com/character/%E2%90%80 | ||
if (Bytes.indexOf(bytes, Byte.decode("0x00")) != -1) { | ||
limitedContent = "Dockstore does not store binary files"; | ||
limitedForm = SourceFile.FormEnum.ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Borderline (see previous comment), but seems more okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate, I'd almost imagine that an error should really be something unexpected like a 500 on GitHub's side and we'd have states for large files or binary files rather than merging them into errors in-case those two states are useful someday.
i.e. maybe we'd link to binaries or large files on github?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is similar to your description
Later, we could add other values to the "form" enum. For example, if we wanted to try to handle big files by storing some portion of the content (instead of totally jettisoning it), we could have values like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd have states for large files or binary files
IMHO, the following two things are different concepts:
- whether a file is binary (or not).
- how we choose to represent the file's content on our end.
I'd recommend they are stored in separate fields (one for isBinary
, another that describes the representation we've chosen for the content (which could be "none")).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I think as long as they're not merged into the error state, being even more descriptive is totally cool
@@ -371,4 +385,9 @@ public static class VerificationInformation { | |||
|
|||
// There is no dbupdatedate because it doesn't work with @Embeddable nor @ElementCollection | |||
} | |||
|
|||
public enum FormEnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably code comment the enum and the enum values.
Don't feel strongly about this, but Form
doesn't really convey much to me aside from Google Forms, web forms, etc.
Perhaps "status" or "state"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this one is tough. I'm not super-hot on "form", either, but it feels more apt than most of the alternatives.
"Status" and "state" are both vague, and suggest the possibility future change. "Mode" is a previously-loaded term when it comes to files. Maybe "representation", which could have values COMPLETE
and NONE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like representation better than form at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not wild about Form
either, although I can see why you chose it. I don't think I like Representation
either -- I may be biased against it from its meaning in REST.
SyncState
or SyncStatus
? But that's not really applicable to hosted entries. ContentState
?
// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER | ||
// https://www.ascii-code.com/character/%E2%90%80 | ||
limitedContent = "Dockstore does not store binary files"; | ||
limitedForm = SourceFile.FormEnum.ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly an error in this case since it is working as intended. Maybe "incomplete" or "placeholder"?
I agree that this is not an error, it's the design, and I think it should be distinguishable from an actual error. The conversation is moving and some of this has been addressed the comments and in your PR description, so this is probably repetitive, but...
- I also think it's worth distinguishing the cases where we don't have have the data because it's binary vs. the case where the content is too large (and your
TRUNCATED
proposal addresses that).
</addColumn> | ||
<sql dbms="postgresql"> | ||
<comment>Set appropriate content and form for sourcefiles with null content</comment> | ||
UPDATE sourcefile SET form = 'ERROR', content = 'Dockstore could not retrieve this file' WHERE content IS NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know that this is always the reason why the content is null? Just checking... And there are no null content source files for hosted entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, upon further investigation, turns out that we do intentionally create SourceFile stubs, in EntryVersionHelper
at least. I've added a STUB
state and changed the migrations to define all existing null-content SourceFiles as stubs. The webservice might not have intentionally created all of them as stubs, but that's effectively what they are now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editorial: the "stub" concept introduces another state into our system, which makes it more difficult to reason about. If we can simplify our system by someday eliminating stubs, that probably would be good,
@@ -371,4 +385,9 @@ public static class VerificationInformation { | |||
|
|||
// There is no dbupdatedate because it doesn't work with @Embeddable nor @ElementCollection | |||
} | |||
|
|||
public enum FormEnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not wild about Form
either, although I can see why you chose it. I don't think I like Representation
either -- I may be biased against it from its meaning in REST.
SyncState
or SyncStatus
? But that's not really applicable to hosted entries. ContentState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better, questions
dockstore-webservice/src/main/java/io/dockstore/webservice/core/SourceFile.java
Show resolved
Hide resolved
STUB | ||
} | ||
|
||
public enum Reason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to considering the two enum approach. The weird part in my eyes is we would need constraints or a validator (?) to enforce illegal combinations. Like a complete + binary error for example.
What is the argument against one enum with more states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the argument against one enum with more states?
Each enum contains a different type of information: State
describes how the SourceFile is stored, and Reason
describes why we stored the SourceFile that way. When the values are separate, the code that uses them is simpler, and needs less modification when we add values, we don't need as many ifs
. The UI can look at State
to figure out how to display the SourceFile, and other code can examine Reason
to generate statistics, etc.
What's the advantage to "one enum" (that encodes both values) besides eliminating the illegal combos?
It would be easy to add a constraint that made sure "reason" was null for the COMPLETE
and STUB
states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence on this one. Some thoughts:
Would you also need to add a constraint that if the state
is MESSAGE
, then reason
has to be set?
And if you add more states, you'd have to constrain those relationships too. Could it start to get a little complex?
On the other hand, if you have one enum, you could end up with a lot of values.
Digression on other possible states -- truncated, which you proposed. Not sure what else, maybe quarantined, lol -- we find or are informed about a file that might be have malware or PII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the "one" and "two" enum solutions seem to have their pros and cons. How's about, in the interest of 10000 forward progress, we go with the "two enum" solution for now. Maybe its pros will be more apparent when in use, and if we don't like it, we can easily change to the other later. Agile!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also note that I'm personally a proponent of the "one enum" solution with values that only encode the representation, with the "reason" being recorded as an English message, stored in the content
field (for now). Clean, simple, no partial 2d grid of enum values to have to keep consistent. We wouldn't have access to the "reason" as an abstract enum value, but if we wanted to know how often it happens, we could log something...
I separated the Reason
field into a separate enum because a conversation with a key team member implied that they might insist upon it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not wedded to the one enum approach, but thanks for explaining.
That said, like in https://github.com/dockstore/dockstore/blob/develop/dockstore-common/src/main/java/io/dockstore/common/DescriptorLanguage.java#L45 we can put some smarts in the enum, I think that can deal with some of the concerns about having too many states.
Regardless of what approach we end up with, I would prefer constraints or whatever in this PR to ensure we don't forget and that we never have to deal with inconsistent data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in-line, on the fence on 1 vs 2 enums. Agreed with comments on the enums.
STUB | ||
} | ||
|
||
public enum Reason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence on this one. Some thoughts:
Would you also need to add a constraint that if the state
is MESSAGE
, then reason
has to be set?
And if you add more states, you'd have to constrain those relationships too. Could it start to get a little complex?
On the other hand, if you have one enum, you could end up with a lot of values.
Digression on other possible states -- truncated, which you proposed. Not sure what else, maybe quarantined, lol -- we find or are informed about a file that might be have malware or PII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't care too much about the enum approach, do prefer constraints or what not up front and not as a follow-up.
Ok, I reverted to a simplified version which is similar to the original implementation, with a single enum that represents the state, some naming improvements, an additional |
|
// Thus, Dockstore cannot currently store binary files. | ||
// https://www.postgresql.org/docs/current/datatype-character.html#DATATYPE-CHARACTER | ||
// https://www.ascii-code.com/character/%E2%90%80 | ||
file.setContent("Dockstore does not store binary files"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, if we wanted to get a list of binary files or truncated files we would need to do a SQL query with this string, so should at least be a constant.
A hex/reason code or something quick to look up may be helpful too if we end up with a bunch of these.
e.g. "R01 - Dockstore does not store binary files", "R02 - Dockstore does not store files of this type ", etc.
This PR modifies SourceFile so that we can conclusively differentiate a successfully downloaded file from one which failed to meet our requirements (for example, a file that could not be stored because it was binary.) See this PR for some motivation, search for "in a followup PR": #5893
Specifically, this PR:
state
property to SourceFile, which describes the representation of the SourceFile, and currently has two values:COMPLETE
: the file was successfully downloaded and the SourceFile'scontent
represents the file body.NOT_STORED
: the SourceFile'scontent
contains a message, rather than the actual file body, that explains why the body isn't stored. For example, "Dockstore does not store binary files".STUB
: the SourceFile is a stub and has null content.Later, we could add other values to the new enum. For example, if we wanted to try to handle big files by storing a smaller alternate representation of the content (instead of totally jettisoning it), we could have new
state
values like:SUMMARY
: the content represents a summarized version of the file content.HEAD
: the content represents the leading portion of the file content.The above might be useful to support the registration of AI models, for example, which have files containing gigabytes of weights.
The PR also contains some db migration fixups for the existing sourcefile values.
There will be a companion UI PR, which, at a minimum, will tweak the UI so it builds and change the notebook preview code to not attempt to render not-
COMPLETE
content as a notebook.Review Instructions
Register an entry that contains files that are binary and/or too large, and confirm that the resulting sourcefiles have the correct "state" values, both in the db and in the API responses.
Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6488
Security and Privacy
If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.
e.g. Does this change...
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation