-
Notifications
You must be signed in to change notification settings - Fork 29
SEAB-6464: Infer .dockstore.yml #5939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly pending vfs discussion
|
/** | ||
* Maps file paths to saved file content. | ||
*/ | ||
private final Map<String, String> filePathToContent = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionally looks like a Guava Loading cache
https://github.com/google/guava/wiki/cachesexplained#example
There may be other cache implementations floating around our classpath.
Ditto with dirPathToFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While at it, it's a bit confusing because the map is from string to string.
Consider using a Path https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/file/Path.html as the key so things are more typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misc observations:
-
Yes, it's a very simple cache based on a
Map
, also on our classpath. The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS), but at the expense of complexity, performance, and a larger chance that we'll inadvertently use it incorrectly. If someone feels really strongly, we can do it, though... -
Path
has system-dependent behavior. For example, the file separator changes from system to system. Because of that, would actually recommend, in general, thatPath
be avoided when possible, except for purposes such as accessing the local file system. Aspirational, of course... -
While the semantics of
Path
and the semantics of workflow file paths seem to overlap each other 1:1 (more or less), it's not implausible that there's some weird file syntax or structure that we'd want to support someday thatPath
doesn't. -
In our codebase, you will find
Map<String, String>
all over the place, and we very often useString
to refer to paths. So, it meets the standard of our existing code, at least. Regarding this particular situation, don't think the advantage ofPath
in the type outweighs the ick of converting the incoming value, which is aString
, to aPath
in each method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it meets the standard of our existing code, at least.
FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.
it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.
Seems unlikely, but in that case it seems like you can implement Path
anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html
Path has system-dependent behavior.
I think I'm ok with Dockstore always running on Linux or Linux-like systems?
The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)
On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.
dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/FileTree.java
Outdated
Show resolved
Hide resolved
dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/GitHubSourceCodeRepo.java
Show resolved
Hide resolved
*/ | ||
public class SyntheticFileTree implements FileTree { | ||
|
||
private Map<String, String> pathToContent = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also functionally a cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't pull anything in, and everything needs to persist, so not so very much like a cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.
By default cache instances created by CacheBuilder will not perform any type of eviction.
https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html
They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile
?
https://www.baeldung.com/guava-cache#preload-the-cache
I feel less strongly about this one than CachingFileTree
since it looks like the intent here is to use this for testing.
dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java
Outdated
Show resolved
Hide resolved
/** | ||
* Maps directory paths to saved directory contents. | ||
*/ | ||
private final Map<String, List<String>> dirPathToFiles = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for Paths
...bservice/src/main/java/io/dockstore/webservice/helpers/infer/DescriptorLanguageInferrer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be longer discussion
dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/GitHubSourceCodeRepo.java
Show resolved
Hide resolved
/** | ||
* Maps file paths to saved file content. | ||
*/ | ||
private final Map<String, String> filePathToContent = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it meets the standard of our existing code, at least.
FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.
it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.
Seems unlikely, but in that case it seems like you can implement Path
anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html
Path has system-dependent behavior.
I think I'm ok with Dockstore always running on Linux or Linux-like systems?
The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)
On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.
*/ | ||
public class SyntheticFileTree implements FileTree { | ||
|
||
private Map<String, String> pathToContent = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.
By default cache instances created by CacheBuilder will not perform any type of eviction.
https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html
They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile
?
https://www.baeldung.com/guava-cache#preload-the-cache
I feel less strongly about this one than CachingFileTree
since it looks like the intent here is to use this for testing.
.filter(path -> path.startsWith(normalizedDirPath)) | ||
.map(path -> path.substring(normalizedDirPath.length())) | ||
.filter(path -> !path.isEmpty()) | ||
.map(path -> path.split(SLASH)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I still find this hard to understand, would be nice to simplify/standardize the code
Assuming I did understand, there's probably stuff like Path.getParent that can simplify https://stackoverflow.com/questions/1099859/how-to-split-a-path-platform-independent#comment914553_1099868
|
Ok, I've made a set of changes that should get this over the hump... A note about one of the changes:
One of the mandated changes, promulgated by a reviewer who must approve any PR before I am able to merge it, was to change
There are other reasons, the above is not an exhaustive list. I would have preferred that we continue to use our historical path representation in this PR, and then later, as a team, considered a good path representation and how we might use it to improve the webservice. A possibility might be an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into Caffeine and while I understand and respect your preference to work with Strings rather than Paths, I think this is a good step of many in helping us get rid of our mix of java.io.File and unstructured Strings.
public List<String> listFiles(Path dirPath) { | ||
Set<String> files = new HashSet<>(); | ||
for (Path filePath: pathToContent.keySet()) { | ||
if (filePath.startsWith(dirPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I know this wasn't your preference but I find it easier to understand getNameCount
and startsWith
than working with split and string manipulation.
I think the lambda version of this would be fine, but no need to change back
|
||
@Test | ||
void testNonSpaceSeparatorsPattern() { | ||
Assertions.assertArrayEquals(new String[]{"/a.wdl", "b.cwl", "c.cwl"}, DescriptorLanguageInferrer.NON_SPACE_SEPARATORS.split("/a.wdl,'b.cwl',\"c.cwl\"")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When repeated like this, Assertions
can be static import for readability
Description
This PR implements the foundations of the "infer .dockstore.yml" feature, which inspects the file tree corresponding to a GitHub repo/ref, tries to determine the entries within that a user would probably want to register, and generates a corresponding
.dockstore.yml
. Storage/delivery of the resulting.dockstore.yml
will be implemented in a future PR.The new functionality is implemented by a new helper
InferrerHelper
, which has two key methods:infer
: read a specified repo/ref and infer a list of entries from it.toDockstoreYml
: given a list of inferred entries, produce the body of a corresponding.dockstore.yml
We've split the functionality into separate methods as above so that later, we have the freedom to do things like use the inferred entries to take a certain action, generate a
.dockstore.yml
from user input, etc.InferrerHelper.infer
delegates to instances ofInferrer
, which encapsulate the inference logic for various types of files. The code is designed to make it easy to add other types ofInferrers
in the future, for example, to support a particular type of manifest file.Currently, the code can find CWL, WDL, Nextflow, Galaxy, and Jupyter entries.
To allow experimentation, I've created a temporary
inferEntries
endpoint that you can use to run the code on a specified repo/ref. It accesses GitHub as the authorized user (and with their rate limit), and will respond with a suggested.dockstore.yml
. See invocations below. I've marked the endpoint asDeprecated
, and we should remove it before the 1.16 release.To infer the list of entries, the
DescriptorLanguageInferrer.infer
method:As a general design goal, this code employs heuristics that require as little parsing as possible, because we're trying to create a
.dockstore.yml
, not to validate the entries that it describes. When the user attempts to register the entries using the generated.dockstore.yml
, our existing language processing code will scrutinize their contents. Step 4 above exists to eliminate secondary descriptors from further consideration, and similarly, it does not do any sophisticated parsing, but rather, finds path-like strings using a regular expression.The code reads the ref/repo file tree via a
FileTree
instance. which roughly mirrors our existing pluginFileReader
interface (later, we might unify the two). We add alistPaths
method that returns a list of all paths in the file tree, so that we don't need enumerate them incrementally via the existingGitHubSourceCodeRepo.listFiles
method, which would take a lot of time (and rate limit) if a repository contained many directories. The includedGitHubFileTree
implementation downloads a Zip repo/ref archive at initialization time, and then extracts the paths and file contents from it on demand, with some fallbacks to existingGitHubSourceCodeRepo
code. Thusly, most of the GitHub API file retrieval calls are eliminated.By abstracting the file access code, we can easily:
FileTree
that reads from a Zip file, GitLab, GitHub, the GitHub "full tree" endpoint, filesystem (rooted at particular directory), some other source we can't imagine yet, etc.FileTree
, allowing us to construct/use file trees in test code, instead being forced to create GitHub repos (or mock GitHub) to test.FileTree
wrappers to implement file caching (seeCachedFileTree
), download limits, etc.We should refactor all of our code similarly.
My preliminary impression is that this implementation works pretty good. It aces the
nf-core/rnaseq
,bactopia-wdl
, andcancerit/workflow-seq-import
repos:It does an ok job on the
bio-cwl-tools
repo, and mediocre on thewarp
repo (includes lots of helper wdls, and some testing workflows, rather than the workflows being tested):This PR includes some basic testing. When we wire it into the webservice for public consumption, we should add integration tests that confirm that it works in real-world conditions.
Team members suggested that we could use "does this file import other files" as a primary descriptor determination heuristic. I tried it out, and it works ok for big, mature workflows, but fails on small/embyronic workflows that are described by a single file. So, it's not included.
Review Instructions
Use the
inferEntries
endpoint to infer a.dockstore.yml
and confirm that it looks sane and well-formed.Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6464
Security and Privacy
If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.
e.g. Does this change...
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation