SEAB-6464: Infer .dockstore.yml #5939

svonworl · 2024-07-18T14:52:59Z

Description
This PR implements the foundations of the "infer .dockstore.yml" feature, which inspects the file tree corresponding to a GitHub repo/ref, tries to determine the entries within that a user would probably want to register, and generates a corresponding .dockstore.yml. Storage/delivery of the resulting .dockstore.yml will be implemented in a future PR.

The new functionality is implemented by a new helper InferrerHelper, which has two key methods:

infer: read a specified repo/ref and infer a list of entries from it.
toDockstoreYml: given a list of inferred entries, produce the body of a corresponding .dockstore.yml

We've split the functionality into separate methods as above so that later, we have the freedom to do things like use the inferred entries to take a certain action, generate a .dockstore.yml from user input, etc.

InferrerHelper.infer delegates to instances of Inferrer, which encapsulate the inference logic for various types of files. The code is designed to make it easy to add other types of Inferrers in the future, for example, to support a particular type of manifest file.

Currently, the code can find CWL, WDL, Nextflow, Galaxy, and Jupyter entries.

To allow experimentation, I've created a temporary inferEntries endpoint that you can use to run the code on a specified repo/ref. It accesses GitHub as the authorized user (and with their rate limit), and will respond with a suggested .dockstore.yml. See invocations below. I've marked the endpoint as Deprecated, and we should remove it before the 1.16 release.

To infer the list of entries, the DescriptorLanguageInferrer.infer method:

Gets a list of all file paths in the file tree corresponding to the repo/ref.
Calculates the subset of paths that could describe a primary descriptor, by inspecting the path string.
Removes paths that likely corresponding to tests and other files that the author probably doesn't want to register, by inspecting the path string.
Removes paths that are referenced by other files, by inspecting file content.
Determines the entry type for each remaining path, and if successful, adds the inferred entry to the list.

As a general design goal, this code employs heuristics that require as little parsing as possible, because we're trying to create a .dockstore.yml, not to validate the entries that it describes. When the user attempts to register the entries using the generated .dockstore.yml, our existing language processing code will scrutinize their contents. Step 4 above exists to eliminate secondary descriptors from further consideration, and similarly, it does not do any sophisticated parsing, but rather, finds path-like strings using a regular expression.

The code reads the ref/repo file tree via a FileTree instance. which roughly mirrors our existing plugin FileReader interface (later, we might unify the two). We add a listPaths method that returns a list of all paths in the file tree, so that we don't need enumerate them incrementally via the existing GitHubSourceCodeRepo.listFiles method, which would take a lot of time (and rate limit) if a repository contained many directories. The included GitHubFileTree implementation downloads a Zip repo/ref archive at initialization time, and then extracts the paths and file contents from it on demand, with some fallbacks to existing GitHubSourceCodeRepo code. Thusly, most of the GitHub API file retrieval calls are eliminated.

By abstracting the file access code, we can easily:

Support multiple entry sources: we can create a FileTree that reads from a Zip file, GitLab, GitHub, the GitHub "full tree" endpoint, filesystem (rooted at particular directory), some other source we can't imagine yet, etc.
Create a synthetic in-memory implementation of FileTree, allowing us to construct/use file trees in test code, instead being forced to create GitHub repos (or mock GitHub) to test.
Lock the filetree to a specific commit, so that bad things don't happen if the ref/repo changes underneath us.
Create FileTree wrappers to implement file caching (see CachedFileTree), download limits, etc.
Keep the low level file access code separate from the entry registration logic.

We should refactor all of our code similarly.

My preliminary impression is that this implementation works pretty good. It aces the nf-core/rnaseq, bactopia-wdl, and cancerit/workflow-seq-import repos:

curl http://localhost:8080/workflows/github/infer/nf-core/rnaseq/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/bactopia/bactopia-wdl/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/cancerit/workflow-seq-import/refs%2Fheads%2Fmain -H 'Authorization: bearer '`cat /tmp/token`

It does an ok job on the bio-cwl-tools repo, and mediocre on the warp repo (includes lots of helper wdls, and some testing workflows, rather than the workflows being tested):

curl http://localhost:8080/workflows/github/infer/common-workflow-library/bio-cwl-tools/refs%2Fheads%2Frelease -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/broadinstitute/warp/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`

This PR includes some basic testing. When we wire it into the webservice for public consumption, we should add integration tests that confirm that it works in real-world conditions.

Team members suggested that we could use "does this file import other files" as a primary descriptor determination heuristic. I tried it out, and it works ok for big, mature workflows, but fails on small/embyronic workflows that are described by a single file. So, it's not included.

Review Instructions
Use the inferEntries endpoint to infer a .dockstore.yml and confirm that it looks sane and well-formed.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6464

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

Security and Privacy assessed

e.g. Does this change...

Any user data we collect, or data location?
Access control, authentication or authorization?
Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

denis-yuen

Mainly pending vfs discussion

sonarqubecloud · 2024-08-09T02:31:28Z

Quality Gate failed

Failed conditions
1.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

denis-yuen · 2024-12-03T21:06:34Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/CachingFileTree.java

+    /**
+     * Maps file paths to saved file content.
+     */
+    private final Map<String, String> filePathToContent = new HashMap<>();


This functionally looks like a Guava Loading cache
https://github.com/google/guava/wiki/cachesexplained#example

There may be other cache implementations floating around our classpath.
Ditto with dirPathToFiles

While at it, it's a bit confusing because the map is from string to string.
Consider using a Path https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/file/Path.html as the key so things are more typed.

Misc observations:

Yes, it's a very simple cache based on a Map, also on our classpath. The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS), but at the expense of complexity, performance, and a larger chance that we'll inadvertently use it incorrectly. If someone feels really strongly, we can do it, though...

Path has system-dependent behavior. For example, the file separator changes from system to system. Because of that, would actually recommend, in general, that Path be avoided when possible, except for purposes such as accessing the local file system. Aspirational, of course...

While the semantics of Path and the semantics of workflow file paths seem to overlap each other 1:1 (more or less), it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

In our codebase, you will find Map<String, String> all over the place, and we very often use String to refer to paths. So, it meets the standard of our existing code, at least. Regarding this particular situation, don't think the advantage of Path in the type outweighs the ick of converting the incoming value, which is a String, to a Path in each method.

So, it meets the standard of our existing code, at least.

FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.

it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

Seems unlikely, but in that case it seems like you can implement Path anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html

Path has system-dependent behavior.

I think I'm ok with Dockstore always running on Linux or Linux-like systems?

The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)

On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/FileTree.java

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/GitHubSourceCodeRepo.java

denis-yuen · 2024-12-03T21:09:23Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java

+ */
+public class SyntheticFileTree implements FileTree {
+
+    private Map<String, String> pathToContent = new HashMap<>();


I think this is also functionally a cache.

Doesn't pull anything in, and everything needs to persist, so not so very much like a cache.

I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.

By default cache instances created by CacheBuilder will not perform any type of eviction.

https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html

They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile?

https://www.baeldung.com/guava-cache#preload-the-cache

I feel less strongly about this one than CachingFileTree since it looks like the intent here is to use this for testing.

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java

denis-yuen · 2024-12-03T21:15:24Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/CachingFileTree.java

+    /**
+     * Maps directory paths to saved directory contents.
+     */
+    private final Map<String, List<String>> dirPathToFiles = new HashMap<>();


Ditto for Paths

...bservice/src/main/java/io/dockstore/webservice/helpers/infer/DescriptorLanguageInferrer.java

denis-yuen

may be longer discussion

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/GitHubSourceCodeRepo.java

denis-yuen · 2024-12-04T16:17:24Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/CachingFileTree.java

+    /**
+     * Maps file paths to saved file content.
+     */
+    private final Map<String, String> filePathToContent = new HashMap<>();


So, it meets the standard of our existing code, at least.

FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.

it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

Seems unlikely, but in that case it seems like you can implement Path anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html

Path has system-dependent behavior.

I think I'm ok with Dockstore always running on Linux or Linux-like systems?

The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)

On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.

denis-yuen · 2024-12-04T16:31:39Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java

+ */
+public class SyntheticFileTree implements FileTree {
+
+    private Map<String, String> pathToContent = new HashMap<>();


I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.

By default cache instances created by CacheBuilder will not perform any type of eviction.

https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html

They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile?

https://www.baeldung.com/guava-cache#preload-the-cache

I feel less strongly about this one than CachingFileTree since it looks like the intent here is to use this for testing.

denis-yuen · 2024-12-04T16:44:15Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java

+            .filter(path -> path.startsWith(normalizedDirPath))
+            .map(path -> path.substring(normalizedDirPath.length()))
+            .filter(path -> !path.isEmpty())
+            .map(path -> path.split(SLASH)[0])


Think I still find this hard to understand, would be nice to simplify/standardize the code

Assuming I did understand, there's probably stuff like Path.getParent that can simplify https://stackoverflow.com/questions/1099859/how-to-split-a-path-platform-independent#comment914553_1099868

sonarqubecloud · 2024-12-11T03:21:00Z

Quality Gate failed

Failed conditions
1.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

svonworl · 2024-12-11T17:12:29Z

Ok, I've made a set of changes that should get this over the hump...

A note about one of the changes:

FileTree was intentionally designed to mirror the signature of the existing methods in SourceCodeRepoInterface, from which it was split. Those methods use String as the type for file paths, as does most of the webservice.

One of the mandated changes, promulgated by a reviewer who must approve any PR before I am able to merge it, was to change FileTree to use Path instead. I make this change under duress, because:

it makes it harder and more error-prone to refactor existing code to use FileTr E377 ee.
it more permanently wires the system-dependent behavior of Path into the webservice. Yes, we can hope we're ok just running on systems where Path happens to work the right way, but there's plausible scenarios wherein we want to run on other systems, or a subtle difference in Path's behavior breaks the webservice.
one of our [informal] goals is to use absolute paths throughout the webservice, but a Path can represent either an absolute or relative path, so it doesn't buy us anything with respect to that.
it introduces more points of failure into an already-large PR, making it harder to diagnose problems and more difficult to partially roll back.

There are other reasons, the above is not an exhaustive list.

I would have preferred that we continue to use our historical path representation in this PR, and then later, as a team, considered a good path representation and how we might use it to improve the webservice. A possibility might be an AbsolutePath object, which would only represent absolute paths. Upon construction, it would verify absoluteness, and then anywhere we used it, we'd be sure the path was absolute, no need to isAbsolute or anything like that. AbsolutePath would also provide a small set of operations that would centralize the path-manipulation code that is currently scattered across the webservice, and generally spit out results that were also AbsolutePaths, maintaining the "is it absolute" type safety. Initially, for expediency sake, the ops could be delegated to Path, and we could easily replace them with platform-independent code later.

denis-yuen

Thanks for looking into Caffeine and while I understand and respect your preference to work with Strings rather than Paths, I think this is a good step of many in helping us get rid of our mix of java.io.File and unstructured Strings.

denis-yuen · 2024-12-11T21:10:39Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/SyntheticFileTree.java

+    public List<String> listFiles(Path dirPath) {
+        Set<String> files = new HashSet<>();
+        for (Path filePath: pathToContent.keySet()) {
+            if (filePath.startsWith(dirPath)) {


Thanks, I know this wasn't your preference but I find it easier to understand getNameCount and startsWith than working with split and string manipulation.
I think the lambda version of this would be fine, but no need to change back

denis-yuen · 2024-12-11T21:13:57Z

...vice/src/test/java/io/dockstore/webservice/helpers/infer/DescriptorLanguageInferrerTest.java

+
+    @Test
+    void testNonSpaceSeparatorsPattern() {
+        Assertions.assertArrayEquals(new String[]{"/a.wdl", "b.cwl", "c.cwl"}, DescriptorLanguageInferrer.NON_SPACE_SEPARATORS.split("/a.wdl,'b.cwl',\"c.cwl\""));


When repeated like this, Assertions can be static import for readability

svonworl added 30 commits June 13, 2024 16:38

first draft

22a18ec

add entry type inference

b96bb7f

add crude testing endpoint

155a6c0

add .dockstore.yml synthesis code

4e94e72

pacify checkstyle

21423a9

fix oopsie

8a7c101

add GitHubFileTree

f4400e2

improve GitHubFileTree

215f5fa

refactor

02d41e2

move files

1d9ad4a

add some untracked files

093f7e6

major refactor part 1

0dafcbf

make compile

570c95b

moar refactor

2148728

continue refactoring

3c9e3dd

add descriptor language subclass

8ad9441

add synthetic and cached file trees

094adc0

add partial unit test, refactor

872e7c6

add more tests, refactor

6629b88

remove some unused code

e1ed49c

address checksyle and add package-info

dffa0d0

eliminate trailing whitespace

e201d05

address TODOs

160ff0a

tweak

c0ef9a7

merge develop

b3f52d9

tweaks

8ab00fb

improve logging, eliminate potential npe

2cfe791

naming change

d1d81d8

deprecate test endpoint

66a68fc

add copyright headers

ae2578c

svonworl added 2 commits July 31, 2024 16:45

enhance description of inferEntries ref param

1ad1ea9

fix openapi.yaml

c2bf38f

svonworl requested a review from denis-yuen August 1, 2024 15:52

svonworl added 2 commits August 1, 2024 11:55

retool referenced path determination

a5b53e2

merge develop

3c34f6c

denis-yuen reviewed Aug 2, 2024

View reviewed changes

svonworl added 3 commits August 5, 2024 14:27

tweak

aa8e5fa

pr feedback

bcaee2a

add test file

3eaa548

pr feedback

bcaee2a

add test file

3eaa548

svonworl added 2 commits September 16, 2024 17:22

merge develop

29ae825

add control chars to non-space separators, tests, comments

b895d08

svonworl requested a review from denis-yuen December 3, 2024 00:48

denis-yuen reviewed Dec 3, 2024

View reviewed changes

some review feedback

883c35e

svonworl requested a review from denis-yuen December 4, 2024 01:23

denis-yuen requested changes Dec 4, 2024

View reviewed changes

svonworl added 7 commits December 9, 2024 16:39

add some javadoc

c7f3990

change code to be more imperative, add tests

a415b5c

convert CachingFileTree to use caffeine LoadingCache

962f3ec

merge develop

d32795d

sync pom

245882c

adjust quay image sha

b50e2b0

refactor to Path

39f2a54

svonworl requested a review from denis-yuen December 11, 2024 17:12

denis-yuen approved these changes Dec 11, 2024

View reviewed changes

svonworl merged commit 13ff51c into develop Dec 11, 2024
21 of 24 checks passed

svonworl deleted the feature/seab-6464/infer-dockstore-yml branch December 11, 2024 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEAB-6464: Infer .dockstore.yml #5939

SEAB-6464: Infer .dockstore.yml #5939

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SEAB-6464: Infer .dockstore.yml #5939

SEAB-6464: Infer .dockstore.yml #5939

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Quality Gate failed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Quality Gate failed

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!