8000 SEAB-6464: Infer .dockstore.yml by svonworl · Pull Request #5939 · dockstore/dockstore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SEAB-6464: Infer .dockstore.yml #5939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 72 commits into from
Dec 11, 2024
Merged

Conversation

svonworl
Copy link
Contributor
@svonworl svonworl commented Jul 18, 2024

Description
This PR implements the foundations of the "infer .dockstore.yml" feature, which inspects the file tree corresponding to a GitHub repo/ref, tries to determine the entries within that a user would probably want to register, and generates a corresponding .dockstore.yml. Storage/delivery of the resulting .dockstore.yml will be implemented in a future PR.

The new functionality is implemented by a new helper InferrerHelper, which has two key methods:

  • infer: read a specified repo/ref and infer a list of entries from it.
  • toDockstoreYml: given a list of inferred entries, produce the body of a corresponding .dockstore.yml

We've split the functionality into separate methods as above so that later, we have the freedom to do things like use the inferred entries to take a certain action, generate a .dockstore.yml from user input, etc.

InferrerHelper.infer delegates to instances of Inferrer, which encapsulate the inference logic for various types of files. The code is designed to make it easy to add other types of Inferrers in the future, for example, to support a particular type of manifest file.

Currently, the code can find CWL, WDL, Nextflow, Galaxy, and Jupyter entries.

To allow experimentation, I've created a temporary inferEntries endpoint that you can use to run the code on a specified repo/ref. It accesses GitHub as the authorized user (and with their rate limit), and will respond with a suggested .dockstore.yml. See invocations below. I've marked the endpoint as Deprecated, and we should remove it before the 1.16 release.

To infer the list of entries, the DescriptorLanguageInferrer.infer method:

  1. Gets a list of all file paths in the file tree corresponding to the repo/ref.
  2. Calculates the subset of paths that could describe a primary descriptor, by inspecting the path string.
  3. Removes paths that likely corresponding to tests and other files that the author probably doesn't want to register, by inspecting the path string.
  4. Removes paths that are referenced by other files, by inspecting file content.
  5. Determines the entry type for each remaining path, and if successful, adds the inferred entry to the list.

As a general design goal, this code employs heuristics that require as little parsing as possible, because we're trying to create a .dockstore.yml, not to validate the entries that it describes. When the user attempts to register the entries using the generated .dockstore.yml, our existing language processing code will scrutinize their contents. Step 4 above exists to eliminate secondary descriptors from further consideration, and similarly, it does not do any sophisticated parsing, but rather, finds path-like strings using a regular expression.

The code reads the ref/repo file tree via a FileTree instance. which roughly mirrors our existing plugin FileReader interface (later, we might unify the two). We add a listPaths method that returns a list of all paths in the file tree, so that we don't need enumerate them incrementally via the existing GitHubSourceCodeRepo.listFiles method, which would take a lot of time (and rate limit) if a repository contained many directories. The included GitHubFileTree implementation downloads a Zip repo/ref archive at initialization time, and then extracts the paths and file contents from it on demand, with some fallbacks to existing GitHubSourceCodeRepo code. Thusly, most of the GitHub API file retrieval calls are eliminated.

By abstracting the file access code, we can easily:

  1. Support multiple entry sources: we can create a FileTree that reads from a Zip file, GitLab, GitHub, the GitHub "full tree" endpoint, filesystem (rooted at particular directory), some other source we can't imagine yet, etc.
  2. Create a synthetic in-memory implementation of FileTree, allowing us to construct/use file trees in test code, instead being forced to create GitHub repos (or mock GitHub) to test.
  3. Lock the filetree to a specific commit, so that bad things don't happen if the ref/repo changes underneath us.
  4. Create FileTree wrappers to implement file caching (see CachedFileTree), download limits, etc.
  5. Keep the low level file access code separate from the entry registration logic.

We should refactor all of our code similarly.

My preliminary impression is that this implementation works pretty good. It aces the nf-core/rnaseq, bactopia-wdl, and cancerit/workflow-seq-import repos:

curl http://localhost:8080/workflows/github/infer/nf-core/rnaseq/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/bactopia/bactopia-wdl/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/cancerit/workflow-seq-import/refs%2Fheads%2Fmain -H 'Authorization: bearer '`cat /tmp/token`

It does an ok job on the bio-cwl-tools repo, and mediocre on the warp repo (includes lots of helper wdls, and some testing workflows, rather than the workflows being tested):

curl http://localhost:8080/workflows/github/infer/common-workflow-library/bio-cwl-tools/refs%2Fheads%2Frelease -H 'Authorization: bearer '`cat /tmp/token`
curl http://localhost:8080/workflows/github/infer/broadinstitute/warp/refs%2Fheads%2Fmaster -H 'Authorization: bearer '`cat /tmp/token`

This PR includes some basic testing. When we wire it into the webservice for public consumption, we should add integration tests that confirm that it works in real-world conditions.

Team members suggested that we could use "does this file import other files" as a primary descriptor determination heuristic. I tried it out, and it works ok for big, mature workflows, but fails on small/embyronic workflows that are described by a single file. So, it's not included.

Review Instructions
Use the inferEntries endpoint to infer a .dockstore.yml and confirm that it looks sane and well-formed.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-6464

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

  • Security and Privacy assessed

e.g. Does this change...

  • Any user data we collect, or data location?
  • Access control, authentication or authorization?
  • Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@svonworl svonworl requested a review from denis-yuen August 1, 2024 15:52
Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly pending vfs discussion

Copy link
sonarqubecloud bot commented Aug 9, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
1.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

@svonworl svonworl requested a review from denis-yuen December 3, 2024 00:48
/**
* Maps file paths to saved file content.
*/
private final Map<String, String> filePathToContent = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionally looks like a Guava Loading cache
https://github.com/google/guava/wiki/cachesexplained#example

There may be other cache implementations floating around our classpath.
Ditto with dirPathToFiles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, it's a bit confusing because the map is from string to string.
Consider using a Path https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/file/Path.html as the key so things are more typed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misc observations:

  • Yes, it's a very simple cache based on a Map, also on our classpath. The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS), but at the expense of complexity, performance, and a larger chance that we'll inadvertently use it incorrectly. If someone feels really strongly, we can do it, though...

  • Path has system-dependent behavior. For example, the file separator changes from system to system. Because of that, would actually recommend, in general, that Path be avoided when possible, except for purposes such as accessing the local file system. Aspirational, of course...

  • While the semantics of Path and the semantics of workflow file paths seem to overlap each other 1:1 (more or less), it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

  • In our codebase, you will find Map<String, String> all over the place, and we very often use String to refer to paths. So, it meets the standard of our existing code, at least. Regarding this particular situation, don't think the advantage of Path in the type outweighs the ick of converting the incoming value, which is a String, to a Path in each method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it meets the standard of our existing code, at least.

FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.

it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

Seems unlikely, but in that case it seems like you can implement Path anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html

Path has system-dependent behavior.

I think I'm ok with Dockstore always running on Linux or Linux-like systems?

The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)

On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.

*/
public class SyntheticFileTree implements FileTree {

private Map<String, String> pathToContent = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also functionally a cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't pull anything in, and everything needs to persist, so not so very much like a cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.

By default cache instances created by CacheBuilder will not perform any type of eviction.

https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html

They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile?

https://www.baeldung.com/guava-cache#preload-the-cache

I feel less strongly about this one than CachingFileTree since it looks like the intent here is to use this for testing.

/**
* Maps directory paths to saved directory contents.
*/
private final Map<String, List<String>> dirPathToFiles = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for Paths

@svonworl svonworl requested a review from denis-yuen December 4, 2024 01:23
Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be longer discussion

/**
* Maps file paths to saved file content.
*/
private final Map<String, String> filePathToContent = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it meets the standard of our existing code, at least.

FWIW, while it is true existing code is like this, this is one of the regrets I've had with the existing code, so if we're improving things or writing a new path, I'd like the make things more typesafe/more easily readable.

it's not implausible that there's some weird file syntax or structure that we'd want to support someday that Path doesn't.

Seems unlikely, but in that case it seems like you can implement Path anew, " WARNING: This interface is only intended to be implemented by those developing custom file system implementations. Methods may be added to this interface in future releases. "
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html

Path has system-dependent behavior.

I think I'm ok with Dockstore always running on Linux or Linux-like systems?

The main advantage to using a full-blown loading cache probably would be to reduce the attack surface slightly (harder to DOS)

On the contrary, I think the main advantage is that we'd avoid rolling our own cache implementation. We'd gain access to other pre-made cache functionality in Guava (or similar) if we need to tweak things, and it's easier to understand because instead of it being our own cache implementation it's a well-known one.

*/
public class SyntheticFileTree implements FileTree {

private Map<String, String> pathToContent = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be a misunderstanding. Guava caches at least by default don't have a maximum size so everything persists.

By default cache instances created by CacheBuilder will not perform any type of eviction.

https://guava.dev/releases/24.0-jre/api/docs/com/google/common/cache/CacheBuilder.html

They don't also have to pull in things themselves. You can put things in yourself which it seems like you're doing via addFile?

https://www.baeldung.com/guava-cache#preload-the-cache

I feel less strongly about this one than CachingFileTree since it looks like the intent here is to use this for testing.

.filter(path -> path.startsWith(normalizedDirPath))
.map(path -> path.substring(normalizedDirPath.length()))
.filter(path -> !path.isEmpty())
.map(path -> path.split(SLASH)[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I still find this hard to understand, would be nice to simplify/standardize the code

Assuming I did understand, there's probably stuff like Path.getParent that can simplify https://stackoverflow.com/questions/1099859/how-to-split-a-path-platform-independent#comment914553_1099868

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@svonworl
Copy link
Contributor Author
svonworl commented Dec 11, 2024

Ok, I've made a set of changes that should get this over the hump...

A note about one of the changes:

FileTree was intentionally designed to mirror the signature of the existing methods in SourceCodeRepoInterface, from which it was split. Those methods use String as the type for file paths, as does most of the webservice.

One of the mandated changes, promulgated by a reviewer who must approve any PR before I am able to merge it, was to change FileTree to use Path instead. I make this change under duress, because:

  • it makes it harder and more error-prone to refactor existing code to use FileTr E377 ee.
  • it more permanently wires the system-dependent behavior of Path into the webservice. Yes, we can hope we're ok just running on systems where Path happens to work the right way, but there's plausible scenarios wherein we want to run on other systems, or a subtle difference in Path's behavior breaks the webservice.
  • one of our [informal] goals is to use absolute paths throughout the webservice, but a Path can represent either an absolute or relative path, so it doesn't buy us anything with respect to that.
  • it introduces more points of failure into an already-large PR, making it harder to diagnose problems and more difficult to partially roll back.

There are other reasons, the above is not an exhaustive list.

I would have preferred that we continue to use our historical path representation in this PR, and then later, as a team, considered a good path representation and how we might use it to improve the webservice. A possibility might be an AbsolutePath object, which would only represent absolute paths. Upon construction, it would verify absoluteness, and then anywhere we used it, we'd be sure the path was absolute, no need to isAbsolute or anything like that. AbsolutePath would also provide a small set of operations that would centralize the path-manipulation code that is currently scattered across the webservice, and generally spit out results that were also AbsolutePaths, maintaining the "is it absolute" type safety. Initially, for expediency sake, the ops could be delegated to Path, and we could easily replace them with platform-independent code later.

@svonworl svonworl requested a review from denis-yuen December 11, 2024 17:12
Copy link
Member
@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into Caffeine and while I understand and respect your preference to work with Strings rather than Paths, I think this is a good step of many in helping us get rid of our mix of java.io.File and unstructured Strings.

public List<String> listFiles(Path dirPath) {
Set<String> files = new HashSet<>();
for (Path filePath: pathToContent.keySet()) {
if (filePath.startsWith(dirPath)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I know this wasn't your preference but I find it easier to understand getNameCount and startsWith than working with split and string manipulation.
I think the lambda version of this would be fine, but no need to change back


@Test
void testNonSpaceSeparatorsPattern() {
Assertions.assertArrayEquals(new String[]{"/a.wdl", "b.cwl", "c.cwl"}, DescriptorLanguageInferrer.NON_SPACE_SEPARATORS.split("/a.wdl,'b.cwl',\"c.cwl\""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When repeated like this, Assertions can be static import for readability

@svonworl svonworl merged commit 13ff51c into develop Dec 11, 2024
21 of 24 checks passed
@svonworl svonworl deleted the feature/seab-6464/infer-dockstore-yml branch December 11, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0