-
Notifications
You must be signed in to change notification settings - Fork 28
Define the cache
key for v1 multi-output recipes
#102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# CEP for the cache output in v1 recipes / rattler-build | ||
|
||
<table> | ||
<tr><td> Title </td><td> The cache output in v1 recipes / rattler-build </td> | ||
<tr><td> Status </td><td> In Discussion </td></tr> | ||
<tr><td> Author(s) </td><td> Wolf Vollprecht <w.vollprecht@gmail.com> </td></tr> | ||
<tr><td> Created </td><td> Nov 27, 2024</td></tr> | ||
<tr><td> Updated </td><td> </td></tr> | ||
<tr><td> Discussion </td><td> </td></tr> | ||
<tr><td> Implementation </td><td> rattler-build </td></tr> | ||
</table> | ||
|
||
## Abstract | ||
|
||
This CEP aims to define the cache output for v1 multi-output recipes. | ||
|
||
## Background | ||
|
||
Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build. | ||
|
||
There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level). | ||
|
||
For the v1 spec we are attempting to formalize the workings of the "top-level" build. For this, we introduce a new top-level key `cache`, that has the same values as a regular output. | ||
|
||
## Specification | ||
|
||
The top-level `cache` key looks as follows: | ||
|
||
```yaml | ||
cache: | ||
source: | ||
- url: https://foo.bar/source.tar.bz | ||
sha256: ... | ||
|
||
requirements: | ||
build: | ||
- ${{ compiler('c') }} | ||
- cmake | ||
- ninja | ||
host: | ||
- libzlib | ||
- libfoo | ||
# the `run` and `run_constraints` sections are not allowed here | ||
ignore_run_exports: | ||
by_name: | ||
- libfoo | ||
|
||
build: | ||
# only the script key is allowed here | ||
script: build.sh | ||
``` | ||
|
||
<details> | ||
<summary>script build.sh default discussion</summary> | ||
We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a multi-output recipe, do the outputs default to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is currently the case. We're thinking of deprecating this though and require users to be explicit about build scripts everywhere. What would you prefer? |
||
</details> | ||
|
||
When computing variants and used variables, rattler-build looks at the union of a given output and the cache. That means, even if an output does not define any requirements, the cache would still add a variant for the `c_compiler`. | ||
|
||
When rattler-build executes the recipe, it will start by building the cache output that is appropriate for the current variant. This is computed by looking at all "used-variables" for the cache output and computing a "hash" for the cache. The build itself is executed in the same way as any other build. | ||
|
||
The variant keys that are injected at build time is the subset used by the cache output. | ||
|
||
When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix). | ||
|
||
The cache restores files that were added to the prefix (conda-build also restored source files). | ||
The cache work dir folder, including the cache sources, is also recreated at the "dirty" state of the end of the cache build. The individual outputs can add additional source files into the cached source folder. | ||
|
||
Any new files in the prefix (from the cache) can then be used in the outputs with the `build.files` key: | ||
|
||
```yaml | ||
outputs: | ||
- package: | ||
name: foo-headers | ||
|
||
build: | ||
files: | ||
- include/** | ||
- package: | ||
name: libfoo | ||
build: | ||
files: | ||
- lib/** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you going to support negative globs in - package:
name: foo-dev
build:
files:
include:
- lib/**
exclude:
- lib/**.so.*
- package:
name: libfoo${{ soname }}
build:
files:
include:
- lib/**.so.* Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime. More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should go further and only allow build:
files:
include:
- ... and forbid the plain More importantly, |
||
|
||
- package: | ||
name: foo-devel | ||
requirements: | ||
run: | ||
- ${{ pin_subpackage("libfoo") }} | ||
- ${{ pin_subpackage("foo-headers") }} | ||
``` | ||
|
||
The glob list syntax can also be a dictionary with `include / exclude` keys, e.g. | ||
|
||
```yaml | ||
files: | ||
include: | ||
- include/** | ||
exclude: | ||
- lib/** | ||
``` | ||
|
||
Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports). What would you think of a more complex algorithm where we only attach the run exports to the lowest level output? E.g. I see the problem with the overdepending warning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, and this gets easily forgotten. I understand the issue with over-imbuing outputs with dependencies, but in this case I think the safer default is to inherit the run-exports, unless the output already has its own build/host environment (because then it's safe to assume that this should determine the actual REs for that output). Part of my thinking on that is that, in many cases, superfluous REs are pretty benign, e.g. it's basically irrelevant whether a metapackage that does So while I generally dislike recipes having to opt out of implicit things, I think in this case it's the safer/saner default.
Can you explain the algorithm in a sentence? I think we need to have understandable rules. So far I think "inherit REs unless output has its own build/host env" is relatively easy to explain (aside from checking the technical boxes from my POV), and a more complicated algorithm would need to provide a pretty big improvement in comparison to be worth it IMO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The background section of this proposal specifically calls out implicit behavior as a "downside" of the top-level package. In my experience, confusion about multi-output packages comes from unexpected behavior when the top-level shares a name as one of the output packages so they aren't behaving independently. This is why I maintain that there should be no run_exports across package boundaries from the cache package to output package because creating more interaction between output packages and the top-level package will increase confusion. Adding a non-trivial run_export tracking algorithm from the cache package to the output packages will not further the design goal of making top-level package cache easier to understand and use. It will only make recipes less verbose. It will be adding complexity and adding implicit behaviors which this proposal claims to dislike. If the goal of this proposal is to make the recipe format more convenient or less verbose, then the background section should claim that instead of claiming that the problem is reader comprehension and implicit behavior. In my opinion, it's easier for users to understand if all outputs inherit none of the run_exports from the cache package. The easiest way to explain the cache package would be that the cache package is a binary redistribution that you are unpacking for each of the output packages. Renumerating host/build dependencies is verbose, but it's easy to conceptualize as "If I was building only the things in this specific output, what would I need?" I think I've made all my points about this topic, so I will stop trying to argue in favor of it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a CEP, so the appropriate meeting is the Conda Community Sync? I can attend this week. Feb 12 @ 11am Chicago. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the side-lines since I was just bitten by this wanting to reduce duplication while splitting the recipe. (Trying to cargo cult from a recipe that was very conscious about the split requirements to get the right dependencies.) I can understand that implicitly adding it to all dependencies build on the cache is also odd even if the failure mode is more graceful. Maybe that doesn't fit the pattern at all, but how about forcing users to be explicit about it? So the user must either indicate which named outputs they want to export the caches dependencies to or indicate that they want to export it to none You could go as far as allowing
(I doubt this is valid yaml, but you get the point) EDIT: And of course some indicator if you have no entry (and an error if indicating nothing or names are wrong). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is not a bad idea! We could have an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @carterbox what are your feelings about this solution? |
||
|
||
To ignore certain run exports, the usual `ignore_run_exports` specifiers can be used in each output. | ||
|
||
> [!NOTE] | ||
> We have pondered other logic for attaching run exports. We could have a more complicated algorithm that attaches the run exports only to the lowest package in a chain of packages connected by `pin_subpackage(..., exact=True)`, however, duplicating the same dependencies should not really matter much to the solver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172