8000 Define the `cache` key for v1 multi-output recipes by wolfv · Pull Request #102 · conda/ceps · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Define the cache key for v1 multi-output recipes #102

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions cep-cache-output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# CEP for the cache output in v1 recipes / rattler-build

<table>
<tr><td> Title </td><td> The cache output in v1 recipes / rattler-build </td>
<tr><td> Status </td><td> In Discussion </td></tr>
<tr><td> Author(s) </td><td> Wolf Vollprecht &ltw.vollprecht@gmail.com&gt; </td></tr>
<tr><td> Created </td><td> Nov 27, 2024</td></tr>
<tr><td> Updated </td><td> </td></tr>
<tr><td> Discussion </td><td> </td></tr>
<tr><td> Implementation </td><td> rattler-build </td></tr>
</table>

## Abstract

This CEP aims to define the cache output for v1 multi-output recipes.

## Background

Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.

There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172


For the v1 spec we are attempting to formalize the workings of the "top-level" build. For this, we introduce a new top-level key `cache`, that has the same values as a regular output.

## Specification

The top-level `cache` key looks as follows:

```yaml
cache:
source:
- url: https://foo.bar/source.tar.bz
sha256: ...

requirements:
build:
- ${{ compiler('c') }}
- cmake
- ninja
host:
- libzlib
- libfoo
# the `run` and `run_constraints` sections are not allowed here
ignore_run_exports:
by_name:
- libfoo

build:
# only the script key is allowed here
script: build.sh
```

<details>
<summary>script build.sh default discussion</summary>
We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a multi-output recipe, do the outputs default to build.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is currently the case. We're thinking of deprecating this though and require users to be explicit about build scripts everywhere. What would you prefer?

</details>

When computing variants and used variables, rattler-build looks at the union of a given output and the cache. That means, even if an output does not define any requirements, the cache would still add a variant for the `c_compiler`.

When rattler-build executes the recipe, it will start by building the cache output that is appropriate for the current variant. This is computed by looking at all "used-variables" for the cache output and computing a "hash" for the cache. The build itself is executed in the same way as any other build.

The variant keys that are injected at build time is the subset used by the cache output.

When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix).

The cache restores files that were added to the prefix (conda-build also restored source files).
The cache work dir folder, including the cache sources, is also recreated at the "dirty" state of the end of the cache build. The individual outputs can add additional source files into the cached source folder.

Any new files in the prefix (from the cache) can then be used in the outputs with the `build.files` key:

```yaml
outputs:
- package:
name: foo-headers

build:
files:
- include/**
- package:
name: libfoo
build:
files:
- lib/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to support negative globs in outputs.build.files selection? For example:

- package:
    name: foo-dev
  build:
    files:
      include:
        - lib/**
      exclude:
        - lib/**.so.*
- package:
    name: libfoo${{ soname }}
  build:
    files:
      include:
        - lib/**.so.*

Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime.

More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go further and only allow

  build:
    files:
      include:
        - ...

and forbid the plain build.files. Forcing the include: to be present costs essentially nothing, but avoids the fallback logic ("is include/exclude present?"), gives a simpler schema, and makes it all-but-obvious that there's an exclude: variant. IMO that's way better than the single extra line it costs.

More importantly, build.files.include must respect the fundamental snapshotting mechanism ("was this file in host already? if so, it's not part of the output whose content we're in the process of determining"), unless overridden explicitly with something like always_include_files. This is essential for slicing a big build into several interrelated chunks. I make this argument in more detail here: conda/conda-build#5455


- package:
name: foo-devel
requirements:
run:
- ${{ pin_subpackage("libfoo") }}
- ${{ pin_subpackage("foo-headers") }}
```

The glob list syntax can also be a dictionary with `include / exclude` keys, e.g.

```yaml
files:
include:
- include/**
exclude:
- lib/**
```

Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs.

Copy link
@carterbox carterbox Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

I see the problem with the overdepending warning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

Yeah, and this gets easily forgotten. I understand the issue with over-imbuing outputs with dependencies, but in this case I think the safer default is to inherit the run-exports, unless the output already has its own build/host environment (because then it's safe to assume that this should determine the actual REs for that output).

Part of my thinking on that is that, in many cases, superfluous REs are pretty benign, e.g. it's basically irrelevant whether a metapackage that does pin_subpackage(pkg) has its own REs, when pkg already has those run-exports anyway.

So while I generally dislike recipes having to opt out of implicit things, I think in this case it's the safer/saner default.

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

Can you explain the algorithm in a sentence? I think we need to have understandable rules. So far I think "inherit REs unless output has its own build/host env" is relatively easy to explain (aside from checking the technical boxes from my POV), and a more complicated algorithm would need to provide a pretty big improvement in comparison to be worth it IMO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background section of this proposal specifically calls out implicit behavior as a "downside" of the top-level package. In my experience, confusion about multi-output packages comes from unexpected behavior when the top-level shares a name as one of the output packages so they aren't behaving independently. This is why I maintain that there should be no run_exports across package boundaries from the cache package to output package because creating more interaction between output packages and the top-level package will increase confusion.

Adding a non-trivial run_export tracking algorithm from the cache package to the output packages will not further the design goal of making top-level package cache easier to understand and use. It will only make recipes less verbose. It will be adding complexity and adding implicit behaviors which this proposal claims to dislike. If the goal of this proposal is to make the recipe format more convenient or less verbose, then the background section should claim that instead of claiming that the problem is reader comprehension and implicit behavior.

In my opinion, it's easier for users to understand if all outputs inherit none of the run_exports from the cache package. The easiest way to explain the cache package would be that the cache package is a binary redistribution that you are unpacking for each of the output packages. Renumerating host/build dependencies is verbose, but it's easy to conceptualize as "If I was building only the things in this specific output, what would I need?"

I think I've made all my points about this topic, so I will stop trying to argue in favor of it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a CEP, so the appropriate meeting is the Conda Community Sync? I can attend this week. Feb 12 @ 11am Chicago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
@seberg seberg Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the side-lines since I was just bitten by this wanting to reduce duplication while splitting the recipe. (Trying to cargo cult from a recipe that was very conscious about the split requirements to get the right dependencies.)

I can understand that implicitly adding it to all dependencies build on the cache is also odd even if the failure mode is more graceful.

Maybe that doesn't fit the pattern at all, but how about forcing users to be explicit about it? So the user must either indicate which named outputs they want to export the caches dependencies to or indicate that they want to export it to none

You could go as far as allowing ignore_run_exports that are specific for a named output here (to keep them paired with the dependencies):

- export_run_exports:
  split_name1
  split_name2
- export_run_exports:
  split_name3
  - ignore_run_exports:
    # maybe even allow ignoring of the ones that come from cache here?

(I doubt this is valid yaml, but you get the point)

EDIT: And of course some indicator if you have no entry (and an error if indicating nothing or names are wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not a bad idea! We could have an cache.attach_run_exports_to: [ ... ] field and we can require it to have at least one entry (at least by linting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carterbox what are your feelings about this solution?


To ignore certain run exports, the usual `ignore_run_exports` specifiers can be used in each output.

> [!NOTE]
> We have pondered other logic for attaching run exports. We could have a more complicated algorithm that attaches the run exports only to the lowest package in a chain of packages connected by `pin_subpackage(..., exact=True)`, however, duplicating the same dependencies should not really matter much to the solver.
0