Define the `cache` key for v1 multi-output recipes #102

wolfv · 2024-11-27T13:51:20Z

This starts the definition for the new cache output in v1 recipes.

One can try the cache with --experimental in rattler-build. The documentation can be found here: https://prefix-dev.github.io/rattler-build/latest/multiple_output_cache/

h-vetinari

Thank you for writing this up, excited to see this moving forward! 🙏 😊

h-vetinari · 2024-11-27T23:12:24Z

cep-cache-output.md

+
+Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.
+
+There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).


Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172

cep-cache-output.md

h-vetinari · 2024-11-27T23:21:58Z

cep-cache-output.md

+Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
+</details>


To double-check, this would save $SRC_DIR to the scratchpad in between operations, and restore it for the installation of any output that needs it?

I would go as far as saying that cache: implies this behaviour. I don't want to cache stuff in $PREFIX (then I could just install it and call that an output), I want to cache it without installation.

h-vetinari · 2024-11-27T23:23:04Z

cep-cache-output.md

+        - ${{ pin_subpackage("foo-headers") }}
+```
+
+Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.


I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs.

I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

I see the problem with the overdepending warning.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

Yeah, and this gets easily forgotten. I understand the issue with over-imbuing outputs with dependencies, but in this case I think the safer default is to inherit the run-exports, unless the output already has its own build/host environment (because then it's safe to assume that this should determine the actual REs for that output).

Part of my thinking on that is that, in many cases, superfluous REs are pretty benign, e.g. it's basically irrelevant whether a metapackage that does pin_subpackage(pkg) has its own REs, when pkg already has those run-exports anyway.

So while I generally dislike recipes having to opt out of implicit things, I think in this case it's the safer/saner default.

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

Can you explain the algorithm in a sentence? I think we need to have understandable rules. So far I think "inherit REs unless output has its own build/host env" is relatively easy to explain (aside from checking the technical boxes from my POV), and a more complicated algorithm would need to provide a pretty big improvement in comparison to be worth it IMO.

The background section of this proposal specifically calls out implicit behavior as a "downside" of the top-level package. In my experience, confusion about multi-output packages comes from unexpected behavior when the top-level shares a name as one of the output packages so they aren't behaving independently. This is why I maintain that there should be no run_exports across package boundaries from the cache package to output package because creating more interaction between output packages and the top-level package will increase confusion.

Adding a non-trivial run_export tracking algorithm from the cache package to the output packages will not further the design goal of making top-level package cache easier to understand and use. It will only make recipes less verbose. It will be adding complexity and adding implicit behaviors which this proposal claims to dislike. If the goal of this proposal is to make the recipe format more convenient or less verbose, then the background section should claim that instead of claiming that the problem is reader comprehension and implicit behavior.

In my opinion, it's easier for users to understand if all outputs inherit none of the run_exports from the cache package. The easiest way to explain the cache package would be that the cache package is a binary redistribution that you are unpacking for each of the output packages. Renumerating host/build dependencies is verbose, but it's easy to conceptualize as "If I was building only the things in this specific output, what would I need?"

I think I've made all my points about this topic, so I will stop trying to argue in favor of it.

This is a CEP, so the appropriate meeting is the Conda Community Sync? I can attend this week. Feb 12 @ 11am Chicago.

From the side-lines since I was just bitten by this wanting to reduce duplication while splitting the recipe. (Trying to cargo cult from a recipe that was very conscious about the split requirements to get the right dependencies.)

I can understand that implicitly adding it to all dependencies build on the cache is also odd even if the failure mode is more graceful.

Maybe that doesn't fit the pattern at all, but how about forcing users to be explicit about it? So the user must either indicate which named outputs they want to export the caches dependencies to or indicate that they want to export it to none

You could go as far as allowing ignore_run_exports that are specific for a named output here (to keep them paired with the dependencies):

- export_run_exports: split_name1 split_name2 - export_run_exports: split_name3 - ignore_run_exports: # maybe even allow ignoring of the ones that come from cache here?

(I doubt this is valid yaml, but you get the point)

EDIT: And of course some indicator if you have no entry (and an error if indicating nothing or names are wrong).

That is not a bad idea! We could have an cache.attach_run_exports_to: [ ... ] field and we can require it to have at least one entry (at least by linting).

@carterbox what are your feelings about this solution?

h-vetinari · 2024-11-27T23:38:57Z

cep-cache-output.md

+Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
+</details>
+
+The new files can then be used in the outputs with the `build.files` key:


How do I denote a dependence of a given output on the cache: phase? For example, we had a bunch of discussion in pytorch over conda-forge/pytorch-cpu-feedstock@981a04d (later reverted conda-forge/pytorch-cpu-feedstock@54fa422), because it is undocumented under what conditions a build will "fan" out (e.g. build multiple python versions on top of the same library; rather than generate a job per python version)

In other words, we need to have some clarity whether an output like

cache: requirements: build: - ${{ compiler('c') }} host: - # no python here outputs: - package: name: foo requirements: host: - # depends on cache implicitly? - python

together with a CBC of

python: - 3.9 - 3.10 - 3.11 - 3.12

will actually "fold" the python dimension into foo or not.

In conda-build, all outputs implicitly depend on the global build stage; perhaps we should make this explicit? build.files already does this partway (for outputs that only use files:). On the other hand, it might be more intuitive to keep the implication that cache: affects everything else - see discussion about skip:).

In the example repo you can try: https://github.com/wolfv/rattler-build-cache-test

It creates one libcalculator and two py-calculators based on the variants.yaml.

Also, right now, if there is a cache, all outputs are affected.

h-vetinari · 2024-11-27T23:44:33Z

cep-cache-output.md

+  build:
+    # note: in the current implementation, all other keys for build
+    #       are also valid, but unused
+    script: build.sh


One thing that's been a permanent annoyance with the way things have grown in conda-build is that a "global" build.skip: ... does not mean the entire build will get cancelled, because - in theory - the various outputs could do their own thing that doesn't rely on the package-level build. So in order to actually skip a given combinatio 6D47 n, it needs to be skipped on all outputs (which quickly gets annoying in large multi-output recipes, especially for quick debugging).

If we go with the "all outputs implicitly depend on cache:" approach, then I think it would be good to specify that a skip: true on the cache: level completely skips the build, without having to repeat it per output.

Hmm, I don't think the cache should have anything to do with skipping.

The top-level can skip fine (which is intersected with the outputs).

Ie. would skip everything on win:

build: skip: [win] outputs: - ...

wolfv · 2024-11-29T10:37:36Z

@h-vetinari I played around with also caching the "source" in prefix-dev/rattler-build#1226 and the example in https://github.com/wolfv/rattler-build-cache-test.

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

The example would then look like:

recipe:
  name: calculator
  version: 1.0.0

cache:
  source:
    path: ../

  requirements:
    build:
      - ${{ compiler('cxx') }}
      - cmake
      - ninja
  build:
    script:
      # make sure that `alternative_name.md` is not present
      - test ! -f ./alternative_name.md
      - mkdir build
      - cd build
      - cmake $SRC_DIR -GNinja ${CMAKE_ARGS}
      - ninja install

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

    requirements:
      run_exports:
        - ${{ pin_subpackage('libcalculator') }}

  - package:
      name: py-calculator
    source:
      - path: ../README.md
        file_name: alternative_name.md

    requirements:
      build:
        - ${{ compiler('cxx') }}
        - cmake
        - ninja
      host:
        - pybind11
        - python
        - libcalculator

    build:
      script:
        # assert that the README.md file is present
        - test -f ./alternative_name.md
        - cd build
        - cmake $SRC_DIR -GNinja ${CMAKE_ARGS} -DBUILD_PYTHON_BINDINGS=ON
        - ninja install

jaimergp · 2024-12-04T15:58:46Z

That's so close to the idea of cache being just a type of output! 🥳

outputs:
  - cache:
      name: optional, only needed if there are several caches
  - package:
      name: my-output
    build:
      needs-cache: bool # or cache.name

wolfv · 2024-12-04T16:14:45Z

Yeah it could be done, but I think we can let it wait for v2 of the recipe format! :)

h-vetinari · 2024-12-04T16:47:51Z

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

That sounds fine in principle. Could you explain a bit more what the thinking behind that is though? If there's already a source: on recipe-level, would the cache: have to repeat something?

That's so close to the idea of cache being just a type of output! 🥳

That would sound even better! More powerful and yet less divergence. (no opinion on whether this is v1 or v2 stuff)

However, this I don't like

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

I think this should never be automatic, and always need an explicit opt-in. Either the output says files: (in which case it could pick up parts of what cache: installed), or it actually does the ninja install itself (rather than during cache:). This would also be a better match for what the example is already doing for py-calculator.

carterbox · 2024-12-10T22:25:40Z

@carterbox, please review

carterbox · 2024-12-11T17:39:29Z

cep-cache-output.md

+    # run: ...
+    # run_constraints: ...
+    # ignore_run_exports: ...


It doesn't make sense that sections besides cache.build.script, cache.requirements.build and cache.requirements.host are allowed because this is a cache that would never be installed to an end-user environment.

The runtime dependencies don't need to be resolved for the cache because all of the relevant dependencies should be explicitly repeated in the actual outputs packages.

You are right. ignore_run_exports makes sense though.

ignore_run_exports makes sense though.

Why? Could you please explain why you want run_exports from the cache build to all of the outputs?

I elaborated on other options further below. We need to attach run exports from the cache to at least 1 output that uses the cache, otherwise there will be bugs.

carterbox · 2024-12-11T18:05:10Z

cep-cache-output.md

+
+<details> 
+  <summary>script build.sh default discussion</summary>
+We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided.


In a multi-output recipe, do the outputs default to build.sh?

Yes, that is currently the case. We're thinking of deprecating this though and require users to be explicit about build scripts everywhere. What would you prefer?

cep-cache-output.md

carterbox · 2024-12-11T18:32:37Z

cep-cache-output.md

+      name: libfoo
+    build:
+      files:
+        - lib/**


Are you going to support negative globs in outputs.build.files selection? For example:

- package: name: foo-dev build: files: include: - lib/** exclude: - lib/**.so.* - package: name: libfoo${{ soname }} build: files: include: - lib/**.so.*

Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime.

More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/

I think we should go further and only allow

build: files: include: - ...

and forbid the plain build.files. Forcing the include: to be present costs essentially nothing, but avoids the fallback logic ("is include/exclude present?"), gives a simpler schema, and makes it all-but-obvious that there's an exclude: variant. IMO that's way better than the single extra line it costs.

More importantly, build.files.include must respect the fundamental snapshotting mechanism ("was this file in host already? if so, it's not part of the output whose content we're in the process of determining"), unless overridden explicitly with something like always_include_files. This is essential for slicing a big build into several interrelated chunks. I make this argument in more detail here: conda/conda-build#5455

carterbox · 2024-12-11T18:43:43Z

cep-cache-output.md

+        - ${{ pin_subpackage("foo-headers") }}
+```
+
+Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.


I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

wolfv · 2025-03-13T16:36:12Z

We just had a discussion at the build tools meeting with new ideas for the name "cache".

One idea was "prepare" / "prepare-step" / or something along these lines.
I was also thinking of "pre-build-step", or "pre_build" or something like this.

If anyone has another idea, would love to hear it. Cache is a bit too overloaded :)

wolfv · 2025-03-13T16:37:23Z

These are ideas from my trusty assistant:

prepare
preprocess
common_build
shared_step
foundation
base_build
intermediate
scaffold
staging
preflight
common_prep
shared_build
core_build

carterbox · 2025-03-13T17:45:51Z

What about something that includes the word "outputs" or "multioutput" so it's more obvious that you would only use this section when creating a multi-output recipe. For example:

recipe:
  name: ...
  version: ...

outputs_staging: ...

outputs:
  - package: ...
  - package: ...

"prepare_outputs", "pre_outputs", "multioutput_cache", "outputs_prebuild", "outputs_scaffold", "outputs_staging", ...

jaimergp · 2025-03-13T19:08:57Z

Yea, I like where @carterbox is going. Some more: before_outputs. Also, since we are talking about "preparing the workspace for the later builds", what about prepare_workspace, prepare_output, output_setup or similar?

h-vetinari · 2025-03-13T22:46:26Z

How about shared_build or perhaps outputs_shared_build...? Actually, if we want to emphasi 3D11 se the association with outputs, we could go one step further, and put it under outputs: directly:

recipe:
  name: ...
  version: ...

outputs:
  - shared:  # or common, shared_build, ...
  - package: ...
  - package: ...

wolfv · 2025-03-14T08:23:21Z

I think moving it under outputs is a nice idea. Main difficulty I see is that people might expect to be able to create more than 1 of those which is not possible right now (maybe later we can allow naming them and having multiple, though). But we can error on that, too ...

carterbox · 2025-03-14T23:11:04Z

The cache is not itself an output though, so we would need to choose a name which indicates as such and doesn't end with the word "build". Otherwise, we would have an outputs.build.build section.

recipe:
  name: ...
  version: ...

outputs:
  - definitely_not_an_output_but_used_to_make_outputs:
      source: ...
      requirements:
        build: ...
        host: ...
      build: ...
  - package: ...
  - package: ...

😆 But seriously... something like setup, staging, scaffold, precursor, bootstrap, mock.

add cep for the v1 recipe cache output

b527b4c

wolfv changed the title ~~add first version of channel registry cep~~ Define the cache output for v1 multi-output recipes Nov 27, 2024

wolfv changed the title ~~Define the cache output for v1 multi-output recipes~~ Define the cache key for v1 multi-output recipes Nov 27, 2024

h-vetinari reviewed Nov 27, 2024

View reviewed changes

carterbox reviewed Dec 11, 2024

View reviewed changes

wolfv mentioned this pull request Jan 20, 2025

No way to pass --experimental to rattler-build conda-forge/conda-smithy#2057

Open

carterbox mentioned this pull request Jan 29, 2025

Explore recipe format v1 compatibility conda-forge/cuda-feedstock#68

Open

wolfv added 2 commits February 5, 2025 13:09

update cep

aea349c

improve the CEP

558c784

This was referenced Mar 2, 2025

add ipython output _ipython_tests conda-forge/admin-requests#1412

Open

Allow specifying override channel targets on a per-output basis conda-forge/conda-smithy#2263

Open

olivier-roussel mentioned this pull request Mar 20, 2025

Add SOFA recipe conda-forge/staged-recipes#23085

Open

10 tasks


		Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.

		There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).

		Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
		</details>

Define the cache key for v1 multi-output recipes #102

Are you sure you want to change the base?

Define the cache key for v1 multi-output recipes #102

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Define the `cache` key for v1 multi-output recipes #102

Define the `cache` key for v1 multi-output recipes #102