8000 Define the `cache` key for v1 multi-output recipes by wolfv · Pull Request #102 · conda/ceps · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Define the cache key for v1 multi-output recipes #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wolfv
Copy link
Contributor
@wolfv wolfv commented Nov 27, 2024

This starts the definition for the new cache output in v1 recipes.

One can try the cache with --experimental in rattler-build. The documentation can be found here: https://prefix-dev.github.io/rattler-build/latest/multiple_output_cache/

@wolfv wolfv changed the title add first version of channel registry cep Define the cache output for v1 multi-output recipes Nov 27, 2024
@wolfv wolfv changed the title Define the cache output for v1 multi-output recipes Define the cache key for v1 multi-output recipes Nov 27, 2024
Copy link
@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up, excited to see this moving forward! 🙏 😊


Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.

There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172

Comment on lines 71 to 72
Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
</details>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double-check, this would save $SRC_DIR to the scratchpad in between operations, and restore it for the installation of any output that needs it?

I would go as far as saying that cache: implies this behaviour. I don't want to cache stuff in $PREFIX (then I could just install it and call that an output), I want to cache it without installation.

- ${{ pin_subpackage("foo-headers") }}
```

Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs.

Copy link
@carterbox carterbox Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

I see the problem with the overdepending warning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the outputs would then have to setup a build / host environment for nothing (just to add run exports).

Yeah, and this gets easily forgotten. I understand the issue with over-imbuing outputs with dependencies, but in this case I think the safer default is to inherit the run-exports, unless the output already has its own build/host environment (because then it's safe to assume that this should determine the actual REs for that output).

Part of my thinking on that is that, in many cases, superfluous REs are pretty benign, e.g. it's basically irrelevant whether a metapackage that does pin_subpackage(pkg) has its own REs, when pkg already has those run-exports anyway.

So while I generally dislike recipes having to opt out of implicit things, I think in this case it's the safer/saner default.

What would you think of a more complex algorithm where we only attach the run exports to the lowest level output?

E.g. cache -> pkg -> pkg-devel where pkg-devel -> pin_subpackage(pkg, exact=True)?

Can you explain the algorithm in a sentence? I think we need to have understandable rules. So far I think "inherit REs unless output has its own build/host env" is relatively easy to explain (aside from checking the technical boxes from my POV), and a more complicated algorithm would need to provide a pretty big improvement in comparison to be worth it IMO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background section of this proposal specifically calls out implicit behavior as a "downside" of the top-level package. In my experience, confusion about multi-output packages comes from unexpected behavior when the top-level shares a name as one of the output packages so they aren't behaving independently. This is why I maintain that there should be no run_exports across package boundaries from the cache package to output package because creating more interaction between output packages and the top-level package will increase confusion.

Adding a non-trivial run_export tracking algorithm from the cache package to the output packages will not further the design goal of making top-level package cache easier to understand and use. It will only make recipes less verbose. It will be adding complexity and adding implicit behaviors which this proposal claims to dislike. If the goal of this proposal is to make the recipe format more convenient or less verbose, then the background section should claim that instead of claiming that the problem is reader comprehension and implicit behavior.

In my opinion, it's easier for users to understand if all outputs inherit none of the run_exports from the cache package. The easiest way to explain the cache package would be that the cache package is a binary redistribution that you are unpacking for each of the output packages. Renumerating host/build dependencies is verbose, but it's easy to conceptualize as "If I was building only the things in this specific output, what would I need?"

I think I've made all my points about this topic, so I will stop trying to argue in favor of it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a CEP, so the appropriate meeting is the Conda Community Sync? I can attend this week. Feb 12 @ 11am Chicago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
@seberg seberg Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the side-lines since I was just bitten by this wanting to reduce duplication while splitting the recipe. (Trying to cargo cult from a recipe that was very conscious about the split requirements to get the right dependencies.)

I can understand that implicitly adding it to all dependencies build on the cache is also odd even if the failure mode is more graceful.

Maybe that doesn't fit the pattern at all, but how about forcing users to be explicit about it? So the user must either indicate which named outputs they want to export the caches dependencies to or indicate that they want to export it to none

You could go as far as allowing ignore_run_exports that are specific for a named output here (to keep them paired with the dependencies):

- export_run_exports:
  split_name1
  split_name2
- export_run_exports:
  split_name3
  - ignore_run_exports:
    # maybe even allow ignoring of the ones that come from cache here?

(I doubt this is valid yaml, but you get the point)

EDIT: And of course some indicator if you have no entry (and an error if indicating nothing or names are wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not a bad idea! We could have an cache.attach_run_exports_to: [ ... ] field and we can require it to have at least one entry (at least by linting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carterbox what are your feelings about this solution?

Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
</details>

The new files can then be used in the outputs with the `build.files` key:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I denote a dependence of a given output on the cache: phase? For example, we had a bunch of discussion in pytorch over conda-forge/pytorch-cpu-feedstock@981a04d (later reverted conda-forge/pytorch-cpu-feedstock@54fa422), because it is undocumented under what conditions a build will "fan" out (e.g. build multiple python versions on top of the same library; rather than generate a job per python version)

In other words, we need to have some clarity whether an output like

cache:
  requirements:
    build:
      - ${{ compiler('c') }}
    host:
      - # no python here
 
outputs:
  - package:
      name: foo
    requirements:
      host:
        - # depends on cache implicitly?
        - python

together with a CBC of

python:
  - 3.9
  - 3.10
  - 3.11
  - 3.12

will actually "fold" the python dimension into foo or not.

In conda-build, all outputs implicitly depend on the global build stage; perhaps we should make this explicit? build.files already does this partway (for outputs that only use files:). On the other hand, it might be more intuitive to keep the implication that cache: affects everything else - see discussion about skip:).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example repo you can try: https://github.com/wolfv/rattler-build-cache-test

It creates one libcalculator and two py-calculators based on the variants.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, right now, if there is a cache, all outputs are affected.

Comment on lines 41 to 44
build:
# note: in the current implementation, all other keys for build
# are also valid, but unused
script: build.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's been a permanent annoyance with the way things have grown in conda-build is that a "global" build.skip: ... does not mean the entire build will get cancelled, because - in theory - the various outputs could do their own thing that doesn't rely on the package-level build. So in order to actually skip a given combinatio 6D47 n, it needs to be skipped on all outputs (which quickly gets annoying in large multi-output recipes, especially for quick debugging).

If we go with the "all outputs implicitly depend on cache:" approach, then I think it would be good to specify that a skip: true on the cache: level completely skips the build, without having to repeat it per output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think the cache should have anything to do with skipping.

The top-level can skip fine (which is intersected with the outputs).

Ie. would skip everything on win:

build:
  skip: [win]

outputs:
  - ... 

@wolfv
Copy link
Contributor Author
wolfv commented Nov 29, 2024

@h-vetinari I played around with also caching the "source" in prefix-dev/rattler-build#1226 and the example in https://github.com/wolfv/rattler-build-cache-test.

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

The example would then look like:

recipe:
  name: calculator
  version: 1.0.0

cache:
  source:
    path: ../

  requirements:
    build:
      - ${{ compiler('cxx') }}
      - cmake
      - ninja
  build:
    script:
      # make sure that `alternative_name.md` is not present
      - test ! -f ./alternative_name.md
      - mkdir build
      - cd build
      - cmake $SRC_DIR -GNinja ${CMAKE_ARGS}
      - ninja install

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

    requirements:
      run_exports:
        - ${{ pin_subpackage('libcalculator') }}

  - package:
      name: py-calculator
    source:
      - path: ../README.md
        file_name: alternative_name.md

    requirements:
      build:
        - ${{ compiler('cxx') }}
        - cmake
        - ninja
      host:
        - pybind11
        - python
        - libcalculator

    build:
      script:
        # assert that the README.md file is present
        - test -f ./alternative_name.md
        - cd build
        - cmake $SRC_DIR -GNinja ${CMAKE_ARGS} -DBUILD_PYTHON_BINDINGS=ON
        - ninja install

@jaimergp
Copy link
Contributor
jaimergp commented Dec 4, 2024

That's so close to the idea of cache being just a type of output! 🥳

outputs:
  - cache:
      name: optional, only needed if there are several caches
  - package:
      name: my-output
    build:
      needs-cache: bool # or cache.name

@wolfv
Copy link
Contributor Author
wolfv commented Dec 4, 2024

Yeah it could be done, but I think we can let it wait for v2 of the recipe format! :)

@h-vetinari
Copy link

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

That sounds fine in principle. Could you explain a bit more what the thinking behind that is though? If there's already a source: on recipe-level, would the cache: have to repeat something?

That's so close to the idea of cache being just a type of output! 🥳

That would sound even better! More powerful and yet less divergence. (no opinion on whether this is v1 or v2 stuff)

However, this I don't like

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

I think this should never be automatic, and always need an explicit opt-in. Either the output says files: (in which case it could pick up parts of what cache: installed), or it actually does the ninja install itself (rather than during cache:). This would also be a better match for what the example is already doing for py-calculator.

@carterbox
Copy link

@carterbox, please review

Comment on lines 38 to 40
# run: ...
# run_constraints: ...
# ignore_run_exports: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense that sections besides cache.build.script, cache.requirements.build and cache.requirements.host are allowed because this is a cache that would never be installed to an end-user environment.

The runtime dependencies don't need to be resolved for the cache because all of the relevant dependencies should be explicitly repeated in the actual outputs packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. ignore_run_exports makes sense though.

Copy link
@carterbox carterbox Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_run_exports makes sense though.

Why? Could you please explain why you want run_exports from the cache build to all of the outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I elaborated on other options further below. We need to attach run exports from the cache to at least 1 output that uses the cache, otherwise there will be bugs.


<details>
<summary>script build.sh default discussion</summary>
We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a multi-output recipe, do the outputs default to build.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is currently the case. We're thinking of deprecating this though and require users to be explicit about build scripts everywhere. What would you prefer?

name: libfoo
build:
files:
- lib/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to support negative globs in outputs.build.files selection? For example:

- package:
    name: foo-dev
  build:
    files:
      include:
        - lib/**
      exclude:
        - lib/**.so.*
- package:
    name: libfoo${{ soname }}
  build:
    files:
      include:
        - lib/**.so.*

Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime.

More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go further and only allow

  build:
    files:
      include:
        - ...

and forbid the plain build.files. Forcing the include: to be present costs essentially nothing, but avoids the fallback logic ("is include/exclude present?"), gives a simpler schema, and makes it all-but-obvious that there's an exclude: variant. IMO that's way better than the single extra line it costs.

More importantly, build.files.include must respect the fundamental snapshotting mechanism ("was this file in host already? if so, it's not part of the output whose content we're in the process of determining"), unless overridden explicitly with something like always_include_files. This is essential for slicing a big build into several interrelated chunks. I make this argument in more detail here: conda/conda-build#5455

- ${{ pin_subpackage("foo-headers") }}
```

Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.
Copy link
@carterbox carterbox Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

@wolfv
Copy link
Contributor Author
wolfv commented Mar 13, 2025

We just had a discussion at the build tools meeting with new ideas for the name "cache".

One idea was "prepare" / "prepare-step" / or something along these lines.
I was also thinking of "pre-build-step", or "pre_build" or something like this.

If anyone has another idea, would love to hear it. Cache is a bit too overloaded :)

@wolfv
Copy link
Contributor Author
wolfv commented Mar 13, 2025

These are ideas from my trusty assistant:

prepare
preprocess
common_build
shared_step
foundation
base_build
intermediate
scaffold
staging
preflight
common_prep
shared_build
core_build

@carterbox
Copy link
carterbox commented Mar 13, 2025

What about something that includes the word "outputs" or "multioutput" so it's more obvious that you would only use this section when creating a multi-output recipe. For example:

recipe:
  name: ...
  version: ...

outputs_staging: ...

outputs:
  - package: ...
  - package: ...

"prepare_outputs", "pre_outputs", "multioutput_cache", "outputs_prebuild", "outputs_scaffold", "outputs_staging", ...

@jaimergp
Copy link
Contributor
jaimergp commented Mar 13, 2025

Yea, I like where @carterbox is going. Some more: before_outputs. Also, since we are talking about "preparing the workspace for the later builds", what about prepare_workspace, prepare_output, output_setup or similar?

@h-vetinari
Copy link

How about shared_build or perhaps outputs_shared_build...? Actually, if we want to emphasi 3D11 se the association with outputs, we could go one step further, and put it under outputs: directly:

recipe:
  name: ...
  version: ...

outputs:
  - shared:  # or common, shared_build, ...
  - package: ...
  - package: ...

@wolfv
Copy link
Contributor Author
wolfv commented Mar 14, 2025

I think moving it under outputs is a nice idea. Main difficulty I see is that people might expect to be able to create more than 1 of those which is not possible right now (maybe later we can allow naming them and having multiple, though). But we can error on that, too ...

@carterbox
AC7D
Copy link

The cache is not itself an output though, so we would need to choose a name which indicates as such and doesn't end with the word "build". Otherwise, we would have an outputs.build.build section.

recipe:
  name: ...
  version: ...

outputs:
  - definitely_not_an_output_but_used_to_make_outputs:
      source: ...
      requirements:
        build: ...
        host: ...
      build: ...
  - package: ...
  - package: ...

😆 But seriously... something like setup, staging, scaffold, precursor, bootstrap, mock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0