-
Notifications
You must be signed in to change notification settings - Fork 18.7k
seccomp: add support for "clone3" syscall in default policy #42681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If no seccomp policy is requested, then the built-in default policy in dockerd applies. This has no rule for "clone3" defined, nor any default errno defined. So when runc receives the config it attempts to determine a default errno, using logic defined in its commit: opencontainers/runc@7a8d716 As explained in the above commit message, runc uses a heuristic to decide which errno to return by default: [quote] The solution applied here is to prepend a "stub" filter which returns -ENOSYS if the requested syscall has a larger syscall number than any syscall mentioned in the filter. The reason for this specific rule is that syscall numbers are (roughly) allocated sequentially and thus newer syscalls will (usually) have a larger syscall number -- thus causing our filters to produce -ENOSYS if the filter was written before the syscall existed. [/quote] Unfortunately clone3 appears to one of the edge cases that does not result in use of ENOSYS, instead ending up with the historical EPERM errno. Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use clone3 by default. If it sees ENOSYS then it will automatically fallback to using clone. Any other errno is treated as a fatal error. Thus when docker seccomp policy triggers EPERM from clone3, no fallback occurs and programs are thus unable to spawn threads. The clone3 syscall is much more complicated than clone, most notably its flags are not exposed as a directly argument any more. Instead they are hidden inside a struct. This means that seccomp filters are unable to apply policy based on values seen in flags. Thus we can't directly replicate the current "clone" filtering for "clone3". We can at least ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone" at which point we can filter on flags. Fixes: moby#42680 Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@AkihiroSuda I see you marked this for backporting, but doing so would also require #42005 to be included to add support for |
SGTM |
Current Docker version on Ubuntu 20.04 used by GH Actions suffers from an incompatibility with newer glibc [0] used by Fedora Rawhide, causing Rawhide containers in CI to fail with: ``` Errors during downloading metadata for repository 'fedora-cisco-openh264': - Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=fedora-cisco-openh264-rawhide&arch=x86_64 [getaddrinfo() thread failed to start] ``` glibc 2.34 and later tries to use the clone3 syscall (for hardware-assisted security hardening on x86_64), and falls back to clone2 on ENOSYS. However, with the current seccomp profile Docker returns EPERM instead, which is considered a "hard" fail. A fix [1] has been merged in upstream, but until then let's run the CI Docker containers without any seccomp profiles to allow Rawhide jobs to to their job. (I tried to disable seccomp only for the Rawhide jobs, but I couldn't procure any solution which wouldn't make my eyes bleed...) [0] moby/moby#42680 [1] moby/moby#42681
This comment has been minimized.
This comment has been minimized.
@gotmax23 It must be a coincidence, but today I have actually been working on a backport for this for Ubuntu 21.04. Because Ubuntu 21.10 containers also requires this fix now as it also started to use Glibc >= 2.34. I have backported both this fix and #42005 (with some slight modifications for Ubuntu) as mentioned in this PR and it seems to do the trick for me though. If I am understanding it correctly, Docker 20.10.8 still doesn't contain this fix. Also the milestone of this PR is 21.xx. Personally I think it's important enough to backport it (or release the next version very soon 😉). |
@pascallj, you are right! I have updated my comment accordingly. Could you please add support for Ubuntu 20.04 to your PPA? |
|
Thank you, @pascallj! I tested the |
I had no previous experience with buildx or BuildKit, so I'm not quite sure. These are way too complicated for my Docker use cases. I did some testing, but I might be completely wrong. It seems to depend on which driver your builder instance uses. If your builder instance uses the docker driver, it uses the docker daemon and therefore works fine (with the ppa packages). If your builder instance uses the docker-container driver instead, it loads a BuildKit container and therefore you completely depend on the capabilities of this container. This issue is not present in everything after and including commit moby/buildkit@8021a3e. By default buildx uses the latest stable BuildKit image which is at the moment two months old and therefore does not contain said commit. However if you create a builder instance with an image tag (
So if I'm right, the problem is also fixed in BuildKit (which is used by buildx), but is just not released to stable. |
See moby/moby#42681 Signed-off-by: AeroStun <24841307+AeroStun@users.noreply.github.com>
See moby/moby#42681 Signed-off-by: AeroStun <24841307+AeroStun@users.noreply.github.com>
8000
I admit I'm not well-versed in the details around this syscall, but we do allow |
The Blocking |
We're looking if we can backport this to the 20.10 branch; we previously tried to do so, but it also would include a (rather large) refactor, so perhaps we should have an implementation of this that targets the 20.10 branch (before the refactor) |
See moby/moby#42681 [skip ci]
See moby/moby#42681 [skip ci]
It appears that some container deity somewhere has fixed the Docker issue [1] that prevented us from upgrading beyond F34, but there was another gotcha introduced in the meanwhile on Fedora side: glibc-gconv-extras is now needed for our UTF-8 encoding check to work. While at it, optimize the dnf side a bit: get rid of modularity repos entirely so they don't come back via updates, and disable the H.264 repo too, we don't need *that* for building or testing rpm... [1] moby/moby#42681
It appears that some container deity somewhere has fixed the Docker issue [1] that prevented us from upgrading beyond F34, but there was another gotcha introduced in the meanwhile on Fedora side: glibc-gconv-extras is now needed for our UTF-8 encoding check to work. While at it, optimize the dnf side a bit: get rid of modularity repos entirely so they don't come back via updates, and disable the H.264 repo too, we don't need *that* for building or testing rpm... [1] moby/moby#42681
It appears that some container deity somewhere has fixed the Docker issue [1] that prevented us from upgrading beyond F34, but there was another gotcha introduced in the meanwhile on Fedora side: glibc-gconv-extras is now needed for our UTF-8 encoding check to work. While at it, optimize the dnf side a bit: get rid of modularity repos entirely so they don't come back via updates, and disable the H.264 repo too, we don't need *that* for building or testing rpm... [1] moby/moby#42681 (cherry picked from commit 6761c39)
It appears that some container deity somewhere has fixed the Docker issue [1] that prevented us from upgrading beyond F34, but there was another gotcha introduced in the meanwhile on Fedora side: glibc-gconv-extras is now needed for our UTF-8 encoding check to work. While at it, optimize the dnf side a bit: get rid of modularity repos entirely so they don't come back via updates, and disable the H.264 repo too, we don't need *that* for building or testing rpm... [1] moby/moby#42681 (cherry picked from commit 6761c39)
Seems like a combination of `ubuntu-latest` and/or the move to f37 glibc is causing `createrepo_c` to hit the classic `clone3` Docker seccomp issue: moby/moby#42681 Hack around this by running the container in privileged mode.
Seems like a combination of `ubuntu-latest` and/or the move to f37 glibc is causing `createrepo_c` to hit the classic `clone3` Docker seccomp issue: moby/moby#42681 Hack around this by running the container in privileged mode.
Ubuntu archived short-term release 21.10 and moved it to the old-releases.ubuntu.com site. We still have to use it because older Docker versions are affected by moby/moby#42681 To fix the build switch apt sources to old-releases before installing packages. Change-Id: I0432cd0002b4e955399539a5b0ddaba21b4535cc Reviewed-on: https://cos-review.googlesource.com/c/cos/tools/+/36309 Reviewed-by: Arnav Kansal <rnv@google.com> Tested-by: Oleksandr Tymoshenko <ovt@google.com> Cloud-Build: GCB Service account <228075978874@cloudbuild.gserviceaccount.com>
On Docker versions < 20.10.9, `apt update` fails due to the use of syscall `clone3` by `Glibc >= 2.34`. This change upgrades the base distribution used by Travis to `jammy`, which contains Docker engine 20.10.12. See https://docs.travis-ci.com/user/reference/jammy/#docker and moby/moby#42681 for reference.
On Docker versions < 20.10.9, `apt update` fails due to the use of syscall `clone3` by `Glibc >= 2.34`. This change upgrades the base distribution used by Travis to `jammy`, which contains Docker engine 20.10.12. See https://docs.travis-ci.com/user/reference/jammy/#docker and moby/moby#42681 for reference.
This comment was marked as off-topic.
This comment was marked as off-topic.
* Removed duplicate utimes * Add additional syscalls, matching what Docker/Containerd allow. Source: https://github.com/containerd/containerd/blob/main/contrib/seccomp/seccomp_default.go * Refactored the list of Syscalls by not using the AllowSyscall() func. This was a carry-over from following Garden/Guardian's implementation. I personally find it easier to manage the list of syscalls directly in the specs.LinuxSyscall struct. I left the syscalls that we specify args for as-is since I think the call to AllowSyscall() does make them easier to read. * Added clone3 and have it always returning ENOSYS, which is how users of it know to fallback to clone. See [here](containerd/containerd#5982) and [here](moby/moby#42681) for details Signed-off-by: Taylor Silva <dev@taydev.net>
* Removed duplicate utimes * Add additional syscalls, matching what Docker/Containerd allow. Source: https://github.com/containerd/containerd/blob/main/contrib/seccomp/seccomp_default.go * Refactored the list of Syscalls by not using the AllowSyscall() func. This was a carry-over from following Garden/Guardian's implementation. I personally find it easier to manage the list of syscalls directly in the specs.LinuxSyscall struct. I left the syscalls that we specify args for as-is since I think the call to AllowSyscall() does make them easier to read. * Added clone3 and have it always returning ENOSYS, which is how users of it know to fallback to clone. See [here](containerd/containerd#5982) and [here](moby/moby#42681) for details Signed-off-by: Taylor Silva <dev@taydev.net>
- What I did
Modified the default seccomp profile so that clone3 is explicitly requested to give ENOSYS instead of the default EPERM, when CAP_SYS_ADMIN is unset.
If CAP_SYS_ADMIN is set, then clone3 is simply allowed unconditionally
- How to verify it
Test by using
$ docker run registry.fedoraproject.org/fedora:rawhide curl google.com
It should dump the HTML if seccomp is correctly triggering fallback from clone3 to clone.
- Description for the changelog
Explicitly set clone3 syscall to return ENOSYS to ensure glibc will correctly fallback to using clone. This fixes ability to spawn threads in Fedora 35 rawhide container images which now default to clone3. The default errno of EPERM results in a fatal error making the images unusable when seccomp is enabled.
Fixes #42680
fixes #42963
fixes #42876