-
Notifications
You must be signed in to change notification settings - Fork 18.7k
RFE: Reworking Seccomp Profiles to ENOSYS Default #42871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just to make it perfectly clear how the current runc heuristic works for the context of this dicussion: Runc prepends a tiny eBPF stub to every seccomp profile it is instructed to create. This program checks the syscall number and if the syscall number is larger than any syscall specified in the profile, it returns ENOSYS. Otherwise it executes the libseccomp-generated eBPF program. This means that adding any new syscalls to the seccomp profile must be done with extreme caution because if you skip a syscall, that syscall will now return EPERM and glibc will break in 6 months time as a result. Unfortunately there wasn't a better solution, because libseccomp doesn't allow you to create conditionals based on syscall number, nor can you construct an ENOSYS-default profile using just the EPERM-default profile because there are limitations on what kinds of boolean logic you can use with libseccomp, and it is non-trivial to make the cut-off for returning ENOSYS smaller than the largest syscall in the profile because doing so would require rewriting every return (and thus every jump) in the profile which would be far more invasive and hard to verify correct than our current stub. So the core issue is that we need to change what profiles are being provided to runc, because the above approach -- while workable in theory -- in practice hasn't solved the issue, and there's nothing more we can do purely in runc to solve the problem. Instead, Docker (as well as containerd, podman, crio, and everyone else using runc) needs to change their profiles to use -ENOSYS as the default errno. This is supported by the latest version of the runtime-spec (thanks to @guiseppe for opencontainers/runtime-spec#1087 -- in retrospect it's quite obvious that the larger changes to the seccomp landscape I discussed were a bit too ambitious to block us solving this with the default errno setup). The main issue with this switch is that due to limitations in libseccomp, we cannot create explicit EPERM rules for every existing syscall rule. For instance, any rule which contains more than one rule allowing a syscall based on a given argument cannot be inverted because boolean OR inversion requires AND and libseccomp doesn't allow using AND on the same argument. SCMP_MASKED_NEQ doesn't exist yet so SCMP_MASKED_EQ has a non-trivial inversion that requires exponential instructions based on how many bits are set in the mask despite the fact that eBPF has an inverse operation that could be trivially used. This means that we need to figure out out whether anything will break if some syscalls return ENOSYS even though the syscall exists and can be used with a different set of arguments. |
Thanks for opening; some preparation work was done to at least have support for the
...
Does that apply to the "current" situation (where runc does the magic rewriting, and Docker uses |
The conditional limitations are a general problem, which means that swapping to a defaultErrnoRet of ENOSYS will result in some syscalls returning ENOSYS when they currently return EPERM (even if you try to write explicit rules to generate EPERM for syscalls you want to block like |
I think we need to rework the approach towards seccomp because the stub workaround I came up earlier this year in opencontainers/runc#2750 seems to not be doing a great job of solving the problem (maybe it's unclear how it works, or maybe the fact you need to be very careful with updating the syscall set in your filter is not being enforced by upstream projects or being missed somehow). In any case, it seems that the setup is just as fragile as it was before but with the added complication that it seems to only break sometimes (because it depends on whether you've included a new syscall without mentioning an old one).
Since Docker is one of the the primary users of runc (alongside containerd, crio, podman, etc) we should probably sit down and unify the rules for seccomp profiles. The only reason we didn't switch to ENOSYS by default is because it would cause existing Docker profiles to change their behaviour. However, with
defaultErrnoRet
each user of runc can on their own terms switch to a setup where they must explicitly set EPERM for the syscalls they wish to block rather than having this ill-advised setup where syscalls that are not mentioned give you EPERM. The idea was that the ENOSYS-for-syscalls-larger-than-the-largest-specified-in-the-profile behaviour implemented in opencontainers/runc#2750 was meant to be an interim solution until Docker could move to explicitly specifying which syscalls they want to EPERM and the rest should be ENOSYS. Unfortunately that didn't happen.Basically someone needs to sit down and go through all the syscalls in the Docker profile which are currently not explicitly blocked and add a new EPERM rule for them. There are a few syscalls where this will be complicated or impossible (such as those that have multiple argument matches -- which podman has -- or the clone flag mask case) but we should probably have a discussion whether returning ENOSYS in those cases rather than EPERM is going to be a deal breaker or not. The point is that hopefully we should move all OCI users to
defaultErrnoRet: 38 // ENOSYS
sooner rather than later if possible.Originally posted by @cyphar in #42681 (comment)
The text was updated successfully, but these errors were encountered: