RFE: Add support for maximum supported kernel version #457

drakenclimber · 2025-02-12T22:40:22Z

This patchset proposes to solve issue #11 - RFE: support "maximum kernel version".

Signficant changes in this patchset

Updates syscalls.csv with the kernel versions that syscalls were added for x86, x86_64, and x32. (See the discussion heading below for why I only did these three architectures.)
Adds two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and SCMP_FLTATR_CTL_KVERMAX, for managing the maximum supported kernel version and what to do with syscalls that are newer than that version
If this feature is enabled by the user, then libseccomp will add a rule for every single known syscall up to the maximum supported kernel version. These rules will perform the DEFAULT action. (See the discussion below for more info.)
Adds supporting documentation and a test

Finally, I am hoping to discuss this issue at Linux Security Summit 2025 in Denver, Colorado USA on June 26th and 27th. I would love to get community feedback about the problem, the proposed solution, etc.

hrw · 2025-02-13T07:26:10Z

According to my system calls table there are holes in syscall numbering on several architectures (looked at arm64, arm, armoabi, x86-64, x32 and i386). New style architectures share syscall numbering and new entries are added at the end of table.

Your syscalls.csv shown me that I missed "parisc64" architecture. Will have to add support for it. (Edit: DONE)

When it comes to LTS/stable kernels then I think that one of rules in them is "no new stuff" which in this case mean no new system calls. Distribution kernels may add them and many did that in the past so check "is syscall present" may need to be more complex than "is kernel version high enough".

As you have support for syscall.tbl for x86 variants then for start it can be expanded for other architectures too. Will not cover all system calls but you get data for many.

I used those scripts for quick check with my syscalls-table project:

#!/bin/bash

KERNELDIR=~/devel/sources/linux/

for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
do
                echo $kernel_version
                (cd $KERNELDIR; git checkout v${kernel_version})
                bash scripts/update-tables.sh $KERNELDIR
                pip install .
                python examples/tables-to-yaml.py $kernel_version
                cp -r data/tables data/tables-${kernel_version}
                cp syscalls.yml syscalls-${kernel_version}.yml
done

examples/tables-to-yaml.py one:

#!/usr/bin/python3

import sys
import system_calls
import yaml

kernel_version = ""

if len(sys.argv) > 1:
    kernel_version = sys.argv[1]

syscalls = system_calls.syscalls()

with open("syscalls.yml", "r") as yf:
    yml = yaml.safe_load(yf)

for syscall_name in yml["syscalls"]:

    if not yml["syscalls"][syscall_name]["from"]:
        yml["syscalls"][syscall_name]["from"] = kernel_version

    for arch in syscalls.archs():
        try:
            number = syscalls.get(syscall_name, arch)
        except system_calls.NotSupportedSystemCall:
            number = ""
            pass
        yml["syscalls"][syscall_name]["archs"][arch]["number"] = number
        if number and not yml["syscalls"][syscall_name]["archs"][arch]["from"]:
            yml["syscalls"][syscall_name]["archs"][arch]["from"] = kernel_version


with open("syscalls.yml", "w") as yf:
    yaml.dump(yml, yf)

Not checked result for correctness yet.

coveralls · 2025-02-13T18:10:23Z

coverage: 90.771% (+0.5%) from 90.252%
when pulling ac42438 on drakenclimber:issues/11
into 7db46d7 on seccomp:main.

Promote the scmp_kver enumeration to the public header file, seccomp.h.in. Add enumerations for all kernel versions from 4.0 to 6.12 Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

A placeholder, KV_UNDEF, was added for when each syscall was added to the kernel for each architecture, but the C code has defined this enum value as SCMP_KV_UNDEF. Find and replace all instances of KV_UNDEF with SCMP_KV_UNDEF. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

drakenclimber · 2025-02-18T20:15:37Z

Moved the discussion list to the v3 comment

Here's a side-by-side diff of between v1 of this patchset's syscalls.csv and v2's syscalls.csv

hrw · 2025-02-19T11:04:00Z

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

hrw · 2025-02-19T11:08:56Z

Please note that "afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg, gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg, putpmsg, security, stty, tuxcall, ulimit, vserver" are officially unimplemented system calls. My syscalls-table has them on ignorelist so that can be why you have some diff.

And problem of x32 is that you need x32 headers in system to get them properly handled. Otherwise you get x86-64 ones. My github action which updates syscalls-table data has extra step to make sure that they are present.

hrw · 2025-02-19T11:14:09Z

Posted on mastodon about it: https://society.oftrolls.com/@hrw/114030254556485861 as some other people may find it useful too.

drakenclimber · 2025-02-19T14:16:33Z

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

Yes, that was my recollection as well, but I wanted data to back it up. I expect this model to continue going forward.

For libseccomp I think that means that we can't rely on a "less than" rule for unknown syscalls. We'll either need an explicit rule for each syscall or a series of ranges.

Thanks for the verification, @hrw

hrw · 2025-02-19T14:38:31Z

https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html allows to disable and reorder columns which can be handy when you want to compare numbers between architectures.

I recommend sorting by arm64 or riscv64 column to see how new system calls are present on each architecture.

Note that everything from 'avr32' to right side does not exist in current Linux kernel - they are kept for historical purposes.

drakenclimber · 2025-02-19T19:09:21Z

@hrw

Changes for v3:

Fixed the x32 syscall numbers. Thanks to @hrw for the guidance here

Moved the discussion list to the v4 comment. Here's a side-by-side diff of before and after this patchset (v3)

hrw · 2025-02-19T19:10:59Z

There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

drakenclimber · 2025-02-19T19:12:24Z

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

Ack. That's on my todo list :)

cyphar · 2025-02-28T10:58:29Z

It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?

A lot of work went into unifying the syscall tables for newer syscalls a few years ago. For all future syscalls (barring a few esoteric architectures) the syscall numbers will match between architectures and so any holes should be expected to be kept (except maybe for arch-specific syscalls, I don't know if there's a proper policy around that).

For completeness though, it might be necessary to have a more complicated rule. In runc we just do the hacky solution, which is okay in general but is not theoretically correct.

Add a tool to populate the syscalls.csv table. It parses the data output from the syscalls-table [1] tool. The following script was used to build the directories and files with the relevant syscall data: #!/bin/bash KERNELDIR=~/devel/sources/linux/ for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13} do echo $kernel_version (cd $KERNELDIR; git checkout v${kernel_version}) bash scripts/update-tables.sh $KERNELDIR pip install . python examples/tables-to-yaml.py $kernel_version cp -r data/tables data/tables-${kernel_version} cp syscalls.yml syscalls-${kernel_version}.yml done (Note that the above script takes quite a bit of time to run :) [1] https://github.com/hrw/syscalls-table Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Using the script from the previous commit, populate the syscalls.csv table for all architectures. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Add a tool, scmp_get_max_syscall_num.py, that can calculate the largest current syscall number. As of this commit, the largest syscall number is 547 via pwritev2() in the x32 architecture. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Add two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and SCMP_FLTATR_CTL_KVER. When SCMP_FLTATR_CTL_KVERMAX is set, then libseccomp will handle syscalls as follows: * syscalls with explicit actions set by the user will behave as before * syscalls that are not explicitly called out by the user's filter but are valid for the specified kernel version will return the default filter action (SCMP_FLTATR_ACT_DEFAULT). * syscalls that are newer than the specified kernel version will return the unknown filter action (SCMP_FLTATR_ACT_ENOSYS) Note that setting the SCMP_FLTATR_CTL_KVERMAX can result in large seccomp BPF filters. It's recommended to also enable the binary tree optimization (SCMP_FLTATR_CTL_OPTIMIZE = 2) to speed up filter traversal in the kernel. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Add support for an application to specify the maximum kernel version it currently supports. Any syscalls that have been added to a kernel version newer than this specified version will return the unknown action. The unknown action defaults to returning ENOSYS, but it can be overridden via the filter attribute SCMP_FLTATR_ACT_ENOSYS. When the maximum supported kernel version is enabled, libseccomp will create a filter as follows: * Users explicitly declare rules for syscalls. No changes here from previous behavior * The default action provided via seccomp_init() will still be used for all syscalls that existed as of the user-specified supported kernel * Any syscalls that did not exist at the time of the user-specified supported kernel will return the unknown action. By default libseccomp sets this to return ENOSYS, but it can be overridden via the filter attribute SCMP_FLTATR_ACT_ENOSYS. Below is a rough pseudo-code outline of a typical usage of this feature: seccomp_init() seccomp_add_rules() (optional but recommended) seccomp_attr_set( binary tree ) seccomp_attr_set( max supported kernel version, e.g. SCMP_KV_6_5 ) (optional) seccomp_attr_set( default unknown action ) seccomp_load() seccomp_release() Fixes: seccomp#11 Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Add a test, 63-sim-kernel_version.[c|py], to test the kernel version logic. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

Add documentation for SCMP_FLTATR_ACT_UNKNOWN and SCMP_FLTATR_CTL_KVER. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

drakenclimber · 2025-02-28T21:02:31Z

Changes for v4:

Add handling for special syscalls like afs_syscall() in scmp_populate_syscalls_csv.py
Fix a bug in scmp_populate_syscalls_csv.py where it wasn't handling x32 read() properly
Regenerate syscalls.csv with this new info
All automated tests now pass :)

I think this is ready for more in-depth review

Discussion

~~Should we support every architecture from the start?~~
- This patchset only adds kernel versions for x86, [x86_64]
  (e2b42b6), and x32. They have had a consistent syscall.tbl since 2015 (kernel version 4.0), so they were an easy initial candidate to prove out the logic. I would prefer to support all architectures from the start, but I'm not certain how easy/hard it will be to flesh out the remainder of syscalls.csv
~~libseccomp has been around since kernel version 3.7.10 or so. Do we need to go that far back with our kernel version table?~~
- ~~This patchset only goes back to 2015 (linux kernel version 4.0)~~
- Patch ~~55bf2ea~~ ~~b424f57~~ 6f34216 now lists kernel versions all the way back to kernel v3.0
One thing that has kept me up at night with this patchset - did I get the correct kernel versions in which a syscall was added?
- I wrote a simple Python script to populate the x86-ish syscall kernel versions, and I'm reasonably confident the numbers are right, but "reasonably confident" is insufficient when security is concerned. @hrw has written a tool to determine syscall kernel versions, and it could be used to populate our table (or perhaps verify my numbers)
- Patch ~~005280d~~ ~~9b285ef~~ 22854ce uses the syscalls-table tool to populate syscalls.csv. libseccomp's kernel versions (prior to this patch) ~~align very, very closely to the output from the syscalls-table tool with the exception of x32~~ now match to the best of my knowledge.
  - Here's a side-by-side diff of before and after this patchset (v4)
  - ~There are ~~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR~~
  - ~~As mentioned above, we need to figure out what's up with x32~~
  - ~~x32 syscall numbers now largely match our previous numbers~~
  - x32 syscall numbers now match our previous numbers
Can we simplify the logic and shrink the filter? I don't think so
- @pcmoore has wondered if we could simplify the logic to only return -ENOSYS for syscalls greater than the maximum supported number. (Again, this patchset explicitly creates a rule for every known syscall rather than a single if syscall_num > max_num rule.) Note that most (all?) architectures have several holes in their syscall table. It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?
- Running this script as follows ./tools/scmp_populate_syscalls_csv.py -d ~/git/other/syscalls-table/data -v shows that syscalls have been added in the middle 112 times since kernel v3.0. arm, s390, x86_64, parisc, x32, and more have all historically done it. Unfortunately, I don't think we can safely rely on new syscalls being added to the end of the list :(
- In comment RFE: Add support for maximum supported kernel version #457 (comment), @cyphar shared that future syscalls should be added at the end, but unfortunately that doesn't solve older kernels. I'm leaning toward adding an explicit rule for each known syscall as this is guaranteed to work on older kernels and will work on newer kernels regardless of what the kernel community does or doesn't do. Thoughts?
As written, SCMP_FLTATR_CTL_KVERMAX must be set at the end of creating the libseccomp context. Any seccomp_arch_add() after setting the maximum kernel version will result in -EINVAL.
- Aside - libseccomp doesn't allow overwriting of existing rules, and (regardless of this patchset) silently ignores the "new" rule and doesn't add it to the filter. Thus as currently implemented, we must populate the known rules logic at the very end of the filter construction.
- Do we consider changing the existing behavior of silently ignoring new rules, and instead overwrite the existing rules? That would simplify this patchset

kees · 2025-03-12T15:38:57Z

What is the benefit of this over having an ENOSYS default action?

drakenclimber · 2025-03-17T22:31:28Z

What is the benefit of this over having an ENOSYS default action?

Good question. Some users have requested different behavior for an invalid syscall vs. an unsupported syscall.

But if an application is content without having such a distinction, then an ENOSYS default should work quite well for those users.

drakenclimber added the enhancement label Feb 12, 2025

drakenclimber added this to the v2.7.0 milestone Feb 12, 2025

drakenclimber self-assigned this Feb 12, 2025

drakenclimber requested a review from pcmoore February 12, 2025 22:40

drakenclimber added 2 commits February 14, 2025 16:17

all: Promote kernel version enum to the public header file

d79cbad

Promote the scmp_kver enumeration to the public header file, seccomp.h.in. Add enumerations for all kernel versions from 4.0 to 6.12 Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

drakenclimber force-pushed the issues/11 branch from 78a0a6e to 9db8052 Compare February 18, 2025 20:00

drakenclimber force-pushed the issues/11 branch from 9db8052 to 425defc Compare February 19, 2025 19:05

drakenclimber added 7 commits February 28, 2025 20:39

syscalls: Populate the kernel versions for all arches

6f34216

Using the script from the previous commit, populate the syscalls.csv table for all architectures. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

tests: Add test for kernel version attribute

6add6b0

Add a test, 63-sim-kernel_version.[c|py], to test the kernel version logic. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

doc: Add documentation for max kernel version attributes

ac42438

Add documentation for SCMP_FLTATR_ACT_UNKNOWN and SCMP_FLTATR_CTL_KVER. Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>

drakenclimber force-pushed the issues/11 branch from 425defc to ac42438 Compare February 28, 2025 20:48

kolyshkin mentioned this pull request May 1, 2025

Move functionality from github.com/docker/docker/profiles/seccomp moby/sys#189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFE: Add support for maximum supported kernel version #457

RFE: Add support for maximum supported kernel version #457

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RFE: Add support for maximum supported kernel version #457

Are you sure you want to change the base?

RFE: Add support for maximum supported kernel version #457

Uh oh!

Conversation

Uh oh!

Signficant changes in this patchset

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Discussion

Uh oh!

Uh oh!

Uh oh!

Uh oh!