8000 RFE: Add support for maximum supported kernel version by drakenclimber · Pull Request #457 · seccomp/libseccomp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

RFE: Add support for maximum supported kernel version #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

drakenclimber
Copy link
Member
@drakenclimber drakenclimber commented Feb 12, 2025

This patchset proposes to solve issue #11 - RFE: support "maximum kernel version".

Signficant changes in this patchset

  • Updates syscalls.csv with the kernel versions that syscalls were added for x86, x86_64, and x32. (See the discussion heading below for why I only did these three architectures.)
  • Adds two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and SCMP_FLTATR_CTL_KVERMAX, for managing the maximum supported kernel version and what to do with syscalls that are newer than that version
  • If this feature is enabled by the user, then libseccomp will add a rule for every single known syscall up to the maximum supported kernel version. These rules will perform the DEFAULT action. (See the discussion below for more info.)
  • Adds supporting documentation and a test

Fixes: #11
CC: @kolyshkin @cyphar

Finally, I am hoping to discuss this issue at Linux Security Summit 2025 in Denver, Colorado USA on June 26th and 27th. I would love to get community feedback about the problem, the proposed solution, etc.

@drakenclimber drakenclimber added this to the v2.7.0 milestone Feb 12, 2025
@drakenclimber drakenclimber self-assigned this Feb 12, 2025
@hrw
Copy link
Contributor
hrw commented Feb 13, 2025

According to my system calls table there are holes in syscall numbering on several architectures (looked at arm64, arm, armoabi, x86-64, x32 and i386). New style architectures share syscall numbering and new entries are added at the end of table.

Your syscalls.csv shown me that I missed "parisc64" architecture. Will have to add support for it. (Edit: DONE)

When it comes to LTS/stable kernels then I think that one of rules in them is "no new stuff" which in this case mean no new system calls. Distribution kernels may add them and many did that in the past so check "is syscall present" may need to be more complex than "is kernel version high enough".

As you have support for syscall.tbl for x86 variants then for start it can be expanded for other architectures too. Will not cover all system calls but you get data for many.

I used those scripts for quick check with my syscalls-table project:

#!/bin/bash

KERNELDIR=~/devel/sources/linux/

for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
do
                echo $kernel_version
                (cd $KERNELDIR; git checkout v${kernel_version})
                bash scripts/update-tables.sh $KERNELDIR
                pip install .
                python examples/tables-to-yaml.py $kernel_version
                cp -r data/tables data/tables-${kernel_version}
                cp syscalls.yml syscalls-${kernel_version}.yml
done

examples/tables-to-yaml.py one:

#!/usr/bin/python3

import sys
import system_calls
import yaml

kernel_version = ""

if len(sys.argv) > 1:
    kernel_version = sys.argv[1]

syscalls = system_calls.syscalls()

with open("syscalls.yml", "r") as yf:
    yml = yaml.safe_load(yf)

for syscall_name in yml["syscalls"]:

    if not yml["syscalls"][syscall_name]["from"]:
        yml["syscalls"][syscall_name]["from"] = kernel_version

    for arch in syscalls.archs():
        try:
            number = syscalls.get(syscall_name, arch)
        except system_calls.NotSupportedSystemCall:
            number = ""
            pass
        yml["syscalls"][syscall_name]["archs"][arch]["number"] = number
        if number and not yml["syscalls"][syscall_name]["archs"][arch]["from"]:
            yml["syscalls"][syscall_name]["archs"][arch]["from"] = kernel_version


with open("syscalls.yml", "w") as yf:
    yaml.dump(yml, yf)

Not checked result for correctness yet.

@coveralls
Copy link
coveralls commented Feb 13, 2025

Coverage Status

coverage: 90.771% (+0.5%) from 90.252%
when pulling ac42438 on drakenclimber:issues/11
into 7db46d7 on seccomp:main.

Promote the scmp_kver enumeration to the public header file,
seccomp.h.in.  Add enumerations for all kernel versions from 4.0 to 6.12

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
A placeholder, KV_UNDEF, was added for when each syscall was added to
the kernel for each architecture, but the C code has defined this enum
value as SCMP_KV_UNDEF.  Find and replace all instances of KV_UNDEF with
SCMP_KV_UNDEF.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
8000
@drakenclimber
Copy link
Member Author
drakenclimber commented Feb 18, 2025

Moved the discussion list to the v3 comment

Here's a side-by-side diff of between v1 of this patchset's syscalls.csv and v2's syscalls.csv

@hrw
Copy link
Contributor
hrw commented Feb 19, 2025

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

@hrw
Copy link
Contributor
hrw commented Feb 19, 2025

Please note that "afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg, gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg, putpmsg, security, stty, tuxcall, ulimit, vserver" are officially unimplemented system calls. My syscalls-table has them on ignorelist so that can be why you have some diff.

And problem of x32 is that you need x32 headers in system to get them properly handled. Otherwise you get x86-64 ones. My github action which updates syscalls-table data has extra step to make sure that they are present.

@hrw
Copy link
Contributor
hrw commented Feb 19, 2025

Posted on mastodon about it: https://society.oftrolls.com/@hrw/114030254556485861 as some other people may find it useful too.

@drakenclimber
Copy link
Member Author

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

Yes, that was my recollection as well, but I wanted data to back it up. I expect this model to continue going forward.

For libseccomp I think that means that we can't rely on a "less than" rule for unknown syscalls. We'll either need an explicit rule for each syscall or a series of ranges.

Thanks for the verification, @hrw

@hrw
Copy link
Contributor
hrw commented Feb 19, 2025

https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html allows to disable and reorder columns which can be handy when you want to compare numbers between architectures.

I recommend sorting by arm64 or riscv64 column to see how new system calls are present on each architecture.

Note that everything from 'avr32' to right side does not exist in current Linux kernel - they are kept for historical purposes.

@drakenclimber
Copy link
Member Author
drakenclimber commented Feb 19, 2025
Loading

Changes for v3:

  • Fixed the x32 syscall numbers. Thanks to @hrw for the guidance here

Moved the discussion list to the v4 comment. Here's a side-by-side diff of before and after this patchset (v3)

@hrw
Copy link
Contributor
hrw commented Feb 19, 2025

There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

@drakenclimber
Copy link
Member Author

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

Ack. That's on my todo list :)

@cyphar
Copy link
cyphar commented Feb 28, 2025

It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?

A lot of work went into unifying the syscall tables for newer syscalls a few years ago. For all future syscalls (barring a few esoteric architectures) the syscall numbers will match between architectures and so any holes should be expected to be kept (except maybe for arch-specific syscalls, I don't know if there's a proper policy around that).

For completeness though, it might be necessary to have a more complicated rule. In runc we just do the hacky solution, which is okay in general but is not theoretically correct.

Add a tool to populate the syscalls.csv table.  It parses the data
output from the syscalls-table [1] tool.  The following script was used
to build the directories and files with the relevant syscall data:

	#!/bin/bash

	KERNELDIR=~/devel/sources/linux/

	for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
	do
			echo $kernel_version
			(cd $KERNELDIR; git checkout v${kernel_version})
			bash scripts/update-tables.sh $KERNELDIR
			pip install .
			python examples/tables-to-yaml.py $kernel_version
			cp -r data/tables data/tables-${kernel_version}
			cp syscalls.yml syscalls-${kernel_version}.yml
	done

(Note that the above script takes quite a bit of time to run :)

[1] https://github.com/hrw/syscalls-table

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Using the script from the previous commit, populate the syscalls.csv
table for all architectures.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Add a tool, scmp_get_max_syscall_num.py, that can calculate the largest
current syscall number.

As of this commit, the largest syscall number is 547 via pwritev2() in
the x32 architecture.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Add two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and
SCMP_FLTATR_CTL_KVER.  When SCMP_FLTATR_CTL_KVERMAX is set, then
libseccomp will handle syscalls as follows:

* syscalls with explicit actions set by the user will behave as
  before
* syscalls that are not explicitly called out by the user's filter
  but are valid for the specified kernel version will return the
  default filter action (SCMP_FLTATR_ACT_DEFAULT).
* syscalls that are newer than the specified kernel version will
  return the unknown filter action (SCMP_FLTATR_ACT_ENOSYS)

Note that setting the SCMP_FLTATR_CTL_KVERMAX can result in large
seccomp BPF filters.  It's recommended to also enable the binary
tree optimization (SCMP_FLTATR_CTL_OPTIMIZE = 2) to speed up
filter traversal in the kernel.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Add support for an application to specify the maximum kernel version it
currently supports.  Any syscalls that have been added to a kernel
version newer than this specified version will return the unknown
action.  The unknown action defaults to returning ENOSYS, but it can be
overridden via the filter attribute SCMP_FLTATR_ACT_ENOSYS.

When the maximum supported kernel version is enabled, libseccomp will
create a filter as follows:
	* Users explicitly declare rules for syscalls.  No changes here
	  from previous behavior
	* The default action provided via seccomp_init() will still be
	  used for all syscalls that existed as of the user-specified
	  supported kernel
	* Any syscalls that did not exist at the time of the
	  user-specified supported kernel will return the unknown
	  action.  By default libseccomp sets this to return ENOSYS, but
	  it can be overridden via the filter attribute
	  SCMP_FLTATR_ACT_ENOSYS.

Below is a rough pseudo-code outline of a typical usage of this feature:
	seccomp_init()
	seccomp_add_rules()

	(optional but recommended) seccomp_attr_set( binary tree )
	seccomp_attr_set( max supported kernel version, e.g. SCMP_KV_6_5 )
	(optional) seccomp_attr_set( default unknown action )

	seccomp_load()
	seccomp_release()

Fixes: seccomp#11
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Add a test, 63-sim-kernel_version.[c|py], to test the kernel version
logic.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Add documentation for SCMP_FLTATR_ACT_UNKNOWN and SCMP_FLTATR_CTL_KVER.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
@drakenclimber
Copy link
Member Author
drakenclimber commented Feb 28, 2025

Changes for v4:

  • Add handling for special syscalls like afs_syscall() in scmp_populate_syscalls_csv.py
  • Fix a bug in scmp_populate_syscalls_csv.py where it wasn't handling x32 read() properly
  • Regenerate syscalls.csv with this new info
  • All automated tests now pass :)

I think this is ready for more in-depth review

Discussion

  • Should we support every architecture from the start?
    • This patchset only adds kernel versions for x86, [x86_64]
      (e2b42b6), and x32. They have had a consistent syscall.tbl since 2015 (kernel version 4.0), so they were an easy initial candidate to prove out the logic. I would prefer to support all architectures from the start, but I'm not certain how easy/hard it will be to flesh out the remainder of syscalls.csv
  • libseccomp has been around since kernel version 3.7.10 or so. Do we need to go that far back with our kernel version table?
    • This patchset only goes back to 2015 (linux kernel version 4.0)
    • Patch 55bf2ea b424f57 6f34216 now lists kernel versions all the way back to kernel v3.0
  • One thing that has kept me up at night with this patchset - did I get the correct kernel versions in which a syscall was added?
    • I wrote a simple Python script to populate the x86-ish syscall kernel versions, and I'm reasonably confident the numbers are right, but "reasonably confident" is insufficient when security is concerned. @hrw has written a tool to determine syscall kernel versions, and it could be used to populate our table (or perhaps verify my numbers)
    • Patch 005280d 9b285ef 22854ce uses the syscalls-table tool to populate syscalls.csv. libseccomp's kernel versions (prior to this patch) align very, very closely to the output from the syscalls-table tool with the exception of x32 now match to the best of my knowledge.
      • Here's a side-by-side diff of before and after this patchset (v4)
      • ~There are 25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR
      • As mentioned above, we need to figure out what's up with x32
      • x32 syscall numbers now largely match our previous numbers
      • x32 syscall numbers now match our previous numbers
  • Can we simplify the logic and shrink the filter? I don't think so
    • @pcmoore has wondered if we could simplify the logic to only return -ENOSYS for syscalls greater than the maximum supported number. (Again, this patchset explicitly creates a rule for every known syscall rather than a single if syscall_num > max_num rule.) Note that most (all?) architectures have several holes in their syscall table. It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?
    • Running this script as follows ./tools/scmp_populate_syscalls_csv.py -d ~/git/other/syscalls-table/data -v shows that syscalls have been added in the middle 112 times since kernel v3.0. arm, s390, x86_64, parisc, x32, and more have all historically done it. Unfortunately, I don't think we can safely rely on new syscalls being added to the end of the list :(
    • In comment RFE: Add support for maximum supported kernel version #457 (comment), @cyphar shared that future syscalls should be added at the end, but unfortunately that doesn't solve older kernels. I'm leaning toward adding an explicit rule for each known syscall as this is guaranteed to work on older kernels and will work on newer kernels regardless of what the kernel community does or doesn't do. Thoughts?
  • As written, SCMP_FLTATR_CTL_KVERMAX must be set at the end of creating the libseccomp context. Any seccomp_arch_add() after setting the maximum kernel version will result in -EINVAL.
    • Aside - libseccomp doesn't allow overwriting of existing rules, and (regardless of this patchset) silently ignores the "new" rule and doesn't add it to the filter. Thus as currently implemented, we must populate the known rules logic at the very end of the filter construction.
    • Do we consider changing the existing behavior of silently ignoring new rules, and instead overwrite the existing rules? That would simplify this patchset

@kees
Copy link
Contributor
kees commented Mar 12, 2025

What is the benefit of this over having an ENOSYS default action?

@drakenclimber
Copy link
Member Author

What is the benefit of this over having an ENOSYS default action?

Good question. Some users have requested different behavior for an invalid syscall vs. an unsupported syscall.

But if an application is content without having such a distinction, then an ENOSYS default should work quite well for those users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: support "maximum kernel version"
5 participants
0