Seccomp and deep argument inspection
source link: https://lwn.net/Articles/822256/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Seccomp and deep argument inspection
Did you know...?
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.
Kees Cook has been doing some thinking about plans for new seccomp features to work on soon. There were four separate areas that he was interested in, which he detailed in a lengthy mid-May message on the linux-kernel mailing list. One of those features, deep argument inspection, has been covered here before, but it would seem that we are getting closer to a resolution on how that all will work.
Deep arguments
Seccomp filtering (or "seccomp mode 2") allows a process to filter which system calls can be made by it or its threads—it can be used to "sandbox" a program such that it cannot make calls that it shouldn't. Those filters use the "classic" BPF (cBPF) language to specify which system calls and argument values to allow or disallow. The seccomp() system call is used to enable filtering mode or to load a cBPF filtering program. Those programs only have access to the values of the arguments passed to the system call; if those arguments are pointers, they cannot be dereferenced by seccomp, which means that accepting or rejecting the system call cannot depend on, for example, values in structures that are passed to system calls via pointers—or even string values.
The reason that seccomp cannot dereference the pointers is to avoid the time-of-check-to-time-of-use (TOCTTOU) race condition, where user space can change the value of what is being pointed to between the time that the kernel checks it and the time that the value gets used. But certain system calls, especially newer ones like clone3() and openat2(), have some important arguments passed in structures via pointers. These new system calls are designed with an eye toward easily adding new arguments and flags by redefining the structure that gets passed; in his email, Cook called these "extensible argument" (or EA) system calls.
It does not make sense for seccomp to provide a mechanism to inspect the pointer arguments of every system call, he said: "[...] the grudging consensus was reached that having seccomp do this for ALL syscalls was likely going to be extremely disruptive for very little gain". But for the EA system calls (or perhaps only a subset of those), seccomp could copy the structure pointed to and make it available to the BPF program via its struct seccomp_data. That would mean that seccomp would need to change to perform that copy, which would require a copy_from_user() call, and affected system calls would need to be seccomp-aware so that they can use the cached copy if seccomp creates one.
There are some other wrinkles to the problem, of course. The size of the structure passed to the EA system calls may grow over time in order to add new features. If the size is larger than expected on either side (user space or kernel), finding or filling zeroes in the "extra" space is specifically designed to mean that those new features are unused (the openat2() man page linked above has some good information on how this is meant to work). Since user space and the kernel do not have to be in lockstep, that will allow newer user-space programs to call into an older kernel and vice versa. But that also means that seccomp needs to be prepared to handle argument sizes larger (or smaller) than "expected" and ensure that the zero-filling is done correctly.
It gets even more complicated because different threads might have different ideas of what the EA structure size is, Cook said:
He had suggestions of a few different possibilities to solve the problem, but seemed to prefer the zero-fill option:
Others commenting also seemed to prefer that option, though Jann Horn noted that there is no need to zero-fill beyond the size that the kernel knows about:
Implementing that new operation would require changes to cBPF, however, which is not going to happen, according to BPF maintainer Alexei Starovoitov: "cbpf is frozen." An alternative would be for seccomp to switch to extended BPF (eBPF) for its filters. Using eBPF would allow the filters to perform that operation themselves without adding any new opcodes, but switching to eBPF is something that Cook hopes to avoid. As he explained in a message back in 2018, eBPF is something of fast-moving target, which worries him from a security standpoint: "[...] I want absolutely zero surprises when it comes to seccomp". Beyond that, eBPF would add a lot more code for the seccomp filter to interact with in potentially dangerous ways.
Aleksa Sarai, who is the developer behind the EA scheme, generally agreed with Cook's plan for handling those structures, but he raised another point. The structures may contain pointers—those cannot be dereferenced by seccomp either, of course. Should something be done so that the filters can access that data as well? When these "nested pointers" came up in another discussion, Linus Torvalds made it abundantly clear that he thinks that is not a problem that the kernel should deal with at all.
Less-deep arguments
A few days after his original post, Cook posted an item on the ksummit-discuss mailing list to suggest that there be a session at the (virtual) Kernel Summit in August to discuss these seccomp issues. Torvalds acknowledged that this kind of system call exists, but did not think there was much to discuss with regard to seccomp:
[...] And if you have some actual and imminent real security issue, you mention _that_ and explain _that_, and accept that maybe you need to do that expensive emulation (because the kernel people just don't care about your private hang-ups) or you need to explain why it's a real issue and why the kernel should help with your odd special case.
Cook seemed somewhat relieved in his response:
Christian Brauner, who has also been doing a lot of development in these areas, agreed that the filters could likely live without the ability to chase pointers any further than the top level. Sarai would like to see there at least be a path forward if requirements of that sort do arise, but seemed willing to keep things simple for now—perhaps forever.
io_uring
In his message on linux-kernel, Horn raised an interesting point for seccomp developers: handling io_uring. Since its introduction in early 2019, io_uring has rapidly added features that effectively allow routing around the normal system-call entry path, while still performing the actions that a seccomp filter might be trying to prevent.
Obviously, the filters could simply disallow the io_uring system calls entirely, but that may be problematic down the road. Sarai agreed that it is something that may need some attention. Cook said that he needed to look more closely at io_uring: "I thought this was strictly for I/O ... like it's named". Trying to filter based on the arguments to the io_uring system calls will be a difficult problem to solve, since the actual commands and their arguments are buried inside a ring buffer that lives in an mmap() region shared between the kernel and user space. Chasing pointers in that environment seems likely to require eBPF—or even stronger medicine.
It would seem that a reasonable path for inspecting the first level of structure "arguments" to some system calls has been identified. clone3() and openat2() are obvious candidates, since their flag arguments, which will help seccomp filters determine if the call is "reasonable" under the rules of the sandbox, live in such structures. On the other hand, complex, multiplexing system calls like ioctl() and bpf() were specifically mentioned as system calls that would not make sense to try to add the pointer-chasing feature. Though Cook did not put any timetable on his plans, one might think we will see this feature sometime before the end of the year.
(Log in to post comments)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK