

splice() and the ghost of set_fs()
source link: https://lwn.net/SubscriberLink/896267/c0e34849dd84010c/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Welcome to LWN.net
The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!
The normal rule of kernel development is that the creation of user-space regressions is not allowed; a patch that breaks a previously working application must be either fixed or reverted. There are exceptions, though, including a 5.10 patch that has been turning up regressions ever since. The story that emerges here shows what can happen when the goals of stability, avoiding security problems, and code cleanup run into conflict.
The set_fs() function was added to the kernel early in its history; it was not in the initial 0.01 release, but was added before the 0.10 release in late 1991. Normally, kernel code that is intended to access user-space memory will generate an error if it attempts to access kernel space instead; this restriction prevents, for example, attempts by an attacker to access kernel memory via system calls. A call to set_fs(KERNEL_DS) can be used to lift the restriction when the need arises; a common use case for set_fs() is to be able to perform file I/O from within the kernel. Calling set_fs(USER_DS) puts the restriction back.
The problem with set_fs() is that it turns out to be easy to forget the second set_fs() call to restore the protection of kernel space, leading directly to the "total compromise" scenario that kernel developers will normally take some pains to avoid. Numerous such bugs have been fixed over the years, but it had long been clear that the real solution was to just get rid of set_fs() entirely and adopt safer ways of accessing kernel memory when needed.
Developers (and Christoph Hellwig in particular) got more serious about this objective in 2020 and made a determined push to eliminate set_fs() entirely. Much of this work went into 5.10, though the final bits of the set_fs() infrastructure were only removed in 5.18. Back in 2020, though, one question that provoked some discussion was what should be done about splice().
The splice() system call will connect an open file descriptor to a pipe, then move data between the two for as long as the data stream lasts. This movement happens entirely within the kernel, potentially eliminating the need for large numbers of system calls; in some settings, it can provide a significant performance improvement. By its nature, splice() often has to move data to and from buffers that are in kernel space; to make that possible, it used set_fs().
Hellwig duly came up with a new implementation that would keep splice() working in the absence of set_fs(), but Linus Torvalds rejected it, saying that he didn't like the "complexity and messiness" of the implementation. But he also made it clear that he didn't feel the need to guarantee that splice() would keep working at all; he felt that making splice() work by default on most file types led to a number of security issues. Later in 2020, for example, he said:
I'd rather limit splice (and kernel_read too, for that matter) as much as possible. It was a mistake originally to allow it everywhere, and it's come back to bite us.So I'd rather have people notice these odd corner cases and get them fixed one by one than just say "anything goes".
So the patches that went into 5.10 ended up breaking splice() for any file type that did not have explicit support for the new way of doing things; the idea was that the important cases would be noticed and fixed over time. That has indeed happened; if one looks for patches committed as explicit fixes to the disabling of splice() support, one finds fixes for the AFS filesystem, the 9p filesystem, the orangefs filesystem, /proc/mountinfo, the TTY subsystem, kernfs, sendfile(), the nilfs2 filesystem, and the JFFS2 filesystem.
Most recently, Jens Axboe reported that splice() no longer worked on /dev/random or /dev/urandom; he included a patch to fix the problem as well. These patches were later reworked by random-number-generator maintainer Jason Donenfeld and were applied to the mainline during the 5.19 merge window.
Along the way, Donenfeld observed that the necessary changes resulted in a performance regression of about 3% when reading from /dev/urandom. That led him to ask whether the fix was something that was needed at all; after some discussion, Axboe gave him the lecture on regressions:
If you have an application that is written using eg splice from /dev/urandom, then it should indeed be safe to expect that it will indeed continue working. If we have one core tenet in the kernel it's that you should ALWAYS be able to upgrade your kernel and not have any breakage in terms of userspace ABI. Obviously that can happen sometimes, but I think this one is exactly the poster child of breakage that should NOT happen. We took away a feature that someone depended on.
That is the sort of breakage that did indeed happen but, in this case, a change was made knowing that this kind of problem would result. Hellwig said in response to Axboe's patch set that "compared to my initial fears the fallout actually isn't that bad", but a perusal of the above list of fixes might lead one to a different conclusion.
The removal of set_fs() is, in many ways, a model for what the kernel development process can do. A fundamental piece of low-level structure that had been deeply wired into the kernel since the beginning was replaced with a much safer alternative without breaking the project's pace of a stable release every nine or ten weeks. The steady stream of regressions resulting from this change, though, is not what the project sets out to do — and it seems certain that this particular gift has not yet stopped giving.
The decision to take this path was driven by a fear of security problems, based on the past history of the splice() system call. If those fears are still justified (and they might well be; consider, for example, that splice() was a part of the "Dirty Pipe" vulnerability reported earlier this year), then refusing to make all existing splice() implementations just work without set_fs() may have prevented far worse regressions than the ones we have seen. Having to fix a filesystem is annoying; having to endure yet another security drill for a branded vulnerability with a silly name is rather more so.
There is no way of knowing whether that is how things would have gone in this case. But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline. It does not take too many of those to create breakage to the project's reputation that is hard to splice back together.
(Log in to post comments)
Recommend
-
34
fastcat - A Faster `cat` Implementation Using Splice Lots of people asked me to write another piece a...
-
37
JavaScript’s splice is a function that is used to insert, remove and replace items in an array. You use it by choosing a position in the array to work with, choosing how many items to delete, and the new...
-
19
Tuning MySQL and the Ghost of Index Merge Intersection November 17, 2020 1981 words ...
-
8
Linux 中的零拷贝技术 splice 2015-05-07 15:26:00 http://hi.baidu.com/renguihuashi/item/ef71f8e28d74f5f22b09a415 ...
-
6
AI: Ghost workers demand to be seen and heardBy Jane WakefieldTechnology reporter Published8 hours agoimage copyrightGetty ImagesArtificial int...
-
6
JS使用Splice()函数操作数组 2019-11-16 9 在js的使用过程中,有一次需要对数组进行各种操作,一时间迫使我想要去使用链表。后来通过查阅资料,总结了下面的一些方法,主...
-
4
Getting Started Writing and managing content in Ghost, an advanced guide A full overview of all the features built into the Ghost editor, including powerful workf...
-
7
Difference between SLICE & SPLICE in JavaScript Hello Devs, In this article, we will discuss what's the difference between the two most important methods of Array in JavaScript (i.e Slice and Splice) The...
-
8
I recently came across a twitter post where the user challenged others if they knew the difference between the js methods .slice() and .splice() without googling them. A lot of people mixed them up. So I've created this simplified cheat sh...
-
4
Rethinking splice() LWN.net needs you!Without subscribers, LWN would simply not exist. Please consider signing...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK