62

An illustration of why it's hard to port outside code into the Linux ker...

 5 years ago
source link: https://www.tuicool.com/articles/hit/vqqmaqa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Sometimes, people show up with useful kernel-side things that were originally written for other systems and try to put them into Linux as (GPLv2) donations. Often these have been filesystems (SGI and XFS, IBM and JFS), but there have been other attempts at code drops. Most recently, Oracle made DTrace available under GPLv2 and integrated it into their kernel . I've said before that this is not an easy thing to do and can take years to actually get the code in to the Linux kernel (eg, XFS). Part of that is that the Linux kernel developers are picky about the code that they accept and require it to follow Linux kernel conventions (because they'll be supporting it for years), but part of that is because joining up code written for another environment with the Linux kernel is a hard, long process, with lots of subtle issues that can be overlooked despite very good people with the best of intentions.

As it happens, I have an illustration of this. Before I start, I want to say explicitly that I think ZFS on Linux is a very solid and well done project. In fact, that it's so solid and well done makes this case all the more useful as an illustration, because it shows how a bunch of very good people, working very hard (and for a long time) and doing a very good job, can still have something slip by in the interface between the Linux kernel and outside code. So here is the story.

Recently, someone on the ZFS on Linux mailing list reported a kernel panic related to snapshots (and also and the ZoL issue report ). The first background detail required is that under some circumstances, accessing snapshots can get ELOOP from the kernel's general filesystem code . The specific crash happened because when the kernel converted aNFS filehandle that was (apparently) for a file in a ZFS snapshot into a kernel dentry, the dentry pointer it got back from ZFS had the value of 0x28 (decimal 40). Although very low faulting addresses usually have causes related to NULL pointers , 40 happens to be the errno value for ELOOP , which is suspicious.

Internally, the Linux kernel has a somewhat odd way of handling errors. Probably in order to avoid confusing them with valid values that might be returned normally, errors are usually returned from functions as negative errno values; for example, if a pathname lookup fails because there's no such name in the directory, the relevant functions will return -ENOENT (and there's a whole infrastructure to smuggle these negative errno values in and out of what are normally pointers). Much of the rest of the world has kernel functions that return positive errnos to signal errors, and in particular the original Solaris/Illumos ZFS code uses positive errnos to return errors, so a ZFS function that wants to signal 'no such name' will return ENOENT .

The ZFS on Linux code tries to stay as close to the upstream Illumos ZFS code as possible to make it easier to port fixes and changes back and forth. As part of this, it has not changed to using Linux style negative errnos internally; instead, it converts from ZFS positive errnos (in its own code) to Linux kernel negative errnos when it returns results to the kernel, such as when it is converting a NFS filehandle into a ZFS dnode and then a kernel dentry, which may fail with errors like ESTALE . This conversion is done by negating the internal ZFS errno , turning a positive ZFS errno into a negative kernel errno (which is then smuggled inside a pointer).

All of this is fine except that there turns out to be a point in converting NFS filehandles where the ZFS on Linux code calls a kernel function to do path lookups and returns its error result unaltered . Since this is a kernel function, it returns negative errnos, which are passed up through the ZFS on Linux call stack and then carefully negated by ZoL before it returns them to the kernel. This careful negation turns what was a negative kernel errno into a positive number that the kernel thinks is not an error, but a dentry pointer. Things do not go well from here.

All of the code involved looks innocent and good on a normal inspection; you're calling functions, you're checking for errors, you're returning errors if there are any, everything looks normal and standard. You need a bunch of contextual knowledge to know that this one function call is special and returns a dangerously different result from everything else (if it encounters an error at all, which it usually doesn't), and it needs special handling. The commit that added this code is over a year old and was reviewed by one of the most experienced ZoL developers, and the code has passed tests and been used in production (where it worked because errors from this kernel function are very rare in this context).

This error is not in ZFS code and it is not in kernel code; it's at the seam between the two, where one world must be carefully converted to the other. Here, one little spot was missed and joined imperfectly, and the result was a kernel panic a year later. And that's part of why porting outside code into the Linux kernel is hard and takes a long time, one way or another.

(It also makes a good illustration of why the Linux kernel developers generally insist that outside code be converted over to use kernel idioms before they'll accept it into the kernel tree.)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK