5

The Unix process API is unreliable and unsafe

 1 year ago
source link: http://catern.com/process.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

1 It's easy for processes to leak

What do I mean by a "leaked process"?

I mean a process that is running without any other entity on the system which is both:

  1. Responsible for killing the process, and
  2. Knows the identity of the process and can kill it precisely with no risk of collateral damage

As soon as a process is orphaned (that is, its parent dies), it's leaked, because only the parent of a process is able to safely kill it. Merely knowing the pid is insufficient to safely kill a process, as we'll discuss in the section on process ids. So while an orphaned process might have some other entity on the system with property 1, that entity can never satisfy property 2.

Once some processes have been leaked, you have two options:

  • Hope that they will exit on their own at some point.
  • Look at the list of processes, fire off some signals, and hope you didn't just kill the wrong thing.

Neither is particularly sastisfying, so we'd hope that it's hard to make an orphaned process. But, unfortunately, it's quite easy.

A process is orphaned when its parent process exits. If process A starts process B, which starts process C, and then process B exits, process C will be orphaned.

This is as simple as:

sh -c '{ sleep inf & } &'

'sh' is our process A; it forks off another copy of itself to perform the outer '&', which is our process B; then 'sleep inf' is our process C.

The parent process B is able to robustly track the lifetime of its child process C, through the mechanisms Linux provides for parent processes, and ensure that process C exits. If and when C is orphaned, that mechanism is no longer easily usable; it can be used by the init process, but that's not typically accessible to us.

1.1 Flawed solutions

1.1.1 Make sure B always cleans up on exit and kills C

B should just always make sure to kill process C before it exits. That way no processes will be orphaned, so no processes will be leaked, and we'll be fine.

Well, what are the possible ways B might exit and need to clean up?

B might choose to exit, possibly by throwing an exception or panicking. In those cases, it's possible for B to kill process C immediately before exiting.

Or B might receive a signal. B might be signaled for conventional reasons, such as a user pressing Ctrl-C, in which case B can still clean up, as long as the programmer or runtime take care to catch every kind of signal.

Or B might be signaled for some more unconventional reasons, such as a segmentation fault. It's still possible for B to clean up in this case, but it may be very tricky to do, and the programmer or runtime may need to take great care to make sure that the pid of C is still accessible even while handling a segfault.

Or B might receive SIGKILL. Unfortunately, this case prevents this solution from working. It's not possible for B to clean up when it receives SIGKILL, so C will be unavoidably leaked.

We might want to say, "never send SIGKILL". But that is impossible, both for a conventional reason and an ironic reason. The conventional reason is that B might have a bug, and hang, and SIGKILL might be the only way to kill it. The ironic reason is that the only way for B to clean up and exit in guaranteed finite time is for it to SIGKILL its own children, so that if they have bugs they will not just hang forever. So B would be SIGKILL'd by its own parent, implementing the same strategy.

So, in summary, it's not possible to guarantee that B cleans up and kills C when it exits, because it might be SIGKILL'd. Even in the case where B isn't SIGKILL'd, it's tricky for a complicated program to always make sure to kill off any child processes when it exits.

1.1.2 Use PR_SET_PDEATHSIG to kill C when B exits.

We can use the Linux-specific feature PR_SET_PDEATHSIG on process C, to ensure that process C will receive SIGKILL (or another signal) whenever process B exits for any reason, including if process B exits uncleanly due to a bug or SIGKILL.

The issue is that this only works one level down the tree. If C forks off its own process D, the death signal will kill off C but not D.

Extending it to work over an entire tree of processes, requires that the entire tree be using PR_SET_PDEATHSIG (and using it correctly). If we can make that guarantee, this technique will work. But in practice, most large systems can't make that guarantee since they are made up of a large number of programs from many different developers. As one specific example, many applications run subcommands through Unix shells, which don't use PR_SET_PDEATHSIG.

Even in smaller systems where we control all involved programs, this technique isn't perfect, since even programs we control can always have bugs and fail to use PR_SET_PDEATHSIG. We'd prefer a guarantee that relies only on the program at the root of the tree, and doesn't require us to reason about and debug all the programs involved.

1.1.3 Always write down the pid of every process you start, or otherwise coordinate between A and B

B could make sure to always write down the pid of every process it starts, so that we can at least make an attempt to kill any orphaned processes, even if that attempt isn't robust. More generally, B could coordinate with A, and somehow tell A about every process B starts. Then A (which we might trust to be correctly implemented) can handle cleaning up the processes that B starts. This will fail if there's a bug in B, or if B is killed just after starting a process but before telling A, but perhaps it's good enough?

This has the same flaw as PR_SET_PDEATHSIG, in that it only allows for avoiding leaks at a single level. Like PR_SET_PDEATHSIG, all programs involved would need to use our mechanism. And that's infeasible in practice in any large system.

1.1.4 A should run B inside a container

If A runs B inside a Linux container technology, such as a Docker container, then no matter how many processes B starts, A will be able to terminate them all by just stopping the container, and we'll be fine.

Ignoring the other merits of containers, if we're trying to solve the problem of "it is too easy for processes to leak", containers have three main flaws.

  1. It's not easy to run a container. Python has a "subprocess.run" function in its standard library, for starting a subprocess. Python has no "container.run" function in its standard library, to start a subprocess inside a container, and in the current container landscape that seems unlikely to change.

    Shell scripts make starting processes trivial, but it's almost unthinkable that, say, bash, would integrate functionality for starting containers, so that every process is started in a container. Leaving aside the issues of which container technology to use, it would be quite complex to implement.

  2. Containers require root or running inside a user namespace. The root requirement obviously can't be satisfied by most users. Fortunately, it's possible to start a container without being root by using user namespaces. Unfortunately, user namespaces introduce a number of quirks, such as breaking gdb (by breaking ptrace), so they also can't be used by most users.
  3. It's pretty heavyweight to require literally every child process to run in a separate container. Robust usage of pid namespaces (the relevant part of Linux containers) requires that we start up an init process for each pid namespace, separate from the other processes running in the container. This init process will do nothing but increase the load on the system, and it will prevent us from directly monitoring the started processes.

So, running every child process in a separate container isn't a viable solution. We still have no way to easily prevent child processes from leaking.

1.1.5 Use process groups or controlling terminals

Process groups and controlling terminals are two features which can be used to terminate a group of processes. Such a group of processes is usually called a "job", since Unix shells use these features and use that terminology. When processes start children, they start out in the same job, and they can all be terminated at once. So if process A put process B in a job, process A could avoid process C leaking by terminating the job.

Unfortunately, neither of these job mechanisms is nestable. If a process puts itself or its children into a new process group or gives itself a new controlling terminal, it completely replaces the old process group or controlling terminal. So that process will no longer be terminated when its original job is terminated!

In other words, if process A puts process B in a job, then process B puts process C in a job, then process B neglects to terminate process C, process C will no longer be in the job that process A knows about, so process C will leak!

So, ironically, if a child process tries to use these features to prevent its own child processes from leaking, it can inadvertantly cause them to leak. This is certainly unsuitable.

1.1.6 Use Windows 8 nested job objects

Windows 8 added support for nested job objects. Child processes (and all their transitive children) can be associated with a job, and they will all be terminated when the owner of the job exits (or deliberately kills them). Child processes can create their own jobs and assign their own children to those jobs, without interfering with or being aware of their parent job.

Unfortunately, we're using Linux, not Windows. :)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK