4

Using Linux's memfd_secret syscall from the JVM with JEP-419

 1 year ago
source link: https://blog.arkey.fr/2022/05/16/linux_memfd_secret_with_jep-419/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

What is a system call (syscall) ?

Before jumping to memfd_secret, let’s first understand how to make a system call. And even before that, let’s see what is a system call.

For those not interested in this part you can jump to memfd_secret section.

In order to do something useful a program has to interact with some resources, memory, disk, network, terminal, etc. On a computer, these resources are handled by a very complex and critical software, the Operating System.

In order to use these resources, a program has to make system calls like read, wait, write, exit, etc. The standard malloc, the native allocator, has to actually place a request to the OS to get memory via a mmap syscall.

malloc mmap

As expected the JVM does plenty of syscalls too, e.g. when logging something on stdout or persisting a (unified) log file.

Essentially,

a system call is a way of requesting the kernel to do something for the program.

Why system calls have to be in the kernel and not in the user space like in a standard library? As mentioned earlier the reasoning is that system calls are a way to interact or, involve, a resource like devices, file system, network, processes, etc. These resources are managed by a privileged software : the OS or kernel.

When a system call happens, the program doesn’t simply invoke a method at some whose code resides at some address, a system call is actually making the CPU switching to Kernel mode because the kernel is a privileged software.

On most modern processors there is a security model, that allows to limit the scope of what a program can do. In particular on Intel based CPUs, the model is known as processor protection ring (or hierarchical protection domains).

Ring 3User space(Lowest privileges)Ring 2Ring 1Ring 0KernelKernel space(Highest privileges)Device driversDevice driversApplications

It seems that Ring 1 and 2 are rarely used because paging (the way that the OS handles memory, see my blog post on [off-heap memory]) only has the concept of privileged and unprivileged which minimize the actual benefit of those rings, according to Evan Teran's answer on SO.'

When a processor executes some code (in thread), the processor knows the current mode, this way the processor is able to gate memory accesses, e.g. a Ring 3 (user-land program cannot access memory from Ring 0, the kernel). This is yet another feature of the virtual memory abstraction. The processor could also restrict some processor instructions and registers to the software running in Ring 0.

Out of scope: there’s even negative rings on some CPU architectures for hypervisor, or CPU System management, up to Ring -3.

Restrictions are enforced by the CPU, in order to perform its purpose a user-land program needs to place a request to the kernel. This mechanism is called syscall, it allows to transition between rings.

process threadexecutingprocess threadexecutingIdlesyscallRing 3User landKernelRing 0kernal syscall executingModeSwitchModeSwitch
Syscall ring transitions

During mode switches a lot is happening, saving and restoring registers, putting the CPU in specific mode (user vs kernel) etc. And of course doing the reverse once the request is handled either with success or a failure

Privilege context switches are sufficiently costly that most libraries try to avoid those. For example, reading 8 KiB instead of 256 bytes is a good idea as it drastically reduces the number of syscall and as such mode switches.

What does the documentation says about syscalls ?

Now let’s get practical.

Looking at man 2 syscall, the manpage shed some details on how to make the call, specifically in the Architecture calling conventions section. Those details are in assembly, e.g.

  • processor interrupt 0x80 for i386 processors (32 bits), then specific registers

  • syscall instruction for x86_64 processors (64 bits), then specific registers

The calling convention of other architectures are also described e.g. on ARM processors, the system call is performed by a swi 0x0 instruction, on aarch64 by svc #0.

For people not aware of what exactly is a calling convention should read at leas this wikipedia article on x86 calling convention. But in a short a calling convention defines how and where parameters should be placed in order to call the code, how parameters are passed registers or/and stack, how values are returned etc.

This manual page also gives an important difference with regular functions, while we look up system calls by their names: write, read, execve, exit, mmap, memfd_create etc. The programs and the kernel actually know them by numbers.

Why numbers? The reason is that syscalls are like messages that are passed down, and these numbers somewhat like enum ordinals indicating the type of message. These numbers are part of the syscall ABI (Application Binary Interface) and as such they are stable for a CPU architecture although unbounded (new syscalls can be added).

Outside, of this scope not all syscalls are made equal nowadays, some syscalls, usually the most used ones are exported in the user space memory, to avoid the cost of switching to kernel mode. In practice, vDSO (Virtual Descriptor Shared Object) is like a library, it is loaded in memory so that it can be accessed from the program memory (glibc knows about this memory region and will use it).

pmap -X {pid}
1 The vDSO 8 KiB segment

To read more about it, one should read the relevant manual page (man 7 vdso). Typically, this page lists the exported syscalls.

E.g ` __vdso_clock_gettime`, which is called by clock_gettime defined in the standard libc (man 3 clock_gettime).

The syscall numbers are different between architectures! On Linux one can look at their definition in the /include/asm-/unistd-.h files.

From the syscall manpage the Intel CPUs syscall calling convention is:

Example 1. 64-bit programs
Example 2. 32-bit programs
Set the registers
  1. rax ← System Call number

  2. rdi ← First argument

  3. rsi ← Second argument

  4. rdx ← Third argument

Make the syscall
  • execute syscall processor instruction

The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_64.h

My first syscall

In order to quickly practice a syscall, let’s do a very simple hello world. The example will be in assembler, I promise this is the only source snippet in assembly and after that I’ll be back with Java and Panama.

  • /usr/include/asm/unistd_64.h

Example 3. 64-bits (with syscall instruction)
Example 4. 32-bits (with an interrupt)
hello_syscall.asm (x86_64)
1 At this place this register will hold the selected the syscall (a number). Note the number comes from /usr/include/asm/unistd_64.h.
2 Syscall arguments are placed in next registers.
3 Make the syscall with interrupt 0x80.
1 Note the elf64 format for 64 bits.

When looking at this very simplistic code, something immediately stands out: From application point of view (user land), a syscall is just like an atomic pseudo machine instruction. I believe this example is more striking than the figure above on syscall ring transitions.

We saw what is exactly a syscall and how to make one using assembly. In general though, it’s rare to invoke syscall directly as the standard library exposes wrappers that handle everything for most of the syscalls.

programlibcKernelprintf() {syscall(SYS_write,…)printf()SYS_write
syscall wrappers in the standard library

Because memfd_secret syscall has been recently used there’s no wrapper functions in the standard library, hence we’ll need to make a system call ourselves.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK