Using Linux's memfd_secret syscall from the JVM with JEP-419

What is a system call (syscall) ?

Before jumping to memfd_secret, let’s first understand how to make a system call. And even before that, let’s see what is a system call.

For those not interested in this part you can jump to memfd_secret section.

In order to do something useful a program has to interact with some resources, memory, disk, network, terminal, etc. On a computer, these resources are handled by a very complex and critical software, the Operating System.

In order to use these resources, a program has to make system calls like read, wait, write, exit, etc. The standard malloc, the native allocator, has to actually place a request to the OS to get memory via a mmap syscall.

malloc mmap

As expected the JVM does plenty of syscalls too, e.g. when logging something on stdout or persisting a (unified) log file.

Essentially,

a system call is a way of requesting the kernel to do something for the program.

Why system calls have to be in the kernel and not in the user space like in a standard library? As mentioned earlier the reasoning is that system calls are a way to interact or, involve, a resource like devices, file system, network, processes, etc. These resources are managed by a privileged software : the OS or kernel.

When a system call happens, the program doesn’t simply invoke a method at some whose code resides at some address, a system call is actually making the CPU switching to Kernel mode because the kernel is a privileged software.

On most modern processors there is a security model, that allows to limit the scope of what a program can do. In particular on Intel based CPUs, the model is known as processor protection ring (or hierarchical protection domains).

Ring 3User space(Lowest privileges)Ring 2Ring 1Ring 0KernelKernel space(Highest privileges)Device driversDevice driversApplications

It seems that Ring 1 and 2 are rarely used because paging (the way that the OS handles memory, see my blog post on [off-heap memory]) only has the concept of privileged and unprivileged which minimize the actual benefit of those rings, according to Evan Teran's answer on SO.'

When a processor executes some code (in thread), the processor knows the current mode, this way the processor is able to gate memory accesses, e.g. a Ring 3 (user-land program cannot access memory from Ring 0, the kernel). This is yet another feature of the virtual memory abstraction. The processor could also restrict some processor instructions and registers to the software running in Ring 0.

Out of scope: there’s even negative rings on some CPU architectures for hypervisor, or CPU System management, up to Ring -3.

Restrictions are enforced by the CPU, in order to perform its purpose a user-land program needs to place a request to the kernel. This mechanism is called syscall, it allows to transition between rings.

process threadexecutingprocess threadexecutingIdlesyscallRing 3User landKernelRing 0kernal syscall executingModeSwitchModeSwitch

Syscall ring transitions

During mode switches a lot is happening, saving and restoring registers, putting the CPU in specific mode (user vs kernel) etc. And of course doing the reverse once the request is handled either with success or a failure

Privilege context switches are sufficiently costly that most libraries try to avoid those. For example, reading 8 KiB instead of 256 bytes is a good idea as it drastically reduces the number of syscall and as such mode switches.

What does the documentation says about syscalls ?

Now let’s get practical.

Looking at man 2 syscall, the manpage shed some details on how to make the call, specifically in the Architecture calling conventions section. Those details are in assembly, e.g.

processor interrupt 0x80 for i386 processors (32 bits), then specific registers
syscall instruction for x86_64 processors (64 bits), then specific registers

The calling convention of other architectures are also described e.g. on ARM processors, the system call is performed by a swi 0x0 instruction, on aarch64 by svc #0.

For people not aware of what exactly is a calling convention should read at leas this wikipedia article on x86 calling convention. But in a short a calling convention defines how and where parameters should be placed in order to call the code, how parameters are passed registers or/and stack, how values are returned etc.

This manual page also gives an important difference with regular functions, while we look up system calls by their names: write, read, execve, exit, mmap, memfd_create etc. The programs and the kernel actually know them by numbers.

Why numbers? The reason is that syscalls are like messages that are passed down, and these numbers somewhat like enum ordinals indicating the type of message. These numbers are part of the syscall ABI (Application Binary Interface) and as such they are stable for a CPU architecture although unbounded (new syscalls can be added).

Outside, of this scope not all syscalls are made equal nowadays, some syscalls, usually the most used ones are exported in the user space memory, to avoid the cost of switching to kernel mode. In practice, vDSO (Virtual Descriptor Shared Object) is like a library, it is loaded in memory so that it can be accessed from the program memory (glibc knows about this memory region and will use it).

pmap -X {pid}

1	The vDSO 8 KiB segment

To read more about it, one should read the relevant manual page (man 7 vdso). Typically, this page lists the exported syscalls.

E.g ` __vdso_clock_gettime`, which is called by clock_gettime defined in the standard libc (man 3 clock_gettime).

The syscall numbers are different between architectures! On Linux one can look at their definition in the /include/asm-/unistd-.h files.

From the syscall manpage the Intel CPUs syscall calling convention is:

Example 1. 64-bit programs

Example 2. 32-bit programs

Set the registers

rax ← System Call number
rdi ← First argument
rsi ← Second argument
rdx ← Third argument

Make the syscall

execute syscall processor instruction

The actual syscall numbers (for 32 bit programs) is usually defined in /usr/include/asm/unistd_64.h

My first syscall

In order to quickly practice a syscall, let’s do a very simple hello world. The example will be in assembler, I promise this is the only source snippet in assembly and after that I’ll be back with Java and Panama.

/usr/include/asm/unistd_64.h

Example 3. 64-bits (with syscall instruction)

Example 4. 32-bits (with an interrupt)

hello_syscall.asm (x86_64)

1	At this place this register will hold the selected the syscall (a number). Note the number comes from `/usr/include/asm/unistd_64.h`.
2	Syscall arguments are placed in next registers.
3	Make the syscall with interrupt `0x80`.

1	Note the `elf64` format for 64 bits.

When looking at this very simplistic code, something immediately stands out: From application point of view (user land), a syscall is just like an atomic pseudo machine instruction. I believe this example is more striking than the figure above on syscall ring transitions.

We saw what is exactly a syscall and how to make one using assembly. In general though, it’s rare to invoke syscall directly as the standard library exposes wrappers that handle everything for most of the syscalls.

programlibcKernelprintf() {syscall(SYS_write,…)printf()SYS_write

syscall wrappers in the standard library

Because memfd_secret syscall has been recently used there’s no wrapper functions in the standard library, hence we’ll need to make a system call ourselves.

What is a system call (syscall) ?

What does the documentation says about syscalls ?

My first syscall

Recommend

战略目标规划-北极星指标

腾讯起诉vivo不正当竞争；谷歌俄罗斯分公司申请破产，官方称“银行账户被俄罗斯没收”；...

Font Awesome 6 圖示還能免費用在商業用途嗎？

How to Select Date From Datepicker in Selenium Webdriver Using Java

The Protection of Rights in The Digital Age

How To Improve Your Organic Growth: & Start Earning Them

How You Can Help Stop Domestic Violence

Fintech and AI: Artificial Intelligence in Finance - DZone AI

The Three Must-Haves for Machine Learning Monitoring

Winning Marathons Before Getting New Shoes

About Joyk