Reverse Taint Analysis Using Binary Ninja

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

by Henry Wildermuth, Horace Mann High School

We open-sourced a set of static analysis tools, KRFAnalysis , that analyze and triage output from our system call (syscall) fault injection tool KRF . Now you can easily figure out where and why, KRF crashes your programs.

During my summer internship at Trail of Bits, I worked on KRF, a fuzzer that directly faults syscalls to cause crashes. KRF works extremely well and pumps out core dumps like nobody’s business. However, it is difficult to determine which faulted syscall caused a particular crash since there could be hundreds of faulted syscalls in a single run. Manually tracing the cause of the crash through the source or disassembled binary is tedious, tiring, and prone to errors.

I set out to solve this problem using a technique I call Reverse Taint Analysis, and implemented my solution using the Binary Ninja API. The script gives a short list of possible causes of the crash, drastically limiting the amount of manual work required. Here, I describe the process I went through to create the algorithm and script, and give a brief overview of the additional tools built to ease its use.

Human in the Loop

How can we reliably determine the source of a crash? Well, how would a human determine the cause of a crash? First, we would look at the stack trace and figure out where the crash occurred. Let’s take a look at this vulnerable example program:

```#include
#include

void fillBuffer(char * string, unsigned len) {
for (unsigned i = 0; i < len; ++i) {
string[i] = 'A'; // if string = NULL, segfaults on invalid write
}
}

int main() {
char *str;

// Memory allocated
str = (char *) malloc(16); // if malloc fails, str = NULL
fillBuffer(str, 16); // str not checked for malloc errors!
free(str);
return 0;
}```

Running KRF against this program caused a fault. In this case, we can easily guess why the crash occurred—a faulted `brk` or `mmap` caused `malloc` to return `NULL` , which produced a segfault when `fillBuffer` tried to write to `NULL` . But let’s figure out the cause of the crash for certain, acting as if we didn’t have access to the source code.

First, let’s open up the core dump’s stack trace with gdb and see what caused the crash:

```(gdb) bt
#0 0x00005555555546a8 in fillBuffer ()
#1 0x00005555555546e1 in main ()```

Next, let’s take a look at our memory maps for the process so we can find the instruction in the binaries:

```(gdb) info proc mappings
[output truncated]```

Now, cross-referencing the address from the stack trace, we see that the instructions of the top stack frame, at `0x555555554000` , are in the binary `/vagrant/shouldve_gone_for_the_head` . We can calculate the instruction pointer’s offset in the binary by subtracting its location in the mapped address space from the beginning of the memory-mapped objfile and adding the offset:

`0x00005555555546a8 - 0x555555554000 + 0x0 = 0x6a8` .

Great! Now we can examine the binary itself in our disassembler of choice (Binary Ninja) and see what went wrong.

Here, we can see the disassembly of the `fillBuffer()` function, with the instruction that causes the segfault highlighted in red. This instruction sets the byte pointed to by `rax` to the character code for `A` . So, the issue must be an invalid value of `rax` . Looking back, we see that `rax = rax + rdx` , which are both previously set to the local variables `string` and `i` , respectively. We can see in the instruction at `0x68e` that string was originally stored in `rdi` , which is the first argument to the function. `i` is initialized to zero and is only incremented, so we can ignore it, since we know it could not have been tainted by a function call or the function’s arguments.

Knowing that the first argument to `fillBuffer()` is tainted, we can go to the next frame in the stack trace and see what happened. We perform the same subtraction with the memory map to the address in the stack trace, `0x00005555555546e1` , and get the actual address:

`0x00005555555546e1 - 0x555555554000 + 0x0 = 0x6e1` .​

This address is going to be one instruction after the function call to `fillBuffer()` since it is the return address. So, we want to examine the instruction directly before the one at `0x6e1` . Let’s open it up in Binary Ninja!

Here, we have the instruction at `0x6e1` highlighted in blue, and the previous instruction highlighted in red. We know from our manual analysis of `fillBuffer` that the first parameter is stored in `rdi` , so we should track the data being stored in `rdi` . In the instruction before, we see that `rdi` is set to `rax` , and above that, there is a call to malloc, which stores its return value in `rax` .

Great! Now we know that the output of `malloc` gets passed into `fillBuffer` , where it causes the segfault. We’ve figured it out! But that was really annoying. If only there were a better way…

Enter MLIL Static Single Assignment

Well, it turns out there is a better way! Binary Ninja can decompile code into something called Medium Level IL (MLIL), which is a more human-readable form of assembly. It can then convert that MLIL into a form called Static Single Assignment (SSA), where every variable is assigned exactly once. This becomes really useful, because we don’t need to worry about things changing a variable other than its definition. As an example of SSA, consider this pseudocode function:

```def f(a):
if a < 5:
a = a * 2
else:
a = a - 5
return a```

In SSA form is:

```def f(a<sub>0</sub>):
if a<sub>0</sub> < 5:
a<sub>1</sub> = a<sub>0</sub> * 2
else:
a<sub>2</sub> = a<sub>0</sub> - 5
a<sub>3</sub> = Φ(a<sub>1</sub>, a<sub>2</sub>) // meaning “a<sub>3</sub> is either a<sub>1</sub> or a<sub>2</sub>”
return a<sub>3</sub>```

So, let’s look at our same example again through the lens of SSA MLIL. Here’s fillBuffer in SSA MLIL:

Here, we can easily trace `rax_2#4` to `rax_1#3 + rdx_1#2` , then trace `rax_1#3` to `string#1` , which we see is `arg1` . We can also easily trace back `i` and see that it is set to `0` . We have once again discovered that the first argument to `fillBuffer` is the source of the crash. So now, let’s look at `main` .

This is where we really see the benefits of SSA MLIL over regular disassembly. It lets us see what arguments are passed into `fillBuffer` , and what values are returned by `malloc` , making the analysis much easier. By tracing the sources of `rdi#1` backwards, we again see that `malloc` is tainting the first argument of `fillBuffer` and, therefore, causing the crash.

We’re in the endgame now

So now that we’ve realized (for the second time) that malloc is the cause of our issues, let’s write out the process we’ve been applying, so we can easily convert it to code:

```1. Make an empty stack.
2. Push the crashing instruction to the stack.
3. While the stack is not empty:
4.   Pop an instruction off the stack.
5.   If it is a MLIL function call instruction:
6.     The return value of that function call may be cause of crash
7.   Otherwise:
8.     For each SSA variable used in the MLIL instruction:
9.        If it’s not assigned in this function:
10.         # It’s a function argument
11.         We will have to go another frame up our stack trace.
12.         # The same as going to main after finding arg1 was tainted
13.       Otherwise:
14.         Add the instruction assigning SSA variable to the stack.```

This is going to be easy! We just have to write it out in Python using the Binary Ninja API. We need to write a function that takes our instruction’s address and a `BinaryView` (a class holding information on the binary), and prints out the taint sources of the instruction.

```def checkFunction(self, inst_addr, bv):
# Get MLILFunction obj for the function containing the instruction

# Get the MLILInstruction obj for instruction at inst_addr

# Convert MLILFunction to SSA form
func = func.ssa_form

# Keep track of what is seen
visited_instructions = set()

# Variables we are interested in
var_stack = []

# Add the vars used by first instruction to stack
var_stack.append(v)

# Continuously run analysis while elements are in the stack
while len(var_stack) > 0:
var = var_stack.pop()
if var not in visited_instructions:
# Add to list of things seen

# Get variable declaration
decl = func.get_ssa_var_definition(var)

# Check if its an argument
if decl is None:
print("Argument " + var.var.name + " tainted from function call")
continue

# Check if its a function call
if decl.operation == MediumLevelILOperation.MLIL_CALL_SSA:
# If direct call
if decl.dest.value.is_constant:
# Get MLILFunction object of callee from address
func_called = bv.get_function_at(decl.dest.value.value)
print("Tainted by call to", func_called.name, "(" + hex(decl.dest.value.value) + ")")
else:
# Indirect calls
print("Tainted by indirect call at instruction", hex(decl.address))
continue

# If not an argument or call, add variables used in instruction to the stack. Constants are filtered out
var_stack.append(v)```

The power of SSA is used in the `vars_read` and `get_ssa_var_definition` methods. MLIL makes detecting calls easy using `decl.operation == MediumLevelILOperation.MLIL_CALL_SSA` .

Extending the script

We can expand on a lot here with error handling, edge cases, automatically analyzing the frame above in the stack trace, automatically extracting information from the stack trace, etc. Thankfully, I’ve already done some of that with a set of python scripts .

`python3 main.py binary coredump1 [coredump2] …`

Automatically extracts the needed information from the core dumps, then inserts that information and binaries into a tarball to be copied to another computer, including libraries that are called in the stack trace.

`gdb.py`

Uses GDB Python API to extract data from each core dump. It’s called by `main.py` , so they must be in the same directory.

`python3 analyze.py tarball.tar.gz`

Takes a tarball output by main.py and automatically runs reverse taint analysis on each core dump in it, automatically cascading tainted arguments to the next frame. It uses `krf.py` to run the analysis, so they must be in the same directory.

`krf.py` contains the analysis code, which is a more featured version of the script written in this blog post. (Requires the Binary Ninja API.)

Let’s try them on our test binary:

```\$ # Linux VM with KRF
Produced tar archive krfanalysis-shouldve_gone_for_the_head.tar.gz in /vagrant

\$ # Machine with Binary Ninja
Done
Tainted by call to malloc (0x560)
All paths checked```

Conclusion

Writing this analysis script has shown me the Binary Ninja API is amazing. The versatility and automatic analysis it allows is incredible, especially considering it acts directly on binaries, and its intermediate languages are easy to use and understand.

I’d also like to mention LLVM, another framework for static analysis, which has a very similar API to Binary Ninja. It has many benefits over Binary Ninja, including better access to debug and type information, being free, having a more mature codebase, and always-perfect analysis of calling conventions. Its downside is that it needs the source code or LLVM IR of what you are analyzing.

Three LLVM passes are available in the KRFAnalysis repository to run static analysis: one detecting race conditions caused by checking the state of a system before use (i.e. time-of-check, time-of-use or TOC/TOU), another detecting unchecked errors from standard library calls, and a third reimplementing reverse taint analysis.

My summer: A small price to pay for salvation

I am incredibly grateful to everyone at Trail of Bits for my internship. I gained some amazing technical experience and got the chance to work with the Linux Kernel, FreeBSD Kernel, and LLVM—codebases I had previously considered to be mystical.

Some of my highlights:

• I ported KRF to FreeBSD
• Added the ability for KRF to target processes by PID, GID, UID, or if it had a specific file open
• Wrote LLVM passes for static analysis
• Upstreamed LLVM changes
• Learned how to use Binary Ninja and its API
• Picked up good coding practices
• Gained a sense of the security industry

I also met some incredible people. I would like to give special thanks to my mentor Will Woodruff ( @8x5clPW2 ), who was always willing to talk over an implementation, idea, or review my pull requests. I can’t wait to apply what I’ve learned at Trail of Bits as I move forward in my career.