Basic disassembly with libopcodes

May 18, 2019

Tags:programming, reverse-engineering , security

This will be a brief post on using libopcodes to disassemble a raw (i.e., not in object format) buffer of machine code. The examples will all be for AMD64, but libopcodes should work with most bfd_arch_* and bfd_mach_* -specified machines.

Some background

For the unfamiliar, libopcodes is a part of the GNU binutils . Coupled with libbfd for object format parsing, it provides the core disassembly functionality used by tools like objdump .

It's also very old (the header in my dis-asm.h credits Cygnus Support and dates to 1993) and barely documented: outside of header file comments, the only real reference for it is this random page from 2009 on someone's self-hosted Wiki. opdis and xdisasm appear to use libopcodes , but both also (appear) to be unmaintained.

Why?

JZvIjur.jpg!web

Honestly, there aren't very many good reasons to use libopcodes : Intel's XED is almost certainly more correct, Capstone has a pretty nice API (including decent Python bindings), and Zydis boasts performance and no dependencies as project goals. LLVM also provides disassembler functionality via the MC subproject; Ray Wang has a great blog post on using it.

However, sometimes you just need to do something a particular way. In this case, I needed to use libopcodes . Since there were no other decent resources on it, I figured I'd share what I've learned.

To BFD or not to BFD

libopcodes uses many of libbfd 's constants, but can also be populated with a bfd * directly.

This post is not going to cover usage with a BFD handle, since libbfd doesn't do anything for us when disassembling raw bytes directly from an in-memory buffer.

Getting started

Seeing how we're using libopcodes , you'll need to have it installed.

On Debian and Ubuntu, apt install binutils-dev will fetch everything for you.

The syntax for linking to libopcodes is identical to every other library: just pass -lopcodes to your linker.

All code samples below assume that dis-asm.h is included.

Creating a disassembler

Almost all libopcodes functionality revolves around two types: disassembler_ftype and struct disassemble_info .

disassembler_ftype is a typedef 'd function pointer, which the user creates and then calls to disassemble a single instruction. dis-asm.h provides some forward declarations for predefined disassembler_ftype s as print_insn_* , but neglects to publicly expose the internal AMD64 disassembler_ftype . As such, we'll need to construct it ourselves.

disassemble_info provides the basic context for feeding data into the user's disassembler_ftype : the stream and callback(s) to use for disassembled output and error reporting, as well as a callback for feeding data into the assembler.

`disassemble_info`

Creating a new disassemble_info is a multi-step process:

/* disassemble_info has quite a few fields, and we won't be populating all of them.
 *
 * We empty initialize here to so that the libopcodes routines won't try to use
 * garbage data.
 */
struct disassemble_info disasm_info = {};


/* init_disassemble_info takes three arguments:
 *  1. a pointer to our disassemble_info
 *  2. a void pointer to a "stream", which gets fed to...
 *  3. a function pointer to a fprint-like function
 *     see fprintf_type in dis-asm.h for the exact prototype
 */
init_disassemble_info(&disasm_info, stdout, (fprintf_type) fprintf);

We'll replace stdout and fprintf above with our own stream and function in the full example at the end of the post, so that we can capture the disassembly instead of outputting it directly.

Confusingly, init_disassemble_info is not enough to fully initialize our disassemble_info structure. We also need to fill in some fields manually, and call a separate initialization function:

/* Specify our disassembly target. These constants are also required when
 * we create the actual disassembly function later, so I'm not 100% sure
 * if/why they're necessary here.
 */
disasm_info.arch = bfd_arch_i386;
disasm_info.mach = bfd_mach_x86_64;

/* Optionally change the output format to Intel,
 * over the default of AT&T.
 */
disasm_info.disassembler_options = "intel-mnemonic";

/* Tell our disassembler how and where to get its raw bytes.
 * libopcodes provides the buffer_read_memory function;
 * the buffer and its length are our input.
 */
disasm_info.read_memory_func = buffer_read_memory;
disasm_info.buffer = input_buffer;
disasm_info.buffer_vma = 0;
disasm_info.buffer_length = input_buffer_length;

Observe that we set disasm_info.buffer_vma to 0 — you can change that to whatever you want your starting VMA to be. Just make sure to do your address relocations correctly.

Finally, we call one last function:

disassemble_init_for_target(&disasm_info);

Our disassemble_info is now ready for use.

`disassembler_ftype`

As mentioned above, disassembler_ftype is actually a typedef 'd function pointer, one that we will actually call post-creation to disassemble our buffer instruction-by-instruction.

libopcodes provides a disassembler function that returns a suitable function:

disassembler_ftype disasm;

/* disassembler takes 4 arguments:
 *  1. The target architecture, same as disasm_info.arch
 *  2. The endianness (true = big, false = little)
 *  3. The target machine, same as disasm_info.mach
 *  4. An optional pointer to a BFD handle
 *
 * Returns NULL if libopcodes can't find a suitable disassembly function.
 */
disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);

disasm can now be called.

Disassembling an instruction

To disassemble a single instruction, we pass a program counter and our disassemble_info to our disasm function. Internally, this (presumably) causes libopcodes to call its read_memory_func with that program counter as the offset.

/* Our start pc. This should be adjusted per disasm_info.buffer_vma.
 */
size_t pc = 0;

/* disasm() returns the number of bytes consumed during instruction decoding.
 */
size_t insn_size = disasm(pc, &disasm_info);

After a successful call, the buffer specified in disasm_info.buffer should contain a string representation of the disassembled instruction. Note that no newline is appended; it's up to the programmer to ensure that the buffer is human-formatted between calls.

Putting it all together

Here's how we can disassemble a raw buffer into a string of assembly:

#define _GNU_SOURCE /* asprintf, vasprintf */

#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <dis-asm.h>

typedef struct {
  char *insn_buffer;
  bool reenter;
} stream_state;

/* This approach isn't very memory efficient, but it
 * avoids for external size/buffer tracking in this
 * example.
 */
static int dis_fprintf(void *stream, const char *fmt, ...) {
  stream_state *ss = (stream_state *)stream;

  va_list arg;
  va_start(arg, fmt);
  if (!ss->reenter) {
    vasprintf(&ss->insn_buffer, fmt, arg);
    ss->reenter = true;
  } else {
    char *tmp;
    vasprintf(&tmp, fmt, arg);

    char *tmp2;
    asprintf(&tmp2, "%s%s", ss->insn_buffer, tmp);
    free(ss->insn_buffer);
    free(tmp);
    ss->insn_buffer = tmp2;
  }
  va_end(arg);

  return 0;
}

char *disassemble_raw(uint8_t *input_buffer, size_t input_buffer_size) {
  char *disassembled = NULL;
  stream_state ss = {};

  disassemble_info disasm_info = {};
  init_disassemble_info(&disasm_info, &ss, dis_fprintf);
  disasm_info.arch = bfd_arch_i386;
  disasm_info.mach = bfd_mach_x86_64;
  disasm_info.read_memory_func = buffer_read_memory;
  disasm_info.buffer = input_buffer;
  disasm_info.buffer_vma = 0;
  disasm_info.buffer_length = input_buffer_size;
  disassemble_init_for_target(&disasm_info);

  disassembler_ftype disasm;
  disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);

  size_t pc = 0;
  while (pc < input_buffer_size) {
    size_t insn_size = disasm(pc, &disasm_info);
    pc += insn_size;

    if (disassembled == NULL) {
      asprintf(&disassembled, "%s", ss.insn_buffer);
    } else {
      char *tmp;
      asprintf(&tmp, "%s\n%s", disassembled, ss.insn_buffer);
      free(disassembled);
      disassembled = tmp;
    }

    /* Reset the stream state after each instruction decode.
     */
    free(ss.insn_buffer);
    ss.reenter = false;
  }

  return disassembled;
}

int main(int argc, char const *argv[]) {
  uint8_t input_buffer[] = {
      0x55,             /* push rbp */
      0x48, 0x89, 0xe5, /* mov rbp, rsp */
      0x89, 0x7d, 0xfc, /* mov DWORD PTR [rbp-0x4], edi */
      0x8b, 0x45, 0xfc, /* mov eax, DWORD PTR [rbp-0x4] */
      0x0f, 0xaf, 0xc0, /* imul eax, rax */
      0x5d,             /* pop ebp */
      0xc3,             /* ret */
  };
  size_t input_buffer_size = sizeof(input_buffer);

  char *disassembled = disassemble_raw(input_buffer, input_buffer_size);
  puts(disassembled);
  free(disassembled);

  return 0;
}

Which, when compiled and run:

clang test.c -lopcodes -o test
./test

Should produce:

push   %rbp
mov    %rsp,%rbp
mov    %edi,-0x4(%rbp)
mov    -0x4(%rbp),%eax
imul   %eax,%eax
pop    %rbp
retq

Other stuff

The above covers the very basics of using libopcode , but there's a lot of other stuff you can do via disassemble_info :

For some targets (not x86, unfortunately), the decoder will set insn_info_valid . If set, the branch_delay_insns , data_size , insn_type , target , and target2 fields can all be accessed. See the header for more information about each.
Memory error reporting can be controlled via memory_error_func . libopcodes provides perror_memory as a default choice, if set by the user.
Address printing can be controlled via print_address_func . libopcodes provides generic_print_address as a default choice.
Symbol resolution can be controlled via symbol_at_address_func and symbol_is_valid . libopcode provides generic_symbol_at_address and generic_symbol_is_valid as default choices.

Previously

May 18, 2019

Some background

Why?

To BFD or not to BFD

Getting started

Creating a disassembler

`disassemble_info`

`disassembler_ftype`

Disassembling an instruction

Putting it all together

Other stuff

Recommend

go语言笔记——go是有虚拟机runtime的，不然谁来做GC呢，总不会让用户自己来new和delete...

55世纪最高邀请码18648287,通过 Jaeger 上报 Go 应用数据

在《我的世界》里从零打造一台计算机有多难？复旦本科生大神花费了一年心血

Hope-Cloud可能是最好的 Java 微服务项目

Promise.race vs. Promise.any And Promise.all vs. Promise.allSettled

Supdate (Clojure library for transforming nested data structures)

(译) Data Binding 将布局视图绑定到架构组件

Using Ed25519 signing keys for encryption

在swoole上运行Yii2应用

4个MySQL优化工具AWR，帮你准确定位数据库瓶颈！

About Joyk