Basic disassembly with libopcodes
source link: https://www.tuicool.com/articles/2uQVBnQ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
May 18, 2019
Tags:programming, reverse-engineering , security
This will be a brief post on using libopcodes
to disassemble a raw (i.e., not in object format) buffer of machine code. The examples will all be for AMD64, but libopcodes
should work with most bfd_arch_*
and bfd_mach_*
-specified machines.
Some background
For the unfamiliar, libopcodes
is a part of the GNU binutils . Coupled with libbfd
for object format parsing, it provides the core disassembly functionality used by tools like objdump
.
It's also very old (the header in my dis-asm.h
credits Cygnus Support and dates to 1993) and barely documented: outside of header file comments, the only real reference for it is this random page from 2009 on someone's self-hosted Wiki. opdis and xdisasm appear to use libopcodes
, but both also (appear) to be unmaintained.
Why?
Honestly, there aren't very many good reasons to use libopcodes
: Intel's XED is almost certainly more correct, Capstone has a pretty nice API (including decent Python bindings), and Zydis boasts performance and no dependencies as project goals. LLVM also provides disassembler functionality via the MC subproject; Ray Wang has a great blog post on using it.
However, sometimes you just need to do something a particular way. In this case, I needed to use libopcodes
. Since there were no other decent resources on it, I figured I'd share what I've learned.
To BFD or not to BFD
libopcodes
uses many of libbfd
's constants, but can also be populated with a bfd *
directly.
This post is not going to cover usage with a BFD handle, since libbfd
doesn't do anything for us when disassembling raw bytes directly from an in-memory buffer.
Getting started
Seeing how we're using libopcodes
, you'll need to have it installed.
On Debian and Ubuntu, apt install binutils-dev
will fetch everything for you.
The syntax for linking to libopcodes
is identical to every other library: just pass -lopcodes
to your linker.
All code samples below assume that dis-asm.h
is included.
Creating a disassembler
Almost all libopcodes
functionality revolves around two types: disassembler_ftype
and struct disassemble_info
.
disassembler_ftype
is a typedef
'd function pointer, which the user creates and then calls to disassemble a single instruction. dis-asm.h
provides some forward declarations for predefined disassembler_ftype
s as print_insn_*
, but neglects to publicly expose the internal AMD64 disassembler_ftype
. As such, we'll need to construct it ourselves.
disassemble_info
provides the basic context for feeding data into the user's disassembler_ftype
: the stream and callback(s) to use for disassembled output and error reporting, as well as a callback for feeding data into the assembler.
disassemble_info
Creating a new disassemble_info
is a multi-step process:
/* disassemble_info has quite a few fields, and we won't be populating all of them. * * We empty initialize here to so that the libopcodes routines won't try to use * garbage data. */ struct disassemble_info disasm_info = {}; /* init_disassemble_info takes three arguments: * 1. a pointer to our disassemble_info * 2. a void pointer to a "stream", which gets fed to... * 3. a function pointer to a fprint-like function * see fprintf_type in dis-asm.h for the exact prototype */ init_disassemble_info(&disasm_info, stdout, (fprintf_type) fprintf);
We'll replace stdout
and fprintf
above with our own stream and function in the full example at the end of the post, so that we can capture the disassembly instead of outputting it directly.
Confusingly, init_disassemble_info
is not enough to fully initialize our disassemble_info
structure. We also need to fill in some fields manually, and call a separate initialization function:
/* Specify our disassembly target. These constants are also required when * we create the actual disassembly function later, so I'm not 100% sure * if/why they're necessary here. */ disasm_info.arch = bfd_arch_i386; disasm_info.mach = bfd_mach_x86_64; /* Optionally change the output format to Intel, * over the default of AT&T. */ disasm_info.disassembler_options = "intel-mnemonic"; /* Tell our disassembler how and where to get its raw bytes. * libopcodes provides the buffer_read_memory function; * the buffer and its length are our input. */ disasm_info.read_memory_func = buffer_read_memory; disasm_info.buffer = input_buffer; disasm_info.buffer_vma = 0; disasm_info.buffer_length = input_buffer_length;
Observe that we set disasm_info.buffer_vma
to 0
— you can change that to whatever you want your starting VMA to be. Just make sure to do your address relocations correctly.
Finally, we call one last function:
disassemble_init_for_target(&disasm_info);
Our disassemble_info
is now ready for use.
disassembler_ftype
As mentioned above, disassembler_ftype
is actually a typedef
'd function pointer, one that we will actually call post-creation to disassemble our buffer instruction-by-instruction.
libopcodes
provides a disassembler
function that returns a suitable function:
disassembler_ftype disasm; /* disassembler takes 4 arguments: * 1. The target architecture, same as disasm_info.arch * 2. The endianness (true = big, false = little) * 3. The target machine, same as disasm_info.mach * 4. An optional pointer to a BFD handle * * Returns NULL if libopcodes can't find a suitable disassembly function. */ disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);
disasm
can now be called.
Disassembling an instruction
To disassemble a single instruction, we pass a program counter and our disassemble_info
to our disasm
function. Internally, this (presumably) causes libopcodes
to call its read_memory_func
with that program counter as the offset.
/* Our start pc. This should be adjusted per disasm_info.buffer_vma. */ size_t pc = 0; /* disasm() returns the number of bytes consumed during instruction decoding. */ size_t insn_size = disasm(pc, &disasm_info);
After a successful call, the buffer specified in disasm_info.buffer
should contain a string representation of the disassembled instruction. Note that no newline is appended; it's up to the programmer to ensure that the buffer is human-formatted between calls.
Putting it all together
Here's how we can disassemble a raw buffer into a string of assembly:
#define _GNU_SOURCE /* asprintf, vasprintf */ #include <stdarg.h> #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <dis-asm.h> typedef struct { char *insn_buffer; bool reenter; } stream_state; /* This approach isn't very memory efficient, but it * avoids for external size/buffer tracking in this * example. */ static int dis_fprintf(void *stream, const char *fmt, ...) { stream_state *ss = (stream_state *)stream; va_list arg; va_start(arg, fmt); if (!ss->reenter) { vasprintf(&ss->insn_buffer, fmt, arg); ss->reenter = true; } else { char *tmp; vasprintf(&tmp, fmt, arg); char *tmp2; asprintf(&tmp2, "%s%s", ss->insn_buffer, tmp); free(ss->insn_buffer); free(tmp); ss->insn_buffer = tmp2; } va_end(arg); return 0; } char *disassemble_raw(uint8_t *input_buffer, size_t input_buffer_size) { char *disassembled = NULL; stream_state ss = {}; disassemble_info disasm_info = {}; init_disassemble_info(&disasm_info, &ss, dis_fprintf); disasm_info.arch = bfd_arch_i386; disasm_info.mach = bfd_mach_x86_64; disasm_info.read_memory_func = buffer_read_memory; disasm_info.buffer = input_buffer; disasm_info.buffer_vma = 0; disasm_info.buffer_length = input_buffer_size; disassemble_init_for_target(&disasm_info); disassembler_ftype disasm; disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL); size_t pc = 0; while (pc < input_buffer_size) { size_t insn_size = disasm(pc, &disasm_info); pc += insn_size; if (disassembled == NULL) { asprintf(&disassembled, "%s", ss.insn_buffer); } else { char *tmp; asprintf(&tmp, "%s\n%s", disassembled, ss.insn_buffer); free(disassembled); disassembled = tmp; } /* Reset the stream state after each instruction decode. */ free(ss.insn_buffer); ss.reenter = false; } return disassembled; } int main(int argc, char const *argv[]) { uint8_t input_buffer[] = { 0x55, /* push rbp */ 0x48, 0x89, 0xe5, /* mov rbp, rsp */ 0x89, 0x7d, 0xfc, /* mov DWORD PTR [rbp-0x4], edi */ 0x8b, 0x45, 0xfc, /* mov eax, DWORD PTR [rbp-0x4] */ 0x0f, 0xaf, 0xc0, /* imul eax, rax */ 0x5d, /* pop ebp */ 0xc3, /* ret */ }; size_t input_buffer_size = sizeof(input_buffer); char *disassembled = disassemble_raw(input_buffer, input_buffer_size); puts(disassembled); free(disassembled); return 0; }
Which, when compiled and run:
clang test.c -lopcodes -o test ./test
Should produce:
push %rbp mov %rsp,%rbp mov %edi,-0x4(%rbp) mov -0x4(%rbp),%eax imul %eax,%eax pop %rbp retq
Other stuff
The above covers the very basics of using libopcode
, but there's a lot of other stuff you can do via disassemble_info
:
-
For some targets (not x86, unfortunately), the decoder will set
insn_info_valid
. If set, thebranch_delay_insns
,data_size
,insn_type
,target
, andtarget2
fields can all be accessed. See the header for more information about each. -
Memory error reporting can be controlled via
memory_error_func
.libopcodes
providesperror_memory
as a default choice, if set by the user. -
Address printing can be controlled via
print_address_func
.libopcodes
providesgeneric_print_address
as a default choice. -
Symbol resolution can be controlled via
symbol_at_address_func
andsymbol_is_valid
.libopcode
providesgeneric_symbol_at_address
andgeneric_symbol_is_valid
as default choices.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK