3

binread: Declarative Rust Binary Parsing

 3 years ago
source link: https://jam1.re/blog/binread-a-declarative-rust-binary-parsing-library
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

binread: Declarative Rust Binary Parsing

(If you want to skip the explanation of "why" and skip to the "what", jump to )

binread.png

Why Rust?

Rust, for those who aren't already aware, is a programming language. The best, in fact (kidding, although it is certainly my favorite). Rust is similar to languages such as C++ due to the fact it is highly efficient, compiles down to machine code, and has no runtime. This makes both languages a great target for everything from operating systems to browsers where you need low-level control to get the best performance possible.

One important feature that sets Rust apart from C++ though for these use cases is its memory safety—without it even a simple mistake in a language like C++ will usually lead to code execution by an experienced attacker (A generalization, but the point is it requires an inhuman level of careful programming to prevent a vulnerability).

(I swear I'm almost at the point) When I'm attacking a system and looking for vulnerabilities, parsers are my go-to for a reason—they often process user-controlled data and they often have memory vulnerabilities. This dangerous combination makes Rust a great tool for doing the job properly. And, better yet, since parsers are rather modular components of programs they can be written in Rust even if the rest of the program is written in, say, C++.

Previous Works

Rust has plenty of work done on parsing, some notable examples:

  • nom - a parser/combinator style parsing library. Works for string parsing, bit parsing and byte parsing and has solid performance thanks to being zero-copy. I would highly recommend it.
  • byteorder - a great library for providing extensions to Read/Write traits to allow byteorder-configurable parsing (and writing) of integer/float types. A lot of inspiration of binread came from byteorder!
  • std - Rust's std now provides from_be_bytes/from_le_bytes, which can be used somewhat similarly to byteorder.

(If you know of others, let me know on twitter - @jam1garner)

Design Philosophy

While, admittedly, I can't say I'm quite good enough at thinking out this sort of thing, so don't expect a concise ordered list of values. But the main "benefit" of binread, in my opinion, is that it is declarative. Binary parsing, ultimately, is just defining structure and then applying transformations to convert it into a more usable form. Which begs the question—why are the declarations of structure being done using imperative code?

Let's take a look at an example of what I mean. Here's an example of how different libraries can be used to handle the same (simple) example:

nom:

fn take_point(input: &[u8]) -> IResult<&[u8], (u16, u16)> { tuple((le_u16, le_u16))(input) }

fn take_rect(input: &[u8]) -> IResult<&[u8], ((u16, u16), (u16, u16))> { tuple((take_point, take_point))(input) }

fn take_sprite(&[u8]) -> IResult<&[u8], Sprite> { let (input, ( sprite_bounds, size, texture_index )) = tuple(( take_rect16, take_point16, le_u32 ))(input)?; Ok((input, Sprite { sprite_bounds, size, texture_index })) }

While this is fast, accurate, and plenty capable/extensible, it... is awfully verbose. Even though we're only defining a simple structure of consecutive primitives this is viciously long to type out and required checking documentation more than I'd like.

Some other criticisms:

  • The nature of implementing a parser/combinator system purely using Rust's type system creates an error handling system that is, at least to me, unintuitive.
  • nom's dedication to supporting any type of stream (character, byte, bit, etc.) further complicates things
  • Just to write the most simple parser, you either have to blindly follow it or have a solid concept of functions that generate closures, which isn't the most beginner friendly.

byteorder:

fn read_point<R: io::Read>(reader: &mut R) -> io::Result<(u16, u16)> { Ok(( reader.read_u16::<LittleEndian>()?, reader.read_u16::<LittleEndian>()? )) }

fn read_rect<R: io::Read>(reader: &mut R) -> io::Result<(u16, u16)> { Ok(( read_point(reader)?, read_point(reader)? )) }

fn read_sprite<R: io::Read>(reader: &mut R) -> io::Result<Sprite> { Ok(Sprite { sprite_bounds: read_rect(reader)?, size: read_point(reader)?, texture_index: reader.read_u32::<LittleEndian>()? }) }

Similarly, also a bit verbose for my taste. And heavy on the oft-awkward turbofish syntax. And, frankly, writing reader for everything hurts. Extremely redundant, even if I agree with the design decisions that lead to this.

Why declarative?

Now, the point I am making is: imperative/functional-style APIs for parsing will, by nature, be overly verbose. So, I propose a third paradigm: declarative. Here's an example of a (primarily) declarative parsing scheme—010 editor binary templates:

struct Point { u16 x; u16 y; };

struct Rect { Point pos; Point size; };

struct Sprite { Rect bounds; Point size; u32 texture_index; } sprite;

This is all that is needed to parse it—a structure definition. And, optionally, 010 editor supports things like using if statements to gate declarations and other fancy things. But at its core, it is just a struct definition, which we already needed to define for the Rust code anyway (and the parsing code on top of that!). And so, my solution is to use the struct definition as the primary mechanism for writing the parser using a derive macro.

The idea, in action

#[derive(BinRead)] #[br(little)] struct Point(u16, u16);

#[derive(BinRead)] struct Rect(Point, Point);

#[derive(BinRead)] struct Sprite { sprite_bounds: Rect, size: Point, texture_index: u32 }

fn main() { let file = File::open(/*...*/).unwrap(); // BinRead provides a reader-extension, similar to byteorder let sprite: Sprite = file.read_be().unwrap(); }

This is the basics of binread: define a structure and it generates the parser for you. However, this alone only works for fixed-size data composed of primitives. binread takes this concept further by allowing you to use attributes to further control how to parse it.

Some examples of this:

  • the parse_with attribute is an escape hatch of sorts to allow for providing a custom parser function
  • enums can be used to try multiple parsers until one parses successfully
  • the if attribute can be used to only parse an Option<T> if a condition (which can involve previously read fields from the same struct!) is true
  • the count attribute can be used to provide an expression to decide how many items to read in a Vec
  • the provided FilePtr wrapper type can be used to read an offset and then read the struct it points to
  • the assert attribute allows for providing sanity checks for error handling
  • and more! see the full list of all the attributes here

Here's what a more complex binread parser looks like:

#[derive(BinRead, Debug)] #[br(big, magic = b"TEST")] struct TestFile { // parse with being used for a custom parser to read an offset from the file then read a null // terminated UTF-8 string from that offset #[br(parse_with = FilePtr32::parse)] string: NullString, len: u32, #[br(count = len - 1)] value: Vec<f32>, #[br(calc = ptr.len())] // calc parses nothing but stores the result of an expression ptr_len: usize,

#[br(try)] // try attempts to parse, stores None if it fails test: Option<[u64; 30]>, }

List of features:

  • no_std compatibility (just disable the std feature)
  • support for both derive-macro and imperative usage
  • parser for all built-in types
  • minimal dependencies (proc_macro2, quote, syn, and nothing else)
  • support for generating an 010 binary template to help debug parsers in a hex editor with type names and highlighting
  • ability to pass data to child structs when parsing
  • error handling
  • support for structs and enums
  • helper types such as FilePtr, PosValue, and NullString

Want to learn more or try it for yourself? Check out the documentation.

Github: https://github.com/jam1garner/binread (issues/PRs welcome)

Crates.io: http://crates.io/crates/binread


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK