Optimizing a Parser Combinator into a memcpy

posted by Craig Gidney on July 16, 2013

In this post: I describe how to dynamically optimize high level descriptions of parsing. In some cases the parsing can be optimized down to a memcpy, easily crushing typical hand-rolled parsers in performance without sacrificing safety or succinctness.

The source code of a library that does these optimizations is available on GitHub, though I make no guarantees about correctness in general since I mostly wrote it over the weekend.

Parser Combinators

Parser combinators are a concept I’ve mentioned in passing several times in previous posts. They allow you to combine basic parsers, using methods called combinators, to create extremely elaborate parsers without having to deal with the low level details.

Although parser combinators are typically applied to parsing textual data, I actually stumbled onto them when dealing with binary WarCraft 3 packets. I was getting tired of writing code to parse packets from data into a usable form, and then writing the same basic thing in reverse to go from the usable form back to data. Eventually I realized that a high level description of a packet, like “an int called ‘X’, then an int called ‘Y’, then a float called ‘duration'” or “a byte called ‘group index’, then a short that determines how many times a GameObjectId is repeated called ‘targets'”, could be used to do both parsing and packing.

(At the time I had no idea these things already existed. I just noticed code duplication and realized there was a beautiful way to solve it. I called the descriptions ‘jars’ since, as a pythonista would say, they were used to ‘pickle’ values. Story of my life: come up with neat idea, find out it was already discovered in the 70’s and given a cooler name to boot.)

Parsing would be done by interpreting the description as instructions for reading data, and packing would be done by interpreting the description as instructions for writing data. As a nice bonus, writing the descriptions was a lot less error prone than writing the parsing/packing code directly. Even better, the descriptions were significantly easier to read and understand. I really can’t stress how much more readable parser combinators are: I no longer needed separate notes on the binary formats, because the code was clearer than my notes!

Parser combinators have their tradeoffs, of course. Because you’re wiring them up at runtime, they tend to involve a large number of virtual calls. If your compiler/jitter isn’t specialized for optimizing that sort of thing (see: C#), parser combinators can easily be an order of magnitude slower than the equivalent hand-written code.

However… we don’t actually have to rely on the compiler to do the optimization, right? I’ve mentioned before that, in .Net, you can use expression trees to perform your own optimizations at runtime. Why not apply the same technique to parser combinators?

Getting Started: Linq-To-Parsers

Before we get into optimizing, it’s useful to have some really basic functionality to compare against. Parsers are monads and C#’s query syntax makes working with monads really nice, so lets start with just parsing numbers and using the query syntax for applying parsers in a sequence.

We need to define what a parser is. The simplest possible thing is probably “takes data, produces a value, and tells you how much data is consumed”, so that’s what we’ll do:

interface IParser<T> {
    ParsedValue<T> Parse(ArraySegment<byte> data);
}
struct ParsedValue<T> {
    public readonly T Value;
    public readonly int Consumed;
    public ParsedValue(T value, int consumed) {
        this.Value = value;
        this.Consumed = consumed;
    }
}

We’ll need some basic parsers. Here’s a parser for 32-bit integers:

struct Int32Parser2 : IParser<Int32> {
    public ParsedValue<Int32> Parse(ArraySegment<byte> data) {
        if (data.Count < 4) throw new ArgumentException();
        // warning: we're not dealing with endian-ness, but this is just an example
        var value = BitConverter.ToInt32(data.Array, data.Offset);
        return new ParsedValue<Int32>(value, 4);
    }
}

We also need a way to combine these parsers. Since we want to use the query syntax, we have to implement Select and SelectMany. Here’s an implementation of the Select method, using an anonymous implementation class:

static IParser<TOut> Select<TIn, TOut>(this IParser<TIn> parser, Func<TIn, TOut> projection) {
    return new AnonymousParser<TOut>(data => {
        var r = parser.Parse(data);
        return new ParsedValue<TOut>(projection(r.Value), r.Consumed);
}

(I’ll just link to the slightly more complicated SelectMany method)

Breather.

We now have a parser for 32-bit integers, and a way to combine parsers into a sequence by using the query syntax. This is enough to start making, combining, and using parsers.

Here’s what usage looks like right now:

IParser<int> intParser = new Int32Parser();

// a serialized Point3 is three contiguous serialized ints in x,y,z order
var point3Parser = from x in intParser
                   from y in intParser
                   from z in intParser
                   select new Point3(x, y, z);

var data = new ArraySegment<byte>(new byte[] { 1,0,0,0, 2,0,0,0, 3,0,0,0 });
var p = point3Parser.Parse(data).Value;
// p is now a Point3 with X=1,Y=2,Z=3

I’d really like to benchmark parsing performance over large numbers of contiguous points, so I’m also going to define a ‘RepeatUntilEndOfData’ combinator. When you apply it to a parser, it will parse values into a list and won’t stop until all data has been consumed.

Here’s what usage of RepeatUntilEndOfData looks like:

IParser<IReadOnlyList<Point3>> bulkPoint3Parser = point3Parser.RepeatUntilEndOfData()

// twelve bytes repeated ten thousand times
var repeatCount = 10000;
var data = new ArraySegment<byte>(
    Enumerable.Repeat(new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }, 10000)
    .SelectMany(e => e)
    .ToArray());

var result = bulkPoint3Parser.Parse(data).Value;

The above bulk parser will be our baseline for performance. On my machine, I benchmark it as parsing at about 30 MB/s. More than enough for most purposes, but let’s try to do better.

Hand rolling

How much slower is the query’d parser combinator, compared to a manually written parser? I made a hand rolled implementation:

private static ParsedValue<IReadOnlyList<Point3>> HandrolledParse(ArraySegment<byte> data) {
    if (data.Count % 12 != 0) throw new ArgumentException();
    var count = data.Count/12;
    var r = new Point3[count];
    var j = data.Offset;
    var a = data.Array;
    for (var i = 0; i < count; i++) {
        r[i] = new Point3(
            BitConverter.ToInt32(a, j + 0),
            BitConverter.ToInt32(a, j + 4),
            BitConverter.ToInt32(a, j + 8));
        j += 12;
    }
    return new ParsedValue<IReadOnlyList<Point3>>(r, data.Count);
}

I benchmarked the above parser on the same data. It parses at about 400MB/s, which is over an order of magnitude faster than the (unoptimized) query’d parser combinator.

We’ve got a lot of ground to make up.

Reflection and Expression Trees

As I mentioned earlier, I’ve previously discussed optimizing at runtime using expression trees. I won’t bother covering the basics again here. Suffice it say that instead of using a method that parses data, we’re going to use a method that generates a method to parse data. The generated method can be specialized based on the parser, because all the relevant information about what needs to be parsed and how to do it is available to work with.

One issue I have with expression tree code is that it’s gross. There’s definitely a better way to do this, but this is the tool I have for now. I’m tempted to just link to the relevant code, but that would feel like a bit of a cop-out since the optimization is the focus of the post. Thus, here is the code used to augment an item parser into a bulk parser capable of (hopefully quickly) parsing many items:

var dataArray = Expression.Parameter(typeof(byte[]), "dataArray");
var dataOffset = Expression.Parameter(typeof(int), "dataOffset");
var dataCount = Expression.Parameter(typeof(int), "dataCount");
var itemCount = Expression.Parameter(typeof(int), "itemCount");
var parameters = new[] { dataArray, dataOffset, dataCount, itemCount };

var resultArray = Expression.Variable(typeof(T[]), "resultArray");
var resultConsumed = Expression.Variable(typeof(int), "totalConsumed");
var loopIndex = Expression.Variable(typeof(int), "i");

var inlinedParseInfo = _itemParser.MakeParseFromDataExpression(dataArray, Expression.Add(dataOffset, resultConsumed), Expression.Subtract(dataCount, resultConsumed));
var inlinedParsePerform = inlinedParseInfo.Item1;
var inlinedParseResultVariables = inlinedParseInfo.Item2;
var parsedItem = Expression.Variable(inlinedParsePerform.Type, "parsed");
var parsedItemValue = _itemParser.MakeGetValueFromParsedExpression(parsedItem);
var parsedItemConsumed = _itemParser.MakeGetConsumedFromParsedExpression(parsedItem);

var locals = new[] { resultArray, resultConsumed, loopIndex, parsedItem };
var initStatements = Expression.Block(
    Expression.Assign(resultArray, Expression.NewArrayBounds(typeof (T), itemCount)),
    Expression.Assign(resultConsumed, Expression.Constant(0)),
    Expression.Assign(loopIndex, Expression.Constant(0)));
            
var loopExit = Expression.Label();
var loopStatements = Expression.Loop(
    Expression.Block(
        inlinedParseResultVariables,
        Expression.IfThen(Expression.GreaterThanOrEqual(loopIndex, itemCount), Expression.Break(loopExit)),
        Expression.Assign(parsedItem, inlinedParsePerform),
        Expression.AddAssign(resultConsumed, parsedItemConsumed),
        Expression.Assign(
            Expression.ArrayAccess(
                resultArray, 
                Expression.PostIncrementAssign(loopIndex)), 
            parsedItemValue)),
    loopExit);

var result = Expression.New(
    typeof(ParsedValue<IReadOnlyList<T>>).GetConstructor(new[] { typeof(IReadOnlyList<T>), typeof(int) }).NotNull(),
    resultArray,
    resultConsumed);

var body = Expression.Block(
    locals,
    initStatements,
    loopStatements,
    result);

var method = Expression.Lambda<Func<byte[], int, int, int, ParsedValue<IReadOnlyList<T>>>>(
    body,
    parameters);

return method.Compile();

Ok, that’s a pretty long method. It’s also really hard to split apart, because all the later stuff references all the earlier stuff. The basic idea is that we’re creating a method that calls an item parser a given number of times, with the caveat that it may inline the item parser into the loop. Inlining is particularly useful when the parser always consumes the same length (e.g. 32-bit integers are always 4 bytes), because it allows the code to avoid constantly creating and disassembling instances of ParsedValue.

There’s another large make-compiled-method method, but for combining several parsers for the fields of a class into a parser for the class. CompiledReflectionParser is a lot more involved, since it must do things like match up named parsers with fields and constructor parameters and so forth, so I’ll settle for just linking it.

Incidentally, it turns out that getting the names of variables out of a query expression is extremely error prone and dependent-on-compiler-details-that-might-change-so-don’t-do-it. In order to make the names available, I switched to using the collection initializer syntax. Here’s what that looks like:

var compiledParser =
    new Parse.Builder<Point3> {
        {"y", Parse.Int32LittleEndian}, // <-- this y is out of order to prevent a further optimization
        {"x", Parse.Int32LittleEndian}, // <-- I'll talk about it in a minute
        {"z", Parse.Int32LittleEndian}
    }.Build()
     .RepeatUntilEndOfData();

The Build and RepeatUntilEndOfData methods recognize when reflection and expression trees can be applied, and do so. At runtime the above parser combinator gets automatically optimized and its components inlined to create (basically) this parsing code:

Point3[] resultArray;
int totalConsumed;
int i;
Point3 parsed;

resultArray = new Point3[itemCount];
totalConsumed = 0;
i = 0;
while (true) {
    int total,;
    Point3 result;
    if (i >= itemCount) break;
    
    int y;
    int x;
    int z;
    total = 0;
    y = BitConverter.ToInt32(dataArray, dataOffset + totalConsumed + total);
    total += 4;
    x = BitConverter.ToInt32(dataArray, dataOffset + totalConsumed + total);
    total += 4;
    z = BitConverter.ToInt32(dataArray, dataOffset + totalConsumed + total);
    total += 4;
    result = new Point3(x, y, z);
    
    parsed = result;
    totalConsumed += total;
    resultArray[i++] = result;
}
return new ParsedValue<IReadOnlyList<Point3>>(resultArray, totalConsumed);

The above code looks kind of dumb in a few places, and does some unnecessary additions, but is pretty close to what I wrote by hand.

How fast is the compiled combinator? On my machine it parses at about 300 MB/s. That's an order of magnitude faster than the unoptimized query combinator, and almost as fast as the hand-written parser! There's still some ground to make up, but we're not done yet.

Seat Belts Off

I haven't actually shown the declaration of the Point3 type we've been parsing this whole time. Here it is:

[StructLayout(LayoutKind.Sequential, Pack = 1)]
public struct Point3 {
    public readonly int X;
    public readonly int Y;
    public readonly int Z;
    public Point3(int x, int y, int z) {
        X = x;
        Y = y;
        Z = z;
    }
}

The important bit to notice is the StructLayout attribute.

When you decorate a struct with StructLayout and specify LayoutKind.Sequential, the struct's represention in memory is guaranteed to follow the declared order of fields. Adding Pack=1 further guarantees there's no padding between the fields. So the struct will be represented in memory as three contiguous integers. Furthermore, arrays of the struct should also place the values side by side, so that the memory of the array essentially goes XYZXYZXYZXYZ...

Hey, isn't that the same representation described by the following parser combinator?

var blitParser =
    new Parse.Builder {
        {"x", Parse.Int32LittleEndian},
        {"y", Parse.Int32LittleEndian},
        {"z", Parse.Int32LittleEndian}
    }.Build()
     .RepeatUntilEndOfData();

It is! Furthermore, we can write code that detects when this happens:

bool CanBlitParseWith(IReadOnlyList<IFieldParser> fieldParsers) {
    if (fieldParsers == null) throw new ArgumentNullException("fieldParsers");

    // type has blittable representation?
    if (!Util.IsBlittable<T>()) return false;

    // all parsers have same constant length representation as value in memory?
    if (fieldParsers.Any(e => !e.AreMemoryAndSerializedRepresentationsOfValueGuaranteedToMatch())) return false;
    if (fieldParsers.Any(e => !e.OptionalConstantSerializedLength().HasValue)) return false;

    // type has no padding?
    var structLayout = typeof(T).StructLayoutAttribute;
    if (structLayout == null) return false;
    if (structLayout.Value != LayoutKind.Sequential) return false;
    if (structLayout.Pack != 1) return false;

    // parsers and struct fields have matching canonical names?
    var serialNames = fieldParsers.Select(e => e.CanonicalName);
    var fieldNames = typeof(T).GetFields().Select(e => e.CanonicalName());
    if (!serialNames.HasSameSetOfItemsAs(fieldNames)) return false;

    // offsets implied by parser ordering matches offsets of the struct's fields?
    var memoryOffsets =
        typeof(T).GetFields().ToDictionary(
            e => e.CanonicalName(),
            e => (int?)typeof(T).FieldOffsetOf(e));
    var serialOffsets =
        fieldParsers
            .StreamZip((int?)0, (a, e) => a + e.OptionalConstantSerializedLength())
            .ToDictionary(e => e.Item1.CanonicalName, e => e.Item2 - e.Item1.OptionalConstantSerializedLength());
    if (!serialOffsets.HasSameKeyValuesAs(memoryOffsets)) return false;

    return true;
}

Then, after we cross our fingers and promise to be careful, we can even use ILEmit to generate unsafe code that allocates an array and copies memory over its contents:

public static UnsafeArrayBlitParser<T> MakeUnsafeArrayBlitParser<T>() {
    var d = new DynamicMethod(
        name: "BlitParseArray" + typeof(T),
        returnType: typeof(T[]),
        parameterTypes: new[] { typeof(byte[]), typeof(int), typeof(int), typeof(int) },
        m: Assembly.GetExecutingAssembly().ManifestModule);

    // ____(byte[] array, int count, int offset, int length)
    var g = d.GetILGenerator();

    // T[] result;
    g.DeclareLocal(typeof(T[]));

    // void* resultPtr;
    g.DeclareLocal(typeof(void*), true);

    // result = new T[count];
    g.Emit(OpCodes.Ldarg_1);
    g.Emit(OpCodes.Newarr, typeof(T));
    g.Emit(OpCodes.Stloc_0);

    // fixed (void* resultPtr = result)
    g.Emit(OpCodes.Ldloc_0);
    g.Emit(OpCodes.Ldc_I4_0);
    g.Emit(OpCodes.Ldelema, typeof(T));
    g.Emit(OpCodes.Stloc_1);

    // Marshal.Copy(array, offset, (IntPtr)resultPtr, length);
    g.Emit(OpCodes.Ldarg_0);
    g.Emit(OpCodes.Ldarg_2);
    g.Emit(OpCodes.Ldloc_1);
    g.Emit(OpCodes.Conv_I);
    g.EmitCall(OpCodes.Call, typeof(IntPtr).GetMethod("op_Explicit", new[] { typeof(void*) }), null);
    g.Emit(OpCodes.Ldarg_3);
    g.EmitCall(OpCodes.Call, typeof(Marshal).GetMethod("Copy", new[] { typeof(byte[]), typeof(int), typeof(IntPtr), typeof(int) }), null);

    // return result
    g.Emit(OpCodes.Ldloc_0);
    g.Emit(OpCodes.Ret);

    return (UnsafeArrayBlitParser<T>)d.CreateDelegate(typeof(UnsafeArrayBlitParser<T>));
}

I think you see where this is going.

When the memory representation of the array is guaranteed to match the representation of the serialized data, we can steal a trick from C programmers and 'parse' by just bliting the bits. This is fast, and it works despite the fact that the fields are 'readonly'.

How fast is a blit parser? Instead of parsing at the baseline 30 MB/s set by the unoptimized query'd parser, the blit parser proceeds at 3 GB/s. That's two orders of magnitude faster and, despite the fact that we started from a high level description, an order of magnitude faster than the hand written C# parser!

This is an example of the benefits of abstraction. Parser combinators describe things at such a high level that we can easily optimize them and, suddenly, making a super-efficient parser is as easy as applying an attribute and being careful about the order of your fields. Plus, if you get things wrong, it doesn't all blow up in your face: the optimization is simply not applied. (Well.. depending on the application that might count as blowing up in your face, but at least it doesn't give wrong results or scribble over random bits of memory.)

Summary

Parser combinators describe how to parse data at a high level. Normally they incur a speed penalty but, by optimizing and compiling them at runtime (analogous to the typical usage of regular expressions), we can regain most of that loss. In some cases, we can even end up with a parser that runs faster than anything safe you'd write by hand.

If you want to delve into the details or try things out, relevant source code is available on GitHub.

If you're interesting in using parser combinators in practice, you can check out FParseq for F#, attoparsec for Haskell, or Scala's Parsers. Unfortunately, I don't know a well established library for C# (Update: a commenter mentioned NParsec and Sprache for C#).

Update: Fixed links to source. They now point to the blob instead of the path, so me renaming things shouldn't break them again.

Discuss on Reddit

Comments are closed.

Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Optimizing a Parser Combinator into a memcpy