5

Optimizing Just in Time with Expression Trees

 3 years ago
source link: http://twistedoakstudios.com/blog/Post2540_optimizing-just-in-time-with-expression-trees
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Optimizing Just in Time with Expression Trees

posted by Craig Gidney on February 19, 2013

One of the steps for logging into Battle.net involves the client hashing some game files in a basic challenge-response (a.k.a. proof of knowledge) scheme referred to as the “revision check”. I implemented this functionality, years ago, as part of working on Tinker (a WarCraft 3 game hosting bot) and optimized my implementation by generating specialized code at runtime.

What’s interesting about WarCraft 3’s revision check is that the challenge sent by the server is the hash function itself, as opposed to a prefix/suffix (Blizzard’s crypto is always “interesting”, sometimes with hilarious results). Each time a client connects, the server sends a description of a different hash function to apply. For example, you might receive the challenge “A=443747131 B=3328179921 C=1040998290 4 A=A^S B=B-C C=C^A A=A+B”, which describes how to initialize the hash function’s state (A,B,C) and the four operations to apply for each value of S read from the game files.

There’s other details relevant to computing the revision check… but this isn’t a post about the peculiarities of WarCraft 3 (though I could talk a lot about that).

In this post I’ll discuss how I used .Net’s expression trees to dynamically generate optimized code for hash functions not known until runtime (cutting multiple seconds off of Tinker’s login time).

Switch Inside

The simplest way to evaluate a hash function described at runtime is by interpreting each instruction. When you have to apply an operation, switch over the possible types of operation, and apply the right thing in each case. That is to say, you implement a super-basic interpreter.

For example, suppose we have a hash function described by the following structure:

struct DynamicHasher {
    public int[] InitialState; // first index (0) used for input, last index used for output
    public Step[] Steps;
    
    public enum Operation {
        Add,
        Multiply,
        Subtract,
        Xor,
        Or,
        And
    }
    public struct Step {
        public int LeftInputIndex;
        public int RightInputIndex;
        public int OutputIndex;
        public Operation Operation;
    }
}

To compute the result of the hash function, we loop over the input data, feeding it into the first value of the state array. For each input value, we loop over each operation to apply, determine what to do, and apply their effects. Once we’re done, the result is the last value in the state array. For reference, the inner loop of the full interpret method looks like this:

// read
var lhs = state[step.LeftInputId];
var rhs = state[step.RightInputId];

// eval
int result;
switch (step.Operation) {
case Operation.Add:
    unchecked {
        result = lhs + rhs;
    }
    break;
case Operation.Multiply:
    unchecked {
        result = lhs * rhs;
    }
    break;
case Operation.Subtract:
    unchecked {
        result = lhs - rhs;
    }
    break;
case Operation.Xor:
    result = lhs ^ rhs;
    break;
case Operation.Or:
    result = lhs | rhs;
    break;
case Operation.And:
    result = lhs & rhs;
    break;
default:
    throw new InvalidOperationException();
}

// write
state[step.OutputId] = result;

The above code applies each operation by switching to the appropriate code to apply. This is the strategy I initially used to compute WC3’s revision check. Unfortunately, at the time, the equivalent of the above code took several seconds to compute (dominating the time required to login).

The problem here is the inner loop. We’re branching many times, unnecessarily, for every single input value. The jit compiler is smart enough to optimize the switch statement into a jump table (I checked. See the disassembly.), but not smart enough to take advantage of the fact that the same branches will be taken again and again and again in an exploitable pattern.

If the number of possible branches in the inner loop was small, we could pull the branching outside of the loop. Instead of switching over possible operations inside the loop, we would switch over each possible operation (outside of the loop) and then, within each case, loop over the data while applying the right operations. We can’t do this here because there’s too many branch possibilities in the inner loop. We’d need hundreds of loops to cover all the combinations of, say, four operations. Writing them all out ahead of time would be absurd. Instead, we will wait until the operations are known (at runtime) and then write them out just in time.

Using Expression Trees

.Net has always supported emitting code at runtime (At least, I think it has. The documentation for Using Reflection Emit goes back to v1.1). However, things got a lot easier in .Net 3.5 with the introduction of expression trees. Instead of directly emitting instructions in the Common Intermediate Language, you could now describe higher level constructs and have the framework emit appropriate CIL for you.

Most of the types related to expression trees are in the System.Linq.Expressions namespace, although you’ll also use types from System.Reflection. Expressions are created via the static factory methods on the System.Linq.Expressions.Expression class (instead of with constructors), so that’s by far the best place to start ‘dotting around’. Still, there’s a lot of little details that are useful to know:

  • Use Expression.Constant to convert a ‘normal’ value into an expression.
  • The value of an expression block is the last expression in the block.
  • The value returned from a function is the last expression in its body.
  • You can define variables/parameters with Expression.Variable/Parameter, but remember to pass then into the scope of an enclosing block or lambda if you want to actually use them.
  • Exiting a loop requires an Expression.Label (pass it to the loop expression and also to the break statement expression).
  • Calling a method requires getting its MethodInfo via the standard reflection API.
  • Use Expression.Lambda<DelegateType> to create an expression you can compile into a delegate.

Lets create an expression that pulls that switch statement out of the loop. This is easy: just return the right type of expression, based on the type of operation. Instead of describing how to branch over each possible operation type, we’re describing how to apply only the exact operation that will be used:

switch (step.Operation) {
case Operation.Add: return Expression.Add(lhs, rhs);
case Operation.Multiply: return Expression.Multiply(lhs, rhs);
case Operation.Subtract: return Expression.Subtract(lhs, rhs);
case Operation.Xor: return Expression.ExclusiveOr(lhs, rhs);
case Operation.Or: return Expression.Or(lhs, rhs);
case Operation.And: return Expression.And(lhs, rhs);
default: throw new InvalidOperationException();
}

This is a neat trick, and we can take it even further. We know exactly what operations need to be applied, so why should we loop over a collection of them? We can just hard code the operators to apply in each iteration. This unrolls the loop into a statement block without any branches:

var runAllStepsBlock = Expression.Block(Steps.Select(step => {
    var lhs = stateVars[step.LeftInputIndex];
    var rhs = stateVars[step.RightInputIndex];
    Expression result = 
	    ... //switch over operation types;
    return Expression.Assign(stateVars[step.OutputIndex], result);
}));

We’ve transformed our inner loop from a large set of switch cases to not even being a loop.

You might have noticed that the lhs/rhs expressions were determined by accessing an array, instead of describing an access to an array. This is another optimization, made possible because we know the size of the hash function’s state when generating the expression. There’s no need to use an array anymore, because we can define exactly the right number of local variables. This optimization will elide almost all the dereferences occurring in the inner loop (and looks a bit like “pulling out the call to ToArray”):

var stateVars = hash.InitialState.Select(_ => Expression.Variable(typeof(int))).ToArray();

With these optimizations in hand, we can implement the rest of the ‘run hash’ method expression. This is relatively straightforward, at least in hindsight, involving describing the loops and method invocations without any clever optimizations at all. The Specialize method is a bit too long to comfortably inline here, but does make a good example of how to create a delegate via an expression tree.

Benchmarking

When I originally performed this optimization, the benefits were significant. Unfortunately, I don’t remember exactly how much improvement (somewhere around 2x as fast). To actually demonstrate the benefits of this optimization, lets measure how long it takes to interpret a simple hash function vs compiling and running a specialized method. In fact, lets use the hash function I mentioned as an example at the start of the post as our test case (which I actually did receive from Battle.net):

var example = new DynamicHasher {
    InitialState = new[] { 0, 443747131, 332817992, 1040998290 },
    Steps = new[] {
        // A += input
        new DynamicHasher.Step {
            LeftInputIndex = 1, 
            RightInputIndex = 0, 
            Operation = DynamicHasher.Operation.Xor, 
            OutputIndex = 1
        },
        // B -= C
        new DynamicHasher.Step {
            LeftInputIndex = 2, 
            RightInputIndex = 3, 
            Operation = DynamicHasher.Operation.Subtract, 
            OutputIndex = 2
        },
        // C ^= A
        new DynamicHasher.Step {
            LeftInputIndex = 3, 
            RightInputIndex = 1, 
            Operation = DynamicHasher.Operation.Xor, 
            OutputIndex = 3
        },
        // A += B
        new DynamicHasher.Step {
            LeftInputIndex = 1, 
            RightInputIndex = 2, 
            Operation = DynamicHasher.Operation.Add, 
            OutputIndex = 1
        }
    }
};

Running the specialize method on this hash function (and then writing the resulting expression to an assembly and decompiling it with resharper) produces very compact code:

public static int RunExample(IntStream stream1) {
    var num2 = 0;
    var num3 = 0x1a730b3b;
    var num4 = 0x13d66648;
    var num5 = 0x3e0c5f92;
    var buffer = new[] {0x1000};
    while (true) {
        var num = stream1.Read(buffer);
        if (num == 0) {
            return num5;
        }
        var num6 = 0;
        while (num6 < num) {
            var index = num6;
            num6 = index + 1;
            num2 = buffer[index];
            num3 ^= num2;
            num4 -= num5;
            num5 ^= num3;
            num3 += num4;
        }
    }
}

Well… that’s definitely simpler than the sixty line interpret function! It’s not perfect, but it’s definitely going to be faster. The question is, will the difference in speed be enough to offset the time spent compiling it? Also, how long did it take to compile this method?

To answer those questions, I fed the numbers in [0, 2^24) through the interpreted function and through the specialized function. I captured the x86 disassembly of the interpreted function and the disassembly of the specialized function, and of course timed how long operations took:

Generating dynamically specialized code...
Done after 45ms
Timing interpretation vs dynamically generated code of example hash...
Interpreted: 1622ms, Specialized: 1109ms
Interpreted: 1536ms, Specialized: 1106ms
Interpreted: 1532ms, Specialized: 1104ms
Interpreted: 1538ms, Specialized: 1105ms
Interpreted: 1540ms, Specialized: 1103ms
Interpreted: 1531ms, Specialized: 1105ms
Interpreted: 1542ms, Specialized: 1104ms
Interpreted: 1534ms, Specialized: 1105ms
Interpreted: 1547ms, Specialized: 1103ms
Interpreted: 1545ms, Specialized: 1102ms

Note that the 45ms delay to generate the code is misleading. It includes a ton of static initialization and warmup costs. The next function to be compiled takes less than a millisecond to complete. However, since I cared about time-to-first-login at the time, I’m going to count the inflated value.

The time taken to compute when interpreting is approximately 1550ms, and the time taken to compute the specialization (including compiling it) is about 1150ms. That’s an approximately 25% reduction in cost (the benefit increases as the number of hashing steps increases).

Interestingly, a 25% reduction is significantly less than I remember (~50%) from applying this optimization to Tinker. There are a few reasons this might have happened:

  • Different optimizations applied by C# now vs VB then. The C# compiler optimizes the switch block into a multi-way branch. Maybe the VB compiler didn’t, since the Select Case block is more general, resulting in the interpreted version paying for many branches.
  • C# makes it easy to ask for unchecked addition, with an unchecked block, whereas VB requires either turning off checks for the entire assembly or placing the operation in another assembly (which is what I did). That cost doesn’t apply to the language-independent expressions.
  • Different computer, different version of visual studio, different version of .Net, different build settings (optimizations enabled? release mode? lack of code contracts?).
  • Faulty human memory.

In any case, a 25% reduction is nothing to laugh at.

Other Applications

Runtime code generation is a very widely applicable optimization. Whenever you’re doing things conditionally and repeatedly, based on data, you can benefit from generating code specialized to the task.

For example, consider parser combinators, a concept I also stumbled on as part of working on Tinker. A parser combinator replaces explicit code, used to parse various packets and protocols, with a description of the format. The parsing/serialization is then done by interpreting the format description. This creates much clearer code (I’m pretty sure my code is still the best reference for the in-game protocol), but of course is slower than hand rolled parsing methods. That speed penalty can be significantly reduced by generating specialized parsing code at runtime.

Another good example is Linq-to-SQL. When you write a query like “from x in someDatabaseTable where x.id == someArgument”, you are not pulling all values from the table in order to filter them through the Enumerable.Where method. That would be horribly inefficient. Instead, the C# compiler turns the linq query into an expression tree. At runtime, the tree will be optimized into SQL to be sent to the database, so that it can take advantage of indexes and other optimizations before returning only matching results.

Summary

Generating code at runtime allows you to make optimizations that would otherwise be impossible. Expression trees (and, soon, Roslyn) make this relatively easy to do.

Discuss on Hacker News, Reddit

Comments are closed.


Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Archive


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK