14

Does Inlined Mean Streamlined? Part 1: Escape Analysis

 4 years ago
source link: https://richardstartin.github.io/posts/does-inlined-mean-streamlined-part-1-escape-analysis
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Does Inlined Mean Streamlined? Part 1: Escape Analysis

Feb 24, 2019 | Richard Startin | javaanalysis

There’s a lot of folklore about the importance of inlining in the JVM. Undoubtedly, inlining can improve performance by removing the overhead of function calls, but, more importantly, various optimisations are disabled or reduced in scope when it can’t happen. However, I think the importance of inlining is often overstated, especially considering the trade off between flexibility and ability to inline. This post is the first in a series where I use JMH to run simple experiments to assess the impact of failure to inline on C2’s ability to optimise programs. This post is about how inlining affects escape analysis, and whether you should care.

Inlining is the process of replacing function calls with the function’s code, much like denormalisation of databases. Just as database denormalisation can eliminate the overhead of performing joins at the expense of increasing the level of data duplication and therefore database size, inlining removes the overhead of function calls, at the expense of the amount of space required to represent the program. The analogy breaks down because copying the function’s code into the call site also aids an optimising compiler like C2 by increasing the scope of what can be optimised within a method, so C2 does this aggressively. It’s well known that there are two ways to confound inlining: code size (InlineSmallCode sets the limit of what can be inlined to 2KB by default), and having lots of polymorphism. Failure to inline can also be provoked by the JMH annotation @CompilerControl(DONT_INLINE).

In the first benchmark, I will look at a contrived example of the kind of small method you may find in Java code written in a functional style. Functional programming exploits monads, which represent a generic computation as a wrapper type, a wrapping operation known as the unit function, and a way to compose functions applied to the wrapper type, known as the bind function. You can also think of them as burritos. Some monadic types common in functionally tinged Java are Either (contains an instance of one type or another), Try (produces an output or an exception) and Optional which exists in the JDK. One drawback of monadic types in Java is that the wrapper type needs to be materialised (rather than exist only as a figment of the compiler’s imagination) and risks being allocated.

Here is an interface exposing a method returning an Optional intended to safely map a potentially null value of type S to type Optional<T> via a mapping between the unwrapped types S and T. To avoid measuring the cost of different implementations, it is implemented the same way three times to reach the threshold where Hotspot will give up on inlining calls to the escapee.

public interface Escapee<T> {
  <S> Optional<T> map(S value, Function<S, T> mapper);
}

public class Escapee1<T> implements Escapee<T> {
  @Override
  public <S> Optional<T> map(S value, Function<S, T> mapper) {
    return Optional.ofNullable(value).map(mapper);
  }
}

In the benchmark, we can simulate conditions where we call between one and four implementations. We should probably expect the benchmark to behave differently when the input value is null because a different branch will be taken. To isolate the difference in throughput just for taking the other branch, the same function, which allocates an Instant, is evaluated on either branch. No attempt is made to make the branch unpredictable since it’s beside the point. Instant.now() is chosen because it is volatile and impure, meaning that its evaluation shouldn’t be eliminated by some other optimisation.

  @State(Scope.Benchmark)
  public static class InstantEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<Instant>[] escapees;
    int size = 4;
    String input;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValue(InstantEscapeeState state, Blackhole bh) {
    for (Escapee<Instant> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> Instant.now()).orElseGet(Instant::now));
    }
  }

Based on common knowledge about C2’s inlining capabilities, we should expect scenarios THREE and FOUR not to inline, whereas ONE should be inlined, and TWO should be inlined with a conditional. Verifying this well known outcome by printing inlining with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining is trivial. See Aleksey Shipilёv’s authoritative post for reference.

The benchmark is run with the following arguments. Tiered compilation is disabled to bypass C1. A large heap is allocated to avoid measuring garbage collection pauses, and the low overhead SerialGC is selected to minimise interference from instrumented write barriers.

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValue$

Despite there being little absolute difference in throughput (the scenarios where we expect inlining to occur have slightly higher throughputs than when we expect inlining not to take place), the results are quite interesting.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario mapValue thrpt 1 15 24.013132 0.459482 ops/us true ONE mapValue thrpt 1 15 22.448583 0.430733 ops/us true TWO mapValue thrpt 1 15 20.291617 0.898656 ops/us true THREE mapValue thrpt 1 15 20.651088 0.552091 ops/us true FOUR mapValue thrpt 1 15 24.625237 0.535002 ops/us false ONE mapValue thrpt 1 15 24.039407 0.432007 ops/us false TWO mapValue thrpt 1 15 21.976675 0.741998 ops/us false THREE mapValue thrpt 1 15 22.183469 0.43514 ops/us false FOUR

The megamorphic cases are slightly faster when the input value is null, which highlights how easy it would be to not capture the relevant effects at all. When the input value is always null, and when there is only one implementation and the input value is not null, the normalised allocation rate are all 24B/op, and just over half that of the non-null input multi implementation scenarios, which are all about the same at 40B/op.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op true ONE mapValue:·gc.alloc.rate.norm thrpt 1 15 40.000018 0.000001 B/op true TWO mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true THREE mapValue:·gc.alloc.rate.norm thrpt 1 15 40.00002 0.000001 B/op true FOUR mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false ONE mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000017 0.000001 B/op false TWO mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false THREE mapValue:·gc.alloc.rate.norm thrpt 1 15 24.000019 0.000001 B/op false FOUR

24B/op is the size of instances of the Instant class (when a simple garbage collector like SerialGC is used), which contains an 8 byte number of seconds since 1970 and a 4 byte number of nanoseconds, plus a 12 byte object header. So the wrapper type can’t have been allocated in these cases! 40B/op includes the 16 bytes taken up by the materialised Optional (12 bytes for the header and 4 bytes for a compressed reference to the Instant). The difference is caused by the limitations of escape analysis: it gives up trying to prove allocation is unnecessary whenever the allocating method can’t be inlined, and incidentally gives up when the allocation takes place within a conditional statement. In scenario TWO, a conditional statement is introduced by inlining two possible implementations, which means each operation allocates the 16 bytes required for the optional.

The signal is fairly weak in this benchmark, and is almost entirely masked by the fact the benchmark will allocate a 24 byte Instant per invocation. To accentuate the difference, we can isolate background allocation from the benchmark and track the same metrics.

  @State(Scope.Benchmark)
  public static class StringEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    Escapee<String>[] escapees;
    int size = 4;
    String input;
    String ifPresent;
    String ifAbsent;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      scenario.fill(escapees);
      ifPresent = UUID.randomUUID().toString();
      ifAbsent = UUID.randomUUID().toString();
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapValueNoAllocation(StringEscapeeState state, Blackhole bh) {
    for (Escapee<String> escapee : state.escapees) {
      bh.consume(escapee.map(state.input, x -> state.ifPresent).orElseGet(() -> state.ifAbsent));
    }
  }
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-string.csv -prof gc 
-jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapValueNoAllocation

While even the cost of very low intensity realistic work (allocating a timestamp) is enough to mollify failure to inline, when the virtual call is a no-op we can make its impact look quite severe. ONE and TWO are much faster because they at least eliminate the virtual function call in each case, no matter whether the input is null or not.

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario mapValueNoAllocation thrpt 1 15 206.913491 3.003555 ops/us true ONE mapValueNoAllocation thrpt 1 15 162.014816 4.353872 ops/us true TWO mapValueNoAllocation thrpt 1 15 77.959095 2.174789 ops/us true THREE mapValueNoAllocation thrpt 1 15 77.845562 3.592952 ops/us true FOUR mapValueNoAllocation thrpt 1 15 202.016045 2.830117 ops/us false ONE mapValueNoAllocation thrpt 1 15 198.241125 2.351662 ops/us false TWO mapValueNoAllocation thrpt 1 15 88.187145 3.908423 ops/us false THREE mapValueNoAllocation thrpt 1 15 89.715024 2.234652 ops/us false FOUR

It’s easy to imagine that allocation has been curtailed, only to be caught out by the limitations of escape analysis in the presence of polymorphism. In scenario ONE, there is never any allocation: escape analysis must have worked. In scenario TWO, because of the inlined conditional, the 16 byte Optional is allocated once per invocation with non-null input, and when the input is always null, there are fewer allocations. However, when the inlining doesn’t work in scenarios THREE and FOUR, an extra 16 bytes is allocated once per invocation, but it’s not related to inlining. The unintentional 16 bytes comes from capturing the variable in each case (a 12 byte header and 4 byte compressed reference to the String), but how often do you check your benchmarks to ensure you are measuring what you think you are?

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op true ONE mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000003 0 B/op true TWO mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true THREE mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 32.000005 0 B/op true FOUR mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false ONE mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 0.000002 0 B/op false TWO mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false THREE mapValueNoAllocation:·gc.alloc.rate.norm thrpt 1 15 16.000005 0 B/op false FOUR

It’s not the sort of thing that can be exploited in real programs, but it looks as if allocations are better eliminated when the method, be it virtual or inlined, only ever sees a null value. Actually, Optional.empty() always returns the same instance, so there were no allocations in the first place.

Having contrived a case to accentuate the effect, it’s worth noting that the impact of failure to inline is smaller than the difference in the cost of allocating an instance and storing the value with different garbage collectors, which is a cost some developers seem to be unaware of.

  @State(Scope.Benchmark)
  public static class InstantStoreEscapeeState {
    @Param({"ONE", "TWO", "THREE", "FOUR"})
    Scenario scenario;
    @Param({"true", "false"})
    boolean isPresent;
    int size = 4;
    String input;
    Escapee<Instant>[] escapees;
    Instant[] target;

    @Setup(Level.Trial)
    public void init() {
      escapees = new Escapee[size];
      target = new Instant[size];
      scenario.fill(escapees);
      input = isPresent ? "" : null;
    }
  }

  @Benchmark
  @OperationsPerInvocation(4)
  public void mapAndStoreValue(InstantStoreEscapeeState state, Blackhole bh) {
    for (int i = 0; i < state.escapees.length; ++i) {
      state.target[i] = state.escapees[i].map(state.input, x -> Instant.now()).orElseGet(Instant::now);
    }
    bh.consume(state.target);
  }

I run the same benchmark in two modes:

taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-serial.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseSerialGC -mx8G" EscapeeBenchmark.mapAndStoreValue$
taskset -c 0 java -jar target/benchmarks.jar -wi 5 -w 1 -r 1 -i 5 -f 3 -rf CSV -rff escapee-store-g1.csv 
-prof gc -jvmArgs="-XX:-TieredCompilation -XX:+UseG1GC -mx8G" EscapeeBenchmark.mapAndStoreValue$

The cost of changing the garbage collector when triggering the write barriers (simple in the case of the serial collector and complex in the case of G1) is about as large as the cost of missing out on inlining. Note that this is not an argument that garbage collector overhead is unacceptable!

Benchmark GC Mode Threads Samples Score Score Error (99.9%) Unit Param: isPresent Param: scenario mapAndStoreValue SerialGC thrpt 1 15 23.739993 0.297493 ops/us true ONE mapAndStoreValue SerialGC thrpt 1 15 22.41715 0.502928 ops/us true TWO mapAndStoreValue SerialGC thrpt 1 15 21.096494 0.629228 ops/us true THREE mapAndStoreValue SerialGC thrpt 1 15 20.656528 0.604725 ops/us true FOUR mapAndStoreValue SerialGC thrpt 1 15 24.098976 0.479819 ops/us false ONE mapAndStoreValue SerialGC thrpt 1 15 23.759017 0.460972 ops/us false TWO mapAndStoreValue SerialGC thrpt 1 15 21.473803 0.411786 ops/us false THREE mapAndStoreValue SerialGC thrpt 1 15 21.524173 0.393322 ops/us false FOUR mapAndStoreValue G1GC thrpt 1 15 20.522258 0.463444 ops/us true ONE mapAndStoreValue G1GC thrpt 1 15 18.520677 0.229133 ops/us true TWO mapAndStoreValue G1GC thrpt 1 15 18.359042 0.276809 ops/us true THREE mapAndStoreValue G1GC thrpt 1 15 18.446654 0.272189 ops/us true FOUR mapAndStoreValue G1GC thrpt 1 15 20.768856 0.496087 ops/us false ONE mapAndStoreValue G1GC thrpt 1 15 20.277051 0.411466 ops/us false TWO mapAndStoreValue G1GC thrpt 1 15 18.875519 0.399535 ops/us false THREE mapAndStoreValue G1GC thrpt 1 15 18.824234 0.657469 ops/us false FOUR

Inlining makes escape analysis possible, but is only effective when only one implementation is used. The marginal benefit decreases in the presence of even trivial allocation, but can be expected to increase with the size of the eliminated allocation. The difference can even be smaller than the runtime cost of write barriers in some garbage collectors. My benchmarks are on github, they were run with OpenJDK 11+28 on Ubuntu 18.04.2 LTS.

Perhaps this analysis is facile; many optimisations more powerful than escape analysis depend on inlining. The next post in the series will be on the benefits, or lack thereof, of inlining a reduction operation such as a hash code.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK