JVM Anatomy Park #16: Megamorphic Virtual Calls

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: [email protected]

Question

I have heard megamorphic virtual calls are so bad, they are getting called by interpreter, not optimizing compiler. Is that true?

Theory

If you have read numerous articles about virtual call optimization in Hotspot, you may have left with the belief that megamorphic calls are pure evil, because they invoke the slowpath handling, and do not enjoy compiler optimizations. If you try to comprehend what OpenJDK does when it fails to devirtualize the call, you might wonder that it crashes and burns performance-wise. But, consider that JVMs work decently well even with baseline compilers, and in some cases even the interpreter performance is okay (and it matters for time-to-performance).

So, it would be premature to conclude that runtime just gives up?

Practice

Let us try and see how does the virtual call slowpath looks. For that, we make the artificial megamorphic call site in a JMH benchmark: make the three subclasses visiting the same call site:

To simplify things for analysis, we invoke this with -XX:LoopUnrollLimit=1 -XX:-TieredCompilation: this will block loop unrolling from complicating the disssembly, and disabling tiered compilation would guarantee compilation with the final optimizing compiler. We don’t care about performance numbers all that much, but let’s have them to frame the discussion:

To give you some taste of what would happen if we do not use the optimizing compiler on test, run with -XX:CompileCommand=exclude,org.openjdk.VirtualCall::test

So, the megamorphic call does indeed cost something, but it is definitely not interpeter-bad performance. The difference between "mono" and "mega" in optimized case is basically the call overhead: we spend 3ns per element for "mega" case, while spending only 1ns per element in "mono" case.

How does "mega" case look like in perfasm? Like this, with many things pruned for brevity:

So the benchmarking loop calls into something, which we can assume is the virtual call handler, then it ends up with VirtualStub, that is supposedly does what every other runtime does with virtual calls: jumps the the actual method with the help of Virtual Method Table (VMT).[1]

But wait a minute, this does not compute! The disassembly says we are actually calling to 0x…0bf60, not into VirtualStub that is at 0x…59bf0?! And that call is hot, so the call target should also be hot, right? This is where runtime itself plays tricks on us. Even if the compiler bails to optimize the virtual call, the runtime can handle "pessimistic" cases on its own. To diagnose this better, we need to get the fastdebug OpenJDK build, and supply a tracing option for Inline Caches (IC): -XX:+TraceIC. Additionally, we want to save the Hotspot log to file with -prof perfasm:saveLog=true

Lo and behold!

Okay, it says inline cache had acted for the call-site at 0x00007fac4fcb428b. Who is it? This is our Java call!

But what was the address in that Java call? This is the resolving runtime stub:

This guy basically called to runtime, figured out what method we want to call, and then asked IC to patch the call to point to new resolved address! Since that is the one-time action, no wonder we do not see it as the hot code. IC action line mentions changing the entry to another address, which is, by the way, our actual VtableStub:

In the end, no runtime/compiler calls were needed to dispatch over resolved call: the call-site just calls the VtableStub that does the VMT dispatch — never leaving the generated machine code. This IC machinery would handle virtual monomorphic and static calls in the similar way, pointing to the stub/address that does not do VMT dispatch.

What we see in initial JMH perfasm output is the generated code as it was looking after the compilation, but before the execution and potential runtime patching.[2]

Observations

Just because compiler had failed to optimize for the best case, it does not mean the worst case is abysmally worse. True, you will give up some optimizations, but the overhead would not be so devastating that you would need to avoid virtual calls altogether. This rhymes with the "Black Magic of (Java) Method Dispatch" conclusion: unless you care very much, you don’t have to worry about call performance.

About, Disclaimers, Contacts

Question

Theory

Practice

Observations

Recommend

GitHub - creactiviti/jiccup: Clojure Hiccup inspired HTML rendering experiment

GitHub - mpangburn/Expressions: Arithmetic and logical expressions elegantly mod...

GitHub - ossu/computer-science: Path to a free self-taught education in Computer...

MySQL-InnoDB-MVCC多版本并发控制

SpaceVim release v0.6.0 | SpaceVim

GitHub - jedisct1/libhydrogen: A lightweight, secure, easy-to-use crypto library...

GitHub - hrvach/espple: Apple 1 Emulator with PAL RF Output

年底了，辛苦了一年了，打算买个 mbp 回家过年。预算有限，何解？ - V2EX

GitHub - nestjs/graphql: A GraphQL (Apollo) module for Nest framework (node.js)...

二维码支付限额了，传统银行能否反攻支付宝和财付通？

About Joyk