33

Graal vs C2 who runs Kotlin faster? - Javarevisited - Medium

 4 years ago
source link: https://medium.com/javarevisited/graal-vs-c2-who-runs-kotlin-faster-82f03f1b11dd
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Graal vs C2 who runs Kotlin faster?

So how can we give Graal a run? Since Java 10 you can activate Graal compiler on any JDK simply using two VM Options -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler. Unfortunately, the Graal compiler included inside the JDK is not always the most up-to-date version, so a better solution is to switch to GraalVM as JDK.

If you have SdkMan installed, you just need three commands:

sdk list java//choose the right GraalVM versionsdk install java 20.0.0.r11-grl//install itsdk use java 20.0.0.r11-grl//select it as current JDK

If you don’t have SdkMan installed you can download Graal from the Oracle site, expand it somewhere and set up JAVA_HOME… but really, why don’t just install SdkMan instead?

GraalVM comes in two flavors: CE and EE. You can read about it here, but the gist is that CE is fully open source, while EE is only free for evaluation and not-production use. EE is based on CE with some additional optimizations, that may be or may not be important for you, as we will see.

To download the Enterprise Edition you need to register on Oracle and sign the license agreement.

I have prepared three algorithms to test the compilers: a Mandelbrot Set generator, a Knapsack solver and a Modular Algebra calculator. The full code is available on Github.

I’ve tried to write the code in the most idiomatic Kotlin way, taking advantage of extension functions, data classes, context programming, for each loop. Not only the resulting code is very easy to read, but also the performances are not significantly slower than using a “C” like syntax.

I have chosen these three algorithms to have some realistic example of very different CPU-intensive tasks, and the code is completely in Kotlin, without using any Java library.

Mandelbrot Set

The Mandelbrot Set is probably the most famous fractal shape and you may have seen it even if you don’t know the name.

Mathematically is defined as the Set of all points in the Complex plane where the function z <- z²+c will not diverge when iterated. It is quite easy to generate the set iterating the functions for some points on the complex plane and creating an image out of them.

Since the goal here is the performance and not the graphics I kept things simple using text graphic.

Let’s start looking at the code for the Mandelbrot Set. Here is the definition of a type to represent the Complex numbers:

data class Complex(val r: Double, val i: Double){
operator fun times(other: Complex) =
Complex(
r = this.r * other.r - this.i * other.i,
i = this.i * other.r + this.r * other.i
)
operator fun plus(other: Complex) =
Complex(r = this.r + other.r, i = this.i + other.i)
operator fun minus(other: Complex) =
Complex(r = this.r - other.r, i = this.i - other.i)
fun squared() = this * this
fun
squaredModule() = r * r + i * i
fun
Double.toComplex() = Complex(r=this, i=0.0)
}

Then the actual calculation for each point of the Set:

fun mandelSet(initZ: Complex, c:Complex, maxIter: Int): Int {
var z = initZ
(1..maxIter).forEach{
z = z.squared() + c
if (z.squaredModule() >= 4)
return it
}
return
maxIter
}

You can see here how using operator overloading and a data class to represent complex numbers can really simplify the code and making easier to understand.

Once we defined in the Complex class the rules to do operations on complex numbers, the mandelSet function has only to check if the operation z <- z²+c is “escaping” or not and in case after how many iteration it will exceed the threshold of 4.

Here you can see the characteristic cardioid shape of Mandelbrot Set in the output rendered in AsciiArt:

Image for post
Image for post

Knapsack problem

The Knapsack problem can be defined in several ways. Imagine a thief that just broke into a watch shop. He can steal as many watches as he wants, provided that he doesn’t exceed the maximum weight that he can carry on his knapsack.

Being a rational thief, he definitely wants to optimize the value of the watches in his knapsack. Every watch has a price and a weight attached. So we need an algorithm to find the set of Watches with the maximum total price for a given weight.

Real-world applications of this algorithm include optimizing cuts and materials for CNC machines and marketing strategies for allocating advertising budget.

For example, let’s start with a shop with only 3 watches defined like this:

val shop = Knapsack.shop(
Watch(weight = 1, price = 1),
Watch(weight = 3, price = 2),
Watch(weight = 1, price = 3)
)

If we have a maximum weight of 1, it’s better for us to pick up the third watch, rather than the first, because the value is higher.

If we have a maximum weight of 3, we can choose the number 2 (price 2) or number 1 and 3 (price 1 + 3). In this case, it’s better to pick up 1 and 3 even if their combined weight is less than the maximum.

These are the complete solutions for this shop:

assertEquals(3, selectWatches(shop, maxWeight = 1))
assertEquals(4, selectWatches(shop, maxWeight = 2))
assertEquals(4, selectWatches(shop, maxWeight = 3))
assertEquals(5, selectWatches(shop, maxWeight = 4))
assertEquals(6, selectWatches(shop, maxWeight = 5))

As you can see, as the number of watches available increase, the number of possible choices become high very, very quickly. It is a classical NP-Hard problem.

To solve it in reasonable times we need to cheat a bit and use Dynamic Programming. We can build a map with the already optimized solution for each set of watches and so we can avoid recalculating them each time.

The general algorithm is based on an exhaustive search based on recursion. This is the Kotlin code to solve it, separated in a memoization function and the recursive search of maximum value.

typealias Memoizer = MutableMap<String, Int>fun priceAddingElement(memo: Memoizer, shop: Set<Watch>, choice: Set<Watch>, maxWeight: Int, priceSum: Int): Int =
shop.filter { !(it in choice) && it.weight <= maxWeight }
.map {
selectWatches(
memo,
shop,
maxWeight - it.weight,
choice + it,
priceSum + it.price) }
.filter { it > priceSum }
.max() ?: priceSum
fun selectWatches(memo: Memoizer, shop: Set<Watch>, maxWeight: Int, choice: Set<Watch>, priceSum: Int): Int =
memoization(memo, generateKey(choice)) {
priceAddingElement(memo, shop, choice, maxWeight, priceSum)}
private fun memoization(memo: Memoizer, key: String, f: () -> Int): Int = when (val w = memo[key]) {
null -> f().also { memo[key] = it }
else -> w
}

I really like how Kotlin allows expressing the intention clearly without having to repeat yourself.

Modular Algebra

Modular algebra is a very useful and interesting branch of mathematics. It’s also called “clock algebra” because it resembles the operation with hours in a clock, for example, 7 + 6 usually is equal to 13, but if we are talking about the time it will be equal to 1. Here the clock is a modulo 12 algebra, but we can generalize it to any number.

It has a lot of interesting properties, for example keeping adding a number which is prime with the modulo will generate a sequence with all the numbers up to the modulo. This means that if we add a number prime to 12 in our clock algebra we will generate a sequence with all the hours. For example, 5 is the smallest number prime to 12, the sequence of 5 in modulo 12 is: 5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 0 and then it repeats itself.

This is the class to define a modular number and its operations. Note that division is quite problematic in modular algebras and I used a brute force approach as a shortcut:

data class ModularNumber(val num: Int, val modulo: Int): Comparable<ModularNumber>  {

override fun
compareTo(other: ModularNumber): Int = num.compareTo(other.num)

operator fun
plus(other: ModularNumber) = (num + other.num).toModularNumber()
operator fun minus(other: ModularNumber) = (num - other.num).toModularNumber()
operator fun times(other: ModularNumber) = (num * other.num).toModularNumber()
operator fun div(other: ModularNumber) = ((1..modulo).first { (it * other.num) % modulo == (this.num % modulo) }).toModularNumber()
operator fun inc() = (num + 1).toModularNumber()

fun squared() = (num*num).toModularNumber()

fun Int.toModularNumber(): ModularNumber =
ModularNumber(this % modulo, modulo)

}

What can we do with such a charming algebra? We can draw nice patterns! To produce they we only need to compare the square of two numbers, inside modular algebra.

data class PlaneGrid(val size: Int, val initPredicate: (Int) -> Boolean) {
private val grid
= BooleanArray(size*size, initPredicate)

operator fun
get(x: Int, y: Int): Boolean = grid[coord(x,y)]

private fun coord(x: Int, y: Int) = (y-1) * size + x -1

fun count(): Int = grid.count {it}
}

val
compareSquares: (ModularNumber, ModularNumber) -> Boolean =
{ x, y -> x.squared() >= y.squared() }

fun
compareSquares(size: Int, modulo: Int): Int =
ModularField(modulo).applyFunction(compareSquares)(size).count()

fun sumOfFunction(from: Int, to: Int, f: (Int) -> Int) =
(from..to)
.map { f(it) }
.sum()data class ModularField(val modulo: Int) { fun applyFunction(f: (ModularNumber, ModularNumber) -> Boolean): (Int) -> PlaneGrid =
{ size ->
PlaneGrid(size) { index ->
val
xMod = (index % size).toModularNumber()
val yMod = (index / size).toModularNumber()
f(xMod, yMod)
}
}
fun Int.toModularNumber(): ModularNumber =
ModularNumber(this % modulo, modulo)
}

These are some examples of modular patterns generated by the above functions:

Modulo 5              Modulo 15            Modulo 12
@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@@@ @@@@@ @@@@@ @
@@ @@ @@ @@ @@ @@@@@@ @@ @@ @@@ @@@ @@@
@@ @@ @@ @@ @ @ @ @ @ @ @ @
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@ @@@ @@@
@@@@@@@@@@@@@@@@@@@@ @ @ @@@@@ @@@@@ @@@@@ @
@@@@ @@@@ @@@@ @@@@ @ @@ @@ @ @ @@@@@@@@@@@@@@@@@@@@
@@ @@ @@ @@ @@ @@@@@@ @@ @@ @@@@@ @@@@@ @@@@@ @
@@ @@ @@ @@ @@ @@@@@@ @@ @@ @@@ @@@ @@@
@@@@ @@@@ @@@@ @@@@ @ @@ @@ @ @ @ @ @
@@@@@@@@@@@@@@@@@@@@ @ @ @@@ @@@ @@@
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@@@ @@@@@ @@@@@ @
@@ @@ @@ @@ @ @ @ @ @ @@@@@@@@@@@@@@@@@@@@
@@ @@ @@ @@ @@ @@@@@@ @@ @@ @@@@@ @@@@@ @@@@@ @
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@ @@@ @@@
@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@ @ @ @
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@ @@@ @@@
@@ @@ @@ @@ @@ @@@@@@ @@ @@ @@@@@ @@@@@ @@@@@ @
@@ @@ @@ @@ @ @ @ @ @ @@@@@@@@@@@@@@@@@@@@
@@@@ @@@@ @@@@ @@@@ @@@@@@@@@@@@@@ @@@@ @@@@@ @@@@@ @@@@@ @

Benchmarks

And now the part you were waiting for all along. Let’s compare the performance of the two compilers!

Let’s remember that Graal is written in Java and it is taking advantage of new researches in the compiler space, but it is still relatively young. C2 on the other side is extremely well-tuned and mature.

Now a word about testing methodology: All tests run on my Notebook with i7 4.4Ghz 32Gb. I ran the tests several times with different parameters and the results have been consistent. The actual values in the graphs have been taken in the same session without other applications running.

I run a test called PerformanceTest in each project which executes an end to end benchmark continuously. In this way, the compiler cannot optimize the code for a specific case. Then I choose the faster result, assuming it is the one without GC and OS pauses. All tests are running mono-thread.

Here are the results in microseconds as an average of many runs:

Image for post
Image for post

These results are consistent in showing a definite performance gain using Graal over Hotspot. The main reason is the more sophisticated inlining and escape analysis optimizations.

In some cases, Graal EE special optimizations (like loop unrolling) give it a further advantage.

As a methodology, I always suggest starting with an end-to-end continuous running benchmark, as it is the safest way to collect performance metrics. It will give us the best approximation of production performance.

If we want to understand better where the bottlenecks are, we need to create more specific micro-benchmarks. Fortunately for us, there is an incredibly good library to do exactly this.

Enter Java Micro-benchmark Harness. JMH is a library to simplify the task to implement Java microbenchmarks correctly. JMH has been created by some JVM developers and it is used for JVM internal performance test. We can take this as a good indication of its quality.

How to set up JMH using Gradle (and Kotlin)?
You just need to edit your build.gradle file. First, add the dependencies:

testImplementation("org.openjdk.jmh:jmh-core:1.23")
testImplementation("org.openjdk.jmh:jmh-generator-annprocess:1.23")

Then, add the Gradle JMH plugin:

plugins {
kotlin("jvm") version "1.3.70"
id("me.champeau.gradle.jmh") version "0.5.0"
}

Finally, add a section with the properties you want to set. You can find the full list here.

jmh {
jmhVersion = "1.23"jvmArgs = listOf(
"-Xms6g",
"-Xmx6g",
"-Dgraal.ShowConfiguration=info",
"-XX:+AlwaysPreTouch",
"-XX:+UnlockExperimentalVMOptions",
"-XX:+UseJVMCICompiler"
)
warmupForks = 0
fork = 2
warmupIterations = 5
iterations = 5
}

How to create a test for JMH?

To create a new benchmark we only need to put an annotation around a function.

@Benchmark
fun knapsack(blackhole: Blackhole) {
blackhole.consume(Knapsack.selectWatches(shop, 199))
}

There are many possible optional parameters for the benchmark annotation, the default mode is to count the operations for a second, but there are many other possible ways to run the test.

In this test, you can see we are usingblackhole class to prevent the compiler to consider the benchmark as “dead code” since we are not doing anything with the result of calculations.

Let’s see the results:

Image for post
Image for post

From this graph, we can see a bigger difference in raw CPU performance than in the continuous run test.

One difference is that JMH uses some techniques to avoid garbage collection cycles, why the continuous run after a while (usually from a few seconds to some minutes) would fill all the memory and require a full GC cycle.

Graal garbage collection is currently not as good as the Hotspot one, so this can partly explain the results. On the positive side, RedHat is currently investing to bring their great Shenandoah garbage collector to GraalVM. Expect new benchmarks when this will happen!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK