25

Faster R with FastR

 5 years ago
source link: https://www.tuicool.com/articles/hit/NVvQFrE
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
e6Zva2b.png!web

R is a programming language for statistical computing known for its rich ecosystem of 3rd party libraries called “packages”. The R language is typically used for statistics, data mining, analysis and visualization, machine learning, and similar tasks. However, R remains, at least in essence, a general purpose language that can be, for example, used for reactive web applications development through the Shiny framework .

FastR is a GraalVM based implementation of the R language that provides significant improvements in the performance of R code, the embedding of the R execution engine into Java applications, and the ability to interface with other GraalVM and JVM languages including Python, Java, and Scala. FastR is also able to utilize thetooling support provided by theGraalVM ecosystem.

To try out FastR, you can download GraalVM and install FastR using:

$GRAALVM_HOME/bin/gu install R

You can start the R interactive console by running:

$GRAALVM_HOME/bin/R

Compatibility with GNU-R

The goal of FastR is to be a drop-in replacement for GNU-R, the reference implementation of the R language. So far, FastR is capable of running binary R packages built for GNU-R as long as those packages properly use the R extensions C API. However, for best results, it is recommended to install R packages from source. FastR supports R graphics via the grid package and packages based on grid ( lattice and ggplot2 ). We are currently working towards support for the base graphics package.

FastR currently works with many of the popular R packages, such as

  • ggplot2
  • jsonlite
  • testthat
  • assertthat
  • knitr
  • Shiny
  • Rcpp
  • rJava
  • quantmod
  • and more…

Moreover, support for dplyr and data.table are on the way. We are actively monitoring and improving FastR support for the most popular packages published on CRAN including all the tidyverse packages. However, one should take into account the experimental state of FastR, there can be packages that are not compatible yet, and if you try it on a complex R application, it can stumble on those.

Note: The most straightforward assessment of whether a package “works” with FastR would seem to be running its own tests with FastR. However, some packages have only trivial tests while other packages may have a large number of tests with bulletproof code coverage, which makes it very difficult to pass them all down to the last one exotic corner case. The packages listed in this article either pass all their tests or they pass our own tests created according to the common use cases.

High performance R code execution: do we still need Fortran?

The performance of R is a continuing sticking point for the R community, and the bane of many R developers is when they are forced to reimplement their R code, due to performance, in C, C++, or Fortran.

“This method works, but it’s un-usably slow for all but small datasets.”

In the article “ Throwing Shade at the Cartographer Illuminati: Raytracing the Washington Monument in R ”, the author Tyler Morgan-Wall shows an R code snippet for a raytracing algorithm and concludes: “This method works, but it’s un-usably slow for all but small datasets”. What if we benchmark this example with FastR? The following plot shows the speedup, i.e. higher is better, of FastR over GNU-R after 5 iterations of warm up.

bQjqmir.png!web

FastR is about 3 times faster. Although this may look like a great result, the Graal compiler powering FastR’s high performance execution usually does much better when compared to GNU-R. We can use the built-in CPU sampler to find out that the bottleneck is in the following function call:

akima::bilinear(volc$x,volc$y,volc$z,x0=xcoord,y0=ycoord)$z

After a closer inspection, we determined this is a simple wrapper for this Fortran function that carries out the actual computation. Conventional wisdom suggests we are now limited in how much we can improve a call to a Fortran function simply by executing it on a better R engine but is this true for FastR?

One of the tenets for each of the languages built upon GraalVM is that we want to allow the developers to focus on their problem domain using the language best suited for the task, no matter the abstraction level of the language. It is the job of the compiler, and not the programmer, to take care of removing the abstractions and running the code as close as possible to the equivalent hand-optimized C code. So is it possible to rewrite that Fortran algorithm in R, and get the same performance? If we do so, we finally get the results that are more in line with what we expected of FastR, this is due to the fact that the boundary for the optimizing compiler created by calling the native code is no longer present. Again, the following plot shows the speedup, i.e. higher is better, of FastR over GNU-R after 5 iterations of warm up.

BJZbAjq.png!web

We are excited to introduce a world where R code is no longer considered a second class citizen within its own applications. A world where R code is as fast as (or even faster than) equivalent C, C++, or Fortran code.

One of the long-term goals for FastR is to execute the native code via the GraalVM LLVM support. This would not only remove the optimization barrier between R and Fortran code but can also provide additional security guarantees and the ability to have a fully sandboxed execution of R packages.

Did we mention that all plots in this article were generated from FastR using ggplot2 ? There was no simple way of embedding SVG images on Medium, but FastR also contains a built-in SVG device, which is different from the GNU-R builtin SVG device as it is closer to svglite in the sense that it produces smaller and more web-friendly SVGs.

Polyglot programming: want an R package in Python?

Being a citizen of the GraalVM ecosystem brings many useful features, such as the ability to interoperate with other GraalVM based languages. The one language that naturally comes to mind in the context of R is Python. There have been increasing discussions on which of the two is a better language for data science, which has a better debugger, which has better libraries and so on. All those discussions would be a moot point in the context of GraalVM as we could simply use Python libraries from R, and vice-versa, or in both directions and maybe throw in some JavaScript just for fun! Let’s take a look at an example, where we would use an R package in the Python 3 implementation from GraalVM.

The following algorithm, written in Python, is a simplistic way of generating random numbers drawn from the exponential distribution.

Now if we want to test the sample we generated is from this distribution, we can use the Kolmogorov-Smirnov test from the R base library. The following snippet is Python code that uses the built-in interoperability feature to call a function inside an R package:

Finally, let’s visualize the data with R’s lattice package:

MNVrymR.png!web
Using lattice plots from Python

Note: the awt function opens a window with the plot instead of plotting into an image file. Use the following code structure to get the SVG code of the plot or to save it as a PNG image.

If you’d like to run the Python example yourself, be sure to install Python support to the GraalVM, and don’t forget to enable the polyglot for the graalpython command like so:

$GRAALVM_HOME/bin/gu install python
$GRAALVM_HOME/bin/graalpython --jvm --polyglot ...

At this point, a frequently asked question is: what is the performance overhead of doing such a complex language interoperability? Remember how we rewrote that Fortran function into an R function in the previous section? This removed the boundary between GraalVM and the native code, but if we stay within GraalVM we can implement the function in any language without introducing any boundary for the optimizing compiler . Therefore, we can, for example, rewrite the function to Python. The plot below shows that rewriting to Python does incur a small overhead, but we are working towards removing this overhead and in theory, it is possible to remove it completely. Nonetheless, the R+Python version is still an order of magnitude faster than the pure R version on GNU-R! Once FastR starts using GraalVM LLVM support to run the R packages’ native code, we’ll remove even the boundary for Fortran and C code.

BRRzyqv.png!web

To make the analysis of that benchmark complete, here is a plot with warm-up curves and absolute times, i.e. lower is better, of all the variants.

eIfE3a7.png!web

The seamless and efficient interoperability opens up many interesting possibilities. Galaaz is a programming language, or you can think embedded DSL, that lets you use Ruby syntax to access the richness of R ecosystem including R graphics. Under the hood, Galaaz uses interop features of TruffleRuby . With Galaaz you can, for example, create ggplot2 visualizations in Ruby using familiar Ruby-like syntax.

Embedding R into your JVM applications

Another use case of the GraalVM ecosystem for R is the seamless embedding of R into Java applications. FastR or any GraalVM based language can be embedded using the Graal SDK . Calling an R function from Java is as simple as this:

How can you pass more complex data from Java to R? Imagine an existing Java application with a class User that holds the data of a user. We have a collection of such objects and we want to run an R script that does some analysis of that data. With FastR, the only thing we need to do is implement few Java proxy classes that transform the data so that it has the expected shape of an R data frame, which is a “by column” data structure like a relational database table. We can then pass the Java objects directly to the R script. This process removes the need to copy data between two vastly different platforms. The Java classes may be implemented as follows:

In the code below, we execute an R script that filters out users with an id greater than 2 from an R data frame. The data frame is in fact backed by the Java classes from the code snippet above and there is no unnecessary copying or marshaling between Java and FastR.

The full example is available at GitHub .

FastR is also able to redirect R’s graphical output to a given Java Graphics2D object. The screenshot at the top of this article was taken from an example application that demonstrates the K-means algorithm. The results are computed and visualized in R but displayed in a Java Swing desktop application. Full code is again available at GitHub .

Speaking of Java, the traditional way of interacting with Java from R is the rJava package. FastR includes its own implementation of the rJava API, which is an order of magnitude faster than rJava on GNU-R, and if you use the FastR’s native Java interoperability, which has almost the same syntax as rJava , then the code can be orders of magnitude faster than GNU-R and rJava . Following code snippet shows usage of rJava taken from this example.

And the following plot shows the warm-up curves, i.e. lower is better, of rJava on GNU-R, FastR and the native Java interoperability in FastR.

77FZNba.png!web

Conclusion

In this article we present FastR, an experimental implementation of the R programming language that aims to be fully compatible with GNU-R, which can offer significant performance improvements over the reference implementation of R and is a member of the GraalVM ecosystem. We are excited to bring many new and interesting features to the R community that includes, but not limited to, interoperability with other languages,development tools, and Java embedding.

FastR can be installed into a GraalVM distribution and if you are using R, please try it out! If you find any incompatibilities with the reference implementaion or the ecosystem of the packages, let us know , we’d be delighted to figure it out!

The future of FastR

Are you considering on using FastR for your next project? Which R packages are the most important to you? Are there any extra features that you would like to see in FastR? We would welcome any feedback and suggestions in the comments section below, our GitHub repo, our mailing list: [email protected] , or email me personally .

Besides working on the compatibility and performance we are also currently investigating the following areas:

  • Using FastR Java embedding to run R on Spark more efficiently.
  • Provide a FastR specific backend for the “future” package to leverage FastR’s in-process parallel execution abilities, especially in the context of Shiny server applications.
  • Run Shiny web applications framework on enterprise Java Servlet container instead of httpuv .
  • Provide a way to gradually move parts of existing R code to FastR. Example: execute this hot loop on FastR, but run the rest of the application in GNU-R for better compatibility. Note that FastR can operate on a raw data of GNU-R’s vectors without needing to copy them.
  • Provide a fully sandboxed mode for the Java embedding use-case scenario.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK