Playing Map() and Reduce() in R – Subsetting - JOYK Joy of Geek, Geek News, Link all geek

In the previous post ( https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation ), I’ve shown how to employ the MapReduce when calculating by-group statistics. Actually, the same Divide-n-Conquer strategy can be applicable to other use cases, one of which is the subsetting operation.

In the example below, let’s still use the same iris data for the demonstration purpose. In R, the most convenient way to perform the subsetting might be the subset() function, which would search for rows meeting the condition described in the “expr” expression below throughout the entire data.frame.

data(iris)
expr = expression(Sepal.Length > 7 & Sepal.Width > 3)
subset(iris, eval(expr))

With the whole data.frame partitioned into multiple pieces, the row searching operation can perfectly fit into the MapReduce paradigm, as described in the logic flow below.

1. First of all, the iris data is divided into chucks with equal number of rows, e.g. two chunks in the example.

2. Next, a Map() function is used to perform the row searching operation within each chuck.

3. Upon the return of rows meeting the criteria from each chuck, a Reduce() function is employed to combine all outcomes together.

n <- 2
lst <- split(iris, sort((1:nrow(iris)) %% n))
Reduce(rbind, Map(function(x) x[with(x, which(eval(expr))), ], lst))

It is noted that the above map operation is still performed sequentially without leveraging the computing power of multiple CPUs. In the CPU usage, we can see that only one CPU is used and the rest are idle.

UR3aeqQ.png!web

Similar to the by-group summary, the by-chunk operation of row searching doesn’t have to be in the sequential order and can be distributed simultaneously across multiple CPUs by using the mcMap() function, as outlined below.

1. Again, it starts with the data partition. However, there are two caveats in the example. Firstly, the data is split based upon the number of CPUs captured by the detectCores() function. Secondly, the partitioned data is NOT stored physically in the memory but reflected logically by a list of future abstractions, e.g. “flst” in the code snippet.

2. In the second step, the mcMap() function is used to evaluate each future abstraction, return the partitioned data, and then perform the row searching within each chunk.

3. At last, the Reduce() function collects and combines all outcomes.

pkgs <- c("parallel", "future")
mapply(function(x) require(x, character.only = T), pkgs)
n <- detectCores()
flst <- Map(function(x) future({x}), split(iris, sort((1:nrow(iris)) %% n)))
Reduce(rbind, mcMap(function(x) value(x)[with(value(x), which(eval(expr))), ], flst, mc.cores = n))

If we take a look at the CPU usage again, it is now shown that all CPUs are utilized.

JJ3i6jm.png!web

Playing Map() and Reduce() in R – Subsetting

Recommend

DevOps 和 SRE

被小米的智能识物感动到哭

各位听说了吗？ Redis 作者 ANTIREZ 被要求修改 Redis 的术语 master/slave（主人/奴...

Android 版的 Chrome 用什么广告拦截插件

你最多能接受每个月花你工资的百分之多少在房租上？

Clojure is Cool

Source Of Evil – A Botnet Code Collection

【PageLayout】非常简单的一键切换加载-空数据-错误页，支持自定义

『中级篇』docker之CI/CD持续集成-（终结篇）（77）

ReactNative学习笔记(五)之ListView

About Joyk