71

Playing Map() and Reduce() in R – Subsetting

 5 years ago
source link: https://www.tuicool.com/articles/hit/bMFZzqZ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In the previous post ( https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation ), I’ve shown how to employ the MapReduce when calculating by-group statistics. Actually, the same Divide-n-Conquer strategy can be applicable to other use cases, one of which is the subsetting operation.

In the example below, let’s still use the same iris data for the demonstration purpose. In R, the most convenient way to perform the subsetting might be the subset() function, which would search for rows meeting the condition described in the “expr” expression below throughout the entire data.frame.

data(iris)
expr = expression(Sepal.Length > 7 & Sepal.Width > 3)
subset(iris, eval(expr))

With the whole data.frame partitioned into multiple pieces, the row searching operation can perfectly fit into the MapReduce paradigm, as described in the logic flow below.

1. First of all, the iris data is divided into chucks with equal number of rows, e.g. two chunks in the example.

2. Next, a Map() function is used to perform the row searching operation within each chuck.

3. Upon the return of rows meeting the criteria from each chuck, a Reduce() function is employed to combine all outcomes together.

n <- 2
lst <- split(iris, sort((1:nrow(iris)) %% n))
Reduce(rbind, Map(function(x) x[with(x, which(eval(expr))), ], lst))

It is noted that the above map operation is still performed sequentially without leveraging the computing power of multiple CPUs. In the CPU usage, we can see that only one CPU is used and the rest are idle.

UR3aeqQ.png!web

Similar to the by-group summary, the by-chunk operation of row searching doesn’t have to be in the sequential order and can be distributed simultaneously across multiple CPUs by using the mcMap() function, as outlined below.

1. Again, it starts with the data partition. However, there are two caveats in the example. Firstly, the data is split based upon the number of CPUs captured by the detectCores() function. Secondly, the partitioned data is NOT stored physically in the memory but reflected logically by a list of future abstractions, e.g. “flst” in the code snippet.

2. In the second step, the mcMap() function is used to evaluate each future abstraction, return the partitioned data, and then perform the row searching within each chunk.

3. At last, the Reduce() function collects and combines all outcomes.

pkgs <- c("parallel", "future")
mapply(function(x) require(x, character.only = T), pkgs)
n <- detectCores()
flst <- Map(function(x) future({x}), split(iris, sort((1:nrow(iris)) %% n)))
Reduce(rbind, mcMap(function(x) value(x)[with(value(x), which(eval(expr))), ], flst, mc.cores = n))

If we take a look at the CPU usage again, it is now shown that all CPUs are utilized.

JJ3i6jm.png!web


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK