01/11/2021

What is Parallel processing?

Is a type of computation in which many calculations or process are carried out simultaneously.

Parallel Packages in R

How we do this?

Multi Core Processors. Computers today have multiple cores on them.

  • Parallel [R Baseline]
  • Rmpi
  • future
  • Foreach
  • Etc.

The Parallel package

The parallel package is now part of the core distribution of R. It includes a number of different mechanisms to enable you to exploit parallelism utilizing the multiple cores in your processor(s) as well as compute the resources distributed across a network as a cluster of machines.

However, in this talk, we will stick to making the most of the resources available on the machine on which you are running R.

Some steps

How many cores do you have?

# Load the package
library(parallel)
detectCores() 
## [1] 8

Starting clusters

cl <- makeCluster(2) 

Sending libraries to clusters

clusterEvalQ(cl, {
  library(tidyverse)
})
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"

Sending variables and functions to clusters

a <- 2
square <- function(num) num**2

clusterExport(cl, c("a", "square")) 

# To test if it was received I run another EvalQ
clusterEvalQ(cl, {
  print(c(a, square(a)))
})
## [[1]]
## [1] 2 4
## 
## [[2]]
## [1] 2 4

Stopping a cluster

stopCluster(cl)

Example of time saved

Making the computer sleeps 3 sec before running anything, repeating this 5 times…

Running in series

ptm <- proc.time()
for (i in 1:5) Sys.sleep(3) 
print(proc.time()-ptm)
##    user  system elapsed 
##    0.02    0.00   15.23

Running in parallel

library(parallel)
ptm <- proc.time()
cl <- makeCluster(8) 
invisible(parSapply(cl, rep(3,5), Sys.sleep)) #invisible here just hides the null list from Sapply
stopCluster(cl)
print(proc.time()-ptm)
##    user  system elapsed 
##    0.05    0.11    5.39

Real example

Further reading