01/11/2021

A good programming practice typically proceeds in the following steps:

  • making the code run (debugging)
  • making the code right
  • profiling the code
  • making the code fast

Keep in mind that…

Premature optimization is the root of all evil
Donald Knuth

Profiling R code

Profiling is a systematic way to measure how much time is spent in each part of a program, which is often necessary for complex projects and/or handling big data. Once the profiling procedure successfully detect the bottlenecks in a code (i.e., the parts running slowly), one can move on to think about how to optimize the code and improve its performance. R has a built-in profiler to help a programmer speed up his/her code.

Profiling vs Benchmarking

Profiling

  • If you’ve decided you need your code to perform better, profile first.
  • Profiling helps isolate hot spots.
  • Time spent here will likely yield best return on your investment (usually time).

Benchmarking

  • With hot spots in hand, examine the code and propose alternatives.
  • While ensuring the results are the same, ask which performs best.

R profiling tools

R has builtin support for profiling, but there are additional packages available:

  • proftools

  • profvis

    (RStudio integration)

Basic profiling with proftools

f <- function(a) { g1(a) + g2(2 * a) }

g1 <- function(a) { h(a) }

g2 <- function(a) { sqrt(a) }

h <- function(a) {
  b <- double(length(a))
  for (i in seq_along(a)) {
    b[i] <- sqrt(a[i])
  }
  b
}

Basic profiling with proftools

x <- 1:1000000
Rprof('prof.out')
for (i in 1:10) {
  y <- f(x)
}
Rprof(NULL)
summaryRprof("prof.out")$by.self
##                    self.time self.pct total.time total.pct
## "h"                     2.00    77.52       2.12     82.17
## "g2"                    0.28    10.85       0.34     13.18
## "double"                0.10     3.88       0.10      3.88
## "f"                     0.04     1.55       2.52     97.67
## "cmp"                   0.02     0.78       0.06      2.33
## "*"                     0.02     0.78       0.02      0.78
## "constantFoldCall"      0.02     0.78       0.02      0.78
## "findCenvVar"           0.02     0.78       0.02      0.78
## "grepl"                 0.02     0.78       0.02      0.78
## "lazyLoadDBfetch"       0.02     0.78       0.02      0.78
## "Rprof"                 0.02     0.78       0.02      0.78
## "sqrt"                  0.02     0.78       0.02      0.78

Basic profiling with profvis

Can also do this in RStudio, e.g. Profile -> Start Profile

library(profvis)
## Warning: package 'profvis' was built under R version 4.0.5
profvis({
for (i in 1:10) {
  y <- f(x)
}
})

Benchmarking

Knowing where code is slow via profiling, use benchmarking tools.

  • Put problem code into a functions.
  • Benchmark different versions of code for comparison.
  • system.time is useful for long running code.
  • microbenchmark package is useful for analyzing short running code.

system.time()

The system.time() function takes an arbitrary R expression as input (can be wrapped in curly braces) and computes the time (in seconds) needed to execute the expression, and if there’s an error, gives the time until the error occurred.

r <- 100
system.time({while (r < 200 && r >= 1) {
  dr <- rbinom(1, 1, .5)
  if (dr == 1) {r <- r + 1} 
  else {r <- r -1}}})
##    user  system elapsed 
##    0.10    0.02    0.13

Are for loops in R slow?

  • Not all for loops are bad
  • Most common mistakes involve for loops.
  • The classic mistake is not preallocating a result vector.

Example

Create a vector of length n where all values are x

Example 1: a bad for-loop

bad.for <- function(n,x) {
  result <- NULL
  for (i in 1:n) {
    result[i] <- x
  }
  result
}
  • Large number of iterations
  • Tiny amount of computation per iteration
  • Item result vector is reallocated and copied on each iteration
  • Triggering garbage collection periodically

Example 1: a better for-loop

okay.for <- function(n,x) {
  result <- double(n)
  for (i in 1:n) {
    result[i] <- x
  }
  result
}

Improvement over the previous example, but it’s still slow because of the many tiny iterations.

Example 1: a puzzle loop

strange.for <- function(n, x) {
  result <- NULL
  for (i in n:1) {
    result[i] <- x
  }
  result
}

Is this loop faster or slower than the previous two?

Example 1: using a vector function

# use of vector assignment
vector.assn <- function(n, x) {
  result <- double(n)
  result[] <- x
  result
}

We can also use vector assignment

Example 1: using R built-in function

built.in <- function(n, x) {
  rep(x, n)
}

Or, we could read the fine manual and use a built-in function

Example 1: testing

Make sure functions produce identical output

n <- 10000
x <- 7
bad.result        <- bad.for(n, x)
okay.result       <- okay.for(n, x)
strange.result    <- strange.for(n, x)
vector.result     <- vector.assn(n, x)
built.result      <- built.in(n, x)
c(identical(bad.result, okay.result),
identical(bad.result, strange.result),
identical(bad.result, vector.result),
identical(bad.result, built.result))
## [1] TRUE TRUE TRUE TRUE

Example 1: benchmark results

library("microbenchmark")
## Warning: package 'microbenchmark' was built under R version 4.0.5
library(knitr)
res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x),
                      vector=vector.assn(n,x), builtin=built.in(n,x))
kable(summary(res, unit="relative"))
expr min lq mean median uq max neval cld
bad 137.938202 136.269044 78.510648 146.933862 128.444243 6.6135560 100 c
okay 33.005618 21.969206 13.864978 25.017196 25.301904 0.8315226 100 b
strange 32.775281 22.119935 15.100756 28.675926 27.078876 1.0748824 100 b
vector 1.926966 1.940032 2.510999 2.174603 1.945603 2.7339711 100 a
builtin 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a

Example 1: benchmark plot

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
autoplot(res)

Further reading