Profiling R code

01/11/2021

A good programming practice typically proceeds in the following steps:

making the code run (debugging)
making the code right
profiling the code
making the code fast

Keep in mind that…

Premature optimization is the root of all evil
Donald Knuth

Profiling R code

Profiling is a systematic way to measure how much time is spent in each part of a program, which is often necessary for complex projects and/or handling big data. Once the profiling procedure successfully detect the bottlenecks in a code (i.e., the parts running slowly), one can move on to think about how to optimize the code and improve its performance. R has a built-in profiler to help a programmer speed up his/her code.

Profiling vs Benchmarking

Profiling

If you’ve decided you need your code to perform better, profile first.
Profiling helps isolate hot spots.
Time spent here will likely yield best return on your investment (usually time).

Benchmarking

With hot spots in hand, examine the code and propose alternatives.
While ensuring the results are the same, ask which performs best.

R profiling tools

R has builtin support for profiling, but there are additional packages available:

proftools
profvis
(RStudio integration)

Basic profiling with proftools

f <- function(a) { g1(a) + g2(2 * a) }

g1 <- function(a) { h(a) }

g2 <- function(a) { sqrt(a) }

h <- function(a) {
  b <- double(length(a))
  for (i in seq_along(a)) {
    b[i] <- sqrt(a[i])
  }
  b
}

Basic profiling with proftools

x <- 1:1000000
Rprof('prof.out')
for (i in 1:10) {
  y <- f(x)
}
Rprof(NULL)
summaryRprof("prof.out")$by.self

##                    self.time self.pct total.time total.pct
## "h"                     2.00    77.52       2.12     82.17
## "g2"                    0.28    10.85       0.34     13.18
## "double"                0.10     3.88       0.10      3.88
## "f"                     0.04     1.55       2.52     97.67
## "cmp"                   0.02     0.78       0.06      2.33
## "*"                     0.02     0.78       0.02      0.78
## "constantFoldCall"      0.02     0.78       0.02      0.78
## "findCenvVar"           0.02     0.78       0.02      0.78
## "grepl"                 0.02     0.78       0.02      0.78
## "lazyLoadDBfetch"       0.02     0.78       0.02      0.78
## "Rprof"                 0.02     0.78       0.02      0.78
## "sqrt"                  0.02     0.78       0.02      0.78

Basic profiling with profvis

Can also do this in RStudio, e.g. Profile -> Start Profile

library(profvis)

## Warning: package 'profvis' was built under R version 4.0.5

profvis({
for (i in 1:10) {
  y <- f(x)
}
})

Benchmarking

Knowing where code is slow via profiling, use benchmarking tools.

Put problem code into a functions.
Benchmark different versions of code for comparison.
system.time is useful for long running code.
microbenchmark package is useful for analyzing short running code.

system.time()

The system.time() function takes an arbitrary R expression as input (can be wrapped in curly braces) and computes the time (in seconds) needed to execute the expression, and if there’s an error, gives the time until the error occurred.

r <- 100
system.time({while (r < 200 && r >= 1) {
  dr <- rbinom(1, 1, .5)
  if (dr == 1) {r <- r + 1} 
  else {r <- r -1}}})

##    user  system elapsed 
##    0.10    0.02    0.13

Are for loops in R slow?

Not all for loops are bad
Most common mistakes involve for loops.
The classic mistake is not preallocating a result vector.

Example

Create a vector of length n where all values are x

Example 1: a bad for-loop

bad.for <- function(n,x) {
  result <- NULL
  for (i in 1:n) {
    result[i] <- x
  }
  result
}

Large number of iterations
Tiny amount of computation per iteration
Item result vector is reallocated and copied on each iteration
Triggering garbage collection periodically

Example 1: a better for-loop

okay.for <- function(n,x) {
  result <- double(n)
  for (i in 1:n) {
    result[i] <- x
  }
  result
}

Improvement over the previous example, but it’s still slow because of the many tiny iterations.

Example 1: a puzzle loop

strange.for <- function(n, x) {
  result <- NULL
  for (i in n:1) {
    result[i] <- x
  }
  result
}

Is this loop faster or slower than the previous two?

Example 1: using a vector function

# use of vector assignment
vector.assn <- function(n, x) {
  result <- double(n)
  result[] <- x
  result
}

We can also use vector assignment

Example 1: using R built-in function

built.in <- function(n, x) {
  rep(x, n)
}

Or, we could read the fine manual and use a built-in function

Example 1: testing

Make sure functions produce identical output

n <- 10000
x <- 7
bad.result        <- bad.for(n, x)
okay.result       <- okay.for(n, x)
strange.result    <- strange.for(n, x)
vector.result     <- vector.assn(n, x)
built.result      <- built.in(n, x)
c(identical(bad.result, okay.result),
identical(bad.result, strange.result),
identical(bad.result, vector.result),
identical(bad.result, built.result))

## [1] TRUE TRUE TRUE TRUE

Example 1: benchmark results

library("microbenchmark")

## Warning: package 'microbenchmark' was built under R version 4.0.5

library(knitr)
res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x),
                      vector=vector.assn(n,x), builtin=built.in(n,x))
kable(summary(res, unit="relative"))

expr	min	lq	mean	median	uq	max	neval	cld
bad	137.938202	136.269044	78.510648	146.933862	128.444243	6.6135560	100	c
okay	33.005618	21.969206	13.864978	25.017196	25.301904	0.8315226	100	b
strange	32.775281	22.119935	15.100756	28.675926	27.078876	1.0748824	100	b
vector	1.926966	1.940032	2.510999	2.174603	1.945603	2.7339711	100	a
builtin	1.000000	1.000000	1.000000	1.000000	1.000000	1.0000000	100	a

Example 1: benchmark plot

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

autoplot(res)

A good programming practice typically proceeds in the following steps:

Profiling R code

Profiling vs Benchmarking

R profiling tools

Basic profiling with proftools

Basic profiling with proftools

Basic profiling with profvis

Benchmarking

system.time()

Are for loops in R slow?

Example

Example 1: a bad for-loop

Example 1: a better for-loop

Example 1: a puzzle loop

Example 1: using a vector function

Example 1: using R built-in function

Example 1: testing

Example 1: benchmark results

Example 1: benchmark plot

Further reading