- making the code run (debugging)
- making the code right
- profiling the code
- making the code fast
Keep in mind that…
Premature optimization is the root of all evil
Donald Knuth
01/11/2021
Keep in mind that…
Premature optimization is the root of all evil
Donald Knuth
Profiling is a systematic way to measure how much time is spent in each part of a program, which is often necessary for complex projects and/or handling big data. Once the profiling procedure successfully detect the bottlenecks in a code (i.e., the parts running slowly), one can move on to think about how to optimize the code and improve its performance. R has a built-in profiler to help a programmer speed up his/her code.
Profiling
Benchmarking
R has builtin support for profiling, but there are additional packages available:
proftools
profvis
(RStudio integration)f <- function(a) { g1(a) + g2(2 * a) } g1 <- function(a) { h(a) } g2 <- function(a) { sqrt(a) } h <- function(a) { b <- double(length(a)) for (i in seq_along(a)) { b[i] <- sqrt(a[i]) } b }
x <- 1:1000000 Rprof('prof.out') for (i in 1:10) { y <- f(x) } Rprof(NULL) summaryRprof("prof.out")$by.self
## self.time self.pct total.time total.pct ## "h" 2.00 77.52 2.12 82.17 ## "g2" 0.28 10.85 0.34 13.18 ## "double" 0.10 3.88 0.10 3.88 ## "f" 0.04 1.55 2.52 97.67 ## "cmp" 0.02 0.78 0.06 2.33 ## "*" 0.02 0.78 0.02 0.78 ## "constantFoldCall" 0.02 0.78 0.02 0.78 ## "findCenvVar" 0.02 0.78 0.02 0.78 ## "grepl" 0.02 0.78 0.02 0.78 ## "lazyLoadDBfetch" 0.02 0.78 0.02 0.78 ## "Rprof" 0.02 0.78 0.02 0.78 ## "sqrt" 0.02 0.78 0.02 0.78
Can also do this in RStudio, e.g. Profile -> Start Profile
library(profvis)
## Warning: package 'profvis' was built under R version 4.0.5
profvis({ for (i in 1:10) { y <- f(x) } })
Knowing where code is slow via profiling, use benchmarking tools.
The system.time() function takes an arbitrary R expression as input (can be wrapped in curly braces) and computes the time (in seconds) needed to execute the expression, and if there’s an error, gives the time until the error occurred.
r <- 100 system.time({while (r < 200 && r >= 1) { dr <- rbinom(1, 1, .5) if (dr == 1) {r <- r + 1} else {r <- r -1}}})
## user system elapsed ## 0.10 0.02 0.13
Create a vector of length n where all values are x
bad.for <- function(n,x) { result <- NULL for (i in 1:n) { result[i] <- x } result }
okay.for <- function(n,x) { result <- double(n) for (i in 1:n) { result[i] <- x } result }
Improvement over the previous example, but it’s still slow because of the many tiny iterations.
strange.for <- function(n, x) { result <- NULL for (i in n:1) { result[i] <- x } result }
Is this loop faster or slower than the previous two?
# use of vector assignment vector.assn <- function(n, x) { result <- double(n) result[] <- x result }
We can also use vector assignment
built.in <- function(n, x) { rep(x, n) }
Or, we could read the fine manual and use a built-in function
Make sure functions produce identical output
n <- 10000 x <- 7 bad.result <- bad.for(n, x) okay.result <- okay.for(n, x) strange.result <- strange.for(n, x) vector.result <- vector.assn(n, x) built.result <- built.in(n, x) c(identical(bad.result, okay.result), identical(bad.result, strange.result), identical(bad.result, vector.result), identical(bad.result, built.result))
## [1] TRUE TRUE TRUE TRUE
library("microbenchmark")
## Warning: package 'microbenchmark' was built under R version 4.0.5
library(knitr) res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x), vector=vector.assn(n,x), builtin=built.in(n,x)) kable(summary(res, unit="relative"))
expr | min | lq | mean | median | uq | max | neval | cld |
---|---|---|---|---|---|---|---|---|
bad | 137.938202 | 136.269044 | 78.510648 | 146.933862 | 128.444243 | 6.6135560 | 100 | c |
okay | 33.005618 | 21.969206 | 13.864978 | 25.017196 | 25.301904 | 0.8315226 | 100 | b |
strange | 32.775281 | 22.119935 | 15.100756 | 28.675926 | 27.078876 | 1.0748824 | 100 | b |
vector | 1.926966 | 1.940032 | 2.510999 | 2.174603 | 1.945603 | 2.7339711 | 100 | a |
builtin | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0000000 | 100 | a |
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
autoplot(res)
Optimising code by Hadley Wickham