Consistently Infrequent

December 8, 2011

Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!

Filed under: R — Tags: , , — Tony Breyal @ 1:41 pm

Introduction

The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here.

The Objective

To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

Solutions One Through Thirteen

A variety of different approaches are illustrated on the r-wiki page which show the performance benefits of things like vectorisation, variable initialisation, linking through to a compiled programming language, reducing a problem to its component parts, etc.

The Fourteenth Solution

The main speed improvement here comes from replacing the function “paste” by “file.path”. This use of “file.path” with parameter fsep=”” only works correctly here because there is never a character vector of length 0 for it to deal with. I only learned about this approach when I happened to see this tweet on twitter with hashtag #rstats and reading the associated help file where it says that it is faster than paste.

generateIndex14 <- function(n) {
  # initialise vectors
  s <- (mode = "character", length = n)

  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # paste strings together
  unlist(lapply(1:(n-1), function(i) file.path("i", s[i], ".", s[(i+1):n], fsep = "") ), use.names = FALSE)
}

Timings:

               test  elapsed    n replications
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Fifteenth Solution: Rcpp

This solution comes from Romain Francois and is based on the tenth solution but implemented in C++ using the R package Rcpp. See his blog for the implementation. This is the sort of thing I would love to learn to do myself but just need to find the time to re-learn C++, though I doubt that’ll happen any time soon as I’m hoping to start my MSc in Statistics next year. This is a great solution though.

Timings:

               test  elapsed    n replications
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Sixteenth Solution

When I was writing up this post I thought up a sixteenth solution (as seems to be the pattern with me on this blog!). This solution gets its speed up by generating the largest set of strings which start i001.xxx first and then replacing the “001” part with “002”, “003”, “004”, etc., for each increment up to and including n-1.


generateIndex16 <- function(n) {
  # initialise vectors
  str <- vector("list", length = n-1)
  s <- vector(mode = "character", length = n)

  # set up strings
  s <- sprintf("%03d", seq_len(n))
  str[[1]] <- file.path("i", s[1], ".", s[-1], fsep = "")

  # generate string sequences
  str[2:(n-1)] <- lapply(2:(n-1), function(i) sub("001", s[i], str[[1]][i:(n-1)], fixed=TRUE))
  unlist(str)
}

The above requires matching the “001” part first and then replacing it. However, we know that “001” will ALWAYS be in character positions 2, 3 and 4, and so there may be a way to avoid the matching part altogether (i.e. replace a fixed position substring with another string of equal or larger length) but I could not work out how to do that outside of a regular expression. Sadface.

Timings:

               test  elapsed    n replications
 generateIndex16(n) 20.77200 2000           50
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

Solutions Comparisons For Different N

I like ggplot2 charts and so ran my computer overnight to generate data for the speed performance of the last several solutions over different N:

Final Thoughts

I’m pretty sure that any more speed improvements will come from some or all of the follwing:

  • doing the heavy lifting in a compiled language and interfacing with R
  • running in parallel (I actually got this to work on linux by replacing lapply with mclapply from the parallel R package but the downside was that one has to use much more memory for larger values of N, plus it’s only works in serial fashion on Windows
  • working out an efficient way of replacing a fixed positioned substring with a string of equal or great length
  • compiling the function into R bytecodes using the compiler package function cmpfun

It would also be interesting to profile the memory usage of each funciton.

This was a fun challenge – if you find some spare time why not try your hand at it, you might come up with something even better!  🙂

November 2, 2011

Code Optimization: One R Problem, Ten Solutions – Now Eleven!

Filed under: R — Tags: , , , , , , , , , , — Tony Breyal @ 11:47 pm

Earlier this year I came across a rather interesting page about optimisation in R from rwiki. The goal was to find the most efficient code to produce strings which follow the pattern below given a single integer input n:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

From this we can see that the general pattern for n is:

# pseudo code
FOR j = 1 to n-1
    FOR k = j+1 to n
        PRINT "i" + format(j, digits = 3) + "." + format(k, digits = 3)
        k = k + 1
    END FOR
    j = j + 1
END FOR
# this produces (n-1)*n/2 strings.

It is rather heart warming to go though that rwiki page and see how we can sequentially optimise the algorithm in R to more efficiently produce the desired string sequence. I learned quite a lot from this page about R and how fun these types of challenges can be!

Looking at the tenth solution, it achieves its speed by recognising that there are a total of n unique strings (e.g. “001”, “002”) to the pattern. These can be assigned to a character vector which we subsequently call by working out the sequence of indices we require and passing them to the character vector:

generateIndex10 <- function(n) {
  s <- sprintf("%03d", seq_len(n));
  paste(
    "i",
    rep(s[1:(n-1)], (n-1):1),
    ".",
    unlist(lapply(2:n, function(k) s[k:n]), use.names=FALSE),
    sep=""
  )
}
generateIndex10(7)
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i001.007" "i002.003" "i002.004"
 [9] "i002.005" "i002.006" "i002.007" "i003.004" "i003.005" "i003.006" "i003.007" "i004.005"
[17] "i004.006" "i004.007" "i005.006" "i005.007" "i006.007"

Playing around with the solution above, I noticed that a speed up would be possible if the following were implemented:

  1. rep on a string vector is slower than simply using rep.int to work out the indices first and then passing those into the character vector.
  2. Initialise the vectors to the correct size. This seems to produce a speed up though I’m not sure why it would in a vectorised solution. I could see the benefit if I were creating the vectors using an explicit R loop and concatenating the results but that is not the case here.
  3. The functions could be compiled into bytecodes using the new compiler package (NB given that tenth solution is from 2007, this option would not have been available back then).
  4. This problem could be set up to run in parallel. I didn’t do this because I’m still within running distance of being an R noob 😀

Given the first three points above, this is the solution I came up with:

generateIndex11 <- function(n) {
  # initialise vectors
  len <- (n-1)*n/2
  s <- vector(mode = "character", length = n)
  ind.1 <- vector(mode = "integer", length = len)
  ind.2 <- vector(mode = "integer", length = len)

  # set up strings
  s <- sprintf("%03d", seq_len(n))

  # calculate indices
  ind.1 <- rep.int(1:(n-1), (n-1):1)
  ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)

  # paste strings together
  return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}

But is it faster? We can benchmark it and see!


# load packages
library(compiler)
library(rbenchmark)

# compile functions
compiled_generateIndex10 <- cmpfun(generateIndex10)
compiled_generateIndex11 <- cmpfun(generateIndex11)

# set size
n = 5000

# run benchmark
benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative",
          replications = 10)

###--- TIMINGS ---###
                          test replications elapsed relative
4 compiled_generateIndex11(n)            10   51.46 1.000000
3          generateIndex11(n)            10   51.67 1.004081
2  compiled_generateIndex10(n)           10   54.35 1.056160
1           generateIndex10(n)           10   61.20 1.189273

So it seems that there is a speed up between generateIndex10 and generateIndex11.   If we call the benchmark for different values of n and plot using the ggplot2 pack we get the following:


library(ggplot2)
get_bm <- function(n) {
  gc()
  df = benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
            columns = c("test", "elapsed"),
            order = "elapsed",
            replications = 10)
  df$n = n
  return(df)

}

df.list <- lapply(seq(1000, 9000, by = 250), get_bm)
df <- do.call("rbind", df.list)
ggplot(df, aes(x=n, y=elapsed, colour=test, shape=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")

Which produces the graph below. I don’t know what’s happening at the extreme values but I suspect that it has something to do with memory because on my 8GB machine, the RAM was mostly at 99% for the largest values of n. I also like how stable the compiled versions of the functions are:

In conclusion, whilst the compiled eleventh solution is faster for larger values of n, for values below 1000 the tenth solution is sometimes faster.

Writing this post, I noticed something and actually came up with an even faster twelfth  solution. When I get time in the next couple of days I’ll make a post about it and also submit it to rwiki.

Thanks for reading, this was my first proper R blog post, constructive comments are most welcome! Also, please kindly bare in mind that I am not an R expert but rather just trying to improve my knowledge of the language 🙂

Blog at WordPress.com.