Consistently Infrequent

November 7, 2011

Code Optimization: One R Problem, Eleven Solutions – Now Thirteen!

Filed under: R — Tags: , , , , — BD @ 11:35 am

Following up from my previous post “Code Optimisation: One R Problem, Ten Solutions – Now Eleven!” I figured out a twelfth solution after writing that blog post. Furthermore, half way through writing this blog post I figured out a thirteenth solution too. As a recap, the problem is taken from rwiki where the goal is to find the most efficient code to produce strings in the following pattern, given a single integer input n:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

The solution I had come up with was as follows:

generateIndex11 <- function(n) {
  # initialise vectors
  len <- (n-1)*n/2
  s <- vector(mode = "character", length = n)
  ind.1 <- vector(mode = "integer", length = len)
  ind.2 <- vector(mode = "integer", length = len)

  # set up strings
  s <- sprintf("%03d", seq_len(n))

  # calculate indicies
  ind.1 <- rep.int(1:(n-1), (n-1):1)
  ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)

  # paste strings together
  return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}

However, once I had finished writing that blog post, I realised that a further improvement was possible by not having to calculate ind.1 at all. This is because ind.2 has all the information we need:

n=5; lapply(2:n, ":", n)
[[1]]
[1] 2 3 4 5

[[2]]
[1] 3 4 5

[[3]]
[1] 4 5

[[4]]
[1] 5

Did you notice it? The pattern we need is effectively “1 with 2,3,4,5 “, “2 with 3,4,5”, “3 with 4, 5” and “4 with 5”.  All of that information is available in the one list above. Therefore, simplifying, a twelfth solution is:

generateIndex12 <- function(n) {
  s <- sprintf("%03d", seq_len(n))
  unlist(lapply(1:(n-1),
         function(i) {
             paste("i", s[i], ".", s[seq.int(i+1, n)], sep = "")
         }), use.names = FALSE)
}

# check
identical(generateIndex11(500), generateIndex12(500)
[1] TRUE

I would love to use the Rcpp package to produce a C++ version of the code above but unfortuantly that is beyound my current skill set but I’m sure it would result is a significant speed up. Perhaps.

Furthermore, a slightly faster thirteenth solution if we remember that initialising vectors and calculating indices independently before using lapply adds some speed too:


generateIndex13 <- function(n) {
  # initialise vectors
  s <- vector(mode = "character", length = n)
  ind <- vector(mode = "integer", length = (n-1)*n/2)
  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # calculate sequence of indicies for second part of strings e.g. "002" "003" etc in "i001.002" "i001.003"...
  ind <- lapply(2:n, ":", n)

  # paste strings together
  unlist(lapply(1:(n-1), function(i) paste("i", s[i], ".", s[ind[[i]]], sep = "") ), use.names = FALSE)
}
# check
identical(generateIndex11(500), generateIndex13(500)
[1] TRUE

Benchmarking the above (without compiled versions because of constraints on time)

# load packages
library(ggplot2)
library(rbenchmark)

get_bm <- function(n) {
  print(n)
  gc()
  df = benchmark(generateIndex10(n), generateIndex11(n), generateIndex12(n), generateIndex13(n),
                 columns = c("test", "elapsed", "relative"),
                 order = "elapsed",
                 replications = 100)
  df$n = n
  return(df)
}

# get benchmarks for various values of n
df.list <- lapply(seq(1000, 10000, by = 200), get_bm)
df <- do.call("rbind", df.list)

# plot results
ggplot(df.compiled, aes(x=n, y=elapsed, colour=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")

Bearing in mind that each algorithm is replicated 100 times at each value of n, the thirteenth solution produces a speed up at larger values of n.

Benchmarks performed on R version 2.13.0 on a Windows Server 2008 R2 x64 Workstation PC, 12GB RAM, Intel Xeon CPU X5650 @ 2.66GHz.

So, anyone have a Fouteenth solution? I’d love to see a solution that is paralleled into independent threads as I think that is where the next speed up will come from. Or how about a version which makes use of the inline and Rcpp packages with the thirteenth solution re-written C++? I have to admit that evolving an algorithm in R  is much easier than in C++ (not that I’ve used the latter in nearly 8 years).

Happy coding! 🙂

11 Comments »

  1. How about using expand.grid and vectorized row selection?

    sol14 <- function(n) {
    x <- expand.grid(1:n, 1:n)
    x x[, 1], ]
    x <- x[order(x[, 1], x[, 2]), ]
    paste(sprintf("i%03d", x[, 1]), sprintf("%03d", x[, 2]), sep = ".")
    }

    Didn't benchmark it though…

    Greets,
    Andrej

    Comment by Andrej Spiess — November 7, 2011 @ 6:30 pm

  2. Code was chopped off. Hopefully not now:

    sol14 <- function(n) {
      x <- expand.grid(1:n, 1:n)
      x  x[, 1], ]
      x <- x[order(x[, 1], x[, 2]), ]
      paste(sprintf("i%03d", x[, 1]), sprintf("%03d", x[, 2]), sep = ".")
    }
    

    Comment by Andrej Spiess — November 7, 2011 @ 6:32 pm

  3. Hmm, what am I doing wrong?
    3rd row in code always chopped off.

    Comment by Andrej Spiess — November 7, 2011 @ 6:37 pm

    • I’d never even heard of expand.grid() before, awesome, learn something new every day! Would love to know what that third line of code is 🙂

      Try inserting the code between the following tags:

      [.sourcecode language=”r”]
      # r code
      [./sourcecode]

      BUT without the periods “.” 🙂

      Comment by Tony Breyal — November 7, 2011 @ 9:30 pm

  4. Yep, I tried with expand.grid when I saw the first post but it didn’t improve the best solutions…

    Comment by FrogWave — November 7, 2011 @ 9:53 pm

  5. How about using matrix.
    My method:

    geneString <- function(n) {
      tmp <- matrix(rep(1:n, n), ncol = n)
      l <- sort(tmp[upper.tri(tmp)])
      r <- tmp[lower.tri(tmp)]
      out <- paste(sprintf("i%03d", l), sprintf("%03d", r), sep = ".")
      return(out)
    }
    

    Comment by chunhung — November 8, 2011 @ 6:27 am

    • One of the earlier solution on the r-wiki page used matrix too but allocating an entire matrix when you only need the upper triangle is inefficient. Interestingly I think most people would try the matrix route on the first attempt to solve this type of problem 🙂

      identical(generateIndex13(100), geneString(100))
      #[1] TRUE
      
      library(rbenchmark)
      n <- 1000
      benchmark(generateIndex13(n), geneString(n),
                       columns = c("test", "replications", "elapsed", "relative"),
                       order = "relative",
                       replications = 30)
      
      #                 test replications elapsed relative
      # 1 generateIndex13(n)           30   13.56 1.000000
      # 2      geneString(n)           30   36.80 2.713864
      

      Comment by Tony Breyal — November 8, 2011 @ 10:26 am

  6. Hopefully now:

    sol14 <- function(n) {
     x <- expand.grid(1:n, 1:n)
     x <- x[x[, 1] < x[, 2], ]
     x <- x[order(x[, 1], x[, 2]), ]
     paste(sprintf("i%03d", x[, 1]), sprintf("%03d", x[, 2]), sep = ".")
    }
    

    Hmm, benchmarked it, doesn’t seem too fast…

    Comment by A.N. Spiess — November 8, 2011 @ 7:20 am

    • Innovative! I think part of the problem here comes from using sprintf twice when really you only need it once as follows:

      sol14_b <- function(n) {
        s <- sprintf("%03d", seq_len(n))
        x <- expand.grid(1:n, 1:n)
        x <- x[x[, 1] < x[, 2], ]
        x <- x[order(x[, 1], x[, 2]), ]
        paste("i", s[x[,1]], ".", s[x[,2]], sep = "")
      }
      
      
      identical(generateIndex13(100), sol14(100))
      # [1] TRUE
      identical(generateIndex13(100), sol14_b(100))
      # [1] TRUE
      
      library(rbenchmark)
      n <- 1000
      benchmark(generateIndex13(n), sol14(n), sol14_b(n),
                       columns = c("test", "replications", "elapsed", "relative"),
                       order = "relative",
                       replications = 10)
      
      #                 test replications elapsed relative
      # 1 generateIndex13(n)           10    4.40 1.000000
      # 3         sol14_b(n)           10   12.68 2.881818
      # 2           sol14(n)           10   17.96 4.081818
      

      Comment by Tony Breyal — November 8, 2011 @ 10:42 am

  7. Every weekend i used to pay a visit this site, as i
    want enjoyment, since this this web site conations genuinely
    good funny material too.

    Comment by heart disease — May 11, 2013 @ 11:23 am

  8. Helpful information. Fortunate me I found your web site accidentally, and I’m stunned why this twist of fate didn’t came
    about earlier! I bookmarked it.

    Comment by Friv2 — August 6, 2013 @ 3:07 pm


RSS feed for comments on this post. TrackBack URI

Leave a reply to Friv2 Cancel reply

Create a free website or blog at WordPress.com.