Following up from my previous post “Code Optimisation: One R Problem, Ten Solutions – Now Eleven!” I figured out a twelfth solution after writing that blog post. Furthermore, half way through writing this blog post I figured out a thirteenth solution too. As a recap, the problem is taken from rwiki where the goal is to find the most efficient code to produce strings in the following pattern, given a single integer input n
:
# n = 4 [1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004" # n = 5 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004" [9] "i003.005" "i004.005" # n = 6 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005" [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"
The solution I had come up with was as follows:
generateIndex11 <- function(n) { # initialise vectors len <- (n-1)*n/2 s <- vector(mode = "character", length = n) ind.1 <- vector(mode = "integer", length = len) ind.2 <- vector(mode = "integer", length = len) # set up strings s <- sprintf("%03d", seq_len(n)) # calculate indicies ind.1 <- rep.int(1:(n-1), (n-1):1) ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE) # paste strings together return(paste("i", s[ind.1], ".", s[ind.2], sep = "")) }
However, once I had finished writing that blog post, I realised that a further improvement was possible by not having to calculate ind.1 at all. This is because ind.2 has all the information we need:
n=5; lapply(2:n, ":", n) [[1]] [1] 2 3 4 5 [[2]] [1] 3 4 5 [[3]] [1] 4 5 [[4]] [1] 5
Did you notice it? The pattern we need is effectively “1 with 2,3,4,5 “, “2 with 3,4,5”, “3 with 4, 5” and “4 with 5”. All of that information is available in the one list above. Therefore, simplifying, a twelfth solution is:
generateIndex12 <- function(n) { s <- sprintf("%03d", seq_len(n)) unlist(lapply(1:(n-1), function(i) { paste("i", s[i], ".", s[seq.int(i+1, n)], sep = "") }), use.names = FALSE) } # check identical(generateIndex11(500), generateIndex12(500) [1] TRUE
I would love to use the Rcpp package to produce a C++ version of the code above but unfortuantly that is beyound my current skill set but I’m sure it would result is a significant speed up. Perhaps.
Furthermore, a slightly faster thirteenth solution if we remember that initialising vectors and calculating indices independently before using lapply adds some speed too:
generateIndex13 <- function(n) { # initialise vectors s <- vector(mode = "character", length = n) ind <- vector(mode = "integer", length = (n-1)*n/2) # set up n unique strings s <- sprintf("%03d", seq_len(n)) # calculate sequence of indicies for second part of strings e.g. "002" "003" etc in "i001.002" "i001.003"... ind <- lapply(2:n, ":", n) # paste strings together unlist(lapply(1:(n-1), function(i) paste("i", s[i], ".", s[ind[[i]]], sep = "") ), use.names = FALSE) } # check identical(generateIndex11(500), generateIndex13(500) [1] TRUE
Benchmarking the above (without compiled versions because of constraints on time)
# load packages library(ggplot2) library(rbenchmark) get_bm <- function(n) { print(n) gc() df = benchmark(generateIndex10(n), generateIndex11(n), generateIndex12(n), generateIndex13(n), columns = c("test", "elapsed", "relative"), order = "elapsed", replications = 100) df$n = n return(df) } # get benchmarks for various values of n df.list <- lapply(seq(1000, 10000, by = 200), get_bm) df <- do.call("rbind", df.list) # plot results ggplot(df.compiled, aes(x=n, y=elapsed, colour=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")
Bearing in mind that each algorithm is replicated 100 times at each value of n, the thirteenth solution produces a speed up at larger values of n.
Benchmarks performed on R version 2.13.0 on a Windows Server 2008 R2 x64 Workstation PC, 12GB RAM, Intel Xeon CPU X5650 @ 2.66GHz.
So, anyone have a Fouteenth solution? I’d love to see a solution that is paralleled into independent threads as I think that is where the next speed up will come from. Or how about a version which makes use of the inline and Rcpp packages with the thirteenth solution re-written C++? I have to admit that evolving an algorithm in R is much easier than in C++ (not that I’ve used the latter in nearly 8 years).
Happy coding! 🙂
How about using expand.grid and vectorized row selection?
sol14 <- function(n) {
x <- expand.grid(1:n, 1:n)
x x[, 1], ]
x <- x[order(x[, 1], x[, 2]), ]
paste(sprintf("i%03d", x[, 1]), sprintf("%03d", x[, 2]), sep = ".")
}
Didn't benchmark it though…
Greets,
Andrej
Comment by Andrej Spiess — November 7, 2011 @ 6:30 pm
Code was chopped off. Hopefully not now:
Comment by Andrej Spiess — November 7, 2011 @ 6:32 pm
Hmm, what am I doing wrong?
3rd row in code always chopped off.
Comment by Andrej Spiess — November 7, 2011 @ 6:37 pm
I’d never even heard of expand.grid() before, awesome, learn something new every day! Would love to know what that third line of code is 🙂
Try inserting the code between the following tags:
[.sourcecode language=”r”]
# r code
[./sourcecode]
BUT without the periods “.” 🙂
Comment by Tony Breyal — November 7, 2011 @ 9:30 pm
Yep, I tried with expand.grid when I saw the first post but it didn’t improve the best solutions…
Comment by FrogWave — November 7, 2011 @ 9:53 pm
How about using matrix.
My method:
Comment by chunhung — November 8, 2011 @ 6:27 am
One of the earlier solution on the r-wiki page used matrix too but allocating an entire matrix when you only need the upper triangle is inefficient. Interestingly I think most people would try the matrix route on the first attempt to solve this type of problem 🙂
Comment by Tony Breyal — November 8, 2011 @ 10:26 am
Hopefully now:
Hmm, benchmarked it, doesn’t seem too fast…
Comment by A.N. Spiess — November 8, 2011 @ 7:20 am
Innovative! I think part of the problem here comes from using sprintf twice when really you only need it once as follows:
Comment by Tony Breyal — November 8, 2011 @ 10:42 am
Every weekend i used to pay a visit this site, as i
want enjoyment, since this this web site conations genuinely
good funny material too.
Comment by heart disease — May 11, 2013 @ 11:23 am
Helpful information. Fortunate me I found your web site accidentally, and I’m stunned why this twist of fate didn’t came
about earlier! I bookmarked it.
Comment by Friv2 — August 6, 2013 @ 3:07 pm