Consistently Infrequent

January 4, 2012

Plotting Doctor Who Ratings (1963-2011) with R

Filed under: R — Tags: , , — Tony Breyal @ 1:52 am

Introduction

First day back to work after New Year celebrations and my brain doesn’t really want to think too much. So I went out for lunch and had a nice walk in the park. Still had 15 minutes to kill before my lunch break was over and so decided to kill some time with a quick web scraping exercise in R.

Objective

Download the last 49 years of British TV ratings data for the programme Doctor Who (the longest-running science fiction television show in the world and which is also the most successful science fiction series of all time, in terms of its overall broadcast ratings, DVD and book sales and iTunes traffic) and make a simple plot of it.

Method

Ratings are available from doctorwhonews.net as a series of page separated tables. This means that we can use the RCurl and XML packages to download the first seed webpage, extract the table of ratings, and use XPath to get the weblink to the next page of ratings. Due to time constraints I’m not going to optimise any of this (though given the small data set it probably doesn’t need optimisation anyway).

Solution

get_doctor_who_ratings <- function() {
  # load packages
  require(RCurl)
  require(XML)

  # return Title, Date and Rating
  format_df <- function(df) {
    data.frame(Date = as.POSIXlt(df$Date, format = "%a %d %b %Y"),
               Title = df$Title,
               Rating = as.numeric(gsub("(\\s+).*", "\\1", df$Rating)),
               stringsAsFactors = FALSE)
  }

  # scrape data from web
  get_ratings <- function(u) {
    df.list <- list()
    i <- 1
    while(!is.null(u)) {
      html <- getURL(u)
      doc <- htmlParse(u)
      df.list[[i]] <- readHTMLTable(doc, header = TRUE, which = 1, stringsAsFactors = FALSE)
      u.next <- as.vector(xpathSApply(doc, "//div[@class='nav']/a[text()='NEXT']/@href"))
      if(is.null(u.next)) {
        return(df.list)
      }
      u <- sub("info.*", u.next, u)
      i <- i + 1
    }
    return(df.list)
  }

  ### main function code ###
  # Step 1: get tables of ratings for each page that is avaiable
  u <- "http://guide.doctorwhonews.net/info.php?detail=ratings"
  df.list <- get_ratings(u)

  # Step 2: format ratings into a single data.frame
  df <- do.call("rbind", df.list)
  df <- format_df(df)

  # Step 3: return data.frame
  return(df)
}

Using the above, we can pull the ratings into a single data.frame as follows:


# get ratings database
ratings.df <- get_doctor_who_ratings()
head(ratings.df)

# Date Title Rating
# 1 1979-10-20 City of Death - Episode 4 16.1
# 2 1979-10-13 City of Death - Episode 3 15.4
# 3 1979-09-22 Destiny of the Daleks - Episode 4 14.4
# 4 1979-10-06 City of Death - Episode 2 14.1
# 5 1979-09-15 Destiny of the Daleks - Episode 3 13.8
# 6 1975-02-01 The Ark In Space - Episode 2 13.6

&nbsp;

Plot

We can plot this data very easily using the Hadley Wickman’s ggplot2 package:


# do a raw plot
require(ggplot2)
ggplot(ratings.df, aes(x=Date, y=Rating)) + geom_point() + xlab("Date") + ylab("Ratings (millions)") + opts(title = "Doctor Who Ratings (1963-Present) without Context")

The gap in the data is due to the show having been put on permanent hiatus between 1989 and 2005 with the exception of the american episode in 1996.

CAUTION 

This was just a fun coding exercise to quickly pass some time.

The chart above should not be directly interpreted without the proper context as it would be very misleading to suggest that that show was more popular in earlier years than in later years. Bear in mind that TV habits have changed dramatically over the past 50 odd years (I myself barely watch TV live any more and instead make use of catchup services like BBC iplayer which the ratings above to do not account for), that there were fewer channels back in 1963 in Britain, the way BARB collect ratings, and that the prestige of the show has changed over time (once an embarrassment for the BBC with all of it’s criminally low budgets and wobbly sets, to now being one of it’s top flagship shows).

A final note

Although I was part of the generation during which Doctor Who was taken off the air, I do vaguely remember some episodes from my childhood where The Doctor was played by Sylvester McCoy, who to this day is still “my doctor” (as the saying goes) and I would put him right up there with Tennent and Smith as being one of the greats. Best. Show. Ever.

You can find a quick review of series six (i.e. the sixth series of episodes since the show’s return in 2005) right here, and because I love the trailer so much I’ll embed it below:

Advertisements

November 7, 2011

Code Optimization: One R Problem, Eleven Solutions – Now Thirteen!

Filed under: R — Tags: , , , , — Tony Breyal @ 11:35 am

Following up from my previous post “Code Optimisation: One R Problem, Ten Solutions – Now Eleven!” I figured out a twelfth solution after writing that blog post. Furthermore, half way through writing this blog post I figured out a thirteenth solution too. As a recap, the problem is taken from rwiki where the goal is to find the most efficient code to produce strings in the following pattern, given a single integer input n:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

The solution I had come up with was as follows:

generateIndex11 <- function(n) {
  # initialise vectors
  len <- (n-1)*n/2
  s <- vector(mode = "character", length = n)
  ind.1 <- vector(mode = "integer", length = len)
  ind.2 <- vector(mode = "integer", length = len)

  # set up strings
  s <- sprintf("%03d", seq_len(n))

  # calculate indicies
  ind.1 <- rep.int(1:(n-1), (n-1):1)
  ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)

  # paste strings together
  return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}

However, once I had finished writing that blog post, I realised that a further improvement was possible by not having to calculate ind.1 at all. This is because ind.2 has all the information we need:

n=5; lapply(2:n, ":", n)
[[1]]
[1] 2 3 4 5

[[2]]
[1] 3 4 5

[[3]]
[1] 4 5

[[4]]
[1] 5

Did you notice it? The pattern we need is effectively “1 with 2,3,4,5 “, “2 with 3,4,5”, “3 with 4, 5” and “4 with 5”.  All of that information is available in the one list above. Therefore, simplifying, a twelfth solution is:

generateIndex12 <- function(n) {
  s <- sprintf("%03d", seq_len(n))
  unlist(lapply(1:(n-1),
         function(i) {
             paste("i", s[i], ".", s[seq.int(i+1, n)], sep = "")
         }), use.names = FALSE)
}

# check
identical(generateIndex11(500), generateIndex12(500)
[1] TRUE

I would love to use the Rcpp package to produce a C++ version of the code above but unfortuantly that is beyound my current skill set but I’m sure it would result is a significant speed up. Perhaps.

Furthermore, a slightly faster thirteenth solution if we remember that initialising vectors and calculating indices independently before using lapply adds some speed too:


generateIndex13 <- function(n) {
  # initialise vectors
  s <- vector(mode = "character", length = n)
  ind <- vector(mode = "integer", length = (n-1)*n/2)
  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # calculate sequence of indicies for second part of strings e.g. "002" "003" etc in "i001.002" "i001.003"...
  ind <- lapply(2:n, ":", n)

  # paste strings together
  unlist(lapply(1:(n-1), function(i) paste("i", s[i], ".", s[ind[[i]]], sep = "") ), use.names = FALSE)
}
# check
identical(generateIndex11(500), generateIndex13(500)
[1] TRUE

Benchmarking the above (without compiled versions because of constraints on time)

# load packages
library(ggplot2)
library(rbenchmark)

get_bm <- function(n) {
  print(n)
  gc()
  df = benchmark(generateIndex10(n), generateIndex11(n), generateIndex12(n), generateIndex13(n),
                 columns = c("test", "elapsed", "relative"),
                 order = "elapsed",
                 replications = 100)
  df$n = n
  return(df)
}

# get benchmarks for various values of n
df.list <- lapply(seq(1000, 10000, by = 200), get_bm)
df <- do.call("rbind", df.list)

# plot results
ggplot(df.compiled, aes(x=n, y=elapsed, colour=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")

Bearing in mind that each algorithm is replicated 100 times at each value of n, the thirteenth solution produces a speed up at larger values of n.

Benchmarks performed on R version 2.13.0 on a Windows Server 2008 R2 x64 Workstation PC, 12GB RAM, Intel Xeon CPU X5650 @ 2.66GHz.

So, anyone have a Fouteenth solution? I’d love to see a solution that is paralleled into independent threads as I think that is where the next speed up will come from. Or how about a version which makes use of the inline and Rcpp packages with the thirteenth solution re-written C++? I have to admit that evolving an algorithm in R  is much easier than in C++ (not that I’ve used the latter in nearly 8 years).

Happy coding! 🙂

November 2, 2011

Code Optimization: One R Problem, Ten Solutions – Now Eleven!

Filed under: R — Tags: , , , , , , , , , , — Tony Breyal @ 11:47 pm

Earlier this year I came across a rather interesting page about optimisation in R from rwiki. The goal was to find the most efficient code to produce strings which follow the pattern below given a single integer input n:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

From this we can see that the general pattern for n is:

# pseudo code
FOR j = 1 to n-1
    FOR k = j+1 to n
        PRINT "i" + format(j, digits = 3) + "." + format(k, digits = 3)
        k = k + 1
    END FOR
    j = j + 1
END FOR
# this produces (n-1)*n/2 strings.

It is rather heart warming to go though that rwiki page and see how we can sequentially optimise the algorithm in R to more efficiently produce the desired string sequence. I learned quite a lot from this page about R and how fun these types of challenges can be!

Looking at the tenth solution, it achieves its speed by recognising that there are a total of n unique strings (e.g. “001”, “002”) to the pattern. These can be assigned to a character vector which we subsequently call by working out the sequence of indices we require and passing them to the character vector:

generateIndex10 <- function(n) {
  s <- sprintf("%03d", seq_len(n));
  paste(
    "i",
    rep(s[1:(n-1)], (n-1):1),
    ".",
    unlist(lapply(2:n, function(k) s[k:n]), use.names=FALSE),
    sep=""
  )
}
generateIndex10(7)
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i001.007" "i002.003" "i002.004"
 [9] "i002.005" "i002.006" "i002.007" "i003.004" "i003.005" "i003.006" "i003.007" "i004.005"
[17] "i004.006" "i004.007" "i005.006" "i005.007" "i006.007"

Playing around with the solution above, I noticed that a speed up would be possible if the following were implemented:

  1. rep on a string vector is slower than simply using rep.int to work out the indices first and then passing those into the character vector.
  2. Initialise the vectors to the correct size. This seems to produce a speed up though I’m not sure why it would in a vectorised solution. I could see the benefit if I were creating the vectors using an explicit R loop and concatenating the results but that is not the case here.
  3. The functions could be compiled into bytecodes using the new compiler package (NB given that tenth solution is from 2007, this option would not have been available back then).
  4. This problem could be set up to run in parallel. I didn’t do this because I’m still within running distance of being an R noob 😀

Given the first three points above, this is the solution I came up with:

generateIndex11 <- function(n) {
  # initialise vectors
  len <- (n-1)*n/2
  s <- vector(mode = "character", length = n)
  ind.1 <- vector(mode = "integer", length = len)
  ind.2 <- vector(mode = "integer", length = len)

  # set up strings
  s <- sprintf("%03d", seq_len(n))

  # calculate indices
  ind.1 <- rep.int(1:(n-1), (n-1):1)
  ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)

  # paste strings together
  return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}

But is it faster? We can benchmark it and see!


# load packages
library(compiler)
library(rbenchmark)

# compile functions
compiled_generateIndex10 <- cmpfun(generateIndex10)
compiled_generateIndex11 <- cmpfun(generateIndex11)

# set size
n = 5000

# run benchmark
benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative",
          replications = 10)

###--- TIMINGS ---###
                          test replications elapsed relative
4 compiled_generateIndex11(n)            10   51.46 1.000000
3          generateIndex11(n)            10   51.67 1.004081
2  compiled_generateIndex10(n)           10   54.35 1.056160
1           generateIndex10(n)           10   61.20 1.189273

So it seems that there is a speed up between generateIndex10 and generateIndex11.   If we call the benchmark for different values of n and plot using the ggplot2 pack we get the following:


library(ggplot2)
get_bm <- function(n) {
  gc()
  df = benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
            columns = c("test", "elapsed"),
            order = "elapsed",
            replications = 10)
  df$n = n
  return(df)

}

df.list <- lapply(seq(1000, 9000, by = 250), get_bm)
df <- do.call("rbind", df.list)
ggplot(df, aes(x=n, y=elapsed, colour=test, shape=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")

Which produces the graph below. I don’t know what’s happening at the extreme values but I suspect that it has something to do with memory because on my 8GB machine, the RAM was mostly at 99% for the largest values of n. I also like how stable the compiled versions of the functions are:

In conclusion, whilst the compiled eleventh solution is faster for larger values of n, for values below 1000 the tenth solution is sometimes faster.

Writing this post, I noticed something and actually came up with an even faster twelfth  solution. When I get time in the next couple of days I’ll make a post about it and also submit it to rwiki.

Thanks for reading, this was my first proper R blog post, constructive comments are most welcome! Also, please kindly bare in mind that I am not an R expert but rather just trying to improve my knowledge of the language 🙂

Blog at WordPress.com.