Consistently Infrequent

January 4, 2012

Plotting Doctor Who Ratings (1963-2011) with R

Filed under: R — Tags: , , — Tony Breyal @ 1:52 am

Introduction

First day back to work after New Year celebrations and my brain doesn’t really want to think too much. So I went out for lunch and had a nice walk in the park. Still had 15 minutes to kill before my lunch break was over and so decided to kill some time with a quick web scraping exercise in R.

Objective

Download the last 49 years of British TV ratings data for the programme Doctor Who (the longest-running science fiction television show in the world and which is also the most successful science fiction series of all time, in terms of its overall broadcast ratings, DVD and book sales and iTunes traffic) and make a simple plot of it.

Method

Ratings are available from doctorwhonews.net as a series of page separated tables. This means that we can use the RCurl and XML packages to download the first seed webpage, extract the table of ratings, and use XPath to get the weblink to the next page of ratings. Due to time constraints I’m not going to optimise any of this (though given the small data set it probably doesn’t need optimisation anyway).

Solution

get_doctor_who_ratings <- function() {
  # load packages
  require(RCurl)
  require(XML)

  # return Title, Date and Rating
  format_df <- function(df) {
    data.frame(Date = as.POSIXlt(df$Date, format = "%a %d %b %Y"),
               Title = df$Title,
               Rating = as.numeric(gsub("(\\s+).*", "\\1", df$Rating)),
               stringsAsFactors = FALSE)
  }

  # scrape data from web
  get_ratings <- function(u) {
    df.list <- list()
    i <- 1
    while(!is.null(u)) {
      html <- getURL(u)
      doc <- htmlParse(u)
      df.list[[i]] <- readHTMLTable(doc, header = TRUE, which = 1, stringsAsFactors = FALSE)
      u.next <- as.vector(xpathSApply(doc, "//div[@class='nav']/a[text()='NEXT']/@href"))
      if(is.null(u.next)) {
        return(df.list)
      }
      u <- sub("info.*", u.next, u)
      i <- i + 1
    }
    return(df.list)
  }

  ### main function code ###
  # Step 1: get tables of ratings for each page that is avaiable
  u <- "http://guide.doctorwhonews.net/info.php?detail=ratings"
  df.list <- get_ratings(u)

  # Step 2: format ratings into a single data.frame
  df <- do.call("rbind", df.list)
  df <- format_df(df)

  # Step 3: return data.frame
  return(df)
}

Using the above, we can pull the ratings into a single data.frame as follows:


# get ratings database
ratings.df <- get_doctor_who_ratings()
head(ratings.df)

# Date Title Rating
# 1 1979-10-20 City of Death - Episode 4 16.1
# 2 1979-10-13 City of Death - Episode 3 15.4
# 3 1979-09-22 Destiny of the Daleks - Episode 4 14.4
# 4 1979-10-06 City of Death - Episode 2 14.1
# 5 1979-09-15 Destiny of the Daleks - Episode 3 13.8
# 6 1975-02-01 The Ark In Space - Episode 2 13.6

&nbsp;

Plot

We can plot this data very easily using the Hadley Wickman’s ggplot2 package:


# do a raw plot
require(ggplot2)
ggplot(ratings.df, aes(x=Date, y=Rating)) + geom_point() + xlab("Date") + ylab("Ratings (millions)") + opts(title = "Doctor Who Ratings (1963-Present) without Context")

The gap in the data is due to the show having been put on permanent hiatus between 1989 and 2005 with the exception of the american episode in 1996.

CAUTION 

This was just a fun coding exercise to quickly pass some time.

The chart above should not be directly interpreted without the proper context as it would be very misleading to suggest that that show was more popular in earlier years than in later years. Bear in mind that TV habits have changed dramatically over the past 50 odd years (I myself barely watch TV live any more and instead make use of catchup services like BBC iplayer which the ratings above to do not account for), that there were fewer channels back in 1963 in Britain, the way BARB collect ratings, and that the prestige of the show has changed over time (once an embarrassment for the BBC with all of it’s criminally low budgets and wobbly sets, to now being one of it’s top flagship shows).

A final note

Although I was part of the generation during which Doctor Who was taken off the air, I do vaguely remember some episodes from my childhood where The Doctor was played by Sylvester McCoy, who to this day is still “my doctor” (as the saying goes) and I would put him right up there with Tennent and Smith as being one of the greats. Best. Show. Ever.

You can find a quick review of series six (i.e. the sixth series of episodes since the show’s return in 2005) right here, and because I love the trailer so much I’ll embed it below:

About these ads

11 Comments »

  1. The last line to perform the plot doesn’t work for me. Changing it from ggplot(df, aes…) to ggplot(ratings.df, aes…) fixed it. I’m using RStudio if that makes a difference.

    Other than that nice article, makes a change from the finance ones on R-Bloggers.

    Comment by Vincent Kolosowski — January 3, 2012 @ 10:28 pm

    • Thanks, Vincent, have corrected it now. I should remind myself to run the code in a fresh R session before posting the code to my blog (as you can probably guess, I had originally called the ratings.df data.frame just df before deciding it needed a more descriptive name). It is indeed nice to have a bit of variety, and coding for fun is the best way to learn in my opinion :)

      Comment by Tony Breyal — January 3, 2012 @ 11:09 pm

  2. Vincent, I really appreciate your posts and learn alot from them. I would ask you to avoid identifying your posts with the phrase Qiuck-R. I have been maintaining a popular R tutorial and blog site called Quick-R (www.statmethods.net) for five years. I will also be contributing to R-bloggers under the Quck-R name and logo. I really want to avoid confusion regarding our contributions. Thank you for your consideration in this, and I look forward to continuing to read your posts.

    Comment by Rob Kabacoff, Ph.D. — January 4, 2012 @ 1:18 am

    • My humblest apologies, Rob (I think you meant that comment for me, not Vincent).I did not realise that “Quick-R” was already in use and I was only using it here to imply that this blog post was a quick exercise in R. I have changed the tittle and look forward to using your site (which from what I’ve just seen is pretty awesome).

      Comment by Tony Breyal — January 4, 2012 @ 1:27 am

  3. ..a very llustrative and valuable post, Tony!
    Especially, the way you skip through pages with xpathSApply() is slick!!

    Best,
    Kay

    Comment by gimoya — January 4, 2012 @ 9:54 am

    • Thanks, Kay! Yeah it always amazes me just how useful the XML package and the XPath method are for scraping data off the web! :)

      Comment by Tony Breyal — January 4, 2012 @ 10:07 am

  4. Excellent post, until the last paragraph – Sylvester McCoy! Very disappointing ;)

    Comment by csgillespie — January 5, 2012 @ 5:26 pm

    • lol, yeah I often get laughed at for my admiration of McCoy — but he was my first and only real introduction to Doctor Who as a wee kid. Whilst I don’t remember any of the stories that well (apart from something with an explosion), I do remember watching the programme with my family and that to me is kind of cool. Plus Ace was a fun companion :D

      Comment by Tony Breyal — January 5, 2012 @ 5:32 pm

  5. Nice work! I made a stab at R+DW here, trying to give myself an idea of the timescales of each Doctor – how long they were on TV for, and when: http://pbett.webs.com/stats/stats.html#DW

    Comment by Phil Bett — January 11, 2012 @ 9:52 pm

    • Thanks for the link – very interesting! Looking at your code it seems you produced the graphs in R which is very surprising to me because I’ve not seen R charts like that before. I will have to find some time to try and understand it. BTW, your .csv data file uses a broken link. Good to know that there are other R users who enjoy a bit of Doctor Who :)

      Comment by Tony Breyal — January 11, 2012 @ 11:08 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 71 other followers

%d bloggers like this: