Consistently Infrequent

January 13, 2012

R: A Quick Scrape of Top Grossing Films from boxofficemojo.com

Filed under: R — Tags: — BD @ 11:55 am

 

Introduction

I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don’t know what kind of graphs I’d construct other than a simple barplot but figured I’d at least get the basics done and then if I feel motivated enough I could revisit this in the future.

Objective

Scrape the information available on http://boxofficemojo.com/alltime/world into R and make a simple barplot.

Solution

This is probably one of the easier scraping challenges. The function readHTMLTable() from the XML package does all the hard work. We just point the url of the page we’re interested in and feed it into the function. The function then pulls out all tables on the webpage as a list of data.frames. We then choose which data.frame we want. Here’s a single wrapper function:

box_office_mojo_top <- function(num.pages) {
  # load required packages
  require(XML)

  # local helper functions
  get_table <- function(u) {
    table <- readHTMLTable(u)[[3]]
    names(table) <- c("Rank", "Title", "Studio", "Worldwide.Gross", "Domestic.Gross", "Domestic.pct", "Overseas.Gross", "Overseas.pct", "Year")
    df <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }
  clean_df <- function(df) {
    clean <- function(col) {
      col <- gsub("$", "", col, fixed = TRUE)
      col <- gsub("%", "", col, fixed = TRUE)
      col <- gsub(",", "", col, fixed = TRUE)
      col <- gsub("^", "", col, fixed = TRUE)
      return(col)
    }

    df <- sapply(df, clean)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }

  # Main
  # Step 1: construct URLs
  urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:num.pages, "&p=.htm", sep = "")

  # Step 2: scrape website
  df <- do.call("rbind", lapply(urls, get_table))

  # Step 3: clean dataframe
  df <- clean_df(df)

  # Step 4: set column types
  s <- c(1, 4:9)
  df[, s] <- sapply(df[, s], as.numeric)
  df$Studio <- as.factor(df$Studio)

  # step 5: return dataframe
  return(df)
}

Which we use as follows:

num.pages <- 5
df <- box_office_mojo_top(num.pages)

head(df)
# Rank Title Studio Worldwide.Gross Domestic.Gross Domestic.pct Overseas.Gross Overseas.pct Year
# 1 1 Avatar Fox 2782.3 760.5 27.3 2021.8 72.7 2009
# 2 2 Titanic Par. 1843.2 600.8 32.6 1242.4 67.4 1997
# 3 3 Harry Potter and the Deathly Hallows Part 2 WB 1328.1 381.0 28.7 947.1 71.3 2011
# 4 4 Transformers: Dark of the Moon P/DW 1123.7 352.4 31.4 771.4 68.6 2011
# 5 5 The Lord of the Rings: The Return of the King NL 1119.9 377.8 33.7 742.1 66.3 2003
# 6 6 Pirates of the Caribbean: Dead Man's Chest BV 1066.2 423.3 39.7 642.9 60.3 2006

str(df)
# 'data.frame': 475 obs. of 9 variables:
# $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
# $ Title : chr "Avatar" "Titanic" "Harry Potter and the Deathly Hallows Part 2" "Transformers: Dark of the Moon" ...
# $ Studio : Factor w/ 35 levels "Art.","BV","Col.",..: 7 20 33 19 16 2 2 2 2 33 ...
# $ Worldwide.Gross: num 2782 1843 1328 1124 1120 ...
# $ Domestic.Gross : num 760 601 381 352 378 ...
# $ Domestic.pct : num 27.3 32.6 28.7 31.4 33.7 39.7 39 23.1 32.6 53.2 ...
# $ Overseas.Gross : num 2022 1242 947 771 742 ...
# $ Overseas.pct : num 72.7 67.4 71.3 68.6 66.3 60.3 61 76.9 67.4 46.8 ...
# $ Year : num 2009 1997 2011 2011 2003 ...

We can even do a simple barplot of the top 50 films by worldwide gross (in millions) :


 require(ggplot2)
 df2 <- subset(df, Rank<=50)
 ggplot(df2, aes(reorder(Title, Worldwide.Gross), Worldwide.Gross)) +
   geom_bar() +
   opts(axis.text.x=theme_text(angle=0)) +
   opts(axis.text.y=theme_text(angle=0)) +
   coord_flip() +
   ylab("Worldwise Gross (USD $ millions)") +
   xlab("Title") +
   opts(title = "TOP 50 FILMS BY WORLDWIDE GROSS")

12 Comments »

  1. Great post! I really like the readHTMLTable function of the XML package. A friend pointed me to the ‘which’ argument of readHTMLTable, which is quite neat. Say you want to get the third table, then you can say readHTMLTable(…, which=3).
    Sometimes on Windows I notice that I have to wrap readHTMLTable around readLines to make it work, e.g. readHTMLTable(readLines(url)). Have you ever come across that issue?

    Comment by Markus — January 13, 2012 @ 7:57 pm

    • Ahh, not sure how I missed the “which” parameter but thanks for pointing it out!

      I have also had to make use of the readLines() work-round when using htmlParse() from the XML package in almost all of my XPath web scraping code, unless I know that a specific website doesn’t need it (like boxofficemojo.com). However, readLines() will fail on https weblinks and so in those cases I just end up just using getURL() from the RCurl package to grab the underlying html code instead.

      Comment by Tony Breyal — January 13, 2012 @ 11:39 pm

  2. Great web scraping example, at least from HTML tables. Thank you.

    Comment by mstrtweaks — January 13, 2012 @ 7:58 pm

  3. It would be interesting to see the results normalized to today’s dollars since you have the year of release.

    Comment by Larry (IEOR Tools) — January 13, 2012 @ 9:15 pm

    • There’s a table for gross adjusted for inflation at the following link (the R code could be easily adapted to work with it) – http://boxofficemojo.com/alltime/adjusted.htm

      There’s always controversy about adjusting for inflation, but even so it’s still quite impressive to see Gone With The Wind at the top of that list.

      Comment by Tony Breyal — January 13, 2012 @ 11:44 pm

      • it’s even more controversial to not adjust for inflation, given that the dollar is worth a tiny fraction of what it was before 1913’s creation of the Federal Reserve http://pix.cs.olemiss.edu/depress

        Comment by Matthew Barney — January 14, 2012 @ 2:19 am

        • Interesting link. I do wonder what a good metric for the gross of a film might be in order to allow meaningful comparisons. Even looking just at the number of tickets bought has it’s own problems (e.g. cost of the ticket, whether the economy was struggling at the time, etc.)

          Comment by Tony Breyal — January 14, 2012 @ 10:22 am

  4. I’m truly enjoying the design and layout of your website. It’s
    a very easy on the eyes which makes it much more enjoyable for me to come here and visit more often.
    Did you hire out a developer to create your theme?

    Exceptional work! by Roscoe see my site Dorcas

    Comment by Dorcas — February 21, 2013 @ 1:39 pm

  5. Howdy! Would you mind if I share your blog
    with my facebook group? There’s a lot of people that I think would really enjoy your content. Please let me know. Thank you by Concetta see my site Theron

    Comment by Theron — February 22, 2013 @ 3:17 am

  6. I think the admin of this web page is really working hard in favor
    of his website, for the reason that here every data is quality based data.

    Comment by gifts for golfers — May 4, 2013 @ 11:34 pm

  7. […] By Tony Breyal […]

    Pingback by R: A Quick Scrape of Top Grossing Films from boxofficemojo.com - R Project Aggregate — January 1, 2014 @ 5:22 am

  8. How would you modify this code if you wanted to add a column to “df” indicating which page each row of data came from? You are scraping all five pages and RBinding them together, but there could be a scenario where the page source is important.

    Comment by Bill — July 15, 2014 @ 7:16 am


RSS feed for comments on this post. TrackBack URI

Leave a reply to gifts for golfers Cancel reply

Blog at WordPress.com.