Consistently Infrequent

August 24, 2014

R: Word Stem Text Blocks in Parallel

Filed under: R — Tags: , , , — Tony Breyal @ 11:02 pm

Objective

I recently needed to stem every word in a block of text i.e. reduce each word to a root form.

Problem

The stemmer I was using would only stem the last word in each block of text e.g.the roo

require(SnowballC)

wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';

Solution

I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

require(SnowballC) # stemmer
require(parallel)  # parallel processing
require(tau)       # tokenise function

stem_text<- function(text, language = 'porter', mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- tokenize(x = str)
    str <- wordStem(str, language = language)
    str <- paste(str, collapse = "")
    return(str)
  }

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)

  # return stemed text blocks
  return(unlist(x))
}

This works as follows:

# Blocks of text
sentences <- c('walk walks walked walking walker walkers?',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence.')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker?';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid.'

The argument “mc.cores” refers to the number of processing cores on your processor. Under Windows this will always be one. Under Ubuntu Linux, you can set it to however many cores you have (though it’s probably only worthwhile if you have lots of text vectors).

January 13, 2012

R: A Quick Scrape of Top Grossing Films from boxofficemojo.com

Filed under: R — Tags: — Tony Breyal @ 11:55 am

 

Introduction

I was looking at a list of the top grossing films of all time (available from boxofficemojo.com) and was wondering what kind of graphs I would come up with if I had that kind of data. I still don’t know what kind of graphs I’d construct other than a simple barplot but figured I’d at least get the basics done and then if I feel motivated enough I could revisit this in the future.

Objective

Scrape the information available on http://boxofficemojo.com/alltime/world into R and make a simple barplot.

Solution

This is probably one of the easier scraping challenges. The function readHTMLTable() from the XML package does all the hard work. We just point the url of the page we’re interested in and feed it into the function. The function then pulls out all tables on the webpage as a list of data.frames. We then choose which data.frame we want. Here’s a single wrapper function:

box_office_mojo_top <- function(num.pages) {
  # load required packages
  require(XML)

  # local helper functions
  get_table <- function(u) {
    table <- readHTMLTable(u)[[3]]
    names(table) <- c("Rank", "Title", "Studio", "Worldwide.Gross", "Domestic.Gross", "Domestic.pct", "Overseas.Gross", "Overseas.pct", "Year")
    df <- as.data.frame(lapply(table[-1, ], as.character), stringsAsFactors=FALSE)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }
  clean_df <- function(df) {
    clean <- function(col) {
      col <- gsub("$", "", col, fixed = TRUE)
      col <- gsub("%", "", col, fixed = TRUE)
      col <- gsub(",", "", col, fixed = TRUE)
      col <- gsub("^", "", col, fixed = TRUE)
      return(col)
    }

    df <- sapply(df, clean)
    df <- as.data.frame(df, stringsAsFactors=FALSE)
    return(df)
  }

  # Main
  # Step 1: construct URLs
  urls <- paste("http://boxofficemojo.com/alltime/world/?pagenum=", 1:num.pages, "&p=.htm", sep = "")

  # Step 2: scrape website
  df <- do.call("rbind", lapply(urls, get_table))

  # Step 3: clean dataframe
  df <- clean_df(df)

  # Step 4: set column types
  s <- c(1, 4:9)
  df[, s] <- sapply(df[, s], as.numeric)
  df$Studio <- as.factor(df$Studio)

  # step 5: return dataframe
  return(df)
}

Which we use as follows:

num.pages <- 5
df <- box_office_mojo_top(num.pages)

head(df)
# Rank Title Studio Worldwide.Gross Domestic.Gross Domestic.pct Overseas.Gross Overseas.pct Year
# 1 1 Avatar Fox 2782.3 760.5 27.3 2021.8 72.7 2009
# 2 2 Titanic Par. 1843.2 600.8 32.6 1242.4 67.4 1997
# 3 3 Harry Potter and the Deathly Hallows Part 2 WB 1328.1 381.0 28.7 947.1 71.3 2011
# 4 4 Transformers: Dark of the Moon P/DW 1123.7 352.4 31.4 771.4 68.6 2011
# 5 5 The Lord of the Rings: The Return of the King NL 1119.9 377.8 33.7 742.1 66.3 2003
# 6 6 Pirates of the Caribbean: Dead Man's Chest BV 1066.2 423.3 39.7 642.9 60.3 2006

str(df)
# 'data.frame': 475 obs. of 9 variables:
# $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
# $ Title : chr "Avatar" "Titanic" "Harry Potter and the Deathly Hallows Part 2" "Transformers: Dark of the Moon" ...
# $ Studio : Factor w/ 35 levels "Art.","BV","Col.",..: 7 20 33 19 16 2 2 2 2 33 ...
# $ Worldwide.Gross: num 2782 1843 1328 1124 1120 ...
# $ Domestic.Gross : num 760 601 381 352 378 ...
# $ Domestic.pct : num 27.3 32.6 28.7 31.4 33.7 39.7 39 23.1 32.6 53.2 ...
# $ Overseas.Gross : num 2022 1242 947 771 742 ...
# $ Overseas.pct : num 72.7 67.4 71.3 68.6 66.3 60.3 61 76.9 67.4 46.8 ...
# $ Year : num 2009 1997 2011 2011 2003 ...

We can even do a simple barplot of the top 50 films by worldwide gross (in millions) :


 require(ggplot2)
 df2 <- subset(df, Rank<=50)
 ggplot(df2, aes(reorder(Title, Worldwide.Gross), Worldwide.Gross)) +
   geom_bar() +
   opts(axis.text.x=theme_text(angle=0)) +
   opts(axis.text.y=theme_text(angle=0)) +
   coord_flip() +
   ylab("Worldwise Gross (USD $ millions)") +
   xlab("Title") +
   opts(title = "TOP 50 FILMS BY WORLDWIDE GROSS")

January 6, 2012

R: Web Scraping R-bloggers Facebook Page

Filed under: R — Tags: , — Tony Breyal @ 8:50 pm

 

Introduction

R-bloggers.com is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from R-bloggers are linked into its feed. These can then be liked, commented upon and shared by other facebook users. I was curious if anyone had commented on any of my R posts which had been linked into this facebook feed but it is a very tedious process to have to manually and continually click the ‘load more’ button to load more posts into the facebook wall page and scan for one of my posts.

Objective

Automatically scrape the content off of the R-bloggers facebook wall page via XPath and structure it into a dataframe in order to see if anyone has made any comments on one of my posts, or liked it or shared it.

Initial Thoughts

I have posted previously about using the Facebook Explorer API to get data from facebook. However there is a issue whereby a set of random posts may not be returned by the API. Given that I’m specifically interested in a small subset of posts, this issue makes it unsuitable for me to use the API as there is a chance I might miss something interesting. (My feeling is this has something to do with privacy issues but I’m not sure because then surely I wouldn’t be able to see a private post at all whether it’s through the facebook wall or Graph API, unless the API is more strict about privacy).

I could try logging directly into Facebook using RCurl and doing things like setting cookies but that would require me having to first learn HOW to set cookies in RCurl (and feeling motivated enough to spend the extra time required to do it). Seeing as I really want to spend the majority of my spare programming time learning python, I’m going to give this one a miss for now.

Therefore I want to do this scraping of data using the skills I already have (which is a rather basic understanding of XPath via the XML package). I was tempted to learn about setting cookies with RCurl but it’s Friday and that means I just want the weekend to start already…

Limitations

Links to blog posts on the Facebook wall often do not give information about the original author of the blog. This is rather annoying because it means that some web-crawling is necessary to find out who wrote the post instead of that information being readily available in the first instance. I’m going to limit my code to only crawling for extra information from R-bloggers.com links because it is very easy to scrape data off that website via XPath (and saves me writing lots of code to try and work with other types of websites).

The R-bloggers facebook page has wall posts going back to January 2010. Prior to September 2011 blog posts pointed to the “notes” page on facebook. This prevents me getting extra data about the blog post because I can’t automatically navigate to those facebook pages. From Septermeber 2011 onwards however the blog posts point to R-bloggers.com and so these can be scraped for further information rather easily. Luckily I only started posting in November 2011 so this isn’t an issue for me.

Not all wall posts indicate how many comments they have if there are only a few comments. Not sure how to get round this, might have to write “at least 1 comment” for this situation maybe.

Most of the wall posts are made up of links to R-bloggers.com and various messages by Facebook users. Instead of filtering out, I’m just going to grab AS MUCH INFORMATION off of the wall feed as I can and then filter at the end. I’ll put the unfiltered information into a csv file for anyone that may want it and post it up on github.

Method

The easiest method would be to log into Facebook via the browser, navigate to the R-bloggers Facebook page, use the socialfixer.com browser add-on “Better Facebook” to automatically and painlessly load all posts in the R-bloggers feed going back to January 2010 and then save that page to the hard drive using, in google chrome browser terminology, the “Web Page, Complete” option (NOT the “Web Page, HTML Only” option because for some reason that won’t work well with my code).

Once the data is in a html file, use XPath expressions via Duncan Temple Lang’s XML package to extract whatever information I can in the first instance and store into a data.frame.

Once this initial data is in place, I will crawl any posts which link to R-bloggers.com and extract extra information about the post (e.g. Author, original publication date, post title, etc.). I will merge this data with the already constructed data.frame above.

I will then save this data.frame to a .csv file in case anyone else wishes to analyse it (thus saving them some time). Finally I will subset the data.frame to only posts that link to one of my blog posts and inspect the output.

Solution

source_https <- function(url, ...)  {
  # load package
  require(RCurl)

  source_script <- function(u) {
    # read script lines from website using a security certificate
    script <- getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

    # parse lines and evaluate in the global environement
    eval(parse(text = script), envir= .GlobalEnv)
  }

  # source each script
  sapply(c(url, ...), source_script)
}

Following the procedure describe in the Method section above:

  1. Log into facebook
  2. Naviagate to the R-bloggers facebook wall
  3. Load data as far back as you like. I used the Better Facebook browser add-on tool to automatically load data right back to January 2010.
  4. Save this webpage as a “complete” html file.
  5. Run the following code, selecting the location of the html file when prompted:
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R")
df <- rbloggersFBXScraper()

Depending on your internet connection this could take quite some time to complete because it has to crawl the R-bloggers website for extra information about links posted since September 2011. To save you some time I’ve saved ALL the data which I have scraped into a single csv file. Here’s how to use it:

library(RCurl)
csv.location <- "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv"
txt <- getURL(csv.location, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
df <- read.table(header=TRUE, text=txt, sep=",", stringsAsFactors=FALSE)

It’s then a simple case of subsetting to find posts by a specific author:

find_posts <- function(df, my.name) {
  subset(df, author == my.name)
}

df2 <- find_posts(df, "Tony Breyal")
t(df2[2,])

#                   30
# timestamp         "Wednesday, December 14, 2011 at 10:29pm"
# num.likes         "6 people like this"
# num.comments      "At least 1 comment"
# num.shares        "0"
# posted.by         "R bloggers"
# message           "I love these things :)http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link      "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link.text "Introduction\n I was asked by a friend how to find the full final address of an URL \nwhich had been shortened via a shortening service (e.g., Twitter’s t.co,\n Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, \nOw.ly, etc.). I replied I had no idea and maybe he should have a look \nover on ..."
# sample.comments   "Kai Feng Chew Yes! It's really cool! I changed a little bit to make it 2 lines to use the shorten function: load(\"unshort.Rdata\") unshort(\"ANY_SHORTEN_URL\") Example:http://cloudst.at/index.php?do=%2Fkafechew%2Fblog%2Funshorten-url-function%2FWednesday, December 14, 2011 at 10:34pm · LikeUnlike ·  1ReplyTony Breyal ‎@Kai\n you might want to use the code from the updated version of the code on \nmy blog because it now handles both https. It won't work with \"http://1.cloudst.at/myeg\" however because that one require the user to be registered (and I'll admit I had not thought of that use case)Thursday, December 15, 2011 at 12:03am · LikeUnlike ·  1Reply"
# rbloggers.link    "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# title             "Unshorten any URL with R"
# first.published   "December 13, 2011"
# author            "Tony Breyal"
# blog.name         " Consistently Infrequent » R"
# blog.link         "http://tonybreyal.wordpress.com/2011/12/13/unshorten-any-url-created-using-url-shortening-services-decode_shortened_url/"
# tags              "dft.ba, R, RCurl, rstats, tinurl, url"

So this tells me that my post entitled “Unshorten any URL with R” got six likes and at least one comment on facebook. Nice. The “sample.comments” field shows what was commented, and that I posted a reply (based on that person’s comment I was able to improve the code and realise that it wouldn’t work with shortened link which requires a user to logged in first). Awesome stuff.

Final Thoughts

So now I have this data I am not quite sure what to do with it. I could do a sorted bar chart with each blog entry on the x-axis and number of facebook likes on the y-axis . I was thinking of doing some sentiment analysis on the sampled comments (I could only scrape visable comments, not the ones you have to press a button to load more for) but I don’t have the time to read up on that type analysis. Maybe in the future :)

R code: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R
csv file: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv

January 4, 2012

Plotting Doctor Who Ratings (1963-2011) with R

Filed under: R — Tags: , , — Tony Breyal @ 1:52 am

Introduction

First day back to work after New Year celebrations and my brain doesn’t really want to think too much. So I went out for lunch and had a nice walk in the park. Still had 15 minutes to kill before my lunch break was over and so decided to kill some time with a quick web scraping exercise in R.

Objective

Download the last 49 years of British TV ratings data for the programme Doctor Who (the longest-running science fiction television show in the world and which is also the most successful science fiction series of all time, in terms of its overall broadcast ratings, DVD and book sales and iTunes traffic) and make a simple plot of it.

Method

Ratings are available from doctorwhonews.net as a series of page separated tables. This means that we can use the RCurl and XML packages to download the first seed webpage, extract the table of ratings, and use XPath to get the weblink to the next page of ratings. Due to time constraints I’m not going to optimise any of this (though given the small data set it probably doesn’t need optimisation anyway).

Solution

get_doctor_who_ratings <- function() {
  # load packages
  require(RCurl)
  require(XML)

  # return Title, Date and Rating
  format_df <- function(df) {
    data.frame(Date = as.POSIXlt(df$Date, format = "%a %d %b %Y"),
               Title = df$Title,
               Rating = as.numeric(gsub("(\\s+).*", "\\1", df$Rating)),
               stringsAsFactors = FALSE)
  }

  # scrape data from web
  get_ratings <- function(u) {
    df.list <- list()
    i <- 1
    while(!is.null(u)) {
      html <- getURL(u)
      doc <- htmlParse(u)
      df.list[[i]] <- readHTMLTable(doc, header = TRUE, which = 1, stringsAsFactors = FALSE)
      u.next <- as.vector(xpathSApply(doc, "//div[@class='nav']/a[text()='NEXT']/@href"))
      if(is.null(u.next)) {
        return(df.list)
      }
      u <- sub("info.*", u.next, u)
      i <- i + 1
    }
    return(df.list)
  }

  ### main function code ###
  # Step 1: get tables of ratings for each page that is avaiable
  u <- "http://guide.doctorwhonews.net/info.php?detail=ratings"
  df.list <- get_ratings(u)

  # Step 2: format ratings into a single data.frame
  df <- do.call("rbind", df.list)
  df <- format_df(df)

  # Step 3: return data.frame
  return(df)
}

Using the above, we can pull the ratings into a single data.frame as follows:


# get ratings database
ratings.df <- get_doctor_who_ratings()
head(ratings.df)

# Date Title Rating
# 1 1979-10-20 City of Death - Episode 4 16.1
# 2 1979-10-13 City of Death - Episode 3 15.4
# 3 1979-09-22 Destiny of the Daleks - Episode 4 14.4
# 4 1979-10-06 City of Death - Episode 2 14.1
# 5 1979-09-15 Destiny of the Daleks - Episode 3 13.8
# 6 1975-02-01 The Ark In Space - Episode 2 13.6

&nbsp;

Plot

We can plot this data very easily using the Hadley Wickman’s ggplot2 package:


# do a raw plot
require(ggplot2)
ggplot(ratings.df, aes(x=Date, y=Rating)) + geom_point() + xlab("Date") + ylab("Ratings (millions)") + opts(title = "Doctor Who Ratings (1963-Present) without Context")

The gap in the data is due to the show having been put on permanent hiatus between 1989 and 2005 with the exception of the american episode in 1996.

CAUTION 

This was just a fun coding exercise to quickly pass some time.

The chart above should not be directly interpreted without the proper context as it would be very misleading to suggest that that show was more popular in earlier years than in later years. Bear in mind that TV habits have changed dramatically over the past 50 odd years (I myself barely watch TV live any more and instead make use of catchup services like BBC iplayer which the ratings above to do not account for), that there were fewer channels back in 1963 in Britain, the way BARB collect ratings, and that the prestige of the show has changed over time (once an embarrassment for the BBC with all of it’s criminally low budgets and wobbly sets, to now being one of it’s top flagship shows).

A final note

Although I was part of the generation during which Doctor Who was taken off the air, I do vaguely remember some episodes from my childhood where The Doctor was played by Sylvester McCoy, who to this day is still “my doctor” (as the saying goes) and I would put him right up there with Tennent and Smith as being one of the greats. Best. Show. Ever.

You can find a quick review of series six (i.e. the sixth series of episodes since the show’s return in 2005) right here, and because I love the trailer so much I’ll embed it below:

December 13, 2011

Unshorten (almost) any URL with R

Filed under: R — Tags: , , , , , — Tony Breyal @ 6:57 pm

Introduction

I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.

Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!

This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.

The Objective

I want to decode a shortened URL to reveal it’s full final web address.

The Solution

The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.

decode_short_url <- function(url, ...) {
  # PACKAGES #
  require(RCurl)

  # LOCAL FUNCTIONS #
  decode <- function(u) {
    Sys.sleep(0.5)
    x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) )
    if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) {
      return(u)
    } else {
      return(gsub('.*Location: (\\S+).*', '\\1', x))
    }
  }

  # MAIN #
  gc()
  # return decoded URLs
  urls <- c(url, ...)
  l <- vector(mode = "list", length = length(urls))
  l <- lapply(urls, decode)
  names(l) <- urls
  return(l)
}

And here’s how we use it:

# EXAMPLE #
decode_short_url("http://tinyurl.com/adcd",
                 "http://www.google.com")
# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"

You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R

Limitations

A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,

http://1.cloudst.at/myeg

decode_short_url("http://tinyurl.com/adcd",
"http://www.google.com",
"http://1.cloudst.at/myeg")

# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"
#
# $`http://1.cloudst.at/myeg`
# [1] "http://1.cloudst.at/myeg"

I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there :)

December 8, 2011

Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!

Filed under: R — Tags: , , — Tony Breyal @ 1:41 pm

Introduction

The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here.

The Objective

To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

Solutions One Through Thirteen

A variety of different approaches are illustrated on the r-wiki page which show the performance benefits of things like vectorisation, variable initialisation, linking through to a compiled programming language, reducing a problem to its component parts, etc.

The Fourteenth Solution

The main speed improvement here comes from replacing the function “paste” by “file.path”. This use of “file.path” with parameter fsep=”” only works correctly here because there is never a character vector of length 0 for it to deal with. I only learned about this approach when I happened to see this tweet on twitter with hashtag #rstats and reading the associated help file where it says that it is faster than paste.

generateIndex14 <- function(n) {
  # initialise vectors
  s <- (mode = "character", length = n)

  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # paste strings together
  unlist(lapply(1:(n-1), function(i) file.path("i", s[i], ".", s[(i+1):n], fsep = "") ), use.names = FALSE)
}

Timings:

               test  elapsed    n replications
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Fifteenth Solution: Rcpp

This solution comes from Romain Francois and is based on the tenth solution but implemented in C++ using the R package Rcpp. See his blog for the implementation. This is the sort of thing I would love to learn to do myself but just need to find the time to re-learn C++, though I doubt that’ll happen any time soon as I’m hoping to start my MSc in Statistics next year. This is a great solution though.

Timings:

               test  elapsed    n replications
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Sixteenth Solution

When I was writing up this post I thought up a sixteenth solution (as seems to be the pattern with me on this blog!). This solution gets its speed up by generating the largest set of strings which start i001.xxx first and then replacing the “001” part with “002”, “003”, “004”, etc., for each increment up to and including n-1.


generateIndex16 <- function(n) {
  # initialise vectors
  str <- vector("list", length = n-1)
  s <- vector(mode = "character", length = n)

  # set up strings
  s <- sprintf("%03d", seq_len(n))
  str[[1]] <- file.path("i", s[1], ".", s[-1], fsep = "")

  # generate string sequences
  str[2:(n-1)] <- lapply(2:(n-1), function(i) sub("001", s[i], str[[1]][i:(n-1)], fixed=TRUE))
  unlist(str)
}

The above requires matching the “001” part first and then replacing it. However, we know that “001” will ALWAYS be in character positions 2, 3 and 4, and so there may be a way to avoid the matching part altogether (i.e. replace a fixed position substring with another string of equal or larger length) but I could not work out how to do that outside of a regular expression. Sadface.

Timings:

               test  elapsed    n replications
 generateIndex16(n) 20.77200 2000           50
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

Solutions Comparisons For Different N

I like ggplot2 charts and so ran my computer overnight to generate data for the speed performance of the last several solutions over different N:

Final Thoughts

I’m pretty sure that any more speed improvements will come from some or all of the follwing:

  • doing the heavy lifting in a compiled language and interfacing with R
  • running in parallel (I actually got this to work on linux by replacing lapply with mclapply from the parallel R package but the downside was that one has to use much more memory for larger values of N, plus it’s only works in serial fashion on Windows
  • working out an efficient way of replacing a fixed positioned substring with a string of equal or great length
  • compiling the function into R bytecodes using the compiler package function cmpfun

It would also be interesting to profile the memory usage of each funciton.

This was a fun challenge – if you find some spare time why not try your hand at it, you might come up with something even better!  :)

December 7, 2011

Installing Rcpp on Windows 7 for R and C++ integration

Filed under: R — Tags: , , , , — Tony Breyal @ 5:49 pm

Introduction

Romain Francois presented an Rcpp solution on his blog to an old r-wiki optimisation challenge which I had also presented R solutions for previously on my blog.

The Rcpp package provides a method for integrating R and C++. This allows for faster execution of an R project by recoding the slower R parts into C+ and thus providing potential performance enhancements.

The two main attractions, for me personally, of the Rcpp package are (a) help me to re-learn C++ because I’ve not used it in over 10 years and (b) write R packages which interface to existing C++ libraries for the purposes of natural language processing or some other library which does something cool.

The objective

Install Rcpp on Windows 7 Pro x64 to test out Romain’s Rccp solution to the old r-wiki optimisation challenge

The Problem

I had zero problems installing and running Rcpp code on Ubuntu Linux 11.10 but just could not get the same to work on Windows 7 Pro x64, often getting the following warning and errors:

Error in compileCode(f, code, language = language, verbose = verbose) :
Compilation ERROR, function(s)/method(s) not created!

cygwin warning:
MS-DOS style path detected: C:/PROGRA~1/R/R-214~1.0/etc/x64/Makeconf
Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-214~1.0/etc/x64/Makeconf
CYGWIN environment variable option “nodosfilewarning” turns off this warning.
Consult the user’s guide for more details about POSIX paths:

http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

x86_64-w64-mingw32-g++.exe: C:/Program: No such file or directory
x86_64-w64-mingw32-g++.exe: Files/R/R-2.14.0/library/Rcpp/lib/x64/libRcpp.a: No such file or directory

The Solution

It turns out that there’s two issues here:

(1) the cygwin warning which we can ignore as it has nothing to do with Rcpp and is caused by using the 64bit version of MinGW from Rtools when adding it to our PATH variable (easily solved by choosing the 32bit version instead, shown below in step 5 part viii)

(2) the actual Rcpp issue itself which happened because I had originally installed R on a path with a space in it. In fact (and this is something I didn’t previously know) it’s officially recommended that one does not install R on Windows with a path containing a space (see rw-FAQ 2.2).

So armed with this knowledge, here is how I got Rcpp working from beginning to end (a lot of the sub-steps below are blindingly obvious but I included them anyway for the sake of completeness in case other people with similar problems come across this post):

  1. Install R on a path that does not contain a space.
    (i)  download R from: http://cran.r-project.org/bin/windows/base/
    (ii) double click the downloaded .exe file
    (iii) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) Choose your language (I choose English)
    (vi) when prompted for where to install R, choose a location without a space. I choose “C:\R\R-2.14.0″
    (vii) click Next on all following screens to finish installation (I left the defaults unaltered).
    .
  2. Install Rtools for components necessary to buld R (this has the tool chain require for C++ code compilation)
    (i) Go to this webpage: http://www.murdoch-sutherland.com/Rtools/
    (ii) In the downloads section, choose a compatible version for your version of R. I’m using R-2.14.0 and thus chose Rtools214.exe
    (iii) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) choose your language (I choose English)
    (vi) choose where to install (I chose “C:\R\Rtools”)
    (vii) click Next on all following screens to finish installation (I left the defaults unaltered).
    .
  3. Download batchfiles to always point to the latest version of R on your system when running R from the command line (I have found that this saves a lot of time in the long run)
    (i) go to: http://cran.r-project.org/contrib/extra/batchfiles/
    (ii) choose the latest version (I chose batchfiles_0.6-6.zip)
    (iii) when downloaded, right click on file and click Extract All.
    (iv) click Browse and choose where you want to put the extracted folder (I chose “C:\R\”)
    .
  4. Download Redmond Path Utility to alter PATH variables in a very user friendly fashion:
    (i) go to: http://download.cnet.com/Redmond-Path/3000-2094_4-10811594.html
    (ii) click on “Download Now CNET Secure Download”
    (iii) double click the downloaded .exe file
    (iv) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) press Next
    (vi) press Decline (unless you want the annoying advertising extra).
    (vii) click open
    (viii) An explorer window will open with RedmondPath.zip
    (ix) click Extract All.
    (x) click Browse and choose where you want to put the extracted folder (I chose “C:\R\”)
    .
  5. Edit PATH variable to allow system wide access to the current version of R on the computer and components of Rtools
    (i) double click the Redmond Path Utility from step 4 above (mine is in: “C:\R\RedmondPath\Redmond Path.exe”)
    (ii) click Yes when the User Account Control message box appears
    (iii) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (iv)  In the window which pops open, navigate to “C:\R\batchfiles_0.6-6″ and click OK
    (v) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (vi)  In the window which pops open, navigate to “C:\R\Rtools\bin” and click OK
    (vii) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (viii)  In the window which pops open, navigate to “C:\R\Rtools\MinGW\bin” and click OK
    .
  6. Restart your computer (this solved an issue where the edits to the PATH variable above had not taken immediate affect)
    .
  7. Open R and run the following code
    # install packages
    install.packages(c("Rcpp", "rbenchmark", "inline", "Runit"))
    
    # load main two packages
    library(Rcpp)
    library(inline)
    
    # do something with Rcpp to quickly check that it works
    body <- '
    NumericVector xx(x);
    return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));'
    
    add <- cxxfunction(signature(x = "numeric"), body, plugin = "Rcpp")
    
    x <- 1
    y <- 2
    res <- add(c(x, y))
    res
    #[1] 3
    
    

And there you have it, Rcpp working on Windows! I was able to run Romain’s Rcpp code (see his blog for the code) without any problems. Awesome stuff.

Useful resources I read in order to work out how to correctly install Rcpp:

Rcpp-FAQ

http://lists.r-forge.r-project.org/pipermail/rcpp-devel/

http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset

 

November 29, 2011

outersect(): The opposite of R’s intersect() function

Filed under: R — Tony Breyal @ 12:57 pm

The Objective

To find the non-duplicated elements between two or more vectors (i.e. the ‘yellow sections of the diagram above)

The Problem

I needed the opposite of R’s intersect() function, an “outersect()“. The closest I found was setdiff() but the order of the input vectors produces different results, e.g.


x = letters[1:3]
#[1] "a" "b" "c"
y = letters[2:4]
#[1] "b" "c" "d"

# The desired result is
# [1] "a" "d"

setdiff(x, y)
#[1] "a"

setdiff(y, x)
#[1] "d"

setdiff() produces all elements of the first input vector without any matching elements from the second input vector (i.e. is asymmetric). Not quite what I’m after. I’m looking for the ‘yellow’ set of elements as in the picture at the top of the page.

The Solution

Concatenating the results of setdiff() with input vectors in both combinations works a treat:

outersect <- function(x, y) {
  sort(c(setdiff(x, y),
         setdiff(y, x)))
}

x = letters[1:3]
#[1] "a" "b" "c"
y = letters[2:4]
#[1] "b" "c" "d"

outersect(x, y)
#[1] "a" "d"

outersect(y, x)
#[1] "a" "d"

Alternative solution

An equivalent alternative would be to use

outersect <- function(x, y) {
  sort(c(x[!x%in%y],
         y[!y%in%x]))
}

but by using setdiff() in the first solution it makes it easier to read I think.

Further Development

It would be nice to extend this to a variable number of input vectors. This final task turns out to be rather simple:


outersect <- function(x, y, ...) {
  big.vec <- c(x, y, ...)
  duplicates <- big.vec[duplicated(big.vec)]
  setdiff(big.vec, unique(duplicates))
}

# desired result is c(1, 2, 3, 6, 9, 10)
outersect(1:5, 4:8, 7:10)
#[1] 1 2 3 6 9 10

Awesome.

November 24, 2011

source_https(): Sourcing an R Script from github over HTTPS

Filed under: R — Tags: , , , , , — Tony Breyal @ 12:21 pm

The Objective

I wanted to source R scripts hosted on my github repository for use in my blog (i.e. a github version of ?source). This would make it easier for anyone wishing to test out my code snippets on their own computers without having to manually go to my github repo and retrieve a series of R scripts themselves to make it run.

The Problem

The base R function source() fails with HTTPS links on Windows 7. There may be a way around this by starting R using –internet2 from the command line (search for CMD in windows) but that would just be another inconvenience like having to download an R script through your browser in the first place.

An easier approach would be to use RCurl:getURL() by setting either ssl.veryifypeer=FALSE or cainfo to a SSL certificates file. That’s easy enough to achieve but I wanted to wrap the code in a function for convenience as follows:


source_github <- function(u) {
  # load package
  require(RCurl)

  # read script lines from website
  script <- getURL(u, ssl.verifypeer = FALSE)

  # parase lines and evealuate in the global environement
  eval(parse(text = script))
}

source("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")

The problem with the code above was that the functions sourced from the desired R script file only existed locally in source_github() and not globally to the rest of the R session. Sadface.

The Solution

Asking on Stack Overflow produced an answer from the mighty Spacedman who added envir=.GlobalEnv as a parameter to eval. This means that the evaluation is done in the global environment and thus all the contents of the R script are available for the entire R session.

Furthermore, it occurred to me that I could make the function generic to work with any R script that is hosted over a HTTPS connection. To this end, I added a couple of lines of code to download a security certificates text file from the curl website.

source_https <- function(u, unlink.tmp.certs = FALSE) {
  # load package
  require(RCurl)

  # read script lines from website using a security certificate
  if(!file.exists("cacert.pem")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
  script <- getURL(u, followlocation = TRUE, cainfo = "cacert.pem")
  if(unlink.tmp.certs) unlink("cacert.pem")

  # parase lines and evealuate in the global environement
  eval(parse(text = script), envir= .GlobalEnv)
}

source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R", unlink.tmp.certs = TRUE)

Using unlink.tmp.certs = TRUE will delete the security certificates text file that source_https downloads and is an optional parameter (probably best to use it only on the final call of source_https to avoid downloading the same certificates file multiple times).

UPDATE

Based on Kay’s comments, here’s a vectorised version with cross-platform SSL certificates:

source_https <- function(url, ...) {
  # load package
  require(RCurl)

  # parse and evaluate each .R script
  sapply(c(url, ...), function(u) {
    eval(parse(text = getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))), envir = .GlobalEnv)
  })
}

# Example
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R",
             "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")

 

November 18, 2011

htmlToText(): Extracting Text from HTML via XPath

Filed under: R — Tags: , , , , , , — Tony Breyal @ 4:31 pm

Converting HTML to plain text usually involves stripping out the HTML tags whilst preserving the most basic of formatting. I wrote a function to do this which works as follows (code can be found on github):

# load packages
library(RCurl)
library(XML)

# assign input (could be a html file, a URL, html text, or some combination of all three is the form of a vector)
input <- "http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904"

# evaluate input and convert to text
txt <- htmlToText(input)
txt

#r project - Google Search Web Images Videos Maps News Shopping Gmail More Translate Books Finance Scholar Blogs YouTube Calendar Photos Documents Sites Groups Reader Even more » Account Options Sign in Search settings Web History Advanced Search Results 1 - 10 of about 336,000,000 for r project . Everything More Search Options Show options... Web The R Project for Statistical Computing R , also called GNU S, is a strongly functional language and environment to statistically explore data sets, make many graphical displays of data from custom ... www. r - project .org/ - Cached - Similar [Trunc...]

The above uses an XPath approach to achieve it’s goal. Another approach would be to use a regular expression. These two approaches are briefly discussed below:

Regular Expressions

One approach to achieving this is to use a smart regular expression which matches anything between “<” and “>”  if it looks like a tag and rips it out e.g.,

# html code
txt <- "
<html>
  <body>
  This is some random text.
    <p>This is some text in a paragraph.</p>
    <p>This is a statement which says that 2 < 3 = TRUE, 4 < 5 = TRUE and 10 > 9 = TRUE.</p>
  </body>
</html>"

# parse html
pattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(pattern, "\\1", txt)
cat(plain.text)

#   This is some random text.
#     This is some text in a paragraph.
#     This is a statement which says that 2 < 3 = TRUE, 4 < 5 = TRUE and 10 > 9 = TRUE

I got the regular expression in “pattern” in the code above from a quick google search which gave this webpage from 2004. It’s a pretty smart regex because it recognises the difference between “<” and “> which are used for a HTML tag and “<” and “>” which are used as a natural part of the plain text we want.

I’m still learning regex and I must confess to finding this one slightly intimidating. There seems like there could be a lot of pitfalls with this approach such as what to do about <script></script> tags which hold programming code for the browser between them? The code is plain text because it’s outside of the pointed brackets and would thus be extracted by the regex. However, it is meant for the browser to tell it how to do something – it’s not meant to be displayed in the web browser for the end user to see and thus is not something we want to include in our html-to-text conversion.

This approach would require building more and more sophsiticated regular expressions, or filtering through a series of different regular expressions, to get the desired result when taking into account these diversions. The code above would not give the desired result on the real world example I give below.

XPath

Another approach is to use XPath. The typical technique used it seems to me is to only extract the text between paragraph tags “<p>” and “</p>” as follows:

# load packages
library(RCulr)
library(XML)

# download html
html <- getURL("http://tonybreyal.wordpress.com/2011/11/17/cool-hand-luke-aldwych-theatre-london-2011-production/", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))

# I just got back from watching a production of Cool Hand Luke at the Aldwych Theatre in Central London. Given that [TRUNC...]

That’s a great approach for most webpages such as blogs because of the way they are designed. However, there are cases where it would not work so well, such as if you wanted all the text off of a google search page (though it applies to other pages too of course):

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))

# r project linux
# stats package r
# Search Help

It returned only three lines. So we need to be more liberal by using “//text()” which will return all text outside of HTML tags which is what the regex approach above might give. However, we also need to account for text we don’t want such as style and script codes, which we can do as follows:

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
cat(paste(plain.text, collapse = " "))

#r project - Google Search Web Images Videos Maps News Shopping Gmail More Translate Books Finance Scholar Blogs YouTube Calendar Photos Documents Sites Groups Reader Even more » Account Options Sign in Search settings Web History Advanced Search Results 1 - 10 of about 336,000,000 for r project . Everything More Search Options Show options... Web The R Project for Statistical Computing R , also called GNU S, is a strongly functional language and environment to statistically explore data sets, make many graphical displays of data from custom ... www. r - project .org/ - Cached - Similar [Trunc...]

This second version of the XPath approach seems to work rather well – it feels more robust than a regular expression approach and returns more information that the typical “//p” XPath approach too, thus returning more information for a greater variety of webpages.

Full code for my htmlToText() R function can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/htmlToText/htmlToText.R

P.S. part of the reason I wrote this function is so that I can plug it into my *XScraper functions to provide an extra field of more detailed information using a webCrawl = TRUE option maybe. I may have to write a more sophisticated web crawler though to handle errors for websites it can’t download correctly through RCurl. I’m not an expert in cURL and so it will probably just have a bunch of try() statements, I might try something simple like that for my next post…

Older Posts »

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers