Consistently Infrequent

December 13, 2011

Unshorten (almost) any URL with R

Filed under: R — Tags: , , , , , — Tony Breyal @ 6:57 pm

Introduction

I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.

Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!

This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.

The Objective

I want to decode a shortened URL to reveal it’s full final web address.

The Solution

The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.

decode_short_url <- function(url, ...) {
  # PACKAGES #
  require(RCurl)

  # LOCAL FUNCTIONS #
  decode <- function(u) {
    Sys.sleep(0.5)
    x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) )
    if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) {
      return(u)
    } else {
      return(gsub('.*Location: (\\S+).*', '\\1', x))
    }
  }

  # MAIN #
  gc()
  # return decoded URLs
  urls <- c(url, ...)
  l <- vector(mode = "list", length = length(urls))
  l <- lapply(urls, decode)
  names(l) <- urls
  return(l)
}

And here’s how we use it:

# EXAMPLE #
decode_short_url("http://tinyurl.com/adcd",
                 "http://www.google.com")
# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"

You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R

Limitations

A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,

http://1.cloudst.at/myeg

decode_short_url("http://tinyurl.com/adcd",
"http://www.google.com",
"http://1.cloudst.at/myeg")

# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"
#
# $`http://1.cloudst.at/myeg`
# [1] "http://1.cloudst.at/myeg"

I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there :)

December 8, 2011

Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!

Filed under: R — Tags: , , — Tony Breyal @ 1:41 pm

Introduction

The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here.

The Objective

To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

Solutions One Through Thirteen

A variety of different approaches are illustrated on the r-wiki page which show the performance benefits of things like vectorisation, variable initialisation, linking through to a compiled programming language, reducing a problem to its component parts, etc.

The Fourteenth Solution

The main speed improvement here comes from replacing the function “paste” by “file.path”. This use of “file.path” with parameter fsep=”” only works correctly here because there is never a character vector of length 0 for it to deal with. I only learned about this approach when I happened to see this tweet on twitter with hashtag #rstats and reading the associated help file where it says that it is faster than paste.

generateIndex14 <- function(n) {
  # initialise vectors
  s <- (mode = "character", length = n)

  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # paste strings together
  unlist(lapply(1:(n-1), function(i) file.path("i", s[i], ".", s[(i+1):n], fsep = "") ), use.names = FALSE)
}

Timings:

               test  elapsed    n replications
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Fifteenth Solution: Rcpp

This solution comes from Romain Francois and is based on the tenth solution but implemented in C++ using the R package Rcpp. See his blog for the implementation. This is the sort of thing I would love to learn to do myself but just need to find the time to re-learn C++, though I doubt that’ll happen any time soon as I’m hoping to start my MSc in Statistics next year. This is a great solution though.

Timings:

               test  elapsed    n replications
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Sixteenth Solution

When I was writing up this post I thought up a sixteenth solution (as seems to be the pattern with me on this blog!). This solution gets its speed up by generating the largest set of strings which start i001.xxx first and then replacing the “001” part with “002”, “003”, “004”, etc., for each increment up to and including n-1.


generateIndex16 <- function(n) {
  # initialise vectors
  str <- vector("list", length = n-1)
  s <- vector(mode = "character", length = n)

  # set up strings
  s <- sprintf("%03d", seq_len(n))
  str[[1]] <- file.path("i", s[1], ".", s[-1], fsep = "")

  # generate string sequences
  str[2:(n-1)] <- lapply(2:(n-1), function(i) sub("001", s[i], str[[1]][i:(n-1)], fixed=TRUE))
  unlist(str)
}

The above requires matching the “001” part first and then replacing it. However, we know that “001” will ALWAYS be in character positions 2, 3 and 4, and so there may be a way to avoid the matching part altogether (i.e. replace a fixed position substring with another string of equal or larger length) but I could not work out how to do that outside of a regular expression. Sadface.

Timings:

               test  elapsed    n replications
 generateIndex16(n) 20.77200 2000           50
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

Solutions Comparisons For Different N

I like ggplot2 charts and so ran my computer overnight to generate data for the speed performance of the last several solutions over different N:

Final Thoughts

I’m pretty sure that any more speed improvements will come from some or all of the follwing:

  • doing the heavy lifting in a compiled language and interfacing with R
  • running in parallel (I actually got this to work on linux by replacing lapply with mclapply from the parallel R package but the downside was that one has to use much more memory for larger values of N, plus it’s only works in serial fashion on Windows
  • working out an efficient way of replacing a fixed positioned substring with a string of equal or great length
  • compiling the function into R bytecodes using the compiler package function cmpfun

It would also be interesting to profile the memory usage of each funciton.

This was a fun challenge – if you find some spare time why not try your hand at it, you might come up with something even better!  :)

December 7, 2011

Installing Rcpp on Windows 7 for R and C++ integration

Filed under: R — Tags: , , , , — Tony Breyal @ 5:49 pm

Introduction

Romain Francois presented an Rcpp solution on his blog to an old r-wiki optimisation challenge which I had also presented R solutions for previously on my blog.

The Rcpp package provides a method for integrating R and C++. This allows for faster execution of an R project by recoding the slower R parts into C+ and thus providing potential performance enhancements.

The two main attractions, for me personally, of the Rcpp package are (a) help me to re-learn C++ because I’ve not used it in over 10 years and (b) write R packages which interface to existing C++ libraries for the purposes of natural language processing or some other library which does something cool.

The objective

Install Rcpp on Windows 7 Pro x64 to test out Romain’s Rccp solution to the old r-wiki optimisation challenge

The Problem

I had zero problems installing and running Rcpp code on Ubuntu Linux 11.10 but just could not get the same to work on Windows 7 Pro x64, often getting the following warning and errors:

Error in compileCode(f, code, language = language, verbose = verbose) :
Compilation ERROR, function(s)/method(s) not created!

cygwin warning:
MS-DOS style path detected: C:/PROGRA~1/R/R-214~1.0/etc/x64/Makeconf
Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-214~1.0/etc/x64/Makeconf
CYGWIN environment variable option “nodosfilewarning” turns off this warning.
Consult the user’s guide for more details about POSIX paths:

http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

x86_64-w64-mingw32-g++.exe: C:/Program: No such file or directory
x86_64-w64-mingw32-g++.exe: Files/R/R-2.14.0/library/Rcpp/lib/x64/libRcpp.a: No such file or directory

The Solution

It turns out that there’s two issues here:

(1) the cygwin warning which we can ignore as it has nothing to do with Rcpp and is caused by using the 64bit version of MinGW from Rtools when adding it to our PATH variable (easily solved by choosing the 32bit version instead, shown below in step 5 part viii)

(2) the actual Rcpp issue itself which happened because I had originally installed R on a path with a space in it. In fact (and this is something I didn’t previously know) it’s officially recommended that one does not install R on Windows with a path containing a space (see rw-FAQ 2.2).

So armed with this knowledge, here is how I got Rcpp working from beginning to end (a lot of the sub-steps below are blindingly obvious but I included them anyway for the sake of completeness in case other people with similar problems come across this post):

  1. Install R on a path that does not contain a space.
    (i)  download R from: http://cran.r-project.org/bin/windows/base/
    (ii) double click the downloaded .exe file
    (iii) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) Choose your language (I choose English)
    (vi) when prompted for where to install R, choose a location without a space. I choose “C:\R\R-2.14.0″
    (vii) click Next on all following screens to finish installation (I left the defaults unaltered).
    .
  2. Install Rtools for components necessary to buld R (this has the tool chain require for C++ code compilation)
    (i) Go to this webpage: http://www.murdoch-sutherland.com/Rtools/
    (ii) In the downloads section, choose a compatible version for your version of R. I’m using R-2.14.0 and thus chose Rtools214.exe
    (iii) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) choose your language (I choose English)
    (vi) choose where to install (I chose “C:\R\Rtools”)
    (vii) click Next on all following screens to finish installation (I left the defaults unaltered).
    .
  3. Download batchfiles to always point to the latest version of R on your system when running R from the command line (I have found that this saves a lot of time in the long run)
    (i) go to: http://cran.r-project.org/contrib/extra/batchfiles/
    (ii) choose the latest version (I chose batchfiles_0.6-6.zip)
    (iii) when downloaded, right click on file and click Extract All.
    (iv) click Browse and choose where you want to put the extracted folder (I chose “C:\R\”)
    .
  4. Download Redmond Path Utility to alter PATH variables in a very user friendly fashion:
    (i) go to: http://download.cnet.com/Redmond-Path/3000-2094_4-10811594.html
    (ii) click on “Download Now CNET Secure Download”
    (iii) double click the downloaded .exe file
    (iv) press Run when the security warning appears
    (iv) press Yes when the User Account Control message box appears
    (v) press Next
    (vi) press Decline (unless you want the annoying advertising extra).
    (vii) click open
    (viii) An explorer window will open with RedmondPath.zip
    (ix) click Extract All.
    (x) click Browse and choose where you want to put the extracted folder (I chose “C:\R\”)
    .
  5. Edit PATH variable to allow system wide access to the current version of R on the computer and components of Rtools
    (i) double click the Redmond Path Utility from step 4 above (mine is in: “C:\R\RedmondPath\Redmond Path.exe”)
    (ii) click Yes when the User Account Control message box appears
    (iii) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (iv)  In the window which pops open, navigate to “C:\R\batchfiles_0.6-6″ and click OK
    (v) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (vi)  In the window which pops open, navigate to “C:\R\Rtools\bin” and click OK
    (vii) click the green “+” icon in the top left corner so we can add elements to the PATH variable
    (viii)  In the window which pops open, navigate to “C:\R\Rtools\MinGW\bin” and click OK
    .
  6. Restart your computer (this solved an issue where the edits to the PATH variable above had not taken immediate affect)
    .
  7. Open R and run the following code
    # install packages
    install.packages(c("Rcpp", "rbenchmark", "inline", "Runit"))
    
    # load main two packages
    library(Rcpp)
    library(inline)
    
    # do something with Rcpp to quickly check that it works
    body <- '
    NumericVector xx(x);
    return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));'
    
    add <- cxxfunction(signature(x = "numeric"), body, plugin = "Rcpp")
    
    x <- 1
    y <- 2
    res <- add(c(x, y))
    res
    #[1] 3
    
    

And there you have it, Rcpp working on Windows! I was able to run Romain’s Rcpp code (see his blog for the code) without any problems. Awesome stuff.

Useful resources I read in order to work out how to correctly install Rcpp:

Rcpp-FAQ

http://lists.r-forge.r-project.org/pipermail/rcpp-devel/

http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset

 

November 24, 2011

source_https(): Sourcing an R Script from github over HTTPS

Filed under: R — Tags: , , , , , — Tony Breyal @ 12:21 pm

The Objective

I wanted to source R scripts hosted on my github repository for use in my blog (i.e. a github version of ?source). This would make it easier for anyone wishing to test out my code snippets on their own computers without having to manually go to my github repo and retrieve a series of R scripts themselves to make it run.

The Problem

The base R function source() fails with HTTPS links on Windows 7. There may be a way around this by starting R using –internet2 from the command line (search for CMD in windows) but that would just be another inconvenience like having to download an R script through your browser in the first place.

An easier approach would be to use RCurl:getURL() by setting either ssl.veryifypeer=FALSE or cainfo to a SSL certificates file. That’s easy enough to achieve but I wanted to wrap the code in a function for convenience as follows:


source_github <- function(u) {
  # load package
  require(RCurl)

  # read script lines from website
  script <- getURL(u, ssl.verifypeer = FALSE)

  # parase lines and evealuate in the global environement
  eval(parse(text = script))
}

source("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")

The problem with the code above was that the functions sourced from the desired R script file only existed locally in source_github() and not globally to the rest of the R session. Sadface.

The Solution

Asking on Stack Overflow produced an answer from the mighty Spacedman who added envir=.GlobalEnv as a parameter to eval. This means that the evaluation is done in the global environment and thus all the contents of the R script are available for the entire R session.

Furthermore, it occurred to me that I could make the function generic to work with any R script that is hosted over a HTTPS connection. To this end, I added a couple of lines of code to download a security certificates text file from the curl website.

source_https <- function(u, unlink.tmp.certs = FALSE) {
  # load package
  require(RCurl)

  # read script lines from website using a security certificate
  if(!file.exists("cacert.pem")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
  script <- getURL(u, followlocation = TRUE, cainfo = "cacert.pem")
  if(unlink.tmp.certs) unlink("cacert.pem")

  # parase lines and evealuate in the global environement
  eval(parse(text = script), envir= .GlobalEnv)
}

source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R", unlink.tmp.certs = TRUE)

Using unlink.tmp.certs = TRUE will delete the security certificates text file that source_https downloads and is an optional parameter (probably best to use it only on the final call of source_https to avoid downloading the same certificates file multiple times).

UPDATE

Based on Kay’s comments, here’s a vectorised version with cross-platform SSL certificates:

source_https <- function(url, ...) {
  # load package
  require(RCurl)

  # parse and evaluate each .R script
  sapply(c(url, ...), function(u) {
    eval(parse(text = getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))), envir = .GlobalEnv)
  })
}

# Example
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R",
             "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")

 

November 18, 2011

htmlToText(): Extracting Text from HTML via XPath

Filed under: R — Tags: , , , , , , — Tony Breyal @ 4:31 pm

Converting HTML to plain text usually involves stripping out the HTML tags whilst preserving the most basic of formatting. I wrote a function to do this which works as follows (code can be found on github):

# load packages
library(RCurl)
library(XML)

# assign input (could be a html file, a URL, html text, or some combination of all three is the form of a vector)
input <- "http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904"

# evaluate input and convert to text
txt <- htmlToText(input)
txt

#r project - Google Search Web Images Videos Maps News Shopping Gmail More Translate Books Finance Scholar Blogs YouTube Calendar Photos Documents Sites Groups Reader Even more » Account Options Sign in Search settings Web History Advanced Search Results 1 - 10 of about 336,000,000 for r project . Everything More Search Options Show options... Web The R Project for Statistical Computing R , also called GNU S, is a strongly functional language and environment to statistically explore data sets, make many graphical displays of data from custom ... www. r - project .org/ - Cached - Similar [Trunc...]

The above uses an XPath approach to achieve it’s goal. Another approach would be to use a regular expression. These two approaches are briefly discussed below:

Regular Expressions

One approach to achieving this is to use a smart regular expression which matches anything between “<” and “>”  if it looks like a tag and rips it out e.g.,

# html code
txt <- "
<html>
  <body>
  This is some random text.
    <p>This is some text in a paragraph.</p>
    <p>This is a statement which says that 2 < 3 = TRUE, 4 < 5 = TRUE and 10 > 9 = TRUE.</p>
  </body>
</html>"

# parse html
pattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(pattern, "\\1", txt)
cat(plain.text)

#   This is some random text.
#     This is some text in a paragraph.
#     This is a statement which says that 2 < 3 = TRUE, 4 < 5 = TRUE and 10 > 9 = TRUE

I got the regular expression in “pattern” in the code above from a quick google search which gave this webpage from 2004. It’s a pretty smart regex because it recognises the difference between “<” and “> which are used for a HTML tag and “<” and “>” which are used as a natural part of the plain text we want.

I’m still learning regex and I must confess to finding this one slightly intimidating. There seems like there could be a lot of pitfalls with this approach such as what to do about <script></script> tags which hold programming code for the browser between them? The code is plain text because it’s outside of the pointed brackets and would thus be extracted by the regex. However, it is meant for the browser to tell it how to do something – it’s not meant to be displayed in the web browser for the end user to see and thus is not something we want to include in our html-to-text conversion.

This approach would require building more and more sophsiticated regular expressions, or filtering through a series of different regular expressions, to get the desired result when taking into account these diversions. The code above would not give the desired result on the real world example I give below.

XPath

Another approach is to use XPath. The typical technique used it seems to me is to only extract the text between paragraph tags “<p>” and “</p>” as follows:

# load packages
library(RCulr)
library(XML)

# download html
html <- getURL("http://tonybreyal.wordpress.com/2011/11/17/cool-hand-luke-aldwych-theatre-london-2011-production/", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))

# I just got back from watching a production of Cool Hand Luke at the Aldwych Theatre in Central London. Given that [TRUNC...]

That’s a great approach for most webpages such as blogs because of the way they are designed. However, there are cases where it would not work so well, such as if you wanted all the text off of a google search page (though it applies to other pages too of course):

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))

# r project linux
# stats package r
# Search Help

It returned only three lines. So we need to be more liberal by using “//text()” which will return all text outside of HTML tags which is what the regex approach above might give. However, we also need to account for text we don’t want such as style and script codes, which we can do as follows:

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("http://www.google.co.uk/search?gcx=c&sourceid=chrome&ie=UTF-8&q=r+project#pq=%22hello+%3C+world%22&hl=en&cp=5&gs_id=3r&xhr=t&q=phd+comics&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=phd+c&aq=0&aqi=g4&aql=&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=27ff09b2758eb4df&biw=1599&bih=904", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
cat(paste(plain.text, collapse = " "))

#r project - Google Search Web Images Videos Maps News Shopping Gmail More Translate Books Finance Scholar Blogs YouTube Calendar Photos Documents Sites Groups Reader Even more » Account Options Sign in Search settings Web History Advanced Search Results 1 - 10 of about 336,000,000 for r project . Everything More Search Options Show options... Web The R Project for Statistical Computing R , also called GNU S, is a strongly functional language and environment to statistically explore data sets, make many graphical displays of data from custom ... www. r - project .org/ - Cached - Similar [Trunc...]

This second version of the XPath approach seems to work rather well – it feels more robust than a regular expression approach and returns more information that the typical “//p” XPath approach too, thus returning more information for a greater variety of webpages.

Full code for my htmlToText() R function can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/htmlToText/htmlToText.R

P.S. part of the reason I wrote this function is so that I can plug it into my *XScraper functions to provide an extra field of more detailed information using a webCrawl = TRUE option maybe. I may have to write a more sophisticated web crawler though to handle errors for websites it can’t download correctly through RCurl. I’m not an expert in cURL and so it will probably just have a bunch of try() statements, I might try something simple like that for my next post…

November 16, 2011

fgui: Automatically Creating Widgets for Arguments of a Function – A Quick Example

Filed under: R — Tags: , , , , , , — Tony Breyal @ 11:14 am

Here’s something I came across by accident, an R package called fgui which has the ability to automatically create a widget just by passing it a function with parameters, e.g.:

# load packages
require(fgui)

# add two number together and return the value
add <- function(x1,  x2) {
  return(x1 + x2)
}

# execute function through GUI
y <- guiv(add)
# [1] 5

The GUI produced from the code above looks like this:

I love how easy that was to do, very cool, and useful too! The package is a wrapper to the base R package tcltk.

November 14, 2011

GScholarXScraper: Hacking the GScholarScraper function with XPath

Filed under: R — Tags: , , , , , , , — Tony Breyal @ 12:36 am

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essentially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

  • Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).
  • As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but for me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions because they’re awesome!)
  • Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
  • Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
  • Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
  • Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
  • Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

#EXAMPLE 1: produces a word cloud based the 'title'' field of Google Scholar search results and an input search string
GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)

#              word freq
# game         game   71
# comput     comput   22
# video       video   13
# learn       learn   11
# [TRUNC...]
#
#
# Number of titles submitted = 210
#
# Number of results as retrieved from first webpage = 267
#
# Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles

I think that’s kind of cool (sorry about the resolution clarity as I can’t seem to add .svg images on here) and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

# EXAMPLE 2: produces a word cloud based the 'description' field of Google Scholar search results and an input search string
GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)

#                word freq
# page           page  147
# gate           gate  132
# game           game  130
# baldur       baldur  129
# roleplay   roleplay   21
# [TRUNC...]
#
# Number of titles submitted = 210
#
# Number of results as retrieved from first webpage = 267
#
# Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles

Not bad and is better than the ‘title’ field. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

One of the drawbacks about the ‘title’ and ‘description’ fields are that they are truncated. It would be nice to crawl to the webpage of each result URL and scrape the text from there and add that as an ‘abstract’ field for more useful results. If I get time I might add that.

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set. I had fun doing this and improved my other *XScraper functions in the process!

Code:

Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R


November 11, 2011

Web Scraping Yahoo Search Page via XPath

Filed under: R — Tags: , , , , , , — Tony Breyal @ 12:25 am

Seeing as I’m on a bit of an XPath kick as of late, I figured I’d continue on scraping search results but this time from Yahoo.com

Rolling my own version of xpathSApply to handle NULL elements seems to have done the trick and so far it’s been relatively easy to do the scraping. I’ve created an R function which will scrape information from a Yahoo Search page (with the user suplying the Yahoo Search URL) and will extract as much information as it can whilst maintaining the data frame structure (full source code at end of post). For example:

# load packages
library(RCurl)
library(XML)

# user provides url and the function extracts relevant information into a data frame as follows
u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592664708059042&mkt=en-GB&setlang=en-GB&w=48d4b732,65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"

I’ve only tested these on web results. The idea of these posts is to get basic functionality and then if I feel it might be fun, to expand the functionality in the future.

It’s nice having an online blog where I can keep these functions I’ve come up with during coding exercises. Maybe if I make enough of these Web Search Engine scrapers I can go ahead and make my first R package. Though the downside of web scraping is that if the structrure/entities of the HTML code change then the scrapers may stop working. That could make the package difficult to maintain. I can’t really think of how the package itself might be useful to anyone apart from teaching me personally how to build a package.

Maybe that’ll be worth it in and of itself. Ha, version 2.0 could be just a collection of the self contained functions, version 3.0 could have the functions converted to S3 (which I really want to learn), version 4.0 could have them converted to S4 (again, something I’d like to learn) and version 5.0 could have reference classes (I still don’t know what those things are). Just thinking out loud, could be a good way to learn more R. Doubt I’ll do it though but we’ll see. I have to find time to start learning Python so might have to put R on the back burner soon!

Full source code here (function is self-contained, just copy and paste):

# load packages
library(RCurl)
library(XML)

get_yahoo_search_df <- function(u) {
  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) {
    nodes.len <- length(xpathSApply(doc, path.base))
    paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE))
    xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN))
    if(!is.null(FUN2)) xx <- FUN2(xx)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))
    return(xx)
  }

  # download html and parse into tree structure
  html <- getURL(u, followlocation = TRUE)
  doc <- htmlParse(html)

  # path to nodes of interest
  path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li"

  # construct data frame
  df <- data.frame(
    title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue),
    url = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a[@href]", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[2])),
    description = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div", xmlValue),
    cached = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/a[@href][text()='Cached']", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[1])),
    recorded = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/span[@id='resultTime']", xmlValue),
    stringsAsFactors = FALSE)

  # free doc from memory
  free(doc)

  # return data frame
  return(df)
}

u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1:5, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592664708059042&mkt=en-GB&setlang=en-GB&w=48d4b732,65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"
#             2
# title       "WIL WHEATON DOT NET"
# url         "http://www.wilwheaton.net/coollinks.php"
# description "Wil Wheaton - Don't be a dick! - Writer and Actor - Your Mom - I'm Wil Wheaton. I'm an author (that's why I'm wilwheatonbooks), an actor, and a lifelong geek."
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592836504520824&mkt=en-GB&setlang=en-GB&w=eaeb9364,4a4e7c54&icp=1&.intl=uk&sig=VC7eV8GUMXVuu9apHagYNg--"
# recorded    "2 days ago"
#             3
# title       "this is one hell of a geeky weekend - WWdN: In Exile"
# url         "http://wilwheaton.typepad.com/wwdnbackup/2008/05/this-is-one-hel.html"
# description "WIL WHEATON DOT NET2 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4559391600545150&mkt=en-GB&setlang=en-GB&w=90d3ee39,34d4424b&icp=1&.intl=uk&sig=ZN.UpexVV4pm3yn7XiEURw--"
# recorded    "2 days ago"
#             4
# title       "Wil Wheaton - Google+ - I realized today that when someone ..."
# url         "https://plus.google.com/108176814619778619437/posts/ENTkBMZKeGY"
# description ">Cool Sites. Okay, I'm talking to the guys here: do you ever get \"the sigh\"? You know what I'm talking about...you're really into some cool website, and your ..."
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4718764947541872&mkt=en-GB&setlang=en-GB&w=9bca6e9a,dba19826&icp=1&.intl=uk&sig=jGaKkuIFOINEBBfBwarrgg--"
# recorded    "6 days ago"
#             5
# title       "The Hot List: Dwight Slade, Back Fence PDX, Wil Wheaton vs ..."
# url         "http://www.oregonlive.com/movies/index.ssf/2011/11/the_hot_list_dwight_slade_back.html"
# description "this is one hell of a geeky weekend - WWdN: In Exile2 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=414191857143&mkt=en-GB&setlang=en-GB&w=3081364,e585aa21&icp=1&.intl=uk&sig=KufdBZ_Thr1Mm8.SnjpMUQ--"
# recorded    "4 hours ago"

UPDATE: I’ve created a github account and the above code can be found at: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/get_yahoo_search_df.R

November 10, 2011

Facebook Graph API Explorer with R (on Windows)

Filed under: R — Tags: , , , , , , , — Tony Breyal @ 2:16 pm

I wanted to play around with the Facebook Graph API  using the Graph API Explorer page as a coding exercise. This facility allows one to use the API with a temporary authorisation token. Now, I don’t know how to make an R package for the proper API where you have to register for an API key and do some OAth stuff because that is above my current skill set but the Explorer page itself is a nice middle ground.

Therefore I’ve came up with a self contained R function which allows me to do just that (full code at end of post):


# load packages
library(RCurl)
library(RJSONIO)

# get facebook data
df <- Facebook_Graph_API_Explorer()
t(df[7,])

# post.id                      "127031120644257_319044381442929"
# from.name                    "Doctor Who"
# from.id                      "127031120644257"
# to.name                      "Doctor Who"
# to.id                        "127031120644257"
# to.category                  "Tv show"
# created.time                 "2011-11-10 11:13:42"
# message                      "Has it ever been found out who blew up the TARDIS?"
# type                         "status"
# likes.count                  NA
# comments.count               "3"
# sample.comments              "Did the tardis blow up I haven't seen all of sesion 6&7 [next>>] \"7\" ??? [next>>] the pandorica was obsorbin earth so he blew it up with the tardis"
# sample.comments.from.name    "Alex Nomikos [next>>] Paul Morris [next>>] Vivienne Leigh Bruen"
# sample.comments.from.id      "100001033497348 [next>>] 595267764 [next>>] 100000679940192"
# sample.comments.created.time "2011-11-10 11:23:36 [next>>] 2011-11-10 11:29:56 [next>>] 2011-11-10 13:04:53"

In the above, I’m using “[next>>]” as a way separating entities in the same cell in order to keep the data frame structure. The order is maintained across cells i.e. the first entity of each cell of the sample.comments.from.name column corresponds to the first entity of of each cell of the sample.comments.from.id column, etc, etc.

The main problem I experienced, and have been experiencing for a long time with R, is dealing with a list which has a NULL as one of it’s elements and then un-listing it whilst still maintaining the same length:. For Example:

mylist <- list(a=1, b=NULL, c="hello"
unlist(mylist, use.names = FALSE)
# [1] "1"     "hello"

Whereas what I really want is for the NULL to be converted to NA and thus have the length of the resulting vector be the same as that of the original list, e.g.

mylist <- list(a=1, b=NULL, c="hello"
mylist[sapply(mylist, is.null)] <- NA
unlist(mylist, use.names = FALSE)
# [1] "1"     NA      "hello"

But I don’t know of any automatic way of doing that and so have to do it manually each time. I tell you, these NULL elements in a lists are really causing me headaches when it comes to using unlist!

Anyway, back to the Facebook_Graph_API_Explorer() function, there are a couple of points to bear in mind:

  1. This will only work on Windows because I don’t know what a cross platform version of winDialogString is. I’m guessing the tcltk package has something but I can’t see what it would be.
  2. You must already be signed into Facebook (i.e. you must have an account and be signed in) before you call my Facebook_Graph_API_Explorer()

The function will guide you through the process with dialogue boxes so it should be easy to use for anyone. I think next time I’ll try a web scraping exercise on the HTML of a facebook wall page using XPath, depends on how much time I get!

Tony Breyal

P.S. Full code is below:


library(RCurl)
library(RJSONIO)

Facebook_Graph_API_Explorer <- function() {
  get_json_df <- function(data) {
    l <- list(
        post.id = lapply(data, function(post) post$id),
        from.name = lapply(data, function(post) post$to$data[[1]]$name),
        from.id = lapply(data, function(post) post$to$data[[1]]$id),
        to.name = lapply(data, function(post) post$to$data[[1]]$name),
        to.id = lapply(data, function(post) post$to$data[[1]]$id),
        to.category = lapply(data, function(post) post$to$data[[1]]$category),
        created.time = lapply(data, function(post) as.character(as.POSIXct(post$created_time, origin="1970-01-01", tz="GMT"))),
        message = lapply(data, function(post) post$message),
        type = lapply(data, function(post) post$type),
        likes.count = lapply(data, function(post) post$likes$count),
        comments.count = lapply(data, function(post) post$comments$count),
        sample.comments = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$message), collapse = " [next>>] ")),
        sample.comments.from.name = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$from$name), collapse = " [next>>] ")),
        sample.comments.from.id = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$from$id), collapse = " [next>>] ")),
        sample.comments.created.time = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) as.character(as.POSIXct(comment$created_time, origin="1970-01-01", tz="GMT"))), collapse = " [next>>] "))
        )
    # replace all occurances of NULL with NA
    df = data.frame(do.call("cbind", lapply(l, function(x) sapply(x, function(xx) ifelse(is.null(xx), NA, xx)))))
    return(df)
  }

  # STEP 1: Get certs so we can access https links (we'll delete it at the end of the script)
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")

  # STEP 2: Get fackebook token to access data. I need a crossplatform version of winDialog and winDialogString otherwise this only works on Windows
  winDialog(type = "ok", "Make sure you have already signed into Facebook.\n\nWhen  browser opens, please click 'Get Access Token' twice. You DO NOT need to select/check any boxes for a public feed.\n\n After pressing OK, swich over to your now open browser.")
  browseURL("http://developers.facebook.com/tools/explorer/?method=GET&path=100002667499585")
  token <- winDialogString("When  browser opens, please click 'Get Access Token' twice and copy/paste token below", "")

  # STEP 3: Get facebook ID. This can be a fanpage or whatever e.g. https://www.facebook.com/DoctorWho
  ID <- winDialogString("Please enter FB name id below:", "https://www.facebook.com/DoctorWho")
  ID <- gsub(".*com/", "", ID)

  # STEP 4: Construct Facebook Graph API URL
  u <- paste("https://graph.facebook.com/", ID, "/feed", "?date_format=U", "&access_token=", token, sep = "")

  # STEP 5: How far back do you want get data for? Format should be YYYY-MM-DD
  user.last.date <- try(as.Date(winDialogString("Please enter a date for how roughly far back to gather data from using this format: yyyy-mm-dd", "")), silent = TRUE)
  current.last.date <- user.last.date + 1

  # Get data
  df.list <- list()
  i <- 1
  while(current.last.date > user.last.date) {
    # Download the JSON feed
    json <- getURL(u, cainfo = "cacert.perm")
    json <- fromJSON(json, simplify = FALSE)
    data <- json$data
    stopifnot(!is.null(data))

    # Get json Data Frame
    df.list[[i]] <- get_json_df(data)
    i <- i + 1

    # variables for while loop
    current.last.date <- as.Date(as.POSIXct(json$data[[length(json$data)]]$created_time, origin="1970-01-01", tz="GMT"))
    print(paste("Current batch of dates being processed is:", current.last.date, "(loading more...)"))
    u <- json$paging$`next`
  }

  # delete security certificates we downloaded earlier for https stites.
  file.remove("cacert.perm")
  # return data frame
  df <- do.call("rbind", df.list)
  return(df)
}

df <- Facebook_Graph_API_Explorer()
t(df[4,])
# post.id                      "127031120644257_319062954774405"
# from.name                    "Torchwood"
# from.id                      "119328091441982"
# to.name                      "Torchwood"
# to.id                        "119328091441982"
# to.category                  "Tv show"
# created.time                 "2011-11-10 12:05:21"
# message                      "If you're missing Torchwood & Doctor Who and are after some good, action-packed science fiction, why not check out FOX's awesome prehistoric romp, Terra Nova? It's carried in the UK on Sky TV and is well worth catching up with & following! The idea - The Earth is dying, it's in its final years. Life's intolerable & getting worse. Scientists take advantage of a rift in time & space to set up a 'fresh start' colony on Terra Nova - the earth, 60 million years ago. The adventure then begins..."
# type                         "link"
# likes.count                  NA
# comments.count               "0"
# sample.comments              ""
# sample.comments.from.name    ""
# sample.comments.from.id      ""
# sample.comments.created.time ""

UPDATE: Based on a sugestion from @BrockTibert  I’ve now set up a github account and the above code can be found here: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/facebook_Graph_API_Explorer/facebook_Graph_API_Explorer.R

UPDATE 2: An alternative web-scraping method to bypass the API with R: http://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

November 8, 2011

Web Scraping Google Scholar: Part 2 (Complete Success)

Filed under: R — Tags: , , , , , — Tony Breyal @ 11:47 pm

THIS CODE IS NO LONGER MAINTAINED AND WILL NOT WORK

(I’ve left it here for my own reference)

UPDATE: This function has been superseded by googleScholarXScraper()

This is a followup to a post I uploaded earlier today about web scraping data off Google Scholar. In that post I was frustrated because I’m not smart enough to use xpathSApply to get the kind of results I wanted. However fast-forward to the evening whilst having dinner with a friend, as a passing remark, she told me how she had finally figured out how to pass a function to another function in R today, e.g.

example <- function(x, FUN1, FUN2) {
  a <- sapply(x, FUN1)
  b <- sapply(a, FUN2)
  return(b)
}

example(c(-16,-9,-4,0,4,9,16), abs, sqrt)
# [1] 4 3 2 0 2 3 4

Now that might be a little thing to others, but to me that is amazing because I had never figured it out before! Anyway, using this new piece of knowledge I was able to take another shot at the scraping problem by rolling my own meta version of xpathSApply and was thus able to successfully complete the task!

# load packages
library(RCurl)
library(XML)

# One function to rule them all...
get_google_scholar_df <- function(u) {
  # get web page html
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  GS_xpathSApply <- function(doc, path, FUN) {
    path.base <- "/html/body/div[@class='gs_r']"
    nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']"))
    paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE))
    xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))
    return(xx)
  }

  # construct data frame
  df <- data.frame(
          footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue),
          title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue),
          type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue),
          publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue),
          description = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font", xmlValue),
          cited_by = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]/text()", xmlValue),
          cited_ref = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]", xmlAttrs),
          title_url = GS_xpathSApply(doc,  "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/a", xmlAttrs),
          view_as_html = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'View as HTML')]", xmlAttrs),
          view_all_versions = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,' versions')]", xmlAttrs),
          from_domain = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_ggs gs_fl']/a", xmlValue),
          related_articles = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Related articles')]", xmlAttrs),
          library_search = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Library Search')]", xmlAttrs),
          result_set = xpathSApply(doc, "/html/body/form/table/tr/td[2]", xmlValue),
          stringsAsFactors = FALSE)

  # Clean up extracted text
  df$title <- sub(".*\\] ", "", df$title)
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE))
  df$type <- gsub("\\]", "", gsub("\\[", "", df$type))
  df$cited_by <- as.integer(gsub("Cited by ", "", df$cited_by, fixed = TRUE))

  # remove footer as it is now redundant after doing clean up
  df <- df[,-1]

  # free doc from memory
  free(doc)

  # return data frame
  return(df)
}

Then, given a google scholar url, we can scrape the following information for each search result:

u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en"
df <- get_google_scholar_df(u)

t(df[1, ])

# title             "Baldur's gate and history: Race and alignment in digital role playing games"
# type              "PDF"
# publication       "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description       "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n"
# cited_by          "8"
# cited_ref         "/scholar?cites=13835674724285845934&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII&num=20"
# title_url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# view_as_html      "http://scholar.googleusercontent.com/scholar?q=cache:rpHocNswAsAJ:scholar.google.com/+baldur%27s+gate+2&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# view_all_versions "/scholar?cluster=13835674724285845934&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# from_domain       "[PDF] from digra.org"
# related_articles  "/scholar?q=related:rpHocNswAsAJ:scholar.google.com/&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# library_search    NA
# result_set        "Results 1 - 20 of about 404.   (0.29 sec) "

I think that’s kind of cool. Everything is wrapped into one function which I rather like. This could be extended further by having a function to construct  a series of Google Scholar URLs with whatever parameters you require, including which pages of results you desire and then put into a loop. The resulting data frames could then be merged and there you have it! You have a nice little data base to do whatever you want with. Not sure what you might want to do with it, but there it is all the same. This was a fun little XPath exercise and even though I didn’t learn how to achieve what I wanted with xpathSApply, it was nice to meta-hack a version of my own to still get the results what I wanted. Awesome stuff.

Older Posts »

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers