Consistently Infrequent

November 8, 2011

Web Scraping Google Scholar (Partial Success)

Filed under: R — Tags: , , , , , — Tony Breyal @ 2:32 pm

UPDATE: This function has been superseded by googleScholarXScraper()

I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract  the ‘title’, ‘url’ , ‘publication’ and ‘description’.  If any of these fields are not available, as in the case of a citation, the corresponding cell in the data frame will have NA.

# load packages
library(XML)
library(RCurl)

get_google_scholar_df <- function(u, omit.citation = TRUE) {
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # make data frame from available information on page
  df <- data.frame(
    title = xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue),
    url = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href'))),
    publication = xpathSApply(doc, "//html//body//div[@class='gs_r']//font//span[@class='gs_a']", xmlValue),
    description = xpathSApply(doc, "//html//body//div[@class='gs_r']//font", xmlValue),
    type = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) xmlValue(xmlChildren(x)$span)),
    footer = xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue),
    stringsAsFactors=FALSE)

  # Clean up
  df$title <- sub(".*\\] ", "", xx)
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE))
  df$type <- gsub("\\]", "", gsub("\\[", "", df$type))

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  ifelse(omit.citation, return(na.omit(df)), return(df))
}

u <- "http://scholar.google.com/scholar?hl=en&q=baldur's+gate+2&btnG=Search&as_sdt=0,5&as_ylo=&as_vis=0"
df <- get_google_scholar_df(u, omit.citation = TRUE)

The above will produce results as follows:

df$url
# [1] "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# [2] "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQBlc&sig=5WujxIs3fN8W74kw3rYSM4PEw0Y"
# [3] "http://www.itu.dk/stud/projects_f2003/moebius/Burn/Ragelse/Andet/Den%20skriftlige%20opgave/Tekster/Hancock.doc"
# [4] "http://www.aaai.org/Papers/AIIDE/2006/AIIDE06-006.pdf"
# [5] "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.597&rep=rep1&type=pdf"
# [6] "http://www.google.com/patents?hl=en&lr=&vid=USPAT7249121&id=Up-AAAAAEBAJ&oi=fnd&dq=baldur%27s+gate+2&printsec=abstract"

Or the full data frame (using t() for display purposes):

t(df[1,])

# title       "Baldur's gate and history: Race and alignment in digital role playing games"
# url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n"
# type        "PDF"
# footer      "Cited by 8 - Related articles - View as HTML - All 10 versions"

That was the most information I could pull off a Google Scholar search using XPath though I have no doubt someone with more knowledge could pull more elements out! Many thanks to John Colby for helping me out with my question over on stackoverflow.com which made the above possible. Trying to get more elements out just didn’t seem to work for me.

About these ads

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: