UPDATE: This function has been superseded by googleScholarXScraper()
I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract the ‘title’, ‘url’ , ‘publication’ and ‘description’. If any of these fields are not available, as in the case of a citation, the corresponding cell in the data frame will have NA.
# load packages
library(XML)
library(RCurl)
get_google_scholar_df <- function(u, omit.citation = TRUE) {
html <- getURL(u)
# parse HTML into tree structure
doc <- htmlParse(html)
# make data frame from available information on page
df <- data.frame(
title = xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue),
url = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href'))),
publication = xpathSApply(doc, "//html//body//div[@class='gs_r']//font//span[@class='gs_a']", xmlValue),
description = xpathSApply(doc, "//html//body//div[@class='gs_r']//font", xmlValue),
type = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) xmlValue(xmlChildren(x)$span)),
footer = xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue),
stringsAsFactors=FALSE)
# Clean up
df$title <- sub(".*\\] ", "", xx)
df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))
df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE))
df$type <- gsub("\\]", "", gsub("\\[", "", df$type))
# free doc from memory
free(doc)
# ensure urls start with "http" to avoid google references to the search page
ifelse(omit.citation, return(na.omit(df)), return(df))
}
u <- "http://scholar.google.com/scholar?hl=en&q=baldur's+gate+2&btnG=Search&as_sdt=0,5&as_ylo=&as_vis=0"
df <- get_google_scholar_df(u, omit.citation = TRUE)
The above will produce results as follows:
df$url # [1] "http://digra.org:8080/Plone/dl/db/06276.04067.pdf" # [2] "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQBlc&sig=5WujxIs3fN8W74kw3rYSM4PEw0Y" # [3] "http://www.itu.dk/stud/projects_f2003/moebius/Burn/Ragelse/Andet/Den%20skriftlige%20opgave/Tekster/Hancock.doc" # [4] "http://www.aaai.org/Papers/AIIDE/2006/AIIDE06-006.pdf" # [5] "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.597&rep=rep1&type=pdf" # [6] "http://www.google.com/patents?hl=en&lr=&vid=USPAT7249121&id=Up-AAAAAEBAJ&oi=fnd&dq=baldur%27s+gate+2&printsec=abstract"
Or the full data frame (using t() for display purposes):
t(df[1,]) # title "Baldur's gate and history: Race and alignment in digital role playing games" # url "http://digra.org:8080/Plone/dl/db/06276.04067.pdf" # publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org" # description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n" # type "PDF" # footer "Cited by 8 - Related articles - View as HTML - All 10 versions"
That was the most information I could pull off a Google Scholar search using XPath though I have no doubt someone with more knowledge could pull more elements out! Many thanks to John Colby for helping me out with my question over on stackoverflow.com which made the above possible. Trying to get more elements out just didn’t seem to work for me.
