Consistently Infrequent

November 8, 2011

Web Scraping Google Scholar: Part 2 (Complete Success)

Filed under: R — Tags: , , , , , — Tony Breyal @ 11:47 pm

THIS CODE IS NO LONGER MAINTAINED AND WILL NOT WORK

(I’ve left it here for my own reference)

UPDATE: This function has been superseded by googleScholarXScraper()

This is a followup to a post I uploaded earlier today about web scraping data off Google Scholar. In that post I was frustrated because I’m not smart enough to use xpathSApply to get the kind of results I wanted. However fast-forward to the evening whilst having dinner with a friend, as a passing remark, she told me how she had finally figured out how to pass a function to another function in R today, e.g.

example <- function(x, FUN1, FUN2) {
  a <- sapply(x, FUN1)
  b <- sapply(a, FUN2)
  return(b)
}

example(c(-16,-9,-4,0,4,9,16), abs, sqrt)
# [1] 4 3 2 0 2 3 4

Now that might be a little thing to others, but to me that is amazing because I had never figured it out before! Anyway, using this new piece of knowledge I was able to take another shot at the scraping problem by rolling my own meta version of xpathSApply and was thus able to successfully complete the task!

# load packages
library(RCurl)
library(XML)

# One function to rule them all...
get_google_scholar_df <- function(u) {
  # get web page html
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  GS_xpathSApply <- function(doc, path, FUN) {
    path.base <- "/html/body/div[@class='gs_r']"
    nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']"))
    paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE))
    xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))
    return(xx)
  }

  # construct data frame
  df <- data.frame(
          footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue),
          title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue),
          type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue),
          publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue),
          description = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font", xmlValue),
          cited_by = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]/text()", xmlValue),
          cited_ref = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]", xmlAttrs),
          title_url = GS_xpathSApply(doc,  "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/a", xmlAttrs),
          view_as_html = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'View as HTML')]", xmlAttrs),
          view_all_versions = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,' versions')]", xmlAttrs),
          from_domain = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_ggs gs_fl']/a", xmlValue),
          related_articles = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Related articles')]", xmlAttrs),
          library_search = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Library Search')]", xmlAttrs),
          result_set = xpathSApply(doc, "/html/body/form/table/tr/td[2]", xmlValue),
          stringsAsFactors = FALSE)

  # Clean up extracted text
  df$title <- sub(".*\\] ", "", df$title)
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE))
  df$type <- gsub("\\]", "", gsub("\\[", "", df$type))
  df$cited_by <- as.integer(gsub("Cited by ", "", df$cited_by, fixed = TRUE))

  # remove footer as it is now redundant after doing clean up
  df <- df[,-1]

  # free doc from memory
  free(doc)

  # return data frame
  return(df)
}

Then, given a google scholar url, we can scrape the following information for each search result:

u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en"
df <- get_google_scholar_df(u)

t(df[1, ])

# title             "Baldur's gate and history: Race and alignment in digital role playing games"
# type              "PDF"
# publication       "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description       "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n"
# cited_by          "8"
# cited_ref         "/scholar?cites=13835674724285845934&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII&num=20"
# title_url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# view_as_html      "http://scholar.googleusercontent.com/scholar?q=cache:rpHocNswAsAJ:scholar.google.com/+baldur%27s+gate+2&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# view_all_versions "/scholar?cluster=13835674724285845934&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# from_domain       "[PDF] from digra.org"
# related_articles  "/scholar?q=related:rpHocNswAsAJ:scholar.google.com/&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# library_search    NA
# result_set        "Results 1 - 20 of about 404.   (0.29 sec) "

I think that’s kind of cool. Everything is wrapped into one function which I rather like. This could be extended further by having a function to construct  a series of Google Scholar URLs with whatever parameters you require, including which pages of results you desire and then put into a loop. The resulting data frames could then be merged and there you have it! You have a nice little data base to do whatever you want with. Not sure what you might want to do with it, but there it is all the same. This was a fun little XPath exercise and even though I didn’t learn how to achieve what I wanted with xpathSApply, it was nice to meta-hack a version of my own to still get the results what I wanted. Awesome stuff.

About these ads

31 Comments »

  1. Nice post about xpath. That will be nice for my next webscraping ideas :-) Although I like BeautifullSoup (python lib) better.

    Google scholar lets you download bibtex (look in the preference, where you can specify the reference format). To get that URL, you need to send a cookie. Here is some python code to generate such a coockie:

    google_id = hashlib.md5(str(random.random())).hexdigest()[:16]
    _ID_COOKIE = "GSP=ID=%s:CF=4" % google_id
    

    No idea how to send a cookie with R.

    One problem is, that GS only allows something like 100-1000 connection per day from one IP, before asking you to prove to be human and not a bot… So no real datasets can be gathered with GS :-( I found it easier to download data from Web of Science or Scopus either by downloading 1000 of complete (aka including references, keywords and abstracts) bibtex files or using the WoS API (which does not have hit limit as far as I know or better tried)

    Comment by Jan — November 9, 2011 @ 12:26 pm

    • Thanks Jan. One of the reasons I want to learn Python is so that I can use BeautifullSoup because I’ve heard many good things about it in terms of screen/web scraping content. Plus it’d be nice to have another scripting language under my belt. I know that the RCurl package can handle cookies but that’s beyond my current skill level. The WoS API sounds interesting, maybe the rOpenSci project will add a package for that one day :)

      Comment by Tony Breyal — November 9, 2011 @ 8:57 pm

  2. Thanks for posting this Tony!

    I think this could be a very useful program. I tried to replicate it using a google scholar search URL

    "http://scholar.google.com/scholar?q=microfinance&amp;hl=en&amp;btnG=Search&amp;as_sdt=1%2C19&amp;as_sdtp=on"
    

    however, it returns the following error:

    Error in data.frame(footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']",  : 
      arguments imply differing number of rows: 10, 11
    

    _____

    Anyone have any idea how I can work the kinks out of this?

    I look forward to putting this into practice!

    Ian

    Comment by Ian — November 9, 2011 @ 8:10 pm

    • Cheers Ian, seems the the problem was in the line:

      df$cited_by <- as.integer(gsub("Cited by ", "", df$cited, fixed = TRUE)
      

      When it should have been

      df$cited_by <- as.integer(gsub("Cited by ", "", df$cited_by, fixed = TRUE)
      

      Works fine now! :)

      Comment by Tony Breyal — November 9, 2011 @ 8:50 pm

      • Hey Tony,

        I was able to replicate your example (above). However, when I use a Google Scholar URL for a different search it produces the error I posted above.

        Are there any elements in the code that need to be altered for different scholar searches?

        I’m new to R (and loving it)!

        So, my apologies if I am missing the obvious here. :)

        Comment by Ian — November 10, 2011 @ 4:38 pm

        • What specifically was the URL of the Google Scholar search you were using? I suppose it’s possible that the elements are different but I’d like to investigate further. All the searches I’ve tried have worked as expected thus far so maybe there’s a different way of using google scholar which I hadn’t anticipated. Also, I still consider myself to be within jogging distance of being an R noob myself so no worries on that account, mate :)

          Comment by Tony Breyal — November 10, 2011 @ 5:43 pm

        • Excellent!

          Here is the link I have been using “http://scholar.google.com/scholar?q=microfinance&hl=en&btnG=Search&as_sdt=1%2C19&as_sdtp=on”

          The scholar search is done using the search term “microfinance”.

          I’m wondering if perhaps I don’t have an important package uploaded? I have the XML and Rcurl libraries uploaded, but perhaps there is another I need for this code?

          I greatly appreciate your willingness to trouble shoot this! I spoke with a colleague and we believe this technique will allow us shave a decent amount of time off our current project.

          Thanks

          ###— INLINE REPLY BY TONY BREYAL —###

          @Ian I can’t seem to reply to comments on here if the depth is more than 4 so going to reply by editing your post here instead. The microfinance URL you supplied works fine for me:

          u = "http://scholar.google.com/scholar?q=microfinance&hl=en&btnG=Search&as_sdt=1%2C19&as_sdtp=on"
          df <- get_google_scholar_df(u)
          t(df[1:2, ])
          
          #                   1
          # title             "Regulation and supervision of microfinance institutions: Experience from Latin America, Asia and Africa"
          # type              "CITATION"                                                                                               
          # publication       "S Berenbach, C Churchill… - 1997 - MicroFinance Network"                                                
          # description       ""                                                                                                       
          # cited_by          "  54"                                                                                                   
          # cited_ref         "/scholar?cites=9948072009448909748&as_sdt=8000005&sciodt=0,19&hl=en&oe=ASCII"                           
          # title_url         NA                                                                                                       
          # view_as_html      NA                                                                                                       
          # view_all_versions NA                                                                                                       
          # from_domain       NA                                                                                                       
          # related_articles  "/scholar?q=related:tGs1E8mmDooJ:scholar.google.com/&hl=en&oe=ASCII&as_sdt=0,19"                         
          # library_search    "http://www.worldcat.org/oclc/502519960"                                                                 
          #                   2                                                                                                                                                                                                                                                                              
          # title             "The microfinance promise"                                                                                                                                                                                                                                                     
          # type              NA                                                                                                                                                                                                                                                                             
          # publication       "J Morduch - Journal of economic Literature, 1999 - JSTOR"                                                                                                                                                                                                                     
          # description       "1 Princeton University. JMorduch@Princeton. Edu. I have benefited from comments from Harold \nAlderman, Anne Case, Jonathan Conning, Peter Fidler, Karla Hoff, Margaret Madajewicz, John \nPencavel, Mark Schreiner, Jay Rosengard, JD von Pischke, and three anonymous  ... "
          # cited_by          "1209"                                                                                                                                                                                                                                                                         
          # cited_ref         "/scholar?cites=13322647613403306524&as_sdt=8000005&sciodt=0,19&hl=en&oe=ASCII"                                                                                                                                                                                                
          # title_url         "http://www.jstor.org/stable/2565486"                                                                                                                                                                                                                                          
          # view_as_html      NA                                                                                                                                                                                                                                                                             
          # view_all_versions "/scholar?cluster=13322647613403306524&hl=en&oe=ASCII&as_sdt=0,19"                                                                                                                                                                                                             
          # from_domain       "[PDF] from kobe-u.ac.jp"                                                                                                                                                                                                                                                      
          # related_articles  "/scholar?q=related:HBqcAGuN47gJ:scholar.google.com/&hl=en&oe=ASCII&as_sdt=0,19"                                                                                                                                                                                               
          # library_search    NA       
          
          

          This was tested on:

          # Ubuntu 11.10 x64
          > sessionInfo()
          R version 2.14.0 (2011-10-31)
          Platform: x86_64-pc-linux-gnu (64-bit)
          
          locale:
           [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C                 
           [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
          
          attached base packages:
          [1] stats     graphics  grDevices utils     datasets  methods   base     
          
          other attached packages:
          [1] XML_3.4-3      RCurl_1.6-10   bitops_1.0-4.1
          
          loaded via a namespace (and not attached):
          [1] tools_2.14.0
          

          Are you using a different version to R and/or the packages above?

          Comment by Ian — November 10, 2011 @ 7:15 pm

  3. Hi Tony,

    thanks for the contribution. I’ve been looking for exercises to get familiar with xpath and scraping.

    i’m trying to go through each line of your code and from the getgo, the first few lines crashes my [R] session. Do you know why?

    u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en&quot;

    html <- getURL(u)
    doc <- htmlParse(html)

    Those 3 run fine.
    However, when I try to return the "doc" object, [R] crashes

    Comment by MT — November 9, 2011 @ 9:12 pm

    • realized it’ll work with
      htmlTreeParse()

      but crashes every time with
      htmlParse()

      werid

      Comment by MT — November 9, 2011 @ 9:29 pm

      • Ahh, ignore my other comment with the code, glad you sorted it. Not sure why htmlParse() would cause a crash. I’m guessing R will crash on your PC if you do:

        htmlTreeParse(html, useInternalNodes = FALSE)
        

        The downside of using the default htmlTreeParse is that it’s slower on larger html though it probably doesn’t matter here :)

        Comment by Tony Breyal — November 9, 2011 @ 9:35 pm

    • What’s you sessionInfo() and operating system? How far into the following can you get?

      # load packages
      library(RCurl)
      library(XML)
      
      # get html tree structure
      u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en"
      html <- getURL(u)
      doc <- htmlParse(html)
      
      # get path to nodes of title nodes
      path.base <- "/html/body/div[@class='gs_r']"
      nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']"))
      path.to.title <- "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3"
      paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path.to.title, fixed = TRUE))
      
      # extract titles
      titles <- sapply(paths, function(xpath) xpathSApply(doc, xpath, xmlValue), USE.NAMES = FALSE)
      
      # take care of missing or NULL values
      titles[sapply(titles, length)<1] <- NA
      titles <- as.vector(unlist(titles))
      
      # output
      titles
      #  [1] "[PDF] Baldur's gate and history: Race and alignment in digital role playing games"                                                       
      #  [2] "[CITATION] Baldur's Gate II: The Anatomy of a Sequel"                                                                                    
      #  [3] "[BOOK] AI game programming wisdom"                                                                                                       
      #  [4] "[DOC] Better game design through cutscenes"                                                                                              
      #  [5] "[CITATION] Translation and Localisation: Characteristics of computer games localisation by the example of Baldur's Gate II and Fallout 2"
      #  [6] "Improved heuristics for optimal path-finding on game maps"                                                                               
      #  [7] "[CITATION] Replayability, part 2: game mechanics"                                                                                        
      #  [8] "[CITATION] Contexts, pleasures and preferences: girls playing computer games"                                                            
      #  [9] "[PDF] Comparison of different grid abstractions for pathfinding on maps"                                                                 
      # [10] "Identification of semantic units from within a search query"                                                                             
      # [11] "[HTML] MIT, 1998"                                                                                                                        
      # [12] "[CITATION] I used to treat all the boys and girls the same: Gender and literacy"                                                         
      # [13] "[BOOK] Myst: the book of Atrus"                                                                                                          
      # [14] "Code generation for AI scripting in computer role-playing games"                                                                         
      # [15] "[PDF] Game challenges and difficulty levels: lessons learned From RPGs"                                                                  
      # [16] "[CITATION] Mass Effect: Revelation"                                                                                                      
      # [17] "[CITATION] Mass Effect: Ascension"                                                                                                       
      # [18] "Baldur's Gate 2: Shadows of Amn-Official Strategy Guide"                                                                                 
      # [19] "Baldur's Gate 2: Throne of Bhaal: Official Strategy Guide"                                                                               
      # [20] "[DOC] The Making of a Monster: Creating Baldur's Gate"
      

      Comment by Tony Breyal — November 9, 2011 @ 9:30 pm

      • So after spending time at work playing around with the (XML) package
        I think a nice touch to your program would be outputting the # of search results

        Results 1 – 20 of about 404. (0.26 sec) # using the baldurs gate query

        I tried to navigate all of the nodes using xpathSApply() but failed, still a novice at this
        http://tonybreyal.wordpress.com/2011/11/08/web-scraping-google-scholar-part-2-complete-success/#respond
        Realized, that the information of interest is stored in a table node and we could just use

        tables = readHTMLTable(u) # with results in tables[[2]][2,2]
        num.results <- (tables[[2]])[2,2]

        Take it for what it is.

        Comment by MT — November 10, 2011 @ 1:21 am

        • Using XPath directly, you can get that value using this:

          xpathSApply(doc, "/html/body/form/table/tr/td[2]", xmlValue)
          # [1] "Results 1 - 20 of about 404.   (0.08 sec)Â "
          

          The key to getting the correct XPath query seems to be to build it up bit by bit. I find the general XPath to the piece of information on web page I am interested in by using the Google Chrome add-on “XPath Helper” and then playing around with the specifics in R e.g.

          xpathSApply(doc, "/html/body/form/table", xmlValue)
          xpathSApply(doc, "/html/body/form/table/tr", xmlValue)
          xpathSApply(doc, "/html/body/form/table/tr/td", xmlValue)
          xpathSApply(doc, "/html/body/form/table/tr/td/[2]", xmlValue)
          

          I’ll add this information to the dataframe, thanks for the feedback!

          Comment by Tony Breyal — November 10, 2011 @ 9:55 am

  4. Hi Tony

    I just gave your code a try (it’s *extremely* useful for a project I’m doing now BTW). I tried the example you gave in your code but got errors. I wonder if you could shed some light on it? (I notice from the comments that someone had a similar error earlier, but you were able to correct it):
    —-
    > u df <- googleScholarXScraper(u)
    Error in data.frame(footer = xpathLVApply(doc, xpath.base, "/font/span[@class='gs_fl']", :
    arguments imply differing number of rows: 2, 0
    —–

    Thanks!

    Comment by Kaushik Krishnan — July 13, 2012 @ 1:44 am

    • This is the problem with using XPath in this way – as soon as google changes the HTML structure of a page it no longer works because none of the XPath expressions are valid any longer. Anway, the following will work with the current html structure of the page (though when Google change the structure again this won’t work either)@

      googleScholarXScraper <- function(input) {
        ###--- PACKAGES ---###
        # load packages
        require(RCurl)
        require(XML)
        
        
        ###--- LOCAL FUNCTIONS ---###  
        # I added a wrapper around xpathSApply to deal with cases return NULL and are thus were removed during the list to vector conversion process. This function ensures the NULLs are replaced by NA
        xpathLVApply <- function(doc, xpath.base, xpath.ext, FUN, FUN2 = NULL) {
          # get xpaths to each child node of interest
          nodes.len <- length(xpathSApply(doc, xpath.base))
          paths <- sapply(1:nodes.len, function(i) paste(xpath.base, "[", i, "]", xpath.ext, sep = ""))
          
          # extract child nodes
          xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN))
          
          # perform extra processing if required
          if(!is.null(FUN2)) xx <- FUN2(xx)
          
          # convert NULL to NA in list
          xx[sapply(xx, length)<1] <- NA
          
          # return node values as a vector
          return(as.vector(unlist(xx)))
        }
        
        # Determine how to grab html for each element of input
        evaluate_input <- function(input) {
          # determine which elements of input are files (assumed to contain valid html) and which are not(assumed to be valid URLs)
          is.file <- file.exists(input)
          
          # stop if input does not seem to be URLS and/or files
          if(sum(is.file) < 1 && length(input) > 1) stop("'input' to googleScholarXScraper() could not be processed.")
          
          # read html from each file
          html.files <- lapply(input[is.file], readLines, warn = FALSE)
          
          # read html from each URL
          html.webpages <- lapply(input[!is.file], getURL, followlocation = TRUE)
          
          # return all html data as list
          return(c(html.files, html.webpages))
        }
        
        # construct data frame from the html of a single Google Scholar search page
        get_google_scholar_df <- function(html) {
          # parse html into tree structure
          doc <- htmlParse(html)
          
          # construct data frame
          xpath.base <- "//div[@class='gs_r']"
          df <- data.frame(
            footer = xpathLVApply(doc, xpath.base, "//div[@class='gs_fl']", xmlValue),
            title = xpathLVApply(doc, xpath.base, "//h3", xmlValue),
            type = xpathLVApply(doc, xpath.base, "//h3/span", xmlValue),
            publication = xpathLVApply(doc, xpath.base, "//div[@class='gs_a']", xmlValue),
            description = xpathLVApply(doc, xpath.base, "//div[@class='gs_rs']", xmlValue),
            cited.by = xpathLVApply(doc, xpath.base, "//a[contains(.,'Cited by')]/text()", xmlValue),
            cited.ref = xpathLVApply(doc, xpath.base, "//a[contains(.,'Cited by')]", xmlAttrs),
            title.url = xpathLVApply(doc,  xpath.base, "//h3/a", xmlAttrs),
            view.as.html = xpathLVApply(doc, xpath.base, "//a[contains(.,'View as HTML')]", xmlAttrs),
            view.all.versions = xpathLVApply(doc, xpath.base, "//a[contains(.,' versions')]", xmlAttrs),
            from.domain = xpathLVApply(doc, xpath.base, "//div[@class='gs_ggs gs_fl']/a", xmlValue),
            related.articles = xpathLVApply(doc, xpath.base, "//a[contains(.,'Related articles')]", xmlAttrs),
            library.search = xpathLVApply(doc, xpath.base, "//a[contains(.,'Library Search')]", xmlAttrs),
            result.set = xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue),
            stringsAsFactors = FALSE)
          # free doc from memory
          free(doc)
          
          # Clean up extracted text
          df$title <- sub(".*\\] ", "", df$title)
          df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))
          df$description <- sapply(1:dim(df)[1], function(i) gsub(df$footer[i], "", df$description[i], fixed = TRUE))
          df$type <- gsub("\\]", "", gsub("\\[", "", df$type))
          df$cited.by <- as.integer(gsub("Cited by ", "", df$cited.by, fixed = TRUE))
          
          # remove footer as it is now redundant after doing clean up  and return dataframe
          return(df[,-1])
        }
        
        
        ###--- MAIN ---##
        # STEP 1: Determine input type(s) and grab html accordingly
        doc.list <- evaluate_input(input)
        
        # STEP 2: get google scholar data frame.
        df <- do.call("rbind", lapply(doc.list, get_google_scholar_df))
        return(df)
      }
      
      
      # ###--- EXAMPLES ---###
      # # example 1: A single URL
       u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en"
       df <- googleScholarXScraper(u)
       t(df[1, ])
      
      

      Comment by Tony Breyal — July 14, 2012 @ 11:22 am

      • Hi Tony

        Thanks for the help. I’ve run into the same problem before. I wish we could find a more permanent solution but that would make things too easy for us :)

        Comment by Kaushik Krishnan — July 15, 2012 @ 6:59 pm

        • If you know Python (I don’t), then I hear that BeautifulSoup is really good at screenscraping and saves a lot of time in coding: http://www.crummy.com/software/BeautifulSoup/

          Comment by Tony Breyal — July 16, 2012 @ 9:52 pm

          • Thanks again Tony,

            BTW I stumbled across your stack overflow post on converting html2txt() in R (http://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r). I wonder how one would do the reverse, i.e. take a string (say “hello: world’s & foo”) and encode into HTML? I thought I’d build a awrapper around this code to take input strings and perform automated searches.

            Comment by Kaushik Krishnan — July 16, 2012 @ 10:43 pm

            • For an automated search, you probably want to encode the URL itself with your search term pasted in, e.g.: URLencode(paste(“https://www.google.co.uk/#q=”, “hello: world’s & foo”, sep = “”)) –otherwise there’s several packages on CRAN which might be of use such as “hwriter” or “R2HTML”

              Comment by Tony Breyal — July 18, 2012 @ 7:36 pm

              • I tried URLencode(). The problem with it is that it leaves characters like : and ‘ as they are rather than the way Google wants them (ie as hex codes).

                Comment by Kaushik Krishnan — July 18, 2012 @ 7:38 pm

                • How about the following using the RCurl function curlEscaple() — I think I got the myURLdecode function off of stackoverflow but can’t remember (the following was all in a file on my hard drive I just came across the other day). If not, then I’m at a loss I’m afraid:

                  myURLdecode <- function(URL) {
                    x <- charToRaw(URL)
                    pc <- charToRaw("%")
                    out <- raw(0L)
                    i <- 1L
                    len <- length(x)
                    while(i <= len) {
                      if(x[i] != pc || i + 2 > len) {
                        out <- c(out, x[i])
                        i <- i + 1L
                      } else {
                        y <- as.integer(x[i + 1L:2L])
                        y[y > 96L] <- y[y > 96L] - 32L # a-f -> A-F
                        y[y > 57L] <- y[y > 57L] - 7L  # A-F
                        if (y[1] > 48L) {
                          y <- sum((y - 48L) * c(16L, 1L))
                          out <- c(out, as.raw(as.character(y)))
                          i <- i + 3L
                        } else {
                          out <- c(out, pc)
                          i <- i + 1L
                        }
                      }
                    }
                    rawToChar(out)
                  }
                  
                  URLencode(myURLdecode("“hello: world’s & foo”")) # didn't encode the ":"
                  [1] "%e2%80%9chello:%20world%e2%80%99s%20&%20foo%e2%80%9d"
                  
                  > RCurl::curlEscape(myURLdecode("“hello: world’s & foo”")) # using RCurl instead of the base R function
                  [1] "%E2%80%9Chello%3A%20world%E2%80%99s%20%26%20foo%E2%80%9D"
                  

                  Comment by Tony Breyal — July 18, 2012 @ 7:51 pm

  5. Just desire to say your article is as astonishing. The clearness in your
    submit is just nice and i could think you’re an expert in this subject. Well along with your permission let me to clutch your feed to keep up to date with forthcoming post. Thank you a million and please keep up the gratifying work.

    Comment by dumpster service rental — December 27, 2012 @ 1:27 am

  6. I constantly spent my half an hour to read this weblog’s articles or reviews daily along with a mug of coffee.

    Comment by massachusetts virtual office — May 1, 2013 @ 12:14 pm

  7. I would like to thank you for the efforts you’ve put in writing this blog. I’m hoping to see the same
    high-grade content by you later on as well.
    In truth, your creative writing abilities has inspired me
    to get my own, personal website now ;)

    Comment by Reputation Management — May 3, 2013 @ 3:51 am

  8. Hi there, I read your new stuff like every week. Your story-telling
    style is witty, keep it up!

    Comment by Larue — May 6, 2013 @ 4:49 pm

  9. An outstanding share! I’ve just forwarded this onto a coworker who was conducting a little homework on this. And he actually bought me lunch because I stumbled upon it for him… lol. So let me reword this…. Thanks for the meal!! But yeah, thanx for spending time to talk about this topic here on your web site.

    Comment by On the main page — May 17, 2013 @ 5:01 pm

  10. I would have loved to have the dataset saved as a .csv file. Though the code run without an error, I can’t seem to find where the resultant dataset is saved. Could anyone help? What am I missing?

    Comment by Delomel James — May 23, 2013 @ 2:50 pm

    • This code is no longer maintained and shouldn’t work because Google have changed the HTML structure they use. Also, the code does not save the results to disk at all. I sugest you look up http://thebiobucket.blogspot.co.uk/ and good luck :)

      Comment by Tony Breyal — June 7, 2013 @ 9:36 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 73 other followers

%d bloggers like this: