Consistently Infrequent

November 11, 2011

Web Scraping Yahoo Search Page via XPath

Filed under: R — Tags: , , , , , , — Tony Breyal @ 12:25 am

Seeing as I’m on a bit of an XPath kick as of late, I figured I’d continue on scraping search results but this time from Yahoo.com

Rolling my own version of xpathSApply to handle NULL elements seems to have done the trick and so far it’s been relatively easy to do the scraping. I’ve created an R function which will scrape information from a Yahoo Search page (with the user suplying the Yahoo Search URL) and will extract as much information as it can whilst maintaining the data frame structure (full source code at end of post). For example:

# load packages
library(RCurl)
library(XML)

# user provides url and the function extracts relevant information into a data frame as follows
u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592664708059042&mkt=en-GB&setlang=en-GB&w=48d4b732,65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"

I’ve only tested these on web results. The idea of these posts is to get basic functionality and then if I feel it might be fun, to expand the functionality in the future.

It’s nice having an online blog where I can keep these functions I’ve come up with during coding exercises. Maybe if I make enough of these Web Search Engine scrapers I can go ahead and make my first R package. Though the downside of web scraping is that if the structrure/entities of the HTML code change then the scrapers may stop working. That could make the package difficult to maintain. I can’t really think of how the package itself might be useful to anyone apart from teaching me personally how to build a package.

Maybe that’ll be worth it in and of itself. Ha, version 2.0 could be just a collection of the self contained functions, version 3.0 could have the functions converted to S3 (which I really want to learn), version 4.0 could have them converted to S4 (again, something I’d like to learn) and version 5.0 could have reference classes (I still don’t know what those things are). Just thinking out loud, could be a good way to learn more R. Doubt I’ll do it though but we’ll see. I have to find time to start learning Python so might have to put R on the back burner soon!

Full source code here (function is self-contained, just copy and paste):

# load packages
library(RCurl)
library(XML)

get_yahoo_search_df <- function(u) {
  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) {
    nodes.len <- length(xpathSApply(doc, path.base))
    paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE))
    xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN))
    if(!is.null(FUN2)) xx <- FUN2(xx)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))
    return(xx)
  }

  # download html and parse into tree structure
  html <- getURL(u, followlocation = TRUE)
  doc <- htmlParse(html)

  # path to nodes of interest
  path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li"

  # construct data frame
  df <- data.frame(
    title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue),
    url = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a[@href]", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[2])),
    description = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div", xmlValue),
    cached = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/a[@href][text()='Cached']", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[1])),
    recorded = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/span[@id='resultTime']", xmlValue),
    stringsAsFactors = FALSE)

  # free doc from memory
  free(doc)

  # return data frame
  return(df)
}

u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1:5, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592664708059042&mkt=en-GB&setlang=en-GB&w=48d4b732,65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"
#             2
# title       "WIL WHEATON DOT NET"
# url         "http://www.wilwheaton.net/coollinks.php"
# description "Wil Wheaton - Don't be a dick! - Writer and Actor - Your Mom - I'm Wil Wheaton. I'm an author (that's why I'm wilwheatonbooks), an actor, and a lifelong geek."
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4592836504520824&mkt=en-GB&setlang=en-GB&w=eaeb9364,4a4e7c54&icp=1&.intl=uk&sig=VC7eV8GUMXVuu9apHagYNg--"
# recorded    "2 days ago"
#             3
# title       "this is one hell of a geeky weekend - WWdN: In Exile"
# url         "http://wilwheaton.typepad.com/wwdnbackup/2008/05/this-is-one-hel.html"
# description "WIL WHEATON DOT NET2 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4559391600545150&mkt=en-GB&setlang=en-GB&w=90d3ee39,34d4424b&icp=1&.intl=uk&sig=ZN.UpexVV4pm3yn7XiEURw--"
# recorded    "2 days ago"
#             4
# title       "Wil Wheaton - Google+ - I realized today that when someone ..."
# url         "https://plus.google.com/108176814619778619437/posts/ENTkBMZKeGY"
# description ">Cool Sites. Okay, I'm talking to the guys here: do you ever get \"the sigh\"? You know what I'm talking about...you're really into some cool website, and your ..."
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=4718764947541872&mkt=en-GB&setlang=en-GB&w=9bca6e9a,dba19826&icp=1&.intl=uk&sig=jGaKkuIFOINEBBfBwarrgg--"
# recorded    "6 days ago"
#             5
# title       "The Hot List: Dwight Slade, Back Fence PDX, Wil Wheaton vs ..."
# url         "http://www.oregonlive.com/movies/index.ssf/2011/11/the_hot_list_dwight_slade_back.html"
# description "this is one hell of a geeky weekend - WWdN: In Exile2 days ago"
# cached      "http://87.248.112.8/search/srpcache?ei=UTF-8&p=Wil+Wheaton&rd=r1&fr=yfp-t-702&u=http://cc.bingj.com/cache.aspx?q=Wil+Wheaton&d=414191857143&mkt=en-GB&setlang=en-GB&w=3081364,e585aa21&icp=1&.intl=uk&sig=KufdBZ_Thr1Mm8.SnjpMUQ--"
# recorded    "4 hours ago"

UPDATE: I’ve created a github account and the above code can be found at: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/get_yahoo_search_df.R

About these ads

6 Comments »

  1. I found your via r-bloggers.com and like the work your presented in this and the other posts about xpath and R. Very interesting how much data is available publicly. At the same time it is a little bit scary when I imagine how much info I already get by just combining the different sources you already wrote web scraping scripts for…

    But one thing I realized while reading was that your code is not very DRY. Since all your xpath-paths are more or less similar, you could save a lot of characters. I posted my suggested changes at http://pastebin.com/2cnNcvBy Maybe it helps you on your way to package 1.0 :-)

    To get better insights into the topic of packages etc. maybe the following video might be interesting for you: http://www.youtube.com/watch?v=TER-rQoVs0k

    Comment by Philipp Riemer (@philipp_riemer) — November 12, 2011 @ 12:03 pm

    • Philipp,

      Thank you kindly for your positive response. http://pastebin.com is brilliant, I’d not heard of it before but it looks very useful and I can already see how I might start using it. I appreciate the code contribution you’ve made and will have a deeper look at them next week when I get a chance but my initial thoughts are “Damn, why didn’t I think of that!” because it also makes the code much more readable. When I get a chance to test it, I’ll update my code on github. And because my other XPath scraper functions are kind of similar I should be able to update those functions too (plus incorporate it into a hack I’m writing of Kay Cichini’s GScholarScraper() word-cloud function!)

      I really do want to learn how to make R packages and have bookmarked the youtube video you kindly linked to so that I can watch it later (hopefully easier than just reading the documentation on its own)! Not sure if there would be much interest in a WebScraper package but I think it would be a good coding exercise in and of itself :)

      Comment by Tony Breyal — November 12, 2011 @ 12:27 pm

  2. Did something very similar to this. Also in R. Just an FYI, if you start doing it in batch, yahoo will cut you off after ~1000 searches. Better to use the bing search API, which is unlimited.

    Comment by Noah — November 13, 2011 @ 12:53 am

    • This blog’s R category is really just a series of little coding and analysis exercises. In fact I’ve already learned quite a lot and am improving my code on my github repo. Whilst I am aware of the limit I doubt I’ll hit it as I don’t need a data sample that is that large though I appreciate the point you’re making. At some point in this blog I hope to start making API RScript functions but I don’t feel equal to that kind of task yet. I didn’t realise that the Bing API is unlimited which is pretty damn amazing (I’m currently writing an XPath scraper for bing search but eventually hope to look at the API).

      Is the code you used to perform similar web-scraping tasks publicly available somewhere? I might be able to pick up some tips! :)

      Comment by Tony Breyal — November 13, 2011 @ 1:09 am

  3. That’s a handy utility! Here’s one vote for turning it into a package. Ten years ago I wrote a simple Perl-based wrapper to yahoo and other search engines that proved to very useful for a variety of tasks (e.g., computing word co-occurrence frequency analysis). True it’s a hassle to update when the search engine changes the HTML interface, but that doesn’t happen too often.

    Comment by Tom O'Hara — December 1, 2011 @ 10:57 pm

  4. [...] Web Scraping Yahoo Search Engine via XPath  Posted by GRS at 13:10  Tagged with: Datos, Encuestas, política, R, web scraping [...]

    Pingback by Ejemplo de web scraping: indicadores de confianza política » G. R. Serrano — September 4, 2012 @ 12:10 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 71 other followers

%d bloggers like this: