Consistently Infrequent

November 11, 2011

Web Scraping Yahoo Search Page via XPath

Filed under: R — Tags: , , , , , , — Tony Breyal @ 12:25 am

Seeing as I’m on a bit of an XPath kick as of late, I figured I’d continue on scraping search results but this time from

Rolling my own version of xpathSApply to handle NULL elements seems to have done the trick and so far it’s been relatively easy to do the scraping. I’ve created an R function which will scrape information from a Yahoo Search page (with the user suplying the Yahoo Search URL) and will extract as much information as it can whilst maintaining the data frame structure (full source code at end of post). For example:

# load packages

# user provides url and the function extracts relevant information into a data frame as follows
u <- ";_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1, ])

#             1
# title       "Wil Wheaton - Google+"
# url         ""
# description "Wil Wheaton - Google+6 days ago"
# cached      ",65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"

I’ve only tested these on web results. The idea of these posts is to get basic functionality and then if I feel it might be fun, to expand the functionality in the future.

It’s nice having an online blog where I can keep these functions I’ve come up with during coding exercises. Maybe if I make enough of these Web Search Engine scrapers I can go ahead and make my first R package. Though the downside of web scraping is that if the structrure/entities of the HTML code change then the scrapers may stop working. That could make the package difficult to maintain. I can’t really think of how the package itself might be useful to anyone apart from teaching me personally how to build a package.

Maybe that’ll be worth it in and of itself. Ha, version 2.0 could be just a collection of the self contained functions, version 3.0 could have the functions converted to S3 (which I really want to learn), version 4.0 could have them converted to S4 (again, something I’d like to learn) and version 5.0 could have reference classes (I still don’t know what those things are). Just thinking out loud, could be a good way to learn more R. Doubt I’ll do it though but we’ll see. I have to find time to start learning Python so might have to put R on the back burner soon!

Full source code here (function is self-contained, just copy and paste):

# load packages

get_yahoo_search_df <- function(u) {
  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) {
    nodes.len <- length(xpathSApply(doc, path.base))
    paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE))
    xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN))
    if(!is.null(FUN2)) xx <- FUN2(xx)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))

  # download html and parse into tree structure
  html <- getURL(u, followlocation = TRUE)
  doc <- htmlParse(html)

  # path to nodes of interest
  path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li"

  # construct data frame
  df <- data.frame(
    title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue),
    url = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a[@href]", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[2])),
    description = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div", xmlValue),
    cached = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/a[@href][text()='Cached']", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[1])),
    recorded = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/span[@id='resultTime']", xmlValue),
    stringsAsFactors = FALSE)

  # free doc from memory

  # return data frame

u <- ";_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1:5, ])

#             1
# title       "Wil Wheaton - Google+"
# url         ""
# description "Wil Wheaton - Google+6 days ago"
# cached      ",65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"
#             2
# title       "WIL WHEATON DOT NET"
# url         ""
# description "Wil Wheaton - Don't be a dick! - Writer and Actor - Your Mom - I'm Wil Wheaton. I'm an author (that's why I'm wilwheatonbooks), an actor, and a lifelong geek."
# cached      ",4a4e7c54&icp=1&.intl=uk&sig=VC7eV8GUMXVuu9apHagYNg--"
# recorded    "2 days ago"
#             3
# title       "this is one hell of a geeky weekend - WWdN: In Exile"
# url         ""
# description "WIL WHEATON DOT NET2 days ago"
# cached      ",34d4424b&icp=1&.intl=uk&sig=ZN.UpexVV4pm3yn7XiEURw--"
# recorded    "2 days ago"
#             4
# title       "Wil Wheaton - Google+ - I realized today that when someone ..."
# url         ""
# description ">Cool Sites. Okay, I'm talking to the guys here: do you ever get \"the sigh\"? You know what I'm talking're really into some cool website, and your ..."
# cached      ",dba19826&icp=1&.intl=uk&sig=jGaKkuIFOINEBBfBwarrgg--"
# recorded    "6 days ago"
#             5
# title       "The Hot List: Dwight Slade, Back Fence PDX, Wil Wheaton vs ..."
# url         ""
# description "this is one hell of a geeky weekend - WWdN: In Exile2 days ago"
# cached      ",e585aa21&icp=1&.intl=uk&sig=KufdBZ_Thr1Mm8.SnjpMUQ--"
# recorded    "4 hours ago"

UPDATE: I’ve created a github account and the above code can be found at:


  1. I found your via and like the work your presented in this and the other posts about xpath and R. Very interesting how much data is available publicly. At the same time it is a little bit scary when I imagine how much info I already get by just combining the different sources you already wrote web scraping scripts for…

    But one thing I realized while reading was that your code is not very DRY. Since all your xpath-paths are more or less similar, you could save a lot of characters. I posted my suggested changes at Maybe it helps you on your way to package 1.0 🙂

    To get better insights into the topic of packages etc. maybe the following video might be interesting for you:

    Comment by Philipp Riemer (@philipp_riemer) — November 12, 2011 @ 12:03 pm

    • Philipp,

      Thank you kindly for your positive response. is brilliant, I’d not heard of it before but it looks very useful and I can already see how I might start using it. I appreciate the code contribution you’ve made and will have a deeper look at them next week when I get a chance but my initial thoughts are “Damn, why didn’t I think of that!” because it also makes the code much more readable. When I get a chance to test it, I’ll update my code on github. And because my other XPath scraper functions are kind of similar I should be able to update those functions too (plus incorporate it into a hack I’m writing of Kay Cichini’s GScholarScraper() word-cloud function!)

      I really do want to learn how to make R packages and have bookmarked the youtube video you kindly linked to so that I can watch it later (hopefully easier than just reading the documentation on its own)! Not sure if there would be much interest in a WebScraper package but I think it would be a good coding exercise in and of itself 🙂

      Comment by Tony Breyal — November 12, 2011 @ 12:27 pm

  2. Did something very similar to this. Also in R. Just an FYI, if you start doing it in batch, yahoo will cut you off after ~1000 searches. Better to use the bing search API, which is unlimited.

    Comment by Noah — November 13, 2011 @ 12:53 am

    • This blog’s R category is really just a series of little coding and analysis exercises. In fact I’ve already learned quite a lot and am improving my code on my github repo. Whilst I am aware of the limit I doubt I’ll hit it as I don’t need a data sample that is that large though I appreciate the point you’re making. At some point in this blog I hope to start making API RScript functions but I don’t feel equal to that kind of task yet. I didn’t realise that the Bing API is unlimited which is pretty damn amazing (I’m currently writing an XPath scraper for bing search but eventually hope to look at the API).

      Is the code you used to perform similar web-scraping tasks publicly available somewhere? I might be able to pick up some tips! 🙂

      Comment by Tony Breyal — November 13, 2011 @ 1:09 am

  3. That’s a handy utility! Here’s one vote for turning it into a package. Ten years ago I wrote a simple Perl-based wrapper to yahoo and other search engines that proved to very useful for a variety of tasks (e.g., computing word co-occurrence frequency analysis). True it’s a hassle to update when the search engine changes the HTML interface, but that doesn’t happen too often.

    Comment by Tom O'Hara — December 1, 2011 @ 10:57 pm

  4. […] Web Scraping Yahoo Search Engine via XPath  Posted by GRS at 13:10  Tagged with: Datos, Encuestas, política, R, web scraping […]

    Pingback by Ejemplo de web scraping: indicadores de confianza política » G. R. Serrano — September 4, 2012 @ 12:10 pm

  5. Very good posting this useful post. I’m a long time reader but I’ve
    never been compelled to leave a comment. I bought to going through your brilliant blog and distributed this in the
    facebook or myspace.
    Thanks again for a great post!

    Comment by smm reseller panel — July 11, 2018 @ 7:42 am

  6. Hello, its nice article concerning media print, we all understand media is a great source of

    Comment by Karan Sangini on starplus — October 3, 2018 @ 4:59 pm

  7. Hi, I wish for to subscribe for this weblog to take hottest updates, thus where
    can i do it please assist.

    Comment by download lagu newsong tacica mp3 — October 29, 2018 @ 2:27 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: