Consistently Infrequent

November 7, 2011

Web Scraping Google URLs

Filed under: R — Tags: , , , , — Tony Breyal @ 2:18 pm

UPDATE: This function has now been improved, see googleSearchXScraper()

Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions:

# load packages
library(RCurl)
library(XML)

get_google_page_urls <- function(u) {
  # read in page contents
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
  links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  links <- grep("http://", links, fixed = TRUE, value=TRUE)
  return(links)
}

u <- "http://www.google.co.uk/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=r+project"
get_google_page_urls(u)

# [1] "http://www.r-project.org/"
# [2] "http://en.wikipedia.org/wiki/R_(programming_language)"
# [3] "http://www.rseek.org/"
# [4] "http://www.gutenberg.org/browse/authors/r"
# [5] "http://sciviews.org/_rgui/"
# [6] "http://www.itc.nl/~rossiter/teach/R/RIntro_ITC.pdf"
# [7] "http://stat.ethz.ch/CRAN/"
# [8] "http://hughesbennett.co.uk/RProject"
# [9] "http://www.warwick.ac.uk/statsdept/user-2011/"

Lovely jubbly! :)

P.S. I know that there is an API of some sort for google search but I don’t think anyone has made an R package for it. Yet. (I feel my skill set is insufficient to do it myself!

About these ads

17 Comments »

  1. Very nice!

    If you’re curious, there is an API for Google Search, but it’s severely rate limited. I explained how to access it via R here: http://stackoverflow.com/questions/5187685/r-search-google-for-a-string-and-return-number-of-hits/5188468#5188468

    Comment by Noah — November 7, 2011 @ 5:54 pm

  2. Excellent script. The only missing part to me was the ability to read over the first page results. This can be achieved by adding “&start=x” at the end of the query, X being 10*(pageNumber-1), thus 0 for page 1, 10 for page 2 and so on. Here’s an ugly copy/paste of my ugly WIP.


    NbPageResults = 10*(NbPageResults – 1)
    search.page <- paste("&start=", NbPageResults, sep="")

    GoogleURL <- paste("http://www.google&quot;,domain,
    "/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=",
    search.term, search.page, sep="")

    # read in page contents
    html <- getURL(GoogleURL)

    Comment by SnRf — December 2, 2011 @ 5:10 am

  3. Hi there! This blog post could not be written any better!

    Looking through this post reminds me of my previous roommate!

    He always kept talking about this. I am going to send this article to
    him. Fairly certain he will have a very good read.
    Thank you for sharing!

    Comment by http://sutcfinarr3.jimdo.com/ — March 7, 2013 @ 3:39 am

  4. Virtually every second contributed analyzing
    this post is rewarding. I will certainly implement your guides and will suggest your
    post to my contacts. Warm regards for this effective review.

    Comment by instant lock smiths London — March 11, 2013 @ 2:06 pm

  5. Wow, this post is pleasant, my younger sister is analyzing
    these kinds of things, therefore I am going to
    inform her.

    Comment by Robin — April 17, 2013 @ 10:33 pm

  6. My brother suggested I might like this blog.
    He was entirely right. This post truly made my day. You cann’t imagine just how much time I had spent for this information! Thanks!

    Comment by Earnest — April 27, 2013 @ 3:59 am

  7. I savor, result in I discovered exactly what I used to be looking for.
    You have ended my four day long hunt! God Bless you man.
    Have a great day. Bye

    Comment by ek thi daayan full movie download — April 28, 2013 @ 4:11 am

  8. I do consider all of the ideas you’ve offered to your post. They’re really convincing
    and will certainly work. Still, the posts are too short for starters.

    May you please extend them a little from subsequent
    time? Thanks for the post.

    Comment by tax accountant toronto — April 28, 2013 @ 9:15 am

  9. I am not positive where you’re getting your info, but good topic. I needs to spend a while studying much more or figuring out more. Thanks for excellent information I used to be on the lookout for this information for my mission.

    Comment by Fanny — May 3, 2013 @ 3:48 am

  10. This is the perfect site for everyone who really wants to understand this
    topic. You understand a whole lot its almost tough
    to argue with you (not that I personally would want to…HaHa).
    You definitely put a fresh spin on a subject which has been discussed for ages.
    Wonderful stuff, just excellent!

    Comment by heart surgeries — May 11, 2013 @ 11:24 am

  11. Admiring the hard work you put into your website and in depth information you provide.
    It’s great to come across a blog every once in a while that isn’t the same old rehashed information.
    Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google
    account.

    Comment by attack heart — May 11, 2013 @ 11:24 am

  12. When I initially commented I clicked the “Notify me when new comments are added”
    checkbox and now each time a comment is added I get three e-mails
    with the same comment. Is there any way you can remove people from that service?
    Appreciate it!

    Comment by Black Mould Removal Toronto — May 30, 2013 @ 2:22 am

    • I have no idea how to do that mate, that’s a wordpress issue. :(

      Comment by Tony Breyal — June 7, 2013 @ 9:27 pm

  13. Hi-ya, great web page you have got here.

    Comment by social media profits — August 8, 2013 @ 1:33 am

  14. nice article on Web Scraping

    Comment by gopikrishnaisolve — June 25, 2014 @ 12:06 pm

  15. web scraping is the process of crawling the web data & collection of raw data to have meaningful report.

    Comment by Web Scraping Services — June 25, 2014 @ 12:08 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: