Web Scraping Google URLs

November 7, 2011

Web Scraping Google URLs

Filed under: R — Tags: Google, RCurl, web-scraping, XML, XPath — BD @ 2:18 pm

UPDATE: This function has now been improved, see googleSearchXScraper()

Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions:

# load packages
library(RCurl)
library(XML)

get_google_page_urls <- function(u) {
  # read in page contents
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
  links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  links <- grep("http://", links, fixed = TRUE, value=TRUE)
  return(links)
}

u <- "http://www.google.co.uk/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=r+project"
get_google_page_urls(u)

# [1] "http://www.r-project.org/"
# [2] "http://en.wikipedia.org/wiki/R_(programming_language)"
# [3] "http://www.rseek.org/"
# [4] "http://www.gutenberg.org/browse/authors/r"
# [5] "http://sciviews.org/_rgui/"
# [6] "http://www.itc.nl/~rossiter/teach/R/RIntro_ITC.pdf"
# [7] "http://stat.ethz.ch/CRAN/"
# [8] "http://hughesbennett.co.uk/RProject"
# [9] "http://www.warwick.ac.uk/statsdept/user-2011/"

Lovely jubbly! 🙂

P.S. I know that there is an API of some sort for google search but I don’t think anyone has made an R package for it. Yet. (I feel my skill set is insufficient to do it myself!

Comments (21)

21 Comments »

Very nice!

If you’re curious, there is an API for Google Search, but it’s severely rate limited. I explained how to access it via R here: http://stackoverflow.com/questions/5187685/r-search-google-for-a-string-and-return-number-of-hits/5188468#5188468

Comment by Noah — November 7, 2011 @ 5:54 pm

Reply
- I actually remember that SO question! Did you ever get around to packaging up the bits and pieces into an R package? The main problem I have with Google Custom Search API is that it is not consistent with the results returned from, say, google.com which is somewhat annoying. reference: http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=141877
  
  Comment by Tony Breyal — November 7, 2011 @ 9:17 pm
  
  Reply
Excellent script. The only missing part to me was the ability to read over the first page results. This can be achieved by adding “&start=x” at the end of the query, X being 10*(pageNumber-1), thus 0 for page 1, 10 for page 2 and so on. Here’s an ugly copy/paste of my ugly WIP.

—
NbPageResults = 10*(NbPageResults – 1)
search.page <- paste("&start=", NbPageResults, sep="")

GoogleURL <- paste("http://www.google",domain,
"/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=",
search.term, search.page, sep="")

# read in page contents
html <- getURL(GoogleURL)

Comment by SnRf — December 2, 2011 @ 5:10 am

Reply
Hi there! This blog post could not be written any better!

Looking through this post reminds me of my previous roommate!

He always kept talking about this. I am going to send this article to
him. Fairly certain he will have a very good read.
Thank you for sharing!

Comment by http://sutcfinarr3.jimdo.com/ — March 7, 2013 @ 3:39 am

Reply
Virtually every second contributed analyzing
this post is rewarding. I will certainly implement your guides and will suggest your
post to my contacts. Warm regards for this effective review.

Comment by instant lock smiths London — March 11, 2013 @ 2:06 pm

Reply
Wow, this post is pleasant, my younger sister is analyzing
these kinds of things, therefore I am going to
inform her.

Comment by Robin — April 17, 2013 @ 10:33 pm

Reply
My brother suggested I might like this blog.
He was entirely right. This post truly made my day. You cann’t imagine just how much time I had spent for this information! Thanks!

Comment by Earnest — April 27, 2013 @ 3:59 am

Reply
I savor, result in I discovered exactly what I used to be looking for.
You have ended my four day long hunt! God Bless you man.
Have a great day. Bye

Comment by ek thi daayan full movie download — April 28, 2013 @ 4:11 am

Reply
I do consider all of the ideas you’ve offered to your post. They’re really convincing
and will certainly work. Still, the posts are too short for starters.

May you please extend them a little from subsequent
time? Thanks for the post.

Comment by tax accountant toronto — April 28, 2013 @ 9:15 am

Reply
I am not positive where you’re getting your info, but good topic. I needs to spend a while studying much more or figuring out more. Thanks for excellent information I used to be on the lookout for this information for my mission.

Comment by Fanny — May 3, 2013 @ 3:48 am

Reply
This is the perfect site for everyone who really wants to understand this
topic. You understand a whole lot its almost tough
to argue with you (not that I personally would want to…HaHa).
You definitely put a fresh spin on a subject which has been discussed for ages.
Wonderful stuff, just excellent!

Comment by heart surgeries — May 11, 2013 @ 11:24 am

Reply
Admiring the hard work you put into your website and in depth information you provide.
It’s great to come across a blog every once in a while that isn’t the same old rehashed information.
Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google
account.

Comment by attack heart — May 11, 2013 @ 11:24 am

Reply
When I initially commented I clicked the “Notify me when new comments are added”
checkbox and now each time a comment is added I get three e-mails
with the same comment. Is there any way you can remove people from that service?
Appreciate it!

Comment by Black Mould Removal Toronto — May 30, 2013 @ 2:22 am

Reply
- I have no idea how to do that mate, that’s a wordpress issue. 😦
  
  Comment by Tony Breyal — June 7, 2013 @ 9:27 pm
  
  Reply
Hi-ya, great web page you have got here.

Comment by social media profits — August 8, 2013 @ 1:33 am

Reply
That’s quite nice for lower amount of scraping. If you need to scrape large amounts of data I’d like to share my projects, they are open source and completely free!
Google search scraper:http://scraping.compunect.com/?scrape-google-search
Google suggest scraper:http://scrape-google-suggest.compunect.com/?scrape-google-suggest
Google Finance scraper:http://scrape-google-finance.compunect.com/?scrape-google-finance

I think those large projects fit quite well to this article.

Comment by Scraping Google — April 3, 2014 @ 4:10 pm

Reply
nice article on Web Scraping

Comment by gopikrishnaisolve — June 25, 2014 @ 12:06 pm

Reply
web scraping is the process of crawling the web data & collection of raw data to have meaningful report.

Comment by Web Scraping Services — June 25, 2014 @ 12:08 pm

Reply
Could someone elaborate on how to get #pages of searched results? In addition, when I add “&start=…” to the end of the url, it seems that Google shrinks the number of results. Is that because Google detects me manipulating it? Thank you so much!!!

Comment by alicecongcong — October 18, 2016 @ 7:33 am

Reply
wisdom

Web Scraping Google URLs | Consistently Infrequent

Trackback by Wisdom — March 30, 2018 @ 9:01 pm

Reply
It’s difficult to find knowledgeable people in this particular topic,
but you seem like you know what you’re talking about! Thanks

Comment by ayam sabung — August 7, 2018 @ 9:06 am

Reply