Consistently Infrequent

November 7, 2011

Web Scraping Google URLs

Filed under: R — Tags: , , , , — Tony Breyal @ 2:18 pm

UPDATE: This function has now been improved, see googleSearchXScraper()

Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions:

# load packages
library(RCurl)
library(XML)

get_google_page_urls <- function(u) {
  # read in page contents
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
  links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  links <- grep("http://", links, fixed = TRUE, value=TRUE)
  return(links)
}

u <- "http://www.google.co.uk/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=r+project"
get_google_page_urls(u)

# [1] "http://www.r-project.org/"
# [2] "http://en.wikipedia.org/wiki/R_(programming_language)"
# [3] "http://www.rseek.org/"
# [4] "http://www.gutenberg.org/browse/authors/r"
# [5] "http://sciviews.org/_rgui/"
# [6] "http://www.itc.nl/~rossiter/teach/R/RIntro_ITC.pdf"
# [7] "http://stat.ethz.ch/CRAN/"
# [8] "http://hughesbennett.co.uk/RProject"
# [9] "http://www.warwick.ac.uk/statsdept/user-2011/"

Lovely jubbly! 🙂

P.S. I know that there is an API of some sort for google search but I don’t think anyone has made an R package for it. Yet. (I feel my skill set is insufficient to do it myself!

Advertisements

36 Comments »

  1. Very nice!

    If you’re curious, there is an API for Google Search, but it’s severely rate limited. I explained how to access it via R here: http://stackoverflow.com/questions/5187685/r-search-google-for-a-string-and-return-number-of-hits/5188468#5188468

    Comment by Noah — November 7, 2011 @ 5:54 pm

  2. Excellent script. The only missing part to me was the ability to read over the first page results. This can be achieved by adding “&start=x” at the end of the query, X being 10*(pageNumber-1), thus 0 for page 1, 10 for page 2 and so on. Here’s an ugly copy/paste of my ugly WIP.


    NbPageResults = 10*(NbPageResults – 1)
    search.page <- paste("&start=", NbPageResults, sep="")

    GoogleURL <- paste("http://www.google&quot;,domain,
    "/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=",
    search.term, search.page, sep="")

    # read in page contents
    html <- getURL(GoogleURL)

    Comment by SnRf — December 2, 2011 @ 5:10 am

  3. Hi there! This blog post could not be written any better!

    Looking through this post reminds me of my previous roommate!

    He always kept talking about this. I am going to send this article to
    him. Fairly certain he will have a very good read.
    Thank you for sharing!

    Comment by http://sutcfinarr3.jimdo.com/ — March 7, 2013 @ 3:39 am

  4. Virtually every second contributed analyzing
    this post is rewarding. I will certainly implement your guides and will suggest your
    post to my contacts. Warm regards for this effective review.

    Comment by instant lock smiths London — March 11, 2013 @ 2:06 pm

  5. Wow, this post is pleasant, my younger sister is analyzing
    these kinds of things, therefore I am going to
    inform her.

    Comment by Robin — April 17, 2013 @ 10:33 pm

  6. My brother suggested I might like this blog.
    He was entirely right. This post truly made my day. You cann’t imagine just how much time I had spent for this information! Thanks!

    Comment by Earnest — April 27, 2013 @ 3:59 am

  7. I savor, result in I discovered exactly what I used to be looking for.
    You have ended my four day long hunt! God Bless you man.
    Have a great day. Bye

    Comment by ek thi daayan full movie download — April 28, 2013 @ 4:11 am

  8. I do consider all of the ideas you’ve offered to your post. They’re really convincing
    and will certainly work. Still, the posts are too short for starters.

    May you please extend them a little from subsequent
    time? Thanks for the post.

    Comment by tax accountant toronto — April 28, 2013 @ 9:15 am

  9. I am not positive where you’re getting your info, but good topic. I needs to spend a while studying much more or figuring out more. Thanks for excellent information I used to be on the lookout for this information for my mission.

    Comment by Fanny — May 3, 2013 @ 3:48 am

  10. This is the perfect site for everyone who really wants to understand this
    topic. You understand a whole lot its almost tough
    to argue with you (not that I personally would want to…HaHa).
    You definitely put a fresh spin on a subject which has been discussed for ages.
    Wonderful stuff, just excellent!

    Comment by heart surgeries — May 11, 2013 @ 11:24 am

  11. Admiring the hard work you put into your website and in depth information you provide.
    It’s great to come across a blog every once in a while that isn’t the same old rehashed information.
    Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google
    account.

    Comment by attack heart — May 11, 2013 @ 11:24 am

  12. When I initially commented I clicked the “Notify me when new comments are added”
    checkbox and now each time a comment is added I get three e-mails
    with the same comment. Is there any way you can remove people from that service?
    Appreciate it!

    Comment by Black Mould Removal Toronto — May 30, 2013 @ 2:22 am

    • I have no idea how to do that mate, that’s a wordpress issue. 😦

      Comment by Tony Breyal — June 7, 2013 @ 9:27 pm

  13. Hi-ya, great web page you have got here.

    Comment by social media profits — August 8, 2013 @ 1:33 am

  14. That’s quite nice for lower amount of scraping. If you need to scrape large amounts of data I’d like to share my projects, they are open source and completely free!
    Google search scraper:http://scraping.compunect.com/?scrape-google-search
    Google suggest scraper:http://scrape-google-suggest.compunect.com/?scrape-google-suggest
    Google Finance scraper:http://scrape-google-finance.compunect.com/?scrape-google-finance

    I think those large projects fit quite well to this article.

    Comment by Scraping Google — April 3, 2014 @ 4:10 pm

  15. nice article on Web Scraping

    Comment by gopikrishnaisolve — June 25, 2014 @ 12:06 pm

  16. web scraping is the process of crawling the web data & collection of raw data to have meaningful report.

    Comment by Web Scraping Services — June 25, 2014 @ 12:08 pm

  17. Is it worth employing someone to carry out the link outreach, or would you do it personally?

    Do you participate in any social sites

    Comment by Rubye — May 30, 2016 @ 3:15 am

  18. Could someone elaborate on how to get #pages of searched results? In addition, when I add “&start=…” to the end of the url, it seems that Google shrinks the number of results. Is that because Google detects me manipulating it? Thank you so much!!!

    Comment by alicecongcong — October 18, 2016 @ 7:33 am

  19. {Interesting|Lovely!!!!! :D|Beautiful|Congrat
    for almost reaching 2k subbies|You’re very welcome!
     🙂 Of course!!!!|+My EcoKids Club Thank you !
    So nice of you to take the time too watch our article and comment >Please keep wafching and let uss know if
    you would like a specific reading .|Thank you !!|Thank you!!
    We made it!! Yay|no dis lijkes wow YOU MUST BE AGREAT
    PERSON and this was posted last year|I love dogs|i like trainz|i love cats|i love cats|this reminds me of Yurio Plisetsky|I
    lovve cats|I can’t have a cat my dad’s
    allergic. :'(|In oour place theirs a cat running around inn our house named sunny you should get heer or
    him.|i’m allergic to cats|im bunny and every animal and plushie lover|i love cts or
    you|like meeeeeeee|I can relate so much from this|me too bloody i FREAKING LOVE CATS IM A CRAZY CAT
    GIRLLLL!!!!!!|yaaaasssss me I love kitty’s lol|I love CATs too there sso
    CUTE|Alexis: THIS IS HOW I FEEL ABOUT CATSSSSSSSSS (Ima srry
    life -.-)|and i luv u!!!!!!!|there’s know a game and thee cats sleep walk play and I got it for free|*
    gives u a neko cat a real one not the ones u see at China house *|Clicking subscribe now…..|I
    caan be your cat :3|meow~ (=^・ω・^=)|LOL ME TOO I LUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUVCATS|my cat has 1 eye.
    it ssaadd|me too I am cat lover|awwwwwwwwww, that’s so cute!|Are you somehow
    related to Sebastian in black butler? XD|I
    love cats too :3|I OHMYGAWD I LOVECATS SO MUCH I GOTTA MAKE THIS OMG!!!!
    (i used too have a cat….)|It’s not your fault, i have two
    pet cats, the black one named Moonlight, the second one iss named
    Ginger i love them so much. X3|Sqeeeeeeee SAME|I love cats too |im a dog lover and cat lover DONT JUDGE ME!|GEEZZ!!
    TAKE MY FREAKING CAT!!!!!! PLZZZ!!!btw i looved it!|I’m A CAT LOVER TOO!!!!
    |same im a cat lover|oh this so cut QuQ|you haven’t made my request
    yet ;-;|OMG I LOVE CAT(ALL ANIMALS)|awwww this is just cute, and my grandmothernever wanted me to have a cat
    cuz I just kee on loossing them >^<3|Ma sono meravigliosiiii! *_*|:) sei gentilissima come sempre ^^|Grazie mille <3|thanks! ^.^|Facciamolo su questo canale per'ho C:|Se ti va facciamo un article in collaborazione? scegli tu il tema! Mandami un mex su posta personale sarebbe divertente.. P.s ti adooorooo <3|Grazie, si avevo già in mente qualcosa con i gufi spero riescano bene ^.^|che cariniiiiiiii *_* la feresti con dei gufi diversi tra loro? *_*|Sono felice che ti piacciano ^.^ <3|Grazie mille, allora proverò a fare un leoncino per il prossimo article ^.^|carinissimi! Brava! sono davvero dolcissimi… mmmm un… leone? :)|Visiti il mio canale??|Grazie sono felice ti piacciano ^.^|I love coat hangers too…..Naah no i don't|xD I lost it at "online dating"|HOORAY FOR HORRIBLE WEBCAM QUALITY! GO FRAWN! |Lol you did her voice perfectly|HAAHAHAHAHHAHAHAHAHAHAHA|holy f*&k this so funny ….loved it.|

    Comment by cats and dogs movie — May 24, 2017 @ 7:04 am

  20. Truoy no matter if someone doesn’t be aware of after that
    its up to other users that they will help, so here it occurs.

    Comment by Kerave Hair — June 13, 2017 @ 10:08 pm

  21. Zoom teeth whitening in West Los Angeles

    Web Scraping Google URLs | Consistently Infrequent

    Trackback by Zoom teeth whitening in West Los Angeles — June 16, 2017 @ 1:59 am

  22. I need to to thanjk you for this great read!! I certainly enjoyed every bit
    of it. I have youu book-marked tto look at new things you post…

    Comment by My Millionaire Mentor — June 24, 2017 @ 3:55 am

  23. That is a very good tip particularly to those fresh to the
    blogosphere. Simple but very precise information… Thank you for
    sharing this one. A must read post!

    Comment by blog — July 29, 2017 @ 5:18 am

  24. Fastidious response in return of this question with real arguments and explaining everything concerning
    that.

    Comment by day trade na bolsa de valores — August 2, 2017 @ 10:18 pm

  25. I’m amazed, I have to admit. Rarely do I come across a blog that’s both educative and
    interesting, and let me tell you, you’ve hit the nail on the head.

    The problem is something that too few men and women are speaking intelligently about.
    I’m very happy that I stumbled across this during my hunt for something concerning this.

    Comment by lucro no forex — August 3, 2017 @ 12:02 am

  26. Great website you have here but I was curious about if you knew of any community forums that cover the same topics discussed in this article?

    I’d really like to be a part of group where I can get responses from
    other knowledgeable people that share the same interest. If
    you have any suggestions, please let me know. Appreciate it!

    Comment by tiposdequeijos.wordpress.com — August 13, 2017 @ 3:24 pm

  27. This is a good tip especially to those fresh to the blogosphere.
    Simple but very precise info… Thanks for sharing this one.
    A must read article!

    Comment by PHP Orientado a Objetos — August 29, 2017 @ 8:54 am

  28. This blog was… how do you say it? Relevant!! Finally I’ve found
    something that helped me. Thanks a lot!

    Comment by Como aprender programar php em 04 meses — August 29, 2017 @ 6:16 pm

  29. What’s up to every , as I am truly keen of reading this webpage’s post to be updated daily.
    It contains good stuff.

    Comment by sono rem — August 30, 2017 @ 1:04 pm

  30. Hi there, for all time i used to check weblog posts here
    in the early hours in the daylight, because i enjoy to learn more and more.

    Comment by colchão com massageador e infravermelho — August 30, 2017 @ 1:53 pm

  31. It’s in fact very difficult in this busy life to listen news on Television, therefore I just use internet
    for that reason, and get the newest information.

    Comment by https://pilaresdocanto.wordpress.com/ — September 2, 2017 @ 4:51 pm

  32. whoah this weblog is magnificent i like reading your articles.
    Stay up the good work! You understand, many persons are looking around for this
    info, you can help them greatly.

    Comment by aprender canto — September 2, 2017 @ 5:06 pm

  33. It’s impressive that you are getting ideas from this piece of writing as well as from
    our discussion made here.

    Comment by carolina herrera ch — September 8, 2017 @ 1:25 pm

  34. With havin so much content and articles do you ever
    run into any problems of plagorism or copyright violation? My blog has a lot of
    completely unique content I’ve either created myself or outsourced
    but it appears a lot of it is popping it up all over the internet without
    my authorization. Do you know any techniques to
    help stop content from being ripped off? I’d truly appreciate it.

    Comment by daftar pengeluaran togel — September 16, 2017 @ 2:26 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: