Consistently Infrequent

November 7, 2011

Web Scraping Google URLs

Filed under: R — Tags: , , , , — Tony Breyal @ 2:18 pm

UPDATE: This function has now been improved, see googleSearchXScraper()

Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions:

# load packages
library(RCurl)
library(XML)

get_google_page_urls <- function(u) {
  # read in page contents
  html <- getURL(u)

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
  links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  links <- grep("http://", links, fixed = TRUE, value=TRUE)
  return(links)
}

u <- "http://www.google.co.uk/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=r+project"
get_google_page_urls(u)

# [1] "http://www.r-project.org/"
# [2] "http://en.wikipedia.org/wiki/R_(programming_language)"
# [3] "http://www.rseek.org/"
# [4] "http://www.gutenberg.org/browse/authors/r"
# [5] "http://sciviews.org/_rgui/"
# [6] "http://www.itc.nl/~rossiter/teach/R/RIntro_ITC.pdf"
# [7] "http://stat.ethz.ch/CRAN/"
# [8] "http://hughesbennett.co.uk/RProject"
# [9] "http://www.warwick.ac.uk/statsdept/user-2011/"

Lovely jubbly! 🙂

P.S. I know that there is an API of some sort for google search but I don’t think anyone has made an R package for it. Yet. (I feel my skill set is insufficient to do it myself!

Advertisements

24 Comments »

  1. Very nice!

    If you’re curious, there is an API for Google Search, but it’s severely rate limited. I explained how to access it via R here: http://stackoverflow.com/questions/5187685/r-search-google-for-a-string-and-return-number-of-hits/5188468#5188468

    Comment by Noah — November 7, 2011 @ 5:54 pm

  2. Excellent script. The only missing part to me was the ability to read over the first page results. This can be achieved by adding “&start=x” at the end of the query, X being 10*(pageNumber-1), thus 0 for page 1, 10 for page 2 and so on. Here’s an ugly copy/paste of my ugly WIP.


    NbPageResults = 10*(NbPageResults – 1)
    search.page <- paste("&start=", NbPageResults, sep="")

    GoogleURL <- paste("http://www.google&quot;,domain,
    "/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=",
    search.term, search.page, sep="")

    # read in page contents
    html <- getURL(GoogleURL)

    Comment by SnRf — December 2, 2011 @ 5:10 am

  3. Hi there! This blog post could not be written any better!

    Looking through this post reminds me of my previous roommate!

    He always kept talking about this. I am going to send this article to
    him. Fairly certain he will have a very good read.
    Thank you for sharing!

    Comment by http://sutcfinarr3.jimdo.com/ — March 7, 2013 @ 3:39 am

  4. Virtually every second contributed analyzing
    this post is rewarding. I will certainly implement your guides and will suggest your
    post to my contacts. Warm regards for this effective review.

    Comment by instant lock smiths London — March 11, 2013 @ 2:06 pm

  5. Wow, this post is pleasant, my younger sister is analyzing
    these kinds of things, therefore I am going to
    inform her.

    Comment by Robin — April 17, 2013 @ 10:33 pm

  6. My brother suggested I might like this blog.
    He was entirely right. This post truly made my day. You cann’t imagine just how much time I had spent for this information! Thanks!

    Comment by Earnest — April 27, 2013 @ 3:59 am

  7. I savor, result in I discovered exactly what I used to be looking for.
    You have ended my four day long hunt! God Bless you man.
    Have a great day. Bye

    Comment by ek thi daayan full movie download — April 28, 2013 @ 4:11 am

  8. I do consider all of the ideas you’ve offered to your post. They’re really convincing
    and will certainly work. Still, the posts are too short for starters.

    May you please extend them a little from subsequent
    time? Thanks for the post.

    Comment by tax accountant toronto — April 28, 2013 @ 9:15 am

  9. I am not positive where you’re getting your info, but good topic. I needs to spend a while studying much more or figuring out more. Thanks for excellent information I used to be on the lookout for this information for my mission.

    Comment by Fanny — May 3, 2013 @ 3:48 am

  10. This is the perfect site for everyone who really wants to understand this
    topic. You understand a whole lot its almost tough
    to argue with you (not that I personally would want to…HaHa).
    You definitely put a fresh spin on a subject which has been discussed for ages.
    Wonderful stuff, just excellent!

    Comment by heart surgeries — May 11, 2013 @ 11:24 am

  11. Admiring the hard work you put into your website and in depth information you provide.
    It’s great to come across a blog every once in a while that isn’t the same old rehashed information.
    Fantastic read! I’ve saved your site and I’m including your RSS feeds to my Google
    account.

    Comment by attack heart — May 11, 2013 @ 11:24 am

  12. When I initially commented I clicked the “Notify me when new comments are added”
    checkbox and now each time a comment is added I get three e-mails
    with the same comment. Is there any way you can remove people from that service?
    Appreciate it!

    Comment by Black Mould Removal Toronto — May 30, 2013 @ 2:22 am

    • I have no idea how to do that mate, that’s a wordpress issue. 😦

      Comment by Tony Breyal — June 7, 2013 @ 9:27 pm

  13. Hi-ya, great web page you have got here.

    Comment by social media profits — August 8, 2013 @ 1:33 am

  14. That’s quite nice for lower amount of scraping. If you need to scrape large amounts of data I’d like to share my projects, they are open source and completely free!
    Google search scraper:http://scraping.compunect.com/?scrape-google-search
    Google suggest scraper:http://scrape-google-suggest.compunect.com/?scrape-google-suggest
    Google Finance scraper:http://scrape-google-finance.compunect.com/?scrape-google-finance

    I think those large projects fit quite well to this article.

    Comment by Scraping Google — April 3, 2014 @ 4:10 pm

  15. nice article on Web Scraping

    Comment by gopikrishnaisolve — June 25, 2014 @ 12:06 pm

  16. web scraping is the process of crawling the web data & collection of raw data to have meaningful report.

    Comment by Web Scraping Services — June 25, 2014 @ 12:08 pm

  17. Is it worth employing someone to carry out the link outreach, or would you do it personally?

    Do you participate in any social sites

    Comment by Rubye — May 30, 2016 @ 3:15 am

  18. Could someone elaborate on how to get #pages of searched results? In addition, when I add “&start=…” to the end of the url, it seems that Google shrinks the number of results. Is that because Google detects me manipulating it? Thank you so much!!!

    Comment by alicecongcong — October 18, 2016 @ 7:33 am

  19. {Interesting|Lovely!!!!! :D|Beautiful|Congrat
    for almost reaching 2k subbies|You’re very welcome!
     🙂 Of course!!!!|+My EcoKids Club Thank you !
    So nice of you to take the time too watch our article and comment >Please keep wafching and let uss know if
    you would like a specific reading .|Thank you !!|Thank you!!
    We made it!! Yay|no dis lijkes wow YOU MUST BE AGREAT
    PERSON and this was posted last year|I love dogs|i like trainz|i love cats|i love cats|this reminds me of Yurio Plisetsky|I
    lovve cats|I can’t have a cat my dad’s
    allergic. :'(|In oour place theirs a cat running around inn our house named sunny you should get heer or
    him.|i’m allergic to cats|im bunny and every animal and plushie lover|i love cts or
    you|like meeeeeeee|I can relate so much from this|me too bloody i FREAKING LOVE CATS IM A CRAZY CAT
    GIRLLLL!!!!!!|yaaaasssss me I love kitty’s lol|I love CATs too there sso
    CUTE|Alexis: THIS IS HOW I FEEL ABOUT CATSSSSSSSSS (Ima srry
    life -.-)|and i luv u!!!!!!!|there’s know a game and thee cats sleep walk play and I got it for free|*
    gives u a neko cat a real one not the ones u see at China house *|Clicking subscribe now…..|I
    caan be your cat :3|meow~ (=^・ω・^=)|LOL ME TOO I LUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUVCATS|my cat has 1 eye.
    it ssaadd|me too I am cat lover|awwwwwwwwww, that’s so cute!|Are you somehow
    related to Sebastian in black butler? XD|I
    love cats too :3|I OHMYGAWD I LOVECATS SO MUCH I GOTTA MAKE THIS OMG!!!!
    (i used too have a cat….)|It’s not your fault, i have two
    pet cats, the black one named Moonlight, the second one iss named
    Ginger i love them so much. X3|Sqeeeeeeee SAME|I love cats too |im a dog lover and cat lover DONT JUDGE ME!|GEEZZ!!
    TAKE MY FREAKING CAT!!!!!! PLZZZ!!!btw i looved it!|I’m A CAT LOVER TOO!!!!
    |same im a cat lover|oh this so cut QuQ|you haven’t made my request
    yet ;-;|OMG I LOVE CAT(ALL ANIMALS)|awwww this is just cute, and my grandmothernever wanted me to have a cat
    cuz I just kee on loossing them >^<3|Ma sono meravigliosiiii! *_*|:) sei gentilissima come sempre ^^|Grazie mille <3|thanks! ^.^|Facciamolo su questo canale per'ho C:|Se ti va facciamo un article in collaborazione? scegli tu il tema! Mandami un mex su posta personale sarebbe divertente.. P.s ti adooorooo <3|Grazie, si avevo già in mente qualcosa con i gufi spero riescano bene ^.^|che cariniiiiiiii *_* la feresti con dei gufi diversi tra loro? *_*|Sono felice che ti piacciano ^.^ <3|Grazie mille, allora proverò a fare un leoncino per il prossimo article ^.^|carinissimi! Brava! sono davvero dolcissimi… mmmm un… leone? :)|Visiti il mio canale??|Grazie sono felice ti piacciano ^.^|I love coat hangers too…..Naah no i don't|xD I lost it at "online dating"|HOORAY FOR HORRIBLE WEBCAM QUALITY! GO FRAWN! |Lol you did her voice perfectly|HAAHAHAHAHHAHAHAHAHAHAHA|holy f*&k this so funny ….loved it.|

    Comment by cats and dogs movie — May 24, 2017 @ 7:04 am

  20. Truoy no matter if someone doesn’t be aware of after that
    its up to other users that they will help, so here it occurs.

    Comment by Kerave Hair — June 13, 2017 @ 10:08 pm

  21. Zoom teeth whitening in West Los Angeles

    Web Scraping Google URLs | Consistently Infrequent

    Trackback by Zoom teeth whitening in West Los Angeles — June 16, 2017 @ 1:59 am

  22. I need to to thanjk you for this great read!! I certainly enjoyed every bit
    of it. I have youu book-marked tto look at new things you post…

    Comment by My Millionaire Mentor — June 24, 2017 @ 3:55 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: