Consistently Infrequent

November 12, 2011

googlePlusXScraper(): Web Scraping Google+ via XPath

Filed under: R — Tony Breyal @ 12:01 am

Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me too much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post):

input <- "https://plus.google.com/110286587261352351537/posts"
df <- googlePlusXScraper(input)
t(df[2, ])

# posted.by             "Felicia Day"
# ID                    "110286587261352351537"
# message               "Um, I cannot wait.  So what class are you gonna be?!?!  I want to be a wood elf now, the Kajit are cool but I think I want a pretty humanoid to play for 100+ hours haha."
# message.embeded.names " [NEXT>>] Felicia Day [NEXT>>] Post date: 2011-11-10 [NEXT>>] VGW Review: The Elder Scrolls V: Skyrim [NEXT>>] Matthias Fussenegger"
# message.embeded.links "./110286587261352351537 [NEXT>>] 110286587261352351537/posts/Vo21AWjk1BK [NEXT>>] http://videogamewriters.com/review-the-elder-scrolls-v-skyrim-28778 [NEXT>>] ./100878219610349033014"
# post.date             "2011-11-10"
# pluses                "785"
# comments              "476"
# comments.by           "Matthias Fussenegger, Ezekiel Rage, Jake Sharman, Walter Swanevelder, Johan Lorentzon and 1 more"
# sample.comments       "Matthias Fussenegger  -  On Xbox? Nah. Just think of the great PC mods.    "
# shares                "45"
# shares.by             "Achmad Soemiardhita, Adam Pace, Alexis Bane, Allen Carrigan, Amanda Troutman and 40 more"
# type                  "Public"

You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. The html element which refers to  loading more posts is:

<span role="button" title="Load more posts" tabindex="0" style="">More</span>

but how one would use that is beyond me. I think it’s probably something to do with javascript but I don’t think R has any way of accessing it, at least as far as I know. Plus, I don’t know javascript. This makes the function of limited usability. One way around this limitation however (and it’s something I’m doing with my facebook wallpost page scraper) is to simply provide the html file of the Google+ posts page, which you will have saved as an .html file on your disk after pressing the ‘more’ button as many times as you desire, and then giving that file path directly to  googlePlusXScraper function which will automatically do the rest.

# save a Google+ posts page as a complete html file to your local disk. When prompted by the function, choose it.
input <- file.choose()
df <- googlePlusXScraper(input)

The full code can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googlePlusXScraper/googlePlusXScraper.R

P.S. I’m new to github in terms of someone who uploads code but it does seems very useful. And cool. Bow-tie cool. Yeah.

Advertisements

5 Comments »

  1. Very quickly this web page will be famous amid all blogging
    visitors, due to it’s good articles

    Comment by Florida real estate — May 23, 2017 @ 6:06 am

  2. Hi there I am so glad I found your blog page, I really found you
    by error, while I was looking on Google for something else,
    Regardless I am here now and would just like to say many thanks for
    a fantastic post and a all round exciting blog (I also love the theme/design), I
    don’t have time to look over it all at the moment but I have bookmarked
    it and also added in your RSS feeds, so when I have time I will be back to read a great deal more,
    Please do keep up the superb work.

    Comment by lanyard — July 17, 2017 @ 6:11 am

  3. It’s amazing in favor of me to have a web site, which
    is useful in support of my experience. thanks admin

    Comment by Have A Sale Store — August 8, 2017 @ 4:07 pm

  4. Aw, this was an exceptionally good post. Taking a few minutes and actual effort to produce a good article… but what can I say… I hesitate a whole lot and don’t seem to get anything done.

    Comment by วิเคราะห์บอล — August 13, 2017 @ 5:17 am

  5. Admiring the dedication you put into your website and in depth information you provide.

    It’s great to come across a blog every once in a while that isn’t the same
    out of date rehashed information. Excellent read! I’ve bookmarked your site and I’m including your RSS feeds to my
    Google account.

    Comment by big tits sex doll — August 18, 2017 @ 8:03 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: