Consistently Infrequent

November 12, 2011

googlePlusXScraper(): Web Scraping Google+ via XPath

Filed under: R — Tony Breyal @ 12:01 am

Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me too much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post):

input <- "https://plus.google.com/110286587261352351537/posts"
df <- googlePlusXScraper(input)
t(df[2, ])

# posted.by             "Felicia Day"
# ID                    "110286587261352351537"
# message               "Um, I cannot wait.  So what class are you gonna be?!?!  I want to be a wood elf now, the Kajit are cool but I think I want a pretty humanoid to play for 100+ hours haha."
# message.embeded.names " [NEXT>>] Felicia Day [NEXT>>] Post date: 2011-11-10 [NEXT>>] VGW Review: The Elder Scrolls V: Skyrim [NEXT>>] Matthias Fussenegger"
# message.embeded.links "./110286587261352351537 [NEXT>>] 110286587261352351537/posts/Vo21AWjk1BK [NEXT>>] http://videogamewriters.com/review-the-elder-scrolls-v-skyrim-28778 [NEXT>>] ./100878219610349033014"
# post.date             "2011-11-10"
# pluses                "785"
# comments              "476"
# comments.by           "Matthias Fussenegger, Ezekiel Rage, Jake Sharman, Walter Swanevelder, Johan Lorentzon and 1 more"
# sample.comments       "Matthias Fussenegger  -  On Xbox? Nah. Just think of the great PC mods.    "
# shares                "45"
# shares.by             "Achmad Soemiardhita, Adam Pace, Alexis Bane, Allen Carrigan, Amanda Troutman and 40 more"
# type                  "Public"

You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. The html element which refers to  loading more posts is:

<span role="button" title="Load more posts" tabindex="0" style="">More</span>

but how one would use that is beyond me. I think it’s probably something to do with javascript but I don’t think R has any way of accessing it, at least as far as I know. Plus, I don’t know javascript. This makes the function of limited usability. One way around this limitation however (and it’s something I’m doing with my facebook wallpost page scraper) is to simply provide the html file of the Google+ posts page, which you will have saved as an .html file on your disk after pressing the ‘more’ button as many times as you desire, and then giving that file path directly to  googlePlusXScraper function which will automatically do the rest.

# save a Google+ posts page as a complete html file to your local disk. When prompted by the function, choose it.
input <- file.choose()
df <- googlePlusXScraper(input)

The full code can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googlePlusXScraper/googlePlusXScraper.R

P.S. I’m new to github in terms of someone who uploads code but it does seems very useful. And cool. Bow-tie cool. Yeah.

Advertisements

1 Comment »

  1. Very quickly this web page will be famous amid all blogging
    visitors, due to it’s good articles

    Comment by Florida real estate — May 23, 2017 @ 6:06 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: