Consistently Infrequent

January 6, 2012

R: Web Scraping R-bloggers Facebook Page

Filed under: R — Tags: , — Tony Breyal @ 8:50 pm

 

Introduction

R-bloggers.com is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from R-bloggers are linked into its feed. These can then be liked, commented upon and shared by other facebook users. I was curious if anyone had commented on any of my R posts which had been linked into this facebook feed but it is a very tedious process to have to manually and continually click the ‘load more’ button to load more posts into the facebook wall page and scan for one of my posts.

Objective

Automatically scrape the content off of the R-bloggers facebook wall page via XPath and structure it into a dataframe in order to see if anyone has made any comments on one of my posts, or liked it or shared it.

Initial Thoughts

I have posted previously about using the Facebook Explorer API to get data from facebook. However there is a issue whereby a set of random posts may not be returned by the API. Given that I’m specifically interested in a small subset of posts, this issue makes it unsuitable for me to use the API as there is a chance I might miss something interesting. (My feeling is this has something to do with privacy issues but I’m not sure because then surely I wouldn’t be able to see a private post at all whether it’s through the facebook wall or Graph API, unless the API is more strict about privacy).

I could try logging directly into Facebook using RCurl and doing things like setting cookies but that would require me having to first learn HOW to set cookies in RCurl (and feeling motivated enough to spend the extra time required to do it). Seeing as I really want to spend the majority of my spare programming time learning python, I’m going to give this one a miss for now.

Therefore I want to do this scraping of data using the skills I already have (which is a rather basic understanding of XPath via the XML package). I was tempted to learn about setting cookies with RCurl but it’s Friday and that means I just want the weekend to start already…

Limitations

Links to blog posts on the Facebook wall often do not give information about the original author of the blog. This is rather annoying because it means that some web-crawling is necessary to find out who wrote the post instead of that information being readily available in the first instance. I’m going to limit my code to only crawling for extra information from R-bloggers.com links because it is very easy to scrape data off that website via XPath (and saves me writing lots of code to try and work with other types of websites).

The R-bloggers facebook page has wall posts going back to January 2010. Prior to September 2011 blog posts pointed to the “notes” page on facebook. This prevents me getting extra data about the blog post because I can’t automatically navigate to those facebook pages. From Septermeber 2011 onwards however the blog posts point to R-bloggers.com and so these can be scraped for further information rather easily. Luckily I only started posting in November 2011 so this isn’t an issue for me.

Not all wall posts indicate how many comments they have if there are only a few comments. Not sure how to get round this, might have to write “at least 1 comment” for this situation maybe.

Most of the wall posts are made up of links to R-bloggers.com and various messages by Facebook users. Instead of filtering out, I’m just going to grab AS MUCH INFORMATION off of the wall feed as I can and then filter at the end. I’ll put the unfiltered information into a csv file for anyone that may want it and post it up on github.

Method

The easiest method would be to log into Facebook via the browser, navigate to the R-bloggers Facebook page, use the socialfixer.com browser add-on “Better Facebook” to automatically and painlessly load all posts in the R-bloggers feed going back to January 2010 and then save that page to the hard drive using, in google chrome browser terminology, the “Web Page, Complete” option (NOT the “Web Page, HTML Only” option because for some reason that won’t work well with my code).

Once the data is in a html file, use XPath expressions via Duncan Temple Lang’s XML package to extract whatever information I can in the first instance and store into a data.frame.

Once this initial data is in place, I will crawl any posts which link to R-bloggers.com and extract extra information about the post (e.g. Author, original publication date, post title, etc.). I will merge this data with the already constructed data.frame above.

I will then save this data.frame to a .csv file in case anyone else wishes to analyse it (thus saving them some time). Finally I will subset the data.frame to only posts that link to one of my blog posts and inspect the output.

Solution

source_https <- function(url, ...)  {
  # load package
  require(RCurl)

  source_script <- function(u) {
    # read script lines from website using a security certificate
    script <- getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

    # parse lines and evaluate in the global environement
    eval(parse(text = script), envir= .GlobalEnv)
  }

  # source each script
  sapply(c(url, ...), source_script)
}

Following the procedure describe in the Method section above:

  1. Log into facebook
  2. Naviagate to the R-bloggers facebook wall
  3. Load data as far back as you like. I used the Better Facebook browser add-on tool to automatically load data right back to January 2010.
  4. Save this webpage as a “complete” html file.
  5. Run the following code, selecting the location of the html file when prompted:
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R")
df <- rbloggersFBXScraper()

Depending on your internet connection this could take quite some time to complete because it has to crawl the R-bloggers website for extra information about links posted since September 2011. To save you some time I’ve saved ALL the data which I have scraped into a single csv file. Here’s how to use it:

library(RCurl)
csv.location <- "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv"
txt <- getURL(csv.location, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
df <- read.table(header=TRUE, text=txt, sep=",", stringsAsFactors=FALSE)

It’s then a simple case of subsetting to find posts by a specific author:

find_posts <- function(df, my.name) {
  subset(df, author == my.name)
}

df2 <- find_posts(df, "Tony Breyal")
t(df2[2,])

#                   30
# timestamp         "Wednesday, December 14, 2011 at 10:29pm"
# num.likes         "6 people like this"
# num.comments      "At least 1 comment"
# num.shares        "0"
# posted.by         "R bloggers"
# message           "I love these things :)http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link      "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link.text "Introduction\n I was asked by a friend how to find the full final address of an URL \nwhich had been shortened via a shortening service (e.g., Twitter’s t.co,\n Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, \nOw.ly, etc.). I replied I had no idea and maybe he should have a look \nover on ..."
# sample.comments   "Kai Feng Chew Yes! It's really cool! I changed a little bit to make it 2 lines to use the shorten function: load(\"unshort.Rdata\") unshort(\"ANY_SHORTEN_URL\") Example:http://cloudst.at/index.php?do=%2Fkafechew%2Fblog%2Funshorten-url-function%2FWednesday, December 14, 2011 at 10:34pm · LikeUnlike ·  1ReplyTony Breyal ‎@Kai\n you might want to use the code from the updated version of the code on \nmy blog because it now handles both https. It won't work with \"http://1.cloudst.at/myeg\" however because that one require the user to be registered (and I'll admit I had not thought of that use case)Thursday, December 15, 2011 at 12:03am · LikeUnlike ·  1Reply"
# rbloggers.link    "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# title             "Unshorten any URL with R"
# first.published   "December 13, 2011"
# author            "Tony Breyal"
# blog.name         " Consistently Infrequent » R"
# blog.link         "http://tonybreyal.wordpress.com/2011/12/13/unshorten-any-url-created-using-url-shortening-services-decode_shortened_url/"
# tags              "dft.ba, R, RCurl, rstats, tinurl, url"

So this tells me that my post entitled “Unshorten any URL with R” got six likes and at least one comment on facebook. Nice. The “sample.comments” field shows what was commented, and that I posted a reply (based on that person’s comment I was able to improve the code and realise that it wouldn’t work with shortened link which requires a user to logged in first). Awesome stuff.

Final Thoughts

So now I have this data I am not quite sure what to do with it. I could do a sorted bar chart with each blog entry on the x-axis and number of facebook likes on the y-axis . I was thinking of doing some sentiment analysis on the sampled comments (I could only scrape visable comments, not the ones you have to press a button to load more for) but I don’t have the time to read up on that type analysis. Maybe in the future :)

R code: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R
csv file: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers