Consistently Infrequent

January 6, 2012

R: Web Scraping R-bloggers Facebook Page

Filed under: R — Tags: , — BD @ 8:50 pm

 

Introduction

R-bloggers.com is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from R-bloggers are linked into its feed. These can then be liked, commented upon and shared by other facebook users. I was curious if anyone had commented on any of my R posts which had been linked into this facebook feed but it is a very tedious process to have to manually and continually click the ‘load more’ button to load more posts into the facebook wall page and scan for one of my posts.

Objective

Automatically scrape the content off of the R-bloggers facebook wall page via XPath and structure it into a dataframe in order to see if anyone has made any comments on one of my posts, or liked it or shared it.

Initial Thoughts

I have posted previously about using the Facebook Explorer API to get data from facebook. However there is a issue whereby a set of random posts may not be returned by the API. Given that I’m specifically interested in a small subset of posts, this issue makes it unsuitable for me to use the API as there is a chance I might miss something interesting. (My feeling is this has something to do with privacy issues but I’m not sure because then surely I wouldn’t be able to see a private post at all whether it’s through the facebook wall or Graph API, unless the API is more strict about privacy).

I could try logging directly into Facebook using RCurl and doing things like setting cookies but that would require me having to first learn HOW to set cookies in RCurl (and feeling motivated enough to spend the extra time required to do it). Seeing as I really want to spend the majority of my spare programming time learning python, I’m going to give this one a miss for now.

Therefore I want to do this scraping of data using the skills I already have (which is a rather basic understanding of XPath via the XML package). I was tempted to learn about setting cookies with RCurl but it’s Friday and that means I just want the weekend to start already…

Limitations

Links to blog posts on the Facebook wall often do not give information about the original author of the blog. This is rather annoying because it means that some web-crawling is necessary to find out who wrote the post instead of that information being readily available in the first instance. I’m going to limit my code to only crawling for extra information from R-bloggers.com links because it is very easy to scrape data off that website via XPath (and saves me writing lots of code to try and work with other types of websites).

The R-bloggers facebook page has wall posts going back to January 2010. Prior to September 2011 blog posts pointed to the “notes” page on facebook. This prevents me getting extra data about the blog post because I can’t automatically navigate to those facebook pages. From Septermeber 2011 onwards however the blog posts point to R-bloggers.com and so these can be scraped for further information rather easily. Luckily I only started posting in November 2011 so this isn’t an issue for me.

Not all wall posts indicate how many comments they have if there are only a few comments. Not sure how to get round this, might have to write “at least 1 comment” for this situation maybe.

Most of the wall posts are made up of links to R-bloggers.com and various messages by Facebook users. Instead of filtering out, I’m just going to grab AS MUCH INFORMATION off of the wall feed as I can and then filter at the end. I’ll put the unfiltered information into a csv file for anyone that may want it and post it up on github.

Method

The easiest method would be to log into Facebook via the browser, navigate to the R-bloggers Facebook page, use the socialfixer.com browser add-on “Better Facebook” to automatically and painlessly load all posts in the R-bloggers feed going back to January 2010 and then save that page to the hard drive using, in google chrome browser terminology, the “Web Page, Complete” option (NOT the “Web Page, HTML Only” option because for some reason that won’t work well with my code).

Once the data is in a html file, use XPath expressions via Duncan Temple Lang’s XML package to extract whatever information I can in the first instance and store into a data.frame.

Once this initial data is in place, I will crawl any posts which link to R-bloggers.com and extract extra information about the post (e.g. Author, original publication date, post title, etc.). I will merge this data with the already constructed data.frame above.

I will then save this data.frame to a .csv file in case anyone else wishes to analyse it (thus saving them some time). Finally I will subset the data.frame to only posts that link to one of my blog posts and inspect the output.

Solution

source_https <- function(url, ...)  {
  # load package
  require(RCurl)

  source_script <- function(u) {
    # read script lines from website using a security certificate
    script <- getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

    # parse lines and evaluate in the global environement
    eval(parse(text = script), envir= .GlobalEnv)
  }

  # source each script
  sapply(c(url, ...), source_script)
}

Following the procedure describe in the Method section above:

  1. Log into facebook
  2. Naviagate to the R-bloggers facebook wall
  3. Load data as far back as you like. I used the Better Facebook browser add-on tool to automatically load data right back to January 2010.
  4. Save this webpage as a “complete” html file.
  5. Run the following code, selecting the location of the html file when prompted:
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R")
df <- rbloggersFBXScraper()

Depending on your internet connection this could take quite some time to complete because it has to crawl the R-bloggers website for extra information about links posted since September 2011. To save you some time I’ve saved ALL the data which I have scraped into a single csv file. Here’s how to use it:

library(RCurl)
csv.location <- "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv"
txt <- getURL(csv.location, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
df <- read.table(header=TRUE, text=txt, sep=",", stringsAsFactors=FALSE)

It’s then a simple case of subsetting to find posts by a specific author:

find_posts <- function(df, my.name) {
  subset(df, author == my.name)
}

df2 <- find_posts(df, "Tony Breyal")
t(df2[2,])

#                   30
# timestamp         "Wednesday, December 14, 2011 at 10:29pm"
# num.likes         "6 people like this"
# num.comments      "At least 1 comment"
# num.shares        "0"
# posted.by         "R bloggers"
# message           "I love these things :)http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link      "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link.text "Introduction\n I was asked by a friend how to find the full final address of an URL \nwhich had been shortened via a shortening service (e.g., Twitter’s t.co,\n Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, \nOw.ly, etc.). I replied I had no idea and maybe he should have a look \nover on ..."
# sample.comments   "Kai Feng Chew Yes! It's really cool! I changed a little bit to make it 2 lines to use the shorten function: load(\"unshort.Rdata\") unshort(\"ANY_SHORTEN_URL\") Example:http://cloudst.at/index.php?do=%2Fkafechew%2Fblog%2Funshorten-url-function%2FWednesday, December 14, 2011 at 10:34pm · LikeUnlike ·  1ReplyTony Breyal ‎@Kai\n you might want to use the code from the updated version of the code on \nmy blog because it now handles both https. It won't work with \"http://1.cloudst.at/myeg\" however because that one require the user to be registered (and I'll admit I had not thought of that use case)Thursday, December 15, 2011 at 12:03am · LikeUnlike ·  1Reply"
# rbloggers.link    "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# title             "Unshorten any URL with R"
# first.published   "December 13, 2011"
# author            "Tony Breyal"
# blog.name         " Consistently Infrequent » R"
# blog.link         "https://tonybreyal.wordpress.com/2011/12/13/unshorten-any-url-created-using-url-shortening-services-decode_shortened_url/"
# tags              "dft.ba, R, RCurl, rstats, tinurl, url"

So this tells me that my post entitled “Unshorten any URL with R” got six likes and at least one comment on facebook. Nice. The “sample.comments” field shows what was commented, and that I posted a reply (based on that person’s comment I was able to improve the code and realise that it wouldn’t work with shortened link which requires a user to logged in first). Awesome stuff.

Final Thoughts

So now I have this data I am not quite sure what to do with it. I could do a sorted bar chart with each blog entry on the x-axis and number of facebook likes on the y-axis . I was thinking of doing some sentiment analysis on the sampled comments (I could only scrape visable comments, not the ones you have to press a button to load more for) but I don’t have the time to read up on that type analysis. Maybe in the future 🙂

R code: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R
csv file: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv

November 10, 2011

Facebook Graph API Explorer with R (on Windows)

Filed under: R — Tags: , , , , , , , — BD @ 2:16 pm

I wanted to play around with the Facebook Graph API  using the Graph API Explorer page as a coding exercise. This facility allows one to use the API with a temporary authorisation token. Now, I don’t know how to make an R package for the proper API where you have to register for an API key and do some OAth stuff because that is above my current skill set but the Explorer page itself is a nice middle ground.

Therefore I’ve came up with a self contained R function which allows me to do just that (full code at end of post):


# load packages
library(RCurl)
library(RJSONIO)

# get facebook data
df <- Facebook_Graph_API_Explorer()
t(df[7,])

# post.id                      "127031120644257_319044381442929"
# from.name                    "Doctor Who"
# from.id                      "127031120644257"
# to.name                      "Doctor Who"
# to.id                        "127031120644257"
# to.category                  "Tv show"
# created.time                 "2011-11-10 11:13:42"
# message                      "Has it ever been found out who blew up the TARDIS?"
# type                         "status"
# likes.count                  NA
# comments.count               "3"
# sample.comments              "Did the tardis blow up I haven't seen all of sesion 6&7 [next>>] \"7\" ??? [next>>] the pandorica was obsorbin earth so he blew it up with the tardis"
# sample.comments.from.name    "Alex Nomikos [next>>] Paul Morris [next>>] Vivienne Leigh Bruen"
# sample.comments.from.id      "100001033497348 [next>>] 595267764 [next>>] 100000679940192"
# sample.comments.created.time "2011-11-10 11:23:36 [next>>] 2011-11-10 11:29:56 [next>>] 2011-11-10 13:04:53"

In the above, I’m using “[next>>]” as a way separating entities in the same cell in order to keep the data frame structure. The order is maintained across cells i.e. the first entity of each cell of the sample.comments.from.name column corresponds to the first entity of of each cell of the sample.comments.from.id column, etc, etc.

The main problem I experienced, and have been experiencing for a long time with R, is dealing with a list which has a NULL as one of it’s elements and then un-listing it whilst still maintaining the same length:. For Example:

mylist <- list(a=1, b=NULL, c="hello"
unlist(mylist, use.names = FALSE)
# [1] "1"     "hello"

Whereas what I really want is for the NULL to be converted to NA and thus have the length of the resulting vector be the same as that of the original list, e.g.

mylist <- list(a=1, b=NULL, c="hello"
mylist[sapply(mylist, is.null)] <- NA
unlist(mylist, use.names = FALSE)
# [1] "1"     NA      "hello"

But I don’t know of any automatic way of doing that and so have to do it manually each time. I tell you, these NULL elements in a lists are really causing me headaches when it comes to using unlist!

Anyway, back to the Facebook_Graph_API_Explorer() function, there are a couple of points to bear in mind:

  1. This will only work on Windows because I don’t know what a cross platform version of winDialogString is. I’m guessing the tcltk package has something but I can’t see what it would be.
  2. You must already be signed into Facebook (i.e. you must have an account and be signed in) before you call my Facebook_Graph_API_Explorer()

The function will guide you through the process with dialogue boxes so it should be easy to use for anyone. I think next time I’ll try a web scraping exercise on the HTML of a facebook wall page using XPath, depends on how much time I get!

Tony Breyal

P.S. Full code is below:


library(RCurl)
library(RJSONIO)

Facebook_Graph_API_Explorer <- function() {
  get_json_df <- function(data) {
    l <- list(
        post.id = lapply(data, function(post) post$id),
        from.name = lapply(data, function(post) post$to$data[[1]]$name),
        from.id = lapply(data, function(post) post$to$data[[1]]$id),
        to.name = lapply(data, function(post) post$to$data[[1]]$name),
        to.id = lapply(data, function(post) post$to$data[[1]]$id),
        to.category = lapply(data, function(post) post$to$data[[1]]$category),
        created.time = lapply(data, function(post) as.character(as.POSIXct(post$created_time, origin="1970-01-01", tz="GMT"))),
        message = lapply(data, function(post) post$message),
        type = lapply(data, function(post) post$type),
        likes.count = lapply(data, function(post) post$likes$count),
        comments.count = lapply(data, function(post) post$comments$count),
        sample.comments = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$message), collapse = " [next>>] ")),
        sample.comments.from.name = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$from$name), collapse = " [next>>] ")),
        sample.comments.from.id = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) comment$from$id), collapse = " [next>>] ")),
        sample.comments.created.time = lapply(data, function(post) paste(sapply(post$comments$data, function(comment) as.character(as.POSIXct(comment$created_time, origin="1970-01-01", tz="GMT"))), collapse = " [next>>] "))
        )
    # replace all occurances of NULL with NA
    df = data.frame(do.call("cbind", lapply(l, function(x) sapply(x, function(xx) ifelse(is.null(xx), NA, xx)))))
    return(df)
  }

  # STEP 1: Get certs so we can access https links (we'll delete it at the end of the script)
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")

  # STEP 2: Get fackebook token to access data. I need a crossplatform version of winDialog and winDialogString otherwise this only works on Windows
  winDialog(type = "ok", "Make sure you have already signed into Facebook.\n\nWhen  browser opens, please click 'Get Access Token' twice. You DO NOT need to select/check any boxes for a public feed.\n\n After pressing OK, swich over to your now open browser.")
  browseURL("http://developers.facebook.com/tools/explorer/?method=GET&path=100002667499585")
  token <- winDialogString("When  browser opens, please click 'Get Access Token' twice and copy/paste token below", "")

  # STEP 3: Get facebook ID. This can be a fanpage or whatever e.g. https://www.facebook.com/DoctorWho
  ID <- winDialogString("Please enter FB name id below:", "https://www.facebook.com/DoctorWho")
  ID <- gsub(".*com/", "", ID)

  # STEP 4: Construct Facebook Graph API URL
  u <- paste("https://graph.facebook.com/", ID, "/feed", "?date_format=U", "&access_token=", token, sep = "")

  # STEP 5: How far back do you want get data for? Format should be YYYY-MM-DD
  user.last.date <- try(as.Date(winDialogString("Please enter a date for how roughly far back to gather data from using this format: yyyy-mm-dd", "")), silent = TRUE)
  current.last.date <- user.last.date + 1

  # Get data
  df.list <- list()
  i <- 1
  while(current.last.date > user.last.date) {
    # Download the JSON feed
    json <- getURL(u, cainfo = "cacert.perm")
    json <- fromJSON(json, simplify = FALSE)
    data <- json$data
    stopifnot(!is.null(data))

    # Get json Data Frame
    df.list[[i]] <- get_json_df(data)
    i <- i + 1

    # variables for while loop
    current.last.date <- as.Date(as.POSIXct(json$data[[length(json$data)]]$created_time, origin="1970-01-01", tz="GMT"))
    print(paste("Current batch of dates being processed is:", current.last.date, "(loading more...)"))
    u <- json$paging$`next`
  }

  # delete security certificates we downloaded earlier for https stites.
  file.remove("cacert.perm")
  # return data frame
  df <- do.call("rbind", df.list)
  return(df)
}

df <- Facebook_Graph_API_Explorer()
t(df[4,])
# post.id                      "127031120644257_319062954774405"
# from.name                    "Torchwood"
# from.id                      "119328091441982"
# to.name                      "Torchwood"
# to.id                        "119328091441982"
# to.category                  "Tv show"
# created.time                 "2011-11-10 12:05:21"
# message                      "If you're missing Torchwood & Doctor Who and are after some good, action-packed science fiction, why not check out FOX's awesome prehistoric romp, Terra Nova? It's carried in the UK on Sky TV and is well worth catching up with & following! The idea - The Earth is dying, it's in its final years. Life's intolerable & getting worse. Scientists take advantage of a rift in time & space to set up a 'fresh start' colony on Terra Nova - the earth, 60 million years ago. The adventure then begins..."
# type                         "link"
# likes.count                  NA
# comments.count               "0"
# sample.comments              ""
# sample.comments.from.name    ""
# sample.comments.from.id      ""
# sample.comments.created.time ""

UPDATE: Based on a sugestion from @BrockTibert  I’ve now set up a github account and the above code can be found here: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/facebook_Graph_API_Explorer/facebook_Graph_API_Explorer.R

UPDATE 2: An alternative web-scraping method to bypass the API with R: https://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

Create a free website or blog at WordPress.com.