Consistently Infrequent

January 6, 2012

R: Web Scraping R-bloggers Facebook Page

Filed under: R — Tags: , — Tony Breyal @ 8:50 pm


Introduction is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from R-bloggers are linked into its feed. These can then be liked, commented upon and shared by other facebook users. I was curious if anyone had commented on any of my R posts which had been linked into this facebook feed but it is a very tedious process to have to manually and continually click the ‘load more’ button to load more posts into the facebook wall page and scan for one of my posts.


Automatically scrape the content off of the R-bloggers facebook wall page via XPath and structure it into a dataframe in order to see if anyone has made any comments on one of my posts, or liked it or shared it.

Initial Thoughts

I have posted previously about using the Facebook Explorer API to get data from facebook. However there is a issue whereby a set of random posts may not be returned by the API. Given that I’m specifically interested in a small subset of posts, this issue makes it unsuitable for me to use the API as there is a chance I might miss something interesting. (My feeling is this has something to do with privacy issues but I’m not sure because then surely I wouldn’t be able to see a private post at all whether it’s through the facebook wall or Graph API, unless the API is more strict about privacy).

I could try logging directly into Facebook using RCurl and doing things like setting cookies but that would require me having to first learn HOW to set cookies in RCurl (and feeling motivated enough to spend the extra time required to do it). Seeing as I really want to spend the majority of my spare programming time learning python, I’m going to give this one a miss for now.

Therefore I want to do this scraping of data using the skills I already have (which is a rather basic understanding of XPath via the XML package). I was tempted to learn about setting cookies with RCurl but it’s Friday and that means I just want the weekend to start already…


Links to blog posts on the Facebook wall often do not give information about the original author of the blog. This is rather annoying because it means that some web-crawling is necessary to find out who wrote the post instead of that information being readily available in the first instance. I’m going to limit my code to only crawling for extra information from links because it is very easy to scrape data off that website via XPath (and saves me writing lots of code to try and work with other types of websites).

The R-bloggers facebook page has wall posts going back to January 2010. Prior to September 2011 blog posts pointed to the “notes” page on facebook. This prevents me getting extra data about the blog post because I can’t automatically navigate to those facebook pages. From Septermeber 2011 onwards however the blog posts point to and so these can be scraped for further information rather easily. Luckily I only started posting in November 2011 so this isn’t an issue for me.

Not all wall posts indicate how many comments they have if there are only a few comments. Not sure how to get round this, might have to write “at least 1 comment” for this situation maybe.

Most of the wall posts are made up of links to and various messages by Facebook users. Instead of filtering out, I’m just going to grab AS MUCH INFORMATION off of the wall feed as I can and then filter at the end. I’ll put the unfiltered information into a csv file for anyone that may want it and post it up on github.


The easiest method would be to log into Facebook via the browser, navigate to the R-bloggers Facebook page, use the browser add-on “Better Facebook” to automatically and painlessly load all posts in the R-bloggers feed going back to January 2010 and then save that page to the hard drive using, in google chrome browser terminology, the “Web Page, Complete” option (NOT the “Web Page, HTML Only” option because for some reason that won’t work well with my code).

Once the data is in a html file, use XPath expressions via Duncan Temple Lang’s XML package to extract whatever information I can in the first instance and store into a data.frame.

Once this initial data is in place, I will crawl any posts which link to and extract extra information about the post (e.g. Author, original publication date, post title, etc.). I will merge this data with the already constructed data.frame above.

I will then save this data.frame to a .csv file in case anyone else wishes to analyse it (thus saving them some time). Finally I will subset the data.frame to only posts that link to one of my blog posts and inspect the output.


source_https <- function(url, ...)  {
  # load package

  source_script <- function(u) {
    # read script lines from website using a security certificate
    script <- getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

    # parse lines and evaluate in the global environement
    eval(parse(text = script), envir= .GlobalEnv)

  # source each script
  sapply(c(url, ...), source_script)

Following the procedure describe in the Method section above:

  1. Log into facebook
  2. Naviagate to the R-bloggers facebook wall
  3. Load data as far back as you like. I used the Better Facebook browser add-on tool to automatically load data right back to January 2010.
  4. Save this webpage as a “complete” html file.
  5. Run the following code, selecting the location of the html file when prompted:
df <- rbloggersFBXScraper()

Depending on your internet connection this could take quite some time to complete because it has to crawl the R-bloggers website for extra information about links posted since September 2011. To save you some time I’ve saved ALL the data which I have scraped into a single csv file. Here’s how to use it:

csv.location <- ""
txt <- getURL(csv.location, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
df <- read.table(header=TRUE, text=txt, sep=",", stringsAsFactors=FALSE)

It’s then a simple case of subsetting to find posts by a specific author:

find_posts <- function(df, {
  subset(df, author ==

df2 <- find_posts(df, "Tony Breyal")

#                   30
# timestamp         "Wednesday, December 14, 2011 at 10:29pm"
# num.likes         "6 people like this"
# num.comments      "At least 1 comment"
# num.shares        "0"
#         "R bloggers"
# message           "I love these things :)"
#      ""
# "Introduction\n I was asked by a friend how to find the full final address of an URL \nwhich had been shortened via a shortening service (e.g., Twitter’s,\n Google’s, Facebook’s,,, TinyURL,, \, etc.). I replied I had no idea and maybe he should have a look \nover on ..."
# sample.comments   "Kai Feng Chew Yes! It's really cool! I changed a little bit to make it 2 lines to use the shorten function: load(\"unshort.Rdata\") unshort(\"ANY_SHORTEN_URL\") Example:, December 14, 2011 at 10:34pm · LikeUnlike ·  1ReplyTony Breyal ‎@Kai\n you might want to use the code from the updated version of the code on \nmy blog because it now handles both https. It won't work with \"\" however because that one require the user to be registered (and I'll admit I had not thought of that use case)Thursday, December 15, 2011 at 12:03am · LikeUnlike ·  1Reply"
#    ""
# title             "Unshorten any URL with R"
# first.published   "December 13, 2011"
# author            "Tony Breyal"
#         " Consistently Infrequent » R"
#         ""
# tags              ", R, RCurl, rstats, tinurl, url"

So this tells me that my post entitled “Unshorten any URL with R” got six likes and at least one comment on facebook. Nice. The “sample.comments” field shows what was commented, and that I posted a reply (based on that person’s comment I was able to improve the code and realise that it wouldn’t work with shortened link which requires a user to logged in first). Awesome stuff.

Final Thoughts

So now I have this data I am not quite sure what to do with it. I could do a sorted bar chart with each blog entry on the x-axis and number of facebook likes on the y-axis . I was thinking of doing some sentiment analysis on the sampled comments (I could only scrape visable comments, not the ones you have to press a button to load more for) but I don’t have the time to read up on that type analysis. Maybe in the future 🙂

R code:
csv file:



  1. Excellent way of explaining, and good post to obtain facts regarding my presentation topic, which i am going to convey in academy.

    Comment by seo — February 5, 2012 @ 8:37 pm

  2. what about finding the most users of our facebook page? who post the most comments? any suggestions?

    Comment by Rico — April 7, 2012 @ 5:16 am

    • If it’s a page you’ve created then I’m sure Facebook has some tools to display this information.

      If it’s a page someone else has created, then figuring out ‘who’ is posting the most comments is a bit beyond me at present using just XPath and therefore your best bet would be to use the official API. I have posted one blog entry with a working example of using the Facebook API, plus others have done the same on their blogs (I can’t remember who they were off the top of my head but their code was easy to follow).

      Comment by Tony Breyal — April 7, 2012 @ 12:28 pm

  3. What’s up, yup this post is actually nice and I have learned lot of things from it
    about blogging. thanks.

    Comment by additional reading — January 11, 2015 @ 2:42 pm

  4. Hi, I would like to get consult on retrieving all the post information and write into csv file, and after which combining all of csv files. However, some of the posts do not have comments. May I consult on how to get rid of posts that do not have comments, and write into csv files for the posts that have comments? Thanks.

    Comment by Cheng Qian Yi — October 5, 2015 @ 4:16 am

  5. I was capable to alteration directories simply I couldn’t pile up the app and it wouldn’t footrace either.

    I undergo the next I curbed and I too give X11 installedIdeas

    Comment by Dakota — January 3, 2016 @ 2:47 pm

  6. I was curious if you ever considered changing
    the layout of your website? Its very well written; I love what youve got
    to say. But maybe you could a little more in the way of content
    so people could connect with it better. Youve got an awful lot of text for only having 1 or two pictures.
    Maybe you could space it out better?

    Comment by Natural Stone Quarry Rock Quarry — December 10, 2017 @ 1:53 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at

%d bloggers like this: