Consistently Infrequent

January 7, 2012

2011 in review

Filed under: Unclassified — Tony Breyal @ 5:02 pm

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 3,000 times in 2011. If it were a cable car, it would take about 50 trips to carry that many people.

Click here to see the complete report.

January 6, 2012

R: Web Scraping R-bloggers Facebook Page

Filed under: R — Tags: , — Tony Breyal @ 8:50 pm

 

Introduction

R-bloggers.com is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook page where a number of articles from R-bloggers are linked into its feed. These can then be liked, commented upon and shared by other facebook users. I was curious if anyone had commented on any of my R posts which had been linked into this facebook feed but it is a very tedious process to have to manually and continually click the ‘load more’ button to load more posts into the facebook wall page and scan for one of my posts.

Objective

Automatically scrape the content off of the R-bloggers facebook wall page via XPath and structure it into a dataframe in order to see if anyone has made any comments on one of my posts, or liked it or shared it.

Initial Thoughts

I have posted previously about using the Facebook Explorer API to get data from facebook. However there is a issue whereby a set of random posts may not be returned by the API. Given that I’m specifically interested in a small subset of posts, this issue makes it unsuitable for me to use the API as there is a chance I might miss something interesting. (My feeling is this has something to do with privacy issues but I’m not sure because then surely I wouldn’t be able to see a private post at all whether it’s through the facebook wall or Graph API, unless the API is more strict about privacy).

I could try logging directly into Facebook using RCurl and doing things like setting cookies but that would require me having to first learn HOW to set cookies in RCurl (and feeling motivated enough to spend the extra time required to do it). Seeing as I really want to spend the majority of my spare programming time learning python, I’m going to give this one a miss for now.

Therefore I want to do this scraping of data using the skills I already have (which is a rather basic understanding of XPath via the XML package). I was tempted to learn about setting cookies with RCurl but it’s Friday and that means I just want the weekend to start already…

Limitations

Links to blog posts on the Facebook wall often do not give information about the original author of the blog. This is rather annoying because it means that some web-crawling is necessary to find out who wrote the post instead of that information being readily available in the first instance. I’m going to limit my code to only crawling for extra information from R-bloggers.com links because it is very easy to scrape data off that website via XPath (and saves me writing lots of code to try and work with other types of websites).

The R-bloggers facebook page has wall posts going back to January 2010. Prior to September 2011 blog posts pointed to the “notes” page on facebook. This prevents me getting extra data about the blog post because I can’t automatically navigate to those facebook pages. From Septermeber 2011 onwards however the blog posts point to R-bloggers.com and so these can be scraped for further information rather easily. Luckily I only started posting in November 2011 so this isn’t an issue for me.

Not all wall posts indicate how many comments they have if there are only a few comments. Not sure how to get round this, might have to write “at least 1 comment” for this situation maybe.

Most of the wall posts are made up of links to R-bloggers.com and various messages by Facebook users. Instead of filtering out, I’m just going to grab AS MUCH INFORMATION off of the wall feed as I can and then filter at the end. I’ll put the unfiltered information into a csv file for anyone that may want it and post it up on github.

Method

The easiest method would be to log into Facebook via the browser, navigate to the R-bloggers Facebook page, use the socialfixer.com browser add-on “Better Facebook” to automatically and painlessly load all posts in the R-bloggers feed going back to January 2010 and then save that page to the hard drive using, in google chrome browser terminology, the “Web Page, Complete” option (NOT the “Web Page, HTML Only” option because for some reason that won’t work well with my code).

Once the data is in a html file, use XPath expressions via Duncan Temple Lang’s XML package to extract whatever information I can in the first instance and store into a data.frame.

Once this initial data is in place, I will crawl any posts which link to R-bloggers.com and extract extra information about the post (e.g. Author, original publication date, post title, etc.). I will merge this data with the already constructed data.frame above.

I will then save this data.frame to a .csv file in case anyone else wishes to analyse it (thus saving them some time). Finally I will subset the data.frame to only posts that link to one of my blog posts and inspect the output.

Solution

source_https <- function(url, ...)  {
  # load package
  require(RCurl)

  source_script <- function(u) {
    # read script lines from website using a security certificate
    script <- getURL(u, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

    # parse lines and evaluate in the global environement
    eval(parse(text = script), envir= .GlobalEnv)
  }

  # source each script
  sapply(c(url, ...), source_script)
}

Following the procedure describe in the Method section above:

  1. Log into facebook
  2. Naviagate to the R-bloggers facebook wall
  3. Load data as far back as you like. I used the Better Facebook browser add-on tool to automatically load data right back to January 2010.
  4. Save this webpage as a “complete” html file.
  5. Run the following code, selecting the location of the html file when prompted:
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R")
df <- rbloggersFBXScraper()

Depending on your internet connection this could take quite some time to complete because it has to crawl the R-bloggers website for extra information about links posted since September 2011. To save you some time I’ve saved ALL the data which I have scraped into a single csv file. Here’s how to use it:

library(RCurl)
csv.location <- "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv"
txt <- getURL(csv.location, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
df <- read.table(header=TRUE, text=txt, sep=",", stringsAsFactors=FALSE)

It’s then a simple case of subsetting to find posts by a specific author:

find_posts <- function(df, my.name) {
  subset(df, author == my.name)
}

df2 <- find_posts(df, "Tony Breyal")
t(df2[2,])

#                   30
# timestamp         "Wednesday, December 14, 2011 at 10:29pm"
# num.likes         "6 people like this"
# num.comments      "At least 1 comment"
# num.shares        "0"
# posted.by         "R bloggers"
# message           "I love these things :)http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link      "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# embeded.link.text "Introduction\n I was asked by a friend how to find the full final address of an URL \nwhich had been shortened via a shortening service (e.g., Twitter’s t.co,\n Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, \nOw.ly, etc.). I replied I had no idea and maybe he should have a look \nover on ..."
# sample.comments   "Kai Feng Chew Yes! It's really cool! I changed a little bit to make it 2 lines to use the shorten function: load(\"unshort.Rdata\") unshort(\"ANY_SHORTEN_URL\") Example:http://cloudst.at/index.php?do=%2Fkafechew%2Fblog%2Funshorten-url-function%2FWednesday, December 14, 2011 at 10:34pm · LikeUnlike ·  1ReplyTony Breyal ‎@Kai\n you might want to use the code from the updated version of the code on \nmy blog because it now handles both https. It won't work with \"http://1.cloudst.at/myeg\" however because that one require the user to be registered (and I'll admit I had not thought of that use case)Thursday, December 15, 2011 at 12:03am · LikeUnlike ·  1Reply"
# rbloggers.link    "http://www.r-bloggers.com/unshorten-any-url-with-r/"
# title             "Unshorten any URL with R"
# first.published   "December 13, 2011"
# author            "Tony Breyal"
# blog.name         " Consistently Infrequent » R"
# blog.link         "http://tonybreyal.wordpress.com/2011/12/13/unshorten-any-url-created-using-url-shortening-services-decode_shortened_url/"
# tags              "dft.ba, R, RCurl, rstats, tinurl, url"

So this tells me that my post entitled “Unshorten any URL with R” got six likes and at least one comment on facebook. Nice. The “sample.comments” field shows what was commented, and that I posted a reply (based on that person’s comment I was able to improve the code and realise that it wouldn’t work with shortened link which requires a user to logged in first). Awesome stuff.

Final Thoughts

So now I have this data I am not quite sure what to do with it. I could do a sorted bar chart with each blog entry on the x-axis and number of facebook likes on the y-axis . I was thinking of doing some sentiment analysis on the sampled comments (I could only scrape visable comments, not the ones you have to press a button to load more for) but I don’t have the time to read up on that type analysis. Maybe in the future :)

R code: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/rbloggersFBXScraper.R
csv file: https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/RBloggersFBXScraper/data.csv

January 4, 2012

Doctor Who (2005), Series Six plus Christmas Special

Filed under: TV — Tony Breyal @ 1:16 pm

Premise

“No, look, there’s a blue box. It’s bigger on the inside than it is on the outside. It can go anywhere in time and space and sometimes even where it’s meant to go. And when it turns up, there’s a bloke in it called The Doctor and there will be stuff wrong and he will do his best to sort it out and he will probably succeed ’cause he’s awesome. Now sit down, shut up, and watch ‘Blink’.” — Neil Gaiman

Series Six Plot

The series this year revolved around two central points which were (1) the death of The Doctor and (2) the revelation of who River Song really is. Or was. Or will be. Tenses are difficult when it comes to DW! And then when that was all resolved, we had a Christmas Special inspired by the C. S. Lewis novel The Lion, The Witch and The Wardrobe.

Verdict

First off, the opening two episodes were brilliant with the opening set of scenes introducing us to The Doctor’s death in the future. Just absolutely bloody brilliant. We are introduced to the Silence which are a group of aliens who you can only remember whilst looking at them because once you turn away you completely forget that there were ever there. This is semi-creepy in its own right but get’s bumped up to a who new level when Amy has to mark her own skin each time she encounters a Silent to remind herself that she’s seen one and needs to get the hell out of where ever she is – but because she forgets she ends up being covered in more and more marks which tells her that she is in serious trouble. Very creepy indeed and therefore very awesome. What makes the Silence an even more deadly foe is that when they say something it is taken as a subliminal message to the person hearing it and then they act on it but have no idea why they’ve done what they’ve just done. This latter point is ultimately the Silence’s undoing because The Doctor tricks one Silent into saying “you should kill us onsite”, records it, and then plants into the one piece of footage every human in the future is likely to see at some point in their lifetimes – the video footage of the moon landing. Nobody will remember seeing the Silent but they will act on that one message and not even realise why they’re doing it. Bloody writing genius that is and that’s just one of the reasons why Steven Moffat is my favourite script writer.

Other notable, none series story arc, episodes are Neil Gaiman’s “The Doctor’s Wife” in which we meet Idris, a physical manifestation of The Doctor’s time travelling spaceship (the TARDIS) and Tom MacRae’s “The Girl Who Waited” in which we see an aged version of Amy who has been living in seclusion for many years. Both are excellent episodes with tons of  re-watch factor.

Back to the series story arc, we get a revelation of who River is in the fabulous episode “A Good Man Goes To War” which is one of the fastest paced episodes I can remember and which delivers on almost all fronts. Special mention to Rory who proves in episode, and yet again in the series, why is may just be the most bad-ass companion of them all, love me a bit of Rory I do! This episode is then followed by one of the best titles of any Doctor Who episode “Let’s Kill Hittler” in which we get more River revelations.

Then we have the final in which The Doctor escapes death, which while a lot of fun, seemed like a big cheat given the build up to that moment. I know it makes sense and have no real problem with that, it’s just that when you start the series with The Doctor’s death, you can’t help but hope that there’s going to be a bigger pay-off in how that is dealt with. But still, a fun episode.

Overall it was a most excellent series and I feel Doctor Who has never been better since it’s return in 2005.

Oh, and the Christmas Special. I wasn’t too impressed by it (I much preferred last years which took inspiration for Dickens “A Christmas Carol” and which I still maintain is among the best ever Doctor Who episodes), but it had some great jokes and was entertaining at least. I just didn’t feel for any of the characters that much and I really need that connection in order to enjoy a story. However, the final scene in which The Doctor visits Amy & Rory unannounced for Christmas Dinner and then is told that they already have a place set up for him and in fact always have a place set aside for him… well, I’ll admit that I had a couple of tears running down my cheeks because of his private reaction to knowing that he has a place there and that he can still feel such emotions.

Bring on series seven!

Edit

BTW, my previous post has a chart of British TV ratings for the last 50 odd years of first run Doctor Who episodes.

Plotting Doctor Who Ratings (1963-2011) with R

Filed under: R — Tags: , , — Tony Breyal @ 1:52 am

Introduction

First day back to work after New Year celebrations and my brain doesn’t really want to think too much. So I went out for lunch and had a nice walk in the park. Still had 15 minutes to kill before my lunch break was over and so decided to kill some time with a quick web scraping exercise in R.

Objective

Download the last 49 years of British TV ratings data for the programme Doctor Who (the longest-running science fiction television show in the world and which is also the most successful science fiction series of all time, in terms of its overall broadcast ratings, DVD and book sales and iTunes traffic) and make a simple plot of it.

Method

Ratings are available from doctorwhonews.net as a series of page separated tables. This means that we can use the RCurl and XML packages to download the first seed webpage, extract the table of ratings, and use XPath to get the weblink to the next page of ratings. Due to time constraints I’m not going to optimise any of this (though given the small data set it probably doesn’t need optimisation anyway).

Solution

get_doctor_who_ratings <- function() {
  # load packages
  require(RCurl)
  require(XML)

  # return Title, Date and Rating
  format_df <- function(df) {
    data.frame(Date = as.POSIXlt(df$Date, format = "%a %d %b %Y"),
               Title = df$Title,
               Rating = as.numeric(gsub("(\\s+).*", "\\1", df$Rating)),
               stringsAsFactors = FALSE)
  }

  # scrape data from web
  get_ratings <- function(u) {
    df.list <- list()
    i <- 1
    while(!is.null(u)) {
      html <- getURL(u)
      doc <- htmlParse(u)
      df.list[[i]] <- readHTMLTable(doc, header = TRUE, which = 1, stringsAsFactors = FALSE)
      u.next <- as.vector(xpathSApply(doc, "//div[@class='nav']/a[text()='NEXT']/@href"))
      if(is.null(u.next)) {
        return(df.list)
      }
      u <- sub("info.*", u.next, u)
      i <- i + 1
    }
    return(df.list)
  }

  ### main function code ###
  # Step 1: get tables of ratings for each page that is avaiable
  u <- "http://guide.doctorwhonews.net/info.php?detail=ratings"
  df.list <- get_ratings(u)

  # Step 2: format ratings into a single data.frame
  df <- do.call("rbind", df.list)
  df <- format_df(df)

  # Step 3: return data.frame
  return(df)
}

Using the above, we can pull the ratings into a single data.frame as follows:


# get ratings database
ratings.df <- get_doctor_who_ratings()
head(ratings.df)

# Date Title Rating
# 1 1979-10-20 City of Death - Episode 4 16.1
# 2 1979-10-13 City of Death - Episode 3 15.4
# 3 1979-09-22 Destiny of the Daleks - Episode 4 14.4
# 4 1979-10-06 City of Death - Episode 2 14.1
# 5 1979-09-15 Destiny of the Daleks - Episode 3 13.8
# 6 1975-02-01 The Ark In Space - Episode 2 13.6

&nbsp;

Plot

We can plot this data very easily using the Hadley Wickman’s ggplot2 package:


# do a raw plot
require(ggplot2)
ggplot(ratings.df, aes(x=Date, y=Rating)) + geom_point() + xlab("Date") + ylab("Ratings (millions)") + opts(title = "Doctor Who Ratings (1963-Present) without Context")

The gap in the data is due to the show having been put on permanent hiatus between 1989 and 2005 with the exception of the american episode in 1996.

CAUTION 

This was just a fun coding exercise to quickly pass some time.

The chart above should not be directly interpreted without the proper context as it would be very misleading to suggest that that show was more popular in earlier years than in later years. Bear in mind that TV habits have changed dramatically over the past 50 odd years (I myself barely watch TV live any more and instead make use of catchup services like BBC iplayer which the ratings above to do not account for), that there were fewer channels back in 1963 in Britain, the way BARB collect ratings, and that the prestige of the show has changed over time (once an embarrassment for the BBC with all of it’s criminally low budgets and wobbly sets, to now being one of it’s top flagship shows).

A final note

Although I was part of the generation during which Doctor Who was taken off the air, I do vaguely remember some episodes from my childhood where The Doctor was played by Sylvester McCoy, who to this day is still “my doctor” (as the saying goes) and I would put him right up there with Tennent and Smith as being one of the greats. Best. Show. Ever.

You can find a quick review of series six (i.e. the sixth series of episodes since the show’s return in 2005) right here, and because I love the trailer so much I’ll embed it below:

December 20, 2011

Dexter, Series Six

Filed under: TV — Tags: , , , — Tony Breyal @ 10:22 am

Premise

After seeing his mother butchered in front of him while only a toddler, Dexter is adopted by a police officer who tries to bring some sense of belonging and family to his life. However as Dexter ages he starts to display tendencies towards psychopathy and so his adopted father realising that he can’t stop this instead refocuses Dexter’s energy to live by a code whereby he can only kill someone if they’ve already committed a murder and are likely to do so again. The series thus revolves around Dexter as an adult who has become a serial killer but will only kill murderers who are likely to kill again.

Plot

The sixth series story arc revolves around several ideas including (i) Dexter questioning his need to kill because all he really wants is to be a better father for his son and thinks it may be possible after meeting another killer who has changed his own life around, (ii) the main antagonists this series are a couple of religious fanatics who want to end the world, (iii) Dexter’s younger sister Deb, who is not a blood relation because Dexter was adopted, realising that she has romantic feelings towards him.

Verdict

I’ve always enjoyed Dexter not only because of it’s fantastic writing and performances but also due to it having an anti-hero as the lead character. I can’t think of many other shows which have, essentially, a ‘bad’ guy as the main focus whom the audience is also rooting for. I put ‘bad’ in quotes because even though he is killing other murderers and in thus in the process preventing potential future murders, he is still at the end of the day still killing people and that is wrong. Even so, I find myself cheering him on because it’s not difficult to see the good he achieves.

I think the show is having a hard time coming up with a good enough rival to Dexter after the Ice Truck Killer from series one and Trinity from series four. Having said that though it was still somewhat interesting to see Dexter battle with the religious fanatics in this series not because they were any kind of real match for him but due to how he as a character grew. At one point he realises that the best thing to do is to call the police instead of trying to satisfy his dark passenger and that to me is the reason why Dexter is such a likeable character – he may be a killer but he also has a code and realises that some things are more important than himself.

The one plot point this series that I’m not really behind however is Deb realising that she has romantic feelings towards Dexter. I know they’re not blood related but they were brought up together and that just doesn’t sit right with me. I can see where the logic comes from within the series given the relationship and shared experiences these two have but there’s still that initial ‘yuck’ factor for me. I’m actually interested to see how Dexter would react to this revelation in some future series. I know it’s not technically incest but it feels like it would be. Interestingly enough, as a side note, if one removes the issue around having children then it actually becomes very difficult to form a logical argument for why incest between two consenting adults should be wrong. Just thought I’d add that last point because it is interesting to think about.

The final scene this series matches my shock from the end of series four when we found out that Trinity had killed Rita, Dexter’s wife. Having Deb walk in as Dexter is about to kill the religious fanatic and ending series six there was a magnificent moment. I honestly thought that given how she hadn’t worked it out over the first five series that she probably wouldn’t this series either. I really hope series seven picks up at the scene that series six ended because I really want to see how she handles the realisation that her brother is a killer. Series seven is going to rock!

December 19, 2011

Python: Hello, World!

Filed under: Python — Tony Breyal @ 9:57 pm

Introduction

Stanford is running a series of open online courses this January. One of these courses is about Text Mining (aka Natural Language Processing, or NLP for short). There is a pre-requisite in this course for being able to programme in either Java or Python.

I was going to spend my Christmas break re-learning C++ but as I really want to try this course out I’m instead going to try and learn Python by following this online Google class because it’s a language I often hear about from other R users.  Having done the first two modules of that google course I thought I should code a quick ‘hello world’ programme on my blog, for the sake of geekery if nothing else.

Objective

Write some python code which will print out “Hello, world!”.

Solution

Ubuntu Linux already comes with python pre-installed by the looks of it so I didn’t need to do anything special. I downloaded the Spyder IDE because it’s the closest thing to RStudio (which I now use when coding in R) that I could see and comes highly recommended based on the various web sites I visited. Anyway, here’s the code I entered into the script window of the Spyder IDE. To run it, I pressed F5 which prompted me to save the file and after which “Hello, world!” was printed to the integrated console :

def main():
  print 'Hello, world!'

if __name__ == '__main__':
  main()

Line 1 tells us that we have defined [def] a function called main() and it’s body starts after the colon [:].

Line 2 is indented to show that it belongs to main(). This is VERY important because unlike some other programming languages, python does not have curly braces “{” and “}” which tell us where a function starts and ends but instead uses the indentation to mark the boundaries (so this formatting is not optional). I’m not sold on this concept yet though I suppose it does save a bit on having to type in the curly braces explicitly because I would normally indent my code anyway.

Line 4 and 5 tells us that this file (lines 1-5) can be used as either a module for import into another python module or as a stand-alone programme. This seems to be required in every python file and so I guess I had better get used to it. When I run this file it is recognised as a standalone programme and starts off by calling the main() function which is used on line 5.

December 18, 2011

30 Rock, Series Two

Filed under: TV — Tags: — Tony Breyal @ 1:56 pm

Premise

The premise of the show seems to have evolved since series one from being about a head writer for an SNL (Saturday Night Live) type comedy sketch show trying to keep the wheels of the machine rolling whilst attempting to have some kind of personal life outside of work, to being a show about the people who work behind the scenes and everyday drama they have to deal with (or in some cases, cause).

Plot

30 Rock has an episodic format where each episode is for the most part self-contained and the audience can therefore jump in and out at any point. As far as the series two arc is concerned, the main story that comes to my mind is mostly about Donaghy tying to become the new chairman of the network.

Verdict

I tend to watch 30 Rock whilst doing my weekly ironing as it’s not really a show which requires my full attention and if I miss something because, say, I’m hanging up a shirt I’ve just finished removing the creases from ,then it’s not at all a big deal and I can usually still figure out what’s happening.

Kenneth is by far my favourite character, his innocence makes any scene he’s involved with just that much more entertaining. I like all the other characters too but I’d say Kenneth and Donaghy make the show very watchable for me.

To be honest I can’t seem to recall much of what happened in series two but I know I enjoyed it enough to want to continue on to series three as my primary show to watch whilst I do my ironing.

December 13, 2011

Unshorten (almost) any URL with R

Filed under: R — Tags: , , , , , — Tony Breyal @ 6:57 pm

Introduction

I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.

Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!

This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.

The Objective

I want to decode a shortened URL to reveal it’s full final web address.

The Solution

The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.

decode_short_url <- function(url, ...) {
  # PACKAGES #
  require(RCurl)

  # LOCAL FUNCTIONS #
  decode <- function(u) {
    Sys.sleep(0.5)
    x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) )
    if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) {
      return(u)
    } else {
      return(gsub('.*Location: (\\S+).*', '\\1', x))
    }
  }

  # MAIN #
  gc()
  # return decoded URLs
  urls <- c(url, ...)
  l <- vector(mode = "list", length = length(urls))
  l <- lapply(urls, decode)
  names(l) <- urls
  return(l)
}

And here’s how we use it:

# EXAMPLE #
decode_short_url("http://tinyurl.com/adcd",
                 "http://www.google.com")
# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"

You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R

Limitations

A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,

http://1.cloudst.at/myeg

decode_short_url("http://tinyurl.com/adcd",
"http://www.google.com",
"http://1.cloudst.at/myeg")

# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"
#
# $`http://1.cloudst.at/myeg`
# [1] "http://1.cloudst.at/myeg"

I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there :)

December 8, 2011

A Round-Heeled Woman, Aldwych Theatre, London (2011 Production)

Filed under: Theatre — Tony Breyal @ 6:20 pm

The Plot

A 66 year old woman who has been celibate for 30 years decides to put an advertisement in the paper for a man she can both like and have sex with. Lots of sex.

The Theatre

For this production all the seats from Row M in the Stalls to the back of the theatre have been partitioned off by a series of fake walls to create a more intimate atmosphere. We were sat in J6 and J7 which have a decent view of the entire stage when the people in front either slouch or lean to one side, with J6 having the added bonus of also being an aisle seat in a row with limited legroom. The row directly in front is HH with its seats not well staggered against row J and also very little rake between them.

The Verdict

I wasn’t expecting to like this play and so was pleasantly surprised to find myself both laughing along to most of the jokes and actually caring about the protagonist, maybe even admiring her somewhat for having the guts to go out and get what she wants. The last time I was surprised to find myself enjoying a show I had low expectations for was The Drowsy Chaperone and the fact it didn’t run longer in the West End is tragedy in my opinion.

The opening joke was a particular highlight and pretty funny with her masturbating while starting to have phone sex and then telling the man on the other end that ofcourse she is alone and would never do anything like that in front of an audience – at which point she breaks the fourth wall and talks directly to the audience while acting slightly embarrassed. My friend and I had an interesting conversation during the interval about this brief seen and the difference between how male and female masturbation is often portrayed on stage, in movies, on TV and in literature – for women it’s usually shown as a sensual act with a deep emotional connection whereas for men it’s usually shown to be a purely animal urge for sexual release . Now ofcourse there are exceptions but still it’s sad that these difference are what are often portrayed when in reality, for both men and women, it can be all those things. Also I was surprised she didn’t have any tissues with her.

A low point was when there’s a memory being played out of when she took her father to a strip-bar so he can have a lap dance (this was when she was younger and wanted his approval). During the lapdance one of the actresses exposes her breasts (large fake plastic ones which have been placed over her real chest) and it is really quite seedy and off-putting to watch but then again on reflection I suppose maybe that was kind of the point that was being made.

One of the most interesting parts of the play is how the character of the leading lady doesn’t just hop into bed with just anyone but has standards and tries to find the attractive qualities in the men she meets. One man she meets is 30 years her junior and this causes her friends to feel embarrassed for her. Again, this is one of those things I’ve never really understood because why should anyone judge what two consenting adults get up to in the privacy of their lives if they’re not hurting anyone? Why hinder their happiness – it’s their lives after all.

Overall a surprisingly fun play made watchable by not only the performances of the actors, who are superb, but also by the fact that it’s not just about sex but about being alone and needing to feel desired and touched by another human being and what kind of situations that can lead to. An interesting way to spend a couple of hours that’s for sure with lots of jokes along the way and some heartbreak.

Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!

Filed under: R — Tags: , , — Tony Breyal @ 1:41 pm

Introduction

The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here.

The Objective

To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

Solutions One Through Thirteen

A variety of different approaches are illustrated on the r-wiki page which show the performance benefits of things like vectorisation, variable initialisation, linking through to a compiled programming language, reducing a problem to its component parts, etc.

The Fourteenth Solution

The main speed improvement here comes from replacing the function “paste” by “file.path”. This use of “file.path” with parameter fsep=”” only works correctly here because there is never a character vector of length 0 for it to deal with. I only learned about this approach when I happened to see this tweet on twitter with hashtag #rstats and reading the associated help file where it says that it is faster than paste.

generateIndex14 <- function(n) {
  # initialise vectors
  s <- (mode = "character", length = n)

  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # paste strings together
  unlist(lapply(1:(n-1), function(i) file.path("i", s[i], ".", s[(i+1):n], fsep = "") ), use.names = FALSE)
}

Timings:

               test  elapsed    n replications
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Fifteenth Solution: Rcpp

This solution comes from Romain Francois and is based on the tenth solution but implemented in C++ using the R package Rcpp. See his blog for the implementation. This is the sort of thing I would love to learn to do myself but just need to find the time to re-learn C++, though I doubt that’ll happen any time soon as I’m hoping to start my MSc in Statistics next year. This is a great solution though.

Timings:

               test  elapsed    n replications
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

The Sixteenth Solution

When I was writing up this post I thought up a sixteenth solution (as seems to be the pattern with me on this blog!). This solution gets its speed up by generating the largest set of strings which start i001.xxx first and then replacing the “001” part with “002”, “003”, “004”, etc., for each increment up to and including n-1.


generateIndex16 <- function(n) {
  # initialise vectors
  str <- vector("list", length = n-1)
  s <- vector(mode = "character", length = n)

  # set up strings
  s <- sprintf("%03d", seq_len(n))
  str[[1]] <- file.path("i", s[1], ".", s[-1], fsep = "")

  # generate string sequences
  str[2:(n-1)] <- lapply(2:(n-1), function(i) sub("001", s[i], str[[1]][i:(n-1)], fixed=TRUE))
  unlist(str)
}

The above requires matching the “001” part first and then replacing it. However, we know that “001” will ALWAYS be in character positions 2, 3 and 4, and so there may be a way to avoid the matching part altogether (i.e. replace a fixed position substring with another string of equal or larger length) but I could not work out how to do that outside of a regular expression. Sadface.

Timings:

               test  elapsed    n replications
 generateIndex16(n) 20.77200 2000           50
 generateIndex15(n) 23.30100 2000           50
 generateIndex14(n) 27.27500 2000           50
 generateIndex13(n) 33.09300 2000           50
 generateIndex12(n) 35.31344 2000           50
 generateIndex11(n) 36.32900 2000           50

Solutions Comparisons For Different N

I like ggplot2 charts and so ran my computer overnight to generate data for the speed performance of the last several solutions over different N:

Final Thoughts

I’m pretty sure that any more speed improvements will come from some or all of the follwing:

  • doing the heavy lifting in a compiled language and interfacing with R
  • running in parallel (I actually got this to work on linux by replacing lapply with mclapply from the parallel R package but the downside was that one has to use much more memory for larger values of N, plus it’s only works in serial fashion on Windows
  • working out an efficient way of replacing a fixed positioned substring with a string of equal or great length
  • compiling the function into R bytecodes using the compiler package function cmpfun

It would also be interesting to profile the memory usage of each funciton.

This was a fun challenge – if you find some spare time why not try your hand at it, you might come up with something even better!  :)

« Newer PostsOlder Posts »

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers