Consistently Infrequent

December 13, 2011

Unshorten (almost) any URL with R

Filed under: R — Tags: , , , , , — Tony Breyal @ 6:57 pm

Introduction

I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.

Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!

This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.

The Objective

I want to decode a shortened URL to reveal it’s full final web address.

The Solution

The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.

decode_short_url <- function(url, ...) {
  # PACKAGES #
  require(RCurl)

  # LOCAL FUNCTIONS #
  decode <- function(u) {
    Sys.sleep(0.5)
    x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) )
    if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) {
      return(u)
    } else {
      return(gsub('.*Location: (\\S+).*', '\\1', x))
    }
  }

  # MAIN #
  gc()
  # return decoded URLs
  urls <- c(url, ...)
  l <- vector(mode = "list", length = length(urls))
  l <- lapply(urls, decode)
  names(l) <- urls
  return(l)
}

And here’s how we use it:

# EXAMPLE #
decode_short_url("http://tinyurl.com/adcd",
                 "http://www.google.com")
# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"

You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R

Limitations

A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,

http://1.cloudst.at/myeg

decode_short_url("http://tinyurl.com/adcd",
"http://www.google.com",
"http://1.cloudst.at/myeg")

# $`http://tinyurl.com/adcd`
# [1] "http://www.r-project.org/"
#
# $`http://www.google.com`
# [1] "http://www.google.co.uk/"
#
# $`http://1.cloudst.at/myeg`
# [1] "http://1.cloudst.at/myeg"

I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there :)

About these ads

14 Comments »

  1. Since no one else wrote it – I wanted to say that this is a good post – thanks for putting it together :)

    Comment by Tal Galili — December 14, 2011 @ 1:34 pm

  2. Thank you for this, this is smth that I was looking for. Unfortunatelly, it doesn’t resolve the final url, when it’s double-shorted.

    Comment by Aleksei Beloshytski (@LadderRunner) — February 4, 2012 @ 7:53 pm

    • Running it recursively would probably work, with the closing condition being two sucessive recursions which resolve to the same final URL.

      Comment by Tony Breyal — October 14, 2012 @ 3:37 pm

  3. I mean it ideally id may also check (optionlly) whether the URL is shortened several times. However it may be run recursively :)

    Comment by Aleksei Beloshytski (@LadderRunner) — February 4, 2012 @ 7:57 pm

  4. Wonderful paintings! This is the type of info that are supposed to be shared across the web. Shame on Google for no longer positioning this put up higher! Come on over and seek advice from my site . Thanks =)

    Comment by url shortener android — September 4, 2012 @ 4:04 pm

  5. Great post, but I’m having problems to run the function on URL lists with more than 1K links. R either halts or crashes while attempting to access unmapped memory. Perhaps we could add a timesleep to the function?

    Comment by Marco T. Bastos — November 16, 2012 @ 3:52 pm

    • Does the update to the code improve the situation? I’ve added a pause of 0.25 seconds between requests plus I’ve preallocated memory for the list object so hopefully if there’s a memory issue this will identify it early on. Other than that I would simply use a for-loop, something along the following lines (untested):

      out <- vector(mode = "list", length = length(urls))
      for(u in urls) {
        out[u] <- decode_short_url(u)
      }
      

      Comment by Tony Breyal — November 17, 2012 @ 12:31 pm

      • Not really. R still crashes when resolving more than 1500 URLs, despite the 0.5 seconds between requests and the preallocated memory. Check it out:

        > urls.resolved
        > urls.resolved <- decode_short_url(urls.shortened[1:2000,])

        *** caught segfault ***
        address (nil), cause 'memory not mapped'

        Traceback:
        1: .Call("R_curlMultiPerform", curl, as.logical(multiple), PACKAGE = "RCurl")
        2: curlMultiPerform(multiHandle)
        3: getURIAsynchronous(url, …, .opts = .opts, write = write, curl = curl)
        4: getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
        5: doTryCatch(return(expr), name, parentenv, handler)
        6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
        7: tryCatchList(expr, classes, parentenv, handlers)
        8: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w LONG) prefix <- paste(prefix, "\n ", sep = "") } else prefix <- "Error : " msg <- paste(prefix, conditionMessage(e), "\n", sep = "") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
        9: try(getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
        10: FUN(X[[1L]], …)
        11: lapply(urls, decode)
        12: decode_short_url(br.shortened[1:2000, ])

        Possible actions:
        1: abort (with core dump, if enabled)
        2: normal R exit
        3: exit R without saving workspace
        4: exit R saving workspace
        Selection:

        Comment by Marco T. Bastos — November 17, 2012 @ 1:42 pm

        • By the way I’ve tested the code running R 2.15 both in Linux (Debian) and Windows 7 64.

          Comment by Marco T. Bastos — November 17, 2012 @ 1:44 pm

          • I think you’re going to have to ask about his on http://stackoverflow.com/questions/tagged/r because I don’t understand why that is happening to be completely honest with you mate. Sorry I couldn’t be of more help.

            Comment by Tony Breyal — November 17, 2012 @ 2:24 pm

            • No probs, Tony. I’ll play around with the function and try to parse the list in smaller blocks. I’ll get back to you if I find a workaround. Thanks for all the help.

              Comment by Marco T. Bastos — November 17, 2012 @ 2:49 pm

  6. Just checking, the function does not work correctly for twitter shortened URLs. For example: “http://t.co/pYeb0wQew8″

    Comment by http://redheadedstepdata.io — June 10, 2014 @ 9:23 pm

  7. Reblogged this on IT Today and commented:
    the deocode_short_url does the trick, it gives you the full url of the shortened url

    Comment by leonwangechi — August 7, 2014 @ 6:44 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: