Consistently Infrequent

November 29, 2011

outersect(): The opposite of R’s intersect() function

Filed under: R — Tony Breyal @ 12:57 pm

The Objective

To find the non-duplicated elements between two or more vectors (i.e. the ‘yellow sections of the diagram above)

The Problem

I needed the opposite of R’s intersect() function, an “outersect()“. The closest I found was setdiff() but the order of the input vectors produces different results, e.g.


x = letters[1:3]
#[1] "a" "b" "c"
y = letters[2:4]
#[1] "b" "c" "d"

# The desired result is
# [1] "a" "d"

setdiff(x, y)
#[1] "a"

setdiff(y, x)
#[1] "d"

setdiff() produces all elements of the first input vector without any matching elements from the second input vector (i.e. is asymmetric). Not quite what I’m after. I’m looking for the ‘yellow’ set of elements as in the picture at the top of the page.

The Solution

Concatenating the results of setdiff() with input vectors in both combinations works a treat:

outersect <- function(x, y) {
  sort(c(setdiff(x, y),
         setdiff(y, x)))
}

x = letters[1:3]
#[1] "a" "b" "c"
y = letters[2:4]
#[1] "b" "c" "d"

outersect(x, y)
#[1] "a" "d"

outersect(y, x)
#[1] "a" "d"

Alternative solution

An equivalent alternative would be to use

outersect <- function(x, y) {
  sort(c(x[!x%in%y],
         y[!y%in%x]))
}

but by using setdiff() in the first solution it makes it easier to read I think.

Further Development

It would be nice to extend this to a variable number of input vectors. This final task turns out to be rather simple:


outersect <- function(x, y, ...) {
  big.vec <- c(x, y, ...)
  duplicates <- big.vec[duplicated(big.vec)]
  setdiff(big.vec, unique(duplicates))
}

# desired result is c(1, 2, 3, 6, 9, 10)
outersect(1:5, 4:8, 7:10)
#[1] 1 2 3 6 9 10

Awesome.

About these ads

20 Comments »

  1. Very cool, but creating a new word (‘outersect?!’ yech!) where one already exists: xor (short for eXclusive OR, as in set A or set B, but not both).

    Comment by Alexis — November 29, 2011 @ 4:57 pm

    • @Alexis – Fair point. I was going to call it symSetDiff (“symmetric set difference”) but the only way I’ll ever remember to use the function in my own code is by thinking in terms of easy to recall named opposites so the opposite of intersect (“intersections”) becomes outersect (“outer sections”). I forget many R functions and this is a good way for me to recall it.

      And yes, “xor” would’ve been much better, but it’s not something I’ll instantly remember when I need the opposite of an intersect operation (plus there’s already an xor function in R which doesn’t quite achieve what I want above). :)

      Comment by Tony Breyal — November 29, 2011 @ 5:38 pm

      • The operation you implemented is called symmetric difference of the set: http://en.wikipedia.org/wiki/Symmetric_difference

        It can be defined as union of two set differences:

        c(setdiff(a,b),setdiff(b,a))

        So we can implement it in R by following mathematical definitions, as it should be.

        Comment by vzemlys — December 5, 2011 @ 9:09 am

        • @Vzemlys Yeah, that’s what I used in my first solution above. The final solution in the post is extended to cope with a variable set of input vectors which is what I usually deal with in my own work (and I suspect others would too). Although I did say that I was originally going to call the function symSetDiff (“symmetric set difference”), looking at the link you posted makes me think that I would have been wrong to do so based on the second image:

          Comment by Tony Breyal — December 5, 2011 @ 10:08 am

      • Dear vzemlys,

        Since the intersection of 1,2,3,4,5,6 and 1,2,3,7,8,9 is 1,2,3, the correct answer for the symmetric difference set is 4,5,6,7,8,9.
        Any idea why your code c(setdiff(a,b),setdiff(b,a)) gave two answers, both wrong?

        > tab123456 tab123789 c(setdiff(tab123456,tab123789),setdiff(tab123789,tab123456))

        $V1
        [1] 1 2 3 4 5 6

        $V1
        [1] 1 2 3 7 8 9

        Comment by Richard — December 19, 2012 @ 1:55 am

        • What is a tab123456? This is what I get when I calculate the symmetric difference between sets 1,2,3,4,5,6 and 1,2,3,7,8,9:

          > a b c(setdiff(a,b),setdiff(b,a))

          [1] 4 5 6 7 8 9

          Comment by vzemlys — December 19, 2012 @ 3:22 am

          • OK, wordpress is eating R code, Here is another try:

             > a b c(setdiff(a,b),setdiff(b,a)) 
            [1] 4 5 6 7 8 9

            Comment by vzemlys — December 19, 2012 @ 3:23 am

            • Nope, still not good. How about a=1:6, b=c(1:3,7:9)

              Comment by vzemlys — December 19, 2012 @ 3:24 am

              • hey thanks for helping out.

                tab123456 is a read.table from a single column tab delimited textfile containing integers 1,2,3,4,5,6
                tab123789 is same of 1,2,3,7,8,9.

                both look okay in RSTUDIO

                running:
                c(setdiff(tab123456,tab123789),setdiff(tab123789,tab123456))

                R gives:
                $V1
                [1] 1 2 3 4 5 6

                $V1
                [1] 1 2 3 7 8 9

                which aint the desired 456789

                r

                Comment by Richard — December 19, 2012 @ 3:55 am

                • If you read in the table, then you have a data.frame. You need to pass vectors, not data.frames. Try using a=tab123456[,1], b=tab123789[,2] and then using the code.

                  Comment by vzemlys — December 19, 2012 @ 4:00 am

            • did just now:

              > a = 1:6
              > b = c(1:3,7:9)
              > c(setdiff(a,b),setdiff(b,a))
              [1] 4 5 6 7 8 9

              so, yes, this of course works!
              but I want to compare tables/vectors/dataframes etc
              r

              Comment by Richard — December 19, 2012 @ 3:58 am

              • You can use this for vectors only, since then mathematicaly set is well defined. So you can use it on rows and columns of tables and data.frames, but not on whole tables and data.frames. Unless you do not care about rectangular structure, in that case just coerce the object to the vector.

                Comment by vzemlys — December 19, 2012 @ 4:02 am

                • got it/that now.
                  thanks.
                  r

                  Comment by Richard — December 19, 2012 @ 4:08 am

          • I figured it out- have to specify the Variable name with $…:

            > a = tab123456
            > b = tab123789
            > c(setdiff(a$V1,b$V1),setdiff(b$V1,a$V1))

            [1] 4 5 6 7 8 9

            thanks

            r

            Comment by Richard — December 19, 2012 @ 4:05 am

  2. Well, as long as we’re making up new words, I’m use your code for a function called UnionMinusIntersection(). ;)

    Comment by Alexis — November 30, 2011 @ 1:25 am

    • lol, if it helps you to remember it, go for it mate ;)

      Comment by Tony Breyal — November 30, 2011 @ 9:45 am

  3. > outersect x y outersect (x,y)

    Error in sort.int(x, na.last = na.last, decreasing = decreasing, …) :
    ‘x’ must be atomic
    ————————————————————–
    WHY DOES IT GIVE THIS ERROR?
    The two lists are simply 1,2,3,4,5,6 and 1,2,3,7,8,9 so the outersect is 4,5,67,8,9.

    Comment by Richard — December 19, 2012 @ 3:43 am

  4. There’s an issue with this function when the two lists have different lengths; the output is incorrect.

    Comment by jmcontreras — November 5, 2013 @ 7:57 pm

  5. Wouldn’t you like to list the unique results?

    outersect <- function(x, y) {
      sort(c(unique(x[!x%in%y],
             y[!y%in%x])))
    }

    Comment by Maurizio — February 27, 2014 @ 3:26 pm

    • Scratch that…

      Comment by Maurizio — February 27, 2014 @ 3:35 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: