Consistently Infrequent

August 24, 2014

R: Word Stem Text Blocks in Parallel

Filed under: R — Tags: , , , — Tony Breyal @ 11:02 pm

Objective

I recently needed to stem every word in a block of text i.e. reduce each word to a root form.

Problem

The stemmer I was using would only stem the last word in each block of text e.g. the word “walkers” in the vector of words below is the only one which is reduced to its root form –

require(SnowballC)

wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';

Solution

I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

require(SnowballC) # stemmer
require(parallel)  # parallel processing
require(tau)       # tokenise function

stem_text<- function(text, language = 'porter', mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- tokenize(x = str)
    str <- wordStem(str, language = language)
    str <- paste(str, collapse = "")
    return(str)
  }

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)

  # return stemed text blocks
  return(unlist(x))
}

This works as follows:

# Blocks of text
sentences <- c('walk walks walked walking walker walkers?',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence.')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker?';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid.'

The argument “mc.cores” refers to the number of processing cores on your processor. Under Windows this will always be one. Under Ubuntu Linux, you can set it to however many cores you have (though it’s probably only worthwhile if you have lots of text vectors).

Advertisements

4 Comments »

  1. Porter algorithm in snowballC is too aggressive and tends to lower the quality in sentiment analysis work which I am doing. Is there a implementation of less aggressive algorithm like KStem in R

    Comment by ankush — July 2, 2015 @ 9:09 pm

  2. Great! This was pretty useful to use with text block within a ata frame 😛

    Comment by Iris G — October 14, 2015 @ 7:47 pm

  3. Thanks this was very useful.I have a similar kind of problem I want to stem each word of a corpus into its root/base word.I have a corpus of 5000 tweets how can I stem each word of every tweet to their base word.kindly help..!!
    Thanks

    Comment by Kavya — April 19, 2016 @ 12:37 pm

  4. Thanks a million! The other 5 stemming solutions I tried didn’t work. And as a bonus this solutions allows to work with a character vector (list) as opposed to needing to create a corpus.

    Comment by simone — February 24, 2017 @ 4:31 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: