Consistently Infrequent

August 24, 2014

R: Word Stem Text Blocks in Parallel

I recently needed to stem every word in a block of text i.e. reduce each word to a root form.


The stemmer I was using would only stem the last word in each block of text e.g. the word “walkers” in the vector of words below is the only one which is reduced to its root form –


wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';


I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

require(SnowballC) # stemmer
require(parallel)  # parallel processing
require(tau)       # tokenise function

stem_text<- function(text, language = 'porter', mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- tokenize(x = str)
    str <- wordStem(str, language = language)
    str <- paste(str, collapse = "")

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)

  # return stemed text blocks

This works as follows:

# Blocks of text
sentences <- c('walk walks walked walking walker walkers?',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence.')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker?';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid.'

The argument “mc.cores” refers to the number of processing cores on your processor. Under Windows this will always be one. Under Ubuntu Linux, you can set it to however many cores you have (though it’s probably only worthwhile if you have lots of text vectors).

