# R: loops and performance

This week I was trying to optimize some R code. The code involved a lot of loops to traverse a large data frame, so I thought was a good idea to first look at all my options and then decide what is the best way for me to implement such loops.

There are many ways to loops across datasets in R. Many times it is not even necessary to write a loop. For example if you just want to sum across columns in a data.frame, or get the mean, s.d. etc, there are functions such as `colSums()`

, `summary()`

, `colMeans()`

among others. However, lets look at the options next.

# apply family of functions

These type of functions can be used when you want to apply a function to elements of a list, data frame or matrix. Here, I describe a few of them:

## lapply()

`lapply()`

can be used when we are looking to apply a function to each element in a list. For example, using the *iris* dataset, lets calculate the mean of each numeric column:

data(iris)

lapply(iris[,c(1:4)], mean) # $Sepal.Length

# [1] 5.843333

# $Sepal.Width

# [1] 3.057333

# $Petal.Length

# [1] 3.758

# $Petal.Width

# [1] 1.199333

This is just a dummy example. As mentioned above, R has the `colMeans()`

function to do this type of operation i.e. `colMeans(iris[,1:4])`

or, more generally, `summary(iris)`

.

*sapply()*

`sapply()`

(simplified apply) is very similar to lapply. It is also used to apply a function to each element of a list. The difference is that the function will try to simplify the output (e.g. to a vector) if possible rather than returning a list.

`sapply(iris, mean)`

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species

# 5.843333 3.057333 3.758000 1.199333 NA

A similar result could have been obtained by `lapply()`

and`unlist()`

:

`identical(sapply(iris, mean), unlist(lapply(iris,mean)))`

# [1] TRUE

`sapply()`

will return a list if the output cannot be simplified, i.e. will have the same output as `lapply()`

. For example:

`identical(sapply(iris, summary), lapply(iris,summary))`

#[1] TRUE

## vapply()

`vapply()`

is also very similar to `sapply()`

. The difference is that you are telling R that the output will be of a certain type and size. This function could be used to speed up processes if you know the type of data and dimension the function will return, as R won’t have to figure out how to cast or simplify the output.

vapply(iris[,1:4], mean, numeric(1)) #Expect a 1 numeric value

#Sepal.Length Sepal.Width Petal.Length Petal.Width

# 5.843333 3.057333 3.758000 1.199333 vapply(iris[,1:4], summary, numeric(6)) #Expect 6 numeric values

# Sepal.Length Sepal.Width Petal.Length Petal.Width

# Min. 4.300000 2.000000 1.000 0.100000

# 1st Qu. 5.100000 2.800000 1.600 0.300000

# Median 5.800000 3.000000 4.350 1.300000

# Mean 5.843333 3.057333 3.758 1.199333

# 3rd Qu. 6.400000 3.300000 5.100 1.800000

# Max. 7.900000 4.400000 6.900 2.500000

## apply()

`apply()`

is used to apply a function to the rows or columns of a matrix. In contrast to a data frame all the elements inside a matrix must be of the same type (e.g. characters, numbers). Note that when using `apply()`

on a data frame, it is likely that R won’t return an error and will instead transform the data frame into a matrix of values of the same type.

For example, using `apply()`

to get a summary of each of the columns in the *iris* data frame would transform every column to a vector of characters as the “Species” column is of type character:

`apply(iris, 2, summary)`

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species

# Length "150" "150" "150" "150" "150"

# Class "character" "character" "character" "character" "character"

# Mode "character" "character" "character" "character" "character"

In order to avoid this behavior, we have to make sure that if we are not using a matrix as an input for `apply()`

, the data frame has to have a single type of values.

`apply(iris[,1:4], 2, summary) #Selecting only numeric columns`

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. 4.300000 2.000000 1.000 0.100000

1st Qu. 5.100000 2.800000 1.600 0.300000

Median 5.800000 3.000000 4.350 1.300000

Mean 5.843333 3.057333 3.758 1.199333

3rd Qu. 6.400000 3.300000 5.100 1.800000

Max. 7.900000 4.400000 6.900 2.500000

## tapply()

`tapply()`

, is one of the functions I use more often. It allows to apply a function to a subset of your data defined by a factor. For example, in order to calculate the mean Petal.Length per species, we can do:

`tapply(iris$Petal.Length, iris$Species, mean)`

# setosa versicolor virginica

# 1.462 4.260 5.552

# Performance

The question I really wanted to answer while inspecting the many ways to loop through a data frame in R was which of these functions gives a better performance, and how do these compare to a more traditional *for loop*. I did a series of experiments, but I’ll show here what I learnt with a couple of trivial examples.

## First experiment

I wanted to see how fast each of the functions to compute something as trivial as the mean for each of the columns of a somehow large data frame / matrix:

library(doFuture)

library(future.apply)

library(reshape2)

library(ggplot2)

registerDoFuture()

plan(multicore)times<-matrix(NA,10,10)

colnames(times)<-c("for","foreach","sapply","lapply","vapply","apply","foreach_dopar","future_lapply","for_matrix","foreach_matrix")for(i in 1:10) {

print(i)

x<- matrix(rnorm(100000000),10000,10000)

x_df<-as.data.frame(x)

times[i,"vapply"]<-system.time({ y<-vapply(x_df,mean, numeric(1)) })[3]

times[i,"sapply"]<-system.time({ y<-sapply(x_df,mean) })[3]

times[i,"lapply"]<-system.time({ y<-lapply(x_df,mean) })[3]

times[i,"apply"]<-system.time({ y<-apply(x,2,mean) })[3]times[i,"for"]<-system.time({

for(h in 1:10000) {

mean(x_df[,h])

}

})[3]times[i,"foreach"]<-system.time({

h <- foreach(h = x_df, .combine="rbind") %do% {

mean(h)

}

})[3]times[i,"for_matrix"]<-system.time({

for(h in 1:10000) {

mean(x[,h])

}

})[3]times[i,"foreach_matrix"]<-system.time({

h <- foreach(h = x, .combine="rbind") %do% {

mean(h)

}

})[3]times[i,"foreach_dopar"]<-system.time({

h <- foreach(h = x_df, .combine="rbind") %dopar% {

mean(h)

}

})[3]times[i,"future_lapply"]<-system.time({future_lapply(x_df,mean) })[3]}colMeans(times)

#for foreach sapply lapply vapply

#0.4265 1.7775 0.2105 0.2074 0.2525

#apply foreach_dopar future_lapply for_matrix foreach_matrix

1.8848 4.1796 2.3966 1.0637 2.9653

In the code above I created a matrix with 10k x 10k random numbers. I tested the performance of `apply()`

family of functions, `for()`

and `foreach()`

using a data frame (*x_df*) or a matrix (*x*) as input, as well as the multicore functions `future_lapply()`

and `foreach() %dopar%`

. In order to get some distribution of elapsed time, I ran this 10 times (see the boxplots below for a nicer summary). For this very trivial example, we can see that `sapply()`

, `lapply()`

and `vapply()`

performed the best.

I was very surprised that using matrices instead of data frame (i.e. *for_matrix*, *foreach_matrix* and *apply*) leads to a very bad performance. It is likely that the reason for this is that when you are performing an operation in a subset of a matrix (e.g. a column), R has to make a copy of that column first which creates quite an overhead for simple operations. The latter is not needed for data frames, so that overhead goes away.

I also observed that for this simple operation, the multicore functions `foreach() %dopar%`

and `future_lapply()`

were considerably slower. I attribute this to the overhead caused while copying the data across processes and then combining the results.

## Second experiment

For this part, I decided to use a “more complex” function inside the loops. The intuition is that now the overhead (e.g. copying the data) of some of these functions will be negligible compared to the actual time to carry out the operations.

complexFunction <- function(x) {

#Do a bunch of stuff

sqrt(x)

median(x)

sd(x)

scale(x)

if(mean(x) >= 0)

return(-1)

return(1)

}times_complex<-matrix(NA,10,10)

colnames(times)<-c("for","foreach","sapply","lapply","vapply","apply","foreach_dopar","future_lapply","for_matrix","foreach_matrix")for(i in 1:10) {

print(i)

x<- matrix(rnorm(100000000),10000,10000)

x_df<-as.data.frame(x)

times_complex[i,"vapply"]<-system.time({ y<-vapply(x_df,complexFunction, numeric(1)) })[3]

times_complex[i,"sapply"]<-system.time({ y<-sapply(x_df,complexFunction) })[3]

times_complex[i,"lapply"]<-system.time({ y<-lapply(x_df,complexFunction) })[3]

times_complex[i,"apply"]<-system.time({ y<-apply(x,2,complexFunction) })[3]times_complex[i,"for"]<-system.time({

for(h in 1:10000) {

complexFunction(x_df[,h])

}

})[3]times_complex[i,"foreach"]<-system.time({

h <- foreach(h = x_df, .combine="rbind") %do% {

complexFunction(h)

}

})[3]times_complex[i,"for_matrix"]<-system.time({

for(h in 1:10000) {

complexFunction(x[,h])

}

})[3]times_complex[i,"foreach_matrix"]<-system.time({

h <- foreach(h = x, .combine="rbind") %do% {

complexFunction(h)

}

})[3]times_complex[i,"foreach_dopar"]<-system.time({

h <- foreach(h = x_df, .combine="rbind") %dopar% {

complexFunction(h)

}

})[3]times_complex[i,"future_lapply"]<-system.time({future_lapply(x_df,complexFunction) })[3]}colMeans(times_complex)

#for foreach sapply lapply vapply

#17.2348 19.6356 17.2059 16.8529 17.0792

#apply foreach_dopar future_lapply for_matrix foreach_matrix

#19.1194 7.7400 6.2917 17.8592 20.8166

As expected, I observed that now the performance is comparable between functions that use the matrix (*x*) as input and those that use a data frame (*x_df*). It is also nice to see that `for()`

is not really worse that the `apply()`

family of functions. Something I found surprising was that `vapply()`

seems not to perform any better that `sapply()`

and `lapply()`

as I thought specifying the type of output would lead to better performance. Maybe it does, but not in these trivial examples.

Finally, given that now the operations are what take the most time, we do see quite an improvement on speed with the multicore functions `foreach() %dopar%`

and `future_lapply()`

.

This is the overall result of this simple benchmarking:

simple<-melt(times)

simple$type<-"simple"

complex<-melt(times_complex)

complex$type<-"complex"

all<-rbind(simple, complex)png("elapsed_time_t.png", width=800, height=800, res=100)

ggplot(all, aes(x=Var2, y=value, fill=type)) +

ylab("Elapsed time") + xlab("Function") + ggtitle("Elapsed time") +

geom_boxplot() + coord_flip()

dev.off()

In conclusion, the best way to loop through data depends on the complexity of operations. In my case, I’ll probably use `apply()`

functions when it helps readability of the code and `for()`

when there is a lot of control statements for the operations that need to happen. Whenever there is a lot of operations, and I can easily parallelize the work, I will be using the multicore functions.