Sign in

R: loops and performance

This week I was trying to optimize some R code. The code involved a lot of loops to traverse a large data frame, so I thought was a good idea to first look at all my options and then decide what is the best way for me to implement such loops.

There are many ways to loops across datasets in R. Many times it is not even necessary to write a loop. For example if you just want to sum across columns in a data.frame, or get the mean, s.d. etc, there are functions such as colSums() , summary() , colMeans() among others. However, lets look at the options next.

apply family of functions

These type of functions can be used when you want to apply a function to elements of a list, data frame or matrix. Here, I describe a few of them:

lapply() can be used when we are looking to apply a function to each element in a list. For example, using the iris dataset, lets calculate the mean of each numeric column:

data(iris)
lapply(iris[,c(1:4)], mean)
# $Sepal.Length
# [1] 5.843333
# $Sepal.Width
# [1] 3.057333
# $Petal.Length
# [1] 3.758
# $Petal.Width
# [1] 1.199333

This is just a dummy example. As mentioned above, R has the colMeans() function to do this type of operation i.e. colMeans(iris[,1:4]) or, more generally, summary(iris) .

sapply() (simplified apply) is very similar to lapply. It is also used to apply a function to each element of a list. The difference is that the function will try to simplify the output (e.g. to a vector) if possible rather than returning a list.

sapply(iris, mean)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 5.843333 3.057333 3.758000 1.199333 NA

A similar result could have been obtained by lapply() andunlist():

identical(sapply(iris, mean), unlist(lapply(iris,mean)))
# [1] TRUE

sapply() will return a list if the output cannot be simplified, i.e. will have the same output as lapply(). For example:

identical(sapply(iris, summary), lapply(iris,summary))
#[1] TRUE

vapply() is also very similar to sapply(). The difference is that you are telling R that the output will be of a certain type and size. This function could be used to speed up processes if you know the type of data and dimension the function will return, as R won’t have to figure out how to cast or simplify the output.

vapply(iris[,1:4], mean, numeric(1)) #Expect a 1 numeric value
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 5.843333 3.057333 3.758000 1.199333
vapply(iris[,1:4], summary, numeric(6)) #Expect 6 numeric values
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. 4.300000 2.000000 1.000 0.100000
# 1st Qu. 5.100000 2.800000 1.600 0.300000
# Median 5.800000 3.000000 4.350 1.300000
# Mean 5.843333 3.057333 3.758 1.199333
# 3rd Qu. 6.400000 3.300000 5.100 1.800000
# Max. 7.900000 4.400000 6.900 2.500000

apply() is used to apply a function to the rows or columns of a matrix. In contrast to a data frame all the elements inside a matrix must be of the same type (e.g. characters, numbers). Note that when using apply() on a data frame, it is likely that R won’t return an error and will instead transform the data frame into a matrix of values of the same type.

For example, using apply()to get a summary of each of the columns in the iris data frame would transform every column to a vector of characters as the “Species” column is of type character:

apply(iris, 2, summary)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Length "150" "150" "150" "150" "150"
# Class "character" "character" "character" "character" "character"
# Mode "character" "character" "character" "character" "character"

In order to avoid this behavior, we have to make sure that if we are not using a matrix as an input for apply(), the data frame has to have a single type of values.

apply(iris[,1:4], 2, summary) #Selecting only numeric columns
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. 4.300000 2.000000 1.000 0.100000
1st Qu. 5.100000 2.800000 1.600 0.300000
Median 5.800000 3.000000 4.350 1.300000
Mean 5.843333 3.057333 3.758 1.199333
3rd Qu. 6.400000 3.300000 5.100 1.800000
Max. 7.900000 4.400000 6.900 2.500000

tapply(), is one of the functions I use more often. It allows to apply a function to a subset of your data defined by a factor. For example, in order to calculate the mean Petal.Length per species, we can do:

tapply(iris$Petal.Length, iris$Species, mean)
# setosa versicolor virginica
# 1.462 4.260 5.552

Performance

The question I really wanted to answer while inspecting the many ways to loop through a data frame in R was which of these functions gives a better performance, and how do these compare to a more traditional for loop. I did a series of experiments, but I’ll show here what I learnt with a couple of trivial examples.

I wanted to see how fast each of the functions to compute something as trivial as the mean for each of the columns of a somehow large data frame / matrix:

library(doFuture)
library(future.apply)
library(reshape2)
library(ggplot2)
registerDoFuture()
plan(multicore)
times<-matrix(NA,10,10)
colnames(times)<-c("for","foreach","sapply","lapply","vapply","apply","foreach_dopar","future_lapply","for_matrix","foreach_matrix")
for(i in 1:10) {
print(i)
x<- matrix(rnorm(100000000),10000,10000)
x_df<-as.data.frame(x)
times[i,"vapply"]<-system.time({ y<-vapply(x_df,mean, numeric(1)) })[3]
times[i,"sapply"]<-system.time({ y<-sapply(x_df,mean) })[3]
times[i,"lapply"]<-system.time({ y<-lapply(x_df,mean) })[3]
times[i,"apply"]<-system.time({ y<-apply(x,2,mean) })[3]
times[i,"for"]<-system.time({
for(h in 1:10000) {
mean(x_df[,h])
}
})[3]
times[i,"foreach"]<-system.time({
h <- foreach(h = x_df, .combine="rbind") %do% {
mean(h)
}
})[3]
times[i,"for_matrix"]<-system.time({
for(h in 1:10000) {
mean(x[,h])
}
})[3]
times[i,"foreach_matrix"]<-system.time({
h <- foreach(h = x, .combine="rbind") %do% {
mean(h)
}
})[3]
times[i,"foreach_dopar"]<-system.time({
h <- foreach(h = x_df, .combine="rbind") %dopar% {
mean(h)
}
})[3]
times[i,"future_lapply"]<-system.time({future_lapply(x_df,mean) })[3]}colMeans(times)
#for foreach sapply lapply vapply
#0.4265 1.7775 0.2105 0.2074 0.2525
#apply foreach_dopar future_lapply for_matrix foreach_matrix
1.8848 4.1796 2.3966 1.0637 2.9653

In the code above I created a matrix with 10k x 10k random numbers. I tested the performance of apply() family of functions, for() and foreach()using a data frame (x_df) or a matrix (x) as input, as well as the multicore functions future_lapply() and foreach() %dopar% . In order to get some distribution of elapsed time, I ran this 10 times (see the boxplots below for a nicer summary). For this very trivial example, we can see that sapply() , lapply() and vapply() performed the best.

I was very surprised that using matrices instead of data frame (i.e. for_matrix, foreach_matrix and apply) leads to a very bad performance. It is likely that the reason for this is that when you are performing an operation in a subset of a matrix (e.g. a column), R has to make a copy of that column first which creates quite an overhead for simple operations. The latter is not needed for data frames, so that overhead goes away.

I also observed that for this simple operation, the multicore functions foreach() %dopar% and future_lapply() were considerably slower. I attribute this to the overhead caused while copying the data across processes and then combining the results.

For this part, I decided to use a “more complex” function inside the loops. The intuition is that now the overhead (e.g. copying the data) of some of these functions will be negligible compared to the actual time to carry out the operations.

complexFunction <- function(x) {
#Do a bunch of stuff
sqrt(x)
median(x)
sd(x)
scale(x)
if(mean(x) >= 0)
return(-1)
return(1)
}
times_complex<-matrix(NA,10,10)
colnames(times)<-c("for","foreach","sapply","lapply","vapply","apply","foreach_dopar","future_lapply","for_matrix","foreach_matrix")
for(i in 1:10) {
print(i)
x<- matrix(rnorm(100000000),10000,10000)
x_df<-as.data.frame(x)
times_complex[i,"vapply"]<-system.time({ y<-vapply(x_df,complexFunction, numeric(1)) })[3]
times_complex[i,"sapply"]<-system.time({ y<-sapply(x_df,complexFunction) })[3]
times_complex[i,"lapply"]<-system.time({ y<-lapply(x_df,complexFunction) })[3]
times_complex[i,"apply"]<-system.time({ y<-apply(x,2,complexFunction) })[3]
times_complex[i,"for"]<-system.time({
for(h in 1:10000) {
complexFunction(x_df[,h])
}
})[3]
times_complex[i,"foreach"]<-system.time({
h <- foreach(h = x_df, .combine="rbind") %do% {
complexFunction(h)
}
})[3]
times_complex[i,"for_matrix"]<-system.time({
for(h in 1:10000) {
complexFunction(x[,h])
}
})[3]
times_complex[i,"foreach_matrix"]<-system.time({
h <- foreach(h = x, .combine="rbind") %do% {
complexFunction(h)
}
})[3]
times_complex[i,"foreach_dopar"]<-system.time({
h <- foreach(h = x_df, .combine="rbind") %dopar% {
complexFunction(h)
}
})[3]
times_complex[i,"future_lapply"]<-system.time({future_lapply(x_df,complexFunction) })[3]}colMeans(times_complex)
#for foreach sapply lapply vapply
#17.2348 19.6356 17.2059 16.8529 17.0792
#apply foreach_dopar future_lapply for_matrix foreach_matrix
#19.1194 7.7400 6.2917 17.8592 20.8166

As expected, I observed that now the performance is comparable between functions that use the matrix (x) as input and those that use a data frame (x_df). It is also nice to see that for() is not really worse that the apply() family of functions. Something I found surprising was that vapply() seems not to perform any better that sapply() and lapply() as I thought specifying the type of output would lead to better performance. Maybe it does, but not in these trivial examples.

Finally, given that now the operations are what take the most time, we do see quite an improvement on speed with the multicore functions foreach() %dopar% and future_lapply().

This is the overall result of this simple benchmarking:

simple<-melt(times)
simple$type<-"simple"
complex<-melt(times_complex)
complex$type<-"complex"
all<-rbind(simple, complex)
png("elapsed_time_t.png", width=800, height=800, res=100)
ggplot(all, aes(x=Var2, y=value, fill=type)) +
ylab("Elapsed time") + xlab("Function") + ggtitle("Elapsed time") +
geom_boxplot() + coord_flip()
dev.off()

In conclusion, the best way to loop through data depends on the complexity of operations. In my case, I’ll probably use apply() functions when it helps readability of the code and for() when there is a lot of control statements for the operations that need to happen. Whenever there is a lot of operations, and I can easily parallelize the work, I will be using the multicore functions.

Code for the benchmarking.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store