Manipulating data in R: a brief benchmark report

Gabriel
8 min readMar 20, 2022

Recently I started to use dplyr. Although I have used R for years, I had been reluctant to learn dplyr’s grammar for data manipulation. However, now a days I find hard to find scripts that do not use this library, so thought was time to just learn it.

There are too many fantastic blog posts and websites with tutorials about the different dplyr functions (including that from the main authors https://dplyr.tidyverse.org/). As such, I wanted to use this small space to share some of the benchmark I did in case someone else finds it useful.

I have normally used the vanilla R functions along with the library data.table for data manipulation. Before adapting some code, I wanted to see how the the dplyr verbs and different data structures performed in comparison.

Content

  1. Creating data and benchmarking functions
    - Creating the toy dataset
    - Plot benchmark function
  2. Filtering data
    - Filtering data by a column of class factor
    - Filtering data by a column of type character
    - Filtering by a column of type double
    - Speeding up things with dtplyr
  3. Extracting data from a column
  4. Ordering data
    - Speeding up ordering in data.table with setkey
  5. Modifying data
  6. Conclusions

Creating data and benchmarking functions

Creating the toy dataset

library(tidyverse)
library(microbenchmark)
library(data.table)
library(bench)
library(ggpubr)
microbenchmark(
my_tibble<-tibble(x=rnorm(1e6,0,1),
y=rnorm(1e6,0,1),
letters=as.factor(sample(letters, 1e6, replace=TRUE)), colors=sample(colors(100), 1e6, replace=TRUE)),
my_data_table<-data.table(x=rnorm(1e6,0,1),
y=rnorm(1e6,0,1),
letters=as.factor(sample(letters, 1e6, replace=TRUE)), colors=sample(colors(100), 1e6, replace=TRUE)),
my_data_frame<-data.frame(
x=rnorm(1e6,0,1),
y=rnorm(1e6,0,1),
letters=as.factor(sample(letters, 1e6, replace=TRUE)), colors=sample(colors(100), 1e6, replace=TRUE)),
times=20)#Unit: milliseconds
#min lq mean median uq max neval
#170.1918 177.0231 192.8869 182.8019 189.5524 251.2096 20
#172.0108 182.1905 205.4384 192.2405 218.6700 292.1585 20
#169.8048 176.6235 192.4018 184.7838 193.5924 271.9795 20

Here I am generating datasets with a million rows containing two numerical vectors (x and y), 1 vector with a variable of type “factor” (letters) and another one of type character (colors). I generate a traditional data.frame, a data.tableand a tibble, as I will be comparing the performance of common operations on these different objects.

In this particular example, creating the data set as a data.tabletook slightly longer compared to creating a tibble or a traditional data.frame.

data.table (from the data.table package) is an enhanced version of a data.frame. It optimizes operations internally making it very fast and memory efficient. tibble (part of the tidyverse package) is an also an enhanced version of data.frames with better print methods that makes easier to inspect data. As shown bellow, both tibble and data.table inherit the data.frame class:

class(my_tibble)
#[1] "tbl_df" "tbl" "data.frame"
class(my_data_frame)
#[1] "data.frame"
class(my_data_table)
#"data.table" "data.frame"

Plot benchmark function

Prior to start the comparison, here is the plot benchmark function that I used:

plot_benchmark <- function(mark_results) {
#Plot memory
mem_plot <- ggplot(mark_results, aes(x=as.character(expression), y=as.numeric(mem_alloc)/1024^2)) +
geom_bar(stat="identity") +
ylab("Allocated memory (Mb)") +
xlab("Expression") +
coord_flip() +
theme(text = element_text(size = 16))
#Plot time in ms
time_plot <- ggplot(mark_results, aes(x=as.character(expression), y=as.numeric(median)*1000)) +
geom_bar(stat="identity") +
ylab("Median time (ms)") +
xlab("Expression") +
coord_flip() +
theme(text = element_text(size = 16))

ggarrange(mem_plot, time_plot,
labels = c("Memory", "Time"),
nrow = 2)
}

Filtering data

Filtering data by a column of class factor /(type integer)

As a first experiment, I wanted to see how much time and memory each of these data objects use when filtering data. I compare the traditional data-masking approach [condition,]and dplyr’s filter() .

Here, I filter by “letters” which is a factor/integer:

filter_benchmark <- mark(
my_tibble %>% filter(letters=='a'),
my_tibble[my_tibble$letters=='a',],
my_data_table %>% filter(letters=='a'),
my_data_table[letters=='a'],
my_data_frame %>% filter(letters=='a'),
my_data_frame[my_data_frame$letters=='a',],
iterations=1, check=FALSE, time_unit="ms")

plot_benchmark(filter_benchmark)

In this experiment, filtering the data of a tibble using the mask my_tibble$letters=="a" used the least memory, and was the second fastest approach, while using the filter() function performed worse across the different data objects.

Filtering data by a column of type character

So far, tibble appears to be winning in terms of overall performance. Now, I wanted to see if I would see the same if instead I filter based on a column of the type “character”.

filter_benchmark <- mark(
my_tibble %>% filter(colors=='skyblue3'),
my_tibble[my_tibble$colors=='skyblue3',],
my_data_table %>% filter(colors=='skyblue3'),
my_data_table[colors=='skyblue3'],
my_data_frame %>% filter(colors=='skyblue3'),
my_data_frame[my_data_frame$colors=='skyblue3',],
iterations=10, check=FALSE, time_unit="ms")

plot_benchmark(filter_benchmark)

tibble with a mask, still appears to be the more memory efficient. However, filtering a data_table based on a column of type character was about 10 times faster than any other approach!

Filtering by a column of type double

Now, how does this look when I filter based on a column of type “double” e.g. (x).

filter_benchmark <- mark(
my_tibble %>% filter(x>1),
my_tibble[my_tibble$x>1,],
my_data_table %>% filter(x>1),
my_data_table[x>1],
my_data_frame %>% filter(x>1),
my_data_frame[my_data_frame$x>1,],
iterations=10, check=FALSE, time_unit="ms")

plot_benchmark(filter_benchmark)

In this case, the results appeared to be very similar to those when filtering by a “factor”, data.table was still faster but uses more memory than a tibble.

Speeding up things with dtplyr

So far, it seems that using the filter() function from dplyr is both slower and more memory inefficient than using a simple mask. Moreover, it seems to affect the speed of filtering in the data_table .

Enter dtplyr, a library that allows to use the dplyr grammar and automagically translates the code to the equivalent (but faster) data.table code!

Now, let’s load that library and see whether filter() performs better. For this I’ll use the filter by a type “double” example:

library(dtplyr)
filter_benchmark <- mark(
my_tibble %>% filter(x>1),
my_tibble[my_tibble$x>1,],
my_data_table %>% filter(x>1),
my_data_table[x>1],
my_data_frame %>% filter(x>1),
my_data_frame[my_data_frame$x>1,],
iterations=10, check=FALSE, time_unit="ms")

plot_benchmark(filter_benchmark)

Now the filter() function gets translated to the most efficient code for a data_table filtering. At first glance, it would appear that now my_data_table %>% filter() achieves a much better performance. However, when inspecting the object that it returns, here is what we see:

my_data_table %>% filter(x>2)
Source: local data table [22,961 x 4]
Call: `_DT1249`[x > 2]
x y letters colors
<dbl> <dbl> <fct> <chr>
1 2.04 -0.104 y bisque
2 2.99 0.453 m gray30
3 2.43 0.821 r darkolivegreen1
4 2.48 1.83 q antiquewhite1
5 3.97 2.26 m lightgoldenrodyellow
6 3.27 -0.374 e red
# … with 22,955 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access results

This looks like an unfair comparison! To make this usable, this has to be reconverted to a data.table. As such, to make things more objective I added the as.data.table() to the benchmark:

filter_benchmark <- mark(
my_tibble %>% filter(x>1),
my_tibble[my_tibble$x>1,],
as.data.table(my_data_table %>% filter(x>1)),
my_data_table[x>1],
my_data_frame %>% filter(x>1),
my_data_frame[my_data_frame$x>1,],
iterations=10, check=FALSE, time_unit="ms")

plot_benchmark(filter_benchmark)
identical(as.data.table(my_data_table %>% filter(x>1)), my_data_table[x>1])
#[1] TRUE

And now things look as previously, where the mask [x>1] achieves a better performance.

Extracting data from a column

Now, lets see what performs best when we want to extract data from a column of our data as a vector using the dplyr’s pull() function vs the traditional $ selector:

pull_column_benchmark <- mark(
my_tibble %>% pull(colors),
my_tibble$colors,
my_data_table %>% pull(colors),
my_data_table$colors,
my_data_frame %>% pull(colors),
my_data_frame$colors,
iterations=100, check=FALSE, time_unit="ms")

plot_benchmark(pull_column_benchmark)
identical(my_data_table %>% pull(colors),my_data_table$colors)
#[1] TRUE

Not surprisingly the execution of this does not need to allocate memory (<1Kb for all cases). Surprisingly, using the pull() function was much slower (relatively speaking, the units are still milliseconds) than using the $ selector.

What about selecting a column from the data with select() and its equivalent [,"colors"]

select_column_benchmark <- mark(
my_tibble %>% select(colors),
my_tibble[,"colors"],
my_data_table %>% select(colors),
my_data_table[,"colors"],
my_data_frame %>% select(colors),
my_data_frame[,"colors",drop=FALSE],
iterations=100, check=FALSE, time_unit="ms")
plot_benchmark(extract_column_benchmark)

Similarly to pull() , select() performed worse than the R vanilla’s [] . Interestingly, data.table appears to struggle in terms of both, memory and speed. Note to self: Investigate why.

Ordering data

Next, I wanted to see if there was any significant performance difference between sorting data using dplyr arrange() and order() across the different data structures.

For the first experiment, I ordered the data based on “y” which is a column of type “double”

arrange_benchmark <- mark(
my_tibble %>% arrange(y),
my_tibble[order(my_tibble$y),],
my_data_table[order(y)],
as.data.table(my_data_table %>% arrange(y)),
my_data_frame[order(my_data_frame$y),],
my_data_frame %>% arrange(y),
iterations=5, check=FALSE, time_unit="ms")
plot_benchmark(arrange_benchmark)

Surprisingly, it appears that the data.table performs worse than tibble and data.frame . It is interesting to see that for data.table casting with as.data.table + arrange() has the same performance as order() .

What about ordering data based on a column of type character such as “colors”?

arrange_benchmark <- mark(
my_tibble %>% arrange(colors),
my_tibble[order(my_tibble$colors),],
my_data_table[order(colors)],
as.data.table(my_data_table %>% arrange(colors)),
my_data_frame[order(my_data_frame$colors),],
my_data_frame %>% arrange(colors),
iterations=2, check=FALSE, time_unit="ms")
plot_benchmark(arrange_benchmark)

Interestingly, data.table is considerably faster when sorting by a column with data of type “character”! Note to self: Investigate why.

Speeding up ordering in data.table with setkey()

I was very unhappy when looking at the performance of data.table when sorting the data by a number. Gladly, data.table has the functions setkey() and setorder() which can be used to sort the data. These functions order the data by reference, which appears to make the task considerably faster, and probably it is unfair to compare against the other methods to achieve this. This is how the different functions would compare though:

my_data_table2<-data.table(x=rnorm(1e6,0,1), y=rnorm(1e6,0,1), letters=as.factor(sample(letters, 1e6, replace=TRUE)), colors=sample(colors(100), 1e6, replace=TRUE))arrange_benchmark <- mark(
my_tibble %>% arrange(y),
my_tibble[order(my_tibble$y),],
my_data_table[order(y)],
as.data.table(my_data_table %>% arrange(y)),
setkey(my_data_table2, y),
my_data_frame[order(my_data_frame$y),],
my_data_frame %>% arrange(y),
iterations=1, check=FALSE, time_unit="ms")
plot_benchmark(arrange_benchmark)

I knew it was fast, but didn’t know was this fast! This insight will really lead me to change quite a bit of code that I have around.

Modifying data

Finally, I checked the performance of changing (mutating) the data. There are many ways to achieve this, but the next expressions were the ones that I used:

mutate_benchmark <- mark(
my_tibble <- my_tibble %>% mutate(x2=x+2),
my_tibble$x2 <- my_tibble$x+2,
my_data_table <- as.data.table(my_data_table %>% mutate(x2=x+2)),
my_data_table[, x2 := x + 2],
my_data_frame <- my_data_frame %>% mutate(x2=x+2),
my_data_frame$x2 <- my_data_frame$x+2,
iterations=10, check=FALSE, time_unit="ms")
plot_benchmark(mutate_benchmark)

Again, data.table did not disappoint (nor did the humble base R data.frame ).

Conclusion

I really wish I had gotten a more concise takeaway from these experiments. All in all, the performance of dplyr functions appear to be slower relative to the base R functions such as order and the [], $ selectors. However, the function do improve readability of the code, and for many cases, the differences in performance probably are irrelevant.

I was surprised that there was not a clear winner in terms of performance between tibble , data.frame and data.table for the few scenarios shown here. If I had to pick one, I would say that data.table got the better performance, at least in terms of sorting data by a column of type “character” or by using setkey().

Anyway, I learnt quite a bit by playing around with the different functions. If you read this and know about how to improve performance of any of these tasks, I’ll be happy to hear about it.

--

--