To data.table or dplyr? That is the question.
Programming languages are still human constructs. They hold sway when they are utilised en mass, like Python for machine learning, and user factions may emerge if very different ways of doing the same thing concurrently exist.
In R, this can manifest in the form of data.table
versus dplyr
debates. 1
Both R packages contain a comprehensive stack of functions for data wrangling. The tidyverse dplyr
approach emphasises code readability whilst data.table
scales complex manipulations of large datasets very efficiently. You can compare the efficiency of data.table
versus other data wrangling packages on large datasets here.
Whilst I prefer to use dplyr
on small datasets where data.table
efficiency gains are negligible, I recommend using data.table
when:
Let’s explore this for ourselves.
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
lobstr, # Trace objects in memory
ids, # Generate random ids
DT, # Create interactive tables
tidyverse,
data.table,
microbenchmark)
Imagine you have a dataset describing how students are engaging with online courses:
We can interactively examine the first 20 rows of the dataset using the R package DT
.
Note: The code used to create this dataset can be accessed from my github repository here.
data.table
In R, datasets exist as data.frame
type objects. To apply data.table
functions on a dataset, we need to convert a data.frame
into a data.table
object using setDT()
.
This function is flexible as it converts a data.frame
by reference (i.e. without creating a duplicate data.table
copy) and assigns both data.table
and data.frame
classes to the converted object.
# Convert data frame into data.table -------------------------------------------
class(student_courses)
#> [1] "tbl_df" "tbl" "data.frame"
mem_used()
#> 222,639,352 B
# Track object assignment in memory
tracemem(student_courses)
#> [1] "<0000022B07B7AB70>"
setDT(student_courses) # data.table is assigned to a new location in memory
#> tracemem[0x0000022b07b7ab70 -> 0x0000022b047a8478]: as.list.data.frame as.list vapply vapply_1i setDT
untracemem(student_courses)
mem_used()
#> 242,674,176 B
# Note that computer memory has not doubled following setDT()
class(student_courses)
#> [1] "data.table" "data.frame"
data.table
query structureA data.table
query is structured in the form DT[i, j, by]
where:
i
placeholder.j
placeholder.by
placeholder.data.table
efficiency gainsThere are a few reasons why data.table
operations are fast:
data.table
automatically creates a secondary index (or key) of the columns used to subset data, so that subsequent operations on the same column are much faster.data.table
has a much faster order()
function, which is also utilised for the evaluation of groupings.:=
to add, delete or modify columns in place, which is a faster alternative to R’s default copy-on-modify behaviour.# Create data.frame and data.table objects -------------------------------------
df <- data.frame(id = seq(1:5),
letter = letters[1:5])
dt <- as.data.table(df)
# Update data.frame column using copy-on-modify --------------------------------
df_2 <- df %>%
mutate(letter = toupper(letter))
# The new data frame is a shallow copy of the original data frame as
# only modified columns are newly created in memory.
ref(df, df_2)
#> o [1:0x22b03b6c998] <df[,2]>
#> +-id = [2:0x22b7f3f6350] <int>
#> \-letter = [3:0x22b0c54b328] <chr>
#> o [4:0x22b03b90e78] <df[,2]>
#> +-id = [2:0x22b7f3f6350]
#> \-letter = [5:0x22b05f9b890] <chr>
# Update data.table column by reference ----------------------------------------
obj_addr(dt)
#> [1] "0x22b016ed030"
dt[, letter := toupper(letter)]
obj_addr(dt)
#> [1] "0x22b016ed030"
Note: You do not need to assign datasets to names when modifying by reference using the :=
operator.
dplyr
The basic syntax for filtering data is very similar for dplyr
and data.table
.
# Filter student_courses using dplyr -------------------------------------------
# Filter by platform A
student_courses %>%
filter(platform == "A")
# Filter by all platforms excepting A
student_courses %>%
filter(platform != "A")
# Filter by platforms A and C
student_courses %>%
filter(platform %in% c("A", "C"))
# Using a comma is a substitute for the condition 'and'
student_courses %>%
filter(id == "00007f23",
between(platform_start_date, "2017-01-01", "2017-12-31"))
# Filter by a variable using regex
student_courses %>%
filter(str_detect(course, "^R_"))
In dplyr
version >= 1.0.0, the functions if_any()
and if_all()
can be incorporated to filter across multiple columns.
# Filter across multiple rows using dplyr --------------------------------------
# Filter across all date columns where all columns are TRUE
student_courses %>%
filter(if_all(ends_with("_date"), ~between(., "2017-01-01", "2017-12-31")))
# Filter across all date columns where at least one column is TRUE
student_courses %>%
filter(if_any(ends_with("_date"), ~between(., "2017-01-01", "2017-12-31")))
Note: The function between()
is equivalent to lower <=x & x <= upper
when incbounds=TRUE
and lower < x & y < upper
when incbounds=FALSE
.
data.table
A minor difference is that data.table
also contains a list of helper functions with optimised performance for filtering on specific data types like characters or integers.
# Filter student_courses using data.table --------------------------------------
# Filter by platform A
student_courses[platform == "A"]
# Filter by all platforms excepting A
student_courses[platform != "A"]
# Operator %chin% is equivalent to but faster than %in%
student_courses[platform %chin% c("A", "C")]
# Operator %between% or data.table::between() searches across a range of values
student_courses[id == "00007f23" &
platform_start_date %between% c("2017-01-01", "2017-12-31")]
# Operator %like% allows you to search for a pattern in a character vector
student_courses[course %like% "R_"]
Currently, dplyr
offers more versatility than data.table
in terms of filtering across multiple columns. A readable data.table
equivalent to if_all()
currently does not exist and the equivalent to if_any
can be solved using get()
to return the value of multiple columns.
# Filter across multiple rows using data.table ---------------------------------
# Filter across all date columns where at least one column is TRUE
date_cols <- str_subset(colnames(student_courses), "_date$")
student_courses[get(date_cols) %between% c("2017-01-01", "2017-12-31"), .SDcols = date_cols]
# get(date_cols) == condition is equivalent to if_any(condition) == TRUE
You can use the R package microbenchmark
to measure code performance. 2
The function microbenchmark()
runs each expression 100 times by default with the argument times = 100L
. It outputs summary statistics on how long it takes to evaluate a single expression.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
filter(student_courses, platform == “A”) | 9.0241 | 9.89670 | 17.45452 | 10.77120 | 13.19065 | 188.7689 | 100 |
student_courses[platform == “A”] | 8.4162 | 9.12350 | 11.79773 | 9.60965 | 10.48005 | 35.0547 | 100 |
filter(student_courses, platform %in% c(“A”, “C”)) | 12.1439 | 13.05705 | 20.04014 | 14.65670 | 33.03505 | 41.0345 | 100 |
student_courses[platform %chin% c(“A”, “C”)] | 13.4523 | 14.65795 | 19.35104 | 15.47045 | 16.69780 | 46.3068 | 100 |
filter(student_courses, str_detect(course, “^R_”)) | 83.2341 | 88.54035 | 97.67838 | 91.93555 | 102.11460 | 214.3023 | 100 |
student_courses[course %like% “R_”] | 53.7119 | 57.01275 | 62.96518 | 58.66400 | 62.27945 | 224.8972 | 100 |
dplyr_filter_if_any | 34.7018 | 37.33390 | 49.85545 | 41.14285 | 59.23520 | 215.8179 | 100 |
data.table_filter_get_cols | 13.6371 | 14.82760 | 21.16597 | 15.69680 | 17.09135 | 186.9853 | 100 |
dplyr
Sorting a data frame can be computationally expensive when multiple variables need to be ranked. This is why I recommending sorting your dataset once, right after basic data cleaning operations have been performed.
# Sort student_courses using dplyr ---------------------------------------------
student_courses %>%
arrange(course_start_date)
student_courses %>%
arrange(platform,
id,
desc(platform_start_date))
# Sorting by a descending date ranks the most recent date as first
data.table
In data.table
, sorting is also performed inside i
of DT[i, j, by]
. Using the operator -
in front of a variable allows sorting by descending order.
# Sort student_courses using data.table ----------------------------------------
student_courses[order(course_start_date)]
student_courses[order(platform,
id,
-platform_start_date)]
# You can also order columns in place using setorder()
setorder(student_courses,
id,
platform_start_date,
course_start_date)
You can see that order()
from data.table
sorts data much faster than its equivalent dplyr
function.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
arrange(student_courses, course_start_date) | 27.2534 | 35.2252 | 44.51070 | 49.7276 | 52.7478 | 61.6616 | 25 |
student_courses[order(course_start_date)] | 54.8829 | 62.0147 | 70.32617 | 69.7907 | 72.0628 | 96.8317 | 25 |
arrange(student_courses, platform, id, desc(platform_start_date)) | 5605.0036 | 5654.6456 | 5702.51392 | 5682.9645 | 5712.8049 | 6182.9437 | 25 |
student_courses[order(platform, id, -platform_start_date)] | 119.1436 | 125.0282 | 134.31395 | 130.4740 | 144.8021 | 160.8236 | 25 |
dplyr
In dplyr
, performing operations on a tibble will always return another data frame, unless you explicitly use pull()
to extract a column as a vector.
# Select column(s) using dplyr -------------------------------------------------
student_courses %>%
select(id)
student_courses %>%
select(c(id, platform, course))
# Select columns(s) using regex
student_courses %>%
select(contains("date", ignore.case = F))
# Output data.frame with select() ----------------------------------------------
student_courses %>%
select(id) %>%
class()
#> [1] "data.table" "data.frame"
# Output vector with pull() ----------------------------------------------------
student_courses %>%
pull(id) %>%
class()
#> [1] "character"
student_courses %>%
.[["id"]] %>%
class()
#> [1] "character"
data.table
In data.table
, column selection is performed inside j
of DT[i, j, by]
and returns a data.table
if you wrap column names inside a list.
# Select column(s) using data.table --------------------------------------------
student_courses[,
.(id)]
student_courses[,
.(id,
platform,
course)]
# Select column(s) using regex
str_subset(colnames(student_courses), "date")
#> [1] "platform_start_date" "platform_end_date" "course_start_date" "course_end_date"
student_courses[,
str_subset(colnames(student_courses), "date"),
with = F]
# Output data frame by wrapping column names inside a list ---------------------
class(student_courses[, .(id)])
#> [1] "data.table" "data.frame"
# Output vector with [[x]] -----------------------------------------------------
class(student_courses[, id])
#> [1] "character"
class(student_courses[["id"]])
#> [1] "character"
Note: In data.table
, .(cols)
is just a shorthand for list(cols)
and is used to list columns in the j
placeholder.
Interestingly, the benchmark below shows that dplyr
is slightly faster than data.table
for column selections.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
select(student_courses, c(id, platform, course)) | 1.8944 | 2.32380 | 2.633437 | 2.48395 | 2.82525 | 5.3393 | 100 |
student_courses[, .(id, platform, course)] | 4.3514 | 5.12335 | 14.512512 | 7.25425 | 24.44105 | 136.2222 | 100 |
select(student_courses, contains(“date”, ignore.case = F)) | 1.2884 | 1.55195 | 1.789318 | 1.72585 | 1.97540 | 3.0404 | 100 |
student_courses[, grep(“date”, names(student_courses), value = T), with = F] | 1.3837 | 1.75865 | 7.824120 | 2.72795 | 3.90910 | 128.7633 | 100 |
dplyr
As dplyr
objects are data frames, a shallow data.frame
copy is created whenever a column is modified using mutate()
.
# Create column(s) using dplyr -------------------------------------------------
# Create new columns from existing variables
student_courses %>%
mutate(platform_dwell_length = platform_end_date - platform_start_date,
platform_start_year = as.numeric(str_extract(platform_start_date, "^.{4}(?!//-)")))
# Create column(s) with multiple conditions using dplyr ------------------------
str_subset(unique(student_courses$course), "^R_")
#> [1] "R_beginner" "R_advanced" "R_intermediate"
str_subset(unique(student_courses$course), "^Python_")
#> [1] "Python_intermediate" "Python_advanced" "Python_beginner"
student_courses %>%
mutate(studied_programming = case_when(str_detect(course, "^R_") ~ "Studied R",
str_detect(course, "^Python_") ~ "Studied Python",
TRUE ~ "No"))
In dplyr
version >= 1.0.0, you can use mutate()
in combination with across()
to apply transformations across one or multiple columns.
# Create multiple columns using dplyr ------------------------------------------
# across() accepts function names or ~function(.x) syntax
student_courses %>%
mutate(across(c(id, platform, course, course), ~toupper(.x)))
# where() can be used inside across() to select columns by type
student_courses %>%
mutate(across(where(is.character), ~toupper(.x)))
In contrast to mutate()
, transmute()
only returns the transformed column(s) of interest.
data.table
Data frame outputs are slightly different in data.table
:
:=
.=
(instead of :=
) to extract only the columns transformed inside j
of DT[i, j, by]
.# Create column(s) using data.table --------------------------------------------
student_courses[,
`:=` (platform_dwell_length = platform_end_date - platform_start_date,
platform_start_year = str_extract(platform_start_date, "^.{4}(?!//-)"))]
student_courses[,
platform_start_year := as.numeric(platform_start_year)]
# Create column(s) with multiple conditions using data.table -------------------
student_courses[,
studied_programming := fcase(
str_detect(course, "^R_"), "Studied R",
str_detect(course, "^Python_"), "Studied Python",
default = "No")]
# Remove newly created columns using data.table --------------------------------
# Column(s) can be removed by assignment as NULL variable(s)
student_courses[,
c("platform_dwell_length",
"platform_start_year",
"studied_programming") := NULL]
In data.table
, the solution to creating multiple columns is to specify columns of interest into .SDcols
and then loop through each column using lapply(function)
.
# Create multiple columns using data.table -------------------------------------
# (col_vector) or c("x", "y") := lapply(...) modifies all columns in place
cols <- c("id", "platform", "course")
student_courses[,
(cols) := lapply(.SD, toupper),
.SDcols = cols]
# Equivalent code
student_courses[,
c("id", "platform", "course") := lapply(.SD, toupper),
.SDcols = c("id", "platform", "course")]
# Use subassignment with `=` to return only transformed columns ----------------
dt_subset <- student_courses[,
.(id = toupper(id),
platform = toupper(platform),
course = toupper(course))]
ncol(dt_subset)
#> [1] 3
# Equivalent code using .SDcols and lapply()
dt_subset <- student_courses[,
lapply(.SD, toupper),
.SDcols = cols]
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
mutate(student_courses, platform_dwell_length = platform_end_date - platform_start_date) | 5.0602 | 6.90770 | 25.55252 | 8.26925 | 9.68500 | 173.8566 | 100 |
student_courses[, := (“platform_dwell_length”, platform_end_date - platform_start_date)] |
3.6370 | 5.19895 | 25.54544 | 6.33755 | 7.82895 | 171.3265 | 100 |
dplyr_case_when | 225.6839 | 249.70800 | 269.89679 | 267.90020 | 279.05350 | 458.1452 | 100 |
data.table_fcase | 195.9269 | 208.36720 | 221.09485 | 213.40350 | 227.35105 | 377.5954 | 100 |
dplyr_mutate_across | 371.1676 | 385.88140 | 404.99245 | 396.43500 | 407.44485 | 499.2400 | 100 |
data.table_lapply_SDcols | 366.4740 | 386.23370 | 405.55840 | 392.76045 | 411.39315 | 497.5348 | 100 |
dplyr
Summarising by group is also where data.table
significantly outperforms dplyr
. A grouping is specified using the group_by()
function in dplyr
.
# Find total number of courses per student via dplyr ---------------------------
student_courses %>%
group_by(id) %>%
summarise(total_courses = n()) %>%
ungroup()
# Code above is also equivalent to using count()
student_courses %>%
count(id)
# Find total number of distinct courses per student via dplyr ------------------
student_courses %>%
group_by(id) %>%
summarise(total_distinct_courses = n_distinct(course)) %>%
ungroup()
# Find the first course studied per student and platform via dplyr -------------
student_courses %>%
group_by(id, platform) %>% # Group by two variables
filter(row_number() == 1L) %>% # Return the first row from each group
ungroup()
data.table
In data.table
, a grouping is specified inside the by
placeholder of DT[i, j, by]
.
# Find total number of courses per student via data.table ----------------------
student_courses[,
.(total_courses = .N),
by = id]
# Find total number of distinct courses per student via data.table -------------
student_courses[,
.(total_distinct_courses = length(unique(course))),
by = id]
# uniqueN(x) is a data.table function equivalent to length((unique(x))
student_courses[,
.(total_distinct_courses = uniqueN(course)),
by = id]
# Find the first course studied per student and platform via data.table --------
student_courses[,
.SD[1L],
by = .(id, platform)]
Group by operations are significantly faster in data.table
than dplyr
.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
student_courses %>% group_by(id) %>% summarise(total_courses = n()) | 2253.7473 | 2416.8814 | 2848.64304 | 2498.8589 | 3270.4622 | 4162.8841 | 10 |
student_courses[, .(total_courses = .N), by = id] | 46.7541 | 50.0241 | 60.37993 | 53.6803 | 70.2789 | 83.9196 | 10 |
student_courses %>% group_by(id) %>% summarise(total_distinct_courses = n_distinct(course)) | 4025.6523 | 4237.2057 | 4937.93044 | 4647.9254 | 5331.3878 | 6326.4315 | 10 |
student_courses[, .(total_distinct_courses = length(unique(course))), by = id] | 847.1269 | 879.6714 | 1034.19212 | 971.1114 | 1218.1229 | 1356.3699 | 10 |
student_courses[, .(total_distinct_courses = uniqueN(course)), by = id] | 6262.0236 | 6343.8344 | 7207.72692 | 6594.7426 | 7264.5609 | 10044.3720 | 10 |
student_courses %>% group_by(id, platform) %>% filter(row_number() == 1L) | 7872.6713 | 8103.4138 | 9778.25568 | 9260.2453 | 10907.8334 | 13687.8050 | 10 |
student_courses[, .SD[1L], by = .(id, platform)] | 69.1016 | 71.2565 | 78.52564 | 74.1029 | 78.2506 | 120.5737 | 10 |
Note: Not all data.table
functions outperform their base R or dplyr
equivalents. The data.table
uniqueN(x)
function is much slower than length(unique(x))
.
Most data.table
operations significantly outperform their dplyr
equivalents in computational speed. I use data.table
when grouping on large datasets (i.e. on datasets with greater than ~ 0.5 million rows) and use dplyr
for day-to-day analyses of smaller datasets.
A stack overflow discussion about the best use cases for data.table
versus dplyr
.
A great side-by-side comparison of data.table
versus dplyr
functions from a blog post by Atrebas.
A list of advanced data.table
operations and tricks from a blog post by Andrew Brooks.
An explanation of how data.table
modifies by reference from a blog post by Tyson Barrett.
A benchmark of dplyr
versus data.table
functions from a blog post by Tyson Barrett