Things get querysome and querysome.
This post is an extension of my comparison of simple R data.table
versus dplyr
functions.
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
ids, # Generate random IDs
tidyverse,
data.table,
microbenchmark,
DT)
Imagine that you have a dataset describing how students are engaging with online courses:
The first 12 rows of the dataset can be examined interactively below.
Note: The code used to create this dataset can be accessed from my github repository here.
dplyr
Imagine that you would like to subset on fitness training courses from platform C and E and then create a column to denote that these were discounted courses.
In dplyr
, this can be written as a single block of code using the %>%
pipe to separate each functional step.
data.table
In data.table
, performing these two operations in separate steps or a single step produces different outputs. 1
# Filter and create a new column using data.table in separate steps ------------
dt_query_1 <- student_courses[(course == "fitness_training")
& (platform %chin% c("C", "E"))] %>%
.[, percentage_discount := 5]
id | platform | course | percentage_discount |
---|---|---|---|
00007f23 | E | fitness_training | 5 |
0007ca33 | E | fitness_training | 5 |
000899ab | E | fitness_training | 5 |
0011ed00 | C | fitness_training | 5 |
In data.table
, applying filtering and a column transformation in a single step retains all dataset rows and only applies the transformation to rows where the filtering condition is TRUE
.
# Filter and create a new column using data.table in a single step -------------
dt_query_1_wrong <- student_courses[(course == "fitness_training")
& (platform %chin% c("C", "E")),
percentage_discount := 5]
id | platform | course | percentage_discount |
---|---|---|---|
00007f23 | D | metal_welding | NA |
00007f23 | E | website_design | NA |
00007f23 | E | fitness_training | 5 |
00007f23 | A | metal_welding | NA |
There is only a slight speed advantage in using data.table
over dplyr
, as our operations do not involve sorting or group by operations.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
dplyr_query_1_code | 14.4775 | 15.24285 | 19.671913 | 16.54145 | 18.86065 | 53.0508 | 100 |
dt_query_1_code | 7.4791 | 8.11775 | 9.768044 | 8.59610 | 9.42310 | 39.4557 | 100 |
dt_query_1_wrong_code | 6.0523 | 6.56240 | 8.491838 | 7.18205 | 7.80735 | 30.4918 | 100 |
A simple introduction to group by operations has already been covered here. This post further explores how different outputs can be obtained by modifying data.table
group by operations in different ways.
Imagine that you are interested in the total number of days each student has spent on an online platform. Could you obtain this by grouping on id
and summing the total number of days spent on a platform?
dplyr
# Calculate total platform_length per student using dplyr ----------------------
student_courses <- student_courses %>%
mutate(platform_length = platform_end_date - platform_start_date,
platform_length = as.integer(platform_length))
dplyr_query_2 <- student_courses %>%
group_by(id) %>%
summarise(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)) %>%
ungroup()
Note: In dplyr
, group by operations should be closed using ungroup()
to remove object metadata that marks row groupings.
data.table
In data.table
, you can choose which variable(s) to group by using by
or keyby
. The additional effect of keyby
is that it also orders the results and creates a secondary key for faster subsequent subsetting. This is useful if you intend to create multiple features from the same grouping.
The problem is that this solution overestimates the total number of days spent on an online platform per student, as some students take multiple courses on the same platform. These student records will contain rows with the same platform dwell length but different course start and end dates. 2
To remove duplicate platform_length
values, we first identify them by concatenating id
, platform
and platform_start_date
and counting the total number of rows per concatenation.
dplyr
# Identify duplicate rows using dplyr ------------------------------------------
student_courses %>%
mutate(platform_key = str_c(id, platform, platform_start_date, sep = "-")) %>%
count(platform_key, name = "row_number") %>%
count(row_number, name = "total_students")
data.table
In data.table
, .SD
means ‘subset of data’ and is used to reference the current sub-table of interest.
# Identify duplicate rows using dplyr using data.table -------------------------
student_courses[,
.(platform_key = do.call(str_c, c(.SD, sep = "-"))),
.SDcols = c("id", "platform", "platform_start_date")] %>%
.[,
.(row_number = .N),
by = platform_key] %>%
.[,
.(total_students = .N),
keyby = row_number] # Use keyby as we also want to sort by row_number
Note: The base R function do.call()
constructs and executes a function call from a name or function and a list of function arguments.
Given that duplicate platform_length
records exist, you would use a two-step process to extract the total platform_length
per student:
id
, platform
, platform_start_date
and extract the first row of each group.id
and aggregate for platform_length
calculations.dplyr
# Calculate total platform_length per student using dplyr ----------------------
dplyr_query_3 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id) %>%
summarise(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)) %>%
ungroup()
summary(dplyr_query_3 == dplyr_query_2)
#> id total_days min_days median_days max_days
#> Mode:logical Mode :logical Mode :logical Mode :logical Mode :logical
#> TRUE:144676 FALSE:14674 FALSE:22 FALSE:9745 FALSE:24
#> TRUE :130002 TRUE :144654 TRUE :134931 TRUE :144652
data.table
# Calculate total platform_length per student using data.table -----------------
dt_query_3 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
.(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)),
keyby = id]
summary(dt_query_3 == setDT(dplyr_query_3))
#> id total_days min_days median_days max_days
#> Mode:logical Mode:logical Mode:logical Mode:logical Mode:logical
#> TRUE:144676 TRUE:144676 TRUE:144676 TRUE:144676 TRUE:144676
The data.table
solutions are significantly faster as group by operations are required.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
dplyr_query_2_code | 5246.0321 | 7057.7399 | 7109.48259 | 7170.6375 | 7452.5706 | 7857.1419 | 25 |
dt_query_2_code | 62.8657 | 65.1054 | 81.67138 | 88.1212 | 92.6909 | 104.0582 | 25 |
dplyr_query_3_code | 14688.6743 | 15368.2905 | 18818.33617 | 19835.0295 | 20920.1685 | 21933.0659 | 25 |
dt_query_3_code | 194.3412 | 277.8851 | 349.00902 | 295.4189 | 340.8037 | 783.2029 | 25 |
Imagine that you are interested in how the length of time spent on a platform varies per student per platform. This solution is similar to the one above, except that transformations are specified for selective variable(s) and more than one variable is specified within a group.
dplyr
In dplyr
, this selection is facilitated by using across()
inside summarise()
, which allows you to apply the same list of functions on a single column or across multiple columns.
# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a single function
dplyr_query_4_1 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id, platform) %>%
summarise(across(contains("length"),
mean, na.rm = TRUE),
.groups = "drop") # Replaces the need to ungroup() after summarise()
id | platform | platform_length |
---|---|---|
00007f23 | A | 53 |
00007f23 | C | 50 |
00007f23 | D | 42 |
00007f23 | E | 48 |
# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a list of functions
# Supply a named list to summarise(across(), .groups = "drop)
mean_sd <- list(mean = ~mean(.x, na.rm = T),
sd = ~sd(.x, na.rm = T))
dplyr_query_4_2 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id, platform) %>%
summarise(across(contains("length"),
mean_sd),
.groups = "drop")
id | platform | platform_length_mean | platform_length_sd |
---|---|---|---|
00007f23 | A | 53 | 19.79899 |
00007f23 | C | 50 | NA |
00007f23 | D | 42 | NA |
00007f23 | E | 48 | 12.72792 |
data.table
In data.table
, the equivalent method is to list columns of interest inside .SDcols
and apply aggregations using lapply(.SD, ...)
.
# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a single function
dt_query_4_1 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
lapply(.SD, mean, na.rm = T),
.SDcols = grep("length", colnames(student_courses)),
keyby = .(id, platform)]
summary(dt_query_4_1 == setDT(dplyr_query_4_1))
#> id platform platform_length
#> Mode:logical Mode:logical Mode:logical
#> TRUE:336621 TRUE:336621 TRUE:336621
# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a list of functions
dt_query_4_2 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
unlist(lapply(.SD,
function(x) list(mean = mean(x),
sd = sd(x))),
recursive = F),
.SDcols = grep("length", colnames(student_courses)),
keyby = .(id, platform)]
summary(dt_query_4_2 == setDT(dplyr_query_4_2))
#> id platform platform_length.mean platform_length.sd
#> Mode:logical Mode:logical Mode:logical Mode:logical
#> TRUE:336621 TRUE:336621 TRUE:336621 TRUE:118638
#> NA's:217983
Note: The use of unlist()
inside the j
placeholder to convert individual list elements into individual columns is explained here.
As expected, data.table
operations run faster than dplyr
operations, although there is a non-linear performance decrease when multiple functions are evaluated for a subset of columns using unlist()
.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
dplyr_query_4_1 | 44909.0833 | 46373.5223 | 48032.2252 | 49279.872 | 49431.1745 | 50167.4736 | 5 |
dt_query_4_1 | 226.2467 | 246.2729 | 277.0908 | 279.916 | 292.9285 | 340.0899 | 5 |
dplyr_query_4_2 | 53664.0369 | 54031.1293 | 55658.0703 | 56744.099 | 56923.1472 | 56927.9389 | 5 |
dt_query_4_2 | 17300.8488 | 17699.8367 | 18300.5890 | 18465.812 | 18525.7296 | 19510.7179 | 5 |
lag
or lead
operationsFinally, what if you were interested in the number of times a student switches between a platform? Obtaining this insight is a multiple step process:
id
and create lag_platform
, which denotes the preceding online platform. Evaluating the first row using lag()
outputs NA
.is_new_platform
using a case_when()
condition which outputs 1
when the preceding platform is NA
or different to the current platform and outputs 0
when the preceding platform is the same as the current platform.id
and sum is_new_platform
to obtain the total number of times a student has switched between a platform.dplyr
# Calculate total platform switches per student using dplyr --------------------
# student_courses must already be sorted by id AND platform_start_date
dplyr_query_5 <- student_courses %>%
group_by(id) %>%
mutate(lag_platform = lag(platform, 1L),
is_new_platform = case_when(is.na(lag_platform) ~ 1,
platform != lag_platform ~ 1,
TRUE ~ 0)) %>%
summarise(platform_switch = sum(is_new_platform),
.groups = "drop")
data.table
# Calculate total platform switches per student using data.table ---------------
dt_query_5 <- student_courses[,
lag_platform := shift(platform, 1L, type = "lag"),
keyby = id] %>%
.[,
is_new_platform := fcase(
is.na(lag_platform), 1,
platform != lag_platform, 1,
default = 0),
by = id] %>%
.[,
.(platform_switch = sum(is_new_platform)),
by = id]
summary(dt_query_5 == setDT(dplyr_query_5))
#> id platform_switch
#> Mode:logical Mode:logical
#> TRUE:144676 TRUE:144676
From the benchmark of just the lag()
code component below, you can see that lag()
is much faster in data.table
than dplyr
but also more computationally expensive overall.
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
dplyr_lag | 4.681914 | 4.887524 | 5.076612 | 5.016075 | 5.231911 | 5.632238 | 25 |
dt_lag | 2.194414 | 2.340580 | 2.450061 | 2.425271 | 2.528855 | 2.954475 | 25 |
data.table
with dplyr
verbsWhilst I’ve become familiar with the data.table
syntax, there is still a way to retain the readability of dplyr
combined with the performance of data.table
, by implementing data.table
lazy translations using the dtplyr
package.
You can use lazy_dt()
on data.frame
objects to convert them into lazy_dt
objects. Using dplyr
code on data.table
objects will also automatically convert that object into a lazy_dt
object.
Note: I am currently experiencing a package dependency error caused by dtplyr
which prevents .list()
functions from evaluating inside data.table
objects, so the code below exists for demonstration purposes only.
# Install and load dtplyr ------------------------------------------------------
install.packages("dtplyr")
library("dtplyr")
# Create data.frame and convert into lazy_dt object ----------------------------
set.seed(111)
test_df <- tibble(id = sample(seq(1, 10), size = 500, replace = T),
var_1 = rnorm(500, 0, 1),
var_2 = rnorm(500, 2, 1),
var_3 = rnorm(500, 0, 2))
test_dt <- lazy_dt(test_df)
The function show_query()
can also be used to output the generated data.table
code translation.
# Use show_query() to copy data.table translations -----------------------------
test_dt %>%
arrange(desc(id)) %>%
mutate(var_sum = sum(var_1, var_2, var_3)) %>%
show_query()
#> `_DT1`[order(desc(id))][, `:=`(var_sum = sum(var_1, var_2, var_3))]
test_dt %>%
group_by(id) %>%
summarise(across(starts_with("var_"), mean)) %>%
show_query()
#> `_DT2`[, .(var_1 = mean(var_1), var_2 = mean(var_2), var_3 = mean(var_3)),
#> keyby = .(id)]
Computation of lazy_dt
objects is only performed if as.data.frame()
, as.data.table()
, or as_tibble()
is explicitly called.
# Perform lazy evaluation to return a data.frame object ------------------------
test_dt %>%
group_by(id) %>%
summarise(across(starts_with("var_"), mean)) %>%
as_tibble() # Outputs a tibble
A great side-by-side comparison of data.table
versus dplyr
functions from a blog post by Atrebas.
A list of advanced data.table
operations and tricks from a blog post by Andrew Brooks.
A stack overflow discussion about why group by operations are much faster using data.table
.
An R publication by Jose with tips on using .SD
in data.table
.
A blog post by JMount which benchmarks base R, dplyr
, data.table
versus lazy_dt
code performance.