Things get querysome and querysome.
This post is an extension of my comparison of simple R data.table versus dplyr functions.
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
ids, # Generate random IDs
tidyverse,
data.table,
microbenchmark,
DT)
Imagine that you have a dataset describing how students are engaging with online courses:
The first 12 rows of the dataset can be examined interactively below.
Note: The code used to create this dataset can be accessed from my github repository here.
dplyrImagine that you would like to subset on fitness training courses from platform C and E and then create a column to denote that these were discounted courses.
In dplyr, this can be written as a single block of code using the %>% pipe to separate each functional step.
data.tableIn data.table, performing these two operations in separate steps or a single step produces different outputs. 1
# Filter and create a new column using data.table in separate steps ------------
dt_query_1 <- student_courses[(course == "fitness_training")
& (platform %chin% c("C", "E"))] %>%
.[, percentage_discount := 5]
| id | platform | course | percentage_discount |
|---|---|---|---|
| 00007f23 | E | fitness_training | 5 |
| 0007ca33 | E | fitness_training | 5 |
| 000899ab | E | fitness_training | 5 |
| 0011ed00 | C | fitness_training | 5 |
In data.table, applying filtering and a column transformation in a single step retains all dataset rows and only applies the transformation to rows where the filtering condition is TRUE.
# Filter and create a new column using data.table in a single step -------------
dt_query_1_wrong <- student_courses[(course == "fitness_training")
& (platform %chin% c("C", "E")),
percentage_discount := 5]
| id | platform | course | percentage_discount |
|---|---|---|---|
| 00007f23 | D | metal_welding | NA |
| 00007f23 | E | website_design | NA |
| 00007f23 | E | fitness_training | 5 |
| 00007f23 | A | metal_welding | NA |
There is only a slight speed advantage in using data.table over dplyr, as our operations do not involve sorting or group by operations.
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| dplyr_query_1_code | 14.4775 | 15.24285 | 19.671913 | 16.54145 | 18.86065 | 53.0508 | 100 |
| dt_query_1_code | 7.4791 | 8.11775 | 9.768044 | 8.59610 | 9.42310 | 39.4557 | 100 |
| dt_query_1_wrong_code | 6.0523 | 6.56240 | 8.491838 | 7.18205 | 7.80735 | 30.4918 | 100 |
A simple introduction to group by operations has already been covered here. This post further explores how different outputs can be obtained by modifying data.table group by operations in different ways.
Imagine that you are interested in the total number of days each student has spent on an online platform. Could you obtain this by grouping on id and summing the total number of days spent on a platform?
dplyr# Calculate total platform_length per student using dplyr ----------------------
student_courses <- student_courses %>%
mutate(platform_length = platform_end_date - platform_start_date,
platform_length = as.integer(platform_length))
dplyr_query_2 <- student_courses %>%
group_by(id) %>%
summarise(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)) %>%
ungroup()
Note: In dplyr, group by operations should be closed using ungroup() to remove object metadata that marks row groupings.
data.tableIn data.table, you can choose which variable(s) to group by using by or keyby. The additional effect of keyby is that it also orders the results and creates a secondary key for faster subsequent subsetting. This is useful if you intend to create multiple features from the same grouping.
The problem is that this solution overestimates the total number of days spent on an online platform per student, as some students take multiple courses on the same platform. These student records will contain rows with the same platform dwell length but different course start and end dates. 2
To remove duplicate platform_length values, we first identify them by concatenating id, platform and platform_start_date and counting the total number of rows per concatenation.
dplyr# Identify duplicate rows using dplyr ------------------------------------------
student_courses %>%
mutate(platform_key = str_c(id, platform, platform_start_date, sep = "-")) %>%
count(platform_key, name = "row_number") %>%
count(row_number, name = "total_students")
data.tableIn data.table, .SD means ‘subset of data’ and is used to reference the current sub-table of interest.
# Identify duplicate rows using dplyr using data.table -------------------------
student_courses[,
.(platform_key = do.call(str_c, c(.SD, sep = "-"))),
.SDcols = c("id", "platform", "platform_start_date")] %>%
.[,
.(row_number = .N),
by = platform_key] %>%
.[,
.(total_students = .N),
keyby = row_number] # Use keyby as we also want to sort by row_number
Note: The base R function do.call() constructs and executes a function call from a name or function and a list of function arguments.
Given that duplicate platform_length records exist, you would use a two-step process to extract the total platform_length per student:
id, platform, platform_start_date and extract the first row of each group.id and aggregate for platform_length calculations.dplyr# Calculate total platform_length per student using dplyr ----------------------
dplyr_query_3 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id) %>%
summarise(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)) %>%
ungroup()
summary(dplyr_query_3 == dplyr_query_2)
#> id total_days min_days median_days max_days
#> Mode:logical Mode :logical Mode :logical Mode :logical Mode :logical
#> TRUE:144676 FALSE:14674 FALSE:22 FALSE:9745 FALSE:24
#> TRUE :130002 TRUE :144654 TRUE :134931 TRUE :144652
data.table# Calculate total platform_length per student using data.table -----------------
dt_query_3 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
.(total_days = sum(platform_length),
min_days = min(platform_length),
median_days = median(platform_length),
max_days = max(platform_length)),
keyby = id]
summary(dt_query_3 == setDT(dplyr_query_3))
#> id total_days min_days median_days max_days
#> Mode:logical Mode:logical Mode:logical Mode:logical Mode:logical
#> TRUE:144676 TRUE:144676 TRUE:144676 TRUE:144676 TRUE:144676
The data.table solutions are significantly faster as group by operations are required.
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| dplyr_query_2_code | 5246.0321 | 7057.7399 | 7109.48259 | 7170.6375 | 7452.5706 | 7857.1419 | 25 |
| dt_query_2_code | 62.8657 | 65.1054 | 81.67138 | 88.1212 | 92.6909 | 104.0582 | 25 |
| dplyr_query_3_code | 14688.6743 | 15368.2905 | 18818.33617 | 19835.0295 | 20920.1685 | 21933.0659 | 25 |
| dt_query_3_code | 194.3412 | 277.8851 | 349.00902 | 295.4189 | 340.8037 | 783.2029 | 25 |
Imagine that you are interested in how the length of time spent on a platform varies per student per platform. This solution is similar to the one above, except that transformations are specified for selective variable(s) and more than one variable is specified within a group.
dplyrIn dplyr, this selection is facilitated by using across() inside summarise(), which allows you to apply the same list of functions on a single column or across multiple columns.
# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a single function
dplyr_query_4_1 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id, platform) %>%
summarise(across(contains("length"),
mean, na.rm = TRUE),
.groups = "drop") # Replaces the need to ungroup() after summarise()
| id | platform | platform_length |
|---|---|---|
| 00007f23 | A | 53 |
| 00007f23 | C | 50 |
| 00007f23 | D | 42 |
| 00007f23 | E | 48 |
# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a list of functions
# Supply a named list to summarise(across(), .groups = "drop)
mean_sd <- list(mean = ~mean(.x, na.rm = T),
sd = ~sd(.x, na.rm = T))
dplyr_query_4_2 <- student_courses %>%
group_by(id, platform, platform_start_date) %>%
filter(row_number() == 1L) %>%
ungroup() %>%
group_by(id, platform) %>%
summarise(across(contains("length"),
mean_sd),
.groups = "drop")
| id | platform | platform_length_mean | platform_length_sd |
|---|---|---|---|
| 00007f23 | A | 53 | 19.79899 |
| 00007f23 | C | 50 | NA |
| 00007f23 | D | 42 | NA |
| 00007f23 | E | 48 | 12.72792 |
data.tableIn data.table, the equivalent method is to list columns of interest inside .SDcols and apply aggregations using lapply(.SD, ...).
# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a single function
dt_query_4_1 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
lapply(.SD, mean, na.rm = T),
.SDcols = grep("length", colnames(student_courses)),
keyby = .(id, platform)]
summary(dt_query_4_1 == setDT(dplyr_query_4_1))
#> id platform platform_length
#> Mode:logical Mode:logical Mode:logical
#> TRUE:336621 TRUE:336621 TRUE:336621
# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a list of functions
dt_query_4_2 <- student_courses[,
.SD[1L],
by = .(id, platform, platform_start_date)] %>%
.[,
unlist(lapply(.SD,
function(x) list(mean = mean(x),
sd = sd(x))),
recursive = F),
.SDcols = grep("length", colnames(student_courses)),
keyby = .(id, platform)]
summary(dt_query_4_2 == setDT(dplyr_query_4_2))
#> id platform platform_length.mean platform_length.sd
#> Mode:logical Mode:logical Mode:logical Mode:logical
#> TRUE:336621 TRUE:336621 TRUE:336621 TRUE:118638
#> NA's:217983
Note: The use of unlist() inside the j placeholder to convert individual list elements into individual columns is explained here.
As expected, data.table operations run faster than dplyr operations, although there is a non-linear performance decrease when multiple functions are evaluated for a subset of columns using unlist().
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| dplyr_query_4_1 | 44909.0833 | 46373.5223 | 48032.2252 | 49279.872 | 49431.1745 | 50167.4736 | 5 |
| dt_query_4_1 | 226.2467 | 246.2729 | 277.0908 | 279.916 | 292.9285 | 340.0899 | 5 |
| dplyr_query_4_2 | 53664.0369 | 54031.1293 | 55658.0703 | 56744.099 | 56923.1472 | 56927.9389 | 5 |
| dt_query_4_2 | 17300.8488 | 17699.8367 | 18300.5890 | 18465.812 | 18525.7296 | 19510.7179 | 5 |
lag or lead operationsFinally, what if you were interested in the number of times a student switches between a platform? Obtaining this insight is a multiple step process:
id and create lag_platform, which denotes the preceding online platform. Evaluating the first row using lag() outputs NA.is_new_platform using a case_when() condition which outputs 1 when the preceding platform is NA or different to the current platform and outputs 0 when the preceding platform is the same as the current platform.id and sum is_new_platform to obtain the total number of times a student has switched between a platform.dplyr# Calculate total platform switches per student using dplyr --------------------
# student_courses must already be sorted by id AND platform_start_date
dplyr_query_5 <- student_courses %>%
group_by(id) %>%
mutate(lag_platform = lag(platform, 1L),
is_new_platform = case_when(is.na(lag_platform) ~ 1,
platform != lag_platform ~ 1,
TRUE ~ 0)) %>%
summarise(platform_switch = sum(is_new_platform),
.groups = "drop")
data.table# Calculate total platform switches per student using data.table ---------------
dt_query_5 <- student_courses[,
lag_platform := shift(platform, 1L, type = "lag"),
keyby = id] %>%
.[,
is_new_platform := fcase(
is.na(lag_platform), 1,
platform != lag_platform, 1,
default = 0),
by = id] %>%
.[,
.(platform_switch = sum(is_new_platform)),
by = id]
summary(dt_query_5 == setDT(dplyr_query_5))
#> id platform_switch
#> Mode:logical Mode:logical
#> TRUE:144676 TRUE:144676
From the benchmark of just the lag() code component below, you can see that lag() is much faster in data.table than dplyr but also more computationally expensive overall.
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| dplyr_lag | 4.681914 | 4.887524 | 5.076612 | 5.016075 | 5.231911 | 5.632238 | 25 |
| dt_lag | 2.194414 | 2.340580 | 2.450061 | 2.425271 | 2.528855 | 2.954475 | 25 |
data.table with dplyr verbsWhilst I’ve become familiar with the data.table syntax, there is still a way to retain the readability of dplyr combined with the performance of data.table, by implementing data.table lazy translations using the dtplyr package.
You can use lazy_dt() on data.frame objects to convert them into lazy_dt objects. Using dplyr code on data.table objects will also automatically convert that object into a lazy_dt object.
Note: I am currently experiencing a package dependency error caused by dtplyr which prevents .list() functions from evaluating inside data.table objects, so the code below exists for demonstration purposes only.
# Install and load dtplyr ------------------------------------------------------
install.packages("dtplyr")
library("dtplyr")
# Create data.frame and convert into lazy_dt object ----------------------------
set.seed(111)
test_df <- tibble(id = sample(seq(1, 10), size = 500, replace = T),
var_1 = rnorm(500, 0, 1),
var_2 = rnorm(500, 2, 1),
var_3 = rnorm(500, 0, 2))
test_dt <- lazy_dt(test_df)
The function show_query() can also be used to output the generated data.table code translation.
# Use show_query() to copy data.table translations -----------------------------
test_dt %>%
arrange(desc(id)) %>%
mutate(var_sum = sum(var_1, var_2, var_3)) %>%
show_query()
#> `_DT1`[order(desc(id))][, `:=`(var_sum = sum(var_1, var_2, var_3))]
test_dt %>%
group_by(id) %>%
summarise(across(starts_with("var_"), mean)) %>%
show_query()
#> `_DT2`[, .(var_1 = mean(var_1), var_2 = mean(var_2), var_3 = mean(var_3)),
#> keyby = .(id)]
Computation of lazy_dt objects is only performed if as.data.frame(), as.data.table(), or as_tibble() is explicitly called.
# Perform lazy evaluation to return a data.frame object ------------------------
test_dt %>%
group_by(id) %>%
summarise(across(starts_with("var_"), mean)) %>%
as_tibble() # Outputs a tibble
A great side-by-side comparison of data.table versus dplyr functions from a blog post by Atrebas.
A list of advanced data.table operations and tricks from a blog post by Andrew Brooks.
A stack overflow discussion about why group by operations are much faster using data.table.
An R publication by Jose with tips on using .SD in data.table.
A blog post by JMount which benchmarks base R, dplyr, data.table versus lazy_dt code performance.