Introduction

This post is an extension of my comparison of simple R data.table versus dplyr functions.

# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
               ids, # Generate random IDs
               tidyverse,
               data.table,
               microbenchmark,
               DT)

Imagine that you have a dataset describing how students are engaging with online courses:

Each student has a unique ID.
There are 5 different online platforms, labelled A, B, C, D and E.
Students have the option of taking different courses within the same platform or switching to a different platform.
Platform start dates are recorded when the student starts the first course on a new platform.
Platform end dates are recorded when the student exits a platform.
Course start dates are recorded when the student starts a course.
Course end dates are recorded when the student exits a course.

The first 12 rows of the dataset can be examined interactively below.

Note: The code used to create this dataset can be accessed from my github repository here.

Code sequence impact

Using `dplyr`

Imagine that you would like to subset on fitness training courses from platform C and E and then create a column to denote that these were discounted courses.

In dplyr, this can be written as a single block of code using the %>% pipe to separate each functional step.

# Filter and create a new column using dplyr %>% pipes -------------------------
dplyr_query_1 <- student_courses %>%
  filter(course == "fitness_training",
         platform %in% c("C", "E")) %>%
  mutate(percentage_discount = 5)

Using `data.table`

In data.table, performing these two operations in separate steps or a single step produces different outputs. ¹

# Filter and create a new column using data.table in separate steps ------------
dt_query_1 <- student_courses[(course == "fitness_training")
                              & (platform %chin% c("C", "E"))] %>%
  .[, percentage_discount := 5]

id	platform	course	percentage_discount
00007f23	E	fitness_training	5
0007ca33	E	fitness_training	5
000899ab	E	fitness_training	5
0011ed00	C	fitness_training	5

In data.table, applying filtering and a column transformation in a single step retains all dataset rows and only applies the transformation to rows where the filtering condition is TRUE.

# Filter and create a new column using data.table in a single step -------------
dt_query_1_wrong <- student_courses[(course == "fitness_training")
                                    & (platform %chin% c("C", "E")),
                                    percentage_discount := 5]

id	platform	course	percentage_discount
00007f23	D	metal_welding	NA
00007f23	E	website_design	NA
00007f23	E	fitness_training	5
00007f23	A	metal_welding	NA

Benchmark data operations

There is only a slight speed advantage in using data.table over dplyr, as our operations do not involve sorting or group by operations.

Table 1: Units: milliseconds
expr	min	lq	mean	median	uq	max	neval
dplyr_query_1_code	14.4775	15.24285	19.671913	16.54145	18.86065	53.0508	100
dt_query_1_code	7.4791	8.11775	9.768044	8.59610	9.42310	39.4557	100
dt_query_1_wrong_code	6.0523	6.56240	8.491838	7.18205	7.80735	30.4918	100

Aggregate by group

A simple introduction to group by operations has already been covered here. This post further explores how different outputs can be obtained by modifying data.table group by operations in different ways.

Imagine that you are interested in the total number of days each student has spent on an online platform. Could you obtain this by grouping on id and summing the total number of days spent on a platform?

Using `dplyr`

# Calculate total platform_length per student using dplyr ----------------------
student_courses <- student_courses %>%
  mutate(platform_length = platform_end_date - platform_start_date,
         platform_length = as.integer(platform_length))

dplyr_query_2 <- student_courses %>%
  group_by(id) %>%
  summarise(total_days = sum(platform_length),
            min_days = min(platform_length),
            median_days = median(platform_length),
            max_days = max(platform_length)) %>%
  ungroup()

Note: In dplyr, group by operations should be closed using ungroup() to remove object metadata that marks row groupings.

Using `data.table`

In data.table, you can choose which variable(s) to group by using by or keyby. The additional effect of keyby is that it also orders the results and creates a secondary key for faster subsequent subsetting. This is useful if you intend to create multiple features from the same grouping.

# Calculate total platform_length per student using data.table -----------------
dt_query_2 <- student_courses[,
                              .(total_days = sum(platform_length),
                                min_days = min(platform_length),
                                median_days = median(platform_length),
                                max_days = max(platform_length)),
                              by = id]

The problem is that this solution overestimates the total number of days spent on an online platform per student, as some students take multiple courses on the same platform. These student records will contain rows with the same platform dwell length but different course start and end dates. ²

Identify duplicate rows

To remove duplicate platform_length values, we first identify them by concatenating id, platform and platform_start_date and counting the total number of rows per concatenation.

Using `dplyr`

# Identify duplicate rows using dplyr ------------------------------------------
student_courses %>%
  mutate(platform_key = str_c(id, platform, platform_start_date, sep = "-")) %>%
  count(platform_key, name = "row_number") %>%
  count(row_number, name = "total_students")

Using `data.table`

In data.table, .SD means ‘subset of data’ and is used to reference the current sub-table of interest.

# Identify duplicate rows using dplyr using data.table -------------------------
student_courses[,
                .(platform_key = do.call(str_c, c(.SD, sep = "-"))),
                .SDcols = c("id", "platform", "platform_start_date")] %>%
  .[, 
    .(row_number = .N),
    by = platform_key] %>%
  .[,
    .(total_students = .N),
    keyby = row_number] # Use keyby as we also want to sort by row_number

Note: The base R function do.call() constructs and executes a function call from a name or function and a list of function arguments.

Remove duplicate rows and aggregate by group

Given that duplicate platform_length records exist, you would use a two-step process to extract the total platform_length per student:

First, group by id, platform, platform_start_date and extract the first row of each group.
Next, group by id and aggregate for platform_length calculations.

Using `dplyr`

# Calculate total platform_length per student using dplyr ----------------------
dplyr_query_3 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id) %>%
  summarise(total_days = sum(platform_length),
            min_days = min(platform_length),
            median_days = median(platform_length),
            max_days = max(platform_length)) %>%
  ungroup()

summary(dplyr_query_3 == dplyr_query_2)
#>     id          total_days       min_days       median_days      max_days
#>  Mode:logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical
#>  TRUE:144676    FALSE:14674     FALSE:22        FALSE:9745      FALSE:24
#>                 TRUE :130002    TRUE :144654    TRUE :134931    TRUE :144652

Using `data.table`

# Calculate total platform_length per student using data.table -----------------
dt_query_3 <- student_courses[,
                              .SD[1L],
                              by = .(id, platform, platform_start_date)] %>%
  .[,
    .(total_days = sum(platform_length),
      min_days = min(platform_length),
      median_days = median(platform_length),
      max_days = max(platform_length)),
    keyby = id]

summary(dt_query_3 == setDT(dplyr_query_3))
#>     id          total_days     min_days       median_days    max_days
#>  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical
#>  TRUE:144676    TRUE:144676    TRUE:144676    TRUE:144676    TRUE:144676

Benchmark data operations

The data.table solutions are significantly faster as group by operations are required.

Table 2: Units: milliseconds
expr	min	lq	mean	median	uq	max	neval
dplyr_query_2_code	5246.0321	7057.7399	7109.48259	7170.6375	7452.5706	7857.1419	25
dt_query_2_code	62.8657	65.1054	81.67138	88.1212	92.6909	104.0582	25
dplyr_query_3_code	14688.6743	15368.2905	18818.33617	19835.0295	20920.1685	21933.0659	25
dt_query_3_code	194.3412	277.8851	349.00902	295.4189	340.8037	783.2029	25

Summarise across multiple variables

Imagine that you are interested in how the length of time spent on a platform varies per student per platform. This solution is similar to the one above, except that transformations are specified for selective variable(s) and more than one variable is specified within a group.

Using `dplyr`

In dplyr, this selection is facilitated by using across() inside summarise(), which allows you to apply the same list of functions on a single column or across multiple columns.

# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a single function
dplyr_query_4_1 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id, platform) %>%
  summarise(across(contains("length"),
                   mean, na.rm = TRUE),
            .groups = "drop") # Replaces the need to ungroup() after summarise()

id	platform	platform_length
00007f23	A	53
00007f23	C	50
00007f23	D	42
00007f23	E	48

# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a list of functions

# Supply a named list to summarise(across(), .groups = "drop)
mean_sd <- list(mean = ~mean(.x, na.rm = T),
                sd = ~sd(.x, na.rm = T))

dplyr_query_4_2 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id, platform) %>%
  summarise(across(contains("length"),
                   mean_sd),
            .groups = "drop")

id	platform	platform_length_mean	platform_length_sd
00007f23	A	53	19.79899
00007f23	C	50	NA
00007f23	D	42	NA
00007f23	E	48	12.72792

Using `data.table`

In data.table, the equivalent method is to list columns of interest inside .SDcols and apply aggregations using lapply(.SD, ...).

# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a single function
dt_query_4_1 <- student_courses[,
                                .SD[1L],
                                by = .(id, platform, platform_start_date)] %>%
  .[,
    lapply(.SD, mean, na.rm = T),
    .SDcols = grep("length", colnames(student_courses)),
    keyby = .(id, platform)]

summary(dt_query_4_1 == setDT(dplyr_query_4_1))
#>     id          platform       platform_length
#>  Mode:logical   Mode:logical   Mode:logical
#>  TRUE:336621    TRUE:336621    TRUE:336621

# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a list of functions
dt_query_4_2 <- student_courses[,
                                .SD[1L],
                                by = .(id, platform, platform_start_date)] %>%
  .[,
    unlist(lapply(.SD,
                  function(x) list(mean = mean(x),
                                   sd = sd(x))),
           recursive = F),
    .SDcols = grep("length", colnames(student_courses)),
    keyby = .(id, platform)]

summary(dt_query_4_2 == setDT(dplyr_query_4_2))
#>     id          platform       platform_length.mean platform_length.sd
#>  Mode:logical   Mode:logical   Mode:logical         Mode:logical
#>  TRUE:336621    TRUE:336621    TRUE:336621          TRUE:118638
#>                                                     NA's:217983

Note: The use of unlist() inside the j placeholder to convert individual list elements into individual columns is explained here.

Benchmark data operations

As expected, data.table operations run faster than dplyr operations, although there is a non-linear performance decrease when multiple functions are evaluated for a subset of columns using unlist().

Table 3: Units: milliseconds
expr	min	lq	mean	median	uq	max	neval
dplyr_query_4_1	44909.0833	46373.5223	48032.2252	49279.872	49431.1745	50167.4736	5
dt_query_4_1	226.2467	246.2729	277.0908	279.916	292.9285	340.0899	5
dplyr_query_4_2	53664.0369	54031.1293	55658.0703	56744.099	56923.1472	56927.9389	5
dt_query_4_2	17300.8488	17699.8367	18300.5890	18465.812	18525.7296	19510.7179	5

Use `lag` or `lead` operations

Finally, what if you were interested in the number of times a student switches between a platform? Obtaining this insight is a multiple step process:

First, group by id and create lag_platform, which denotes the preceding online platform. Evaluating the first row using lag() outputs NA.
Next, create is_new_platform using a case_when() condition which outputs 1 when the preceding platform is NA or different to the current platform and outputs 0 when the preceding platform is the same as the current platform.
Finally, group by id and sum is_new_platform to obtain the total number of times a student has switched between a platform.

Using `dplyr`

# Calculate total platform switches per student using dplyr --------------------
# student_courses must already be sorted by id AND platform_start_date
dplyr_query_5 <- student_courses %>%
  group_by(id) %>%
  mutate(lag_platform = lag(platform, 1L),
         is_new_platform = case_when(is.na(lag_platform) ~ 1,
                                     platform != lag_platform ~ 1,
                                     TRUE ~ 0)) %>%
  summarise(platform_switch = sum(is_new_platform),
            .groups = "drop")

Using `data.table`

# Calculate total platform switches per student using data.table ---------------
dt_query_5 <- student_courses[,
                              lag_platform := shift(platform, 1L, type = "lag"),
                              keyby = id] %>%
  .[,
    is_new_platform := fcase(
      is.na(lag_platform), 1,
      platform != lag_platform, 1,
      default = 0),
    by = id] %>%
  .[,
    .(platform_switch = sum(is_new_platform)),
    by = id]

summary(dt_query_5 == setDT(dplyr_query_5))
#>     id          platform_switch
#>  Mode:logical   Mode:logical
#>  TRUE:144676    TRUE:144676

Benchmark data operations

From the benchmark of just the lag() code component below, you can see that lag() is much faster in data.table than dplyr but also more computationally expensive overall.

Table 4: Units: seconds
expr	min	lq	mean	median	uq	max	neval
dplyr_lag	4.681914	4.887524	5.076612	5.016075	5.231911	5.632238	25
dt_lag	2.194414	2.340580	2.450061	2.425271	2.528855	2.954475	25

Lazy `data.table` with `dplyr` verbs

Whilst I’ve become familiar with the data.table syntax, there is still a way to retain the readability of dplyr combined with the performance of data.table, by implementing data.table lazy translations using the dtplyr package.

You can use lazy_dt() on data.frame objects to convert them into lazy_dt objects. Using dplyr code on data.table objects will also automatically convert that object into a lazy_dt object.

Note: I am currently experiencing a package dependency error caused by dtplyr which prevents .list() functions from evaluating inside data.table objects, so the code below exists for demonstration purposes only.

# Install and load dtplyr ------------------------------------------------------
install.packages("dtplyr")
library("dtplyr")

# Create data.frame and convert into lazy_dt object ----------------------------
set.seed(111)
test_df <- tibble(id = sample(seq(1, 10), size = 500, replace = T),
                  var_1 = rnorm(500, 0, 1),
                  var_2 = rnorm(500, 2, 1),
                  var_3 = rnorm(500, 0, 2))

test_dt <- lazy_dt(test_df)

The function show_query() can also be used to output the generated data.table code translation.

# Use show_query() to copy data.table translations -----------------------------
test_dt %>%
  arrange(desc(id)) %>%
  mutate(var_sum = sum(var_1, var_2, var_3)) %>%
  show_query()

#> `_DT1`[order(desc(id))][, `:=`(var_sum = sum(var_1, var_2, var_3))]

test_dt %>%
  group_by(id) %>%
  summarise(across(starts_with("var_"), mean)) %>%
  show_query()
#> `_DT2`[, .(var_1 = mean(var_1), var_2 = mean(var_2), var_3 = mean(var_3)),
#>        keyby = .(id)]

Computation of lazy_dt objects is only performed if as.data.frame(), as.data.table(), or as_tibble() is explicitly called.

# Perform lazy evaluation to return a data.frame object ------------------------
test_dt %>%
  group_by(id) %>%
  summarise(across(starts_with("var_"), mean)) %>%
  as_tibble() # Outputs a tibble

Other resources

A great side-by-side comparison of data.table versus dplyr functions from a blog post by Atrebas.
A list of advanced data.table operations and tricks from a blog post by Andrew Brooks.
A stack overflow discussion about why group by operations are much faster using data.table.
An R publication by Jose with tips on using .SD in data.table.
A blog post by JMount which benchmarks base R, dplyr, data.table versus lazy_dt code performance.

Performing the two functions in separate steps is equivalent to the dplyr approach above.↩︎
It is simpler to calculate and aggregate based on the course dwell length, which is a unique value. But then I wouldn’t be able to demonstrate any interesting code.↩︎

Advanced data.table operations

Introduction

Code sequence impact

Using dplyr

Using data.table

Benchmark data operations

Aggregate by group

Using dplyr

Using data.table

Identify duplicate rows

Using dplyr

Using data.table

Remove duplicate rows and aggregate by group

Using dplyr

Using data.table

Benchmark data operations

Summarise across multiple variables

Using dplyr

Using data.table

Benchmark data operations

Use lag or lead operations

Using dplyr

Using data.table

Benchmark data operations

Lazy data.table with dplyr verbs

Other resources

Using `dplyr`

Using `data.table`

Using `dplyr`

Using `data.table`

Using `dplyr`

Using `data.table`

Using `dplyr`

Using `data.table`

Using `dplyr`

Using `data.table`

Use `lag` or `lead` operations

Using `dplyr`

Using `data.table`

Lazy `data.table` with `dplyr` verbs