Advanced data.table operations

data cleaning data.table dplyr R

Things get querysome and querysome.

Erika Duan
02-16-2021

Introduction

This post is an extension of my comparison of simple R data.table versus dplyr functions.

# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
               ids, # Generate random IDs
               tidyverse,
               data.table,
               microbenchmark,
               DT)

Imagine that you have a dataset describing how students are engaging with online courses:

The first 12 rows of the dataset can be examined interactively below.

Note: The code used to create this dataset can be accessed from my github repository here.

Code sequence impact

Using dplyr

Imagine that you would like to subset on fitness training courses from platform C and E and then create a column to denote that these were discounted courses.

In dplyr, this can be written as a single block of code using the %>% pipe to separate each functional step.

# Filter and create a new column using dplyr %>% pipes -------------------------
dplyr_query_1 <- student_courses %>%
  filter(course == "fitness_training",
         platform %in% c("C", "E")) %>%
  mutate(percentage_discount = 5)

Using data.table

In data.table, performing these two operations in separate steps or a single step produces different outputs. 1

# Filter and create a new column using data.table in separate steps ------------
dt_query_1 <- student_courses[(course == "fitness_training")
                              & (platform %chin% c("C", "E"))] %>%
  .[, percentage_discount := 5]
id platform course percentage_discount
00007f23 E fitness_training 5
0007ca33 E fitness_training 5
000899ab E fitness_training 5
0011ed00 C fitness_training 5

In data.table, applying filtering and a column transformation in a single step retains all dataset rows and only applies the transformation to rows where the filtering condition is TRUE.

# Filter and create a new column using data.table in a single step -------------
dt_query_1_wrong <- student_courses[(course == "fitness_training")
                                    & (platform %chin% c("C", "E")),
                                    percentage_discount := 5]
id platform course percentage_discount
00007f23 D metal_welding NA
00007f23 E website_design NA
00007f23 E fitness_training 5
00007f23 A metal_welding NA

Benchmark data operations

There is only a slight speed advantage in using data.table over dplyr, as our operations do not involve sorting or group by operations.

Table 1: Units: milliseconds
expr min lq mean median uq max neval
dplyr_query_1_code 14.4775 15.24285 19.671913 16.54145 18.86065 53.0508 100
dt_query_1_code 7.4791 8.11775 9.768044 8.59610 9.42310 39.4557 100
dt_query_1_wrong_code 6.0523 6.56240 8.491838 7.18205 7.80735 30.4918 100

Aggregate by group

A simple introduction to group by operations has already been covered here. This post further explores how different outputs can be obtained by modifying data.table group by operations in different ways.

Imagine that you are interested in the total number of days each student has spent on an online platform. Could you obtain this by grouping on id and summing the total number of days spent on a platform?

Using dplyr

# Calculate total platform_length per student using dplyr ----------------------
student_courses <- student_courses %>%
  mutate(platform_length = platform_end_date - platform_start_date,
         platform_length = as.integer(platform_length))

dplyr_query_2 <- student_courses %>%
  group_by(id) %>%
  summarise(total_days = sum(platform_length),
            min_days = min(platform_length),
            median_days = median(platform_length),
            max_days = max(platform_length)) %>%
  ungroup()

Note: In dplyr, group by operations should be closed using ungroup() to remove object metadata that marks row groupings.

Using data.table

In data.table, you can choose which variable(s) to group by using by or keyby. The additional effect of keyby is that it also orders the results and creates a secondary key for faster subsequent subsetting. This is useful if you intend to create multiple features from the same grouping.

# Calculate total platform_length per student using data.table -----------------
dt_query_2 <- student_courses[,
                              .(total_days = sum(platform_length),
                                min_days = min(platform_length),
                                median_days = median(platform_length),
                                max_days = max(platform_length)),
                              by = id]

The problem is that this solution overestimates the total number of days spent on an online platform per student, as some students take multiple courses on the same platform. These student records will contain rows with the same platform dwell length but different course start and end dates. 2

Identify duplicate rows

To remove duplicate platform_length values, we first identify them by concatenating id, platform and platform_start_date and counting the total number of rows per concatenation.

Using dplyr

# Identify duplicate rows using dplyr ------------------------------------------
student_courses %>%
  mutate(platform_key = str_c(id, platform, platform_start_date, sep = "-")) %>%
  count(platform_key, name = "row_number") %>%
  count(row_number, name = "total_students")

Using data.table

In data.table, .SD means ‘subset of data’ and is used to reference the current sub-table of interest.

# Identify duplicate rows using dplyr using data.table -------------------------
student_courses[,
                .(platform_key = do.call(str_c, c(.SD, sep = "-"))),
                .SDcols = c("id", "platform", "platform_start_date")] %>%
  .[, 
    .(row_number = .N),
    by = platform_key] %>%
  .[,
    .(total_students = .N),
    keyby = row_number] # Use keyby as we also want to sort by row_number

Note: The base R function do.call() constructs and executes a function call from a name or function and a list of function arguments.

Remove duplicate rows and aggregate by group

Given that duplicate platform_length records exist, you would use a two-step process to extract the total platform_length per student:

Using dplyr

# Calculate total platform_length per student using dplyr ----------------------
dplyr_query_3 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id) %>%
  summarise(total_days = sum(platform_length),
            min_days = min(platform_length),
            median_days = median(platform_length),
            max_days = max(platform_length)) %>%
  ungroup()

summary(dplyr_query_3 == dplyr_query_2)
#>     id          total_days       min_days       median_days      max_days
#>  Mode:logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical
#>  TRUE:144676    FALSE:14674     FALSE:22        FALSE:9745      FALSE:24
#>                 TRUE :130002    TRUE :144654    TRUE :134931    TRUE :144652

Using data.table

# Calculate total platform_length per student using data.table -----------------
dt_query_3 <- student_courses[,
                              .SD[1L],
                              by = .(id, platform, platform_start_date)] %>%
  .[,
    .(total_days = sum(platform_length),
      min_days = min(platform_length),
      median_days = median(platform_length),
      max_days = max(platform_length)),
    keyby = id]

summary(dt_query_3 == setDT(dplyr_query_3))
#>     id          total_days     min_days       median_days    max_days
#>  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical
#>  TRUE:144676    TRUE:144676    TRUE:144676    TRUE:144676    TRUE:144676

Benchmark data operations

The data.table solutions are significantly faster as group by operations are required.

Table 2: Units: milliseconds
expr min lq mean median uq max neval
dplyr_query_2_code 5246.0321 7057.7399 7109.48259 7170.6375 7452.5706 7857.1419 25
dt_query_2_code 62.8657 65.1054 81.67138 88.1212 92.6909 104.0582 25
dplyr_query_3_code 14688.6743 15368.2905 18818.33617 19835.0295 20920.1685 21933.0659 25
dt_query_3_code 194.3412 277.8851 349.00902 295.4189 340.8037 783.2029 25

Summarise across multiple variables

Imagine that you are interested in how the length of time spent on a platform varies per student per platform. This solution is similar to the one above, except that transformations are specified for selective variable(s) and more than one variable is specified within a group.

Using dplyr

In dplyr, this selection is facilitated by using across() inside summarise(), which allows you to apply the same list of functions on a single column or across multiple columns.

# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a single function
dplyr_query_4_1 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id, platform) %>%
  summarise(across(contains("length"),
                   mean, na.rm = TRUE),
            .groups = "drop") # Replaces the need to ungroup() after summarise()
id platform platform_length
00007f23 A 53
00007f23 C 50
00007f23 D 42
00007f23 E 48
# Remove duplicate rows and summarise across platform_length using dplyr -------
# Solution for a list of functions

# Supply a named list to summarise(across(), .groups = "drop)
mean_sd <- list(mean = ~mean(.x, na.rm = T),
                sd = ~sd(.x, na.rm = T))

dplyr_query_4_2 <- student_courses %>%
  group_by(id, platform, platform_start_date) %>%
  filter(row_number() == 1L) %>%
  ungroup() %>%
  group_by(id, platform) %>%
  summarise(across(contains("length"),
                   mean_sd),
            .groups = "drop")
id platform platform_length_mean platform_length_sd
00007f23 A 53 19.79899
00007f23 C 50 NA
00007f23 D 42 NA
00007f23 E 48 12.72792

Using data.table

In data.table, the equivalent method is to list columns of interest inside .SDcols and apply aggregations using lapply(.SD, ...).

# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a single function
dt_query_4_1 <- student_courses[,
                                .SD[1L],
                                by = .(id, platform, platform_start_date)] %>%
  .[,
    lapply(.SD, mean, na.rm = T),
    .SDcols = grep("length", colnames(student_courses)),
    keyby = .(id, platform)]

summary(dt_query_4_1 == setDT(dplyr_query_4_1))
#>     id          platform       platform_length
#>  Mode:logical   Mode:logical   Mode:logical
#>  TRUE:336621    TRUE:336621    TRUE:336621
# Remove duplicate rows and lapply() across platform_length using data.table ---
# Solution for a list of functions
dt_query_4_2 <- student_courses[,
                                .SD[1L],
                                by = .(id, platform, platform_start_date)] %>%
  .[,
    unlist(lapply(.SD,
                  function(x) list(mean = mean(x),
                                   sd = sd(x))),
           recursive = F),
    .SDcols = grep("length", colnames(student_courses)),
    keyby = .(id, platform)]

summary(dt_query_4_2 == setDT(dplyr_query_4_2))
#>     id          platform       platform_length.mean platform_length.sd
#>  Mode:logical   Mode:logical   Mode:logical         Mode:logical
#>  TRUE:336621    TRUE:336621    TRUE:336621          TRUE:118638
#>                                                     NA's:217983

Note: The use of unlist() inside the j placeholder to convert individual list elements into individual columns is explained here.

Benchmark data operations

As expected, data.table operations run faster than dplyr operations, although there is a non-linear performance decrease when multiple functions are evaluated for a subset of columns using unlist().

Table 3: Units: milliseconds
expr min lq mean median uq max neval
dplyr_query_4_1 44909.0833 46373.5223 48032.2252 49279.872 49431.1745 50167.4736 5
dt_query_4_1 226.2467 246.2729 277.0908 279.916 292.9285 340.0899 5
dplyr_query_4_2 53664.0369 54031.1293 55658.0703 56744.099 56923.1472 56927.9389 5
dt_query_4_2 17300.8488 17699.8367 18300.5890 18465.812 18525.7296 19510.7179 5

Use lag or lead operations

Finally, what if you were interested in the number of times a student switches between a platform? Obtaining this insight is a multiple step process:

Using dplyr

# Calculate total platform switches per student using dplyr --------------------
# student_courses must already be sorted by id AND platform_start_date
dplyr_query_5 <- student_courses %>%
  group_by(id) %>%
  mutate(lag_platform = lag(platform, 1L),
         is_new_platform = case_when(is.na(lag_platform) ~ 1,
                                     platform != lag_platform ~ 1,
                                     TRUE ~ 0)) %>%
  summarise(platform_switch = sum(is_new_platform),
            .groups = "drop")

Using data.table

# Calculate total platform switches per student using data.table ---------------
dt_query_5 <- student_courses[,
                              lag_platform := shift(platform, 1L, type = "lag"),
                              keyby = id] %>%
  .[,
    is_new_platform := fcase(
      is.na(lag_platform), 1,
      platform != lag_platform, 1,
      default = 0),
    by = id] %>%
  .[,
    .(platform_switch = sum(is_new_platform)),
    by = id]

summary(dt_query_5 == setDT(dplyr_query_5))
#>     id          platform_switch
#>  Mode:logical   Mode:logical
#>  TRUE:144676    TRUE:144676

Benchmark data operations

From the benchmark of just the lag() code component below, you can see that lag() is much faster in data.table than dplyr but also more computationally expensive overall.

Table 4: Units: seconds
expr min lq mean median uq max neval
dplyr_lag 4.681914 4.887524 5.076612 5.016075 5.231911 5.632238 25
dt_lag 2.194414 2.340580 2.450061 2.425271 2.528855 2.954475 25

Lazy data.table with dplyr verbs

Whilst I’ve become familiar with the data.table syntax, there is still a way to retain the readability of dplyr combined with the performance of data.table, by implementing data.table lazy translations using the dtplyr package.

You can use lazy_dt() on data.frame objects to convert them into lazy_dt objects. Using dplyr code on data.table objects will also automatically convert that object into a lazy_dt object.

Note: I am currently experiencing a package dependency error caused by dtplyr which prevents .list() functions from evaluating inside data.table objects, so the code below exists for demonstration purposes only.

# Install and load dtplyr ------------------------------------------------------
install.packages("dtplyr")
library("dtplyr")

# Create data.frame and convert into lazy_dt object ----------------------------
set.seed(111)
test_df <- tibble(id = sample(seq(1, 10), size = 500, replace = T),
                  var_1 = rnorm(500, 0, 1),
                  var_2 = rnorm(500, 2, 1),
                  var_3 = rnorm(500, 0, 2))

test_dt <- lazy_dt(test_df)

The function show_query() can also be used to output the generated data.table code translation.

# Use show_query() to copy data.table translations -----------------------------
test_dt %>%
  arrange(desc(id)) %>%
  mutate(var_sum = sum(var_1, var_2, var_3)) %>%
  show_query()

#> `_DT1`[order(desc(id))][, `:=`(var_sum = sum(var_1, var_2, var_3))]

test_dt %>%
  group_by(id) %>%
  summarise(across(starts_with("var_"), mean)) %>%
  show_query()
#> `_DT2`[, .(var_1 = mean(var_1), var_2 = mean(var_2), var_3 = mean(var_3)),
#>        keyby = .(id)]

Computation of lazy_dt objects is only performed if as.data.frame(), as.data.table(), or as_tibble() is explicitly called.

# Perform lazy evaluation to return a data.frame object ------------------------
test_dt %>%
  group_by(id) %>%
  summarise(across(starts_with("var_"), mean)) %>%
  as_tibble() # Outputs a tibble

Other resources


  1. Performing the two functions in separate steps is equivalent to the dplyr approach above.↩︎

  2. It is simpler to calculate and aggregate based on the course dwell length, which is a unique value. But then I wouldn’t be able to demonstrate any interesting code.↩︎