Introduction to data.table

data cleaning data.table dplyr R

To data.table or dplyr? That is the question.

Erika Duan
01-30-2021

Introduction

Programming languages are still human constructs. They hold sway when they are utilised en mass, like Python for machine learning, and user factions may emerge if very different ways of doing the same thing concurrently exist.

In R, this can manifest in the form of data.table versus dplyr debates. 1

Both R packages contain a comprehensive stack of functions for data wrangling. The tidyverse dplyr approach emphasises code readability whilst data.table scales complex manipulations of large datasets very efficiently. You can compare the efficiency of data.table versus other data wrangling packages on large datasets here.

Whilst I prefer to use dplyr on small datasets where data.table efficiency gains are negligible, I recommend using data.table when:

Let’s explore this for ourselves.

# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
               lobstr, # Trace objects in memory 
               ids, # Generate random ids
               DT, # Create interactive tables  
               tidyverse,
               data.table,
               microbenchmark)

Create a test dataset

Imagine you have a dataset describing how students are engaging with online courses:

We can interactively examine the first 20 rows of the dataset using the R package DT.

Note: The code used to create this dataset can be accessed from my github repository here.

Principles of data.table

In R, datasets exist as data.frame type objects. To apply data.table functions on a dataset, we need to convert a data.frame into a data.table object using setDT().

This function is flexible as it converts a data.frame by reference (i.e. without creating a duplicate data.table copy) and assigns both data.table and data.frame classes to the converted object.

# Convert data frame into data.table -------------------------------------------
class(student_courses)
#> [1] "tbl_df"     "tbl"        "data.frame"

mem_used()
#> 222,639,352 B

# Track object assignment in memory
tracemem(student_courses)
#> [1] "<0000022B07B7AB70>"

setDT(student_courses) # data.table is assigned to a new location in memory
#> tracemem[0x0000022b07b7ab70 -> 0x0000022b047a8478]: as.list.data.frame as.list vapply vapply_1i setDT
untracemem(student_courses)

mem_used()
#> 242,674,176 B

# Note that computer memory has not doubled following setDT()

class(student_courses)
#> [1] "data.table" "data.frame"

data.table query structure

A data.table query is structured in the form DT[i, j, by] where:

data.table efficiency gains

There are a few reasons why data.table operations are fast:

# Create data.frame and data.table objects -------------------------------------
df <- data.frame(id = seq(1:5),
                 letter = letters[1:5])

dt <- as.data.table(df)

# Update data.frame column using copy-on-modify --------------------------------
df_2 <- df %>%
  mutate(letter = toupper(letter))

# The new data frame is a shallow copy of the original data frame as
# only modified columns are newly created in memory.

ref(df, df_2)
#> o [1:0x22b03b6c998] <df[,2]>
#> +-id = [2:0x22b7f3f6350] <int>
#> \-letter = [3:0x22b0c54b328] <chr>

#> o [4:0x22b03b90e78] <df[,2]>
#> +-id = [2:0x22b7f3f6350]
#> \-letter = [5:0x22b05f9b890] <chr>

# Update data.table column by reference ----------------------------------------
obj_addr(dt)
#> [1] "0x22b016ed030"

dt[, letter := toupper(letter)]

obj_addr(dt)
#> [1] "0x22b016ed030"

Note: You do not need to assign datasets to names when modifying by reference using the := operator.

Filter data

Using dplyr

The basic syntax for filtering data is very similar for dplyr and data.table.

# Filter student_courses using dplyr -------------------------------------------
# Filter by platform A
student_courses %>%
  filter(platform == "A")

# Filter by all platforms excepting A
student_courses %>%
  filter(platform != "A")

# Filter by platforms A and C
student_courses %>%
  filter(platform %in% c("A", "C"))

# Using a comma is a substitute for the condition 'and'
student_courses %>%
  filter(id == "00007f23",
         between(platform_start_date, "2017-01-01", "2017-12-31"))

# Filter by a variable using regex
student_courses %>%
  filter(str_detect(course, "^R_"))

In dplyr version >= 1.0.0, the functions if_any() and if_all() can be incorporated to filter across multiple columns.

# Filter across multiple rows using dplyr --------------------------------------
# Filter across all date columns where all columns are TRUE  
student_courses %>%
  filter(if_all(ends_with("_date"), ~between(., "2017-01-01", "2017-12-31")))

# Filter across all date columns where at least one column is TRUE
student_courses %>%
  filter(if_any(ends_with("_date"), ~between(., "2017-01-01", "2017-12-31")))

Note: The function between() is equivalent to lower <=x & x <= upper when incbounds=TRUE and lower < x & y < upper when incbounds=FALSE.

Using data.table

A minor difference is that data.table also contains a list of helper functions with optimised performance for filtering on specific data types like characters or integers.

# Filter student_courses using data.table --------------------------------------
# Filter by platform A
student_courses[platform == "A"]

# Filter by all platforms excepting A
student_courses[platform != "A"]

# Operator %chin% is equivalent to but faster than %in%
student_courses[platform %chin% c("A", "C")]

# Operator %between% or data.table::between() searches across a range of values
student_courses[id == "00007f23" &
                platform_start_date %between% c("2017-01-01", "2017-12-31")]

# Operator %like% allows you to search for a pattern in a character vector
student_courses[course %like% "R_"]

Currently, dplyr offers more versatility than data.table in terms of filtering across multiple columns. A readable data.table equivalent to if_all() currently does not exist and the equivalent to if_any can be solved using get() to return the value of multiple columns.

# Filter across multiple rows using data.table ---------------------------------
# Filter across all date columns where at least one column is TRUE
date_cols <- str_subset(colnames(student_courses), "_date$")
student_courses[get(date_cols) %between% c("2017-01-01", "2017-12-31"), .SDcols = date_cols]

# get(date_cols) == condition is equivalent to if_any(condition) == TRUE

Benchmark data filtering

You can use the R package microbenchmark to measure code performance. 2

The function microbenchmark() runs each expression 100 times by default with the argument times = 100L. It outputs summary statistics on how long it takes to evaluate a single expression.

Table 1: Units: milliseconds
expr min lq mean median uq max neval
filter(student_courses, platform == “A”) 9.0241 9.89670 17.45452 10.77120 13.19065 188.7689 100
student_courses[platform == “A”] 8.4162 9.12350 11.79773 9.60965 10.48005 35.0547 100
filter(student_courses, platform %in% c(“A”, “C”)) 12.1439 13.05705 20.04014 14.65670 33.03505 41.0345 100
student_courses[platform %chin% c(“A”, “C”)] 13.4523 14.65795 19.35104 15.47045 16.69780 46.3068 100
filter(student_courses, str_detect(course, “^R_”)) 83.2341 88.54035 97.67838 91.93555 102.11460 214.3023 100
student_courses[course %like% “R_”] 53.7119 57.01275 62.96518 58.66400 62.27945 224.8972 100
dplyr_filter_if_any 34.7018 37.33390 49.85545 41.14285 59.23520 215.8179 100
data.table_filter_get_cols 13.6371 14.82760 21.16597 15.69680 17.09135 186.9853 100

Sort data

Using dplyr

Sorting a data frame can be computationally expensive when multiple variables need to be ranked. This is why I recommending sorting your dataset once, right after basic data cleaning operations have been performed.

# Sort student_courses using dplyr ---------------------------------------------
student_courses %>%
  arrange(course_start_date)

student_courses %>%
  arrange(platform,
          id,
          desc(platform_start_date))

# Sorting by a descending date ranks the most recent date as first

Using data.table

In data.table, sorting is also performed inside i of DT[i, j, by]. Using the operator - in front of a variable allows sorting by descending order.

# Sort student_courses using data.table ----------------------------------------
student_courses[order(course_start_date)]

student_courses[order(platform,
                      id,
                      -platform_start_date)]

# You can also order columns in place using setorder()
setorder(student_courses,
         id,
         platform_start_date,
         course_start_date)

Benchmark data sorting

You can see that order() from data.table sorts data much faster than its equivalent dplyr function.

Table 2: Units: milliseconds
expr min lq mean median uq max neval
arrange(student_courses, course_start_date) 27.2534 35.2252 44.51070 49.7276 52.7478 61.6616 25
student_courses[order(course_start_date)] 54.8829 62.0147 70.32617 69.7907 72.0628 96.8317 25
arrange(student_courses, platform, id, desc(platform_start_date)) 5605.0036 5654.6456 5702.51392 5682.9645 5712.8049 6182.9437 25
student_courses[order(platform, id, -platform_start_date)] 119.1436 125.0282 134.31395 130.4740 144.8021 160.8236 25

Select data columns

Using dplyr

In dplyr, performing operations on a tibble will always return another data frame, unless you explicitly use pull() to extract a column as a vector.

# Select column(s) using dplyr -------------------------------------------------
student_courses %>%
  select(id)

student_courses %>%
  select(c(id, platform, course))

# Select columns(s) using regex
student_courses %>%
  select(contains("date", ignore.case = F))

# Output data.frame with select() ----------------------------------------------
student_courses %>%
  select(id) %>%
  class()
#> [1] "data.table" "data.frame"

# Output vector with pull() ----------------------------------------------------
student_courses %>%
  pull(id) %>%
  class()
#> [1] "character"

student_courses %>%
  .[["id"]] %>%
  class()
#> [1] "character"

Using data.table

In data.table, column selection is performed inside j of DT[i, j, by] and returns a data.table if you wrap column names inside a list.

# Select column(s) using data.table --------------------------------------------
student_courses[,
                .(id)]

student_courses[,
                .(id,
                  platform,
                  course)]

# Select column(s) using regex
str_subset(colnames(student_courses), "date")
#> [1] "platform_start_date" "platform_end_date"   "course_start_date"   "course_end_date"    

student_courses[,
                str_subset(colnames(student_courses), "date"),
                with = F] 

# Output data frame by wrapping column names inside a list ---------------------
class(student_courses[, .(id)])
#> [1] "data.table" "data.frame"

# Output vector with [[x]] -----------------------------------------------------
class(student_courses[, id])
#> [1] "character"

class(student_courses[["id"]])
#> [1] "character"

Note: In data.table, .(cols) is just a shorthand for list(cols) and is used to list columns in the j placeholder.

Benchmark column selection

Interestingly, the benchmark below shows that dplyr is slightly faster than data.table for column selections.

Table 3: Units: milliseconds
expr min lq mean median uq max neval
select(student_courses, c(id, platform, course)) 1.8944 2.32380 2.633437 2.48395 2.82525 5.3393 100
student_courses[, .(id, platform, course)] 4.3514 5.12335 14.512512 7.25425 24.44105 136.2222 100
select(student_courses, contains(“date”, ignore.case = F)) 1.2884 1.55195 1.789318 1.72585 1.97540 3.0404 100
student_courses[, grep(“date”, names(student_courses), value = T), with = F] 1.3837 1.75865 7.824120 2.72795 3.90910 128.7633 100

Column creation

Using dplyr

As dplyr objects are data frames, a shallow data.frame copy is created whenever a column is modified using mutate().

# Create column(s) using dplyr -------------------------------------------------
# Create new columns from existing variables
student_courses %>%
  mutate(platform_dwell_length = platform_end_date - platform_start_date,
         platform_start_year = as.numeric(str_extract(platform_start_date, "^.{4}(?!//-)")))

# Create column(s) with multiple conditions using dplyr ------------------------
str_subset(unique(student_courses$course), "^R_")
#> [1] "R_beginner"     "R_advanced"     "R_intermediate"

str_subset(unique(student_courses$course), "^Python_")
#> [1] "Python_intermediate" "Python_advanced"     "Python_beginner"

student_courses %>%
  mutate(studied_programming = case_when(str_detect(course, "^R_") ~ "Studied R",
                                         str_detect(course, "^Python_") ~ "Studied Python",
                                         TRUE ~ "No"))

In dplyr version >= 1.0.0, you can use mutate() in combination with across() to apply transformations across one or multiple columns.

# Create multiple columns using dplyr ------------------------------------------
# across() accepts function names or ~function(.x) syntax
student_courses %>%
  mutate(across(c(id, platform, course, course), ~toupper(.x)))

# where() can be used inside across() to select columns by type
student_courses %>%
  mutate(across(where(is.character), ~toupper(.x)))

In contrast to mutate(), transmute() only returns the transformed column(s) of interest.

# Create and only return column(s) using dplyr ---------------------------------
student_courses %>%
  transmute(across(c(id, platform, course), ~toupper(.x))) %>%
  ncol()
#> [1] 3

student_courses %>%
  mutate(across(c(id, platform, course, course), ~toupper(.x))) %>%
  ncol()
#> [1] 7

Using data.table

Data frame outputs are slightly different in data.table:

# Create column(s) using data.table --------------------------------------------
student_courses[,
                `:=` (platform_dwell_length = platform_end_date - platform_start_date,
                      platform_start_year = str_extract(platform_start_date, "^.{4}(?!//-)"))]

student_courses[,
                platform_start_year := as.numeric(platform_start_year)]

# Create column(s) with multiple conditions using data.table -------------------
student_courses[,
                studied_programming := fcase(
                  str_detect(course, "^R_"), "Studied R",
                  str_detect(course, "^Python_"), "Studied Python",
                  default = "No")]

# Remove newly created columns using data.table --------------------------------
# Column(s) can be removed by assignment as NULL variable(s)
student_courses[,
                c("platform_dwell_length",
                  "platform_start_year",
                  "studied_programming") := NULL]

In data.table, the solution to creating multiple columns is to specify columns of interest into .SDcols and then loop through each column using lapply(function).

# Create multiple columns using data.table -------------------------------------
# (col_vector) or c("x", "y") := lapply(...) modifies all columns in place
cols <- c("id", "platform", "course")
student_courses[,
                (cols) := lapply(.SD, toupper),
                .SDcols = cols]

# Equivalent code
student_courses[,
                c("id", "platform", "course") := lapply(.SD, toupper),
                .SDcols = c("id", "platform", "course")]
# Use subassignment with `=` to return only transformed columns ----------------
dt_subset <- student_courses[,
                             .(id = toupper(id),
                               platform = toupper(platform),
                               course = toupper(course))]

ncol(dt_subset)
#> [1] 3

# Equivalent code using .SDcols and lapply()
dt_subset <- student_courses[,
                             lapply(.SD, toupper),
                             .SDcols = cols]

Benchmark column creation

Table 4: Units: milliseconds
expr min lq mean median uq max neval
mutate(student_courses, platform_dwell_length = platform_end_date - platform_start_date) 5.0602 6.90770 25.55252 8.26925 9.68500 173.8566 100
student_courses[, :=(“platform_dwell_length”, platform_end_date - platform_start_date)] 3.6370 5.19895 25.54544 6.33755 7.82895 171.3265 100
dplyr_case_when 225.6839 249.70800 269.89679 267.90020 279.05350 458.1452 100
data.table_fcase 195.9269 208.36720 221.09485 213.40350 227.35105 377.5954 100
dplyr_mutate_across 371.1676 385.88140 404.99245 396.43500 407.44485 499.2400 100
data.table_lapply_SDcols 366.4740 386.23370 405.55840 392.76045 411.39315 497.5348 100

Simple group by operations

Using dplyr

Summarising by group is also where data.table significantly outperforms dplyr. A grouping is specified using the group_by() function in dplyr.

# Find total number of courses per student via dplyr ---------------------------
student_courses %>%
  group_by(id) %>%
  summarise(total_courses = n()) %>%
  ungroup()

# Code above is also equivalent to using count()
student_courses %>%
  count(id)

# Find total number of distinct courses per student via dplyr ------------------
student_courses %>%
  group_by(id) %>%
  summarise(total_distinct_courses = n_distinct(course)) %>%
  ungroup()

# Find the first course studied per student and platform via dplyr -------------
student_courses %>%
  group_by(id, platform) %>% # Group by two variables
  filter(row_number() == 1L) %>% # Return the first row from each group
  ungroup()

Using data.table

In data.table, a grouping is specified inside the by placeholder of DT[i, j, by].

# Find total number of courses per student via data.table ----------------------
student_courses[,
                .(total_courses = .N),
                by = id]

# Find total number of distinct courses per student via data.table -------------
student_courses[,
                .(total_distinct_courses = length(unique(course))),
                by = id]

# uniqueN(x) is a data.table function equivalent to length((unique(x))
student_courses[,
                .(total_distinct_courses = uniqueN(course)),
                by = id]

# Find the first course studied per student and platform via data.table --------
student_courses[,
                .SD[1L],
                by = .(id, platform)]

Benchmark simple group by operations

Group by operations are significantly faster in data.table than dplyr.

Table 5: Units: milliseconds
expr min lq mean median uq max neval
student_courses %>% group_by(id) %>% summarise(total_courses = n()) 2253.7473 2416.8814 2848.64304 2498.8589 3270.4622 4162.8841 10
student_courses[, .(total_courses = .N), by = id] 46.7541 50.0241 60.37993 53.6803 70.2789 83.9196 10
student_courses %>% group_by(id) %>% summarise(total_distinct_courses = n_distinct(course)) 4025.6523 4237.2057 4937.93044 4647.9254 5331.3878 6326.4315 10
student_courses[, .(total_distinct_courses = length(unique(course))), by = id] 847.1269 879.6714 1034.19212 971.1114 1218.1229 1356.3699 10
student_courses[, .(total_distinct_courses = uniqueN(course)), by = id] 6262.0236 6343.8344 7207.72692 6594.7426 7264.5609 10044.3720 10
student_courses %>% group_by(id, platform) %>% filter(row_number() == 1L) 7872.6713 8103.4138 9778.25568 9260.2453 10907.8334 13687.8050 10
student_courses[, .SD[1L], by = .(id, platform)] 69.1016 71.2565 78.52564 74.1029 78.2506 120.5737 10

Note: Not all data.table functions outperform their base R or dplyr equivalents. The data.table uniqueN(x) function is much slower than length(unique(x)).

Summary

Most data.table operations significantly outperform their dplyr equivalents in computational speed. I use data.table when grouping on large datasets (i.e. on datasets with greater than ~ 0.5 million rows) and use dplyr for day-to-day analyses of smaller datasets.

Other resources


  1. I am extremely thankful for encountering these Twitter debates, as they helped draw more attention to data.table usage.↩︎

  2. For accurate benchmarking, you need to separately run, save and print microbenchmark() outputs rather than directly knitting results.↩︎