These are some common data cleaning things.
Comment fields sit somewhere in between tidy tabular data entries and large text files (i.e. documents) in terms of wrangling effort. They require human naunce to decode and the quality and completeness of comments vary between individual entries.
This makes it hard to gauge whether cleaning comment fields is a worthwhile endeavour (especially when you have multiple other data sources that need examining). Luckily, some knowledge of string manipulations and regular expressions can help simplify this process.
Let’s get started.
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
tidyverse,
microbenchmark)
Let’s imagine that my local chocolate company, Haighs Chocolates, wants to understand what food critics versus Haighs fans think about their newest product. They send out a bag of free samples with a link to an online survey that asks individuals to rate their chocolates (on a scale of 1 to 10) and provide additional comments.
Note: The code used to create this survey can be accessed from my github repository here.
# Examine the first 6 rows of data ---------------------------------------------
survey %>%
head(6)
# A tibble: 6 x 3
respondee rating comment_field
<chr> <chr> <chr>
1 expert_1 8 "<textarea name=\"comment\" form=\"1\"> <Grade ~
2 expert_2 7 "<textarea name=\"comment\" form=\"1\"> <Grade ~
3 expert_3 8 "<textarea name=\"comment\" form=\"1\"> <Grade ~
4 expert_4 10 "<textarea name=\"comment\" form=\"1\"> <Grade ~
5 expert_5 7 "<textarea name=\"comment\" form=\"1\"> <Grade ~
6 fan_1 9 "<textarea name=\"comment\" form=\"1\"> Delicious ~
Oh dear, it looks like we will first need to use regular expressions to remove all the html tags embedded within survey$comment_field
.
Regular expressions, or regex, can be thought of as a separate syntax for handling patterns in strings. In R, regular expressions can be directly enclosed inside quotes like character strings or explicitly referenced inside regex()
. For convenience, I prefer the former approach but the latter approach can help increase code readability.
# Call regular expressions in R ------------------------------------------------
many_apples <- c("Apple", "apple", "APPLE", "apples")
str_extract(many_apples, "apples?")
#> [1] NA "apple" NA "apples"
# Call regular expressions in R using regex() ----------------------------------
# regex() provides additional arguments
str_extract(many_apples, regex("apples?", ignore_case = T))
#> [1] "Apple" "apple" "APPLE" "apples"
Some sequences of characters have specific meanings. For example, s
refers to the letter "s"
but \s
refers to any type of white space. To call whitespace in R, a second backslash \
is required to escape special character behaviour i.e. \\s
.
# Examples of special character sequences --------------------------------------
words_and_spaces <- c(" a cat",
"acat",
"a cat",
"a\ncat",
"a\\ncat")
# "a\\s+cat" calls variations of "a...cat" separated by one or more whitespaces
# The string "a\ncat" also counts because \n refers to a new line
str_extract(words_and_spaces, "a\\s+cat")
#> [1] "a cat" NA "a cat" "a\ncat" NA
# "\\S+" refers to everything that is not white space (starting from left to right)
str_extract(words_and_spaces, "\\S+")
#> [1] "a" "acat" "a" "a" "a\\ncat"
Note: The special characters \s
versus \S
allow the extraction of opposite pattern types. In another example, lowercase \w
refers to any word character whilst uppercase \W
and lowercase [^\w]
both refer to anything that is not a word character.
I feel that the goal of writing good regex is to be as specific as possible. This is why character anchors are useful (i.e. using ^
and $
to denote the start and end of your string respectively).
If we re-visit the example above, we can see that the presence or absence of character anchors produces very different outputs.
# Investigate impact of character anchors --------------------------------------
words_and_spaces <- c(" a cat",
"acat",
"a cat",
"a\ncat",
"a\\ncat")
# "\\S+" refers to all non-white space read from left to right
str_extract(words_and_spaces, "\\S+")
#> [1] "a" "acat" "a" "a" "a\\ncat"
str_extract(words_and_spaces, "^\\S+")
#> [1] NA "acat" "a" "a" "a\\ncat"
str_extract(words_and_spaces, "\\S+$")
#> [1] "cat" "acat" "cat" "cat" "a\\ncat"
Character classes and groupings are handy for extracting specific letter and/or digit combinations. Some special characters found inside character classes and groupings are:
or
is represented by |
i.e [a|c]
range
is represented by -
i.e. [a-z]
excludes
is represented by ^
i.e. [^a-c]
Note: Representation of a single character is denoted by []
and representation of a grouping i.e. combination of characters is denoted by ()
.
# Extract patterns using character classes [] ----------------------------------
strange_fruits <- c("apple1",
"bapple2",
"capple3",
"dapple4",
"epple5",
"aggle0")
str_extract(strange_fruits, "[a-d]")
#> [1] "a" "b" "c" "d" NA "a"
str_extract(strange_fruits, "[a-d][^p]")
#> [1] NA "ba" "ca" "da" NA "ag"
# [a-d][^p] refers to a chr between a to d followed by a chr that is not p
str_extract(strange_fruits, "[0|4-9]")
#> [1] NA NA NA "4" "5" "0"
# [0|4-9] refers to a number that is zero or a number between 4 to 9
# Extract patterns using groupings () ------------------------------------------
strange_fruits <- c("apple1",
"bapple2",
"capple3",
"dapple4",
"epple5",
"aggle1")
str_extract(strange_fruits, "a(pp|gg)le")
#> [1] "apple" "apple" "apple" "apple" NA "aggle"
# Groups can be referenced by their order of appearance i.e. \\1 = first group
str_extract(strange_fruits, "(a)(p|g)\\2")
#> [1] "app" "app" "app" "app" NA "agg"
# Group 1 contains (a) and can be called using \\1
# Group 2 contains (p|g) and can be called using \\2
In R, regular expression parsing is non-greedy by default. This means that we need to add quantifiers *
and +
to greedily extract zero or more and one or more characters respectively.
Using a non-greedy match allows you to extract just the first characters before a white space or punctuation mark. This is useful for trimming strings or extracting file or object names.
# Examples of greedy matches ---------------------------------------------------
messy_dates <- c("Thursday 24th May",
"Thursday 24th May ",
" May",
"May ")
str_extract(messy_dates, "^\\w")
#> [1] "T" "T" NA "M"
# Greedily extract the first word in the string
str_extract(messy_dates, "^\\w+")
#> [1] "Thursday" "Thursday" NA "May"
# The quantifier + and {1,} are equivalent
str_extract(messy_dates, "^\\w{1,}")
#> [1] "Thursday" "Thursday" NA "May"
str_extract(messy_dates, "^(\\S+)")
#> [1] "Thursday" "Thursday" NA "May"
# Examples of non-greedy matches -----------------------------------------------
str_replace_all(messy_dates, "\\s" , "-") # Replaces each individual whitespace
#> [1] "Thursday-24th-May" "Thursday--24th-May--" "-May" "May----"
str_replace_all(messy_dates, "\\s{1,2}" , "-")
#> [1] "Thursday-24th-May" "Thursday-24th-May-" "-May" "May--"
# Use look-arounds to replace the whitespace(s) after the first word
str_replace_all(messy_dates, "(?<=^\\w{1,2})\\s{1,2}" , "-")
#> [1] "Thursday-24th May" "Thursday-24th May " " May" "May- "
Note: For a deeper explanation of the regex syntax for the last example, read this stack overflow post.
Look around operations are useful when you are unsure of the pattern itself, but you know exactly what its preceding or following pattern is. I’ve found that the clearest explanation of look around operations comes from the RStudio cheetsheet on string_r
, as depicted below.
# Examples of different types of look arounds ----------------------------------
recipes <- c("croissant recipes",
"apple pie recipe",
"chocolate cake recipe", # Extra space
"cookie receipe", # Deliberate typo
"secret KFC-recipe",
"very secret McDonalds soft-serve recipe") # Extra space
# Use positive look-ahead (?=...) to extract the preceding word
str_extract(recipes, "\\S+(?=\\s*recipes?)")
#> [1] "croissant" "pie" "cake" NA "KFC-" "soft-serve"
# Use positive look-behind (?<=) on "secret" to identify the secret recipes
str_extract(recipes, "(?<=secret\\s{1,10})\\S+.+")
#> [1] NA NA NA
#> [4] NA "KFC-recipe" "McDonalds soft-serve recipe"
Note: Positive look-behinds require defined boundary specifications i.e. the operation +
needs to be converted into {1,1000}
.
With regex revised, let us return to our Haighs chocolate survey. The first thing we can see is that html tags have been retained inside the comment field and that this field is very long (i.e. difficult to read).
We can improve the readability of the survey by:
separate()
.# Examine survey data ----------------------------------------------------------
survey %>%
head(5)
# A tibble: 5 x 3
respondee rating comment_field
<chr> <chr> <chr>
1 expert_1 8 "<textarea name=\"comment\" form=\"1\"> <Grade ~
2 expert_2 7 "<textarea name=\"comment\" form=\"1\"> <Grade ~
3 expert_3 8 "<textarea name=\"comment\" form=\"1\"> <Grade ~
4 expert_4 10 "<textarea name=\"comment\" form=\"1\"> <Grade ~
5 expert_5 7 "<textarea name=\"comment\" form=\"1\"> <Grade ~
# Remove html tags -------------------------------------------------------------
# Remove html tags with <[^>]+>
# Remove html code with \\&\\w+\\;
survey <- survey %>%
mutate(comment_field = str_replace_all(comment_field, "<[^>]+>", ""),
comment_field = str_replace_all(comment_field, "\\&\\w+\\;", ""))
# Examine cleaned comment_field ------------------------------------------------
survey %>%
select(comment_field) %>%
head(5)
# A tibble: 5 x 1
comment_field
<chr>
1 " Grade A beans. Easily melts. Smooth chocolate shell, with a crunc~
2 " Grade A beans with subtle caramel hints. Melts well. Smooth exter~
3 " Grade a beans. Caramel and vanilla undertones complement the bit~
4 " Grade A cocoa beans. Melts easily. Smooth dark chocolate contrast~
5 " Grade A beans, likely of Ecuador origin. Smooth dark chocolate co~
We can then split the single long comment field into multiple smaller columns. 1
# Separate comment_field into individual columns -------------------------------
# Separate on punctuation or conjunctions
nmax <- max(str_count(survey$comment_field, "[[:punct:]]|and|with|against")) + 1
survey <- survey %>%
separate(comment_field,
into = paste0("Field", seq_len(nmax)),
sep = "[[:punct:]]|and|with|against",
remove = F,
extra = "warn",
fill = "right")
# Examine comment_fields -------------------------------------------------------
survey %>%
select(starts_with("Field")) %>%
head(5)
After separating the comment field into smaller fields, we see references to:
Information about cocoa bean grade is highly structured. This means that extracting the letter following the word “Grade” is sufficient. A similar logic can be applied to extract whether caramel or vanilla flavour or chocolate smoothness was mentioned.
# Extract information about cocoa bean grade, flavour and smoothness -----------
tidy_survey <- survey %>%
select(respondee,
comment_field) %>%
mutate(cocoa_grade = str_extract(comment_field, "(?<=[G|g]rade\\s{0,10})[A-C|a-c]"),
is_caramel = case_when(str_detect(comment_field, "[C|c]aramel") ~ "yes",
TRUE ~ NA_character_),
is_vanilla = case_when(str_detect(comment_field, "[V|v]anilla") ~ "yes",
TRUE ~ NA_character_),
is_smooth = case_when(str_detect(comment_field, "[S|s]mooth") ~ "yes",
TRUE ~ NA_character_))
# We cannot assign TRUE ~ NA inside case_when as NA is of logical type
For more descriptive fields such as whether the chocolate melts, I find it easier to first extract a matrix of fields.
# Extract information about chocolate texture ----------------------------------
melt_matrix <- survey %>%
select(respondee,
starts_with("Field")) %>%
mutate(across(starts_with("Field"),
~replace(.x, !(str_detect(.x, ".*\\b[M|m]elt.*\\b.*")), NA)))
# Convert fields which do not contain "melt" into NA and unite fields
melt_cols <- str_which(colnames(melt_matrix), "^Field.+")
melt_status <- melt_matrix %>%
unite("is_melty",
all_of(melt_cols),
sep = "",
remove = T,
na.rm = T) # Make sure to remove NAs
# Convert responses into factors and re-code factor levels ---------------------
melt_status$is_melty <- factor(melt_status$is_melty)
levels(melt_status$is_melty)
#> [1] "" " Easily melts" " Melts easily"
#> [4] " melts in your mouth" " Melts well"
melt_status <- melt_status %>%
mutate(is_melty = fct_collapse(is_melty,
"yes" = c(" Easily melts",
" Melts well",
" Melts easily",
" melts in your mouth"),
"NA" = ""))
# Left join tidy_survey to melt_status -----------------------------------------
tidy_survey <- tidy_survey %>%
left_join(melt_status,
by = "respondee")
This process is repeated for chocolate sweetness. 2
# Extract information about chocolate sweetness --------------------------------
sweetness_matrix <- survey %>%
select(respondee,
starts_with("Field")) %>%
mutate(across(starts_with("Field"),
~replace(.x, !(str_detect(.x, ".*\\b[S|s](weet)|(ugar).*\\b.*")), NA)))
# Convert fields which do not contain "sweet" or "sugar" into NA and unite fields
sweetness_cols <- str_which(colnames(sweetness_matrix), "^Field.+")
sweetness_status <- sweetness_matrix %>%
unite("is_sweet",
all_of(sweetness_cols),
sep = "",
remove = T,
na.rm = T) # Make sure to remove NAs
# Convert responses into factors and re-code factor levels ---------------------
sweetness_status$is_sweet <- factor(sweetness_status$is_sweet)
levels(sweetness_status$is_sweet)
#> [1] "" " low sugar content " " not so sweet I enjoyed this"
#> [4] "filled core may be too sweet for some"
sweetness_status <- sweetness_status %>%
mutate(is_sweet = fct_collapse(is_sweet,
"yes" = c("filled core may be too sweet for some"),
"no" = c(" low sugar content ",
" not so sweet I enjoyed this"),
"NA" = ""))
# Left join tidy_survey to melt_status -----------------------------------------
tidy_survey <- tidy_survey %>%
left_join(sweetness_status,
by = "respondee")
Note: This method of converting topics into tabular variables works well when we are not dealing with too many factors (i.e. when recoding factors is not too cumbersome).
A reason why we might be interested in converting unstructured comment fields into structured variables is to generate data features for machine learning. For instance, we might be interested in whether there is a relationship between survey topics, whether the comment comes from a critic or chocolate fan, and the chocolate rating.
# Create final tidy_survey -----------------------------------------------------
survey_rating <- survey %>%
select(respondee,
rating)
tidy_survey <- tidy_survey %>%
select(-comment_field) %>%
left_join(survey_rating,
by = "respondee") %>%
mutate(respondee = str_extract(respondee, ".+(?=\\_[0-9]+)"))
set.seed(123) # Sample reproducibly
tidy_survey %>%
sample_n(5)
stringr
functionsIn R, string manipulation can be performed using either base R functions or functions from the stringr
library. A key difference between base R and stringr
functions is the order that the string and pattern are specified. The pattern, not the string, is specified first inside base R functions, which is not a pipe friendly argument order.
# Examples using grep() --------------------------------------------------------
desserts <- c("chocolate",
"chocolate cake",
"chocolate tart",
"chocolate icecream",
"chocolate cookies",
"dark chocolate fudge",
"fruit",
"fruit tart",
"fruit sorbet")
grep(".*\\bchocolate\\b.*", desserts, value = F) # Default is value = FALSE
#> [1] 1 2 3 4 5 6
# grep(value = FALSE) only extracts the position of matching elements in the vector
str_which(desserts, ".*\\bchocolate\\b.*")
#> [1] 1 2 3 4 5 6
grep(".*\\bchocolate\\b.*", desserts, value = T)
#> [1] "chocolate" "chocolate cake" "chocolate tart" "chocolate icecream"
#> [5] "chocolate cookies" "dark chocolate fudge"
# grep(value = TRUE) extracts the matching elements in the vector
str_subset(desserts, ".*\\bchocolate\\b.*")
#> [1] "chocolate" "chocolate cake" "chocolate tart" "chocolate icecream"
#> [5] "chocolate cookies" "dark chocolate fudge"
# The function str_subset() is a wrapper around x[str_detect(x, pattern)]
# Examples using grepl() -------------------------------------------------------
desserts <- c("chocolate",
"chocolate cake",
"chocolate tart",
"chocolate icecream",
"chocolate cookies",
"dark chocolate fudge",
"fruit",
"fruit tart",
"fruit sorbet")
grepl(".*\\bchocolate\\b.*", desserts)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
str_detect(desserts, ".*\\bchocolate\\b.*")
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
desserts[str_detect(desserts, ".*\\bchocolate\\b.*")]
#> [1] "chocolate" "chocolate cake" "chocolate tart" "chocolate icecream"
#> [5] "chocolate cookies" "dark chocolate fudge"
# Examples using gsub() --------------------------------------------------------
desserts <- c("chocolate",
"chocolate cake",
"chocolate tart",
"chocolate icecream",
"chocolate cookies",
"dark chocolate fudge",
"fruit",
"fruit tart",
"fruit sorbet")
gsub("(dark )?chocolate", "vanilla", desserts)
#> [1] "vanilla" "vanilla cake" "vanilla tart" "vanilla icecream"
#> [5] "vanilla cookies" "vanilla fudge" "fruit" "fruit tart" "fruit sorbet"
str_replace_all(desserts, "(dark )?chocolate", "vanilla")
#> [1] "vanilla" "vanilla cake" "vanilla tart" "vanilla icecream"
#> [5] "vanilla cookies" "vanilla fudge" "fruit" "fruit tart" "fruit sorbet"
baser_vs_stringr <- microbenchmark(grep = grep(".*\\bchocolate\\b.*", desserts, value = F),
str_which = str_which(desserts, ".*\\bchocolate\\b.*"),
gsub = gsub("chocolate", "vanilla", desserts),
str_replace_all = str_replace_all(desserts, "chocolate", "vanilla"),
grepl = grepl(".*\\bchocolate\\b.*", desserts),
str_detect = str_detect(desserts, ".*\\bchocolate\\b.*"),
times = 1000)
autoplot(baser_vs_stringr)
Note: Base R functions are significantly faster than their stringr
equivalents.
stringr
stringr
cheatsheet