Cleaning free text and wrangling strings

data cleaning regex R

These are some common data cleaning things.

Erika Duan
2020-12-31

Introduction

Comment fields sit somewhere in between tidy tabular data entries and large text files (i.e. documents) in terms of wrangling effort. They require human naunce to decode and the quality and completeness of comments vary between individual entries.

This makes it hard to gauge whether cleaning comment fields is a worthwhile endeavour (especially when you have multiple other data sources that need examining). Luckily, some knowledge of string manipulations and regular expressions can help simplify this process.

Let’s get started.

# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,  
               tidyverse,  
               microbenchmark) 

Create a test dataset

Let’s imagine that my local chocolate company, Haighs Chocolates, wants to understand what food critics versus Haighs fans think about their newest product. They send out a bag of free samples with a link to an online survey that asks individuals to rate their chocolates (on a scale of 1 to 10) and provide additional comments.

Note: The code used to create this survey can be accessed from my github repository here.

# Examine the first 6 rows of data ---------------------------------------------
survey %>%
  head(6)  
# A tibble: 6 x 3
  respondee rating comment_field                                      
  <chr>     <chr>  <chr>                                              
1 expert_1  8      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
2 expert_2  7      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
3 expert_3  8      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
4 expert_4  10     "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
5 expert_5  7      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
6 fan_1     9      "<textarea name=\"comment\" form=\"1\"> Delicious ~

Oh dear, it looks like we will first need to use regular expressions to remove all the html tags embedded within survey$comment_field.

Introduction to regular expressions

Regular expressions, or regex, can be thought of as a separate syntax for handling patterns in strings. In R, regular expressions can be directly enclosed inside quotes like character strings or explicitly referenced inside regex(). For convenience, I prefer the former approach but the latter approach can help increase code readability.

# Call regular expressions in R ------------------------------------------------
many_apples <- c("Apple", "apple", "APPLE", "apples")

str_extract(many_apples, "apples?")
#> [1] NA       "apple"  NA       "apples"

# Call regular expressions in R using regex() ----------------------------------
# regex() provides additional arguments

str_extract(many_apples, regex("apples?", ignore_case = T))
#> [1] "Apple"  "apple"  "APPLE"  "apples"

Match characters

Some sequences of characters have specific meanings. For example, s refers to the letter "s" but \s refers to any type of white space. To call whitespace in R, a second backslash \ is required to escape special character behaviour i.e. \\s.

# Examples of special character sequences --------------------------------------
words_and_spaces <- c(" a cat",
                      "acat",
                      "a   cat",
                      "a\ncat",
                      "a\\ncat")

# "a\\s+cat" calls variations of "a...cat" separated by one or more whitespaces
# The string "a\ncat" also counts because \n refers to a new line

str_extract(words_and_spaces, "a\\s+cat")
#> [1] "a cat"       NA        "a   cat"       "a\ncat"      NA

# "\\S+" refers to everything that is not white space (starting from left to right)

str_extract(words_and_spaces, "\\S+")
#> [1] "a"       "acat"       "a"        "a"       "a\\ncat"

Note: The special characters \s versus \S allow the extraction of opposite pattern types. In another example, lowercase \w refers to any word character whilst uppercase \W and lowercase [^\w] both refer to anything that is not a word character.

Character anchors

I feel that the goal of writing good regex is to be as specific as possible. This is why character anchors are useful (i.e. using ^ and $ to denote the start and end of your string respectively).

If we re-visit the example above, we can see that the presence or absence of character anchors produces very different outputs.

# Investigate impact of character anchors --------------------------------------
words_and_spaces <- c(" a cat",
                      "acat",
                      "a   cat",
                      "a\ncat",
                      "a\\ncat")

# "\\S+" refers to all non-white space read from left to right

str_extract(words_and_spaces, "\\S+")
#> [1] "a"       "acat"    "a"       "a"       "a\\ncat"

str_extract(words_and_spaces, "^\\S+")  
#> [1] NA       "acat"    "a"       "a"       "a\\ncat"

str_extract(words_and_spaces, "\\S+$")
#> [1] "cat"     "acat"    "cat"     "cat"     "a\\ncat"

Character classes and groupings

Character classes and groupings are handy for extracting specific letter and/or digit combinations. Some special characters found inside character classes and groupings are:

Note: Representation of a single character is denoted by [] and representation of a grouping i.e. combination of characters is denoted by ().

# Extract patterns using character classes [] ----------------------------------
strange_fruits <- c("apple1",
                    "bapple2",
                    "capple3",
                    "dapple4",
                    "epple5",
                    "aggle0")

str_extract(strange_fruits, "[a-d]")
#> [1] "a" "b" "c" "d" NA  "a"

str_extract(strange_fruits, "[a-d][^p]")
#> [1] NA   "ba" "ca" "da" NA   "ag"

# [a-d][^p] refers to a chr between a to d followed by a chr that is not p

str_extract(strange_fruits, "[0|4-9]")
#> [1] NA  NA  NA  "4" "5" "0"

# [0|4-9] refers to a number that is zero or a number between 4 to 9
# Extract patterns using groupings () ------------------------------------------
strange_fruits <- c("apple1",
                    "bapple2",
                    "capple3",
                    "dapple4",
                    "epple5",
                    "aggle1")

str_extract(strange_fruits, "a(pp|gg)le")
#> [1] "apple"      "apple"      "apple"      "apple"     NA      "aggle"

# Groups can be referenced by their order of appearance i.e. \\1 = first group

str_extract(strange_fruits, "(a)(p|g)\\2")
#> [1] "app"     "app"     "app"     "app"     NA     "agg"

# Group 1 contains (a) and can be called using \\1
# Group 2 contains (p|g) and can be called using \\2

Greedy versus lazy matches

In R, regular expression parsing is non-greedy by default. This means that we need to add quantifiers * and + to greedily extract zero or more and one or more characters respectively.

Using a non-greedy match allows you to extract just the first characters before a white space or punctuation mark. This is useful for trimming strings or extracting file or object names.

# Examples of greedy matches ---------------------------------------------------
messy_dates <- c("Thursday 24th May",
                 "Thursday  24th May  ",
                 " May",
                 "May    ")

str_extract(messy_dates, "^\\w")
#> [1] "T" "T" NA  "M"

# Greedily extract the first word in the string

str_extract(messy_dates, "^\\w+")
#> [1] "Thursday" "Thursday" NA      "May"

# The quantifier + and {1,} are equivalent

str_extract(messy_dates, "^\\w{1,}")
#> [1] "Thursday" "Thursday" NA      "May"

str_extract(messy_dates, "^(\\S+)")
#> [1] "Thursday" "Thursday" NA      "May"

# Examples of non-greedy matches -----------------------------------------------
str_replace_all(messy_dates, "\\s" , "-") # Replaces each individual whitespace
#> [1] "Thursday-24th-May"    "Thursday--24th-May--" "-May"            "May----"

str_replace_all(messy_dates, "\\s{1,2}" , "-")
#> [1] "Thursday-24th-May"  "Thursday-24th-May-" "-May"                  "May--"

# Use look-arounds to replace the whitespace(s) after the first word

str_replace_all(messy_dates, "(?<=^\\w{1,2})\\s{1,2}" , "-")
#> [1] "Thursday-24th May"   "Thursday-24th May  " " May"               "May-  "

Note: For a deeper explanation of the regex syntax for the last example, read this stack overflow post.

Look arounds

Look around operations are useful when you are unsure of the pattern itself, but you know exactly what its preceding or following pattern is. I’ve found that the clearest explanation of look around operations comes from the RStudio cheetsheet on string_r, as depicted below.

Taken from the RStudio stringr cheatsheet

Figure 1: Taken from the RStudio stringr cheatsheet

# Examples of different types of look arounds ----------------------------------
recipes <- c("croissant recipes",
             "apple pie recipe",
             "chocolate cake  recipe", # Extra space
             "cookie receipe",  # Deliberate typo
             "secret KFC-recipe", 
             "very secret  McDonalds soft-serve recipe") # Extra space

# Use positive look-ahead (?=...) to extract the preceding word

str_extract(recipes, "\\S+(?=\\s*recipes?)")
#> [1] "croissant"   "pie"      "cake"       NA        "KFC-"       "soft-serve"

# Use positive look-behind (?<=) on "secret" to identify the secret recipes

str_extract(recipes, "(?<=secret\\s{1,10})\\S+.+")
#> [1] NA                 NA                        NA
#> [4] NA                "KFC-recipe"              "McDonalds soft-serve recipe"

Note: Positive look-behinds require defined boundary specifications i.e. the operation + needs to be converted into {1,1000}.

Improve comment field readability

With regex revised, let us return to our Haighs chocolate survey. The first thing we can see is that html tags have been retained inside the comment field and that this field is very long (i.e. difficult to read).

We can improve the readability of the survey by:

# Examine survey data ----------------------------------------------------------
survey %>%
  head(5)
# A tibble: 5 x 3
  respondee rating comment_field                                      
  <chr>     <chr>  <chr>                                              
1 expert_1  8      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
2 expert_2  7      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
3 expert_3  8      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
4 expert_4  10     "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
5 expert_5  7      "<textarea name=\"comment\" form=\"1\"> &lt;Grade ~
# Remove html tags -------------------------------------------------------------
# Remove html tags with <[^>]+>
# Remove html code with \\&\\w+\\;
survey <- survey %>%
  mutate(comment_field = str_replace_all(comment_field, "<[^>]+>", ""),
         comment_field = str_replace_all(comment_field, "\\&\\w+\\;", ""))

# Examine cleaned comment_field ------------------------------------------------
survey %>%
  select(comment_field) %>%
  head(5)
# A tibble: 5 x 1
  comment_field                                                       
  <chr>                                                               
1 " Grade A beans. Easily melts. Smooth chocolate shell, with a crunc~
2 " Grade A beans with subtle caramel hints. Melts well. Smooth exter~
3 " Grade a beans.  Caramel and vanilla undertones complement the bit~
4 " Grade A cocoa beans. Melts easily. Smooth dark chocolate contrast~
5 " Grade A beans, likely of Ecuador origin. Smooth dark chocolate co~

We can then split the single long comment field into multiple smaller columns. 1

# Separate comment_field into individual columns -------------------------------
# Separate on punctuation or conjunctions
nmax <- max(str_count(survey$comment_field, "[[:punct:]]|and|with|against")) + 1

survey <- survey %>%
  separate(comment_field,
           into = paste0("Field", seq_len(nmax)),
           sep = "[[:punct:]]|and|with|against",
           remove = F,
           extra = "warn",
           fill = "right")

# Examine comment_fields -------------------------------------------------------
survey %>%
  select(starts_with("Field")) %>%
  head(5)

Manually extract topics of interest

After separating the comment field into smaller fields, we see references to:

Information about cocoa bean grade is highly structured. This means that extracting the letter following the word “Grade” is sufficient. A similar logic can be applied to extract whether caramel or vanilla flavour or chocolate smoothness was mentioned.

# Extract information about cocoa bean grade, flavour and smoothness -----------
tidy_survey <- survey %>%
  select(respondee,
         comment_field) %>%
  mutate(cocoa_grade = str_extract(comment_field, "(?<=[G|g]rade\\s{0,10})[A-C|a-c]"),
         is_caramel = case_when(str_detect(comment_field, "[C|c]aramel") ~ "yes",
                                TRUE ~ NA_character_),
         is_vanilla = case_when(str_detect(comment_field, "[V|v]anilla") ~ "yes",
                                TRUE ~ NA_character_),
         is_smooth = case_when(str_detect(comment_field, "[S|s]mooth") ~ "yes",
                               TRUE ~ NA_character_))

# We cannot assign TRUE ~ NA inside case_when as NA is of logical type

For more descriptive fields such as whether the chocolate melts, I find it easier to first extract a matrix of fields.

# Extract information about chocolate texture ----------------------------------
melt_matrix <- survey %>%
  select(respondee,
         starts_with("Field")) %>%
  mutate(across(starts_with("Field"),
                ~replace(.x, !(str_detect(.x, ".*\\b[M|m]elt.*\\b.*")), NA)))

# Convert fields which do not contain "melt" into NA and unite fields

melt_cols <- str_which(colnames(melt_matrix), "^Field.+")

melt_status <- melt_matrix %>%
  unite("is_melty",
        all_of(melt_cols),
        sep = "",
        remove = T,
        na.rm = T) # Make sure to remove NAs

# Convert responses into factors and re-code factor levels ---------------------
melt_status$is_melty <- factor(melt_status$is_melty)

levels(melt_status$is_melty)
#> [1] ""        " Easily melts"        " Melts easily"
#> [4] " melts in your mouth"         " Melts well"

melt_status <- melt_status %>%
  mutate(is_melty = fct_collapse(is_melty,
                                 "yes" = c(" Easily melts",
                                           " Melts well",
                                           " Melts easily",
                                           " melts in your mouth"),
                                 "NA" = ""))

# Left join tidy_survey to melt_status -----------------------------------------
tidy_survey <- tidy_survey %>%
  left_join(melt_status,
            by = "respondee")

This process is repeated for chocolate sweetness. 2

# Extract information about chocolate sweetness --------------------------------
sweetness_matrix <- survey %>%
  select(respondee,
         starts_with("Field")) %>%
  mutate(across(starts_with("Field"),
                ~replace(.x, !(str_detect(.x, ".*\\b[S|s](weet)|(ugar).*\\b.*")), NA)))

# Convert fields which do not contain "sweet" or "sugar" into NA and unite fields

sweetness_cols <- str_which(colnames(sweetness_matrix), "^Field.+")

sweetness_status <- sweetness_matrix %>%
  unite("is_sweet",
        all_of(sweetness_cols),
        sep = "",
        remove = T,
        na.rm = T) # Make sure to remove NAs

# Convert responses into factors and re-code factor levels ---------------------
sweetness_status$is_sweet <- factor(sweetness_status$is_sweet)

levels(sweetness_status$is_sweet)
#> [1] ""         " low sugar content "          " not so sweet  I enjoyed this"
#> [4] "filled core may be too sweet for some"

sweetness_status <- sweetness_status %>%
  mutate(is_sweet = fct_collapse(is_sweet,
                                 "yes" = c("filled core may be too sweet for some"),
                                 "no" = c(" low sugar content ",
                                          " not so sweet  I enjoyed this"),
                                 "NA" = ""))

# Left join tidy_survey to melt_status -----------------------------------------
tidy_survey <- tidy_survey %>%
  left_join(sweetness_status,
            by = "respondee")

Note: This method of converting topics into tabular variables works well when we are not dealing with too many factors (i.e. when recoding factors is not too cumbersome).

Extract a machine learning friendly dataset

A reason why we might be interested in converting unstructured comment fields into structured variables is to generate data features for machine learning. For instance, we might be interested in whether there is a relationship between survey topics, whether the comment comes from a critic or chocolate fan, and the chocolate rating.

# Create final tidy_survey -----------------------------------------------------
survey_rating <- survey %>%
  select(respondee,
         rating)

tidy_survey <- tidy_survey %>%
  select(-comment_field) %>%
  left_join(survey_rating,
            by = "respondee") %>%
  mutate(respondee = str_extract(respondee, ".+(?=\\_[0-9]+)"))

set.seed(123) # Sample reproducibly
tidy_survey %>%
  sample_n(5)

Differences between base R and stringr functions

In R, string manipulation can be performed using either base R functions or functions from the stringr library. A key difference between base R and stringr functions is the order that the string and pattern are specified. The pattern, not the string, is specified first inside base R functions, which is not a pipe friendly argument order.

# Examples using grep() --------------------------------------------------------
desserts <- c("chocolate",
              "chocolate cake",
              "chocolate tart",
              "chocolate icecream",
              "chocolate cookies",
              "dark chocolate fudge",
              "fruit",
              "fruit tart",
              "fruit sorbet")

grep(".*\\bchocolate\\b.*", desserts, value = F) # Default is value = FALSE
#> [1] 1 2 3 4 5 6

# grep(value = FALSE) only extracts the position of matching elements in the vector

str_which(desserts, ".*\\bchocolate\\b.*")
#> [1] 1 2 3 4 5 6

grep(".*\\bchocolate\\b.*", desserts, value = T)
#> [1] "chocolate"    "chocolate cake"    "chocolate tart"    "chocolate icecream"
#> [5] "chocolate cookies"    "dark chocolate fudge"

# grep(value = TRUE) extracts the matching elements in the vector

str_subset(desserts, ".*\\bchocolate\\b.*")
#> [1] "chocolate"    "chocolate cake"    "chocolate tart"    "chocolate icecream"
#> [5] "chocolate cookies"    "dark chocolate fudge"

# The function str_subset() is a wrapper around x[str_detect(x, pattern)]
# Examples using grepl() -------------------------------------------------------
desserts <- c("chocolate",
              "chocolate cake",
              "chocolate tart",
              "chocolate icecream",
              "chocolate cookies",
              "dark chocolate fudge",
              "fruit",
              "fruit tart",
              "fruit sorbet")

grepl(".*\\bchocolate\\b.*", desserts)
#> [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

str_detect(desserts, ".*\\bchocolate\\b.*")
#> [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

desserts[str_detect(desserts, ".*\\bchocolate\\b.*")]
#> [1] "chocolate"    "chocolate cake"   "chocolate tart"    "chocolate icecream"
#> [5] "chocolate cookies"    "dark chocolate fudge"
# Examples using gsub() --------------------------------------------------------
desserts <- c("chocolate",
              "chocolate cake",
              "chocolate tart",
              "chocolate icecream",
              "chocolate cookies",
              "dark chocolate fudge",
              "fruit",
              "fruit tart",
              "fruit sorbet")

gsub("(dark )?chocolate", "vanilla", desserts)
#> [1] "vanilla"          "vanilla cake"     "vanilla tart"     "vanilla icecream"
#> [5] "vanilla cookies"  "vanilla fudge"    "fruit"    "fruit tart"   "fruit sorbet"

str_replace_all(desserts, "(dark )?chocolate", "vanilla")
#> [1] "vanilla"          "vanilla cake"     "vanilla tart"     "vanilla icecream"
#> [5] "vanilla cookies"  "vanilla fudge"    "fruit"    "fruit tart"   "fruit sorbet"
baser_vs_stringr <- microbenchmark(grep = grep(".*\\bchocolate\\b.*", desserts, value = F),
                                   str_which = str_which(desserts, ".*\\bchocolate\\b.*"),
                                   gsub = gsub("chocolate", "vanilla", desserts),
                                   str_replace_all = str_replace_all(desserts, "chocolate", "vanilla"),
                                   grepl = grepl(".*\\bchocolate\\b.*", desserts),
                                   str_detect = str_detect(desserts, ".*\\bchocolate\\b.*"),
                                   times = 1000)

autoplot(baser_vs_stringr)

Note: Base R functions are significantly faster than their stringr equivalents.

Other resources


  1. Many R functions require R regex classes to be wrapped in a second set of [ ], e.g. [[:punct:]].↩︎

  2. As a repetitive step within the workflow, we might want to rewrite the generic part of this cleaning step as a function stored in a separate R script for maximal readability.↩︎