Four approaches to feature engineering with regular expressions in R
Update May 22, 2019: Thanks to @hglanz for noting that I could have used pull
instead of the .
as a placeholder.
library(tidyverse) # general use
library(titanic) # to get titanic data set
The Name
variable in the titanic
data set has all unique values. To get started, let’s visually inspect a few values.
titanic_train %>%
select(Name) %>%
head(10)
Name
1 Braund, Mr. Owen Harris
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3 Heikkinen, Miss. Laina
4 Futrelle, Mrs. Jacques Heath (Lily May Peel)
5 Allen, Mr. William Henry
6 Moran, Mr. James
7 McCarthy, Mr. Timothy J
8 Palsson, Master. Gosta Leonard
9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
10 Nasser, Mrs. Nicholas (Adele Achem)
In this brief print out, each passenger’s title is consistently located between ,
and .
. Luckily, this holds true for all observations! With this consistency, we can somewhat easily extract a title from Name
. Depending on your field, this operation may be referred to something along the lines of data preparation or feature engineering.
Approach | Function(s) | Regular expression(s) |
---|---|---|
1 | str_locate + str_sub | "," + "\\." |
2 | str_match | "(.*)(, )(.*)(\\.)(.*)" |
3 | str_extract + str_sub | "([A-z]+)\\." |
4 | str_replace_all | "(.*, )|(\\..*)" |
Read on for explanations!
Regular expressions can be used to parse character strings, which you can think of as a key to unlock string patterns. The trick is identify the right regular expression + function combination. Let’s demo four ways to tackle the challenge utilizing functions from the stringr package; each method specifies a different string pattern to match.
library(stringr)
ICYMI, the double bracket in titanic_train[["Name"]]
is used to extract a named variable vector from a data frame, which has some benefits over the more commonly used dollar sign (i.e., titanic_train$Name
). Now onward.
The str_locate function produces the starting and ending position of a specified pattern. If we consider the comma to be a pattern, we can figure out where it is located in each name. Here, the starting and ending value is the same because the comma is only one character.
titanic_train[["Name"]] %>%
str_locate(",") %>%
head()
start end
[1,] 7 7
[2,] 8 8
[3,] 10 10
[4,] 9 9
[5,] 6 6
[6,] 6 6
Knowing this, we can identify the positions of the comma and the period and then extract the text in between. Some notes here:
str_locate
function returns a matrix, we use .
as a placeholder in .[,1]
to access the first column of values..
is a special character in regular expressions, we use the double backslash in "\\."
to escape it.
comma_pos <- titanic_train[["Name"]] %>%
str_locate(",") %>%
.[,1]
period_pos <- titanic_train[["Name"]] %>%
str_locate("\\.") %>%
.[,1]
Now we can use str_sub to extract substrings from the character vector based on their physical position. To exclude the punctuation and white space, we can add two to the comma position and subtract one from the period position to get the title only.
titanic_train[["Name"]] %>%
str_sub(comma_pos + 2, period_pos - 1) %>%
head()
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
Super!
The str_match function creates a character matrix for each group matched in the specified pattern. With the correct regular expression, str_match
returns the complete match in addition to each matched group. Here’s a quick example:
# ----------5 groups-->>>----1---2---3----4---5----
str_match("XXX, YYY. ZZZ", "(.*)(, )(.*)(\\.)(.*)")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "XXX, YYY. ZZZ" "XXX" ", " "YYY" "." " ZZZ"
Let’s break down this regular expression pattern.
()
indicate a grouping(, )
only has one modification from the pattern used in the first example - here we include a space after the comma.(\\.)
remains unchanged from the first example.(.*)
.
means any character*
means matches at least 0 timesTo execute this, we’ll grab the 4th column to catch our title.
titanic_train[["Name"]] %>%
str_match("(.*)(, )(.*)(\\.)(.*)") %>%
.[,4] %>%
head()
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
All right, we got it again!
Lastly, let’s use the str_extract function to extract matching patterns. This seems like what we wanted to do all along!
titanic_train[["Name"]] %>%
str_extract("([A-z]+)\\.") %>%
head()
[1] "Mr." "Mrs." "Miss." "Mrs." "Mr." "Mr."
Let’s break down this regular expression:
[]
specifies a list of permitted characters.A-z
.+
outside the bracket signifies “match at least one time”, i.e., grab all the letters.\\.
indicates that our pattern should end in period.This pattern is a bit more sophisticated to compose than the previous ones, but it gets right to the point! This last effort does end in a period, whereas the others do not. If we wanted to remove the period for consistency, we could use str_sub
with the end
argument to specify the position of the last character.
titanic_train[["Name"]] %>%
str_extract("([A-z]+)\\.") %>%
str_sub(end = -2) %>%
head()
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
As a last approach, we can use str_replace_all to replace all matched patterns with null character values. Here, we specify the pattern and then the replacement string.
titanic_train[["Name"]] %>%
str_replace_all("(.*, )|(\\..*)", "") %>%
head()
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
In this regular expression,
|
which means “or”.Now that we figured out how to extract the title, I’ll utilize the last method and assign title
as a variable to the titanic_train
data set using the mutate
function.
titanic_train <- titanic_train %>%
mutate(title = str_replace_all(titanic_train[["Name"]], "(.*, )|(\\..*)", ""))
Now let’s use count
to get a frequency table of the titles, with the sort = TRUE
option to arrange the results in descending order.
titanic_train %>%
count(title, sort = TRUE)
title n
1 Mr 517
2 Miss 182
3 Mrs 125
4 Master 40
5 Dr 7
6 Rev 6
7 Col 2
8 Major 2
9 Mlle 2
10 Capt 1
11 Don 1
12 Jonkheer 1
13 Lady 1
14 Mme 1
15 Ms 1
16 Sir 1
17 the Countess 1
We can see that there are several infrequent titles occuring only one or two times, and so we should re-classify them. If you want to squeeze the most juice out of your data, try to figure out the historical context and meaning of those titles to create a better classification for them. For now, let’s take the easy way out by just re-classifying them to an other
group.
Fortunately, the forcats
package has an awesome function that let’s us do this quickly: fct_lump
. We’re using mutate
again to re-classified title
. The fct_lump
function combines the least frequent values together in an other
group, and the n = 6
option specifies to keep the 6 most common values (so the 7th value is other
).
titanic_train %>%
mutate(title = fct_lump(title, n = 6)) %>%
count(title, sort = TRUE)
title n
1 Mr 517
2 Miss 182
3 Mrs 125
4 Master 40
5 Other 14
6 Dr 7
7 Rev 6
If you wanted to explicitly re-code the infrequent titles to something more meaningful than other
, look into fct_recode.
Super, now title
is ready to use for analysis!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Pileggi (2018, Dec. 11). PIPING HOT DATA: Stringr 4 ways. Retrieved from https://www.pipinghotdata.com/posts/2018-12-11-stringr-4-ways/
BibTeX citation
@misc{pileggi2018stringr, author = {Pileggi, Shannon}, title = {PIPING HOT DATA: Stringr 4 ways}, url = {https://www.pipinghotdata.com/posts/2018-12-11-stringr-4-ways/}, year = {2018} }