PIPING HOT DATA: Stringr 4 ways

Shannon Pileggi

Update May 22, 2019: Thanks to @hglanz for noting that I could have used pull instead of the . as a placeholder.


library(tidyverse) # general use
library(titanic)   # to get titanic data set

Overview

The Name variable in the titanic data set has all unique values. To get started, let’s visually inspect a few values.


titanic_train %>% 
  select(Name) %>% 
  head(10)


                                                  Name
1                              Braund, Mr. Owen Harris
2  Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
6                                     Moran, Mr. James
7                              McCarthy, Mr. Timothy J
8                       Palsson, Master. Gosta Leonard
9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
10                 Nasser, Mrs. Nicholas (Adele Achem)

In this brief print out, each passenger’s title is consistently located between , and .. Luckily, this holds true for all observations! With this consistency, we can somewhat easily extract a title from Name. Depending on your field, this operation may be referred to something along the lines of data preparation or feature engineering.

TL; DR

Approach	Function(s)	Regular expression(s)
1	str_locate + str_sub	`","` + `"\\."`
2	str_match	`"(.)(, )(.)(\\.)(.*)"`
3	str_extract + str_sub	`"([A-z]+)\\."`
4	str_replace_all	`"(., )\|(\\..)"`

Read on for explanations!

Overview of regular expressions

Regular expressions can be used to parse character strings, which you can think of as a key to unlock string patterns. The trick is identify the right regular expression + function combination. Let’s demo four ways to tackle the challenge utilizing functions from the stringr package; each method specifies a different string pattern to match.


library(stringr)

Extracting title from name

First approach

ICYMI, the double bracket in titanic_train[["Name"]] is used to extract a named variable vector from a data frame, which has some benefits over the more commonly used dollar sign (i.e., titanic_train$Name). Now onward.

The str_locate function produces the starting and ending position of a specified pattern. If we consider the comma to be a pattern, we can figure out where it is located in each name. Here, the starting and ending value is the same because the comma is only one character.


titanic_train[["Name"]] %>% 
  str_locate(",") %>%
  head()


     start end
[1,]     7   7
[2,]     8   8
[3,]    10  10
[4,]     9   9
[5,]     6   6
[6,]     6   6

Knowing this, we can identify the positions of the comma and the period and then extract the text in between. Some notes here:

Because str_locate function returns a matrix, we use . as a placeholder in .[,1] to access the first column of values.
Because . is a special character in regular expressions, we use the double backslash in "\\." to escape it.


comma_pos <- titanic_train[["Name"]] %>% 
  str_locate(",") %>% 
  .[,1]

period_pos <- titanic_train[["Name"]] %>% 
  str_locate("\\.") %>% 
  .[,1]

Now we can use str_sub to extract substrings from the character vector based on their physical position. To exclude the punctuation and white space, we can add two to the comma position and subtract one from the period position to get the title only.


titanic_train[["Name"]] %>% 
  str_sub(comma_pos + 2, period_pos - 1) %>% 
  head()


[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

Super!

Second approach

The str_match function creates a character matrix for each group matched in the specified pattern. With the correct regular expression, str_match returns the complete match in addition to each matched group. Here’s a quick example:


# ----------5 groups-->>>----1---2---3----4---5----                      
str_match("XXX, YYY. ZZZ", "(.*)(, )(.*)(\\.)(.*)")


     [,1]            [,2]  [,3] [,4]  [,5] [,6]  
[1,] "XXX, YYY. ZZZ" "XXX" ", " "YYY" "."  " ZZZ"

Let’s break down this regular expression pattern.

The parentheses () indicate a grouping
The 2nd grouping (, ) only has one modification from the pattern used in the first example - here we include a space after the comma.
The 4th grouping (\\.) remains unchanged from the first example.
in the 1st, 3rd, and 5th grouping (.*)
- The period . means any character
- The asterisk * means matches at least 0 times

To execute this, we’ll grab the 4th column to catch our title.


titanic_train[["Name"]] %>% 
  str_match("(.*)(, )(.*)(\\.)(.*)") %>%
  .[,4] %>% 
  head()


[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

All right, we got it again!

Third approach

Lastly, let’s use the str_extract function to extract matching patterns. This seems like what we wanted to do all along!


titanic_train[["Name"]] %>% 
  str_extract("([A-z]+)\\.") %>%
  head()


[1] "Mr."   "Mrs."  "Miss." "Mrs."  "Mr."   "Mr."

Let’s break down this regular expression:

Here, the bracket []specifies a list of permitted characters.
Inside the bracket, we specify strings that consist of both upper and lower case letters with A-z.
The + outside the bracket signifies “match at least one time”, i.e., grab all the letters.
Finally, ending in \\. indicates that our pattern should end in period.
Putting it all together, this pattern translates to “grab all upper case and lower case letters immediately preceeding a period.”

This pattern is a bit more sophisticated to compose than the previous ones, but it gets right to the point! This last effort does end in a period, whereas the others do not. If we wanted to remove the period for consistency, we could use str_sub with the end argument to specify the position of the last character.


titanic_train[["Name"]] %>% 
  str_extract("([A-z]+)\\.") %>%
  str_sub(end = -2) %>%
  head()


[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

Fourth approach

As a last approach, we can use str_replace_all to replace all matched patterns with null character values. Here, we specify the pattern and then the replacement string.


titanic_train[["Name"]] %>% 
  str_replace_all("(.*, )|(\\..*)", "") %>%
  head()


[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

In this regular expression,

There are two groupings separated by a vertical pipe | which means “or”.
The two groupings should look familar:
- The first grouping looks for all characters preceeding a comma space.
- The second grouping looks for all characters following a period.
Putting it together, if grouping 1 or grouping 2 is satisifed then those character values are replaced with null character values.

Re-classifying entries

Now that we figured out how to extract the title, I’ll utilize the last method and assign title as a variable to the titanic_train data set using the mutate function.


titanic_train <- titanic_train %>%
  mutate(title = str_replace_all(titanic_train[["Name"]], "(.*, )|(\\..*)", ""))

Now let’s use count to get a frequency table of the titles, with the sort = TRUE option to arrange the results in descending order.


titanic_train %>%
  count(title, sort = TRUE)


          title   n
1            Mr 517
2          Miss 182
3           Mrs 125
4        Master  40
5            Dr   7
6           Rev   6
7           Col   2
8         Major   2
9          Mlle   2
10         Capt   1
11          Don   1
12     Jonkheer   1
13         Lady   1
14          Mme   1
15           Ms   1
16          Sir   1
17 the Countess   1

We can see that there are several infrequent titles occuring only one or two times, and so we should re-classify them. If you want to squeeze the most juice out of your data, try to figure out the historical context and meaning of those titles to create a better classification for them. For now, let’s take the easy way out by just re-classifying them to an other group.

Fortunately, the forcats package has an awesome function that let’s us do this quickly: fct_lump. We’re using mutate again to re-classified title. The fct_lump function combines the least frequent values together in an other group, and the n = 6 option specifies to keep the 6 most common values (so the 7th value is other).


titanic_train %>%
  mutate(title = fct_lump(title, n = 6)) %>%
  count(title, sort = TRUE)


   title   n
1     Mr 517
2   Miss 182
3    Mrs 125
4 Master  40
5  Other  14
6     Dr   7
7    Rev   6

If you wanted to explicitly re-code the infrequent titles to something more meaningful than other, look into fct_recode.

Super, now title is ready to use for analysis!

Stringr 4 ways