Click here to return to the menu
Click here to go to the slides for more examples
In data science for the social sciences, text data is especially important to be dealt with with care. In R, there are many packages you can use for dealing with text data. In this workshop, we will be going over the stringr
package which is included in the tidyverse
package. We will also talk about regular expressions [regex], which is a common way to deal with strings across most language.
stringr
There are many functions that start with str_
in the stringr
package. In this workshop, we are going to about the ones I used the most often. The str_
functions, in general, take the arguments in the form of str_...(string, pattern, sep)
,
For a more complete look at the functions, either look at the cheat sheet linked at the end of this workshop notes or Google it. For this workshop, we will continue using the data from the tidyverse workshop.
require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")
str_detect()
The first commonly used str_
function is str_detect()
. Just like its name, this function detects a certain pattern in your string. str_detect()
searches a string and return TRUE
if the pattern specified is present in the string and FALSE
otherwise. For example, if we want to find the strings with the pattern “ology” in the cohort column.
data$cohort%>%
str_detect(pattern="ology")%>%
head(10)
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
This also means that we can combine mutate to create a column that indicates to use whether the desired pattern is present in the string:
data%>%
mutate(ology = as.numeric(str_detect(cohort,
pattern="ology")))
## # A tibble: 1,976 x 7
## workshop age county social cohort gpa ology
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81 1
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23 0
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34 0
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58 1
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political Sci… 3.63 0
## 6 Yes 27 Racine County, WI VANESSA Psychology 3.88 1
## 7 Yes 31 <NA> SGGSAC Sociology 3.68 1
## 8 Yes 26 Dekalb, GA VANESSA Anthropology 3.41 1
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00 0
## 10 Yes 24 York County, South Ca… SGGSAC History 3.87 0
## # … with 1,966 more rows
str_extract()
and str_extract_all()
Most str_
functions are fairly straight forward in terms of naming. In this case, str_extract()
extracts the first desired pattern from the string. str_extract_all()
extracts all the times the pattern is present in the string.
You might be thinking, if I just want a pattern, why won’t I just mutate()
a new column with the pattern I want?
The answer to that question is the pattern can actually be quite vague. We will discuss this in the section regular expression
. For now, let’s continue our last example and extract the pattern “ology”.
data%>%
mutate(ology = str_extract(cohort,
pattern="ology"))
## # A tibble: 1,976 x 7
## workshop age county social cohort gpa ology
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81 ology
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23 <NA>
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34 <NA>
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58 ology
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political Sci… 3.63 <NA>
## 6 Yes 27 Racine County, WI VANESSA Psychology 3.88 ology
## 7 Yes 31 <NA> SGGSAC Sociology 3.68 ology
## 8 Yes 26 Dekalb, GA VANESSA Anthropology 3.41 ology
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00 <NA>
## 10 Yes 24 York County, South Ca… SGGSAC History 3.87 <NA>
## # … with 1,966 more rows
str_replace()
and str_replace_all()
str_replace()
simply replaces the patter you want replaced with the text you want to replace it with. The difference between str_replace()
and str_replace_all()
is that str_replace()
only replaces the pattern the first time it comes up in the string. str_replace()
on the other hand replaces all of the patterns that match.
For example, if we want to replace “ology” with “ologist”, we can use str_replace()
:
data%>%
mutate(cohort = str_replace(cohort,
pattern="ology",
replacement="ologist"))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthropologist 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Sociologist 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political Scien… 3.63
## 6 Yes 27 Racine County, WI VANESSA Psychologist 3.88
## 7 Yes 31 <NA> SGGSAC Sociologist 3.68
## 8 Yes 26 Dekalb, GA VANESSA Anthropologist 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00
## 10 Yes 24 York County, South Caroli… SGGSAC History 3.87
## # … with 1,966 more rows
But if I just want to change all the “o”’s into “e”’s, str_replace()
would be insufficient:
data%>%
mutate(cohort = str_replace(cohort,
pattern="o",
replacement="e"))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthrepology 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Ecenomics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Ecenomics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Seciology 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Pelitical Scien… 3.63
## 6 Yes 27 Racine County, WI VANESSA Psychelogy 3.88
## 7 Yes 31 <NA> SGGSAC Seciology 3.68
## 8 Yes 26 Dekalb, GA VANESSA Anthrepology 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Ecenomics 4.00
## 10 Yes 24 York County, South Caroli… SGGSAC Histery 3.87
## # … with 1,966 more rows
As you can see from above, only the first “o” is replaced by “e”. To correct that, we would use str_replace_all()
:
data%>%
mutate(cohort = str_replace_all(cohort,
pattern="o",
replacement="e"))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthrepelegy 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Ecenemics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Ecenemics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Secielegy 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Pelitical Scien… 3.63
## 6 Yes 27 Racine County, WI VANESSA Psychelegy 3.88
## 7 Yes 31 <NA> SGGSAC Secielegy 3.68
## 8 Yes 26 Dekalb, GA VANESSA Anthrepelegy 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Ecenemics 4.00
## 10 Yes 24 York County, South Caroli… SGGSAC Histery 3.87
## # … with 1,966 more rows
If you want to remove a pattern, you can either use str_remove()
and str_remove_all()
or you can use str_replace()
/str_replace_all()
with replacement=""
.
str_to_...()
This is the function that let’s you manipulate your string into a certain form. There are 4 forms: upper
, lower
, title
, sentence
. upper
and lower
changes the string into the respective cases. title
capitalizes the first letter of each word. sentence
capitalizes the first letter of the sentence.
data$county[2]%>%
str_to_upper()
## [1] "FRANKLIN COUNTY, OH"
data$county[2]%>%
str_to_lower()
## [1] "franklin county, oh"
data$county[2]%>%
str_to_title
## [1] "Franklin County, Oh"
data$county[2]%>%
str_to_sentence
## [1] "Franklin county, oh"
Regular expression is the common way to describe strings to any computer language. In the examples, I will speak in terms of what str_extract()
will output from the string “aaabc123d”
( )
When you want a exact group of string to be used in a pattern, put parentheses around the pattern.
For example, (abc)
extracts exactly “abc” and (abcd)
extracts nothing.
|
Or condition lets you match like the way dplyr::filter()
works.
For example, (abc)|(ab)|d
extracts “abc”. If you use str_extract_all()
instead, (abc)|(ab)|d
extracts “abc”,“ab”.
[ ]
When you use brackets, your regular expression means anything in the bracket. Sort of like an overpowered Or condition.
For example, [abc]
extracts “a”. If you were to use str_extract_all()
instead, [abc]
extracts “a”, “b”, “c”.
?
Adding a ?
after a letter or a block means to match with either zero or one of that unit exists.
For example, (ab)c?
extracts “abc” and (ab)d?
extracts “ab”.
*
* works the same as ? except it can match for there being more than one of that unit.
For example, ab\*bc
extracts “abc” and aa\*bc
extracts “aabc”.
+
+ works the same as ? except it requires the pattern to at least exists once.
For example, a+bc
extracts “aabc”
.
.
lets you match anything.
For example, a.c
extracts “abc”, a..c
extracts “aabc”, and
\d
\d
lets you match any digit/number. Note that you need \d
in regex which means when you are using it in R, you need to escape the \
with a \
so you actually need to write \\d
.
For example, .\\d+
extracts “c123”
^
^
in regular expression means the pattern needs to be at the start of the string.
For example, if we use string_extract_all()
with a.
we will get “aa”, “ab”. But if we use ^a.
instead, we will only get “aa”.
$
$
in regular expression means the pattern needs to be at the end of the string.
For example, if we use string_extract_all()
with ..
we will get “aa”, “ab”, “c1”, “23”. But if we use ..$
instead, we will only get “3d”.
[A-Z]
, [a-z]
, and [A-Za-z]
These regular expression means “capital A through Z”, “lower case a through z”, and “any character”.
Regex and strings in general take a lot of practice, so now we are going to use the data set to do some practices! Use the our data frame data
, answer the following questions:
View data
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political Scien… 3.63
## 6 Yes 27 Racine County, WI VANESSA Psychology 3.88
## 7 Yes 31 <NA> SGGSAC Sociology 3.68
## 8 Yes 26 Dekalb, GA VANESSA Anthropology 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00
## 10 Yes 24 York County, South Caroli… SGGSAC History 3.87
## # … with 1,966 more rows
Click for solutions
data%>%
mutate(county=str_replace(county,regex(",? .+"),""))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange SGGSAC Anthropology 3.81
## 2 Yes 23 Franklin SGGSAC Economics 3.23
## 3 Yes 23 Franklin SGGSAC Economics 3.34
## 4 Maybe 24 Cook SGGSAC Sociology 3.58
## 5 Yes 23 Knoxville SGGSAC Political Science 3.63
## 6 Yes 27 Racine VANESSA Psychology 3.88
## 7 Yes 31 <NA> SGGSAC Sociology 3.68
## 8 Yes 26 Dekalb VANESSA Anthropology 3.41
## 9 Yes 35 Hyde SGGSAC Economics 4.00
## 10 Yes 24 York SGGSAC History 3.87
## # … with 1,966 more rows
Click for solutions
data%>%mutate(state=str_replace(county,regex(".+, "),""))
## # A tibble: 1,976 x 7
## workshop age county social cohort gpa state
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81 CA
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23 OH
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34 OH
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58 IL
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political S… 3.63 Tennesse
## 6 Yes 27 Racine County, WI VANESSA Psychology 3.88 WI
## 7 Yes 31 <NA> SGGSAC Sociology 3.68 <NA>
## 8 Yes 26 Dekalb, GA VANESSA Anthropology 3.41 GA
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00 IL
## 10 Yes 24 York County, South… SGGSAC History 3.87 South Car…
## # … with 1,966 more rows
Click for solutions
data%>%
mutate(cohort=str_extract(cohort,regex("^...")))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Ant 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Eco 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Eco 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Soc 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Pol 3.63
## 6 Yes 27 Racine County, WI VANESSA Psy 3.88
## 7 Yes 31 <NA> SGGSAC Soc 3.68
## 8 Yes 26 Dekalb, GA VANESSA Ant 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Eco 4.00
## 10 Yes 24 York County, South Carolina SGGSAC His 3.87
## # … with 1,966 more rows
Click for solutions
data%>%
mutate(cohort=ifelse(str_detect(cohort,regex("ology$")),
NA,
str_extract(cohort,regex(".....$"))))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC <NA> 3.81
## 2 Yes 23 Franklin county, OH SGGSAC omics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC omics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC <NA> 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC ience 3.63
## 6 Yes 27 Racine County, WI VANESSA <NA> 3.88
## 7 Yes 31 <NA> SGGSAC <NA> 3.68
## 8 Yes 26 Dekalb, GA VANESSA <NA> 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC omics 4.00
## 10 Yes 24 York County, South Carolina SGGSAC story 3.87
## # … with 1,966 more rows
Click for solutions
data%>%
mutate(cohort=str_extract(cohort,regex("[A-Za-z]+([A-Za-z]+)?/?;?([A-Za-z]+)?")))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58
## 5 Yes 23 Knoxville, Tennesse SGGSAC Political 3.63
## 6 Yes 27 Racine County, WI VANESSA Psychology 3.88
## 7 Yes 31 <NA> SGGSAC Sociology 3.68
## 8 Yes 26 Dekalb, GA VANESSA Anthropology 3.41
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00
## 10 Yes 24 York County, South Carolina SGGSAC History 3.87
## # … with 1,966 more rows
data%>%
mutate(cohort1=str_extract(cohort,regex("[A-Za-z]+( [A-Za-z]+)?/?")),
cohort2=str_extract(cohort,regex("(/|;)([A-Za-z]+)?")))
## # A tibble: 1,976 x 8
## workshop age county social cohort gpa cohort1 cohort2
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yes 24 Orange County… SGGSAC Anthropo… 3.81 Anthropol… <NA>
## 2 Yes 23 Franklin coun… SGGSAC Economics 3.23 Economics <NA>
## 3 Yes 23 Franklin coun… SGGSAC Economics 3.34 Economics <NA>
## 4 Maybe 24 Cook county, … SGGSAC Sociology 3.58 Sociology <NA>
## 5 Yes 23 Knoxville, Te… SGGSAC Politica… 3.63 Political… <NA>
## 6 Yes 27 Racine County… VANESSA Psycholo… 3.88 Psychology <NA>
## 7 Yes 31 <NA> SGGSAC Sociology 3.68 Sociology <NA>
## 8 Yes 26 Dekalb, GA VANESSA Anthropo… 3.41 Anthropol… <NA>
## 9 Yes 35 Hyde Park, IL SGGSAC Economics 4.00 Economics <NA>
## 10 Yes 24 York County, … SGGSAC History 3.87 History <NA>
## # … with 1,966 more rows
Click here to continue to the next workshop: Introduction to RMarkdown automation