Click here to return to the menu

Click here to go to the slides for more examples

Download the slides here

In data science for the social sciences, text data is especially important to be dealt with with care. In R, there are many packages you can use for dealing with text data. In this workshop, we will be going over the stringr package which is included in the tidyverse package. We will also talk about regular expressions [regex], which is a common way to deal with strings across most language.

stringr

There are many functions that start with str_ in the stringr package. In this workshop, we are going to about the ones I used the most often. The str_ functions, in general, take the arguments in the form of str_...(string, pattern, sep),

For a more complete look at the functions, either look at the cheat sheet linked at the end of this workshop notes or Google it. For this workshop, we will continue using the data from the tidyverse workshop.

require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")

str_detect()

The first commonly used str_ function is str_detect(). Just like its name, this function detects a certain pattern in your string. str_detect() searches a string and return TRUE if the pattern specified is present in the string and FALSE otherwise. For example, if we want to find the strings with the pattern “ology” in the cohort column.

data$cohort%>%
  str_detect(pattern="ology")%>%
  head(10)
##  [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE

This also means that we can combine mutate to create a column that indicates to use whether the desired pattern is present in the string:

data%>%
  mutate(ology = as.numeric(str_detect(cohort, 
                                       pattern="ology")))
## # A tibble: 1,976 x 7
##    workshop   age county                 social  cohort           gpa ology
##    <chr>    <dbl> <chr>                  <chr>   <chr>          <dbl> <dbl>
##  1 Yes         24 Orange County, CA      SGGSAC  Anthropology    3.81     1
##  2 Yes         23 Franklin county, OH    SGGSAC  Economics       3.23     0
##  3 Yes         23 Franklin county, OH    SGGSAC  Economics       3.34     0
##  4 Maybe       24 Cook county, IL        SGGSAC  Sociology       3.58     1
##  5 Yes         23 Knoxville, Tennesse    SGGSAC  Political Sci…  3.63     0
##  6 Yes         27 Racine County, WI      VANESSA Psychology      3.88     1
##  7 Yes         31 <NA>                   SGGSAC  Sociology       3.68     1
##  8 Yes         26 Dekalb, GA             VANESSA Anthropology    3.41     1
##  9 Yes         35 Hyde Park, IL          SGGSAC  Economics       4.00     0
## 10 Yes         24 York County, South Ca… SGGSAC  History         3.87     0
## # … with 1,966 more rows

str_extract() and str_extract_all()

Most str_ functions are fairly straight forward in terms of naming. In this case, str_extract() extracts the first desired pattern from the string. str_extract_all() extracts all the times the pattern is present in the string.

You might be thinking, if I just want a pattern, why won’t I just mutate() a new column with the pattern I want?

The answer to that question is the pattern can actually be quite vague. We will discuss this in the section regular expression. For now, let’s continue our last example and extract the pattern “ology”.

data%>%
  mutate(ology = str_extract(cohort,
                             pattern="ology"))
## # A tibble: 1,976 x 7
##    workshop   age county                 social  cohort           gpa ology
##    <chr>    <dbl> <chr>                  <chr>   <chr>          <dbl> <chr>
##  1 Yes         24 Orange County, CA      SGGSAC  Anthropology    3.81 ology
##  2 Yes         23 Franklin county, OH    SGGSAC  Economics       3.23 <NA> 
##  3 Yes         23 Franklin county, OH    SGGSAC  Economics       3.34 <NA> 
##  4 Maybe       24 Cook county, IL        SGGSAC  Sociology       3.58 ology
##  5 Yes         23 Knoxville, Tennesse    SGGSAC  Political Sci…  3.63 <NA> 
##  6 Yes         27 Racine County, WI      VANESSA Psychology      3.88 ology
##  7 Yes         31 <NA>                   SGGSAC  Sociology       3.68 ology
##  8 Yes         26 Dekalb, GA             VANESSA Anthropology    3.41 ology
##  9 Yes         35 Hyde Park, IL          SGGSAC  Economics       4.00 <NA> 
## 10 Yes         24 York County, South Ca… SGGSAC  History         3.87 <NA> 
## # … with 1,966 more rows

str_replace() and str_replace_all()

str_replace() simply replaces the patter you want replaced with the text you want to replace it with. The difference between str_replace() and str_replace_all() is that str_replace() only replaces the pattern the first time it comes up in the string. str_replace() on the other hand replaces all of the patterns that match.

For example, if we want to replace “ology” with “ologist”, we can use str_replace():

data%>%
  mutate(cohort = str_replace(cohort,
                             pattern="ology",
                             replacement="ologist"))
## # A tibble: 1,976 x 6
##    workshop   age county                     social  cohort             gpa
##    <chr>    <dbl> <chr>                      <chr>   <chr>            <dbl>
##  1 Yes         24 Orange County, CA          SGGSAC  Anthropologist    3.81
##  2 Yes         23 Franklin county, OH        SGGSAC  Economics         3.23
##  3 Yes         23 Franklin county, OH        SGGSAC  Economics         3.34
##  4 Maybe       24 Cook county, IL            SGGSAC  Sociologist       3.58
##  5 Yes         23 Knoxville, Tennesse        SGGSAC  Political Scien…  3.63
##  6 Yes         27 Racine County, WI          VANESSA Psychologist      3.88
##  7 Yes         31 <NA>                       SGGSAC  Sociologist       3.68
##  8 Yes         26 Dekalb, GA                 VANESSA Anthropologist    3.41
##  9 Yes         35 Hyde Park, IL              SGGSAC  Economics         4.00
## 10 Yes         24 York County, South Caroli… SGGSAC  History           3.87
## # … with 1,966 more rows

But if I just want to change all the “o”’s into “e”’s, str_replace() would be insufficient:

data%>%
  mutate(cohort = str_replace(cohort,
                             pattern="o",
                             replacement="e"))
## # A tibble: 1,976 x 6
##    workshop   age county                     social  cohort             gpa
##    <chr>    <dbl> <chr>                      <chr>   <chr>            <dbl>
##  1 Yes         24 Orange County, CA          SGGSAC  Anthrepology      3.81
##  2 Yes         23 Franklin county, OH        SGGSAC  Ecenomics         3.23
##  3 Yes         23 Franklin county, OH        SGGSAC  Ecenomics         3.34
##  4 Maybe       24 Cook county, IL            SGGSAC  Seciology         3.58
##  5 Yes         23 Knoxville, Tennesse        SGGSAC  Pelitical Scien…  3.63
##  6 Yes         27 Racine County, WI          VANESSA Psychelogy        3.88
##  7 Yes         31 <NA>                       SGGSAC  Seciology         3.68
##  8 Yes         26 Dekalb, GA                 VANESSA Anthrepology      3.41
##  9 Yes         35 Hyde Park, IL              SGGSAC  Ecenomics         4.00
## 10 Yes         24 York County, South Caroli… SGGSAC  Histery           3.87
## # … with 1,966 more rows

As you can see from above, only the first “o” is replaced by “e”. To correct that, we would use str_replace_all():

data%>%
  mutate(cohort = str_replace_all(cohort,
                             pattern="o",
                             replacement="e"))
## # A tibble: 1,976 x 6
##    workshop   age county                     social  cohort             gpa
##    <chr>    <dbl> <chr>                      <chr>   <chr>            <dbl>
##  1 Yes         24 Orange County, CA          SGGSAC  Anthrepelegy      3.81
##  2 Yes         23 Franklin county, OH        SGGSAC  Ecenemics         3.23
##  3 Yes         23 Franklin county, OH        SGGSAC  Ecenemics         3.34
##  4 Maybe       24 Cook county, IL            SGGSAC  Secielegy         3.58
##  5 Yes         23 Knoxville, Tennesse        SGGSAC  Pelitical Scien…  3.63
##  6 Yes         27 Racine County, WI          VANESSA Psychelegy        3.88
##  7 Yes         31 <NA>                       SGGSAC  Secielegy         3.68
##  8 Yes         26 Dekalb, GA                 VANESSA Anthrepelegy      3.41
##  9 Yes         35 Hyde Park, IL              SGGSAC  Ecenemics         4.00
## 10 Yes         24 York County, South Caroli… SGGSAC  Histery           3.87
## # … with 1,966 more rows

If you want to remove a pattern, you can either use str_remove() and str_remove_all() or you can use str_replace()/str_replace_all() with replacement="".

str_to_...()

This is the function that let’s you manipulate your string into a certain form. There are 4 forms: upper, lower, title, sentence. upper and lower changes the string into the respective cases. title capitalizes the first letter of each word. sentence capitalizes the first letter of the sentence.

data$county[2]%>%
  str_to_upper()
## [1] "FRANKLIN COUNTY, OH"
data$county[2]%>%
  str_to_lower()
## [1] "franklin county, oh"
data$county[2]%>%
  str_to_title
## [1] "Franklin County, Oh"
data$county[2]%>%
  str_to_sentence
## [1] "Franklin county, oh"

Regular Expressions (Regex)

Regular expression is the common way to describe strings to any computer language. In the examples, I will speak in terms of what str_extract() will output from the string “aaabc123d”

Group/Block: ( )

When you want a exact group of string to be used in a pattern, put parentheses around the pattern.

For example, (abc) extracts exactly “abc” and (abcd) extracts nothing.

Or: |

Or condition lets you match like the way dplyr::filter() works.

For example, (abc)|(ab)|d extracts “abc”. If you use str_extract_all() instead, (abc)|(ab)|d extracts “abc”,“ab”.

Any in: [ ]

When you use brackets, your regular expression means anything in the bracket. Sort of like an overpowered Or condition.

For example, [abc] extracts “a”. If you were to use str_extract_all() instead, [abc] extracts “a”, “b”, “c”.

Zero or one: ?

Adding a ? after a letter or a block means to match with either zero or one of that unit exists.

For example, (ab)c? extracts “abc” and (ab)d? extracts “ab”.

Zero or more: *

* works the same as ? except it can match for there being more than one of that unit.

For example, ab\*bc extracts “abc” and aa\*bc extracts “aabc”.

One or more: +

+ works the same as ? except it requires the pattern to at least exists once.

For example, a+bc extracts “aabc”

Anything: .

. lets you match anything.

For example, a.c extracts “abc”, a..c extracts “aabc”, and

Any number: \d

\d lets you match any digit/number. Note that you need \d in regex which means when you are using it in R, you need to escape the \ with a \ so you actually need to write \\d.

For example, .\\d+ extracts “c123”

Start of string: ^

^ in regular expression means the pattern needs to be at the start of the string.

For example, if we use string_extract_all() with a. we will get “aa”, “ab”. But if we use ^a. instead, we will only get “aa”.

End of string: $

$ in regular expression means the pattern needs to be at the end of the string.

For example, if we use string_extract_all() with .. we will get “aa”, “ab”, “c1”, “23”. But if we use ..$ instead, we will only get “3d”.

[A-Z], [a-z], and [A-Za-z]

These regular expression means “capital A through Z”, “lower case a through z”, and “any character”.

Examples

Regex and strings in general take a lot of practice, so now we are going to use the data set to do some practices! Use the our data frame data, answer the following questions:

View data

## # A tibble: 1,976 x 6
##    workshop   age county                     social  cohort             gpa
##    <chr>    <dbl> <chr>                      <chr>   <chr>            <dbl>
##  1 Yes         24 Orange County, CA          SGGSAC  Anthropology      3.81
##  2 Yes         23 Franklin county, OH        SGGSAC  Economics         3.23
##  3 Yes         23 Franklin county, OH        SGGSAC  Economics         3.34
##  4 Maybe       24 Cook county, IL            SGGSAC  Sociology         3.58
##  5 Yes         23 Knoxville, Tennesse        SGGSAC  Political Scien…  3.63
##  6 Yes         27 Racine County, WI          VANESSA Psychology        3.88
##  7 Yes         31 <NA>                       SGGSAC  Sociology         3.68
##  8 Yes         26 Dekalb, GA                 VANESSA Anthropology      3.41
##  9 Yes         35 Hyde Park, IL              SGGSAC  Economics         4.00
## 10 Yes         24 York County, South Caroli… SGGSAC  History           3.87
## # … with 1,966 more rows



  1. If I want to clear the county column so that it only has the name of the county and not the state or the word county, what do I do?

Click for solutions

data%>%
  mutate(county=str_replace(county,regex(",? .+"),""))
## # A tibble: 1,976 x 6
##    workshop   age county    social  cohort              gpa
##    <chr>    <dbl> <chr>     <chr>   <chr>             <dbl>
##  1 Yes         24 Orange    SGGSAC  Anthropology       3.81
##  2 Yes         23 Franklin  SGGSAC  Economics          3.23
##  3 Yes         23 Franklin  SGGSAC  Economics          3.34
##  4 Maybe       24 Cook      SGGSAC  Sociology          3.58
##  5 Yes         23 Knoxville SGGSAC  Political Science  3.63
##  6 Yes         27 Racine    VANESSA Psychology         3.88
##  7 Yes         31 <NA>      SGGSAC  Sociology          3.68
##  8 Yes         26 Dekalb    VANESSA Anthropology       3.41
##  9 Yes         35 Hyde      SGGSAC  Economics          4.00
## 10 Yes         24 York      SGGSAC  History            3.87
## # … with 1,966 more rows



  1. If I only want the states and not the counties, what do I do?

Click for solutions

data%>%mutate(state=str_replace(county,regex(".+, "),""))
## # A tibble: 1,976 x 7
##    workshop   age county              social  cohort         gpa state     
##    <chr>    <dbl> <chr>               <chr>   <chr>        <dbl> <chr>     
##  1 Yes         24 Orange County, CA   SGGSAC  Anthropology  3.81 CA        
##  2 Yes         23 Franklin county, OH SGGSAC  Economics     3.23 OH        
##  3 Yes         23 Franklin county, OH SGGSAC  Economics     3.34 OH        
##  4 Maybe       24 Cook county, IL     SGGSAC  Sociology     3.58 IL        
##  5 Yes         23 Knoxville, Tennesse SGGSAC  Political S…  3.63 Tennesse  
##  6 Yes         27 Racine County, WI   VANESSA Psychology    3.88 WI        
##  7 Yes         31 <NA>                SGGSAC  Sociology     3.68 <NA>      
##  8 Yes         26 Dekalb, GA          VANESSA Anthropology  3.41 GA        
##  9 Yes         35 Hyde Park, IL       SGGSAC  Economics     4.00 IL        
## 10 Yes         24 York County, South… SGGSAC  History       3.87 South Car…
## # … with 1,966 more rows



  1. If I just want the first 3 letters of each cohort name instead of the full cohort name, what do I do?

Click for solutions

data%>%
  mutate(cohort=str_extract(cohort,regex("^...")))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort   gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>  <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  Ant     3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  Eco     3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  Eco     3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  Soc     3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  Pol     3.63
##  6 Yes         27 Racine County, WI           VANESSA Psy     3.88
##  7 Yes         31 <NA>                        SGGSAC  Soc     3.68
##  8 Yes         26 Dekalb, GA                  VANESSA Ant     3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  Eco     4.00
## 10 Yes         24 York County, South Carolina SGGSAC  His     3.87
## # … with 1,966 more rows



  1. If I want the 5-letter cohort endings of the cohorts that do not end with “ology”, what do I do?

Click for solutions

data%>%
  mutate(cohort=ifelse(str_detect(cohort,regex("ology$")),
                       NA,
                       str_extract(cohort,regex(".....$"))))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort   gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>  <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  <NA>    3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  omics   3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  omics   3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  <NA>    3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  ience   3.63
##  6 Yes         27 Racine County, WI           VANESSA <NA>    3.88
##  7 Yes         31 <NA>                        SGGSAC  <NA>    3.68
##  8 Yes         26 Dekalb, GA                  VANESSA <NA>    3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  omics   4.00
## 10 Yes         24 York County, South Carolina SGGSAC  story   3.87
## # … with 1,966 more rows



  1. How do I make sure that every cohort name only has the first word of each discipline?

Click for solutions

data%>%
  mutate(cohort=str_extract(cohort,regex("[A-Za-z]+([A-Za-z]+)?/?;?([A-Za-z]+)?")))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort         gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>        <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  Anthropology  3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  Economics     3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  Economics     3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  Sociology     3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  Political     3.63
##  6 Yes         27 Racine County, WI           VANESSA Psychology    3.88
##  7 Yes         31 <NA>                        SGGSAC  Sociology     3.68
##  8 Yes         26 Dekalb, GA                  VANESSA Anthropology  3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  Economics     4.00
## 10 Yes         24 York County, South Carolina SGGSAC  History       3.87
## # … with 1,966 more rows
data%>%
  mutate(cohort1=str_extract(cohort,regex("[A-Za-z]+( [A-Za-z]+)?/?")),
         cohort2=str_extract(cohort,regex("(/|;)([A-Za-z]+)?")))
## # A tibble: 1,976 x 8
##    workshop   age county         social  cohort      gpa cohort1    cohort2
##    <chr>    <dbl> <chr>          <chr>   <chr>     <dbl> <chr>      <chr>  
##  1 Yes         24 Orange County… SGGSAC  Anthropo…  3.81 Anthropol… <NA>   
##  2 Yes         23 Franklin coun… SGGSAC  Economics  3.23 Economics  <NA>   
##  3 Yes         23 Franklin coun… SGGSAC  Economics  3.34 Economics  <NA>   
##  4 Maybe       24 Cook county, … SGGSAC  Sociology  3.58 Sociology  <NA>   
##  5 Yes         23 Knoxville, Te… SGGSAC  Politica…  3.63 Political… <NA>   
##  6 Yes         27 Racine County… VANESSA Psycholo…  3.88 Psychology <NA>   
##  7 Yes         31 <NA>           SGGSAC  Sociology  3.68 Sociology  <NA>   
##  8 Yes         26 Dekalb, GA     VANESSA Anthropo…  3.41 Anthropol… <NA>   
##  9 Yes         35 Hyde Park, IL  SGGSAC  Economics  4.00 Economics  <NA>   
## 10 Yes         24 York County, … SGGSAC  History    3.87 History    <NA>   
## # … with 1,966 more rows

Click here to continue to the next workshop: Introduction to RMarkdown automation

Click here to continue to stringr cheat sheet

Click here to return to the menu