Workshop 5

Dealing with texts

In data science for social sciences, text data is especially important to be dealt with with care. In R, there are many packages you can use for dealing with text data. In this workshop, we will be going over the stringr package which is included in the tidyverse package. We will also talk about regular expressions [regex], which is a common way to deal with strings across most language.

stringr

stringr

There are many functions that start with str_ in the stringr package. In this workshop, we are going to about the ones I used the most often. The str_ functions, in general, take the arguments in the form of str_...(string, pattern, sep),

For a more complete look at the functions, either look at the cheat sheet linked at the end of this workshop notes or Google it. For this workshop, we will continue using the data from the tidyverse workshop.

require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")

str_detect()

The first commonly used str_ function is str_detect(). Just like its name, this function detects a certain pattern in your string. str_detect() searches a string and return TRUE if the pattern specified is present in the string and FALSE otherwise. For example, if we want to find the strings with the pattern “ology” in the cohort column.

data$cohort%>%
  str_detect(pattern="ology")%>%
  head(6)
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE

This also means that we can combine mutate to create a column that indicates to use whether the desired pattern is present in the string:

data%>%
  mutate(ology = as.numeric(str_detect(cohort, 
                                       pattern="ology")))%>%
  head(6)
## # A tibble: 6 x 7
##   workshop   age county              social  cohort              gpa ology
##   <chr>    <dbl> <chr>               <chr>   <chr>             <dbl> <dbl>
## 1 Yes         24 Orange County, CA   SGGSAC  Anthropology       3.81     1
## 2 Yes         23 Franklin county, OH SGGSAC  Economics          3.23     0
## 3 Yes         23 Franklin county, OH SGGSAC  Economics          3.34     0
## 4 Maybe       24 Cook county, IL     SGGSAC  Sociology          3.58     1
## 5 Yes         23 Knoxville, Tennesse SGGSAC  Political Science  3.63     0
## 6 Yes         27 Racine County, WI   VANESSA Psychology         3.88     1

str_extract() and str_extract_all()

Most str_ functions are fairly straight forward in terms of naming. In this case, str_extract() extracts the first desired pattern from the string. str_extract_all() extracts all the times the pattern is present in the string.

You might be thinking, if I just want a pattern, why won’t I just mutate() a new column with the pattern I want?

The answer to that question is the pattern can actually be quite vague. We will discuss this in the section regular expression. For now, let’s continue our last example and extract the pattern “ology”.

data%>%
  mutate(ology = str_extract(cohort,
                             pattern="ology"))%>%
  head(6)
## # A tibble: 6 x 7
##   workshop   age county              social  cohort              gpa ology
##   <chr>    <dbl> <chr>               <chr>   <chr>             <dbl> <chr>
## 1 Yes         24 Orange County, CA   SGGSAC  Anthropology       3.81 ology
## 2 Yes         23 Franklin county, OH SGGSAC  Economics          3.23 <NA> 
## 3 Yes         23 Franklin county, OH SGGSAC  Economics          3.34 <NA> 
## 4 Maybe       24 Cook county, IL     SGGSAC  Sociology          3.58 ology
## 5 Yes         23 Knoxville, Tennesse SGGSAC  Political Science  3.63 <NA> 
## 6 Yes         27 Racine County, WI   VANESSA Psychology         3.88 ology

str_replace() and str_replace_all()

str_replace() simply replaces the patter you want replaced with the text you want to replace it with. The difference between str_replace() and str_replace_all() is that str_replace() only replaces the pattern the first time it comes up in the string. str_replace() on the other hand replaces all of the patterns that match.

For example, if we want to replace “ology” with “ologist”, we can use str_replace():

data%>%
  mutate(cohort = str_replace(cohort,
                             pattern="ology",
                             replacement="ologist"))%>%
  head(6)
## # A tibble: 6 x 6
##   workshop   age county              social  cohort              gpa
##   <chr>    <dbl> <chr>               <chr>   <chr>             <dbl>
## 1 Yes         24 Orange County, CA   SGGSAC  Anthropologist     3.81
## 2 Yes         23 Franklin county, OH SGGSAC  Economics          3.23
## 3 Yes         23 Franklin county, OH SGGSAC  Economics          3.34
## 4 Maybe       24 Cook county, IL     SGGSAC  Sociologist        3.58
## 5 Yes         23 Knoxville, Tennesse SGGSAC  Political Science  3.63
## 6 Yes         27 Racine County, WI   VANESSA Psychologist       3.88

But if I just want to change all the “o”’s into “e”’s, str_replace() would be insufficient:

data%>%
  mutate(cohort = str_replace(cohort,
                             pattern="o",
                             replacement="e"))%>%
  head(6)
## # A tibble: 6 x 6
##   workshop   age county              social  cohort              gpa
##   <chr>    <dbl> <chr>               <chr>   <chr>             <dbl>
## 1 Yes         24 Orange County, CA   SGGSAC  Anthrepology       3.81
## 2 Yes         23 Franklin county, OH SGGSAC  Ecenomics          3.23
## 3 Yes         23 Franklin county, OH SGGSAC  Ecenomics          3.34
## 4 Maybe       24 Cook county, IL     SGGSAC  Seciology          3.58
## 5 Yes         23 Knoxville, Tennesse SGGSAC  Pelitical Science  3.63
## 6 Yes         27 Racine County, WI   VANESSA Psychelogy         3.88

As you can see from above, only the first “o” is replaced by “e”. To correct that, we would use str_replace_all():

data%>%
  mutate(cohort = str_replace_all(cohort,
                             pattern="o",
                             replacement="e"))%>%
  head(6)
## # A tibble: 6 x 6
##   workshop   age county              social  cohort              gpa
##   <chr>    <dbl> <chr>               <chr>   <chr>             <dbl>
## 1 Yes         24 Orange County, CA   SGGSAC  Anthrepelegy       3.81
## 2 Yes         23 Franklin county, OH SGGSAC  Ecenemics          3.23
## 3 Yes         23 Franklin county, OH SGGSAC  Ecenemics          3.34
## 4 Maybe       24 Cook county, IL     SGGSAC  Secielegy          3.58
## 5 Yes         23 Knoxville, Tennesse SGGSAC  Pelitical Science  3.63
## 6 Yes         27 Racine County, WI   VANESSA Psychelegy         3.88

If you want to remove a pattern, you can either use str_remove() and str_remove_all() or you can use str_replace()/str_replace_all() with replacement="".

str_to_...()

This is the function that let’s you manipulate your string into a certain form. There are 4 forms: upper, lower, title, sentence. upper and lower changes the string into the respective cases. title capitalizes the first letter of each word. sentence capitalizes the first letter of the sentence.

data$county[2]%>%
  str_to_upper()
## [1] "FRANKLIN COUNTY, OH"
data$county[2]%>%
  str_to_lower()
## [1] "franklin county, oh"

data$county[2]%>%
  str_to_title
## [1] "Franklin County, Oh"
data$county[2]%>%
  str_to_sentence
## [1] "Franklin county, oh"

Regular Expressions (Regex)

Regular Expressions (Regex)

Regular expression is the common way to describe strings to any computer language. In the examples, I will speak in terms of what str_extract() will output from the string “aaabc123d”

Group/Block: ( )

When you want a exact group of string to be used in a pattern, put parentheses around the pattern.

For example, (abc) extracts exactly “abc” and (abcd) extracts nothing.

str_extract("aaabc123d",regex("(abc)"))
## [1] "abc"
str_extract_all("aaabc123d",regex("(abcd)"))
## [[1]]
## character(0)

Or:

Or condition lets you match like the way dplyr::filter() works.

str_extract("aaabc123d",regex("(abc)|(ab)|d"))
## [1] "abc"
str_extract_all("aaabc123d",regex("(abc)|(ab)|d"))
## [[1]]
## [1] "abc" "d"

Any in: [ ]

When you use brackets, your regular expression means anything in the bracket. Sort of like an overpowered Or condition.

str_extract("aaabc123d",regex("[abc]"))
## [1] "a"
str_extract_all("aaabc123d",regex("[abc]"))
## [[1]]
## [1] "a" "a" "a" "b" "c"

Zero or one: ?

Adding a ? after a letter or a block means to match with either zero or one of that unit exists.

str_extract("aaabc123d",regex("(ab)c?"))
## [1] "abc"
str_extract_all("aaabc123d",regex("(ab)d?"))
## [[1]]
## [1] "ab"

Zero or more: *

* works the same as ? except it can match for there being more than one of that unit.

str_extract("aaabc123d",regex("ab*bc"))
## [1] "abc"
str_extract_all("aaabc123d",regex("aa*bc"))
## [[1]]
## [1] "aaabc"

One or more: +

+ works the same as ? except it requires the pattern to at least exists once.

str_extract("aaabc123d",regex("a+bc"))
## [1] "aaabc"
str_extract_all("aaabc123d",regex("a+bc"))
## [[1]]
## [1] "aaabc"

Anything: .

. lets you match anything.

str_extract("aaabc123d",regex("a.c"))
## [1] "abc"
str_extract_all("aaabc123d",regex("a..c"))
## [[1]]
## [1] "aabc"

Any number: \d

\d lets you match any digit/number. Note that you need \d in regex which means when you are using it in R, you need to escape the \ with a \ so you actually need to write \\d.

str_extract("aaabc123d",regex(".\\d+"))
## [1] "c123"
str_extract_all("aaabc123d",regex(".\\d"))
## [[1]]
## [1] "c1" "23"

Start of string: ^

^ in regular expression means the pattern needs to be at the start of the string.

str_extract_all("aaabc123d",regex("a."))
## [[1]]
## [1] "aa" "ab"
str_extract_all("aaabc123d",regex("^a."))
## [[1]]
## [1] "aa"

End of string: $

$ in regular expression means the pattern needs to be at the end of the string.

For example, if we use string_extract_all() with .. we will get “aa”, “ab”, “c1”, “23”. But if we use ..$ instead, we will only get “3d”.

str_extract_all("aaabc123d",regex(".."))
## [[1]]
## [1] "aa" "ab" "c1" "23"
str_extract_all("aaabc123d",regex("..$"))
## [[1]]
## [1] "3d"

[A-Z], [a-z], and [A-Za-z]

These regular expression means “capital A through Z”, “lower case a through z”, and “any character”.

Examples

Regex and strings in general take a lot of practice, so now we are going to use the data set to do some practices!

1

If I want to clear the county column so that it only has the name of the county and not the state or the word county, what do I do?

data%>%
  mutate(county=str_replace(county,regex(",? .+"),""))
## # A tibble: 1,976 x 6
##    workshop   age county    social  cohort              gpa
##    <chr>    <dbl> <chr>     <chr>   <chr>             <dbl>
##  1 Yes         24 Orange    SGGSAC  Anthropology       3.81
##  2 Yes         23 Franklin  SGGSAC  Economics          3.23
##  3 Yes         23 Franklin  SGGSAC  Economics          3.34
##  4 Maybe       24 Cook      SGGSAC  Sociology          3.58
##  5 Yes         23 Knoxville SGGSAC  Political Science  3.63
##  6 Yes         27 Racine    VANESSA Psychology         3.88
##  7 Yes         31 <NA>      SGGSAC  Sociology          3.68
##  8 Yes         26 Dekalb    VANESSA Anthropology       3.41
##  9 Yes         35 Hyde      SGGSAC  Economics          4.00
## 10 Yes         24 York      SGGSAC  History            3.87
## # … with 1,966 more rows

2

If I only want the states and not the counties, what do I do?

data%>%
  mutate(county=str_extract(county,regex("[A-Za-z]+$")))
## # A tibble: 1,976 x 6
##    workshop   age county   social  cohort              gpa
##    <chr>    <dbl> <chr>    <chr>   <chr>             <dbl>
##  1 Yes         24 CA       SGGSAC  Anthropology       3.81
##  2 Yes         23 OH       SGGSAC  Economics          3.23
##  3 Yes         23 OH       SGGSAC  Economics          3.34
##  4 Maybe       24 IL       SGGSAC  Sociology          3.58
##  5 Yes         23 Tennesse SGGSAC  Political Science  3.63
##  6 Yes         27 WI       VANESSA Psychology         3.88
##  7 Yes         31 <NA>     SGGSAC  Sociology          3.68
##  8 Yes         26 GA       VANESSA Anthropology       3.41
##  9 Yes         35 IL       SGGSAC  Economics          4.00
## 10 Yes         24 Carolina SGGSAC  History            3.87
## # … with 1,966 more rows

3

If I just want the first 3 letters of each cohort name instead of the full cohort name, what do I do?

data%>%
  mutate(cohort=str_extract(cohort,regex("^...")))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort   gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>  <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  Ant     3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  Eco     3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  Eco     3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  Soc     3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  Pol     3.63
##  6 Yes         27 Racine County, WI           VANESSA Psy     3.88
##  7 Yes         31 <NA>                        SGGSAC  Soc     3.68
##  8 Yes         26 Dekalb, GA                  VANESSA Ant     3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  Eco     4.00
## 10 Yes         24 York County, South Carolina SGGSAC  His     3.87
## # … with 1,966 more rows

4

If I want the 5-letter cohort endings of the cohorts that do not end with “ology”, what do I do?

data%>%
  mutate(cohort=ifelse(str_detect(cohort,"ology"),
                       NA,
                       str_extract(cohort,regex(".....$"))))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort   gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>  <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  <NA>    3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  omics   3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  omics   3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  <NA>    3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  ience   3.63
##  6 Yes         27 Racine County, WI           VANESSA <NA>    3.88
##  7 Yes         31 <NA>                        SGGSAC  <NA>    3.68
##  8 Yes         26 Dekalb, GA                  VANESSA <NA>    3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  omics   4.00
## 10 Yes         24 York County, South Carolina SGGSAC  story   3.87
## # … with 1,966 more rows

5

How do I make sure that every cohort name only has the first word?

data%>%
  mutate(cohort=str_replace(cohort,regex(" .+"),""))
## # A tibble: 1,976 x 6
##    workshop   age county                      social  cohort         gpa
##    <chr>    <dbl> <chr>                       <chr>   <chr>        <dbl>
##  1 Yes         24 Orange County, CA           SGGSAC  Anthropology  3.81
##  2 Yes         23 Franklin county, OH         SGGSAC  Economics     3.23
##  3 Yes         23 Franklin county, OH         SGGSAC  Economics     3.34
##  4 Maybe       24 Cook county, IL             SGGSAC  Sociology     3.58
##  5 Yes         23 Knoxville, Tennesse         SGGSAC  Political     3.63
##  6 Yes         27 Racine County, WI           VANESSA Psychology    3.88
##  7 Yes         31 <NA>                        SGGSAC  Sociology     3.68
##  8 Yes         26 Dekalb, GA                  VANESSA Anthropology  3.41
##  9 Yes         35 Hyde Park, IL               SGGSAC  Economics     4.00
## 10 Yes         24 York County, South Carolina SGGSAC  History       3.87
## # … with 1,966 more rows