Principles of Data Visualization in R Workshop: Introduction to tidyverse

Workshop 4

What is `tidyverse`

R is a powerful language, but any language is only as powerful as its most versatile package. In R’s case, that package is tidyverse and we will be focusing on functions from the tidyverse package for the rest of this series. If you haven’t already, install the tidyverse package from CRAN using the code install.packages("tidyverse") and load it into your session using either library(tidyverse) or require(tidyverse). For this workshop, we will use a subset of the data set from the ggplot2 workshop with a newly created variable gpa.

require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")

Pipe operator `%>%`

One of the most powerful tool that tidyverse brings us is the pipe operator %>%. If you are familiar with shell, you know what this is. If you have never heard of shell, this is going to blow your mind.

Sometimes, you are going to use multiple function on a data set. Say you have a data set called data, and you have 4 functions, f1, f2, f3, and f4, and you need the output from feeding data through each function sequestially. Intuitively, the first thing you would do is.

data1<-f1(data)
data2<-f2(data1)
data3<-f3(data2)
data_final<-f4(data3)

This may seem simple to do, but it can get tedious if you need to use more than 10 functions. To save our eyes and brains a little, programmers invented the pipe operator %>%. What the pipe operator does is it takes the output of a function and directly pipe it into the next function as its first argument. Hence the above scenario can be rewritten into:

data_final<-
  data %>% 
  f1() %>% 
  f2() %>% 
  f3() %>% 
  f4()

The piping operator allows us to write much cleaner codefor readability. One of the most important and universal best practice in coding is to space out your code into lines when you need to, and pipes allows us to do that easily.

Personally, I put each function in its own line so I can easily change some of the arguments in those functions.

For the rest of this workshop, I want to focus on using the packages dplyr and tidyr from the parent package tidyverse. Note that this means you only have to load tidyverse and these packages will automatically be included.

dplyr

To get a more complete view of the dplyr package, click here

`filter(df , condition)`

filter() is a literal filter for you to only get data that fits a certain condition. The filter function takes 2 arguments.

Data frame
Truth condition

You should place the data you want to use and place the truth condition second. Note that the truthcondition can include multiple things. To make your code clear, alwasy wrap around chunks of condition in paranthesis. For example, the code filter(data,((a==4&b==2)&(a==2|b==3))|(a==0|b==0)) will give me data that fits either (a==4&b==2) & (a==2|b==3) or (a==0|b==0). Making your filter condition clear is extermely important for readability.

For example, we can select people who answered Maybe to the workshop question in the survey:

filter(data,
       workshop=="Maybe")

## # A tibble: 240 x 6
##    workshop   age county                  social cohort           gpa
##    <chr>    <dbl> <chr>                   <chr>  <chr>          <dbl>
##  1 Maybe       24 Cook county, IL         SGGSAC Sociology       3.58
##  2 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.79
##  3 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.69
##  4 Maybe       24 Cook county, IL         SGGSAC Sociology       3.49
##  5 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.52
##  6 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.40
##  7 Maybe       28 Orange county           SUCSAC Economics       3.83
##  8 Maybe       33 Hillsborough County, FL SGGSAC Psych/poli sci  3.54
##  9 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.10
## 10 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.67
## # … with 230 more rows

`%in%`

Sometimes if you have a vector of things that you want filter to match with the or(|) condition, you can use %in% and the vector you want to match. For example, we can select people who think the social committee’s name is either SUCSAC or VANESSA:

filter(data,
       social == "SUCSAC" | social == "VANESSA")

## # A tibble: 459 x 6
##    workshop   age county            social  cohort         gpa
##    <chr>    <dbl> <chr>             <chr>   <chr>        <dbl>
##  1 Yes         27 Racine County, WI VANESSA Psychology    3.88
##  2 Yes         26 Dekalb, GA        VANESSA Anthropology  3.41
##  3 Yes         NA Arlington, VA     VANESSA Sociology     3.39
##  4 Yes         27 Racine County, WI VANESSA Psychology    3.39
##  5 Yes         NA NY                SUCSAC  Psychology    3.77
##  6 Maybe       28 Orange county     SUCSAC  Economics     3.83
##  7 Yes         35 Cook County, IL   SUCSAC  Economics     3.44
##  8 Yes         35 Cook County, IL   SUCSAC  Economics     3.63
##  9 Yes         27 Racine County, WI VANESSA Psychology    3.45
## 10 Yes         NA NY                SUCSAC  Psychology    3.83
## # … with 449 more rows

filter(data,
       social %in% c("SUCSAC","VANESSA"))

## # A tibble: 459 x 6
##    workshop   age county            social  cohort         gpa
##    <chr>    <dbl> <chr>             <chr>   <chr>        <dbl>
##  1 Yes         27 Racine County, WI VANESSA Psychology    3.88
##  2 Yes         26 Dekalb, GA        VANESSA Anthropology  3.41
##  3 Yes         NA Arlington, VA     VANESSA Sociology     3.39
##  4 Yes         27 Racine County, WI VANESSA Psychology    3.39
##  5 Yes         NA NY                SUCSAC  Psychology    3.77
##  6 Maybe       28 Orange county     SUCSAC  Economics     3.83
##  7 Yes         35 Cook County, IL   SUCSAC  Economics     3.44
##  8 Yes         35 Cook County, IL   SUCSAC  Economics     3.63
##  9 Yes         27 Racine County, WI VANESSA Psychology    3.45
## 10 Yes         NA NY                SUCSAC  Psychology    3.83
## # … with 449 more rows

`arrange(df , vars...)`

arrange() lets you rearrange the order of your observations based on the order of a variable(s). By default, it arranges the variable ascendingly. If you would like it to be arranged descendingly, you need to wrap the variable with desc(). arrange() rearranges based on the order of the variable you gave. For example, arrange(data,age,desc(gpa)) would first arrange the data set from the youngest to oldest, and then for people of the same age, it arranges the data from the highest GPA to the lowest GPA.

Recall that if your string column is set to factors, you can also arrange it by the alphabetical order. The priority of ordering is set by how you write your arguments. Arrange is most useful when you need to see extreme values using head() or need to do sequential operations.

For example, we can arrange the data set by cohort alphabetically and gpa descendingly:

data %>% 
  arrange(as.factor(cohort),
          desc(gpa))

## # A tibble: 1,976 x 6
##    workshop   age county               social cohort         gpa
##    <chr>    <dbl> <chr>                <chr>  <chr>        <dbl>
##  1 Yes         24 San Diego county, CA SGGSAC Anthropology  4.39
##  2 Yes         24 Orange County, CA    SGGSAC Anthropology  4.30
##  3 Maybe       25 Cuyahoga County, OH  SGGSAC Anthropology  4.21
##  4 Yes         24 Orange County, CA    SGGSAC Anthropology  4.20
##  5 Yes         24 San Diego county, CA SGGSAC Anthropology  4.20
##  6 Yes         24 Orange County, CA    SGGSAC Anthropology  4.18
##  7 Yes         24 Orange County, CA    SGGSAC Anthropology  4.17
##  8 Yes         45 Cook countywid       SGGSAC Anthropology  4.16
##  9 Yes         24 Orange County, CA    SGGSAC Anthropology  4.15
## 10 Yes         45 Cook countywid       SGGSAC Anthropology  4.14
## # … with 1,966 more rows

`mutate(df , vars...)`

mutate() is perhaps one of the most useful function and probably the function I used the most. This function lets you create a new variable in your dataset either with or without using the original data. To use mutate(), you use the data frame as its first argument and the variables you want to create as the rest of the arguments. For example, if I want to create a variable that is age $\times$ gpa and gpa $\times$ a random number between 10 and 20, I can write.

data<-data %>%
  mutate(gradeage = gpa*age,
         graderand = gpa*sample(10:20,1,TRUE))

## # A tibble: 3 x 8
##   workshop   age county          social cohort       gpa gradeage graderand
##   <chr>    <dbl> <chr>           <chr>  <chr>      <dbl>    <dbl>     <dbl>
## 1 Yes         24 Orange County,… SGGSAC Anthropol…  3.81     91.4      38.1
## 2 Yes         23 Franklin count… SGGSAC Economics   3.23     74.3      32.3
## 3 Yes         23 Franklin count… SGGSAC Economics   3.34     76.8      33.4

`summarise(df , vars...)`

summarise() is similar to mutate, but you will lose your original data set. This function takes your data frame as the first argument and creates summarized statistics however you want it. Say I want to see the mean, maximum, and standard deviation of the variable gpa, I will write:

data %>%
  summarise(avggpa = mean(gpa),
                 maxgpa = max(gpa),
                 sigmagpa = sd(gpa))

## # A tibble: 1 x 3
##   avggpa maxgpa sigmagpa
##    <dbl>  <dbl>    <dbl>
## 1   3.50   4.39    0.301

Alternatively, if you want a general summary statistics of your data you can call the function summary().

`group_by(df , vars...)`

group_by() is perhaps the most powerful function of the dplyr package. This function allows you to do any operation by certain grouping characteristics. This function outputs a data frame that has a grouping structure on it, making all the subsequent operations groupwise until the structure is removed. For example, if I want to look at the average age of participants in this workshop based on their cohorts, I will simply group the data by cohorts and then call summarise() or mutate(). Afterwards, I may want to remove the grouping structure so I can see the difference between group averages and total average. In that case, I simply need to pipe the data into ungroup() before I pipe it again into mutate.

data<-data %>%
  group_by(cohort) %>%
  mutate(muage=mean(age, na.rm=TRUE)) %>%
  ungroup() %>%
  mutate(groupdiff = mean(age, na.rm=TRUE)-muage)

data%>%
  head(6)

## # A tibble: 6 x 10
##   workshop   age county social cohort   gpa gradeage graderand muage
##   <chr>    <dbl> <chr>  <chr>  <chr>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Yes         24 Orang… SGGSAC Anthr…  3.81     91.4      38.1  26.1
## 2 Yes         23 Frank… SGGSAC Econo…  3.23     74.3      32.3  26.9
## 3 Yes         23 Frank… SGGSAC Econo…  3.34     76.8      33.4  26.9
## 4 Maybe       24 Cook … SGGSAC Socio…  3.58     86.0      35.8  25.7
## 5 Yes         23 Knoxv… SGGSAC Polit…  3.63     83.5      36.3  23  
## 6 Yes         27 Racin… VANES… Psych…  3.88    105.       38.8  25.4
## # … with 1 more variable: groupdiff <dbl>

Bonus If you want the operation to happen row by row (you may need to do this when geocoding), you can use a special version of group_by() called rowwise(). rowwise() imposes a grouping structure so that each row is in its own group.

`select(df, cols)`/`rename(df, new_col=old_col)`

seelct() and rename() are used so you can tidy up your data. In select(), you put your data as the first argument, and the rest of the arguments are just variables you want to keep. If you want to rename some variables, you just need to write select(data,newvar1=oldvar1,...). On the otherhand, if you just want to rename some variables, you onlyneed to write rename(data,newvar1=oldvar1,...). The difference is that select() only gives you back the variables you specified while rename() gives you back your entire data frame.

In the case where you want to use select() to reorder your data frame, you just need to specify the ones you want in the order and call everything() as your last argument. If instead of selecting the columns you want, you need to kick out of columns you do not want, you can just add a minus sign before the variable such as select(-col1)

For example, if I want to select workshop, gpa, and age and change “workshop” to “participate”, I can write:

data %>%
  select(participate=workshop,
         gpa, age)

## # A tibble: 1,976 x 3
##    participate   gpa   age
##    <chr>       <dbl> <dbl>
##  1 Yes          3.81    24
##  2 Yes          3.23    23
##  3 Yes          3.34    23
##  4 Maybe        3.58    24
##  5 Yes          3.63    23
##  6 Yes          3.88    27
##  7 Yes          3.68    31
##  8 Yes          3.41    26
##  9 Yes          4.00    35
## 10 Yes          3.87    24
## # … with 1,966 more rows

`distinct(df, cols)`

distinct() gives you distinct combinations of the columns you select. For example, say I want the distinct combinations of workshop and cohort, I will write:

data %>%
  distinct(workshop,
           cohort) %>% arrange(cohort)

## # A tibble: 10 x 2
##    workshop cohort           
##    <chr>    <chr>            
##  1 Yes      Anthropology     
##  2 Maybe    Anthropology     
##  3 Yes      Economics        
##  4 Maybe    Economics        
##  5 Yes      History          
##  6 Yes      Political Science
##  7 Maybe    Psych/poli sci   
##  8 Yes      Psychology       
##  9 Maybe    Sociology        
## 10 Yes      Sociology

tidyr

`separate(df, col, into, sep)`

separate() is used for you to split values in a single column into multiple columns. The into argument takes on a vector of texts for you to specify the names of the new columns that you want to create from the original column. sep specifies what you want are to use as the separator for the column. For example, in the data we used before, people answered a county and state, which should be separated by a comma. We can thus use separate to split this column into county and state:

data %>% 
  separate(county,
                c("county","state"),
                sep=", ")

data %>% 
  separate(county,
                c("county","state"),
                sep=", ")

## # A tibble: 1,976 x 11
##    workshop   age county state social cohort   gpa gradeage graderand muage
##    <chr>    <dbl> <chr>  <chr> <chr>  <chr>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Yes         24 Orang… CA    SGGSAC Anthr…  3.81     91.4      38.1  26.1
##  2 Yes         23 Frank… OH    SGGSAC Econo…  3.23     74.3      32.3  26.9
##  3 Yes         23 Frank… OH    SGGSAC Econo…  3.34     76.8      33.4  26.9
##  4 Maybe       24 Cook … IL    SGGSAC Socio…  3.58     86.0      35.8  25.7
##  5 Yes         23 Knoxv… Tenn… SGGSAC Polit…  3.63     83.5      36.3  23  
##  6 Yes         27 Racin… WI    VANES… Psych…  3.88    105.       38.8  25.4
##  7 Yes         31 <NA>   <NA>  SGGSAC Socio…  3.68    114.       36.8  25.7
##  8 Yes         26 Dekalb GA    VANES… Anthr…  3.41     88.6      34.1  26.1
##  9 Yes         35 Hyde … IL    SGGSAC Econo…  4.00    140.       40.0  26.9
## 10 Yes         24 York … Sout… SGGSAC Histo…  3.87     92.9      38.7  24  
## # … with 1,966 more rows, and 1 more variable: groupdiff <dbl>

`unite(df, new_col, cols, sep)`

unite() is the exact opposite of separate. It takes multiple columns and concatenate them into one column using the same argumaents. For example, if I want to concatenate workshop and social into a variable called “new” with “/” as my separator, I will write:

data %>% 
  unite("new",
                c(workshop, social),
                sep="_")

## # A tibble: 1,976 x 9
##    new       age county    cohort    gpa gradeage graderand muage groupdiff
##    <chr>   <dbl> <chr>     <chr>   <dbl>    <dbl>     <dbl> <dbl>     <dbl>
##  1 Yes_SG…    24 Orange C… Anthro…  3.81     91.4      38.1  26.1     0.260
##  2 Yes_SG…    23 Franklin… Econom…  3.23     74.3      32.3  26.9    -0.501
##  3 Yes_SG…    23 Franklin… Econom…  3.34     76.8      33.4  26.9    -0.501
##  4 Maybe_…    24 Cook cou… Sociol…  3.58     86.0      35.8  25.7     0.658
##  5 Yes_SG…    23 Knoxvill… Politi…  3.63     83.5      36.3  23       3.37 
##  6 Yes_VA…    27 Racine C… Psycho…  3.88    105.       38.8  25.4     0.944
##  7 Yes_SG…    31 <NA>      Sociol…  3.68    114.       36.8  25.7     0.658
##  8 Yes_VA…    26 Dekalb, … Anthro…  3.41     88.6      34.1  26.1     0.260
##  9 Yes_SG…    35 Hyde Par… Econom…  4.00    140.       40.0  26.9    -0.501
## 10 Yes_SG…    24 York Cou… History  3.87     92.9      38.7  24       2.37 
## # … with 1,966 more rows

`gather()` and `spread()`

We talked in details about these two functions last time in the workshop Introduction to data. To summarize, gather() is used to gather wide form data into a long form data. This function takes arguments in the following form gather(key="sting",value="string",columns). spread() is used to spread a long form data into a wide form data. This function takes arguments in the following form spread(key="sting",value="string",columns).

Try it yourself!

Question 1

For this question, we will be using the same data set from the first quesion in the ggplot2 workshop. You can load the data with readr::read_csv() with the path https://willythewoo.github.io/WillyTheWoo/workshop/data/acs_sample.csv.

As you can see right now that the column city contains both state and city name and sometimes has two or more states. Use the tools you have learned today, create a column named metro that only contains the city name and a column name states that contain all the states the metropolitan area is in. This includes deleting the original state column. Call this new data frame d11

After looking at the data set d11, you realized that the variable female was coded 1 for male and 2 for female. The variable was created by taking the average value of this variable from everyone in the city. Hence this variable does not represent the percentage or females in the city. Correct this data error from d11 so the variable represents the percentage of females. Call this new data frame d12

The median of hhincome in this data set is $58,000. Filter the data set d12 so you only have cities that are on the upper half of the household income distribution. Call this new data frame d13.

You now realzied that you lose half of your data in d13, which is not a good thing. Instead, you decide to create a logical variable from d12 that lets you conditionally do operations on the data. Create a new variable called upper such that it is of value 1 if the city’s household income is greater than $58,000 and 0 otherwise. Save this new data frame as d14. (Hint: You can use the function ifelse())

Do the same thing as part iv but with age 37 and data d14. Name the new variabel up_age.

Now you can look at statistics in each combination of age and income bracket. Find out the average percentage of female in each combination of age and income bracket by year using summarize(). Sort this data set by population descendinly.

Going back to ggplot2, plot the linear relationship between population and female, faceted by the age and income brackets in a 2 $\times$ 2 figure.

Find out how many cities are in each age and income bracket combination by year. (Hint: Use the function n() inside summarize()).

Question 2

This problem will teach you how to create a simulated data set under the generalized Roy model framework. It is also a good practice for understanding tidyverse. The three functions in the first quesiton are normal distribution, binomial distribution, and uniform distribution respetively.

Using functions like rnorm(), rbinom(), and runif(), create a data frame with variables $X$, $D$, and $E$ such that $X\sim U[-2,2]$, $D\sim Binom(0.3)$, and $E\sim N(0,1)$ with 1,000 observations. Call this data frame d21. In rbinom(), set the argument size to 1.

Using d21, create $Y0$ and $Y1$ such that $Y0=X+E$ and $Y1=X+0.3+E$. Then create a new variable $Y$ such that $Y=Y1$ if $D=1$ and $Y=Y0$ if $D=0$.

Create a new variable called ID that uniquely identifies each observation.

Run the following code to create a data set called id_gpa with ID numbers and gpa.

id_gpa<-data.frame(ID=1:1000,
                   gpa=rnorm(1000,3.5,0.15))

Now run ?left_join to read up on the documentation of this function. Merge id_gpa into d21 so you have the gpa of all the observations in d21.

Using ggplot, graph the relationship between Y and gpa using scatter plot. Facet the figure by D.

What is tidyverse

Pipe operator %>%

dplyr

filter(df , condition)

%in%

arrange(df , vars...)

mutate(df , vars...)

summarise(df , vars...)

group_by(df , vars...)

select(df, cols)/rename(df, new_col=old_col)

distinct(df, cols)