Click here to return to the menu

Click here to go to the slides for more examples

Download the slides here

R is a powerful language, but any language is only as powerful as its most versatile package. In R’s case, that package is tidyverse and we will be focusing on functions from the tidyverse package for the rest of this series. If you haven’t already, install the tidyverse package from CRAN using the code install.packages("tidyverse") and load it into your session using either library(tidyverse) or require(tidyverse). For this workshop, we will use a subset of the data set from the ggplot2 workshop with a newly created variable gpa.

require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")

Pipe operator %>%

One of the most powerful tool that tidyverse brings us is the pipe operator %>%. If you are familiar with shell, you know what this is. If you have never heard of shell, this is going to blow your mind.

Sometimes, you are going to use multiple function on a data set. Say you have a data set called data, and you have 4 functions, f1, f2, f3, and f4, and you need the output from feeding data through each function sequentially. Intuitively, the first thing you would do is.

data1<-f1(data)
data2<-f2(data1)
data3<-f3(data2)
data_final<-f4(data3)

This may seem simple to do, but it can get tedious if you need to use more than 10 functions. To save our eyes and brains a little, programmers invented the pipe operator %>%. What the pipe operator does is it takes the output of a function and directly pipe it into the next function as its first argument. Hence the above scenario can be rewritten into:

data_final<-
  data %>% 
  f1() %>% 
  f2() %>% 
  f3() %>% 
  f4()

The piping operator allows us to write much cleaner code for readability. One of the most important and universal best practice in coding is to space out your code into lines when you need to, and pipes allows us to do that easily. Personally, I put each function in its own line so I can easily change some of the arguments in those functions.

For the rest of this workshop, I want to focus on using the packages dplyr and tidyr from the parent package tidyverse. Note that this means you only have to load tidyverse and these packages will automatically be included.

dplyr

To get a more complete view of the dplyr package, click here

filter(df , condition)

filter() is a literal filter for you to only get data that fits a certain condition. The filter function takes 2 arguments.

  1. Data frame
  2. Truth condition

You should place the data you want to use and place the truth condition second. Note that the truth condition can include multiple things. To make your code clear, always wrap around chunks of condition in parenthesis. For example, the code filter(data,((a==4&b==2)&(a==2|b==3))|(a==0|b==0)) will give me data that fits either (a==4&b==2) & (a==2|b==3) or (a==0|b==0). Making your filter condition clear is extremely important for readability.

For example, we can select people who answered Maybe to the workshop question in the survey:

filter(data,
       workshop=="Maybe" & gpa>3.5)
## # A tibble: 128 x 6
##    workshop   age county                  social cohort           gpa
##    <chr>    <dbl> <chr>                   <chr>  <chr>          <dbl>
##  1 Maybe       24 Cook county, IL         SGGSAC Sociology       3.58
##  2 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.79
##  3 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.69
##  4 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.52
##  5 Maybe       28 Orange county           SUCSAC Economics       3.83
##  6 Maybe       33 Hillsborough County, FL SGGSAC Psych/poli sci  3.54
##  7 Maybe       25 Cuyahoga County, OH     SGGSAC Anthropology    3.67
##  8 Maybe       24 Cook county, IL         SGGSAC Sociology       3.96
##  9 Maybe       33 Hillsborough County, FL SGGSAC Psych/poli sci  4.15
## 10 Maybe       24 Cook county, IL         SGGSAC Sociology       3.52
## # … with 118 more rows

%in%

Sometimes if you have a vector of things that you want filter to match with the or(|) condition, you can use %in% and the vector you want to match. For example, we can select people who think the social committee’s name is either SUCSAC or VANESSA:

filter(data,
       social %in% c("SUCSAC","VANESSA"))
## # A tibble: 459 x 6
##    workshop   age county            social  cohort         gpa
##    <chr>    <dbl> <chr>             <chr>   <chr>        <dbl>
##  1 Yes         27 Racine County, WI VANESSA Psychology    3.88
##  2 Yes         26 Dekalb, GA        VANESSA Anthropology  3.41
##  3 Yes         NA Arlington, VA     VANESSA Sociology     3.39
##  4 Yes         27 Racine County, WI VANESSA Psychology    3.39
##  5 Yes         NA NY                SUCSAC  Psychology    3.77
##  6 Maybe       28 Orange county     SUCSAC  Economics     3.83
##  7 Yes         35 Cook County, IL   SUCSAC  Economics     3.44
##  8 Yes         35 Cook County, IL   SUCSAC  Economics     3.63
##  9 Yes         27 Racine County, WI VANESSA Psychology    3.45
## 10 Yes         NA NY                SUCSAC  Psychology    3.83
## # … with 449 more rows

arrange(df , vars...)

arrange() lets you rearrange the order of your observations based on the order of a variable(s). By default, it arranges the variable ascendingly. If you would like it to be arranged descendingly, you need to wrap the variable with desc(). arrange() rearranges based on the order of the variable you gave. For example, arrange(data,age,desc(gpa)) would first arrange the data set from the youngest to oldest, and then for people of the same age, it arranges the data from the highest GPA to the lowest GPA.

Recall that if your string column is set to factors, you can also arrange it by the alphabetical order. The priority of ordering is set by how you write your arguments. Arrange is most useful when you need to see extreme values using head() or need to do sequential operations.

For example, we can arrange the data set by cohort alphabetically and gpa descendingly:

data %>% 
  arrange(as.factor(cohort),
          desc(gpa))
## # A tibble: 1,976 x 6
##    workshop   age county               social cohort         gpa
##    <chr>    <dbl> <chr>                <chr>  <chr>        <dbl>
##  1 Yes         24 San Diego county, CA SGGSAC Anthropology  4.39
##  2 Yes         24 Orange County, CA    SGGSAC Anthropology  4.30
##  3 Maybe       25 Cuyahoga County, OH  SGGSAC Anthropology  4.21
##  4 Yes         24 Orange County, CA    SGGSAC Anthropology  4.20
##  5 Yes         24 San Diego county, CA SGGSAC Anthropology  4.20
##  6 Yes         24 Orange County, CA    SGGSAC Anthropology  4.18
##  7 Yes         24 Orange County, CA    SGGSAC Anthropology  4.17
##  8 Yes         45 Cook countywid       SGGSAC Anthropology  4.16
##  9 Yes         24 Orange County, CA    SGGSAC Anthropology  4.15
## 10 Yes         45 Cook countywid       SGGSAC Anthropology  4.14
## # … with 1,966 more rows

mutate(df , vars...)

mutate() is perhaps one of the most useful function and probably the function I used the most. This function lets you create a new variable in your data set either with or without using the original data. To use mutate(), you use the data frame as its first argument and the variables you want to create as the rest of the arguments. For example, if I want to create a variable that is age \(\times\) gpa and gpa \(\times\) a random number between 10 and 20, I can write.

data<-data %>%
  mutate(gradeage = gpa*age,
         graderand = gpa*sample(10:20,1,TRUE))
## # A tibble: 3 x 6
##   workshop   age county              social cohort         gpa
##   <chr>    <dbl> <chr>               <chr>  <chr>        <dbl>
## 1 Yes         24 Orange County, CA   SGGSAC Anthropology  3.81
## 2 Yes         23 Franklin county, OH SGGSAC Economics     3.23
## 3 Yes         23 Franklin county, OH SGGSAC Economics     3.34

summarise(df , vars...)

summarise() is similar to mutate, but you will lose your original data set. This function takes your data frame as the first argument and creates summarized statistics however you want it. Say I want to see the mean, maximum, and standard deviation of the variable gpa, I will write:

data %>%
  summarise(avggpa = mean(gpa),
                 maxgpa = max(gpa),
                 sigmagpa = sd(gpa))
## # A tibble: 1 x 3
##   avggpa maxgpa sigmagpa
##    <dbl>  <dbl>    <dbl>
## 1   3.50   4.39    0.301

and I will get a data frame with one row and three variables. Alternatively, if you want a general summary statistics of your data you can call the function summary().

group_by(df , vars...)

group_by() is perhaps the most powerful function of the dplyr package. This function allows you to do any operation by certain grouping characteristics. This function outputs a data frame that has a grouping structure on it, making all the subsequent operations group wise until the structure is removed. For example, if I want to look at the average age of participants in this workshop based on their cohorts, I will simply group the data by cohorts and then call summarise() or mutate(). Afterwards, I may want to remove the grouping structure so I can see the difference between group averages and total average. In that case, I simply need to pipe the data into ungroup() before I pipe it again into mutate.

data<-data %>%
  group_by(cohort) %>%
  mutate(muage=mean(age, na.rm=TRUE)) %>%
  ungroup() %>%
  mutate(groupdiff = mean(age, na.rm=TRUE)-muage)

data%>%head(6)
## # A tibble: 6 x 8
##   workshop   age county          social cohort          gpa muage groupdiff
##   <chr>    <dbl> <chr>           <chr>  <chr>         <dbl> <dbl>     <dbl>
## 1 Yes         24 Orange County,… SGGSAC Anthropology   3.81  26.1     0.260
## 2 Yes         23 Franklin count… SGGSAC Economics      3.23  26.9    -0.501
## 3 Yes         23 Franklin count… SGGSAC Economics      3.34  26.9    -0.501
## 4 Maybe       24 Cook county, IL SGGSAC Sociology      3.58  25.7     0.658
## 5 Yes         23 Knoxville, Ten… SGGSAC Political Sc…  3.63  23       3.37 
## 6 Yes         27 Racine County,… VANES… Psychology     3.88  25.4     0.944

Bonus If you want the operation to happen row by row (you may need to do this when geocoding), you can use a special version of group_by() called rowwise(). rowwise() imposes a grouping structure so that each row is in its own group.

select(df, cols)/rename(df, new_col=old_col)

seelct() and rename() are used so you can tidy up your data. In select(), you put your data as the first argument, and the rest of the arguments are just variables you want to keep. If you want to rename some variables, you just need to write select(data,newvar1=oldvar1,...). On the other hand, if you just want to rename some variables, you only need to write rename(data,newvar1=oldvar1,...). The difference is that select() only gives you back the variables you specified while rename() gives you back your entire data frame.

In the case where you want to use select() to reorder your data frame, you just need to specify the ones you want in the order and call everything() as your last argument. If instead of selecting the columns you want, you need to kick out of columns you do not want, you can just add a minus sign before the variable such as select(-col1)

For example, if I want to select workshop, gpa, and age and change “workshop” to “participate”, I can write:

data %>%
  select(participate=workshop,
         gpa, age)
## # A tibble: 1,976 x 3
##    participate   gpa   age
##    <chr>       <dbl> <dbl>
##  1 Yes          3.81    24
##  2 Yes          3.23    23
##  3 Yes          3.34    23
##  4 Maybe        3.58    24
##  5 Yes          3.63    23
##  6 Yes          3.88    27
##  7 Yes          3.68    31
##  8 Yes          3.41    26
##  9 Yes          4.00    35
## 10 Yes          3.87    24
## # … with 1,966 more rows

distinct(df, cols)

distinct() gives you distinct combinations of the columns you select. For example, say I want the distinct combinations of workshop and cohort, I will write:

data %>%
  distinct(workshop,
           cohort) %>% arrange(cohort)
## # A tibble: 10 x 2
##    workshop cohort           
##    <chr>    <chr>            
##  1 Yes      Anthropology     
##  2 Maybe    Anthropology     
##  3 Yes      Economics        
##  4 Maybe    Economics        
##  5 Yes      History          
##  6 Yes      Political Science
##  7 Maybe    Psych/poli sci   
##  8 Yes      Psychology       
##  9 Maybe    Sociology        
## 10 Yes      Sociology

tidyr

separate(df, col, into, sep)

separate() is used for you to split values in a single column into multiple columns. The into argument takes on a vector of texts for you to specify the names of the new columns that you want to create from the original column. sep specifies what you want are to use as the separator for the column. For example, in the data we used before, people answered a county and state, which should be separated by a comma. We can thus use separate to split this column into county and state:

data %>% 
  separate(county,
                c("county","state"),
                sep=", ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 430 rows
## [16, 17, 26, 32, 40, 41, 48, 59, 65, 67, 70, 71, 74, 78, 79, 84, 91, 92,
## 96, 99, ...].
## # A tibble: 1,976 x 9
##    workshop   age county    state    social cohort      gpa muage groupdiff
##    <chr>    <dbl> <chr>     <chr>    <chr>  <chr>     <dbl> <dbl>     <dbl>
##  1 Yes         24 Orange C… CA       SGGSAC Anthropo…  3.81  26.1     0.260
##  2 Yes         23 Franklin… OH       SGGSAC Economics  3.23  26.9    -0.501
##  3 Yes         23 Franklin… OH       SGGSAC Economics  3.34  26.9    -0.501
##  4 Maybe       24 Cook cou… IL       SGGSAC Sociology  3.58  25.7     0.658
##  5 Yes         23 Knoxville Tennesse SGGSAC Politica…  3.63  23       3.37 
##  6 Yes         27 Racine C… WI       VANES… Psycholo…  3.88  25.4     0.944
##  7 Yes         31 <NA>      <NA>     SGGSAC Sociology  3.68  25.7     0.658
##  8 Yes         26 Dekalb    GA       VANES… Anthropo…  3.41  26.1     0.260
##  9 Yes         35 Hyde Park IL       SGGSAC Economics  4.00  26.9    -0.501
## 10 Yes         24 York Cou… South C… SGGSAC History    3.87  24       2.37 
## # … with 1,966 more rows

unite(df, new_col, cols, sep)

unite() is the exact opposite of separate. It takes multiple columns and concatenate them into one column using the same arguments. For example, if I want to concatenate workshop and social into a variable called “new” with “/” as my separator, I will write:

data %>% 
  unite("new",
                c(workshop, social),
                sep="_")
## # A tibble: 1,976 x 7
##    new          age county              cohort          gpa muage groupdiff
##    <chr>      <dbl> <chr>               <chr>         <dbl> <dbl>     <dbl>
##  1 Yes_SGGSAC    24 Orange County, CA   Anthropology   3.81  26.1     0.260
##  2 Yes_SGGSAC    23 Franklin county, OH Economics      3.23  26.9    -0.501
##  3 Yes_SGGSAC    23 Franklin county, OH Economics      3.34  26.9    -0.501
##  4 Maybe_SGG…    24 Cook county, IL     Sociology      3.58  25.7     0.658
##  5 Yes_SGGSAC    23 Knoxville, Tennesse Political Sc…  3.63  23       3.37 
##  6 Yes_VANES…    27 Racine County, WI   Psychology     3.88  25.4     0.944
##  7 Yes_SGGSAC    31 <NA>                Sociology      3.68  25.7     0.658
##  8 Yes_VANES…    26 Dekalb, GA          Anthropology   3.41  26.1     0.260
##  9 Yes_SGGSAC    35 Hyde Park, IL       Economics      4.00  26.9    -0.501
## 10 Yes_SGGSAC    24 York County, South… History        3.87  24       2.37 
## # … with 1,966 more rows

gather() and spread()

We talked in details about these two functions last time in the workshop Introduction to data. To summarize, gather() is used to gather wide form data into a long form data. This function takes arguments in the following form gather(key="sting",value="string",columns). spread() is used to spread a long form data into a wide form data. This function takes arguments in the following form spread(key="sting",value="string",columns).

Try it yourself!

  1. For this question, we will be using the same data set from the first question in the ggplot2 workshop. You can load the data with readr::read_csv() with the path https://willythewoo.github.io/WillyTheWoo/workshop/data/acs_sample.csv.
  1. As you can see right now that the column city contains both state and city name and sometimes has two or more states. Use the tools you have learned today, create a column named metro that only contains the city name and a column named states that contain all the states the metropolitan area is in. This includes deleting the original state column. Call this new data frame d11

Click for solutions

# Loading data
d11 <- read.csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/acs_sample.csv")

# Separate the column city into metro and states
d11<-d11%>%
  separate(city,sep=",", into=c("metro","states"))%>%
  select(-state)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 161 rows
## [89, 130, 138, 217, 222, 223, 240, 349, 390, 398, 477, 482, 483, 500, 609,
## 650, 658, 737, 742, 743, ...].
# Look at the result
head(d11)
##   year                      metro states population   female age hhincome
## 1 2005                      akron     oh     700649 1.514134  37    55000
## 2 2005    albany-schenectady-troy     ny     846357 1.517147  38    60700
## 3 2005                albuquerque     nm     798722 1.514087  36    50900
## 4 2005                 alexandria     la     147128 1.528448  36    40100
## 5 2005 allentown-bethlehem-easton  pa-nj     786034 1.517561  40    61000
## 6 2005                    altoona     pa     125882 1.514830  40    42150
##   sanctuary
## 1        NA
## 2        NA
## 3        NA
## 4        NA
## 5        NA
## 6        NA
  1. After looking at the data set d11, you realized that the variable female was coded 1 for male and 2 for female. The variable was created by taking the average value of this variable from everyone in the city. Hence this variable does not represent the percentage or females in the city. Correct this data error from d11 so the variable represents the percentage of females. Call this new data frame d12

Click for solutions

Since female was coded as the average of 1 and 2, and we need it to be the average of 0 and 1, we can simply minus one on female to achieve what is asked.

d12<-d11%>%
  mutate(female = female-1)

head(d12)
##   year                      metro states population    female age hhincome
## 1 2005                      akron     oh     700649 0.5141343  37    55000
## 2 2005    albany-schenectady-troy     ny     846357 0.5171471  38    60700
## 3 2005                albuquerque     nm     798722 0.5140870  36    50900
## 4 2005                 alexandria     la     147128 0.5284480  36    40100
## 5 2005 allentown-bethlehem-easton  pa-nj     786034 0.5175613  40    61000
## 6 2005                    altoona     pa     125882 0.5148302  40    42150
##   sanctuary
## 1        NA
## 2        NA
## 3        NA
## 4        NA
## 5        NA
## 6        NA
  1. The median of hhincome in this data set is $58,000. Filter the data set d12 so you only have cities that are on the upper half of the household income distribution. Call this new data frame d13. (Hint: You can either use arrange() or median() to figure out the median yourself)

Click for solutions

d13<-d12%>%filter(hhincome >= 58000)

str(d13)
## 'data.frame':    1836 obs. of  8 variables:
##  $ year      : int  2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
##  $ metro     : chr  "albany-schenectady-troy" "allentown-bethlehem-easton" "anchorage" "ann arbor" ...
##  $ states    : chr  " ny" " pa-nj" " ak" " mi" ...
##  $ population: int  846357 786034 351851 343858 4947012 1464309 2649586 224877 NA 4458891 ...
##  $ female    : num  0.517 0.518 0.498 0.501 0.504 ...
##  $ age       : int  38 40 34 33 34 32 37 44 32 37 ...
##  $ hhincome  : int  60700 61000 69360 66000 60000 58500 69000 66000 62000 74300 ...
##  $ sanctuary : int  NA NA NA NA NA 0 0 NA NA 0 ...

If you tried to do median yourself, you might notice that the output is NA or empty. This is caused by the fact that some household income data are missing. To correct this, add na.rm = TRUE as an argument in the function median().

d13<-d12%>%filter(hhincome >= median(d12$hhincome, na.rm = TRUE))

str(d13)
## 'data.frame':    1836 obs. of  8 variables:
##  $ year      : int  2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
##  $ metro     : chr  "albany-schenectady-troy" "allentown-bethlehem-easton" "anchorage" "ann arbor" ...
##  $ states    : chr  " ny" " pa-nj" " ak" " mi" ...
##  $ population: int  846357 786034 351851 343858 4947012 1464309 2649586 224877 NA 4458891 ...
##  $ female    : num  0.517 0.518 0.498 0.501 0.504 ...
##  $ age       : int  38 40 34 33 34 32 37 44 32 37 ...
##  $ hhincome  : int  60700 61000 69360 66000 60000 58500 69000 66000 62000 74300 ...
##  $ sanctuary : int  NA NA NA NA NA 0 0 NA NA 0 ...
  1. You now realized that you lose half of your data in d13, which is not a good thing. Instead, you decide to create a logical variable from d12 that lets you conditionally do operations on the data. Create a new variable called upper such that it is of value 1 if the city’s household income is greater than $58,000 and 0 otherwise. Save this new data frame as d14. (Hint: You can use the function ifelse())

Click for solutions

d14<-d12%>%
  mutate(upper=ifelse(hhincome>=58000,TRUE,FALSE))

str(d14)
## 'data.frame':    3759 obs. of  9 variables:
##  $ year      : int  2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
##  $ metro     : chr  "akron" "albany-schenectady-troy" "albuquerque" "alexandria" ...
##  $ states    : chr  " oh" " ny" " nm" " la" ...
##  $ population: int  700649 846357 798722 147128 786034 125882 237767 351851 343858 NA ...
##  $ female    : num  0.514 0.517 0.514 0.528 0.518 ...
##  $ age       : int  37 38 36 36 40 40 34 34 33 39 ...
##  $ hhincome  : int  55000 60700 50900 40100 61000 42150 44734 69360 66000 40900 ...
##  $ sanctuary : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ upper     : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
  1. Do the same thing as part iv but with age 37 and data d14. Name the new variable up_age and store this new data frame as d15

Click for solutions

d15<-d14%>%
  mutate(up_age=ifelse(age>=37,TRUE,FALSE))

str(d15)
## 'data.frame':    3759 obs. of  10 variables:
##  $ year      : int  2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
##  $ metro     : chr  "akron" "albany-schenectady-troy" "albuquerque" "alexandria" ...
##  $ states    : chr  " oh" " ny" " nm" " la" ...
##  $ population: int  700649 846357 798722 147128 786034 125882 237767 351851 343858 NA ...
##  $ female    : num  0.514 0.517 0.514 0.528 0.518 ...
##  $ age       : int  37 38 36 36 40 40 34 34 33 39 ...
##  $ hhincome  : int  55000 60700 50900 40100 61000 42150 44734 69360 66000 40900 ...
##  $ sanctuary : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ upper     : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ up_age    : logi  TRUE TRUE FALSE FALSE TRUE TRUE ...
  1. Now you can look at statistics in each combination of age and income bracket. Find out the raw average percentage of female in each combination of age and income bracket by year using summarize(). Sort this data set by population descendingly.

Click for solutions

d15%>%
  filter(!is.na(upper),!is.na(up_age),!is.na(female))%>%
  group_by(upper,up_age)%>%
  summarise(female=mean(female,na.rm=TRUE))
## # A tibble: 4 x 3
## # Groups:   upper [2]
##   upper up_age female
##   <lgl> <lgl>   <dbl>
## 1 FALSE FALSE   0.507
## 2 FALSE TRUE    0.511
## 3 TRUE  FALSE   0.503
## 4 TRUE  TRUE    0.508
  1. Going back to ggplot2, plot the linear relationship between population and female, faceted by the age and income brackets in a 2 \(times\) 2 figure.

Click for solutions

ggplot(d15%>%filter(!is.na(upper),!is.na(up_age),!is.na(female)),aes(x=population,y=female))+
  geom_smooth(se=FALSE)+
  facet_wrap(upper~up_age)
## Warning: Removed 151 rows containing non-finite values (stat_smooth).
  1. Find out how many cities are in each age and income bracket combination by year. (Hint: Use the function n() inside summarize()).

Click for solutions

d15%>%
  filter(!is.na(upper),!is.na(up_age),!is.na(female))%>%
  group_by(upper,up_age)%>%
  summarise(female=mean(female,na.rm=TRUE),
            count=n())
## # A tibble: 4 x 4
## # Groups:   upper [2]
##   upper up_age female count
##   <lgl> <lgl>   <dbl> <int>
## 1 FALSE FALSE   0.507   964
## 2 FALSE TRUE    0.511   854
## 3 TRUE  FALSE   0.503   859
## 4 TRUE  TRUE    0.508   977
  1. This problem will teach you how to create a simulated data set under the generalized Roy model framework. It is also a good practice for understanding tidyverse. The three functions in i. are normal distribution, binomial distribution, and uniform distribution respectively.
  1. Using functions like rnorm(), rbinom(), and runif(), create a data frame with variables \(X\), \(D\), and \(E\) such that \(X\sim U[-2,2]\), \(D\sim Binom(0.3)\), and \(E\sim N(0,1)\) with 1,000 observations. Call this data frame d21. In rbinom(), set the argument size to 1.

Click for solutions

d21<-data.frame(
  X=runif(1000,-2,2),
  D=rbinom(1000,1,0.3),
  E=rnorm(1000,0,1)
)
  1. Using d21, create \(Y0\) and \(Y1\) such that \(Y0=X+E\) and \(Y1=X+0.3+E\). Then create a new variable \(Y\) such that \(Y=Y1\) if \(D=1\) and \(Y=Y0\) if \(D=0\). Store this new data frame as d22.

Click for solutions

d22<-d21%>%
  mutate(Y0 = X + E,
         Y1 = X + 0.3 + E,
         Y = ifelse(D==1,Y1,Y0))
  1. Create a new variable called ID that uniquely identifies each observation. Store this data frame as d23.

Click for solutions

d23 <- d22%>%
  mutate(ID=1:1000)
  1. Run the following code to create a data set called id_gpa with ID numbers and gpa.
id_gpa<-data.frame(ID=1:1000,
                   gpa=rnorm(1000,3.5,0.15))

Now run ?left_join to read up on the documentation of this function. Merge id_gpa into d23 so you have the gpa of all the observations in d21. Store this new data frame as d24

Click for solutions

d24<-d23%>%
  left_join(id_gpa,by="ID")
  1. Using ggplot, graph the relationship between Y and gpa using scatter plot. Facet the figure by D.

Click for solutions

ggplot(d24,aes(x=gpa,y=Y))+
  geom_smooth(se=FALSE)+
  facet_wrap(~D)

Click here to continue to the next workshop: Introduction to text data

Click here to return to the menu