Click here to return to the menu
Click here to go to the slides for more examples
R is a powerful language, but any language is only as powerful as its most versatile package. In R’s case, that package is tidyverse and we will be focusing on functions from the tidyverse
package for the rest of this series. If you haven’t already, install the tidyverse package from CRAN using the code install.packages("tidyverse")
and load it into your session using either library(tidyverse)
or require(tidyverse)
. For this workshop, we will use a subset of the data set from the ggplot2 workshop with a newly created variable gpa
.
require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data3.csv")
%>%
One of the most powerful tool that tidyverse brings us is the pipe operator %>%
. If you are familiar with shell, you know what this is. If you have never heard of shell, this is going to blow your mind.
Sometimes, you are going to use multiple function on a data set. Say you have a data set called data
, and you have 4 functions, f1
, f2
, f3
, and f4
, and you need the output from feeding data through each function sequentially. Intuitively, the first thing you would do is.
data1<-f1(data)
data2<-f2(data1)
data3<-f3(data2)
data_final<-f4(data3)
This may seem simple to do, but it can get tedious if you need to use more than 10 functions. To save our eyes and brains a little, programmers invented the pipe operator %>%
. What the pipe operator does is it takes the output of a function and directly pipe it into the next function as its first argument. Hence the above scenario can be rewritten into:
data_final<-
data %>%
f1() %>%
f2() %>%
f3() %>%
f4()
The piping operator allows us to write much cleaner code for readability. One of the most important and universal best practice in coding is to space out your code into lines when you need to, and pipes allows us to do that easily. Personally, I put each function in its own line so I can easily change some of the arguments in those functions.
For the rest of this workshop, I want to focus on using the packages dplyr
and tidyr
from the parent package tidyverse
. Note that this means you only have to load tidyverse
and these packages will automatically be included.
To get a more complete view of the dplyr
package, click here
filter(df , condition)
filter()
is a literal filter for you to only get data that fits a certain condition. The filter function takes 2 arguments.
You should place the data you want to use and place the truth condition second. Note that the truth condition can include multiple things. To make your code clear, always wrap around chunks of condition in parenthesis. For example, the code filter(data,((a==4&b==2)&(a==2|b==3))|(a==0|b==0))
will give me data that fits either (a==4&b==2) & (a==2|b==3)
or (a==0|b==0)
. Making your filter condition clear is extremely important for readability.
For example, we can select people who answered Maybe
to the workshop question in the survey:
filter(data,
workshop=="Maybe" & gpa>3.5)
## # A tibble: 128 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Maybe 24 Cook county, IL SGGSAC Sociology 3.58
## 2 Maybe 25 Cuyahoga County, OH SGGSAC Anthropology 3.79
## 3 Maybe 25 Cuyahoga County, OH SGGSAC Anthropology 3.69
## 4 Maybe 25 Cuyahoga County, OH SGGSAC Anthropology 3.52
## 5 Maybe 28 Orange county SUCSAC Economics 3.83
## 6 Maybe 33 Hillsborough County, FL SGGSAC Psych/poli sci 3.54
## 7 Maybe 25 Cuyahoga County, OH SGGSAC Anthropology 3.67
## 8 Maybe 24 Cook county, IL SGGSAC Sociology 3.96
## 9 Maybe 33 Hillsborough County, FL SGGSAC Psych/poli sci 4.15
## 10 Maybe 24 Cook county, IL SGGSAC Sociology 3.52
## # … with 118 more rows
%in%
Sometimes if you have a vector of things that you want filter to match with the or
(|
) condition, you can use %in%
and the vector you want to match. For example, we can select people who think the social committee’s name is either SUCSAC
or VANESSA
:
filter(data,
social %in% c("SUCSAC","VANESSA"))
## # A tibble: 459 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 27 Racine County, WI VANESSA Psychology 3.88
## 2 Yes 26 Dekalb, GA VANESSA Anthropology 3.41
## 3 Yes NA Arlington, VA VANESSA Sociology 3.39
## 4 Yes 27 Racine County, WI VANESSA Psychology 3.39
## 5 Yes NA NY SUCSAC Psychology 3.77
## 6 Maybe 28 Orange county SUCSAC Economics 3.83
## 7 Yes 35 Cook County, IL SUCSAC Economics 3.44
## 8 Yes 35 Cook County, IL SUCSAC Economics 3.63
## 9 Yes 27 Racine County, WI VANESSA Psychology 3.45
## 10 Yes NA NY SUCSAC Psychology 3.83
## # … with 449 more rows
arrange(df , vars...)
arrange()
lets you rearrange the order of your observations based on the order of a variable(s). By default, it arranges the variable ascendingly. If you would like it to be arranged descendingly, you need to wrap the variable with desc()
. arrange()
rearranges based on the order of the variable you gave. For example, arrange(data,age,desc(gpa))
would first arrange the data set from the youngest to oldest, and then for people of the same age, it arranges the data from the highest GPA to the lowest GPA.
Recall that if your string column is set to factors, you can also arrange it by the alphabetical order. The priority of ordering is set by how you write your arguments. Arrange is most useful when you need to see extreme values using head()
or need to do sequential operations.
For example, we can arrange the data set by cohort alphabetically and gpa descendingly:
data %>%
arrange(as.factor(cohort),
desc(gpa))
## # A tibble: 1,976 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 San Diego county, CA SGGSAC Anthropology 4.39
## 2 Yes 24 Orange County, CA SGGSAC Anthropology 4.30
## 3 Maybe 25 Cuyahoga County, OH SGGSAC Anthropology 4.21
## 4 Yes 24 Orange County, CA SGGSAC Anthropology 4.20
## 5 Yes 24 San Diego county, CA SGGSAC Anthropology 4.20
## 6 Yes 24 Orange County, CA SGGSAC Anthropology 4.18
## 7 Yes 24 Orange County, CA SGGSAC Anthropology 4.17
## 8 Yes 45 Cook countywid SGGSAC Anthropology 4.16
## 9 Yes 24 Orange County, CA SGGSAC Anthropology 4.15
## 10 Yes 45 Cook countywid SGGSAC Anthropology 4.14
## # … with 1,966 more rows
mutate(df , vars...)
mutate()
is perhaps one of the most useful function and probably the function I used the most. This function lets you create a new variable in your data set either with or without using the original data. To use mutate()
, you use the data frame as its first argument and the variables you want to create as the rest of the arguments. For example, if I want to create a variable that is age \(\times\) gpa and gpa \(\times\) a random number between 10 and 20, I can write.
data<-data %>%
mutate(gradeage = gpa*age,
graderand = gpa*sample(10:20,1,TRUE))
## # A tibble: 3 x 6
## workshop age county social cohort gpa
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Yes 24 Orange County, CA SGGSAC Anthropology 3.81
## 2 Yes 23 Franklin county, OH SGGSAC Economics 3.23
## 3 Yes 23 Franklin county, OH SGGSAC Economics 3.34
summarise(df , vars...)
summarise()
is similar to mutate, but you will lose your original data set. This function takes your data frame as the first argument and creates summarized statistics however you want it. Say I want to see the mean, maximum, and standard deviation of the variable gpa, I will write:
data %>%
summarise(avggpa = mean(gpa),
maxgpa = max(gpa),
sigmagpa = sd(gpa))
## # A tibble: 1 x 3
## avggpa maxgpa sigmagpa
## <dbl> <dbl> <dbl>
## 1 3.50 4.39 0.301
and I will get a data frame with one row and three variables. Alternatively, if you want a general summary statistics of your data you can call the function summary()
.
group_by(df , vars...)
group_by()
is perhaps the most powerful function of the dplyr
package. This function allows you to do any operation by certain grouping characteristics. This function outputs a data frame that has a grouping structure on it, making all the subsequent operations group wise until the structure is removed. For example, if I want to look at the average age of participants in this workshop based on their cohorts, I will simply group the data by cohorts and then call summarise()
or mutate()
. Afterwards, I may want to remove the grouping structure so I can see the difference between group averages and total average. In that case, I simply need to pipe the data into ungroup()
before I pipe it again into mutate.
data<-data %>%
group_by(cohort) %>%
mutate(muage=mean(age, na.rm=TRUE)) %>%
ungroup() %>%
mutate(groupdiff = mean(age, na.rm=TRUE)-muage)
data%>%head(6)
## # A tibble: 6 x 8
## workshop age county social cohort gpa muage groupdiff
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Yes 24 Orange County,… SGGSAC Anthropology 3.81 26.1 0.260
## 2 Yes 23 Franklin count… SGGSAC Economics 3.23 26.9 -0.501
## 3 Yes 23 Franklin count… SGGSAC Economics 3.34 26.9 -0.501
## 4 Maybe 24 Cook county, IL SGGSAC Sociology 3.58 25.7 0.658
## 5 Yes 23 Knoxville, Ten… SGGSAC Political Sc… 3.63 23 3.37
## 6 Yes 27 Racine County,… VANES… Psychology 3.88 25.4 0.944
Bonus If you want the operation to happen row by row (you may need to do this when geocoding), you can use a special version of group_by()
called rowwise()
. rowwise()
imposes a grouping structure so that each row is in its own group.
select(df, cols)
/rename(df, new_col=old_col)
seelct()
and rename()
are used so you can tidy up your data. In select()
, you put your data as the first argument, and the rest of the arguments are just variables you want to keep. If you want to rename some variables, you just need to write select(data,newvar1=oldvar1,...)
. On the other hand, if you just want to rename some variables, you only need to write rename(data,newvar1=oldvar1,...)
. The difference is that select()
only gives you back the variables you specified while rename()
gives you back your entire data frame.
In the case where you want to use select()
to reorder your data frame, you just need to specify the ones you want in the order and call everything()
as your last argument. If instead of selecting the columns you want, you need to kick out of columns you do not want, you can just add a minus sign before the variable such as select(-col1)
For example, if I want to select workshop, gpa, and age and change “workshop” to “participate”, I can write:
data %>%
select(participate=workshop,
gpa, age)
## # A tibble: 1,976 x 3
## participate gpa age
## <chr> <dbl> <dbl>
## 1 Yes 3.81 24
## 2 Yes 3.23 23
## 3 Yes 3.34 23
## 4 Maybe 3.58 24
## 5 Yes 3.63 23
## 6 Yes 3.88 27
## 7 Yes 3.68 31
## 8 Yes 3.41 26
## 9 Yes 4.00 35
## 10 Yes 3.87 24
## # … with 1,966 more rows
distinct(df, cols)
distinct()
gives you distinct combinations of the columns you select. For example, say I want the distinct combinations of workshop and cohort, I will write:
data %>%
distinct(workshop,
cohort) %>% arrange(cohort)
## # A tibble: 10 x 2
## workshop cohort
## <chr> <chr>
## 1 Yes Anthropology
## 2 Maybe Anthropology
## 3 Yes Economics
## 4 Maybe Economics
## 5 Yes History
## 6 Yes Political Science
## 7 Maybe Psych/poli sci
## 8 Yes Psychology
## 9 Maybe Sociology
## 10 Yes Sociology
separate(df, col, into, sep)
separate()
is used for you to split values in a single column into multiple columns. The into
argument takes on a vector of texts for you to specify the names of the new columns that you want to create from the original column. sep
specifies what you want are to use as the separator for the column. For example, in the data we used before, people answered a county and state, which should be separated by a comma. We can thus use separate to split this column into county and state:
data %>%
separate(county,
c("county","state"),
sep=", ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 430 rows
## [16, 17, 26, 32, 40, 41, 48, 59, 65, 67, 70, 71, 74, 78, 79, 84, 91, 92,
## 96, 99, ...].
## # A tibble: 1,976 x 9
## workshop age county state social cohort gpa muage groupdiff
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Yes 24 Orange C… CA SGGSAC Anthropo… 3.81 26.1 0.260
## 2 Yes 23 Franklin… OH SGGSAC Economics 3.23 26.9 -0.501
## 3 Yes 23 Franklin… OH SGGSAC Economics 3.34 26.9 -0.501
## 4 Maybe 24 Cook cou… IL SGGSAC Sociology 3.58 25.7 0.658
## 5 Yes 23 Knoxville Tennesse SGGSAC Politica… 3.63 23 3.37
## 6 Yes 27 Racine C… WI VANES… Psycholo… 3.88 25.4 0.944
## 7 Yes 31 <NA> <NA> SGGSAC Sociology 3.68 25.7 0.658
## 8 Yes 26 Dekalb GA VANES… Anthropo… 3.41 26.1 0.260
## 9 Yes 35 Hyde Park IL SGGSAC Economics 4.00 26.9 -0.501
## 10 Yes 24 York Cou… South C… SGGSAC History 3.87 24 2.37
## # … with 1,966 more rows
unite(df, new_col, cols, sep)
unite()
is the exact opposite of separate. It takes multiple columns and concatenate them into one column using the same arguments. For example, if I want to concatenate workshop and social into a variable called “new” with “/” as my separator, I will write:
data %>%
unite("new",
c(workshop, social),
sep="_")
## # A tibble: 1,976 x 7
## new age county cohort gpa muage groupdiff
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Yes_SGGSAC 24 Orange County, CA Anthropology 3.81 26.1 0.260
## 2 Yes_SGGSAC 23 Franklin county, OH Economics 3.23 26.9 -0.501
## 3 Yes_SGGSAC 23 Franklin county, OH Economics 3.34 26.9 -0.501
## 4 Maybe_SGG… 24 Cook county, IL Sociology 3.58 25.7 0.658
## 5 Yes_SGGSAC 23 Knoxville, Tennesse Political Sc… 3.63 23 3.37
## 6 Yes_VANES… 27 Racine County, WI Psychology 3.88 25.4 0.944
## 7 Yes_SGGSAC 31 <NA> Sociology 3.68 25.7 0.658
## 8 Yes_VANES… 26 Dekalb, GA Anthropology 3.41 26.1 0.260
## 9 Yes_SGGSAC 35 Hyde Park, IL Economics 4.00 26.9 -0.501
## 10 Yes_SGGSAC 24 York County, South… History 3.87 24 2.37
## # … with 1,966 more rows
gather()
and spread()
We talked in details about these two functions last time in the workshop Introduction to data. To summarize, gather()
is used to gather wide form data into a long form data. This function takes arguments in the following form gather(key="sting",value="string",columns)
. spread()
is used to spread a long form data into a wide form data. This function takes arguments in the following form spread(key="sting",value="string",columns)
.
readr::read_csv()
with the path https://willythewoo.github.io/WillyTheWoo/workshop/data/acs_sample.csv
.city
contains both state and city name and sometimes has two or more states. Use the tools you have learned today, create a column named metro that only contains the city name and a column named states that contain all the states the metropolitan area is in. This includes deleting the original state column. Call this new data frame d11
Click for solutions
# Loading data
d11 <- read.csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/acs_sample.csv")
# Separate the column city into metro and states
d11<-d11%>%
separate(city,sep=",", into=c("metro","states"))%>%
select(-state)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 161 rows
## [89, 130, 138, 217, 222, 223, 240, 349, 390, 398, 477, 482, 483, 500, 609,
## 650, 658, 737, 742, 743, ...].
# Look at the result
head(d11)
## year metro states population female age hhincome
## 1 2005 akron oh 700649 1.514134 37 55000
## 2 2005 albany-schenectady-troy ny 846357 1.517147 38 60700
## 3 2005 albuquerque nm 798722 1.514087 36 50900
## 4 2005 alexandria la 147128 1.528448 36 40100
## 5 2005 allentown-bethlehem-easton pa-nj 786034 1.517561 40 61000
## 6 2005 altoona pa 125882 1.514830 40 42150
## sanctuary
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
d11
, you realized that the variable female was coded 1
for male and 2
for female. The variable was created by taking the average value of this variable from everyone in the city. Hence this variable does not represent the percentage or females in the city. Correct this data error from d11
so the variable represents the percentage of females. Call this new data frame d12
Click for solutions
Since female was coded as the average of 1 and 2, and we need it to be the average of 0 and 1, we can simply minus one on female to achieve what is asked.
d12<-d11%>%
mutate(female = female-1)
head(d12)
## year metro states population female age hhincome
## 1 2005 akron oh 700649 0.5141343 37 55000
## 2 2005 albany-schenectady-troy ny 846357 0.5171471 38 60700
## 3 2005 albuquerque nm 798722 0.5140870 36 50900
## 4 2005 alexandria la 147128 0.5284480 36 40100
## 5 2005 allentown-bethlehem-easton pa-nj 786034 0.5175613 40 61000
## 6 2005 altoona pa 125882 0.5148302 40 42150
## sanctuary
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
d12
so you only have cities that are on the upper half of the household income distribution. Call this new data frame d13
. (Hint: You can either use arrange()
or median()
to figure out the median yourself)Click for solutions
d13<-d12%>%filter(hhincome >= 58000)
str(d13)
## 'data.frame': 1836 obs. of 8 variables:
## $ year : int 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
## $ metro : chr "albany-schenectady-troy" "allentown-bethlehem-easton" "anchorage" "ann arbor" ...
## $ states : chr " ny" " pa-nj" " ak" " mi" ...
## $ population: int 846357 786034 351851 343858 4947012 1464309 2649586 224877 NA 4458891 ...
## $ female : num 0.517 0.518 0.498 0.501 0.504 ...
## $ age : int 38 40 34 33 34 32 37 44 32 37 ...
## $ hhincome : int 60700 61000 69360 66000 60000 58500 69000 66000 62000 74300 ...
## $ sanctuary : int NA NA NA NA NA 0 0 NA NA 0 ...
If you tried to do median yourself, you might notice that the output is NA or empty. This is caused by the fact that some household income data are missing. To correct this, add na.rm = TRUE
as an argument in the function median()
.
d13<-d12%>%filter(hhincome >= median(d12$hhincome, na.rm = TRUE))
str(d13)
## 'data.frame': 1836 obs. of 8 variables:
## $ year : int 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
## $ metro : chr "albany-schenectady-troy" "allentown-bethlehem-easton" "anchorage" "ann arbor" ...
## $ states : chr " ny" " pa-nj" " ak" " mi" ...
## $ population: int 846357 786034 351851 343858 4947012 1464309 2649586 224877 NA 4458891 ...
## $ female : num 0.517 0.518 0.498 0.501 0.504 ...
## $ age : int 38 40 34 33 34 32 37 44 32 37 ...
## $ hhincome : int 60700 61000 69360 66000 60000 58500 69000 66000 62000 74300 ...
## $ sanctuary : int NA NA NA NA NA 0 0 NA NA 0 ...
d13
, which is not a good thing. Instead, you decide to create a logical variable from d12
that lets you conditionally do operations on the data. Create a new variable called upper
such that it is of value 1 if the city’s household income is greater than $58,000 and 0 otherwise. Save this new data frame as d14
. (Hint: You can use the function ifelse()
)Click for solutions
d14<-d12%>%
mutate(upper=ifelse(hhincome>=58000,TRUE,FALSE))
str(d14)
## 'data.frame': 3759 obs. of 9 variables:
## $ year : int 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
## $ metro : chr "akron" "albany-schenectady-troy" "albuquerque" "alexandria" ...
## $ states : chr " oh" " ny" " nm" " la" ...
## $ population: int 700649 846357 798722 147128 786034 125882 237767 351851 343858 NA ...
## $ female : num 0.514 0.517 0.514 0.528 0.518 ...
## $ age : int 37 38 36 36 40 40 34 34 33 39 ...
## $ hhincome : int 55000 60700 50900 40100 61000 42150 44734 69360 66000 40900 ...
## $ sanctuary : int NA NA NA NA NA NA NA NA NA NA ...
## $ upper : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
d14
. Name the new variable up_age
and store this new data frame as d15
Click for solutions
d15<-d14%>%
mutate(up_age=ifelse(age>=37,TRUE,FALSE))
str(d15)
## 'data.frame': 3759 obs. of 10 variables:
## $ year : int 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
## $ metro : chr "akron" "albany-schenectady-troy" "albuquerque" "alexandria" ...
## $ states : chr " oh" " ny" " nm" " la" ...
## $ population: int 700649 846357 798722 147128 786034 125882 237767 351851 343858 NA ...
## $ female : num 0.514 0.517 0.514 0.528 0.518 ...
## $ age : int 37 38 36 36 40 40 34 34 33 39 ...
## $ hhincome : int 55000 60700 50900 40100 61000 42150 44734 69360 66000 40900 ...
## $ sanctuary : int NA NA NA NA NA NA NA NA NA NA ...
## $ upper : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ up_age : logi TRUE TRUE FALSE FALSE TRUE TRUE ...
summarize()
. Sort this data set by population descendingly.Click for solutions
d15%>%
filter(!is.na(upper),!is.na(up_age),!is.na(female))%>%
group_by(upper,up_age)%>%
summarise(female=mean(female,na.rm=TRUE))
## # A tibble: 4 x 3
## # Groups: upper [2]
## upper up_age female
## <lgl> <lgl> <dbl>
## 1 FALSE FALSE 0.507
## 2 FALSE TRUE 0.511
## 3 TRUE FALSE 0.503
## 4 TRUE TRUE 0.508
Click for solutions
ggplot(d15%>%filter(!is.na(upper),!is.na(up_age),!is.na(female)),aes(x=population,y=female))+
geom_smooth(se=FALSE)+
facet_wrap(upper~up_age)
## Warning: Removed 151 rows containing non-finite values (stat_smooth).
n()
inside summarize()
).Click for solutions
d15%>%
filter(!is.na(upper),!is.na(up_age),!is.na(female))%>%
group_by(upper,up_age)%>%
summarise(female=mean(female,na.rm=TRUE),
count=n())
## # A tibble: 4 x 4
## # Groups: upper [2]
## upper up_age female count
## <lgl> <lgl> <dbl> <int>
## 1 FALSE FALSE 0.507 964
## 2 FALSE TRUE 0.511 854
## 3 TRUE FALSE 0.503 859
## 4 TRUE TRUE 0.508 977
tidyverse
. The three functions in i. are normal distribution, binomial distribution, and uniform distribution respectively.rnorm()
, rbinom()
, and runif()
, create a data frame with variables \(X\), \(D\), and \(E\) such that \(X\sim U[-2,2]\), \(D\sim Binom(0.3)\), and \(E\sim N(0,1)\) with 1,000 observations. Call this data frame d21
. In rbinom()
, set the argument size
to 1.Click for solutions
d21<-data.frame(
X=runif(1000,-2,2),
D=rbinom(1000,1,0.3),
E=rnorm(1000,0,1)
)
d21
, create \(Y0\) and \(Y1\) such that \(Y0=X+E\) and \(Y1=X+0.3+E\). Then create a new variable \(Y\) such that \(Y=Y1\) if \(D=1\) and \(Y=Y0\) if \(D=0\). Store this new data frame as d22
.Click for solutions
d22<-d21%>%
mutate(Y0 = X + E,
Y1 = X + 0.3 + E,
Y = ifelse(D==1,Y1,Y0))
d23
.Click for solutions
d23 <- d22%>%
mutate(ID=1:1000)
id_gpa
with ID numbers and gpa.id_gpa<-data.frame(ID=1:1000,
gpa=rnorm(1000,3.5,0.15))
Now run ?left_join
to read up on the documentation of this function. Merge id_gpa
into d23
so you have the gpa of all the observations in d21
. Store this new data frame as d24
Click for solutions
d24<-d23%>%
left_join(id_gpa,by="ID")
Click for solutions
ggplot(d24,aes(x=gpa,y=Y))+
geom_smooth(se=FALSE)+
facet_wrap(~D)
Click here to continue to the next workshop: Introduction to text data