Click here to return to the menu
Click here to go to the slides for more examples
As we discussed in the first workshop, I started this workshop to increase people’s general understanding of data. This includes talking about understanding the structure of data, different graphs, exploratory analysis vs causal analysis, etc.
For the purpose of this workshop, I will only discuss these superficially to give you basic understanding. The first thing we should talk about is the structure of data. For this workshop, I will be using data collected from the workshop signup. You can load the data set with the code below, and work on it as we go.
require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data2.csv")
Recall that when we look at a dataframe, it is like an excel sheet. In a data frame, each column is a variable or sometimes called an attribute. Each row represents an observation in the data. This means each row is an instance that has attributes described by the data.
What this distinction means is that our data can be in either a long form or a wide form. Long form data means that the data collects multiple instances of the same observation. Wide form data means the opposite. In wide form data, each observation will only have a single instance.
When you transform a data set from long to wide, you spread the multiple instances into different attributes. This means the data will have many attributes, with each attribute representing the value of a instance.
When you transform a data set from wide to long, you gather the multiple attributes [that belongs to the single instance] of each observation and put it into one attribute.
From my discussion above, you are probably thinking that the line between long and wide seems blurry. You would be correct. A data set can actually be long and wide at the same time depending on how you want to define the observation and instance.
Under tidyverse
, we call these observations “key” and these instances “value”. Key tells R which attribute(s) uniquely identifies your observations. For example, in our workshop data, you can see the column good
and the columns bad1
through bad9
. The variable good stores information on the observation’s favorite week(s) of the perspectives class. The variables bad1 through bad9 store information on the observation’s least favorite week(s) of class [not ranked].
## # A tibble: 5 x 2
## TimeStamp good
## <chr> <chr>
## 1 2020/06/12 3:50:53 pm GMT-5 Feminism
## 2 2020/06/16 7:50:24 pm GMT-5 Feminism
## 3 2020/06/13 6:35:45 am GMT-5 Economics
## 4 2020/06/16 7:50:24 pm GMT-5 Feminism
## 5 2020/06/14 10:44:53 pm GMT-5 Sociology
As you can see, if we look at the variable good, the data set is in long form since every row represents a unique combination of student and favorite week. On the other hand, if we look at bad1 through bad9, each row for the specific student will have the same bad1 through bad9.
gather()
and spread()
tidyverse::gather()
is used to gather wide form data into a long form data. This function takes arguments in the following form gather(key="sting",value="string",columns)
. tidyverse::spread()
is used to spread a long form data into a wide form data. This function takes arguments in the following form spread(key="sting",value="string",columns)
.
For example, we can gather bad1 through bad9 into a single variable.
data%>%
gather(key="bad","bad_week",bad1:bad9)%>%
filter(!is.na(bad_week))%>%
select(bad, bad_week)%>%sample_n(5)
## # A tibble: 5 x 2
## bad bad_week
## <chr> <chr>
## 1 bad1 History
## 2 bad1 Psychology
## 3 bad3 Feminism
## 4 bad1 History
## 5 bad1 Psychology
Similarly, we can use spread()
to transform data from long to wide with good.
data%>%group_by(TimeStamp)%>%mutate(num=1:n())%>%
spread(key="num",value="good",sep="_")%>%.[,26:30]%>%
head(n=5)
## # A tibble: 5 x 5
## num_4 num_5 num_6 num_7 num_8
## <chr> <chr> <chr> <chr> <chr>
## 1 Anthropology Anthropology Anthropology Anthropology Anthropology
## 2 History History History History History
## 3 Feminism Feminism Feminism Feminism Feminism
## 4 Post Structur… Post Structur… Post Structur… Post Structur… Post Structu…
## 5 Post Structur… Post Structur… Post Structur… Post Structur… Post Structu…
In ggplot2
, we briefly talked about different types of graphs. Scatter plot, jitter plot, line plot, fitted-line plot, bar chart, histogram, box plot, etc. Each of these graphs has a geom_...()
function associated with it.
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning: Removed 346 rows containing missing values (geom_point).
## Warning: Removed 346 rows containing missing values (geom_point).
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 346 rows containing non-finite values (stat_smooth).
## Warning: Removed 346 rows containing missing values (geom_path).
## Warning: Ignoring unknown parameters: se
## Warning: Removed 346 rows containing non-finite values (stat_density).
## Warning: Removed 346 rows containing missing values (stat_boxplot).
For more information on how to use these functions, check out the ggplot2 cheat sheet
Exploratory analysis are similar to what you would probably think big data does. Essentially, you fit the data to a model or graph the data in order to get a sense of the underlying relationship.
Many things you see on the internet are simply graphs from exploratory analysis. Notice that even though exploratory analysis could show underlying relationship, it could also end up being used to mislead.
For example, if we graph the precipitation on y-axis and the probability of you having your umbrella with you on x-axis, we will probably get a positive relationship. However, this does not mean you bringing umbrella causes higher precipitation. For more spurious and hilarious correlation, click here.
Causal analysis is what social scientists try to do. It is often similar to an exploratory analysis but it requires much attention to logic and reasoning. For this workshop, I will not divulge into how to do causal analysis as that takes many actual classes and years of training.
This is the end of this workshop. I hope that helped you clear some things out. There is no homework from this workshop, but I highly recommend looking at the spurious correlation website!
Click here to continue to the next workshop: Introduction to tidyverse