How do we use data?

As we discussed in the first workshop, I started this workshop to increase people’s general understanding of data. This includes talking about understanding the structure of data, different graphs, exploratory analysis vs causal analysis, etc.

For the purpose of this workshop, I will only discuss these superficially to give you basic understanding. The first thing we should talk about is the structure of data. For this workshop, I will be using data collected from the workshop signup. You can load the data set with the code below, and work on it as we go.

require(tidyverse)
data<-read_csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/workshop_data2.csv")

Data Structure: Long vs Wide

Recall that when we look at a dataframe, it is like an excel sheet. In a data frame, each column is a variable or sometimes called an attribute. Each row represents an observation in the data. This means each row is an instance that has attributes described by the data.

What this distinction means is that our data can be in either a long form or a wide form. Long form data means that the data collects multiple instances of the same observation. Wide form data means the opposite. In wide form data, each observation will only have a single instance.

When you transform a data set from long to wide, you spread the multiple instances into different attributes. This means the data will have many attributes, with each attribute representing the value of a instance.

When you transform a data set from wide to long, you gather the multiple attributes [that belongs to the single instance] of each observation and put it into one attribute.

Key and Value

From my discussion above, you are probably thinking that the line between long and wide seems blurry. You would be correct. A data set can actually be long and wide at the same time depending on how you want to define the observation and instance.

Under tidyverse, we call these observations “key” and these instances “value”. Key tells R which attribute(s) uniquely identifies your observations. For example, in our workshop data, you can see the column good and the columns bad1 through bad9. The variable good stores information on the observation’s favorite week(s) of the perspectives class. The variables bad1 through bad9 store information on the observation’s least favorite week(s) of class [not ranked].

## # A tibble: 5 x 2
##   TimeStamp                    good     
##   <chr>                        <chr>    
## 1 2020/06/12 3:50:53 pm GMT-5  Feminism 
## 2 2020/06/16 7:50:24 pm GMT-5  Feminism 
## 3 2020/06/13 6:35:45 am GMT-5  Economics
## 4 2020/06/16 7:50:24 pm GMT-5  Feminism 
## 5 2020/06/14 10:44:53 pm GMT-5 Sociology

As you can see, if we look at the variable good, the data set is in long form since every row represents a unique combination of student and favorite week. On the other hand, if we look at bad1 through bad9, each row for the specific student will have the same bad1 through bad9.

`gather()` and `spread()`

tidyverse::gather() is used to gather wide form data into a long form data. This function takes arguments in the following form gather(key="sting",value="string",columns). tidyverse::spread() is used to spread a long form data into a wide form data. This function takes arguments in the following form spread(key="sting",value="string",columns).

For example, we can gather bad1 through bad9 into a single variable.

data%>%
  gather(key="bad","bad_week",bad1:bad9)%>%
  filter(!is.na(bad_week))%>%
  select(bad, bad_week)%>%sample_n(5)

## # A tibble: 5 x 2
##   bad   bad_week  
##   <chr> <chr>     
## 1 bad1  History   
## 2 bad1  Psychology
## 3 bad3  Feminism  
## 4 bad1  History   
## 5 bad1  Psychology

Similarly, we can use spread() to transform data from long to wide with good.

data%>%group_by(TimeStamp)%>%mutate(num=1:n())%>%
  spread(key="num",value="good",sep="_")%>%.[,26:30]%>%
  head(n=5)

## # A tibble: 5 x 5
##   num_4          num_5          num_6          num_7          num_8        
##   <chr>          <chr>          <chr>          <chr>          <chr>        
## 1 Anthropology   Anthropology   Anthropology   Anthropology   Anthropology 
## 2 History        History        History        History        History      
## 3 Feminism       Feminism       Feminism       Feminism       Feminism     
## 4 Post Structur… Post Structur… Post Structur… Post Structur… Post Structu…
## 5 Post Structur… Post Structur… Post Structur… Post Structur… Post Structu…

Different graphs

In ggplot2, we briefly talked about different types of graphs. Scatter plot, jitter plot, line plot, fitted-line plot, bar chart, histogram, box plot, etc. Each of these graphs has a geom_...() function associated with it.

Scatter vs Jitter

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning: Removed 346 rows containing missing values (geom_point).

## Warning: Removed 346 rows containing missing values (geom_point).

Line vs Smooth

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 346 rows containing non-finite values (stat_smooth).

## Warning: Removed 346 rows containing missing values (geom_path).

Histogram vs Bar chart

Box plot vs Density plot

## Warning: Ignoring unknown parameters: se

## Warning: Removed 346 rows containing non-finite values (stat_density).

## Warning: Removed 346 rows containing missing values (stat_boxplot).

For more information on how to use these functions, check out the ggplot2 cheat sheet

Analysis

Exploratory analysis

Exploratory analysis are similar to what you would probably think big data does. Essentially, you fit the data to a model or graph the data in order to get a sense of the underlying relationship.

Many things you see on the internet are simply graphs from exploratory analysis. Notice that even though exploratory analysis could show underlying relationship, it could also end up being used to mislead.

For example, if we graph the precipitation on y-axis and the probability of you having your umbrella with you on x-axis, we will probably get a positive relationship. However, this does not mean you bringing umbrella causes higher precipitation. For more spurious and hilarious correlation, click here.

Causal analysis

Causal analysis is what social scientists try to do. It is often similar to an exploratory analysis but it requires much attention to logic and reasoning. For this workshop, I will not divulge into how to do causal analysis as that takes many actual classes and years of training.

This is the end of this workshop. I hope that helped you clear some things out. There is no homework from this workshop, but I highly recommend looking at the spurious correlation website!

Click here to continue to the next workshop: Introduction to tidyverse

Click here to return to the menu

Principles of Data Visualization in R Workshop: Data

Willy Chen

Workshop 3

How do we use data?

Data Structure: Long vs Wide

Key and Value

`gather()` and `spread()`

Different graphs

Scatter vs Jitter

Line vs Smooth

Histogram vs Bar chart

Box plot vs Density plot

Analysis

Exploratory analysis

Causal analysis

Principles of Data Visualization in R Workshop: Data

Willy Chen

Workshop 3

How do we use data?

Data Structure: Long vs Wide

Key and Value

gather() and spread()

Different graphs

Scatter vs Jitter

Line vs Smooth

Histogram vs Bar chart

Box plot vs Density plot

Analysis

Exploratory analysis

Causal analysis

`gather()` and `spread()`