Workshop 1

What is R?

R is an object-oriented statistical language that is based on the language S which is based on C. Hence everything you do in R actually gets boiled down to C for your computer to understand. Cool, right?

Anyways, to understand R, there are some basics you need to know.

Data/Value types

Values

Before we talk about all the bells and whistles in R, let’s first talk about values. There are 6 types of values in R and in most languages:

1. Integer

An integer is a whole number that does not have decimals.

class(as.integer(1))
## [1] "integer"

2. Numeric/Float/Double

A numeric is a number that can be any number with or without decimals. Note that an integer can be transformed into a numeric without losing information while a numeric transformed into an integer can lose the information after the decimals. For example,

as.integer(2.2)
## [1] 2
as.numeric(2)
## [1] 2

3. Logical/Boolean

A Boolean [Boo-Lee-en] or logical value takes on either TRUE or FALSE. If you make a logical value numeric or integer, you will get 1 for a true value and a 0 for a false value. You can also make 0 into FALSE and any non-zero value into `TRUE. For example,

as.logical(0)
## [1] FALSE
as.logical(5)
## [1] TRUE
as.integer(TRUE)
## [1] 1
as.numeric(FALSE)
## [1] 0

4. Character/String

String/character values are just texts. Any text value is a string. You can create strings with "". A number can also be read as a string. If you see a number read as a string, you can revert it back to a number using as.numeric().

class("1")
## [1] "character"
class(as.numeric(as.character(1)))
## [1] "numeric"

5. Factors

Factors are a special type of value that are categorical strings with an order associated with them. You can specify the order when making a factor by using the levels argument inside the function factor. In R 4.0.1 and above, the default for string variables are NOT factors while in older versions of R the default for string variables are factors.

When the default is stringsAsFactors==TRUE, it is ordered alphabetically. To check your default settings, call default.stringsAsFactors() in your console.

names<-c("Willy","Caroline","Aila")
name<-factor(names,levels=c("Aila","Willy","Caroline"))
name
## [1] Willy    Caroline Aila    
## Levels: Aila Willy Caroline

6. NA (Special case)

NA’s are tricky to deal with as it is not actually a type of values. Instead of representing a value, NA represents the absence of a value. Any value operated in conjugation with NA will give you a NA. If you compare 2 values, you get actual values, but if you do it with NA, you always get NA. For example,

2+2
## [1] 4
2+TRUE
## [1] 3
2==2
## [1] TRUE

6. NA (Special case) (Cont.)

2+NA
## [1] NA
2==NA
## [1] NA
TRUE+NA
## [1] NA
NA==NA
## [1] NA

6. NA (Special case) (Cont.)

To check whether something is NA, you can use is.na() or anyNA().

is.na(NA)
## [1] TRUE
anyNA(c(NA,2,1))
## [1] TRUE

As you might have noticed from the examples above, you can do the basic operations with values using + , - , * , / and compare the sizes of the values with ==(two equal signs), !=, < , <= , > , >=.

This concludes our introduction to values.

Pop quiz

  1. When dealing with string data that are numbers, the best way is to transform it into numeric/double using as.numeric()

  2. Which of the following expressions does not return a logical value? is.na(NA), as.logical(TRUE+TRUE), TRUE==TRUE, TRUE+FALSE

is.na(NA) \(\Rightarrow\) TRUE

as.logical(TRUE+TRUE) \(\Rightarrow\) TRUE

TRUE==TRUE \(\Rightarrow\) TRUE

TRUE+FALSE \(\Rightarrow\) 1

Object types

Objects

R, like I said, is an object-oriented language. This means it is crucial we understand what objects are. Objects are things that we store values in, they are literal objects that we use to tell the computer what we are trying to do and what we are doing this to. From the basics of the basics, we have the following objects. Before you continue, copy,paste, and run the following chunk of code in your RStudio console.

vector1<-1
vector2<-2
vector3<-c(1,2,4)
matrix1<-matrix(c(1,2,3,4),nrow=2)
dataframe1<-data.frame(matrix1)

[Atomic] Vectors

Vectors are the basic things we store our values in. A vector is a set of values that you set an order to. To create a vector, you write the name of the vector you want to create, put an assignment operator “<-” next to it, and then start you vector with a “c” followed by a pair of parentheses with the values in the vector separated by commas. For example myvector <- c(1,2,3) will give you

## [1] 1 2 3

As you can see from the code chunk I asked you to run, there is a vector called vector1 and a vector called vector2. Both of these vectors are displayed as a value in RStudio, showing you that if you don’t create a vector for a single value, R will just take it as a vector with one element. vector3 shows you what a vector with multiple value looks like.

If you were versed at linear algebra or matrix algebra at some point in your life, you’re probably thinking, are the vectors vertical or horizontal?

[Atomic] Vectors (Cont.)

And the answer is: Vertical! However, the only time the direction your vector would matter is if you are merging it with a data set [adding a collumn].

In that case, you will use bind_cols() [recommended] or cbind() to do so.

If you want to append one data set to another, you can use bind_rows() [recommended] or rbind(). But you need to make sure they have the same column names in order to append without error.

To create a sequential vector you can use : to create a vector of a sequence added by 1. If you want different gaps, you can use seq(). If you want a repeated number, you can use rep() For example,

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10,2)
## [1] 1 3 5 7 9

[Atomic] Vectors (Cont.)

rep(c(1,2,3),5)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Important note: A vector can only contain one value type. You will get a uniform value type if you make a vector of numeric/logical and strings. NA’s are not subject to this rule because NA is not a value, it is the absence of a value.

c(rep(c(1,2,3),5),TRUE)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
c(rep(c(1,2,3),5),"TRUE")
##  [1] "1"    "2"    "3"    "1"    "2"    "3"    "1"    "2"    "3"    "1"   
## [11] "2"    "3"    "1"    "2"    "3"    "TRUE"

Data Frames

If you use the as.matrix() or matrix() functions on a vector, you could transform it into a matrix. However, unless you are doing matrix operations to solve for equations by hand, you are not going to need it. Instead, to deal with a rectangular form of data, we have what is called a data frame. Data frames come in many types, it can be a data.frame, a data.table, a tibble, etc. At this stage, all you need to know is that a data frame is what you would imagine an excel sheet would look like. It has columns storing different types of values like name, salary, grades, etc. And each row in the data frame represents an observation of your data. To create a simple data frame, we can either use a matrix or multiple vectors in conjugate with the data.frame() function.

Data Frames (Cont.)

df<-data.frame(vector3,myvector)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ vector3 : num  1 2 4
##  $ myvector: num  1 2 3

To add a vector to your data frame, use bind_cols() and bind_rows() to add either a column or row to your data frame. Because vectors, matrices, and data frames store your “data”, they also have a special name called “data objects” even though values that are not data frames are commonly referred to as just atomic vectors

Functions

Functions are objects that takes in objects as arguments, does some operations with it, and then spits back an object. functions are in the general form of function(arguments...) where the arguments of the function are specified by different objects and values for the operation. You might have been wondering what I’ve been doing with class(), str(), as.numeric, data.frame(), etc. These are all functions that does specific things. If you want to see what a function does, just type in ? before the function you want to check in the console and hit enter/return. For example, type in ?class and see what it says.

The answer is, it tells you what the object type or value type of the thing you just fed it is. class(2) would give you “numeric” and class(TRUE) will give you “logical”.

We will learn about more complicated and versatile functions later. As of writing your own functions, I will omit that for now. If need be, I can talk about it later or add it back in here.

Pop quiz

The following is an example of a function, can you identify each parts of the function?

both_odd<-function(num1=2, num2){

O1<-(as.integer(num1/2)==(num1/2))

O2<-(as.integer(num2/2)==(num2/2))

Result<-O1==O2

return(Result)
}  

What should the call both_odd(1) return?

What about both_odd(2,1)?

Syntax and Semantics

How to talk to your computer

Semantics is how you communicate with your computer in the language you have chosen. Every language (even the real languages like English) have a set of rules that tell you how words are arranged [syntax] and what they mean [semantics]. In R, there are only a couple you need to remember:

Assignment operator

<- is your assignment operator. You can use this to assign values to objects. You can also use = as assignments like python but why would you?

a <- TRUE
b = 1
a==b
## [1] TRUE

Selection operator

[] is your selection operator for data objects [vectors, matrices, data frames]. As you might have realized, comma gives you the sense of ordering within data objects, however, in selection operator, comma is used for direction. To select an element of a vector, you would put the selection operator next to your vector. For example, myvector[2] will give me the second element of the vector myvector. To select something out of a 2D data object (matrices, data frames), simply use the syntax 2Dobject[row,column]. The first index represents the row you want, and the second index represents the column you want.

Now you might be thinking, “What if I want multiple columns of one row?” This is where vector comes in. Recall that comma in vectors represents ordering, so say you want the second and forth column for the 5th row from 2Dobject you would write 2Dobject[5,c(2,4)]. Remember, you can also use : or seq or rep or a combination of all these things as indices in your selection operator.

Selection operator (Cont.)

data
##   number        field   name
## 1      1    economics  Willy
## 2      2         <NA>  Grace
## 3     NA anthropology Kristi
data[c(2,3),c(TRUE,FALSE,TRUE)]
##   number   name
## 2      2  Grace
## 3     NA Kristi
class(data[c(2,3),c(TRUE,FALSE,TRUE)])
## [1] "data.frame"

Selection operator (Cont.)

data==2
##      number field  name
## [1,]  FALSE FALSE FALSE
## [2,]   TRUE    NA FALSE
## [3,]     NA FALSE FALSE
class(is.na(data))
## [1] "matrix"
data[is.na(data)]
## [1] NA NA

Column selection operator

$ placed after a data frame will allow you to select a column by its name. If I have a data set of all MAPSS students but only want to view their GPA, I can use mapss$gpa and get only that column. To select multiple columns by their names, you can use the select() function which will be discussed in the next workshop.

data$name
## [1] Willy  Grace  Kristi
## Levels: Grace Kristi Willy

AND and OR

To add more operation in R for a logical condition, we can use the AND operator & and the OR operator |. You would normally place one condition on each side of the operator and it will give you the condition as a whole.

For example, if we want to see whether 2 is less than 4 AND 4 is less than 4 we can write 2<4&4<4 and we would get the output FALSE. But if we switch to the OR operator and write 2<4|4<4, then we will get the output TRUE. As a special note, there could be times where you have a large data set and having the computer check both conditions on an AND operator will take too much time. In those cases, you can use the && operator instead. It checks the first condition, and if it is false the function outputs false without checking the second condition.

(2<4 & 4<4) | (2<4 | 4<4)
## [1] TRUE

Packages

If you think of R as your phone, packages are like the apps you would download and install from the app store. Most popular languages have a vast collection of packages developed either by a company or individual developers. In R’s case, it is both. Great packages allow us to do things more efficiently in R and make more beautiful graphs.

  • Installing packages: To install a package that is not already installed on your computer, run the code install.packages("packagename"). Simply replace the term "packagename" with the name of the package you want to install. Note that you need to make the package name a string/character for the function to work correctly. For example, the main package this workshop series will be using is the tidyverse package. As an example, run the following code in your console. install.packages("tidyverse")

  • Loading packages: At the beginning of each of your R sessions, you need to tell the computer which packages you want to use during the session. This is called loading a package. You can do so by using either library(packagename) or require(packagename). Notice that there is no quotes needed as these functions take your input as a “symbol”. You don’t need to know what symbol means at this stage. library() will load the package you tell it to no matter what but require() will only load the package if the package is not already loaded. Using require() makes rerunning your code more efficiency when you are using multiple packages.

  • Direct calls: Sometimes, you may wish to call a function from a package while is “masked” by a function with the same name from a different package. In this case, you will only get the correct call by designating the package to call from. You can do so by using ::. For example, when you run library(tidyverse), you should get a warning that says dplyr::filter masks stats::filter(). If you want to call filter() from stats, you simply have to call stats::filter() instead of just filter()

Try it yourself!

Practice problems

1. Create a data frame that has all the even numbers from 1 to 1000 in the first column and all the odd numbers from 1 to 1000 in the second columns and call this mydf. (Hint: Use either bind_cols() or bind_rows())

2. Using the selection operator [], create the sub-data frame in mydf such that it only contains numbers less than 100 or greater than 900. (Hint: You can put logical values in the selection operator. Try and see how it works)

Practice problems (Cont.)

3. Use the following code [copy and paste onto your console] to obtain a data frame called ps1.csv.

Ignore the first two lines if you have already installed and loaded the package tidyverse

install.packages("tidyverse")
require(tidyverse)
data<-read.csv("https://willythewoo.github.io/WillyTheWoo/workshop/data/ps1.csv")

(a). Use the function head() to look at the first couple rows of the data, what do you see?

(b). How many NA’s are there in this data frame. (Hint: you can sum up a vector using sum())