1 Foreword

Geographic Information Systems, use of satellite imagery, modeling, and mapping have become essentials in the toolbox of many researchers, natural resource managers, and conservationists. Yet, these resources are often not readily accessible to practitioners around the world. Most restrictive is the limited access to specialized and expensive commercial software. In recent years, several powerful open-source tools for geospatial analysis have emerged. R is a widely used statistical programming language with considerable geospatial analysis capabilities. This training focuses on performing geospatial work in R.

1.1 Learning Outcomes

By the end of these tutorials, you will be able to:

  1. Understand R coding language
  2. Understand how to troubleshoot for R errors
  3. Perform database and spatial queries in R
  4. Create and import spatial data in R
  5. Know basic raster and vector analysis functions in R
  6. Project spatial data in R
  7. Use R to create maps and visualize data
  8. Extract environmental values for spatial points and regions
  9. Fit Generalized Linear Models and predictions over space
  10. Learn to simulate that and the basics of Bayesian statistics.

1.2 About this material

This content has been the product of years of work and collaborations with many amazing people. Some of the code has been provided by them.

We thank Dr. Quingyu Huangm, Dr. Grant Connette, and Dr. Joseph Kolowski for their inputs.

1.3 Required software

During the course, students will use R and R-Studio. R is a programming language and free software environment for statistical computing widely used by researchers. R-Studio is an integrated development environment for R that makes the software more user friendly. Both R and R-Studio are compatible with the most common operating systems and are free to use. Before the course begins, you should have the software installed on your personal computers. Installation files can be accessed at:

-R: https://www.r-project.org

-RStudio Desktop: https://rstudio.com/products/rstudio/

1.4 Data access

All the data used in the tutorials are available in this link

2 An intro to the R programming language

2.1 What is R and why use R?

R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.

Among the reasons to use R you have:

  1. It’s free – open source! If you are a teacher or a student, the benefits are obvious.
  2. It runs on a variety of platforms including Windows, Unix and MacOS.
  3. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.
  4. It contains advanced statistical routines not yet available in other software.
  5. New add-on “packages” are being created and updated constantly.
  6. It has state-of-the-art graphics capabilities.

R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.

Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.

2.2 R or RStudio? How to get them?

R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).

R console, script and plot tabs.
R console, script and plot tabs.

RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.

RStudio sowfware
RStudio sowfware

2.3 Getting Help

One of the most useful commands in R is ?. At the command prompt (signified by > in your Console window), type ? followed by any command and you will be prompted with a help tab for that command (e.g., ?mean Fig. 3). You can also search through the help tab directly by searching functions on the search bar.

Getting help on R.
Getting help on R.

The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.

2.4 Basic R concepts

There are a few concepts that are important to keep in mind before you start coding. The fact that R is a programming language may deter some users who think “I can’t program”. This should not be the case for two reasons. First, R is an interpreted language, not a compiled one, meaning that all commands typed on the keyboard are directly executed without requiring you to build a complete program like in most computer languages (C, Pascal, . . . ). Second, R’s syntax is very simple and intuitive. For instance, a linear regression can be done with the command lm(y ~ x) which means fitting a linear model with y as the response and x as a predictor.

In R, in order to be executed, a function always needs to be written with parentheses, even if there is nothing within them (e.g., ls()). If you type the name of a function without parentheses, R will display the content of the function.

When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).

The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).

When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.

It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).


2.5 Starting with R

2.5.1 Setting your working directory

Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:

getwd()

Use setwd() to change or set a new working directory. For instance, you can set your working directory to be in your Documents folder on the C: drive, or in any folder you prefer.

setwd("C:/Documents/R_Practice")

In RStudio, you can do some basic operations, such as setting your working directory, using the point-and-click interface: Session > Set Working Directory > Choose Directory … as shown in Fig. 4.

Setting working directory in RStudio
Setting working directory in RStudio

You can also work with R projects. R projects have the advantage that they enhance organization, collaboration, and reproducibility in R-based work by providing a structured and isolated workspace for your projects. They contribute to a more efficient and transparent workflow, particularly when working on multiple projects or collaborating with others.

The main advantage is that R Projects create a dedicated workspace for your R-related work, keeping all files, scripts, data, and outputs within a well-defined directory. This isolation helps avoid conflicts between projects and ensures a clean environment for each project. R projects use relative paths, allowing you to refer to files and directories within the project without specifying the full path. This enhances portability and makes it easier to share your project with others.

To start a new R Project click File > New Project or directly click on the R Project icon and follow the directions. You can set up your new project on a new folder or an existing folder.

RStudio sowfware
RStudio sowfware

2.6 R Fundamentals

2.6.1 Data Types

There are four fundamental data types in R that you will work with:

  1. Character
  2. Numeric
  3. Integer
  4. Logical

You can check the data type of an object using the function class(). To convert between data types you can use: as.integer(), as.numeric(), as.logical(), as. character().

For instance:

city <- 'Nairobi'
class(city)
## [1] "character"
number <- 3
class(number)
## [1] "numeric"
Integer <- as.integer(number)
class(Integer)
## [1] "integer"
logical <- 3 > 5
logical
## [1] FALSE

2.6.2 Assigning data to objects

Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!

city <- "front royal"
summary(city)

number <- 2
summary(number)

character <- as.character(2)

Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:

table <- read.csv(file="data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder

Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.

2.6.3 Special characters

The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.

#This is a comment

The semicolon (;) defines a line continuation character so that you can write different commands on the same line of code.

a <- 3; b <- 6; c <- a+b
a
## [1] 3
b
## [1] 6
c
## [1] 9

2.7 R Data Structure

2.7.1 Vectors

Vectors are a basic data structure in R. They contain a sequence of data and can contain characters, numbers, or be TRUE/FALSE values. Remember: If you are unsure or need help, use the help function (e.g., help(seq) or ?seq). Below are several ways to create vectors in R.

1:20
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
rep(1:5,5)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep("A rolling stone gathers no moss",4)
## [1] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
## [3] "A rolling stone gathers no moss" "A rolling stone gathers no moss"

2.7.1.1 Extract subset of values from a vector using [] notation

x <- 1:10
y <- c(1.5, 9, 3.46, 12.2)

To see only part (i.e., a subset) of the data stored in a vector, you need to “ask” R to extract the information you want using square brackets. Most commonly, you will indicate in square brackets the position of the data you want to extract (from beginning of the vector [1] to the Nth slot in the vector [n]). If you only wanted to see the first 5 values of ‘x’, how would you do that? Or only the 2nd and 4th element of y? What if you wanted to see all the records in y, except for the 2nd and 3rd records? There are more ways to use notation to select subsets of data, which we will cover in more detail below.

x
##  [1]  1  2  3  4  5  6  7  8  9 10
(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
x[1:5]
## [1] 1 2 3 4 5
y[c(2,4)]
## [1]  9.0 12.2
y[-c(2,3)]
## [1]  1.5 12.2

2.7.2 Matrices and Dataframes

Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.

Create a matrix with 4 rows and 5 columns using the data from x above. Consult the help (e.g., help(matrix) or ?matrix) to determine the syntax required.

test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9    3    7
## [2,]    2    6   10    4    8
## [3,]    3    7    1    5    9
## [4,]    4    8    2    6   10
# Note, I can assign any name to an object that I create.  Generally it is best to name things in a way that is meaningful, but we'll have some fun here!
superman <- matrix(data = x, nrow = 4, ncol = 5)

2.7.2.1 Subset of Matrices and Dataframes

Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:

#test_matrix(row,column)
test_matrix[1,5]
## [1] 7

In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!

For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.

test_matrix[,5]
## [1]  7  8  9 10

What about the 4th row?

test_matrix[4,]
## [1]  4  8  2  6 10

What if we wanted to view the values in the 3rd row, but only the 2nd and 4th columns?

test_matrix[3,c(2,4)]
## [1] 7 5

What happens to the matrix if we append a character field? Use the cbind() (column bind) command to bind a new column, called ‘countries’. Note that I am not changing the contents of test_matrix. Can you figure out how to do a row bind (hint: use rbind())

countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
##                            countries      
## [1,] "1" "5" "9"  "3" "7"  "United States"
## [2,] "2" "6" "10" "4" "8"  "Pakistan"     
## [3,] "3" "7" "1"  "5" "9"  "Ireland"      
## [4,] "4" "8" "2"  "6" "10" "China"
#Note that I am not changing/overwriting the contents of test_matrix.  I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)

Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?

test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
##   X1 X2 X3 X4 X5     countries
## 1  1  5  9  3  7 United States
## 2  2  6 10  4  8      Pakistan
## 3  3  7  1  5  9       Ireland
## 4  4  8  2  6 10         China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"

Can I rename the column headings?

names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
##   Val1 Val2 Val3 Val4 Val5     Countries
## 1    1    5    9    3    7 United States
## 2    2    6   10    4    8      Pakistan
## 3    3    7    1    5    9       Ireland
## 4    4    8    2    6   10         China
# Also see the colnames() function

Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?

test_dataframe[3,5]
## [1] 9
test_dataframe[,5]
## [1]  7  8  9 10
test_dataframe$Val5[3]
## [1] 9
test_dataframe$Val5
## [1]  7  8  9 10
test_dataframe[,"Val5"]
## [1]  7  8  9 10

You can also use some very simple commands to determine the size of dataframes or matrices.

nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6

You can delete individual objects to clear your working directory (rm(dataset)), or start every script with the following command to make sure you are starting fresh (this is good programming practice):

#rm(list=ls())

2.8 Functions

R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.

Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)

print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
y <- matrix(1:5)
y
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [5,]    5
df.example <- cbind(x, y)
df.example
##      [,1] [,2] [,3]
## [1,]    1    6    1
## [2,]    2    7    2
## [3,]    3    8    3
## [4,]    4    9    4
## [5,]    5   10    5

?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.

Here, for example, we install and load package named “ggplot2” that we will use for data visualization.

install.packages("ggplot2")
library(ggplot2)

2.8.1 Pre-existing Functions

R also contains many pre-existing functions in the base software. Numeric functions include sum(), mean(), sd(), min(), max(), median(), range(), quantile(), or summary(). Try a few of these on the numeric vectors you have created.

sum(x)
## [1] 55
summary(x)
##        V1          V2    
##  Min.   :1   Min.   : 6  
##  1st Qu.:2   1st Qu.: 7  
##  Median :3   Median : 8  
##  Mean   :3   Mean   : 8  
##  3rd Qu.:4   3rd Qu.: 9  
##  Max.   :5   Max.   :10
range(y)
## [1] 1 5

2.8.2 Calculations & Arithmetic Operators

R can be used to perform basic calculations and report the results back to the user.

4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260

Exponentiation: ^

2^3
## [1] 8

Max and Min: max(), min()

vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers) 
## [1] 10
min(vector_numbers)
## [1] 2

Can you calculate the square root and then subtract 5 for each element in vector_number?

2.8.3 Relational Operators

<,>, =, !=, >=, <=, Evaluate a conditional expression and return TRUE or FALSE

3 > max(c(2,3,4,5))
## [1] FALSE

2.9 Loops

The for loop is used to iterate over a sequence (numeric vector, list, or other iterable objects). Loops are very important to perform opperations and especial building blocks of many advanced models.

Here is a simple for loop that print numbers 1 to 5. As you can see, i is the element in which the loop is iterating. It can take a value of 1 to 5, and the loop ends when it reaches the last element of the sequence.

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

But the loop can iterate not just on numbers, but also lists.

my_list <- c("apple", "orange", "banana")
for (fruit in my_list) {
  print(fruit)
}
## [1] "apple"
## [1] "orange"
## [1] "banana"

We can create a vector from which we will add 5 to each element and save it in a new vector. For this, we first need to create an empty vector where we will save the new numbers. Note how we use [] to access the elements in the i positions.

vector <- c(2,3,4,5,2) #Data vector
newdata <- NULL #Vector to store output
for (i in 1:length(vector)){
  newdata[i] <- vector[i] + 5
}
newdata
## [1]  7  8  9 10  7

Finally, we can nest loops. A loop inside a loop. Make sure you understand dimensions before trying to understand the nested loop.

vector <- c(1,2,3,4,5) #Data vector
newdata <- matrix(NA, 5,5) #Vector to store output
for (i in 1:5) {
  for (j in 1:5) {
    newdata[i,j] <- vector[i] * vector[j]
  }
}
newdata
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    2    4    6    8   10
## [3,]    3    6    9   12   15
## [4,]    4    8   12   16   20
## [5,]    5   10   15   20   25

2.10 Installing SWIRL

A good way to learn more about R is to use SWIRL. This is a user-generated program (also called a package or library) for learning how to code in R. To access this, you must first install the package so it is accessible. In the Console window (bottom left), type the following and press ENTER:

install.packages("swirl")

# Data can also be installed locally if you've downloaded the zip file to your hard drive
#install.packages("yourzip.tar.gz", repos=NULL)

This may take a little while, but when the stop sign in the upper right of the console window is gone, you can proceed. For any package you install in R, you will also need to turn them on before using them. You can do this with the require() or library() functions. Type this now:

library(swirl)

Note: You may be prompted to select a “mirror” from which to download the package. If this is the case, it is recommended that you choose the mirror that is geographically closest to you.

To install the lesson, you will need to use:

install_from_swirl("R Programming")

Find out more about other courses, and other download options here: https://github.com/swirldev/swirl_courses

2.10.1 SWIRL Lessons

There are many lessons within R. Once SWIRL is loaded (below), you will be given the option of which lessons to complete. Some of the core lessons can be found in the initial section labeled R Programming (estimated time to compete required lessons is about 2 hours). We recommend to start with the following lessons:

  1. Basic Building Blocks (10 min)
  2. Workspace and Files (15 min)
  3. Sequences of Numbers (5 min)
  4. Vectors (8 min)
  5. Missing Values (5 min)
  6. Subsetting Vectors (12 min)
  7. Matrices and Data Frames (13 min)
  8. Logic (optional)
  9. Functions (30 min)
  10. lapply and sapply (optional)
  11. vapply and tapply (optional)
  12. Looking at Data (5 min)
  13. Simulation (optional)
  14. Dates and Times (10 min) (optional)
  15. Base Graphics (10 Min)

2.10.2 Run SWIRL

Type the following to begin using SWIRL. Also, when restarting your session later, you’ll need to “turn on” SWIRL each time with either library(swirl) or require(swirl).

swirl()

Have fun!

3 Introduction to Data Analysis

Now that you’ve learned the basics of R programming, we’ll take things a step further.This chapter will walk you through a new set of analyses.

We’ll be working with the dataset “mtcars” which comes pre-loaded with R. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data and to reiterate some principles in statistical modelling. This chapter will help as we move towards our ultimate goal of conducting more advanced analyses on animal point data.

You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.


For this tutorial section I followed the book: Data Analysis with R Statistical Software - A Guidebook for Scientists This is, in my opinion, one of the best resources out there for an introduction to data analysis with R.


Whenever you start working with a new script, you should first set a working directory. If you are working within a R-Studio project, the working directory will automatically be set by default. This directory will contain all the data for your analysis and will be where you will save all the data outputs.

Remember that you can check the current working directory by typing:

getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/R-Tutorials"

Now, let’s change the working directory to a location of your choosing. Create a folder if you don’t have one already, then make sure your working directory is in that folder. If you already have a folder, just set the working directory to the folder you want to use.

setwd("C:/....")

3.1 Data Management and Data Manipulation

3.1.1 Exploring the Data

Let’s start investigating a data set to later fit a linear model.

Load the “mtcars”” data set. This dataset comes with R. We will use this dataset as it is intuitive to think about it and is a good example on how to tackle many other datasets.

data(mtcars)

View the first 10 lines of the data set.

head(mtcars, 10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Assess the overall structure of the data set to get a sense of the number and type of variables included. When you work with your own data, you will be familiar with the data structure, but it is always good practice to examine your data before moving on to any model fitting. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.

Note, all columns/variables included in this sample dataset are numeric-type. You can confirm the data type of each column by typing is.numeric() next to the variable name (e.g., is.numeric(mtcars$mpg)).

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Now, summarize the data to provide a list of each variable with the mean, min, and max.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

‘Summary’ is a great function to have a quick view at the data. But what if you want to save the mean, min, or max values of each variable? There is a family of functions in R that are great for applying functions to all columns or all rows in a matrix and that return the result as a vector or list of values. This is the apply function.

The apply function has two main arguments. The MARGIN that is a 1 or a 2, indicating whether you want to operate on rows (1) or columns (2) and the FUN arguments that tell R what function is to be applied. For example, to obtain the mean of each variable in the mtcars dataset, we do:

apply(mtcars, 2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Compare with the values reported by the summary function.

QUESTION: Can you calculate the min and max values for each variable?

Let’s have a quick look at all our variables together (since they are all numeric) by looking at scatter plots of each variable combination. Use the function “pairs” or “plot” on the data set.

plot(mtcars)

You should be able to see cases where there seems to be a strong relationship between two variables. “mpg” vs. “wt” is a good example of this. This is miles per gallon vs. the weight of the car, and this makes sense, a heavier car should consume more petrol per distance. We should transform miles to kilometers, because, what kind of measurement is miles? But in appreciation of all my American friends, we will use miles, just for this exercise. Is the slope here positive or negative?

We can plot these two variables against each other to examine the relationship closer.

plot(mtcars$mpg ~ mtcars$wt)

You could plot any of the variables in the data frame. Plotting the data is one of the simplest ways to look and explore the data.

In R you can customize your plots. We will dedicate a section to ggplot2. But for now, can you change the ‘x’ and ‘y’ variable names and add a title to this plot? Use the help file help(plot) or ?plot to determine the proper syntax, or simply Google “add tile to plot in R”.

plot(mtcars$mpg ~ mtcars$wt, xlab="Weight", ylab="MPG", main="MPG vs Weight")

Calculate the correlation coefficient between the two variables, then perform a correlation test to see if they are significantly correlated.

cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
cor.test(mtcars$mpg, mtcars$wt)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

The p-value is very small and the correlation coefficient of -0.8676594 is very high. We also note that this value is negative, meaning that as the weight increases, the fuel efficiency decreases. This makes intuitive sense. We will talk about the difference between correlation and causation later.

Let’s practice some data management before we look into these variables in more detail.

Create a new data set called “my_mtcars”, excluding the variables “vs” and “am” using vector notation. Then look at your new data set to make sure it worked. We don’t necessarily need to remove these variables to continue the analysis. We are simply doing this so that you get more familiar with manipulating data frames.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
my_mtcars <- mtcars[, -c(8,9)] #Remove columns 8 and 9
head(my_mtcars, 10) 
##                    mpg cyl  disp  hp drat    wt  qsec gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30    4    4

Now, keeping the same name for your data set, exclude the “gear” and “carb” columns without using vector notation. Instead use the “subset” function. Check out the help (?subset) for this function to figure out how to exclude columns by name.

my_mtcars <- subset(my_mtcars, select = -c(gear, carb))

Note that the initial data of my_mycars with 9 variables no longer exists, because my syntax states to save the modified 7 variable data with the original name.

QUESTION: How could you do this without overwriting the original data?

You should now have a data set called my_mtcars that has 32 observations of 7 variables. Check this.

str(my_mtcars)
## 'data.frame':    32 obs. of  7 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
dim(my_mtcars)
## [1] 32  7

QUESTION: What does ## [1] 32 7 tells us? Hint, recall vector notation.

Another way of checking the number of rows and columns is using nrow() and ncol() functions.

The variable “cyl” represents the number of cylinders in the car engine. You’ll see this is classified as a numeric variable. However, we aren’t necessarily interested in this as an actual number. For us, the number of cylinders is more useful as a grouping mechanisms to serve as a factor, or categorical variable. Let’s use the as.factor function to convert it, keeping the same variable name. Then check what the class of the variable is, to confirm the conversion worked.

my_mtcars$cyl <- as.factor(my_mtcars$cyl)
class(my_mtcars$cyl)
## [1] "factor"

Creating a categorical factor variable will enable us to generate summary statistics and plot data by groups.

We can now use this factor variable to group different operations. tapply is a great function to use for grouped operations. Check the help for ?tapply and try to calculate the mean of “mpg” for each factor of the “cyl” variable.

tapply(my_mtcars$mpg, my_mtcars$cyl, mean)
##        4        6        8 
## 26.66364 19.74286 15.10000

Now let’s create box plots of our two variables of interest (mpg, wt) to visualize their distribution. First, change your graphic parameter to show two graphs side by side.

Question: How do you think you’d specify the title of each box plot?

par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg, main = "mpg")
boxplot(my_mtcars$wt, main = "weight")  

You can see that two points are potential outliers in the plot forwt. The plot gives you a tool to make a decision whether to remove the data points. Here we will keep them.

Since cyl is a categorical variable, we can also visualize the distribution of mpg and wt across different cylinder classes.

par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg ~ my_mtcars$cyl, main = "mpg")
boxplot(my_mtcars$wt ~ my_mtcars$cyl, main = "weight") 

QUESTION: How do you modify the label for x axis and y axis?

Before we move forward, let’s exclude these two observations by using a logical expression that removes the points (rows in the dataframe) in the data set where weight is greater than 5.3 tons. There are a few different ways to do this, as is the case with most things in R. Let’s use the subset function again.

my_mtcars <- subset(my_mtcars, wt <= 5.3)

You should now have a data set with 30 observations of 7 variables.

Note that removing data points from an analysis is not a good statistical practice and when done, it should be justified. But here, the goal is to use different functions on how we can work with data frames.

3.2 Data Table Manipulation with Dplyr

The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!

The base R functions that come with the default R installation have the capacity for almost all the table manipulation you will need (e.g., split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()). However, sometimes their syntax are less user-friendly and intuitive than some of the special packages built for table manipulation purposes. So, here we are introducing a few of the most useful table manipulation functions within dplyr package. This is a package I use a lot.

Note that you will have to use install.packages() and library() function to download and activate the dplyr before using it.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now, we will see how different functions of this package work.

3.2.1 select()

We can use select() to select column(s) that meet an specific pattern:

select(mtcars, mpg) # select column called mpg (miles per galon)
##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4
select(mtcars, -mpg) # select all columns in the data except mpg column
##                     cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4             6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag         6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive        6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout     8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant               6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360            8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D             4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230              4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280              6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C             6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE            8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL            8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC           8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood    8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial     8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128              4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic           4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla        4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona         4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger      8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin           8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28            8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird      8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9             4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2         4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa          4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L        8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino          6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora         8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E            4 121.0 109 4.11 2.780 18.60  1  1    4    2
select(mtcars, hp:vs) # select a continuous block columns starting from hp column and end on vs column 
##                      hp drat    wt  qsec vs
## Mazda RX4           110 3.90 2.620 16.46  0
## Mazda RX4 Wag       110 3.90 2.875 17.02  0
## Datsun 710           93 3.85 2.320 18.61  1
## Hornet 4 Drive      110 3.08 3.215 19.44  1
## Hornet Sportabout   175 3.15 3.440 17.02  0
## Valiant             105 2.76 3.460 20.22  1
## Duster 360          245 3.21 3.570 15.84  0
## Merc 240D            62 3.69 3.190 20.00  1
## Merc 230             95 3.92 3.150 22.90  1
## Merc 280            123 3.92 3.440 18.30  1
## Merc 280C           123 3.92 3.440 18.90  1
## Merc 450SE          180 3.07 4.070 17.40  0
## Merc 450SL          180 3.07 3.730 17.60  0
## Merc 450SLC         180 3.07 3.780 18.00  0
## Cadillac Fleetwood  205 2.93 5.250 17.98  0
## Lincoln Continental 215 3.00 5.424 17.82  0
## Chrysler Imperial   230 3.23 5.345 17.42  0
## Fiat 128             66 4.08 2.200 19.47  1
## Honda Civic          52 4.93 1.615 18.52  1
## Toyota Corolla       65 4.22 1.835 19.90  1
## Toyota Corona        97 3.70 2.465 20.01  1
## Dodge Challenger    150 2.76 3.520 16.87  0
## AMC Javelin         150 3.15 3.435 17.30  0
## Camaro Z28          245 3.73 3.840 15.41  0
## Pontiac Firebird    175 3.08 3.845 17.05  0
## Fiat X1-9            66 4.08 1.935 18.90  1
## Porsche 914-2        91 4.43 2.140 16.70  0
## Lotus Europa        113 3.77 1.513 16.90  1
## Ford Pantera L      264 4.22 3.170 14.50  0
## Ferrari Dino        175 3.62 2.770 15.50  0
## Maserati Bora       335 3.54 3.570 14.60  0
## Volvo 142E          109 4.11 2.780 18.60  1
select(mtcars, starts_with("c")) # select all columns that start with "c" in their column names
##                     cyl carb
## Mazda RX4             6    4
## Mazda RX4 Wag         6    4
## Datsun 710            4    1
## Hornet 4 Drive        6    1
## Hornet Sportabout     8    2
## Valiant               6    1
## Duster 360            8    4
## Merc 240D             4    2
## Merc 230              4    2
## Merc 280              6    4
## Merc 280C             6    4
## Merc 450SE            8    3
## Merc 450SL            8    3
## Merc 450SLC           8    3
## Cadillac Fleetwood    8    4
## Lincoln Continental   8    4
## Chrysler Imperial     8    4
## Fiat 128              4    1
## Honda Civic           4    2
## Toyota Corolla        4    1
## Toyota Corona         4    1
## Dodge Challenger      8    2
## AMC Javelin           8    2
## Camaro Z28            8    4
## Pontiac Firebird      8    2
## Fiat X1-9             4    1
## Porsche 914-2         4    2
## Lotus Europa          4    2
## Ford Pantera L        8    4
## Ferrari Dino          6    6
## Maserati Bora         8    8
## Volvo 142E            4    2
  • starts_with argument is very convenient because it allow you to select multiple columns that start with the same text. A few similar arguments are available to define other naming patterns.
  • ends_with()= Select columns that end with a character string.
  • contains()= Select columns that contain a character string.
  • matches()= Select columns that match a regular expression.
  • one_of()= Select columns names that are from a group of names.

Question: How do you select or exclude two columns: mpg and cyl?

3.2.2 filter()

Filter/select row(s) of data based on specific requirement of column(s) values:

filter(mtcars, cyl %in% c(4,6)) # select cars with 4 or 6 cylinders
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
filter (mtcars, mpg>20) # select rows that have mpg > 20
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
filter(mtcars, mpg>20 & cyl == 6) # select rows that have mpg>20 AND cyl == 6
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
filter(mtcars, mpg>20 | hp > 100) # select rows that have mpg>20 OR hp > 100
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Question 1: How do you select rows with 6 and 8 cylinders knowing that cyl is a factor?

3.2.3 pipe operator

The pipe operator allows you to pipe the output from one function to the input of the next function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right. It can also help you avoid creating and saving a lot of intermediate variables that you don’t need to keep. The old operator for pipes was %>%, but now a new version has been introduced, |>

# old operator
pipe_result<- mtcars %>%
  select(mpg, cyl) %>%
  head()
pipe_result
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6
# new operator
pipe_result<- mtcars |>
  select(mpg, cyl) |>
  head()
pipe_result
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

3.2.4 arrange()

This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order

order_data1<- mtcars %>% 
    arrange(mpg) 
order_data1
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
order_data2<- mtcars %>%
    arrange(cyl, mpg)
order_data2
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing? 

Question: Can you arrange the table first by wt and then by hp in decending order?

3.2.5 mutate()

The mutate() command creates new column(s) and define their values. For instance, we can create a new column with random numbers between 0 and 100 using a uniform distribution.

new_col<- mtcars %>%
    mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100)) # note the use of nrow to get the exact number of random numbers as there are rows in the dataframe.
new_col
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                        new_col
## Mazda RX4           68.0082178
## Mazda RX4 Wag       22.6670911
## Datsun 710          47.3906628
## Hornet 4 Drive      87.9783226
## Hornet Sportabout   38.6728120
## Valiant             43.2574856
## Duster 360          79.9368094
## Merc 240D           45.4630289
## Merc 230            84.9415006
## Merc 280            53.0237370
## Merc 280C            1.4590759
## Merc 450SE          62.0795711
## Merc 450SL          81.5701176
## Merc 450SLC         32.5959353
## Cadillac Fleetwood  52.1947961
## Lincoln Continental 45.9956347
## Chrysler Imperial    0.9429407
## Fiat 128            62.2708343
## Honda Civic         72.3534224
## Toyota Corolla       9.3948067
## Toyota Corona       43.3100333
## Dodge Challenger    39.3705221
## AMC Javelin         53.8806726
## Camaro Z28          29.9311936
## Pontiac Firebird    26.0978157
## Fiat X1-9           99.9855554
## Porsche 914-2       46.2409172
## Lotus Europa        25.5154344
## Ford Pantera L       7.8714678
## Ferrari Dino        93.9733470
## Maserati Bora       17.3660055
## Volvo 142E          71.0062781
new_col2 <- mtcars %>%
    mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100),  wt_kg = wt * 1000) # you can create multiple columns at once. Lets transform the weight from tones to kg
new_col2
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                       new_col wt_kg
## Mazda RX4           39.431255  2620
## Mazda RX4 Wag       30.763502  2875
## Datsun 710          25.109426  2320
## Hornet 4 Drive      86.945529  3215
## Hornet Sportabout   39.114930  3440
## Valiant             40.612382  3460
## Duster 360           2.979379  3570
## Merc 240D           64.125189  3190
## Merc 230             5.158186  3150
## Merc 280            52.688352  3440
## Merc 280C           68.500006  3440
## Merc 450SE          11.045863  4070
## Merc 450SL          58.549967  3730
## Merc 450SLC         61.971348  3780
## Cadillac Fleetwood  87.852788  5250
## Lincoln Continental 42.586525  5424
## Chrysler Imperial   45.012744  5345
## Fiat 128             4.594393  2200
## Honda Civic         15.298820  1615
## Toyota Corolla      82.605559  1835
## Toyota Corona       55.259505  2465
## Dodge Challenger    62.798611  3520
## AMC Javelin         93.932539  3435
## Camaro Z28          92.219728  3840
## Pontiac Firebird    63.116971  3845
## Fiat X1-9            2.939228  1935
## Porsche 914-2       60.078523  2140
## Lotus Europa         4.487375  1513
## Ford Pantera L      23.166902  3170
## Ferrari Dino         6.026278  2770
## Maserati Bora       50.565610  3570
## Volvo 142E           1.647168  2780

Can you create a new column call zero and give it a value of 0 ?

3.2.6 summarise()

This function calculates a summary statistics among all rows or rows within certain grouping, often used in combination with group_by()

sum_table <- mtcars %>% 
summarise(mean(wt))
sum_table
##   mean(wt)
## 1  3.21725
sum_table2 <- mtcars%>% 
summarise(avg_wt= mean(wt), min_wt= min(wt))
sum_table2
##    avg_wt min_wt
## 1 3.21725  1.513

3.2.7 group_by()

This is a great function. group_by() divides data rows into groups based on grouping column(s) provided, often used in combination with other functions which define what you do with them after placing them in groups. When group_by() and summarise() are used together, you are essentially telling R to separate rows into different groups, and for each groups you use summarise() to generate a series of summary statistics that characterize the column values.

group_summary <- mtcars |>
  group_by(cyl) |>
  summarise(avg_wt= mean(wt), min_wt= min(wt))
group_summary
## # A tibble: 3 × 3
##     cyl avg_wt min_wt
##   <dbl>  <dbl>  <dbl>
## 1     4   2.29   1.51
## 2     6   3.12   2.62
## 3     8   4.00   3.17

You can also create groups by the combination of two or multiple columns

group_summary2<- mtcars |>
  group_by(cyl, gear) |>
  summarise(avg_wt= mean(wt), min_wt= min(wt))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
group_summary2
## # A tibble: 8 × 4
## # Groups:   cyl [3]
##     cyl  gear avg_wt min_wt
##   <dbl> <dbl>  <dbl>  <dbl>
## 1     4     3   2.46   2.46
## 2     4     4   2.38   1.62
## 3     4     5   1.83   1.51
## 4     6     3   3.34   3.22
## 5     6     4   3.09   2.62
## 6     6     5   2.77   2.77
## 7     8     3   4.10   3.44
## 8     8     5   3.37   3.17

3.2.8 Join table

Let’s read two other tables that are available in the package ‘dplyr’

data(band_members)
band_members
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
data(band_instruments)
band_instruments
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
  • left_join()

Returns all records from the left table, and the matched (matched by shared ID columns) records from the right table. The result is NULL from the right side, if there is no match.

left_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass
  • full_join()

Return all records in both tables.

full_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 4 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Other than left_join() and full_join() there are a few other join functions that would join table differently. Sometimes you will find them useful for a specific table you want to make, but in general, you can reply on left_join() for most of the job.

  • right_join(): similar to left join, it keeps all records from the right table instead.

  • inner_join(): return only the matched records from both tables.

3.2.9 Conclusion on data management

This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by Googling and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.

3.3 Statistical Modelling

Let’s move on into data analysis.

3.3.1 General Principles

When building an statistical model, you need to specify a dependent variable (what you are trying to explain) and one or more independent variables (the variables that you are using to explain the dependent variable).

Thus, to build a model you will first need to identify the dependent and independent variables which come from the research question. Once you have identified the variables, you need to decide what kind of model structure is appropriate for the data set at hand. For this reason, knowing models and model structures can help you design the data collection process as you have a good idea of what type of data you need for a specific model.

Running models can be easy, but before you can make interpretations from the model, you need to check that the assumptions of the model are valid (model validation), decide what is the best model structure (model selection) and then, finally, interpret the parameters of the model and/or make predictions.

The general structure for defining any model in R is:

model.type(dependent.variable ~ independent.variable1 + independent.variable2, data = dataframe)

Note that in the model equation you do not have to specify the intercept. We have assumed in this example that there are 2 different independent variables in the example provided (independent.variable1 and independent.variable2), but the model could contain any number of independent variables.

3.3.2 Regression vs Classification

The dependent variable can either be continuous or categorical data. We refer to these separate parameterizations as a regression model or a classification model, respectively. Most of the models we demonstrate here are regression models, but the structure to build classification models in R are almost identical. In most cases, the data type of the dependent variable determines whether the model is a regression (non-categorical numeric-type dependent variable) or classification (categorical factor-type dependent variable). You can use functions such as as.factor() and as.numeric() to convert different data types of data to a factor or numeric variable. Note, just because a data set contains numbers, does NOT necessarily mean the numbers are numeric-type. Numbers can be used as symbols to differentiate categories as well. It is always a good practice to confirm the data type of the dependent and independent variables that you are inputing into the model.

3.4 Linear Regression

We will start with a simple linear regression analysis.

QUESTION: What is the difference between correlation and regression?

3.4.1 Model design and model fit

Model design involves deciding what we are trying to explain (i.e., the dependent variable) and what we are going to use to explain it (i.e., the independent variables). These are questions that are informed by the researcher.

Then, we need to decide what kind of model structure is appropriate to the data set. In our case, we will start with a simple linear regression model. Later in the course, however, we will investigate more complex model structures.

Run a basic linear regression model, attempting to predict/explain the fuel efficiency of a car (mpg) based on its weight. Finally, assign the model output “lm1”. Remember to specify the my_mtcars data set in the model so the R understands that the name of dependent and independent variables are columns in that specific dataframe.

lm1 <- lm(mpg ~ wt, data = my_mtcars)

3.4.2 Investigate your model

Once you have fit the model, you can use the summary function to investigate the beta coefficient, SE, and intercept values.

summary(lm1)
## 
## Call:
## lm(formula = mpg ~ wt, data = my_mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1820 -2.4603 -0.4526  1.5046  6.3191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.913      2.002  19.936  < 2e-16 ***
## wt            -6.287      0.630  -9.979 1.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.793 on 28 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7727 
## F-statistic: 99.59 on 1 and 28 DF,  p-value: 1.007e-10

3.4.3 Model validation - Checking the assumptions of your model

IMPORTANT: Before you can trust your model, you need to examine whether the assumptions on which the model is based are actually valid. If the assumptions are not satisfied, then your model will be unreliable, at least to some extent!

If you apply the function plot to the model, R will provide a series of plots that will help you to inspect model assumptions. Click enter to get all the plots.

plot(lm1)

  1. Assumption 1: The residuals are normally distributed

Check the Q-Q (quantile-quantile) plot. A perfect normal distribution is represented by a straight line. Your residuals should be close to a straight line.

  1. Assumption 2: The variances of the residuals are homogeneous (homoscedasticity)

The variability of the residuals should be uniform across the range of fitted values of the dependent variable. In the plot of residuals vs fitted values of y and standardized residuals vs. fitted values there should be a “scatter” of dots across the graphs, with no discernible pattern. Points should be randomly distributed. If you have a scatter of points on one side of the graph, your data may NOT be homogeneous.

  1. Assumption 3: The independent variables are independent of each other (no collinearity)

There are different ways you can address this before you fit your model. For instance, you can estimate the correlation of each pair of covariates and discard variables or exclude them from analysis if they highly correlated (positively or negatively). We will examine this later in the course when dealing with more complex models.

  1. Assumption 4: The data set does not contain serial auto-correlation

Serial autocorrelation is when there is significant correlation between successive data points in the data set. Spatial data or time-series data tend to be autocorrelated. There are different ways to deal with autocorrelation, such as using mixed-effect models.

  1. Assumption 5: The model is not biased by unduly influential observations.

We can check this by looking at the plot of standardized residuals vs leverage and “Cook’s Distance.”

Leverage is a measure of the influence of individual data points on the model’s parameters, measured on the scale of 0-1, where high values indicate a strong influence on the model’s parameters.

Cook’s distance is the sum of squared distances between the fitted values using the whole data set, and the fitted values with the ith observation removed. A large difference indicates that the ith observation exerts a strong influence on the model’s parameters.

No values beyond 1.


QUESTION: How does the QQ plot look? Does it indicate a potential problem?

If your standardized residuals are not normally distributed you can transform the dependent variable.

Let’s try a log transformation, which is commonly used:

lm1 <- lm(log(mpg) ~ wt, data = my_mtcars)
plot(lm1)