Geographic Information Systems, use of satellite imagery, modeling, and mapping have become essentials in the toolbox of many researchers, natural resource managers, and conservationists. Yet, these resources are often not readily accessible to practitioners around the world. Most restrictive is the limited access to specialized and expensive commercial software. In recent years, several powerful open-source tools for geospatial analysis have emerged. R is a widely used statistical programming language with considerable geospatial analysis capabilities. This training focuses on performing geospatial work in R.
By the end of these tutorials, you will be able to:
This content has been the product of years of work and collaborations with many amazing people. Some of the code has been provided by them.
We thank Dr. Quingyu Huangm, Dr. Grant Connette, and Dr. Joseph Kolowski for their inputs.
During the course, students will use R and R-Studio. R is a programming language and free software environment for statistical computing widely used by researchers. R-Studio is an integrated development environment for R that makes the software more user friendly. Both R and R-Studio are compatible with the most common operating systems and are free to use. Before the course begins, you should have the software installed on your personal computers. Installation files can be accessed at:
-RStudio Desktop: https://rstudio.com/products/rstudio/
All the data used in the tutorials are available in this link
R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.
Among the reasons to use R you have:
R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.
Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.
R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).
RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.
One of the most useful commands in R is
?
. At the command prompt (signified by >
in
your Console window), type ?
followed by any command and
you will be prompted with a help tab for that command (e.g.,
?mean
Fig. 3). You can also search through the help tab
directly by searching functions on the search bar.
The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.
There are a few concepts that are important to keep in mind before
you start coding. The fact that R is a programming
language may deter some users who think “I can’t program”. This should
not be the case for two reasons. First, R is an
interpreted language, not a compiled one, meaning that all commands
typed on the keyboard are directly executed without requiring you to
build a complete program like in most computer languages (C, Pascal, . .
. ). Second, R’s syntax is very simple and intuitive.
For instance, a linear regression can be done with the command
lm(y ~ x)
which means fitting a linear model with y as the
response and x as a predictor.
In R, in order to be executed, a function always
needs to be written with parentheses, even if there is nothing within
them (e.g., ls()
). If you type the name of a function
without parentheses, R will display the content of the
function.
When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).
The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).
When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.
It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).
Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:
getwd()
Use setwd()
to change or set a new working directory.
For instance, you can set your working directory to be in your Documents
folder on the C: drive, or in any folder you prefer.
setwd("C:/Documents/R_Practice")
In RStudio, you can do some basic operations, such as setting your working directory, using the point-and-click interface: Session > Set Working Directory > Choose Directory … as shown in Fig. 4.
You can also work with R projects. R projects have the advantage that they enhance organization, collaboration, and reproducibility in R-based work by providing a structured and isolated workspace for your projects. They contribute to a more efficient and transparent workflow, particularly when working on multiple projects or collaborating with others.
The main advantage is that R Projects create a dedicated workspace for your R-related work, keeping all files, scripts, data, and outputs within a well-defined directory. This isolation helps avoid conflicts between projects and ensures a clean environment for each project. R projects use relative paths, allowing you to refer to files and directories within the project without specifying the full path. This enhances portability and makes it easier to share your project with others.
To start a new R Project click File > New Project or directly click on the R Project icon and follow the directions. You can set up your new project on a new folder or an existing folder.
There are four fundamental data types in R that you will work with:
You can check the data type of an object using the function
class()
. To convert between data types you can use:
as.integer()
, as.numeric()
,
as.logical()
, as. character()
.
For instance:
city <- 'Nairobi'
class(city)
## [1] "character"
number <- 3
class(number)
## [1] "numeric"
Integer <- as.integer(number)
class(Integer)
## [1] "integer"
logical <- 3 > 5
logical
## [1] FALSE
Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!
city <- "front royal"
summary(city)
number <- 2
summary(number)
character <- as.character(2)
Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:
table <- read.csv(file="data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder
Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.
The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.
#This is a comment
The semicolon (;) defines a line continuation character so that you can write different commands on the same line of code.
a <- 3; b <- 6; c <- a+b
a
## [1] 3
b
## [1] 6
c
## [1] 9
Vectors are a basic data structure in R. They
contain a sequence of data and can contain characters, numbers, or be
TRUE/FALSE values. Remember: If you are unsure or need help, use the
help function (e.g., help(seq)
or ?seq
).
Below are several ways to create vectors in R.
1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
rep(1:5,5)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep("A rolling stone gathers no moss",4)
## [1] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
## [3] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
x <- 1:10
y <- c(1.5, 9, 3.46, 12.2)
To see only part (i.e., a subset) of the data stored in a vector, you
need to “ask” R to extract the information you want
using square brackets. Most commonly, you will indicate in square
brackets the position of the data you want to extract (from beginning of
the vector [1] to the Nth slot in the vector [n]). If you only wanted to
see the first 5 values of ‘x’, how would you do that? Or only the 2nd
and 4th element of y? What if you wanted to see all the records in
y
, except for the 2nd and 3rd records? There are more ways
to use notation to select subsets of data, which we will cover in more
detail below.
x
## [1] 1 2 3 4 5 6 7 8 9 10
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
x[1:5]
## [1] 1 2 3 4 5
y[c(2,4)]
## [1] 9.0 12.2
y[-c(2,3)]
## [1] 1.5 12.2
Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.
Create a matrix with 4 rows and 5 columns using the data from
x
above. Consult the help (e.g., help(matrix)
or ?matrix
) to determine the syntax required.
test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 3 7
## [2,] 2 6 10 4 8
## [3,] 3 7 1 5 9
## [4,] 4 8 2 6 10
# Note, I can assign any name to an object that I create. Generally it is best to name things in a way that is meaningful, but we'll have some fun here!
superman <- matrix(data = x, nrow = 4, ncol = 5)
Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:
#test_matrix(row,column)
test_matrix[1,5]
## [1] 7
In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!
For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.
test_matrix[,5]
## [1] 7 8 9 10
What about the 4th row?
test_matrix[4,]
## [1] 4 8 2 6 10
What if we wanted to view the values in the 3rd row, but only the 2nd and 4th columns?
test_matrix[3,c(2,4)]
## [1] 7 5
What happens to the matrix if we append a character field? Use the
cbind()
(column bind) command to bind a new column, called
‘countries’. Note that I am not changing the contents of test_matrix.
Can you figure out how to do a row bind (hint: use
rbind()
)
countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
## countries
## [1,] "1" "5" "9" "3" "7" "United States"
## [2,] "2" "6" "10" "4" "8" "Pakistan"
## [3,] "3" "7" "1" "5" "9" "Ireland"
## [4,] "4" "8" "2" "6" "10" "China"
#Note that I am not changing/overwriting the contents of test_matrix. I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)
Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?
test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
## X1 X2 X3 X4 X5 countries
## 1 1 5 9 3 7 United States
## 2 2 6 10 4 8 Pakistan
## 3 3 7 1 5 9 Ireland
## 4 4 8 2 6 10 China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"
Can I rename the column headings?
names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
## Val1 Val2 Val3 Val4 Val5 Countries
## 1 1 5 9 3 7 United States
## 2 2 6 10 4 8 Pakistan
## 3 3 7 1 5 9 Ireland
## 4 4 8 2 6 10 China
# Also see the colnames() function
Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?
test_dataframe[3,5]
## [1] 9
test_dataframe[,5]
## [1] 7 8 9 10
test_dataframe$Val5[3]
## [1] 9
test_dataframe$Val5
## [1] 7 8 9 10
test_dataframe[,"Val5"]
## [1] 7 8 9 10
You can also use some very simple commands to determine the size of dataframes or matrices.
nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6
You can delete individual objects to clear your working directory
(rm(dataset)
), or start every script with the following
command to make sure you are starting fresh (this is good programming
practice):
#rm(list=ls())
R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.
Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)
print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
y <- matrix(1:5)
y
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
df.example <- cbind(x, y)
df.example
## [,1] [,2] [,3]
## [1,] 1 6 1
## [2,] 2 7 2
## [3,] 3 8 3
## [4,] 4 9 4
## [5,] 5 10 5
?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.
Here, for example, we install and load package named “ggplot2” that we will use for data visualization.
install.packages("ggplot2")
library(ggplot2)
R also contains many pre-existing functions in the
base software. Numeric functions include sum()
,
mean()
, sd()
, min()
,
max()
, median()
, range()
,
quantile()
, or summary()
. Try a few of these
on the numeric vectors you have created.
sum(x)
## [1] 55
summary(x)
## V1 V2
## Min. :1 Min. : 6
## 1st Qu.:2 1st Qu.: 7
## Median :3 Median : 8
## Mean :3 Mean : 8
## 3rd Qu.:4 3rd Qu.: 9
## Max. :5 Max. :10
range(y)
## [1] 1 5
R can be used to perform basic calculations and report the results back to the user.
4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260
Exponentiation: ^
2^3
## [1] 8
Max and Min: max(), min()
vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers)
## [1] 10
min(vector_numbers)
## [1] 2
Can you calculate the square root and then subtract 5 for each element in vector_number?
<,>, =, !=, >=, <=, Evaluate a conditional expression and return TRUE or FALSE
3 > max(c(2,3,4,5))
## [1] FALSE
The for loop is used to iterate over a sequence (numeric vector, list, or other iterable objects). Loops are very important to perform opperations and especial building blocks of many advanced models.
Here is a simple for loop that print numbers 1 to 5. As you can see, i is the element in which the loop is iterating. It can take a value of 1 to 5, and the loop ends when it reaches the last element of the sequence.
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
But the loop can iterate not just on numbers, but also lists.
my_list <- c("apple", "orange", "banana")
for (fruit in my_list) {
print(fruit)
}
## [1] "apple"
## [1] "orange"
## [1] "banana"
We can create a vector from which we will add 5 to each element and save it in a new vector. For this, we first need to create an empty vector where we will save the new numbers. Note how we use [] to access the elements in the i positions.
vector <- c(2,3,4,5,2) #Data vector
newdata <- NULL #Vector to store output
for (i in 1:length(vector)){
newdata[i] <- vector[i] + 5
}
newdata
## [1] 7 8 9 10 7
Finally, we can nest loops. A loop inside a loop. Make sure you understand dimensions before trying to understand the nested loop.
vector <- c(1,2,3,4,5) #Data vector
newdata <- matrix(NA, 5,5) #Vector to store output
for (i in 1:5) {
for (j in 1:5) {
newdata[i,j] <- vector[i] * vector[j]
}
}
newdata
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 2 4 6 8 10
## [3,] 3 6 9 12 15
## [4,] 4 8 12 16 20
## [5,] 5 10 15 20 25
A good way to learn more about R is to use SWIRL. This is a user-generated program (also called a package or library) for learning how to code in R. To access this, you must first install the package so it is accessible. In the Console window (bottom left), type the following and press ENTER:
install.packages("swirl")
# Data can also be installed locally if you've downloaded the zip file to your hard drive
#install.packages("yourzip.tar.gz", repos=NULL)
This may take a little while, but when the stop sign in the upper
right of the console window is gone, you can proceed. For any package
you install in R, you will also need to turn them on
before using them. You can do this with the require()
or
library()
functions. Type this now:
library(swirl)
Note: You may be prompted to select a “mirror” from which to download the package. If this is the case, it is recommended that you choose the mirror that is geographically closest to you.
To install the lesson, you will need to use:
install_from_swirl("R Programming")
Find out more about other courses, and other download options here: https://github.com/swirldev/swirl_courses
There are many lessons within R. Once SWIRL is loaded (below), you will be given the option of which lessons to complete. Some of the core lessons can be found in the initial section labeled R Programming (estimated time to compete required lessons is about 2 hours). We recommend to start with the following lessons:
Type the following to begin using SWIRL. Also, when restarting your
session later, you’ll need to “turn on” SWIRL each time with either
library(swirl)
or require(swirl)
.
swirl()
Have fun!
Now that you’ve learned the basics of R programming, we’ll take things a step further.This chapter will walk you through a new set of analyses.
We’ll be working with the dataset “mtcars” which comes pre-loaded with R. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data and to reiterate some principles in statistical modelling. This chapter will help as we move towards our ultimate goal of conducting more advanced analyses on animal point data.
You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.
For this tutorial section I followed the book: Data Analysis with R Statistical Software - A Guidebook for Scientists This is, in my opinion, one of the best resources out there for an introduction to data analysis with R.
Whenever you start working with a new script, you should first set a working directory. If you are working within a R-Studio project, the working directory will automatically be set by default. This directory will contain all the data for your analysis and will be where you will save all the data outputs.
Remember that you can check the current working directory by typing:
getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/R-Tutorials"
Now, let’s change the working directory to a location of your choosing. Create a folder if you don’t have one already, then make sure your working directory is in that folder. If you already have a folder, just set the working directory to the folder you want to use.
setwd("C:/....")
Let’s start investigating a data set to later fit a linear model.
Load the “mtcars”” data set. This dataset comes with R. We will use this dataset as it is intuitive to think about it and is a good example on how to tackle many other datasets.
data(mtcars)
View the first 10 lines of the data set.
head(mtcars, 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Assess the overall structure of the data set to get a sense of the number and type of variables included. When you work with your own data, you will be familiar with the data structure, but it is always good practice to examine your data before moving on to any model fitting. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.
Note, all columns/variables included in this sample dataset are
numeric-type. You can confirm the data type of each column by typing
is.numeric() next to the variable name (e.g.,
is.numeric(mtcars$mpg)
).
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Now, summarize the data to provide a list of each variable with the mean, min, and max.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
‘Summary’ is a great function to have a quick view at the data. But
what if you want to save the mean, min, or max values of each variable?
There is a family of functions in R that are great for
applying functions to all columns or all rows in a matrix and that
return the result as a vector or list of values. This is the
apply
function.
The apply
function has two main arguments. The MARGIN
that is a 1 or a 2, indicating whether you want to operate on rows (1)
or columns (2) and the FUN arguments that tell R what
function is to be applied. For example, to obtain the mean of each
variable in the mtcars dataset, we do:
apply(mtcars, 2, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Compare with the values reported by the summary
function.
QUESTION: Can you calculate the min and max values for each variable?
Let’s have a quick look at all our variables together (since they are all numeric) by looking at scatter plots of each variable combination. Use the function “pairs” or “plot” on the data set.
plot(mtcars)
You should be able to see cases where there seems to be a strong relationship between two variables. “mpg” vs. “wt” is a good example of this. This is miles per gallon vs. the weight of the car, and this makes sense, a heavier car should consume more petrol per distance. We should transform miles to kilometers, because, what kind of measurement is miles? But in appreciation of all my American friends, we will use miles, just for this exercise. Is the slope here positive or negative?
We can plot these two variables against each other to examine the relationship closer.
plot(mtcars$mpg ~ mtcars$wt)
You could plot any of the variables in the data frame. Plotting the data is one of the simplest ways to look and explore the data.
In R you can customize your plots. We will dedicate
a section to ggplot2. But for now, can you change the ‘x’ and ‘y’
variable names and add a title to this plot? Use the help file
help(plot)
or ?plot
to determine the proper
syntax, or simply Google “add tile to plot in R”.
plot(mtcars$mpg ~ mtcars$wt, xlab="Weight", ylab="MPG", main="MPG vs Weight")
Calculate the correlation coefficient between the two variables, then perform a correlation test to see if they are significantly correlated.
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
cor.test(mtcars$mpg, mtcars$wt)
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
The p-value is very small and the correlation coefficient of -0.8676594 is very high. We also note that this value is negative, meaning that as the weight increases, the fuel efficiency decreases. This makes intuitive sense. We will talk about the difference between correlation and causation later.
Let’s practice some data management before we look into these variables in more detail.
Create a new data set called “my_mtcars”, excluding the variables “vs” and “am” using vector notation. Then look at your new data set to make sure it worked. We don’t necessarily need to remove these variables to continue the analysis. We are simply doing this so that you get more familiar with manipulating data frames.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
my_mtcars <- mtcars[, -c(8,9)] #Remove columns 8 and 9
head(my_mtcars, 10)
## mpg cyl disp hp drat wt qsec gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 4 4
Now, keeping the same name for your data set, exclude the “gear” and
“carb” columns without using vector notation. Instead use the “subset”
function. Check out the help (?subset
) for this function to
figure out how to exclude columns by name.
my_mtcars <- subset(my_mtcars, select = -c(gear, carb))
Note that the initial data of my_mycars with 9 variables no longer exists, because my syntax states to save the modified 7 variable data with the original name.
QUESTION: How could you do this without overwriting the original data?
You should now have a data set called my_mtcars that has 32 observations of 7 variables. Check this.
str(my_mtcars)
## 'data.frame': 32 obs. of 7 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
dim(my_mtcars)
## [1] 32 7
QUESTION: What does
## [1] 32 7
tells us? Hint, recall vector notation.
Another way of checking the number of rows and columns is using
nrow()
and ncol()
functions.
The variable “cyl” represents the number of cylinders in the car
engine. You’ll see this is classified as a numeric variable. However, we
aren’t necessarily interested in this as an actual number. For us, the
number of cylinders is more useful as a grouping mechanisms to serve as
a factor, or categorical variable. Let’s use the as.factor
function to convert it, keeping the same variable name. Then check what
the class of the variable is, to confirm the conversion worked.
my_mtcars$cyl <- as.factor(my_mtcars$cyl)
class(my_mtcars$cyl)
## [1] "factor"
Creating a categorical factor variable will enable us to generate summary statistics and plot data by groups.
We can now use this factor variable to group different operations.
tapply
is a great function to use for grouped operations.
Check the help for ?tapply
and try to calculate the mean of
“mpg” for each factor of the “cyl” variable.
tapply(my_mtcars$mpg, my_mtcars$cyl, mean)
## 4 6 8
## 26.66364 19.74286 15.10000
Now let’s create box plots of our two variables of interest (mpg, wt) to visualize their distribution. First, change your graphic parameter to show two graphs side by side.
Question: How do you think you’d specify the title of each box plot?
par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg, main = "mpg")
boxplot(my_mtcars$wt, main = "weight")
You can see that two points are potential outliers in the plot
forwt
. The plot gives you a tool to make a decision whether
to remove the data points. Here we will keep them.
Since cyl
is a categorical variable, we can also
visualize the distribution of mpg
and wt
across different cylinder classes.
par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg ~ my_mtcars$cyl, main = "mpg")
boxplot(my_mtcars$wt ~ my_mtcars$cyl, main = "weight")
QUESTION: How do you modify the label for x axis and y axis?
Before we move forward, let’s exclude these two observations by using
a logical expression that removes the points (rows in the dataframe) in
the data set where weight is greater than 5.3 tons. There are a few
different ways to do this, as is the case with most things in
R. Let’s use the subset
function
again.
my_mtcars <- subset(my_mtcars, wt <= 5.3)
You should now have a data set with 30 observations of 7 variables.
Note that removing data points from an analysis is not a good statistical practice and when done, it should be justified. But here, the goal is to use different functions on how we can work with data frames.
The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!
The base R functions that come with the default R installation have
the capacity for almost all the table manipulation you will need (e.g.,
split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()
).
However, sometimes their syntax are less user-friendly and intuitive
than some of the special packages built for table manipulation purposes.
So, here we are introducing a few of the most useful table manipulation
functions within dplyr
package. This is a package I use a
lot.
Note that you will have to use install.packages()
and
library()
function to download and activate the
dplyr
before using it.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now, we will see how different functions of this package work.
select()
We can use select()
to select column(s) that meet an
specific pattern:
select(mtcars, mpg) # select column called mpg (miles per galon)
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
select(mtcars, -mpg) # select all columns in the data except mpg column
## cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 4 121.0 109 4.11 2.780 18.60 1 1 4 2
select(mtcars, hp:vs) # select a continuous block columns starting from hp column and end on vs column
## hp drat wt qsec vs
## Mazda RX4 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 110 3.90 2.875 17.02 0
## Datsun 710 93 3.85 2.320 18.61 1
## Hornet 4 Drive 110 3.08 3.215 19.44 1
## Hornet Sportabout 175 3.15 3.440 17.02 0
## Valiant 105 2.76 3.460 20.22 1
## Duster 360 245 3.21 3.570 15.84 0
## Merc 240D 62 3.69 3.190 20.00 1
## Merc 230 95 3.92 3.150 22.90 1
## Merc 280 123 3.92 3.440 18.30 1
## Merc 280C 123 3.92 3.440 18.90 1
## Merc 450SE 180 3.07 4.070 17.40 0
## Merc 450SL 180 3.07 3.730 17.60 0
## Merc 450SLC 180 3.07 3.780 18.00 0
## Cadillac Fleetwood 205 2.93 5.250 17.98 0
## Lincoln Continental 215 3.00 5.424 17.82 0
## Chrysler Imperial 230 3.23 5.345 17.42 0
## Fiat 128 66 4.08 2.200 19.47 1
## Honda Civic 52 4.93 1.615 18.52 1
## Toyota Corolla 65 4.22 1.835 19.90 1
## Toyota Corona 97 3.70 2.465 20.01 1
## Dodge Challenger 150 2.76 3.520 16.87 0
## AMC Javelin 150 3.15 3.435 17.30 0
## Camaro Z28 245 3.73 3.840 15.41 0
## Pontiac Firebird 175 3.08 3.845 17.05 0
## Fiat X1-9 66 4.08 1.935 18.90 1
## Porsche 914-2 91 4.43 2.140 16.70 0
## Lotus Europa 113 3.77 1.513 16.90 1
## Ford Pantera L 264 4.22 3.170 14.50 0
## Ferrari Dino 175 3.62 2.770 15.50 0
## Maserati Bora 335 3.54 3.570 14.60 0
## Volvo 142E 109 4.11 2.780 18.60 1
select(mtcars, starts_with("c")) # select all columns that start with "c" in their column names
## cyl carb
## Mazda RX4 6 4
## Mazda RX4 Wag 6 4
## Datsun 710 4 1
## Hornet 4 Drive 6 1
## Hornet Sportabout 8 2
## Valiant 6 1
## Duster 360 8 4
## Merc 240D 4 2
## Merc 230 4 2
## Merc 280 6 4
## Merc 280C 6 4
## Merc 450SE 8 3
## Merc 450SL 8 3
## Merc 450SLC 8 3
## Cadillac Fleetwood 8 4
## Lincoln Continental 8 4
## Chrysler Imperial 8 4
## Fiat 128 4 1
## Honda Civic 4 2
## Toyota Corolla 4 1
## Toyota Corona 4 1
## Dodge Challenger 8 2
## AMC Javelin 8 2
## Camaro Z28 8 4
## Pontiac Firebird 8 2
## Fiat X1-9 4 1
## Porsche 914-2 4 2
## Lotus Europa 4 2
## Ford Pantera L 8 4
## Ferrari Dino 6 6
## Maserati Bora 8 8
## Volvo 142E 4 2
starts_with
argument is very convenient because it
allow you to select multiple columns that start with the same text. A
few similar arguments are available to define other naming
patterns.ends_with()
= Select columns that end with a character
string.contains()
= Select columns that contain a character
string.matches()
= Select columns that match a regular
expression.one_of()
= Select columns names that are from a group of
names.Question: How do you select or exclude two columns: mpg and cyl?
filter()
Filter/select row(s) of data based on specific requirement of column(s) values:
filter(mtcars, cyl %in% c(4,6)) # select cars with 4 or 6 cylinders
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
filter (mtcars, mpg>20) # select rows that have mpg > 20
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
filter(mtcars, mpg>20 & cyl == 6) # select rows that have mpg>20 AND cyl == 6
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
filter(mtcars, mpg>20 | hp > 100) # select rows that have mpg>20 OR hp > 100
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Question 1: How do you select rows with 6 and 8 cylinders knowing that cyl is a factor?
The pipe operator allows you to pipe the output from one function to
the input of the next function. Instead of nesting functions (reading
from the inside to the outside), the idea of of piping is to read the
functions from left to right. It can also help you avoid creating and
saving a lot of intermediate variables that you don’t need to keep. The
old operator for pipes was %>%
, but now a new version
has been introduced, |>
# old operator
pipe_result<- mtcars %>%
select(mpg, cyl) %>%
head()
pipe_result
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
# new operator
pipe_result<- mtcars |>
select(mpg, cyl) |>
head()
pipe_result
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
arrange()
This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order
order_data1<- mtcars %>%
arrange(mpg)
order_data1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
order_data2<- mtcars %>%
arrange(cyl, mpg)
order_data2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing?
Question: Can you arrange the table first by wt and then by hp in decending order?
mutate()
The mutate()
command creates new column(s) and define
their values. For instance, we can create a new column with random
numbers between 0 and 100 using a uniform distribution.
new_col<- mtcars %>%
mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100)) # note the use of nrow to get the exact number of random numbers as there are rows in the dataframe.
new_col
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## new_col
## Mazda RX4 68.0082178
## Mazda RX4 Wag 22.6670911
## Datsun 710 47.3906628
## Hornet 4 Drive 87.9783226
## Hornet Sportabout 38.6728120
## Valiant 43.2574856
## Duster 360 79.9368094
## Merc 240D 45.4630289
## Merc 230 84.9415006
## Merc 280 53.0237370
## Merc 280C 1.4590759
## Merc 450SE 62.0795711
## Merc 450SL 81.5701176
## Merc 450SLC 32.5959353
## Cadillac Fleetwood 52.1947961
## Lincoln Continental 45.9956347
## Chrysler Imperial 0.9429407
## Fiat 128 62.2708343
## Honda Civic 72.3534224
## Toyota Corolla 9.3948067
## Toyota Corona 43.3100333
## Dodge Challenger 39.3705221
## AMC Javelin 53.8806726
## Camaro Z28 29.9311936
## Pontiac Firebird 26.0978157
## Fiat X1-9 99.9855554
## Porsche 914-2 46.2409172
## Lotus Europa 25.5154344
## Ford Pantera L 7.8714678
## Ferrari Dino 93.9733470
## Maserati Bora 17.3660055
## Volvo 142E 71.0062781
new_col2 <- mtcars %>%
mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100), wt_kg = wt * 1000) # you can create multiple columns at once. Lets transform the weight from tones to kg
new_col2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## new_col wt_kg
## Mazda RX4 39.431255 2620
## Mazda RX4 Wag 30.763502 2875
## Datsun 710 25.109426 2320
## Hornet 4 Drive 86.945529 3215
## Hornet Sportabout 39.114930 3440
## Valiant 40.612382 3460
## Duster 360 2.979379 3570
## Merc 240D 64.125189 3190
## Merc 230 5.158186 3150
## Merc 280 52.688352 3440
## Merc 280C 68.500006 3440
## Merc 450SE 11.045863 4070
## Merc 450SL 58.549967 3730
## Merc 450SLC 61.971348 3780
## Cadillac Fleetwood 87.852788 5250
## Lincoln Continental 42.586525 5424
## Chrysler Imperial 45.012744 5345
## Fiat 128 4.594393 2200
## Honda Civic 15.298820 1615
## Toyota Corolla 82.605559 1835
## Toyota Corona 55.259505 2465
## Dodge Challenger 62.798611 3520
## AMC Javelin 93.932539 3435
## Camaro Z28 92.219728 3840
## Pontiac Firebird 63.116971 3845
## Fiat X1-9 2.939228 1935
## Porsche 914-2 60.078523 2140
## Lotus Europa 4.487375 1513
## Ford Pantera L 23.166902 3170
## Ferrari Dino 6.026278 2770
## Maserati Bora 50.565610 3570
## Volvo 142E 1.647168 2780
Can you create a new column call zero and give it a value of 0 ?
summarise()
This function calculates a summary statistics among all rows or rows
within certain grouping, often used in combination with
group_by()
sum_table <- mtcars %>%
summarise(mean(wt))
sum_table
## mean(wt)
## 1 3.21725
sum_table2 <- mtcars%>%
summarise(avg_wt= mean(wt), min_wt= min(wt))
sum_table2
## avg_wt min_wt
## 1 3.21725 1.513
group_by()
This is a great function. group_by()
divides data rows
into groups based on grouping column(s) provided, often used in
combination with other functions which define what you do with them
after placing them in groups. When group_by()
and
summarise()
are used together, you are essentially telling
R to separate rows into different groups, and for each groups you use
summarise()
to generate a series of summary statistics that
characterize the column values.
group_summary <- mtcars |>
group_by(cyl) |>
summarise(avg_wt= mean(wt), min_wt= min(wt))
group_summary
## # A tibble: 3 × 3
## cyl avg_wt min_wt
## <dbl> <dbl> <dbl>
## 1 4 2.29 1.51
## 2 6 3.12 2.62
## 3 8 4.00 3.17
You can also create groups by the combination of two or multiple columns
group_summary2<- mtcars |>
group_by(cyl, gear) |>
summarise(avg_wt= mean(wt), min_wt= min(wt))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
group_summary2
## # A tibble: 8 × 4
## # Groups: cyl [3]
## cyl gear avg_wt min_wt
## <dbl> <dbl> <dbl> <dbl>
## 1 4 3 2.46 2.46
## 2 4 4 2.38 1.62
## 3 4 5 1.83 1.51
## 4 6 3 3.34 3.22
## 5 6 4 3.09 2.62
## 6 6 5 2.77 2.77
## 7 8 3 4.10 3.44
## 8 8 5 3.37 3.17
Let’s read two other tables that are available in the package ‘dplyr’
data(band_members)
band_members
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
data(band_instruments)
band_instruments
## # A tibble: 3 × 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join()
Returns all records from the left table, and the matched (matched by shared ID columns) records from the right table. The result is NULL from the right side, if there is no match.
left_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
full_join()
Return all records in both tables.
full_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 4 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
Other than left_join()
and full_join()
there are a few other join functions that would join table differently.
Sometimes you will find them useful for a specific table you want to
make, but in general, you can reply on left_join()
for most
of the job.
right_join()
: similar to left join, it keeps all
records from the right table instead.
inner_join()
: return only the matched records from
both tables.
This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by Googling and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.
Let’s move on into data analysis.
When building an statistical model, you need to specify a dependent variable (what you are trying to explain) and one or more independent variables (the variables that you are using to explain the dependent variable).
Thus, to build a model you will first need to identify the dependent and independent variables which come from the research question. Once you have identified the variables, you need to decide what kind of model structure is appropriate for the data set at hand. For this reason, knowing models and model structures can help you design the data collection process as you have a good idea of what type of data you need for a specific model.
Running models can be easy, but before you can make interpretations from the model, you need to check that the assumptions of the model are valid (model validation), decide what is the best model structure (model selection) and then, finally, interpret the parameters of the model and/or make predictions.
The general structure for defining any model in R is:
model.type(dependent.variable ~ independent.variable1 + independent.variable2, data = dataframe)
Note that in the model equation you do not have to specify the intercept. We have assumed in this example that there are 2 different independent variables in the example provided (independent.variable1 and independent.variable2), but the model could contain any number of independent variables.
The dependent variable can either be continuous or categorical data.
We refer to these separate parameterizations as a regression model or a
classification model, respectively. Most of the models we demonstrate
here are regression models, but the structure to build classification
models in R are almost identical. In most cases, the
data type of the dependent variable determines whether the model is a
regression (non-categorical numeric-type dependent variable) or
classification (categorical factor-type dependent variable). You can use
functions such as as.factor()
and as.numeric()
to convert different data types of data to a factor or numeric variable.
Note, just because a data set contains numbers, does
NOT necessarily mean the numbers are numeric-type.
Numbers can be used as symbols to differentiate categories as well. It
is always a good practice to confirm the data type of the dependent and
independent variables that you are inputing into the model.
We will start with a simple linear regression analysis.
QUESTION: What is the difference between correlation and regression?
Model design involves deciding what we are trying to explain (i.e., the dependent variable) and what we are going to use to explain it (i.e., the independent variables). These are questions that are informed by the researcher.
Then, we need to decide what kind of model structure is appropriate to the data set. In our case, we will start with a simple linear regression model. Later in the course, however, we will investigate more complex model structures.
Run a basic linear regression model, attempting to predict/explain
the fuel efficiency of a car (mpg) based on its weight. Finally, assign
the model output “lm1”. Remember to specify the my_mtcars
data set in the model so the R understands that the
name of dependent and independent variables are columns in that specific
dataframe.
lm1 <- lm(mpg ~ wt, data = my_mtcars)
Once you have fit the model, you can use the summary function to investigate the beta coefficient, SE, and intercept values.
summary(lm1)
##
## Call:
## lm(formula = mpg ~ wt, data = my_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1820 -2.4603 -0.4526 1.5046 6.3191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.913 2.002 19.936 < 2e-16 ***
## wt -6.287 0.630 -9.979 1.01e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.793 on 28 degrees of freedom
## Multiple R-squared: 0.7805, Adjusted R-squared: 0.7727
## F-statistic: 99.59 on 1 and 28 DF, p-value: 1.007e-10
IMPORTANT: Before you can trust your model, you need to examine whether the assumptions on which the model is based are actually valid. If the assumptions are not satisfied, then your model will be unreliable, at least to some extent!
If you apply the function plot to the model, R will provide a series of plots that will help you to inspect model assumptions. Click enter to get all the plots.
plot(lm1)
Check the Q-Q (quantile-quantile) plot. A perfect normal distribution is represented by a straight line. Your residuals should be close to a straight line.
The variability of the residuals should be uniform across the range of fitted values of the dependent variable. In the plot of residuals vs fitted values of y and standardized residuals vs. fitted values there should be a “scatter” of dots across the graphs, with no discernible pattern. Points should be randomly distributed. If you have a scatter of points on one side of the graph, your data may NOT be homogeneous.
There are different ways you can address this before you fit your model. For instance, you can estimate the correlation of each pair of covariates and discard variables or exclude them from analysis if they highly correlated (positively or negatively). We will examine this later in the course when dealing with more complex models.
Serial autocorrelation is when there is significant correlation between successive data points in the data set. Spatial data or time-series data tend to be autocorrelated. There are different ways to deal with autocorrelation, such as using mixed-effect models.
We can check this by looking at the plot of standardized residuals vs leverage and “Cook’s Distance.”
Leverage is a measure of the influence of individual data points on the model’s parameters, measured on the scale of 0-1, where high values indicate a strong influence on the model’s parameters.
Cook’s distance is the sum of squared distances between the fitted values using the whole data set, and the fitted values with the ith observation removed. A large difference indicates that the ith observation exerts a strong influence on the model’s parameters.
No values beyond 1.
QUESTION: How does the QQ plot look? Does it indicate a potential problem?
If your standardized residuals are not normally distributed you can transform the dependent variable.
Let’s try a log transformation, which is commonly used:
lm1 <- lm(log(mpg) ~ wt, data = my_mtcars)
plot(lm1)