Geographic Information Systems, use of satellite imagery, modeling, and mapping have become essentials in the toolbox of many researchers, natural resource managers, and conservationists. Yet, these resources are often not readily accessible to practitioners around the world. Most restrictive is the limited access to specialized and expensive commercial software. In recent years, several powerful open-source tools for geospatial analysis have emerged. R is a widely used statistical programming language with considerable geospatial analysis capabilities. This training focuses on performing geospatial work in R.
By the end of these tutorials, you will be able to:
This content has been the product of years of work and collaborations with many amazing people. Some of the code has been provided by them.
We thank Dr. Quingyu Huangm, Dr. Grant Connette, and Dr. Joseph Kolowski for their inputs.
During the course, students will use R and R-Studio. R is a programming language and free software environment for statistical computing widely used by researchers. R-Studio is an integrated development environment for R that makes the software more user friendly. Both R and R-Studio are compatible with the most common operating systems and are free to use. Before the course begins, you should have the software installed on your personal computers. Installation files can be accessed at:
-RStudio Desktop: https://rstudio.com/products/rstudio/
All the data used in the tutorials are available in this link
R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.
Among the reasons to use R you have:
R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.
Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.
R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).
RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.
One of the most useful commands in R is
?
. At the command prompt (signified by >
in
your Console window), type ?
followed by any command and
you will be prompted with a help tab for that command (e.g.,
?mean
Fig. 3). You can also search through the help tab
directly by searching functions on the search bar.
The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.
There are a few concepts that are important to keep in mind before
you start coding. The fact that R is a programming
language may deter some users who think “I can’t program”. This should
not be the case for two reasons. First, R is an
interpreted language, not a compiled one, meaning that all commands
typed on the keyboard are directly executed without requiring you to
build a complete program like in most computer languages (C, Pascal, . .
. ). Second, R’s syntax is very simple and intuitive.
For instance, a linear regression can be done with the command
lm(y ~ x)
which means fitting a linear model with y as the
response and x as a predictor.
In R, in order to be executed, a function always
needs to be written with parentheses, even if there is nothing within
them (e.g., ls()
). If you type the name of a function
without parentheses, R will display the content of the
function.
When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).
The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).
When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.
It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).
Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:
getwd()
Use setwd()
to change or set a new working directory.
For instance, you can set your working directory to be in your Documents
folder on the C: drive, or in any folder you prefer.
setwd("C:/Documents/R_Practice")
In RStudio, you can do some basic operations, such as setting your working directory, using the point-and-click interface: Session > Set Working Directory > Choose Directory … as shown in Fig. 4.
You can also work with R projects. R projects have the advantage that they enhance organization, collaboration, and reproducibility in R-based work by providing a structured and isolated workspace for your projects. They contribute to a more efficient and transparent workflow, particularly when working on multiple projects or collaborating with others.
The main advantage is that R Projects create a dedicated workspace for your R-related work, keeping all files, scripts, data, and outputs within a well-defined directory. This isolation helps avoid conflicts between projects and ensures a clean environment for each project. R projects use relative paths, allowing you to refer to files and directories within the project without specifying the full path. This enhances portability and makes it easier to share your project with others.
To start a new R Project click File > New Project or directly click on the R Project icon and follow the directions. You can set up your new project on a new folder or an existing folder.
There are four fundamental data types in R that you will work with:
You can check the data type of an object using the function
class()
. To convert between data types you can use:
as.integer()
, as.numeric()
,
as.logical()
, as. character()
.
For instance:
city <- 'Nairobi'
class(city)
## [1] "character"
number <- 3
class(number)
## [1] "numeric"
Integer <- as.integer(number)
class(Integer)
## [1] "integer"
logical <- 3 > 5
logical
## [1] FALSE
Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!
city <- "front royal"
summary(city)
number <- 2
summary(number)
character <- as.character(2)
Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:
table <- read.csv(file="data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder
Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.
The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.
#This is a comment
The semicolon (;) defines a line continuation character so that you can write different commands on the same line of code.
a <- 3; b <- 6; c <- a+b
a
## [1] 3
b
## [1] 6
c
## [1] 9
Vectors are a basic data structure in R. They
contain a sequence of data and can contain characters, numbers, or be
TRUE/FALSE values. Remember: If you are unsure or need help, use the
help function (e.g., help(seq)
or ?seq
).
Below are several ways to create vectors in R.
1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
rep(1:5,5)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep("A rolling stone gathers no moss",4)
## [1] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
## [3] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
x <- 1:10
y <- c(1.5, 9, 3.46, 12.2)
To see only part (i.e., a subset) of the data stored in a vector, you
need to “ask” R to extract the information you want
using square brackets. Most commonly, you will indicate in square
brackets the position of the data you want to extract (from beginning of
the vector [1] to the Nth slot in the vector [n]). If you only wanted to
see the first 5 values of ‘x’, how would you do that? Or only the 2nd
and 4th element of y? What if you wanted to see all the records in
y
, except for the 2nd and 3rd records? There are more ways
to use notation to select subsets of data, which we will cover in more
detail below.
x
## [1] 1 2 3 4 5 6 7 8 9 10
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
x[1:5]
## [1] 1 2 3 4 5
y[c(2,4)]
## [1] 9.0 12.2
y[-c(2,3)]
## [1] 1.5 12.2
Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.
Create a matrix with 4 rows and 5 columns using the data from
x
above. Consult the help (e.g., help(matrix)
or ?matrix
) to determine the syntax required.
test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 3 7
## [2,] 2 6 10 4 8
## [3,] 3 7 1 5 9
## [4,] 4 8 2 6 10
# Note, I can assign any name to an object that I create. Generally it is best to name things in a way that is meaningful, but we'll have some fun here!
superman <- matrix(data = x, nrow = 4, ncol = 5)
Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:
#test_matrix(row,column)
test_matrix[1,5]
## [1] 7
In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!
For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.
test_matrix[,5]
## [1] 7 8 9 10
What about the 4th row?
test_matrix[4,]
## [1] 4 8 2 6 10
What if we wanted to view the values in the 3rd row, but only the 2nd and 4th columns?
test_matrix[3,c(2,4)]
## [1] 7 5
What happens to the matrix if we append a character field? Use the
cbind()
(column bind) command to bind a new column, called
‘countries’. Note that I am not changing the contents of test_matrix.
Can you figure out how to do a row bind (hint: use
rbind()
)
countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
## countries
## [1,] "1" "5" "9" "3" "7" "United States"
## [2,] "2" "6" "10" "4" "8" "Pakistan"
## [3,] "3" "7" "1" "5" "9" "Ireland"
## [4,] "4" "8" "2" "6" "10" "China"
#Note that I am not changing/overwriting the contents of test_matrix. I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)
Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?
test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
## X1 X2 X3 X4 X5 countries
## 1 1 5 9 3 7 United States
## 2 2 6 10 4 8 Pakistan
## 3 3 7 1 5 9 Ireland
## 4 4 8 2 6 10 China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"
Can I rename the column headings?
names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
## Val1 Val2 Val3 Val4 Val5 Countries
## 1 1 5 9 3 7 United States
## 2 2 6 10 4 8 Pakistan
## 3 3 7 1 5 9 Ireland
## 4 4 8 2 6 10 China
# Also see the colnames() function
Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?
test_dataframe[3,5]
## [1] 9
test_dataframe[,5]
## [1] 7 8 9 10
test_dataframe$Val5[3]
## [1] 9
test_dataframe$Val5
## [1] 7 8 9 10
test_dataframe[,"Val5"]
## [1] 7 8 9 10
You can also use some very simple commands to determine the size of dataframes or matrices.
nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6
You can delete individual objects to clear your working directory
(rm(dataset)
), or start every script with the following
command to make sure you are starting fresh (this is good programming
practice):
#rm(list=ls())
R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.
Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)
print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
y <- matrix(1:5)
y
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
df.example <- cbind(x, y)
df.example
## [,1] [,2] [,3]
## [1,] 1 6 1
## [2,] 2 7 2
## [3,] 3 8 3
## [4,] 4 9 4
## [5,] 5 10 5
?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.
Here, for example, we install and load package named “ggplot2” that we will use for data visualization.
install.packages("ggplot2")
library(ggplot2)
R also contains many pre-existing functions in the
base software. Numeric functions include sum()
,
mean()
, sd()
, min()
,
max()
, median()
, range()
,
quantile()
, or summary()
. Try a few of these
on the numeric vectors you have created.
sum(x)
## [1] 55
summary(x)
## V1 V2
## Min. :1 Min. : 6
## 1st Qu.:2 1st Qu.: 7
## Median :3 Median : 8
## Mean :3 Mean : 8
## 3rd Qu.:4 3rd Qu.: 9
## Max. :5 Max. :10
range(y)
## [1] 1 5
R can be used to perform basic calculations and report the results back to the user.
4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260
Exponentiation: ^
2^3
## [1] 8
Max and Min: max(), min()
vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers)
## [1] 10
min(vector_numbers)
## [1] 2
Can you calculate the square root and then subtract 5 for each element in vector_number?
<,>, =, !=, >=, <=, Evaluate a conditional expression and return TRUE or FALSE
3 > max(c(2,3,4,5))
## [1] FALSE
The for loop is used to iterate over a sequence (numeric vector, list, or other iterable objects). Loops are very important to perform opperations and especial building blocks of many advanced models.
Here is a simple for loop that print numbers 1 to 5. As you can see, i is the element in which the loop is iterating. It can take a value of 1 to 5, and the loop ends when it reaches the last element of the sequence.
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
But the loop can iterate not just on numbers, but also lists.
my_list <- c("apple", "orange", "banana")
for (fruit in my_list) {
print(fruit)
}
## [1] "apple"
## [1] "orange"
## [1] "banana"
We can create a vector from which we will add 5 to each element and save it in a new vector. For this, we first need to create an empty vector where we will save the new numbers. Note how we use [] to access the elements in the i positions.
vector <- c(2,3,4,5,2) #Data vector
newdata <- NULL #Vector to store output
for (i in 1:length(vector)){
newdata[i] <- vector[i] + 5
}
newdata
## [1] 7 8 9 10 7
Finally, we can nest loops. A loop inside a loop. Make sure you understand dimensions before trying to understand the nested loop.
vector <- c(1,2,3,4,5) #Data vector
newdata <- matrix(NA, 5,5) #Vector to store output
for (i in 1:5) {
for (j in 1:5) {
newdata[i,j] <- vector[i] * vector[j]
}
}
newdata
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 2 4 6 8 10
## [3,] 3 6 9 12 15
## [4,] 4 8 12 16 20
## [5,] 5 10 15 20 25
A good way to learn more about R is to use SWIRL. This is a user-generated program (also called a package or library) for learning how to code in R. To access this, you must first install the package so it is accessible. In the Console window (bottom left), type the following and press ENTER:
install.packages("swirl")
# Data can also be installed locally if you've downloaded the zip file to your hard drive
#install.packages("yourzip.tar.gz", repos=NULL)
This may take a little while, but when the stop sign in the upper
right of the console window is gone, you can proceed. For any package
you install in R, you will also need to turn them on
before using them. You can do this with the require()
or
library()
functions. Type this now:
library(swirl)
Note: You may be prompted to select a “mirror” from which to download the package. If this is the case, it is recommended that you choose the mirror that is geographically closest to you.
To install the lesson, you will need to use:
install_from_swirl("R Programming")
Find out more about other courses, and other download options here: https://github.com/swirldev/swirl_courses
There are many lessons within R. Once SWIRL is loaded (below), you will be given the option of which lessons to complete. Some of the core lessons can be found in the initial section labeled R Programming (estimated time to compete required lessons is about 2 hours). We recommend to start with the following lessons:
Type the following to begin using SWIRL. Also, when restarting your
session later, you’ll need to “turn on” SWIRL each time with either
library(swirl)
or require(swirl)
.
swirl()
Have fun!
Now that you’ve learned the basics of R programming, we’ll take things a step further.This chapter will walk you through a new set of analyses.
We’ll be working with the dataset “mtcars” which comes pre-loaded with R. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data and to reiterate some principles in statistical modelling. This chapter will help as we move towards our ultimate goal of conducting more advanced analyses on animal point data.
You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.
For this tutorial section I followed the book: Data Analysis with R Statistical Software - A Guidebook for Scientists This is, in my opinion, one of the best resources out there for an introduction to data analysis with R.
Whenever you start working with a new script, you should first set a working directory. If you are working within a R-Studio project, the working directory will automatically be set by default. This directory will contain all the data for your analysis and will be where you will save all the data outputs.
Remember that you can check the current working directory by typing:
getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/R-Tutorials"
Now, let’s change the working directory to a location of your choosing. Create a folder if you don’t have one already, then make sure your working directory is in that folder. If you already have a folder, just set the working directory to the folder you want to use.
setwd("C:/....")
Let’s start investigating a data set to later fit a linear model.
Load the “mtcars”” data set. This dataset comes with R. We will use this dataset as it is intuitive to think about it and is a good example on how to tackle many other datasets.
data(mtcars)
View the first 10 lines of the data set.
head(mtcars, 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Assess the overall structure of the data set to get a sense of the number and type of variables included. When you work with your own data, you will be familiar with the data structure, but it is always good practice to examine your data before moving on to any model fitting. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.
Note, all columns/variables included in this sample dataset are
numeric-type. You can confirm the data type of each column by typing
is.numeric() next to the variable name (e.g.,
is.numeric(mtcars$mpg)
).
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Now, summarize the data to provide a list of each variable with the mean, min, and max.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
‘Summary’ is a great function to have a quick view at the data. But
what if you want to save the mean, min, or max values of each variable?
There is a family of functions in R that are great for
applying functions to all columns or all rows in a matrix and that
return the result as a vector or list of values. This is the
apply
function.
The apply
function has two main arguments. The MARGIN
that is a 1 or a 2, indicating whether you want to operate on rows (1)
or columns (2) and the FUN arguments that tell R what
function is to be applied. For example, to obtain the mean of each
variable in the mtcars dataset, we do:
apply(mtcars, 2, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Compare with the values reported by the summary
function.
QUESTION: Can you calculate the min and max values for each variable?
Let’s have a quick look at all our variables together (since they are all numeric) by looking at scatter plots of each variable combination. Use the function “pairs” or “plot” on the data set.
plot(mtcars)
You should be able to see cases where there seems to be a strong relationship between two variables. “mpg” vs. “wt” is a good example of this. This is miles per gallon vs. the weight of the car, and this makes sense, a heavier car should consume more petrol per distance. We should transform miles to kilometers, because, what kind of measurement is miles? But in appreciation of all my American friends, we will use miles, just for this exercise. Is the slope here positive or negative?
We can plot these two variables against each other to examine the relationship closer.
plot(mtcars$mpg ~ mtcars$wt)
You could plot any of the variables in the data frame. Plotting the data is one of the simplest ways to look and explore the data.
In R you can customize your plots. We will dedicate
a section to ggplot2. But for now, can you change the ‘x’ and ‘y’
variable names and add a title to this plot? Use the help file
help(plot)
or ?plot
to determine the proper
syntax, or simply Google “add tile to plot in R”.
plot(mtcars$mpg ~ mtcars$wt, xlab="Weight", ylab="MPG", main="MPG vs Weight")
Calculate the correlation coefficient between the two variables, then perform a correlation test to see if they are significantly correlated.
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
cor.test(mtcars$mpg, mtcars$wt)
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
The p-value is very small and the correlation coefficient of -0.8676594 is very high. We also note that this value is negative, meaning that as the weight increases, the fuel efficiency decreases. This makes intuitive sense. We will talk about the difference between correlation and causation later.
Let’s practice some data management before we look into these variables in more detail.
Create a new data set called “my_mtcars”, excluding the variables “vs” and “am” using vector notation. Then look at your new data set to make sure it worked. We don’t necessarily need to remove these variables to continue the analysis. We are simply doing this so that you get more familiar with manipulating data frames.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
my_mtcars <- mtcars[, -c(8,9)] #Remove columns 8 and 9
head(my_mtcars, 10)
## mpg cyl disp hp drat wt qsec gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 4 4
Now, keeping the same name for your data set, exclude the “gear” and
“carb” columns without using vector notation. Instead use the “subset”
function. Check out the help (?subset
) for this function to
figure out how to exclude columns by name.
my_mtcars <- subset(my_mtcars, select = -c(gear, carb))
Note that the initial data of my_mycars with 9 variables no longer exists, because my syntax states to save the modified 7 variable data with the original name.
QUESTION: How could you do this without overwriting the original data?
You should now have a data set called my_mtcars that has 32 observations of 7 variables. Check this.
str(my_mtcars)
## 'data.frame': 32 obs. of 7 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
dim(my_mtcars)
## [1] 32 7
QUESTION: What does
## [1] 32 7
tells us? Hint, recall vector notation.
Another way of checking the number of rows and columns is using
nrow()
and ncol()
functions.
The variable “cyl” represents the number of cylinders in the car
engine. You’ll see this is classified as a numeric variable. However, we
aren’t necessarily interested in this as an actual number. For us, the
number of cylinders is more useful as a grouping mechanisms to serve as
a factor, or categorical variable. Let’s use the as.factor
function to convert it, keeping the same variable name. Then check what
the class of the variable is, to confirm the conversion worked.
my_mtcars$cyl <- as.factor(my_mtcars$cyl)
class(my_mtcars$cyl)
## [1] "factor"
Creating a categorical factor variable will enable us to generate summary statistics and plot data by groups.
We can now use this factor variable to group different operations.
tapply
is a great function to use for grouped operations.
Check the help for ?tapply
and try to calculate the mean of
“mpg” for each factor of the “cyl” variable.
tapply(my_mtcars$mpg, my_mtcars$cyl, mean)
## 4 6 8
## 26.66364 19.74286 15.10000
Now let’s create box plots of our two variables of interest (mpg, wt) to visualize their distribution. First, change your graphic parameter to show two graphs side by side.
Question: How do you think you’d specify the title of each box plot?
par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg, main = "mpg")
boxplot(my_mtcars$wt, main = "weight")
You can see that two points are potential outliers in the plot
forwt
. The plot gives you a tool to make a decision whether
to remove the data points. Here we will keep them.
Since cyl
is a categorical variable, we can also
visualize the distribution of mpg
and wt
across different cylinder classes.
par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg ~ my_mtcars$cyl, main = "mpg")
boxplot(my_mtcars$wt ~ my_mtcars$cyl, main = "weight")
QUESTION: How do you modify the label for x axis and y axis?
Before we move forward, let’s exclude these two observations by using
a logical expression that removes the points (rows in the dataframe) in
the data set where weight is greater than 5.3 tons. There are a few
different ways to do this, as is the case with most things in
R. Let’s use the subset
function
again.
my_mtcars <- subset(my_mtcars, wt <= 5.3)
You should now have a data set with 30 observations of 7 variables.
Note that removing data points from an analysis is not a good statistical practice and when done, it should be justified. But here, the goal is to use different functions on how we can work with data frames.
The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!
The base R functions that come with the default R installation have
the capacity for almost all the table manipulation you will need (e.g.,
split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()
).
However, sometimes their syntax are less user-friendly and intuitive
than some of the special packages built for table manipulation purposes.
So, here we are introducing a few of the most useful table manipulation
functions within dplyr
package. This is a package I use a
lot.
Note that you will have to use install.packages()
and
library()
function to download and activate the
dplyr
before using it.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now, we will see how different functions of this package work.
select()
We can use select()
to select column(s) that meet an
specific pattern:
select(mtcars, mpg) # select column called mpg (miles per galon)
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
select(mtcars, -mpg) # select all columns in the data except mpg column
## cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 4 121.0 109 4.11 2.780 18.60 1 1 4 2
select(mtcars, hp:vs) # select a continuous block columns starting from hp column and end on vs column
## hp drat wt qsec vs
## Mazda RX4 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 110 3.90 2.875 17.02 0
## Datsun 710 93 3.85 2.320 18.61 1
## Hornet 4 Drive 110 3.08 3.215 19.44 1
## Hornet Sportabout 175 3.15 3.440 17.02 0
## Valiant 105 2.76 3.460 20.22 1
## Duster 360 245 3.21 3.570 15.84 0
## Merc 240D 62 3.69 3.190 20.00 1
## Merc 230 95 3.92 3.150 22.90 1
## Merc 280 123 3.92 3.440 18.30 1
## Merc 280C 123 3.92 3.440 18.90 1
## Merc 450SE 180 3.07 4.070 17.40 0
## Merc 450SL 180 3.07 3.730 17.60 0
## Merc 450SLC 180 3.07 3.780 18.00 0
## Cadillac Fleetwood 205 2.93 5.250 17.98 0
## Lincoln Continental 215 3.00 5.424 17.82 0
## Chrysler Imperial 230 3.23 5.345 17.42 0
## Fiat 128 66 4.08 2.200 19.47 1
## Honda Civic 52 4.93 1.615 18.52 1
## Toyota Corolla 65 4.22 1.835 19.90 1
## Toyota Corona 97 3.70 2.465 20.01 1
## Dodge Challenger 150 2.76 3.520 16.87 0
## AMC Javelin 150 3.15 3.435 17.30 0
## Camaro Z28 245 3.73 3.840 15.41 0
## Pontiac Firebird 175 3.08 3.845 17.05 0
## Fiat X1-9 66 4.08 1.935 18.90 1
## Porsche 914-2 91 4.43 2.140 16.70 0
## Lotus Europa 113 3.77 1.513 16.90 1
## Ford Pantera L 264 4.22 3.170 14.50 0
## Ferrari Dino 175 3.62 2.770 15.50 0
## Maserati Bora 335 3.54 3.570 14.60 0
## Volvo 142E 109 4.11 2.780 18.60 1
select(mtcars, starts_with("c")) # select all columns that start with "c" in their column names
## cyl carb
## Mazda RX4 6 4
## Mazda RX4 Wag 6 4
## Datsun 710 4 1
## Hornet 4 Drive 6 1
## Hornet Sportabout 8 2
## Valiant 6 1
## Duster 360 8 4
## Merc 240D 4 2
## Merc 230 4 2
## Merc 280 6 4
## Merc 280C 6 4
## Merc 450SE 8 3
## Merc 450SL 8 3
## Merc 450SLC 8 3
## Cadillac Fleetwood 8 4
## Lincoln Continental 8 4
## Chrysler Imperial 8 4
## Fiat 128 4 1
## Honda Civic 4 2
## Toyota Corolla 4 1
## Toyota Corona 4 1
## Dodge Challenger 8 2
## AMC Javelin 8 2
## Camaro Z28 8 4
## Pontiac Firebird 8 2
## Fiat X1-9 4 1
## Porsche 914-2 4 2
## Lotus Europa 4 2
## Ford Pantera L 8 4
## Ferrari Dino 6 6
## Maserati Bora 8 8
## Volvo 142E 4 2
starts_with
argument is very convenient because it
allow you to select multiple columns that start with the same text. A
few similar arguments are available to define other naming
patterns.ends_with()
= Select columns that end with a character
string.contains()
= Select columns that contain a character
string.matches()
= Select columns that match a regular
expression.one_of()
= Select columns names that are from a group of
names.Question: How do you select or exclude two columns: mpg and cyl?
filter()
Filter/select row(s) of data based on specific requirement of column(s) values:
filter(mtcars, cyl %in% c(4,6)) # select cars with 4 or 6 cylinders
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
filter (mtcars, mpg>20) # select rows that have mpg > 20
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
filter(mtcars, mpg>20 & cyl == 6) # select rows that have mpg>20 AND cyl == 6
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
filter(mtcars, mpg>20 | hp > 100) # select rows that have mpg>20 OR hp > 100
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Question 1: How do you select rows with 6 and 8 cylinders knowing that cyl is a factor?
The pipe operator allows you to pipe the output from one function to
the input of the next function. Instead of nesting functions (reading
from the inside to the outside), the idea of of piping is to read the
functions from left to right. It can also help you avoid creating and
saving a lot of intermediate variables that you don’t need to keep. The
old operator for pipes was %>%
, but now a new version
has been introduced, |>
# old operator
pipe_result<- mtcars %>%
select(mpg, cyl) %>%
head()
pipe_result
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
# new operator
pipe_result<- mtcars |>
select(mpg, cyl) |>
head()
pipe_result
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
arrange()
This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order
order_data1<- mtcars %>%
arrange(mpg)
order_data1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
order_data2<- mtcars %>%
arrange(cyl, mpg)
order_data2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing?
Question: Can you arrange the table first by wt and then by hp in decending order?
mutate()
The mutate()
command creates new column(s) and define
their values. For instance, we can create a new column with random
numbers between 0 and 100 using a uniform distribution.
new_col<- mtcars %>%
mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100)) # note the use of nrow to get the exact number of random numbers as there are rows in the dataframe.
new_col
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## new_col
## Mazda RX4 13.320419
## Mazda RX4 Wag 91.380441
## Datsun 710 43.099530
## Hornet 4 Drive 59.621749
## Hornet Sportabout 23.499179
## Valiant 80.746549
## Duster 360 63.343340
## Merc 240D 18.063325
## Merc 230 29.064578
## Merc 280 85.146029
## Merc 280C 11.208987
## Merc 450SE 81.418487
## Merc 450SL 82.116567
## Merc 450SLC 26.008203
## Cadillac Fleetwood 66.913571
## Lincoln Continental 74.533827
## Chrysler Imperial 42.702340
## Fiat 128 1.639470
## Honda Civic 45.892397
## Toyota Corolla 9.612745
## Toyota Corona 13.092204
## Dodge Challenger 11.044016
## AMC Javelin 52.805285
## Camaro Z28 57.636350
## Pontiac Firebird 89.993556
## Fiat X1-9 12.588983
## Porsche 914-2 8.648430
## Lotus Europa 88.083560
## Ford Pantera L 99.760419
## Ferrari Dino 91.033441
## Maserati Bora 73.210326
## Volvo 142E 59.449843
new_col2 <- mtcars %>%
mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100), wt_kg = wt * 1000) # you can create multiple columns at once. Lets transform the weight from tones to kg
new_col2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## new_col wt_kg
## Mazda RX4 75.572603 2620
## Mazda RX4 Wag 86.507752 2875
## Datsun 710 65.434586 2320
## Hornet 4 Drive 29.154522 3215
## Hornet Sportabout 64.609343 3440
## Valiant 87.118349 3460
## Duster 360 1.399047 3570
## Merc 240D 13.651664 3190
## Merc 230 84.343792 3150
## Merc 280 79.836075 3440
## Merc 280C 65.493076 3440
## Merc 450SE 75.247570 4070
## Merc 450SL 66.274205 3730
## Merc 450SLC 28.878130 3780
## Cadillac Fleetwood 77.404107 5250
## Lincoln Continental 34.216939 5424
## Chrysler Imperial 29.698619 5345
## Fiat 128 34.958406 2200
## Honda Civic 60.662340 1615
## Toyota Corolla 45.379436 1835
## Toyota Corona 87.526815 2465
## Dodge Challenger 93.870674 3520
## AMC Javelin 1.647280 3435
## Camaro Z28 22.489337 3840
## Pontiac Firebird 27.950543 3845
## Fiat X1-9 67.181611 1935
## Porsche 914-2 61.389510 2140
## Lotus Europa 69.277950 1513
## Ford Pantera L 54.042772 3170
## Ferrari Dino 60.958549 2770
## Maserati Bora 34.495943 3570
## Volvo 142E 81.788555 2780
Can you create a new column call zero and give it a value of 0 ?
summarise()
This function calculates a summary statistics among all rows or rows
within certain grouping, often used in combination with
group_by()
sum_table <- mtcars %>%
summarise(mean(wt))
sum_table
## mean(wt)
## 1 3.21725
sum_table2 <- mtcars%>%
summarise(avg_wt= mean(wt), min_wt= min(wt))
sum_table2
## avg_wt min_wt
## 1 3.21725 1.513
group_by()
This is a great function. group_by()
divides data rows
into groups based on grouping column(s) provided, often used in
combination with other functions which define what you do with them
after placing them in groups. When group_by()
and
summarise()
are used together, you are essentially telling
R to separate rows into different groups, and for each groups you use
summarise()
to generate a series of summary statistics that
characterize the column values.
group_summary <- mtcars |>
group_by(cyl) |>
summarise(avg_wt= mean(wt), min_wt= min(wt))
group_summary
## # A tibble: 3 × 3
## cyl avg_wt min_wt
## <dbl> <dbl> <dbl>
## 1 4 2.29 1.51
## 2 6 3.12 2.62
## 3 8 4.00 3.17
You can also create groups by the combination of two or multiple columns
group_summary2<- mtcars |>
group_by(cyl, gear) |>
summarise(avg_wt= mean(wt), min_wt= min(wt))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
group_summary2
## # A tibble: 8 × 4
## # Groups: cyl [3]
## cyl gear avg_wt min_wt
## <dbl> <dbl> <dbl> <dbl>
## 1 4 3 2.46 2.46
## 2 4 4 2.38 1.62
## 3 4 5 1.83 1.51
## 4 6 3 3.34 3.22
## 5 6 4 3.09 2.62
## 6 6 5 2.77 2.77
## 7 8 3 4.10 3.44
## 8 8 5 3.37 3.17
Let’s read two other tables that are available in the package ‘dplyr’
data(band_members)
band_members
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
data(band_instruments)
band_instruments
## # A tibble: 3 × 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join()
Returns all records from the left table, and the matched (matched by shared ID columns) records from the right table. The result is NULL from the right side, if there is no match.
left_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
full_join()
Return all records in both tables.
full_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 4 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
Other than left_join()
and full_join()
there are a few other join functions that would join table differently.
Sometimes you will find them useful for a specific table you want to
make, but in general, you can reply on left_join()
for most
of the job.
right_join()
: similar to left join, it keeps all
records from the right table instead.
inner_join()
: return only the matched records from
both tables.
This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by Googling and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.
Let’s move on into data analysis.
When building an statistical model, you need to specify a dependent variable (what you are trying to explain) and one or more independent variables (the variables that you are using to explain the dependent variable).
Thus, to build a model you will first need to identify the dependent and independent variables which come from the research question. Once you have identified the variables, you need to decide what kind of model structure is appropriate for the data set at hand. For this reason, knowing models and model structures can help you design the data collection process as you have a good idea of what type of data you need for a specific model.
Running models can be easy, but before you can make interpretations from the model, you need to check that the assumptions of the model are valid (model validation), decide what is the best model structure (model selection) and then, finally, interpret the parameters of the model and/or make predictions.
The general structure for defining any model in R is:
model.type(dependent.variable ~ independent.variable1 + independent.variable2, data = dataframe)
Note that in the model equation you do not have to specify the intercept. We have assumed in this example that there are 2 different independent variables in the example provided (independent.variable1 and independent.variable2), but the model could contain any number of independent variables.
The dependent variable can either be continuous or categorical data.
We refer to these separate parameterizations as a regression model or a
classification model, respectively. Most of the models we demonstrate
here are regression models, but the structure to build classification
models in R are almost identical. In most cases, the
data type of the dependent variable determines whether the model is a
regression (non-categorical numeric-type dependent variable) or
classification (categorical factor-type dependent variable). You can use
functions such as as.factor()
and as.numeric()
to convert different data types of data to a factor or numeric variable.
Note, just because a data set contains numbers, does
NOT necessarily mean the numbers are numeric-type.
Numbers can be used as symbols to differentiate categories as well. It
is always a good practice to confirm the data type of the dependent and
independent variables that you are inputing into the model.
We will start with a simple linear regression analysis.
QUESTION: What is the difference between correlation and regression?
Model design involves deciding what we are trying to explain (i.e., the dependent variable) and what we are going to use to explain it (i.e., the independent variables). These are questions that are informed by the researcher.
Then, we need to decide what kind of model structure is appropriate to the data set. In our case, we will start with a simple linear regression model. Later in the course, however, we will investigate more complex model structures.
Run a basic linear regression model, attempting to predict/explain
the fuel efficiency of a car (mpg) based on its weight. Finally, assign
the model output “lm1”. Remember to specify the my_mtcars
data set in the model so the R understands that the
name of dependent and independent variables are columns in that specific
dataframe.
lm1 <- lm(mpg ~ wt, data = my_mtcars)
Once you have fit the model, you can use the summary function to investigate the beta coefficient, SE, and intercept values.
summary(lm1)
##
## Call:
## lm(formula = mpg ~ wt, data = my_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1820 -2.4603 -0.4526 1.5046 6.3191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.913 2.002 19.936 < 2e-16 ***
## wt -6.287 0.630 -9.979 1.01e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.793 on 28 degrees of freedom
## Multiple R-squared: 0.7805, Adjusted R-squared: 0.7727
## F-statistic: 99.59 on 1 and 28 DF, p-value: 1.007e-10
IMPORTANT: Before you can trust your model, you need to examine whether the assumptions on which the model is based are actually valid. If the assumptions are not satisfied, then your model will be unreliable, at least to some extent!
If you apply the function plot to the model, R will provide a series of plots that will help you to inspect model assumptions. Click enter to get all the plots.
plot(lm1)
Check the Q-Q (quantile-quantile) plot. A perfect normal distribution is represented by a straight line. Your residuals should be close to a straight line.
The variability of the residuals should be uniform across the range of fitted values of the dependent variable. In the plot of residuals vs fitted values of y and standardized residuals vs. fitted values there should be a “scatter” of dots across the graphs, with no discernible pattern. Points should be randomly distributed. If you have a scatter of points on one side of the graph, your data may NOT be homogeneous.
There are different ways you can address this before you fit your model. For instance, you can estimate the correlation of each pair of covariates and discard variables or exclude them from analysis if they highly correlated (positively or negatively). We will examine this later in the course when dealing with more complex models.
Serial autocorrelation is when there is significant correlation between successive data points in the data set. Spatial data or time-series data tend to be autocorrelated. There are different ways to deal with autocorrelation, such as using mixed-effect models.
We can check this by looking at the plot of standardized residuals vs leverage and “Cook’s Distance.”
Leverage is a measure of the influence of individual data points on the model’s parameters, measured on the scale of 0-1, where high values indicate a strong influence on the model’s parameters.
Cook’s distance is the sum of squared distances between the fitted values using the whole data set, and the fitted values with the ith observation removed. A large difference indicates that the ith observation exerts a strong influence on the model’s parameters.
No values beyond 1.
QUESTION: How does the QQ plot look? Does it indicate a potential problem?
If your standardized residuals are not normally distributed you can transform the dependent variable.
Let’s try a log transformation, which is commonly used:
lm1 <- lm(log(mpg) ~ wt, data = my_mtcars)
plot(lm1)
QUESTION: Does this improve the result?
The residuals are now randomly distributed and are homogeneous. We can now trust model results.
Continuing with interpreting results, the slope value should be -0.3055548. This means that for each unit of weight added to a car (1 ton), the log of miles per gallon it achieves is predicted to REDUCE by a value of 0.3055548. There is a negative effect of weight on car efficiency.
The intercept of 3.9259255 sets the start of the regression line at
the weight of zero. In this case, this isn’t very useful (a car will not
weigh zero tons) but it is a necessary element of describing a linear
relationship. Here, the equation for the line is
log(mpg) = 3.92593 -0.30555 (wt)
. Note that you can call
the individual coefficients from a model directly using, in this
example, “lm1$coefficients”.
Now, plot again a scatterplot of weight vs. mpg and draw the regression line, in blue. First return your graphing parameter to it’s default setting of (1,1)
par(mfrow=c(1, 1))
plot(log(mpg) ~ wt, data = my_mtcars)
abline(lm1, col="blue")
Here are two other ways you can draw the predicted regression line:
abline(coef = lm1$coefficients, col="green")
abline(lm1$coefficients, col="red")
In any mathematical modeling approach, there may be other variables,
or some combination of variables, that are most effective and efficient
at predicting your response variable of interest. In this case, our
response variable is mpg
. Looking at your plot of all the
two-way relationships between variables, are there any other variables
that may help predict mpg
? Horsepower (hp
)
seems to be potentially informative. The question now is, which model
might be best at predicting mpg
.
Let’s say we have three options: 1. mpg ~ wt 2. mpg ~ wt + hp 3. mpg ~ hp
We’ve already fitted the first model. Now, fit a linear regression
model for the next two parameter combinations, giving them unique names
(lm2
and lm3
), and look at the summary of
their results.
lm2 <- lm(log(mpg) ~ wt + hp, data = my_mtcars)
lm3 <- lm(log(mpg) ~ hp, data = my_mtcars)
summary(lm2)
##
## Call:
## lm(formula = log(mpg) ~ wt + hp, data = my_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17048 -0.05589 -0.03043 0.07386 0.19392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8995420 0.0705574 55.268 < 2e-16 ***
## wt -0.2287735 0.0284846 -8.031 1.25e-08 ***
## hp -0.0014795 0.0003458 -4.278 0.000211 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09807 on 27 degrees of freedom
## Multiple R-squared: 0.8857, Adjusted R-squared: 0.8772
## F-statistic: 104.6 on 2 and 27 DF, p-value: 1.932e-13
summary(lm3)
##
## Call:
## lm(formula = log(mpg) ~ hp, data = my_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44058 -0.06574 -0.02156 0.09492 0.34549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4444246 0.0759960 45.324 < 2e-16 ***
## hp -0.0032294 0.0004855 -6.652 3.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1773 on 28 degrees of freedom
## Multiple R-squared: 0.6125, Adjusted R-squared: 0.5986
## F-statistic: 44.25 on 1 and 28 DF, p-value: 3.225e-07
As you’ll see, all 3 of these models are reasonably good. Which is optimal? You might know that when you add more independent variables to a model, the model fit will often improve. But, it is certainly not ideal to have a model with a large number of independent variables (because we want to avoid overfitting). Akaike’s Information Criteria (AIC) provides a good mechanism for model comparison. AIC ranks a model’s performance by accounting for model fit while penalizing models with more variables. When adding a variable, the improved fit of the model must outweight the penalty, otherwise the improved fit will not be deemed worthwhile given the additional model complexity.
Use the AIC
function to compare the AIC values of our 3
models. The lowest AIC indicates that the model is a “better”
representation of the data and has a better predictive power. Note, AIC
is not a measure of fit.
AIC(lm1)
## [1] -35.82462
AIC(lm2)
## [1] -49.35107
AIC(lm3)
## [1] -14.73444
What can we conclude? The best model (from the three we have tried)
for predicting miles per gallon of a car uses the horsepower and weight
of the car in the prediction (lm2
).
To finish, let’s see if we can now use this model (lm2
)
to predict the fuel efficiency of a car with a hp = 225
and
a wt = 4.0
tons. Making prediction with a new set of
independent variables when dependent variable is absent, is ultimately
one of the main goals of a regression analyses.
First, we need a dataframe with our new independent variables to be
used in the prediction. Then, we can use the predict()
function to apply our established linear model to the new information.
Lastly, we need to transform miles per gallon back from the log scale to
make it more easily interpretable.
nd <- data.frame(hp = 225, wt = 4.0)
exp(predict(lm2, newdata = nd, type = "response"))
## 1
## 14.17612
We can get this same result without using the predict
command, simply by writing out the linear equation. That is, our
predicted response equals our intercept term, plus our coefficient for
wt
multiplied by the weight value, plus our coefficient for
hp
multiplied by our horsepower value (The function
predict()
just makes things a little easier for us). That
is:
\[y_i = \beta_0 + \beta_i x_i + \epsilon_i\]
We must also use the exp()
function to back-transform
from the log-scale.
exp(lm2$coefficients[[1]] + (lm2$coefficients[[2]]*4.0) + (lm2$coefficients[[3]]*225))
## [1] 14.17612
The analysis of variance (ANOVA) is a lineal model, but the explanatory variable is categorical. As we saw before, the categorical variable must be a factor. Remember that if the category has been coded with numbers, R will assume the variable is continuous.
We will continue working with the mtcars
data set. As we
did before, we specified cylinders as a categorical variable:
mtcars$cyl <- as.factor(mtcars$cyl)
class(mtcars$cyl)
## [1] "factor"
Here we ask the question, do different cylinders imply more power?
Our hypothesis (\(H_1\)) is that more cylinders will provide more power to the car, expressed in gross horsepower (HP). The null hypothesis (\(H_0\)) is that there is no difference in HP among number of cylinders.
To perform an ANOVA in R, we use the function
aov()
and summary()
to see the results.
model1 <- aov(hp ~ cyl, data=mtcars)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 104031 52015 36.18 1.32e-08 ***
## Residuals 29 41696 1438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the results, the table shows us a p-value, representing the overall probability of significance for the categorical variable. If the term is significant, then we know at least one level is significantly different from the others.
We can reject the null hypothesis and conclude that gross horsepower differs among different number of cylinders.
However, before we know we can trust this result, we need to check
the model assumptions. We will do this in the same way we did for the
lineal model, using plot()
plot(model1)
From the second plot we can conclude that the residuals are normally distributed (straight line). From the first and third plots, we can see that the variances of the residuals are homogeneous (homoscedasticity). From the final plot, we can conclude that the model is not biased by unduly influential observations.
To understand how levels compare to each other once we know that the
ANOVA is significant, we can use a post-hoc test. We will use the
Tukey post-hoc test with the function
TukeyHSD()
.
TukeyHSD(model1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = hp ~ cyl, data = mtcars)
##
## $cyl
## diff lwr upr p adj
## 6-4 39.64935 -5.627454 84.92616 0.0949068
## 8-4 126.57792 88.847251 164.30859 0.0000000
## 8-6 86.92857 43.579331 130.27781 0.0000839
diff is the difference in mean between two levels,
lwr and upr are the lower and upper
95% confidence intervals for the difference and padj is
the p-value
for the difference.
In this example, we can see that there is not HP significant
difference between 6 and 4 cylinders (p > 0.05
), and
that 8 cylinders have more HP than 4 and 6 cylinders
(p < 0.05
).
In the past few decades, there has been an increasing number of studies that use non-parametric machine learning models for classification or regression. In the most simplistic of terms, non-parametric/machine learning models do not require the predictive variables to take a predetermined distribution (e.g. normal distribution). They are also relatively less stringent in terms of model assumptions.
One of the most popular and also most powerful machine learning tools is Random Forest. Random Forests is an ensemble method for classification- or regression-typed analyses that operate by constructing a series of decision trees. We will not go into detail to discuss the Random Forest algorithm or how it works. But in general, Random Forest models perform really well when there are:
Although the Random Forest algorithm is complicated, running a Random Forest model in R is as easy as running a linear model. For the most part, we just need to specify the data set, and the regression model formula. Let’s use the “mtcars” data set again for this example.
Load the “mtcars”” data set, and load the random forest package.
install.packages("randomForest")
data(mtcars)
library( randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
Now let’s run a model that includes all metrics in the data set to
predict the mpg
of a car.
Note that in the following code, the formula was simplified
mpg ~ .
. The .
. represents the rest of the
columns in the dataframe that are not defined as predictive variable.
Whenever you have a dataframe that contains only the independent and
dependent variables you need in the model, you can simplify the formula
in this way. This essentially tells R to use all the
variables included in the dataframe.
rf.model<- randomForest(mpg ~ ., data=mtcars)
rf.model
##
## Call:
## randomForest(formula = mpg ~ ., data = mtcars)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 5.744222
## % Var explained: 83.68
Notice that one of the most important results of Random Forest is the “% variance explained”. It can be considered equivalet to a \(R^2\) value in a linear models quantifying how “good” the model fits the data. The closer the “% var explained” is to 100, the better the model’s performance.
Here you can see the default number of trees used in the model is 500. In reality, the number of tree is one of very few parameters that need to be modified by the user. The number of trees determines the number of ensemble models that are created in the modeling process. As we increase the tree number, the model results will tend to vary and eventually stabilize. But we don’t want to create too many ensemble tree models because it will take more computational time. The general rule is to increase the number of trees incrementally until the value of “% var explained” changes very little.
Let’s change the default number of trees parameter to 1000 and 5000 and observe the difference in model results
rf.model.1000<- randomForest( mpg ~ ., data=mtcars, ntree=1000)
rf.model.1000
##
## Call:
## randomForest(formula = mpg ~ ., data = mtcars, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 5.630956
## % Var explained: 84
rf.model.5000<- randomForest( mpg ~ ., data=mtcars, ntree=5000)
rf.model.5000
##
## Call:
## randomForest(formula = mpg ~ ., data = mtcars, ntree = 5000)
## Type of random forest: regression
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 5.616431
## % Var explained: 84.04
The bigger number of trees (ntree
) does not change the
“% var explained” significantly. So the default ntree= 500
is as good as ntree= 1000
or ntree=5000
.
Lastly, a convenient feature of Random Forest model is that it
internally summarizes how important each variable is to model
performance. The model does this by replacing each predictive variable
separately with randomly permuted values. After running the model with
replacement, the more the model accuracy decreases, the more important
the specific variable (that was replaced in the process) is. You can
retrieve and visualize the variable importance data using
importance()
and varImPlot()
functions.
importance(rf.model)
## IncNodePurity
## cyl 167.86329
## disp 257.59362
## hp 174.06653
## drat 57.84139
## wt 276.08078
## qsec 39.34845
## vs 21.40127
## am 17.73601
## gear 20.64983
## carb 25.42392
varImpPlot(rf.model)
Note that “IncNodePurity” is just a measure of how much the model performance has decreased. The higher that value, the more important the variable is.
QUESTION: Which variable is contributing most to predicting mpg of a car?
Lastly, just as with all the other models, the most powerful use of
Random Forest model is to predict the dependent variable when you have a
set of independent variables. You can, once again, use the
predict()
function to perform this action. Below, we will
create a fictitious car, changing the Hornet-4 Drive from 6 cylinder to
8 cylinders. We want to know what the mpg of this new car is.
new.car<- mtcars[4,]
new.car$cyl<- 8
new.car
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 8 258 110 3.08 3.215 19.44 1 0 3 1
predict(rf.model, newdata = new.car, type = "response")
QUESTIONS:
- Do you think the prediction make sense? Why?
- Can you build a random forest classification model to predict the number of cylinders based on other metrics included in the dataset?
A primary goal of basic ecological and applied conservation research is to understand how species are distributed across space and through time. However, animal locations or abundance data typically cannot be modelled by a normal distribution. As a result, other types of model structures are needed.
Generalized linear models (GLM) are used when the residuals are non-normal, when there are non-linear relationships between dependent and independent variables, and/or when the variance in the dependent variable is not uniform across its range. E.g., presence-absence data, count data.
GLMs consist of three elements:
A GLM allows the specification of a variety of different error distributions:
The linear predictor η (eta) is the quantity which incorporates the information about the independent variables into the model. It is related to the expected value of the dependent variable through the link function.
η is expressed as linear combinations of unknown parameters β.
The linear predictor can be represented as a vector of coefficient values that is multiplied the matrix containing your independent variables X. . η can thus be expressed as \(η = Xβ\).
The link function g
provides the relationship between
the linear predictor and the mean, or expected value, of the
distribution function.
There are many commonly used link functions, and their choice can seem somewhat arbitrary. It can be convenient to match the domain of the link function to the range of the distribution function’s mean.
A logistic regression model allows us to establish a relationship between a binary (0 vs. 1) outcome variable and a group of predictor variables. It models the logit-transformed probability of a certain outcome as a linear relationship with the predictor variables.
To demonstrate how linear regression model works, we will use the following study:
Bolger et al. (1997) investigated the impacts of habitat fragmentation on the occurrence of native rodents. Quinn and Keough (2002) subsequently modeled the presence/absence native rodents against some Bolger et al. (1997) biogeographic variables.
Bolger et al. (1997) studied whether small fragments of the shrub habitats in canyons of San Diego, CA, isolated by urbanization, were capable of supporting viable populations of native rodent species.
The data set consists of rodent presence/absence and three predictor variables: 1. PERSHRUB = percentage of shrub cover 2. DISTX = the distance to the nearest large (>100 ha) “source canyon.” 3. AGE = time elapsed since the fragment was isolated by development
For this demonstration we need to install and load a new package.
#install.packages("usdm") #Install package if it is not installed in your computer.
library(usdm)
Import the data set suing the ‘read.table’ function. The data for
this example is in a .txt format. Read.table
will read the
.txt file and load it as a data frame into the R
environment.
Look at the first lines of data using the head
function.
bolger <- read.table("data/bolger.txt", header=T, sep= "")
head(bolger)
## PERSHRUB DISTX AGE RODENTSP
## 1 66 2100 50 1
## 2 90 914 20 1
## 3 75 1676 34 1
## 4 75 243 34 1
## 5 60 822 16 1
## 6 50 121 14 1
Note: It is often best to rescale continuous covariates to improve numerical stability in the models. We will skip this step for this example, but we will look into scaling in other examples.
When we fit a model we assume that the estimated effects of predictor variables are independent of each other. Thus, before we fit a model we first need to check our independent variables for potential collinearity.
Collinearity occurs when 2 or more independent variables have a strong linear relationship, making it difficult to determine which of the two collinear variables are more strongly associated with the response variable. Collinearity affects parameter estimates and increases both standard errors and p-values of parameters, likely obscuring some important relationships.
plot(bolger[,1:3])
To make sure we do not have a (multi)collinearity problem in our predictor variables, we can look at the Variance Inflation Factor (VIF) for each covariate. Variance inflation is a measure of the degree of collinearity between variables. Any variables with a VIF value > to 3 (or 5 or 10 depending who you ask) are the cause of concerning collinearity in the model. If a covariate has a VIF > 3, remove it (or other correlated variables) and recalculate VIF after deletion.
vifstep(bolger[,1:3])
## No variable from the 3 input variables has collinearity problem.
##
## The linear correlation coefficients ranges between:
## min correlation ( DISTX ~ PERSHRUB ): -0.2745893
## max correlation ( AGE ~ PERSHRUB ): -0.7952976
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 PERSHRUB 2.743579
## 2 DISTX 1.093362
## 3 AGE 2.750798
In our example there is no evidence of collinearity as all VIFs are < 3. Good!
bolger.glm <- glm(RODENTSP ~ DISTX + AGE + PERSHRUB,
family=binomial (link= "logit"),
data=bolger)
summary(bolger.glm)
##
## Call:
## glm(formula = RODENTSP ~ DISTX + AGE + PERSHRUB, family = binomial(link = "logit"),
## data = bolger)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9099159 3.1125426 -1.899 0.0576 .
## DISTX 0.0003087 0.0007741 0.399 0.6900
## AGE 0.0250077 0.0376618 0.664 0.5067
## PERSHRUB 0.0958695 0.0406119 2.361 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.617 on 24 degrees of freedom
## Residual deviance: 19.358 on 21 degrees of freedom
## AIC: 27.358
##
## Number of Fisher Scoring iterations: 5
In the case of a linear model, we used ‘plot(model1)’ to check the standardized residuals for any “patterns”. For binomial data, however, you will just see two bands in the residuals due to the binomial nature of the dependent variable.
What we need to check is evidence of model fit, thus, that the model fits the data well. To do this we extract the deviance residuals and examine their distribution. Observations with a deviance residual > 2 may indicate a lack of fit.
devresid<-resid(bolger.glm, type = "deviance")
hist(devresid)
In this case, we can conclude there is no strong evidence of lack of fit. We can now feel confident about our model to make interpretations.
Let’s look again at the model summary:
summary(bolger.glm)
##
## Call:
## glm(formula = RODENTSP ~ DISTX + AGE + PERSHRUB, family = binomial(link = "logit"),
## data = bolger)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9099159 3.1125426 -1.899 0.0576 .
## DISTX 0.0003087 0.0007741 0.399 0.6900
## AGE 0.0250077 0.0376618 0.664 0.5067
## PERSHRUB 0.0958695 0.0406119 2.361 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.617 on 24 degrees of freedom
## Residual deviance: 19.358 on 21 degrees of freedom
## AIC: 27.358
##
## Number of Fisher Scoring iterations: 5
The chance of native rodent occurrence increases significantly with increasing shrub cover (\(B=0.096\), \(p = 0.0182\)). Neither fragment isolation age or distance had significant effects.
The default results are on the logit scale. The coefficient of PERSHRUB is: 0.096, so that a one unit change in percentage of shrub cover produces approximately a 0.096 unit change in the log odds of presence (i.e. a 0.096 unit change on the logit scale).
The odds of presence are defined as the ratio of the probability of presence over the probability of absence. For instance, if the probability of presence is 0.8 and the probability of absence is 0.2, the odd ratios is 0.8/0.2=4, or the odds of presence are 4 to 1. For more information on how to interpret odds visit: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/
It is hard to think in terms of logits. We can plot the association between the probability of native rodent presence and percentage shrub cover.
See how the code for this plot has gotten much longer and more
complex. Make sure you understand each line of code. Remember to use the
help
command when you do not understand a specific function
and all its arguments.
xs <- seq(0, 100, 1)
bolger.predict <- predict(bolger.glm, type="response", se=T,
newdata=data.frame(DISTX=mean(bolger$DISTX), AGE=mean(bolger$AGE), PERSHRUB=xs))
# Plot results
plot(bolger$RODENTSP ~ bolger$PERSHRUB, xlab="Pecentage shrub cover", ylab="Rodent presence probability", axes=T, pch=16)
lines(bolger.predict$fit~xs, type="l", col="gray")
lines(bolger.predict$fit + bolger.predict$se.fit~xs, col="red", type= "l", lty=2)
lines(bolger.predict$fit - bolger.predict$se.fit~xs, col="red", type= "l", lty=2)
***
What is the probability of native rodent occurrence with a shrub cover of 80%, distanceX of 100, and fragment isolation age of 5.
predict(bolger.glm, type="response", se=T, newdata=data.frame(DISTX=100, AGE=5, PERSHRUB=80))
## $fit
## 1
## 0.8716415
##
## $se.fit
## 1
## 0.1277815
##
## $residual.scale
## [1] 1
Note that the predict function already converts values from the logit scale back to the probability scale.
To do that manually, we can do the following:
x <- -5.9099159 + 0.0003087*100 + 0.0250077 * 5 + 0.0958695 * 80
exp(x)/(exp(x)+1) # This is the inverse of the logit scale
## [1] 0.8716417
plogis(x) # Use a built in function
## [1] 0.8716417
The package ggplot2
is widely used for data
visualization in R. This package is extremelly powerful. I used to like
using basic R code for plotting but eventually, I had to admit that
ggplot is extremelly cool and had to adopt it.
ggplot2
is based on the grammar of graphics, which allows
users to create complex plots from data in a systematic way.
I will first present you the basics of ggplot2
using the
same mtcars
dataset and then I will present you the code of
different figures I used in my publications.
As with any R package, before you can use ggplot2, you need to install it (if you haven’t already) and load it into your R session.
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(dplyr)
There are a few concepts that we need to know to understand how to code ggplots.
Data: The dataset you want to visualize. Aesthetics (aes): The visual properties of the plot (e.g., x and y position, color, size). Geometries (geom_): The actual marks we put on the plot (e.g., points, lines, bars). We can, for instance, use geom_line() for plotting lines. Facets: Subplots that display subsets of the data. Scales: Control how data values are mapped to visual properties. Themes: Control the overall appearance of the plot (e.g., font size, background color).
Let’s start with a simple scatter plot using the built-in mtcars dataset. We have created this plot before, so the look should be familiar to you.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point()
In here, ggplot(data = mtcars, aes(x = wt, y = mpg))
initializes the plot with the mtcars
dataset, setting
wt
(weight) on the x-axis and mpg
(miles per
gallon) on the y-axis. geom_point()
adds points to the plot
to create a scatter plot.
We now have a basic plot. Let’s start to customize it. We will first
add a title, subtitle, and axis labels with the labs()
function.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Weight",
subtitle = "Data from mtcars dataset",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
caption = "Source: mtcars dataset")
We can also change the point color and sizes by specifying color and
size to variables in the data. For this we need to specify in the
aesthetics which variables represent color and size. Then we need to use
scale_color_continuous()
and
scale_size_continuous()
to set the names on the legend.
ggplot(data = mtcars, aes(x = wt, y = mpg, color = as.numeric(cyl), size = hp)) +
geom_point() +
labs(title = "Scatter Plot with Color and Size Mappings",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon") +
scale_color_continuous(name = "Number of Cylinders") +
scale_size_continuous(name = "Horsepower")
Looking pretty good.
What if we want to add a line trend?
We can use the geom_smooth()
function to add a linear
trend. Method refers to the moothing method (function). Here we use a
linear model. se controls whether you plot the standard error. And color
controls the color of the line.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = T, color = "blue") +
labs(title = "Scatter Plot with Linear Regression Line",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
Check the help on geom_smooth()
for other methods, like
gam models.
Faceting allows you to create multiple plots based on a categorical variable. This is very nice. Let’s generate different plots for cars with different number of cylinders.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl) +
labs(title = "Scatter Plot Faceted by Number of Cylinders",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon")
Let’s try now different geometries.
For a bar plot we use geom_bar()
.
ggplot(data = mtcars, aes(x = factor(cyl))) +
geom_bar() +
labs(title = "Bar Plot of Cylinder Counts",
x = "Number of Cylinders",
y = "Count")
For a histogram we use geom_histogram()
.
ggplot(data = mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "Histogram of Miles Per Gallon",
x = "Miles Per Gallon",
y = "Count")
We use geom_box()
for a boxplot.
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot() +
labs(title = "Box Plot of MPG by Number of Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon")
#### Vilolin Plot
We use geom_violin()
for a violin plot. Sometimes,
violin plots can represent better the distribution of the data.
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
geom_violin() +
labs(title = "Box Plot of MPG by Number of Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon")
The themes
control the overall look of your plot. There
are many themes available. For isntance, we can use
theme_minimal()
.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter Plot with Minimal Theme",
x = "Number of Cylinders",
y = "Miles Per Gallon")
Other available themes include theme_gray()
,
theme_classic()
, theme_bw()
, and more. Check
them out yourself.
You can customize the scales of axes, colors, and other aesthetics.
In this case, we are colouring the dots based on horsepower using the
function scale_color_gradient()
and setting the range of
colors from blue (low) to red (high).
ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Scatter Plot with Custom Color Scale",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
color = "Horsepower")
You can add annotations to your plots. This can be useful to highlite something in your plot.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
annotate("text", x = 5, y = 30, label = "High MPG", color = "red") +
labs(title = "Scatter Plot with Annotation",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon")
Something very good about ggplot is that it is very easy to add multiple layers to the same plot. You can combine different geoms in one plot using the same or different data.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Scatter Plot with Multiple Trend Lines",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
With ggplot2, you have a powerful tool to explore and present your data in compelling ways.
There are many options that you can control on ggplot. The best way to learn all the possibilities is by playing with it. Pretty much, everything can be customized. Googling anything you want to do is one of the best ways to troubleshoot.
Feel free to experiment with different datasets and ggplot2 functions to create the visualizations that best communicate your insights!
To save your plot to a file use the function ggsave()
.
Note that first you need to save your plot as an object.
p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Weight")
ggsave("scatter_plot.png", plot = p, width = 6, height = 4)
I want to show you now some more advanced plots I have created though the years for different publications and presentations.
The firsts examples come from this manuscript.
In this study, I used a multi-species occupancy model and look at species richness of large herbivores relationships with different covariables and land uses in central Kenya.
I created a violin plot for figure 1 of the article to show how species richness varies across years at each land use.
Let’s load the data.
data <-read.csv("data/forplot.csv")
colnames(data)[3:5]<-c("DistanceToWater","NDVI","LivestockAbundance")
Now we will create the plot. We specify the aesthetics with
Year
in the x axis, SR
(Species richness) in
the y axis, and Year
as the grouping factor. We use
geom_violin()
to create the violin plot. I used a cool
function, stat_summary()
to plot the mean and standard
deviation of species richness within each violin plot. I faceted the
plot by land use. I used the theme gray (theme_gray()
).
With labs()
, we speciy the labels. And finally, within the
function theme()
we make more customizations, such as,
setting font size of axes, titles, and labels.
ggplot(data, aes(x=Year, y=SR, group = Year)) + geom_violin() + stat_summary(fun.data="mean_sdl", fun.args = list(mult=1), geom="pointrange", color = "black") + facet_wrap (~LandUse, ncol=5) + theme_gray() + labs(x = "Year", y = "Estimated point sp. richness") + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12), strip.text = element_text(size=12))
In that same paper, I wanted to show the relationship of species richness with livestock abundance. I then created this scatter plot, showing how species richness declines with increasing livestock but highlighting the differences among land uses.
Here, I used the geom_point()
and the
stat_smooth()
to create trends based on gam models for each
land use. Note that I specified LandUse
as the grouping and
color factor in the aesthetics. I customized the colors using
scale_color_manual()
. I set a black and white theme. And
with the theme()
function, I set the position of the legend
(getting the position right requires a lot of trial and error). Change
it and you will understand. I also customized font sizes.
ggplot(data, aes(x=LivestockAbundance, y=SR, group=LandUse, color=LandUse)) + geom_point() + stat_smooth(aes(x=LivestockAbundance, y=SR), method = "gam", formula = y ~ s(x)) +
scale_color_manual(name = "Land use", values = c("#FF7F00", "#8B0000", "#66CD00","#006400")) +
labs(x = "Livestock abundance", y = "") +
theme_bw() +
theme(legend.position=c(0.8,0.8),legend.title=element_text(size=14),legend.text=element_text(size=12), axis.text = element_text(size = 14), axis.title = element_text(size=15))
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Here is the same plot faceted by year.
ggplot(data, aes(x=LivestockAbundance, y=SR, group=LandUse, color=LandUse)) + facet_wrap(~ Year, ncol=3) + geom_point() +
stat_smooth(aes(x=LivestockAbundance, y=SR), method = "gam", formula = y ~ x) +
scale_color_manual(name = "Land use", values = c("#FF7F00", "#8B0000", "#66CD00","#006400")) +
labs(x = "Livestock abundance", y = "Estimated point species richness") +
theme_bw() +
theme(legend.position=c(0.85,0.12),legend.title=element_text(size=14),legend.text=element_text(size=12)) +
theme(axis.text = element_text(size = 11), axis.title = element_text(size = 15), strip.text = element_text(size=15))
Very cool!!
The second example come from this manuscript.
In this study, we combined eight 2-km long transects surveyed monthly between October 2017 and March 2020 with a hierarchical multi-species and multi-season distance sampling modelling framework to: (1) estimate monthly density of an ensemble of 10 different large herbivores and (2) understand how species respond to changes in vegetation productivity and time across the Naboisho Conservancy in the Greater Mara Ecosystem, Kenya.
This first plot shows the average density of species across time with corresponding credible intervals.
Load the data. The data for plotting are organised in lists. The first plot uses the first element of datalist, while the second plot uses 3 different datasets contained in datalist2.
load('./data/ListForPlotData')
load('./data/ListForPlotData2')
You can inspect the data we are going to use.
head(datalist[[1]])
## Species Date MedianAb CI CI_low CI_high mean sd
## 1 Eland 2020-03-01 4.1766996 0.89 1.647684048 7.116809 4.498319 1.878446
## 2 Eland 2020-02-01 14.7177689 0.89 7.331800163 22.200248 15.305441 4.853975
## 3 Eland 2020-01-01 16.4567753 0.89 8.813831968 25.198753 17.045384 5.398528
## 4 Eland 2019-12-01 2.3835696 0.89 0.608483624 4.613852 2.689674 1.415061
## 5 Eland 2019-11-01 0.5767496 0.89 0.006554621 1.837341 0.944494 1.357441
## 6 Eland 2019-10-01 0.5971446 0.89 0.030504372 1.943356 0.940102 1.252115
## Year Mass
## 1 2020 419
## 2 2020 419
## 3 2020 419
## 4 2019 419
## 5 2019 419
## 6 2019 419
head(datalist2[[1]])
## x y time Species
## 1 0.235640 0.1695834 2017-10-01 Hartebeest
## 2 0.249109 0.1820851 2017-10-01 Hartebeest
## 3 0.262578 0.1947698 2017-10-01 Hartebeest
## 4 0.276047 0.2075512 2017-10-01 Hartebeest
## 5 0.289516 0.2203360 2017-10-01 Hartebeest
## 6 0.302985 0.2330247 2017-10-01 Hartebeest
head(datalist2[[2]])
## Species y yUpper yLower x
## 1 Hartebeest 0.2079740 0.4425786 0.09190152 0.235640
## 2 Hartebeest 0.2225848 0.4467321 0.10418357 0.249109
## 3 Hartebeest 0.2373915 0.4458057 0.11616434 0.262578
## 4 Hartebeest 0.2525141 0.4542040 0.12976854 0.276047
## 5 Hartebeest 0.2664800 0.4607097 0.14319463 0.289516
## 6 Hartebeest 0.2806685 0.4662031 0.15587398 0.302985
head(datalist2[[3]])
## Species y yUpper yLower x
## 1 Community 1.146447 2.884560 0.4580493 0.235640
## 2 Community 1.267944 3.163149 0.5149669 0.249109
## 3 Community 1.394953 3.441288 0.5744531 0.262578
## 4 Community 1.530966 3.706783 0.6322530 0.276047
## 5 Community 1.664420 3.971599 0.6961910 0.289516
## 6 Community 1.806478 4.249713 0.7626169 0.302985
This first plot, what is Fig. 2 in the manuscript, shows the changes in monthly abundance for each species.
This plot uses geom_point()
and
facet_wrap()
as before. To plot the credible intervals
around the density estimates, I use the geom_errorbar()
function to which you need to specify the ymin
and
ymax
values. Also note the use of
scale_x_date()
to set the x label as dates.
ggplot(datalist[[1]], aes(x=as.Date(Date), y=MedianAb)) +
facet_wrap(~Species, scales="free_y", nrow = 4, ncol = 3) + labs(y="Estimated ind/km2", x = "Time") + geom_point(size=2) + geom_errorbar(aes(ymin = CI_low, ymax = CI_high), width = 0.5) + theme(axis.text=element_text(size=8), axis.title=element_text(size=12), axis.line=element_line(linewidth=0.5), axis.ticks=element_line(size=0.5), strip.text=element_text(size=11), strip.background=element_blank(), panel.border=element_blank(), panel.grid=element_blank(), legend.position = c(0.85, 0.1), legend.title = element_blank()) + scale_x_date(date_breaks = "6 months", date_labels = "%Y %b")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The next plot is quite complex. It consists of 3 different datasets, showing the full potential of ggplot.
The plot shows the response in each species abundance (as estimated number of groups) to vegetation productivity (as NDVI). This is Fig.4 of the manuscript. There are 3 different datasets because the hierarchical multi-species and multi-year model allow you to get estimates for each month, but also an average for each species across all months, and a community average.
The first dataset (datalist2[[1]]) contains estimated number of groups per species and per month for a range of NDVI values. The second dataset (datalist2[[2]]), contains the average estimate of number of groups per species across all months The third dataset (datalist2[[3]]) contains the average estimated number of groups for the ‘community’ (i.e., the assemblage of all species included in the analysis).
Let’s plot one by one, and then we combine them all together.
Here, we have the estimated number of groups in the y
axis as a function of NDVI in the x
axis for each month.
So, each line represents a different months
group = factor(time)
. We use geom_line()
to
plot the lines. We facet by species letting the y axis change freely
among species facet_wrap(~Species, scales="free_y")
.
ggplot(datalist2[[1]], aes(x = x, y = y, group = factor(time))) + facet_wrap(~Species, scales="free_y") + geom_line(colour = 'grey') +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none') +
scale_color_manual(values = rep('darkgray', 34))
We will now plot the second dataset. In this case, we have the
average estimate of number of groups across all months. Note that we
have now crebidle intervals for the estimations of number of groups. In
this case, we use the function geom_ribbon()
to plot the
credible intervals. The alpha arguments controls the transparency.
ggplot(datalist2[[2]], aes(x = x, y = y, group = Species)) + facet_wrap(~Species, scales="free_y") + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(data = datalist2[[2]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none')
Finally, we have a plot for the community.
ggplot(datalist2[[3]], aes(x = x, y = y, group = Species)) + facet_wrap(~Species, scales="free_y") + geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = 'black', alpha = 0.5) + geom_line(linewidth=1, colour = 'black') +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none')
We can now combine all datasets in the same plot by adding multiuple layers, each time specifying the new dataset. So, there are a few modifications to the code of each individual plot. This is the final code:
ggplot(datalist2[[1]], aes(x = x, y = y, group = factor(time))) + facet_wrap(~Species, scales="free_y") + geom_line(colour = 'grey') + geom_ribbon(data = datalist2[[2]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) + geom_line(data = datalist2[[2]], aes(x = x, y = y, group = Species), linewidth=1, colour = 'blue') + geom_ribbon(data = datalist2[[3]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = 'black', alpha = 0.5) + geom_line(data = datalist2[[3]], aes(x= x, y = y, group = Species), linewidth=1, colour = 'black') +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none') +
scale_color_manual(values = rep('darkgray', 34))
Note, that for the community plot to be added to the facets, the data
frame needs to have a column called Species
with the name
Community
. The Species column in each dataset needs to be
specified as a factor with all the levels corresponding to all species
plus community as shown here:
factor(data$Species, levels= c("Hartebeest", 'Eland','Giraffe',"Grant's gazelle",'Impala',"Thomson's gazelle",'Topi','Warthog','Wildebeest',"Plains zebra", "Community"))
Phylopic is a free database of silhouetted images of organisms. You
can access those iamges and use them in your scientific figures. In R,
you can use the rphylopic
package to download and
incorporate these silhouettes into your ggplot2 visualizations. Pretty
cool. Here’s an example to get you started with using Phylopic
silhouettes in R with ggplot2.
First, let’s load the package (and install it if you do not have done it yet).
#install.packages("rphylopic")
library(rphylopic)
## You are using rphylopic v.1.5.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
The first thing we need to do to use a Phylopic image, is to search for the desired organism and retrieve its unique identifier (UUID). The rphylopic package provides functions to search and download these images.
Here’s how you can search for and download a Phylopic image of a
giraffe. First we search for the images available with the
pick_phylopic()
function. Some species have multiple
options to chose from. Here I am asking R to display no more than 4
images. This is an interactive function and you will need to enter which
image to display. This is very useful to chose which silhouette you want
to use.
pick_phylopic(name = "Giraffa camelopardalis", n = 4, view = 1)
You will see something like this:
Once you know which image you want to use, you retrieve the unique
identifier using get_uuid()
. Here, we are choosing the
first image.
giraffe <- get_uuid(name = "Giraffa camelopardalis", n = 1)
giraffe
## [1] "b35f867d-26e8-4c48-806e-0f07f18b1844"
Now, let’s save this uuid in a data.frame. Note, you could add more species.
phylopicsuuid <- data.frame(Species = "Giraffe",
uuid = c(giraffe))
Let’s now add the giraffe silhouette to the figure we created before. First, we need to filter the data to keep just the giraffe.
forgirplot <- filter(datalist2[[2]], Species == 'Giraffe')
We can recreate the previous plot but using the giraffe only data.
ggplot(forgirplot, aes(x = x, y = y)) + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none')
Now, we will use the function geom_phylopic()
to add the
giraffe silhouette to our plot. We need to specify the dataframe with
the uuid as the data for geom_phylopic()
. The tricky part,
and it is quite trial and error, is defining the coordinates where the
plot will appear (x and y). These will depend on the data you are
plotting. Getting the size correct is also trial and error. Color, fill
and alpha work as in other ggplot functions.
ggplot(forgirplot, aes(x = x, y = y)) + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
geom_phylopic(data = phylopicsuuid, aes(x = 0.5, y = 1.1, uuid = uuid), size = 1.9, color = 'black', fill = 'black', alpha = 0.3) +
labs(y=expression("Nr. of groups"), x = "NDVI") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=12),
axis.line=element_line(linewidth=0.5),
axis.ticks=element_line(linewidth=0.5),
strip.text=element_text(size=11),
strip.background=element_blank(),
panel.border=element_blank(),
panel.grid=element_blank(), legend.position = 'none')
## Warning: Using the `size` aesthetic in this geom was deprecated in rphylopic 1.5.0.
## ℹ Please use the `height` and `width` aesthetics instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Very cool!!
Stay tunned for more examples.
We will now start to work with spatial data (i.e. vector and raster) in R.
Before we start, and if you are not using an R project, do not forget to set your working directory
setwd("...")
# don't forget to set the directory to your own data folder!!
Next, load packages:
library(sf)
library(terra)
library(tidyverse)
There is a lot of information in the internet about geo-spatial data and R. Here, I summarize the basic tools to conduct spatial analyses in R with the goal that you will build your skills as we progress into some study cases. I have modified some of the code from Claudia Engel (https://cengel.github.io/R-spatial/). I recommend this site for a more in depth tutorial.
Vector data are composed of points, lines, and polygons.
plot(st_point(c(0,0)))
plot(st_point(c(0.5,0.5)), add = T)
plot(st_linestring(matrix(runif(6), ncol=2)) )
plot(st_polygon(list(rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)))))
Any spatial data collected using GPS devices or downloaded online must have a projection system that links the coordinates to specific locations on the earth’s surface. Wiithout a coordinate reference system, we cannot locate places on Earth.
Any given geospatial data (e.g. a set of points) with an incorrect projection or with unknown projection system can be visualized on its own. However, you will not be able to plot that spatial data with other spatial data, as the coordinate reference system is missing. Importantly, you will not be able to conduct any spatial analysis (e.g., calculate the distance of a given point to a road, or extract value of the underlining raster value at the point locations), since the incorrectly projected (or unprojected) spatial data is pointing to a completely different location on earth than intended.
Spatial data without the correct projection system is basically useless. When you acquire or download a geospatial data, make sure you know what projection the data were originally made (this information is usually found in the metadata of a given dataset).
There are EPSG Codes that can be used to avoid manually setting all the parameters needed in a projection. For instance, the code EPSG:32637 codes for the UTM37N projection with WGS84 datum and ellipsoid. You can find the EPSG reference list in the following link: http://spatialreference.org/ref/epsg/
There were two main packages in R to handle geospatial vector data: the sp package and the sf package.
The first package to provide classes and methods for spatial data
types in R is called sp
.
Development of the sp
package began in the early 2000s in
an attempt to standardize how spatial data would be treated in R and to
allow for better interoperability between different analysis packages
that use spatial data. The package (first release on CRAN in 2005)
provides classes and methods to create points, lines,
polygons, and grids and to operate them. About 350
packages developed for spatial analysis use the data types implemented
in sp
, which means several spatial packages “depend” on the
sp
package in some way.
The sp
package depends on rgeos
and
rgdal
. However at the end of 2023, these packages were
retired, primarily because their maintainer, Roger Bivand, has retired.
You can read more about it here.
All geospatial work in R is being replaced by the more modern
package: sf
It was first released on CRAN in late October 2016. It implements a formal standard called “Simple Features” that specifies a storage and access model of spatial geometries (i.e. points, lines, polygons).
sf
provides fast I/O, particularly
relevant for large files.
sf
PackageThe sf class is an extension of data frames. Essentially, sf objects
can be treated as data frames that also contain spatial data, as opposed
to spatial data that may or may not also contain data frames. This
enables sf objects to fit within the tidyverse
workflow,
making it possible to manipulate them with the dplyr
commands we saw before.
sf objects consist of rows of features, hence the name “simple features (sf)”, which have both non-spatial and spatial data formats.
The spatial data of an sf object is contained in a special geometry column that is of class ‘sfc’. The geometry column contains the basic types of spatial data: the Coordinate Reference System (CRS), coordinates, and type of geometric object.
The ‘sfc’ class has seven subclasses to denote the types of geometric objects within the geometry column, which are derived from the simple features standard. The possible geometric objects are point, linestring, polygon, multipoint, multilinestring, multipolygon, and geometrycollection, which is used for any combination of the other types.
Now lets do some operations using sf.
In your research you may create spreadsheets that contain data
associated with spatial coordinates. You already know how to read the
spreadsheet into a data frame with read.table
or
read.csv
. We can then very easily convert the table into a
spatial object in R.
Lets start converting a data frame into a sf object. For this, our data frame needs to have at least two columns indicating the coordinates of points. This could be the location of animal observations recorded on the field, for instance.
We will use the st_as_sf()
function that uses a vector
of the coordinate columns for the ‘coords’ argument, and take either an
EPSG code or a PROJ definition for the crs.
mt_df <- read.csv("data/ke_major-towns.csv", header = TRUE)
# Create sf object with geo_data data frame and CRS
points_sf <- st_as_sf(mt_df, coords = c("lon", "lat"), crs = 4326)
print(points_sf)
## Simple feature collection with 55 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 34.0701 ymin: -4.16856 xmax: 41.8401 ymax: 4.26594
## Geodetic CRS: WGS 84
## First 10 features:
## TOWN_NAME geometry
## 1 Moyale POINT (39.0671 3.51502)
## 2 Mandera POINT (41.8401 3.93175)
## 3 Wajir POINT (40.0699 1.74579)
## 4 Marsabit POINT (37.9998 2.32618)
## 5 Lodwar POINT (35.6075 3.12534)
## 6 Lokitaung POINT (35.7604 4.26594)
## 7 Kitale POINT (35.0197 1.02763)
## 8 Eldoret POINT (35.284 0.511704)
## 9 Kapenguria POINT (35.1246 1.23239)
## 10 Bungoma POINT (34.5667 0.566278)
We now have a sf object containing the location of major towns in Kenya.
If you need to asign a CRS to an sf
object, you can use
the function st_set_crs()
.
To reproject an sf
object, you can use the function
st_transform()
. You need to provide the object and the new
CRS.
Here, we are going to project the spatial object from WGS84 to UTM37N.
points_sf_UTM <- st_transform(points_sf, crs="+proj=utm +zone=37 +north +ellps=WGS84 +datum=WGS84 +units=m")
points_sf_UTM
## Simple feature collection with 55 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -49256.84 ymin: -460773.8 xmax: 815422 ymax: 472280.3
## Projected CRS: +proj=utm +zone=37 +north +ellps=WGS84 +datum=WGS84 +units=m
## First 10 features:
## TOWN_NAME geometry
## 1 Moyale POINT (507452.6 388520.5)
## 2 Mandera POINT (815422 435119.7)
## 3 Wajir POINT (619005.1 192996.7)
## 4 Marsabit POINT (388788.2 257153.6)
## 5 Lodwar POINT (122836.8 346056.7)
## 6 Lokitaung POINT (140314.8 472280.3)
## 7 Kitale POINT (56804.23 113860.5)
## 8 Eldoret POINT (86226.52 56678.64)
## 9 Kapenguria POINT (68533.06 136530.5)
## 10 Bungoma POINT (6212.445 62779.85)
We can see that the projection is now UTM 37N.
Reading shapefiles into R using the sf
package is quite
simple. We can use the function st_read()
mammals <- st_read("data/ke_mammals_WGS84.shp")
## Reading layer `ke_mammals_WGS84' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/ke_mammals_WGS84.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 11128 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 33.9207 ymin: -4.677568 xmax: 41.91134 ymax: 5.428362
## Geodetic CRS: WGS 84
mammals
## Simple feature collection with 11128 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 33.9207 ymin: -4.677568 xmax: 41.91134 ymax: 5.428362
## Geodetic CRS: WGS 84
## First 10 features:
## MAMMALS geometry
## 1 59 POLYGON ((35.72504 5.392665...
## 2 59 POLYGON ((35.6244 5.402578,...
## 3 60 POLYGON ((35.46943 5.427557...
## 4 59 POLYGON ((35.59788 5.412438...
## 5 60 POLYGON ((35.41483 5.426372...
## 6 59 POLYGON ((35.49558 5.428124...
## 7 58 POLYGON ((35.74144 5.381304...
## 8 59 POLYGON ((35.32218 5.424792...
## 9 60 POLYGON ((35.61533 5.405414...
## 10 59 POLYGON ((35.83205 5.343137...
What kind of vector data is this?
All sf objects can generally be treated as data frames when wrangling the data. This makes it possible to use workflows from the tidyverse package.
For instance, returning the subset of mammals as we did before, can
be done with dplyr’s filter()
function.
Working with the mammals
sf object, lets filter polygons
that contain more than 60 species of mammals
mammals_60 <- mammals %>% filter(MAMMALS > 60)
mammals_60
## Simple feature collection with 6884 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 33.9207 ymin: -4.677568 xmax: 40.33287 ymax: 5.385777
## Geodetic CRS: WGS 84
## First 10 features:
## MAMMALS geometry
## 1 61 POLYGON ((35.49277 5.385777...
## 2 61 POLYGON ((35.73524 5.37106,...
## 3 61 POLYGON ((35.49269 5.362251...
## 4 61 POLYGON ((35.78922 5.358685...
## 5 61 POLYGON ((35.56057 5.294257...
## 6 61 POLYGON ((35.54746 5.28794,...
## 7 62 POLYGON ((35.49659 5.263432...
## 8 61 POLYGON ((35.7855 5.264725,...
## 9 61 POLYGON ((35.49269 5.261552...
## 10 61 POLYGON ((35.79181 5.251389...
We now have a new object with only polygons containing more than 60 species of mammals.
You can save this new object as a shapefile using the
st_write()
function. You need to provide the sf object and
the path and name where you want to save the shapefile. By default, this
function checks if the file already exists, and if so it will not
overwrite it. You can change the parameter delete_layer = TRUE to
overwrite a shapefile with the same name.
st_write(mammals_60, "data/mammals_60", driver = "ESRI Shapefile", delete_layer = TRUE)
This is a very basic introduction to vector data. In future workflows, we will learn to use more spatial tools to manipulate vector data.
Note that we have not plot any vector data yet. You can use the
function plot()
to plot sf, but the plots are not great. We
will learn to use another package for making maps later.
Raster processing is an important aspect of building spatial datasets for further analysis. Doing so within R can greatly help streamline workflows, allowing for all analyses to be completed in a single program. Here, we will walk through a series of steps to provide the basic skills necessary for image analysis.
A raster image is composed by pixels or cells. Each raster contains a CRS, the coordinates of its origin, dimensions, and an array of cell values.
The package rater
was the main package to work with
raster data in R. In 2020, the package terra
was released,
and thanks to its greater functionality and efficiency, has now replaced
the raster
package.
For this exercise we will use a Landsat 8 image. Before we start processing the image it is good to remember the different bands of Landsat 8:
Band | Wavelength | Useful for mapping |
---|---|---|
Band 1 | Coastal Aerosol 0.435 - 0.451 | Coastal and aerosol studies |
Band 2 | Blue 0.452 - 0.512 | Bathymetric mapping, distinguishing soil from vegetation, and deciduous from coniferous vegetation |
Band 3 | Green 0.533 - 0.590 | Emphasizes peak vegetation, which is useful for assessing plant vigor |
Band 4 | Red 0.636 - 0.673 | Discriminates vegetation slopes |
Band 5 | Near Infrared (NIR) 0.851 - 0.879 | Emphasizes biomass content and shorelines |
Band 6 | Short-wave Infrared (SWIR) 1 1.566 - 1.651 | Discriminates moisture content of soil and vegetation; penetrates thin clouds |
Band 7 | Short-wave Infrared (SWIR) 2 2.107 - 2.294 | Improved moisture content of soil and vegetation and thin cloud penetration |
Band 8 | Panchromatic 0.503 - 0.676 | 15 meter resolution, sharper image definition |
Band 9 | Cirrus 1.363 - 1.384 | Improved detection of cirrus cloud contamination |
Band 10 | TIRS 1 10.60 - 11.19 | 100 meter resolution, thermal mapping and estimated soil moisture |
Band 11 | TIRS 2 11.50 - 12.51 | 100 meter resolution, Improved thermal mapping and estimated soil moisture |
BQA | Landsat Collection 1 QA Bitmask |
We use the function rast()
to read a raster. In this
case we are reading the Landsat8 image.
Image <- rast("./data/Landsat8.tif")
Image
## class : SpatRaster
## dimensions : 795, 814, 12 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 287430, 311850, -2490, 21360 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637)
## source : Landsat8.tif
## names : B1, B2, B3, B4, B5, B6, ...
Questions: What is the spatial resolution of this image? What is the CRS? How many bands are there?
If you only want to read the 11 bands, you can use the argument lyrs
Image2 <- rast("./data/Landsat8.tif", lyrs = c(1:11))
Image2
## class : SpatRaster
## dimensions : 795, 814, 11 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 287430, 311850, -2490, 21360 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637)
## source : Landsat8.tif
## names : B1, B2, B3, B4, B5, B6, ...
Similar to a data.frame, raster objects store all the relevant information to do data manipulation and/or visualization.
print(Image) # Alternatively, you could just type: ImageStack
## class : SpatRaster
## dimensions : 795, 814, 12 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 287430, 311850, -2490, 21360 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637)
## source : Landsat8.tif
## names : B1, B2, B3, B4, B5, B6, ...
# Raster Class
class(Image)
## [1] "SpatRaster"
## attr(,"package")
## [1] "terra"
# Names of the raster layers
names(Image)
## [1] "B1" "B2" "B3" "B4" "B5"
## [6] "B6" "B7" "B10" "B11" "sr_aerosol"
## [11] "pixel_qa" "radsat_qa"
# Summary statistics (min, max, quartiles)
summary(Image)
## B1 B2 B3 B4
## Min. :-2000.0 Min. :-2000.0 Min. : 63.0 Min. : 26.0
## 1st Qu.: 263.0 1st Qu.: 333.0 1st Qu.: 571.0 1st Qu.: 660.0
## Median : 373.0 Median : 461.0 Median : 737.0 Median : 965.0
## Mean : 372.6 Mean : 459.4 Mean : 733.6 Mean : 935.2
## 3rd Qu.: 481.0 3rd Qu.: 587.0 3rd Qu.: 890.0 3rd Qu.:1201.0
## Max. : 2189.0 Max. : 2576.0 Max. :3183.0 Max. :3446.0
## B5 B6 B7 B10 B11
## Min. : 159 Min. : 133 Min. : 130 Min. :2880 Min. :2876
## 1st Qu.:1964 1st Qu.:1882 1st Qu.:1253 1st Qu.:3010 1st Qu.:2992
## Median :2207 Median :2440 Median :1788 Median :3043 Median :3022
## Mean :2236 Mean :2421 Mean :1754 Mean :3030 Mean :3011
## 3rd Qu.:2467 3rd Qu.:2971 3rd Qu.:2251 3rd Qu.:3065 3rd Qu.:3043
## Max. :6027 Max. :5603 Max. :5769 Max. :3120 Max. :3090
## sr_aerosol pixel_qa radsat_qa
## Min. : 8.0 Min. :322.0 Min. :0
## 1st Qu.: 96.0 1st Qu.:322.0 1st Qu.:0
## Median : 96.0 Median :322.0 Median :0
## Mean :117.2 Mean :322.1 Mean :0
## 3rd Qu.:160.0 3rd Qu.:322.0 3rd Qu.:0
## Max. :228.0 Max. :480.0 Max. :0
# Spatial extent
ext(Image)
## SpatExtent : 287430, 311850, -2490, 21360 (xmin, xmax, ymin, ymax)
You can check the projection of your raster using
crs
crs(Image, proj = TRUE)
## [1] "+proj=utm +zone=37 +datum=WGS84 +units=m +no_defs"
To reproject a raster, you need to use the function
project()
. Note that as the pixel position will slightly
shift, you need to provide an argument for the methods to recalculate
the new pixel values. In general, you use ‘bilinear’ for continuous data
and ‘rgb’ for classes.
Imagelatlong <- project(Image, "+init=epsg:4326", method = 'bilinear')
crs(Imagelatlong, proj = TRUE)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
Raster images can be plotted using a number of commands. Try a few commands to see how the results differ.
# A plot of the image with lots of options
plot(Image,4)
# Plot a 3-band RGB image
plotRGB(Image, r=4, g=3, b=2, stretch = "lin", axes=TRUE, xlab="Easting", ylab="Northing", main="Landsat 8, Bands 4,3,2 (RGB) - True Color Composite")
# Plot a 3-band RGB image in false color composite
plotRGB(Image, r=5, g=4, b=3, stretch = "lin", axes=TRUE, xlab="Easting", ylab="Northing", main="Landsat 8, Bands 5,4,3 (RGB) - False Color Composite")
The real strength of the terra package is the ability to perform complex mathematical operations.
NDVI
One of the most commonly used raster products in terrestrial ecology is the Normalized Difference Vegetation Index (NDVI). Calculated from a ratio of the red and near-infrared spectral bands, the index is a measure of greenness (i.e. density of live green vegetation). NDVI values are within the -1 to 1 range. Negative NDVI values (values approaching -1) will likely correspond to water. Values close to zero (-0.1 to 0.1) generally correspond to barren areas of rock, sand, or snow. Low, positive values represent shrub and grassland (approximately 0.2 to 0.4), while high values indicate dense vegetation, such the ones found in temperate and tropical forests (values approaching 1).
To calculate NDVI, we can do a simple math calculation to manipulate the raster data layers. NDVI is:
\[NDVI = \frac{(NIR - red)}{(NIR + red)}\] where red and NIR stand for the spectral reflectance measurements acquired in the red (visible) and near-infrared regions of the electromagnetic spectrum.
# Using standard math operators, rename the red band (band 3) and the near-infrared band (band 4) in the image.
redBand <- Image$B4
nirBand <- Image$B5
# Create NDVI
ndvi <- (nirBand-redBand)/(nirBand+redBand)
plot(ndvi)
hist(ndvi)
One of the most important items in raster processing is the
development of samples of spatial datasets for further analysis. We will
see how to extract pixel values from the NDVI raster. For this example,
we will create random points using the st_sample()
function
withing the bounding box of the NDVI raster.
# Select random samples (pixels) from a raster
ranSamps <- st_sample(st_as_sfc(st_bbox(ndvi)), size=10, type = "random")
ranSamps <- st_as_sf(ranSamps) #Convert to simple feature collection
ranSamps
## Simple feature collection with 10 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 288421 ymin: -882.7695 xmax: 308887.8 ymax: 17400.44
## Projected CRS: WGS 84 / UTM zone 37N
## x
## 1 POINT (306968.6 414.9227)
## 2 POINT (292759.3 5147.23)
## 3 POINT (302153.5 17400.44)
## 4 POINT (300060.8 -882.7695)
## 5 POINT (300599 9189.424)
## 6 POINT (292283.8 15371.02)
## 7 POINT (291921.1 11205.62)
## 8 POINT (297620.5 4256.651)
## 9 POINT (288421 16766.3)
## 10 POINT (308887.8 2033.875)
# Plot ndvi and the random points
plot(ndvi)
plot(ranSamps, add = T)
Now, we use the extract()
function to get the NVDI pixel
values.
ndvivalues <- terra::extract(ndvi, ranSamps)
ndvivalues
## ID B5
## 1 1 0.7694704
## 2 2 0.2596119
## 3 3 0.2643312
## 4 4 0.7040000
## 5 5 0.3405311
## 6 6 0.2721879
## 7 7 0.2496100
## 8 8 0.8188235
## 9 9 0.4093344
## 10 10 0.7205479
This function is the base for many spatial analysis as it allows to extract pixel values from locations that can then be used as predictor variables in models. We will see that later on.
You can also save rasters to your hard drive so that it can be read
into other programs (e.g., ArcGIS, QGIS). Use the
writeRester()
function.
Now let’s conduct some simple analyses. Note that, for each functionality we try to achieve here, there are multiple functions from different packages that can do the same job. Here we only introduce one way of doing it. When you are more familiar with R, you will develop your own preference in specific packages and functions.
birds <- st_read("data/Bird_pts.shp")
## Reading layer `Bird_pts' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Bird_pts.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 20 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
fences<- st_read("data/fences.shp")
## Reading layer `fences' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/fences.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 353 features and 13 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: 744545.6 ymin: 4305558 xmax: 749943.7 ymax: 4310110
## Projected CRS: WGS 84 / UTM zone 17N
habitat<- st_read("data/habitat.shp")
## Reading layer `habitat' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/habitat.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 72 features and 7 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 744545.6 ymin: 4305557 xmax: 749947.7 ymax: 4310110
## Projected CRS: WGS 84 / UTM zone 17N
roads<- st_read("data/roads.shp")
## Reading layer `roads' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/roads.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 108 features and 9 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: 744483.1 ymin: 4305488 xmax: 749995.6 ymax: 4310127
## Projected CRS: WGS 84 / UTM zone 17N
landcover<- rast("data/landcover.img")
ndvi<- rast("data/ndvi250m.tif")
canvas<- rast(ext(roads), crs=crs(roads), resolution=30)
road.r<- rasterize(roads, canvas) # the distance function only works with raster data type, so here rasterize roads first
distance<- distance(road.r)
plot(distance)
plot(roads, add=T)
## Warning in plot.sf(roads, add = T): ignoring all but the first attribute
What do you need to do to calculate the distance between each bird occurence location and the nearest road?
Calculate 50m buffer and 500m buffer along the fence and plot the results. Note that when plotting sf objects, a plot for each attribue is created. To avoid multiple plots, we can extract the geometry and plot that instead.
fence_50m<- st_buffer(fences, 50)
plot(st_geometry(fence_50m))
fence_100m<- st_buffer(fences, 100)
plot(st_geometry(fence_100m))
Now we want to: Convert the buffer polygon to raster and change the values of all pixels within the 100m buffer area to 1, and the pixels outside the buffer area to 0
rasterize.buff<- rasterize(fence_100m, ndvi, field=1, background=0)
#Notice here we used an existing layer named ndvi as the template of the newly rasterized raster, the new rasterized layer will adapt the extent, the projection, and the pixel size of the template, which is convenient to us.
plot(rasterize.buff)
Extract raster value at point locations, so we can quantify the environmental value captured by other raster data at the customized point locations. Notice that in order for the extract function to work, the points and the raster data need to cover the same extent and have the same projection.
landcover.extract<- terra::extract(landcover, birds)
birds # this is the original data table
## Simple feature collection with 20 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
## id Presence geometry
## 1 1 1 POINT (748020.7 4308458)
## 2 2 1 POINT (747684.4 4308521)
## 3 3 1 POINT (747940.9 4308299)
## 4 4 1 POINT (749314.8 4308116)
## 5 5 1 POINT (748397 4308208)
## 6 6 1 POINT (749121 4307797)
## 7 7 1 POINT (748824.5 4307586)
## 8 8 1 POINT (749195.1 4307432)
## 9 9 1 POINT (749343.3 4307552)
## 10 10 1 POINT (748784.6 4307911)
birds.extract<- cbind(birds, landcover.extract)
birds.extract # inspect the differences
## Simple feature collection with 20 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
## id Presence ID landcover geometry
## 1 1 1 1 11 POINT (748020.7 4308458)
## 2 2 1 2 11 POINT (747684.4 4308521)
## 3 3 1 3 11 POINT (747940.9 4308299)
## 4 4 1 4 11 POINT (749314.8 4308116)
## 5 5 1 5 11 POINT (748397 4308208)
## 6 6 1 6 11 POINT (749121 4307797)
## 7 7 1 7 11 POINT (748824.5 4307586)
## 8 8 1 8 11 POINT (749195.1 4307432)
## 9 9 1 9 11 POINT (749343.3 4307552)
## 10 10 1 10 11 POINT (748784.6 4307911)
subset.habitat<- habitat %>% filter(HABITAT =="Oak")
par(mfrow= c(1,2))
plot(habitat)
plot(subset.habitat)
How do you select only the polygon with area larger than 100,000 square meters?
R offers the opportunity to create beautiful high quality maps for publication. This saves you from having to export all files to make the final maps in other software, such as ArcGIS or QGIS.
Making maps is like art. Good maps can help enormously with presenting your results to an audience. It takes skills and practice.
When combining R powerful data processing with visualization packages, we can move beyond static maps. Here, we will learn the basics for making three type of maps:
We will use the ‘tmap’ package, which is a great new addition to R tools. Tmap is a powerful and flexible package with a concise syntax that is similar to ggplot2 users. It also has the unique capability to generate static and interactive maps using the same code via tmap_mode().
library(tmap)
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(spData)
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
Before we start using tmap, remember that you always have the help. Some functions in tmap have lots of arguments that can be edited. We will cover some basics here, hoping that you will explore more options on your own.
We will use generic data to demonstrate the basics, and then use some of the data from this chapter to create a publishable map.
Static maps are the most common and traditional outputs from geospatial analysis.
Tmap syntax involves a separation between the input data and the aesthetics for data visualization. Each input dataset can be displayed in many different ways, largely depending on the nature of the data (raster, points, polygons, etc.)
The dataset we are using here is the world object from the ‘spData’ package. It contains some variables from the World Bank.
In tmap the basic building block is tm_shape()
, which
creates a tmap-element that defines input data as vector or raster
objects. We can then add one or more layer elements such as
tm_fill()
, tm_borders()
,
tm_dots()
, tm_raster()
, depending what you
want to display.
Here a simple example:
# Add fill layer to nz shape
data(world)
tm_shape(world) +
tm_fill()
# Add border layer to nz shape
tm_shape(world) +
tm_borders()
# Add fill and border layers to nz shape
tm_shape(world) +
tm_fill() +
tm_borders()
Objects in tmap can be store with names, similar to ggplot, and then called or used as starting point to add more objects.
map1 <- tm_shape(world) +
tm_fill() +
tm_borders()
map1
If we want to add more objects (e.g., another shapefile), we use
+ tm_shape(new_obj)
. When a new shape is added in this way,
all subsequent aesthetic functions refer to it, until another new shape
is added. This syntax allows the creation of maps with multiple shapes
and layers.
tm_shape(world) +
tm_borders() +
tm_shape(urban_agglomerations) +
tm_dots(size = "population_millions")
In addition, you can change the aesthetics of the maps using either variable fields (data in your shapefile) or constant values. The most commonly used aesthetics for fill and border layers include color, transparency, line width and line type, set with col, alpha, lwd, and lty arguments, respectively.
Here some examples on using constant numbers and colors. Note the use
of tmap_arrange
to plot multiple maps.
ma1 = tm_shape(world) + tm_fill(col = "red")
ma2 = tm_shape(world) + tm_fill(col = "red", alpha = 0.3)
ma3 = tm_shape(world) + tm_borders(col = "blue")
ma4 = tm_shape(world) + tm_borders(lwd = 3)
ma5 = tm_shape(world) + tm_borders(lty = 2)
ma6 = tm_shape(world) + tm_fill(col = "red", alpha = 0.3) +
tm_borders(col = "blue", lwd = 3, lty = 2)
tmap_arrange(ma1, ma2, ma3, ma4, ma5, ma6)
But we can also color the maps based on some variable of interest. You can use the default breaks for continuous variables or define your own. ‘n’ sets the number of bins into which numeric variables are categorized and ‘palette’ defines the color scheme, which in the following example is ‘BuGn’.
Lets plot population size by country:
tm_shape(world) + tm_polygons(col = "pop") #default breaks
breaks = c(0, 2, 5, 10, 100, 500, 2000) * 1000000 # customized breaks
tm_shape(world) + tm_polygons(col = "pop", breaks = breaks)
tm_shape(world) + tm_polygons(col = "pop", n = 5)
tm_shape(world) + tm_polygons(col = "pop", palette = "BuGn", breaks = breaks)
So far we have been editing the objects, but a good map needs other elements in the layout. These include title, scale bar, north arrow, any text, etc.
Additional elements such as north arrows and scale bars have their
own functions: tm_compass()
and
tm_scale_bar()
. Here is an example on how to add the north
arrow.
tm_shape(world) +
tm_polygons(col = "pop", palette = "BuGn", breaks = breaks, title = "Population") +
tm_compass(position = c("left", "top"))
While static maps sometimes is all you need for a publication or a poster, animated maps can be fantastic ways of showing change across time in presentations or websites where you can display gifs or movies.
There are a few things we need to learn first. In tmaps, the
functions tm_facets
generates faceted maps. There are some
arguments that will allow you to display change in several small maps in
one figure or animated change by changing the argument in tm_facets().
If you set by we get several static maps side by side,
if we set along we get all maps displayed on top of
each other, ready for producing an animation.
For instance, a faceted map can look like this:
tm_shape(world) +
tm_polygons() +
tm_shape(urban_agglomerations) +
tm_symbols(col = "black", border.col = "white", size = "population_millions") +
tm_facets(by = "year", nrow = 4, free.coords = FALSE)
As you may notice, it can be difficult to visualize the difference
between so many maps when they’re displayed side by side. But if we set
along = year
it uses all years available to create an
animation. We also want to set free.coords = FALSE
, which
maintains the map extent for each map iteration. We then save the map as
an object.
pop_anim <- tm_shape(world) + tm_polygons() +
tm_shape(urban_agglomerations) + tm_dots(size = "population_millions") +
tm_facets(along = "year", free.coords = FALSE)
tmap_animation(pop_anim, filename = "urb_anim.gif", delay = 25)
## Creating frames
## =========
## ====
## =====
## ====
## =====
## ====
## =====
## ====
## ====
## =====
## ====
## =====
## ====
## =====
## ====
## =====
## ====
##
## Creating animation
## Animation saved to /Users/ramirocrego/Documents/GitHub/R-Tutorials/urb_anim.gif
Note that the .gif file will be saved to your directory. It will look like this:
It was cool to make animated maps, but we can now take maps to a new level with interactive maps. The release of the ‘leaflet’ package in 2015 revolutionized interactive web map creation from within R and a number of packages have built on it since then.
Here we will learn how to create interactive maps using tmaps. Tmaps has a unique function that allows to display your maps as static or interactive. The only caution is that certain features such as scale bars and north arrows are not displayed or used in interactive mode. It also has some limitations on the legends that can be created. But depending on the purpose, it is worth how much you gain.
All you need to do is switching to view mode, using the command
tmap_mode("view")
.
Lets display one of our previous maps on the interactive mode:
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(world) + tm_polygons(col = "pop", palette = "BuGn", breaks = breaks, title = "Population size in millions")
You can navigate the map and click on each country to query the population size. You can also change the base map displayed. This type of map is displayed in html, and be saved as an html file to share or even to insert in your own website.
We will use the previous bird data to create a high quality map.
tmap_mode("plot")
## tmap mode set to plotting
tm_shape(habitat) +
tm_polygons(col="HABITAT", palette = "Greens", title ="Habitat type") +
tm_shape(roads) +
tm_lines(col = "ROAD_USE", palette = "Set1", title.col = "Road use") +
tm_shape(fences) +
tm_lines(lty = 'solid', col = "black") + tm_add_legend("line", col = 'black', title = "Fences") +
tm_shape(birds) +
tm_dots(col = 'red', size = 0.3) + tm_add_legend("symbol", shape = 16, size=0.4 ,col = 'red', title = "Bird occurrence") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("right", "bottom")) +
tm_layout(legend.outside = T)
We can also create an interactive map. Note that some elements have changed. We removed the scale and north arrow, and the legends are simplified. There are not options for shapes in tmap view mode available at the moment.
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(habitat) +
tm_polygons(col="HABITAT", palette = "Greens", title ="Habitat type") +
tm_shape(roads) +
tm_lines(col = "ROAD_USE", palette = "Set1", title.col = "Road use") +
tm_shape(fences) +
tm_lines(lty = 'solid', col = "black") +
tm_shape(birds) +
tm_markers() + tm_add_legend("fill", col = c('black','blue'), labels = c("Fences","Bird locations"))
In this manuscript I used ArcGIS to make the static maps for publication:
But for presentations, I wanted to show the map in a dynamic way, as a Dynamic map. So this is the code:
Load data:
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
##
## ======================
## Welcome to heatmaply version 1.5.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
SR <- st_read("./data/EstiSpRichness.shp")
## Reading layer `EstiSpRichness' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/EstiSpRichness.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 288 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 201250 ymin: -22500 xmax: 316250 ymax: 92500
## Projected CRS: WGS 84 / UTM zone 37N
colnames(SR)[3:10]<-c(2001,2004,2006,2008,2010,2012,2015,2016)
prop<-st_read("./data/Property_boundary_2015.shp")
## Reading layer `Property_boundary_2015' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Property_boundary_2015.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 314 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 186885.9 ymin: -32707.34 xmax: 321651.3 ymax: 96041.66
## Projected CRS: WGS 84 / UTM zone 37N
prop
## Simple feature collection with 314 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 186885.9 ymin: -32707.34 xmax: 321651.3 ymax: 96041.66
## Projected CRS: WGS 84 / UTM zone 37N
## First 10 features:
## NAME_96 LWF_UNIT LANDUSE AREA TENURE RECNO
## 1 NYS Mar Mar Ranch NORTHERN 2 182832252 Government Land 1
## 2 Louniek SFT WESTERN 6 206768128 Government Land 2
## 3 Lenges Ranch NORTHERN 11 28727854 Private Ranch 3
## 4 Mugie Conservancy NORTHERN 11 183191628 Private Ranch 4
## 5 Laikipia N.Reserve NORTHERN 2 166186628 Government Land 5
## 6 P&D NORTHERN 6 173673828 Small Holder 7
## 7 Laikipia N. Conservancy WESTERN 11 386049301 Private Ranch 8
## 8 L. West WESTERN 8 204385714 Small Holder 9
## 9 Ol Malo NORTHERN 11 7568867 Private Ranch 10
## 10 Mutukanio WESTERN 8 210504638 Small Holder 11
## PERIMETER ACRES HECTARES AREA_KM2 TENURE_211
## 1 56493.40 1828322.52 18283.225 182.83 Government Land
## 2 62937.69 2067681.27 20676.813 206.77 Government Land
## 3 24525.24 287278.54 2872.785 28.73 Privately Owned Ranch
## 4 71924.57 1831916.28 18319.163 183.19 Privately Owned Ranch
## 5 66749.15 1661866.28 16618.663 166.19 Government Land
## 6 100202.37 1736738.28 17367.383 173.67 Privately Owned Smallholding
## 7 81338.23 3860493.01 38604.930 386.05 Privately Owned Ranch
## 8 80095.26 2043857.14 20438.571 204.39 Privately Owned Smallholding
## 9 22844.52 75688.67 756.887 7.57 Privately Owned Ranch
## 10 84720.16 2105046.38 21050.464 210.50 Privately Owned Smallholding
## LANDUSE211 CONSERV ANY_TRISM CONSER_VAL
## 1 Pastoralist Grazing Area 0 0 1
## 2 Pastoralist Grazing Area 0 0 2
## 3 Ranching 0 0 2
## 4 Ranching & Wildlife 1 1 1
## 5 Pastoralist Grazing Area 0 0 2
## 6 Pastoralist Grazing Area 0 0 2
## 7 Wildlife Only 1 1 1
## 8 Pastoralist Grazing Area 0 0 2
## 9 Ranching & Wildlife 1 1 1
## 10 Pastoralist Grazing Area 0 0 2
## CONS_VAL_T CONS_INV LU
## 1 Core wildlife area Passive Pastorial
## 2 Secondary wildlife area Passive Pastorial
## 3 Secondary wildlife area Passive Mixed Farming
## 4 Core wildlife area Active ProWildlife
## 5 Secondary wildlife area Passive Pastorial
## 6 Secondary wildlife area Passive Pastorial
## 7 Core wildlife area Active Wildlife Only
## 8 Secondary wildlife area Passive Pastorial
## 9 Core wildlife area Active ProWildlife
## 10 Secondary wildlife area Passive Pastorial
## geometry
## 1 MULTIPOLYGON (((240602.1 83...
## 2 MULTIPOLYGON (((211302.1 79...
## 3 MULTIPOLYGON (((236329.8 87...
## 4 MULTIPOLYGON (((235449.3 87...
## 5 MULTIPOLYGON (((259130.1 88...
## 6 MULTIPOLYGON (((245758.4 79...
## 7 MULTIPOLYGON (((220833.5 74...
## 8 MULTIPOLYGON (((225684.3 72...
## 9 MULTIPOLYGON (((266006.3 69...
## 10 MULTIPOLYGON (((228857.2 60...
Make a static map for one year
tmap_mode(mode="plot")
## tmap mode set to plotting
tm_shape(prop) + tm_polygons("LU", palette = "Greys", border.col = "white", title="Land use") +
tm_legend(legend.show=T) +
tm_layout(legend.outside = T, scale = 1.3, title = 2016, title.size = 2, frame = F, legend.position = c("left","center")) +
tm_scale_bar(position = c("left","bottom"), size = 0.5) +
tm_compass (position = c("left","top")) +
tm_shape(SR) + tm_bubbles(size = "2016", col= "2016", border.col="white", palette = RdYlGn (6), scale = 1.1, size.max = 13, title.size = "Sp richness", style = "fixed", breaks = c(0,0.990001,3.90001,5.90001,7.90001,9.90001,13), title.col = "Sp richness", legend.col.show = FALSE, legend.size.show = FALSE) +
tm_add_legend("symbol", title = "Sp. richness", col = RdYlGn (6), size = c(0.4,0.5,0.6,0.7,0.8,0.9), labels = c("0", "1-4","4-6","6-8","8-10","10-13"), border.col="white")
## Warning: The argument size of tm_scale_bar is deprecated. It has been renamed
## to text.size
Re-arrange the dataframe to be able to the function
tm_facets
:
colnames(SR)[3:10]<-c('SR')
Year <- c(2001,2004,2006,2008,2010,2012,2015,2016)
SR2 <- SR[,3]
SR2$Year <- Year[1]
for (i in 2:8){
temp <- SR[,2+i]
temp$Year <- Year[i]
SR2 <- rbind(SR2,temp)
}
Create the dynamic map:
map <- tm_shape(prop) + tm_polygons("LU", palette = "Greys", border.col = "white", title="Land use") +
tm_legend(legend.show=T) +
tm_layout(legend.outside = T, scale = 1.3, frame = F, legend.position = c("left","center")) +
tm_scale_bar(position = c("left","bottom"), text.size = 0.5) +
tm_compass (position = c("left","top")) +
tm_shape(SR2) + tm_bubbles(size = "SR", col= "SR", border.col="white", palette = RdYlGn (6), scale = 1.1, size.max = 13, title.size = "Sp richness", style = "fixed", breaks = c(0,0.990001,3.90001,5.90001,7.90001,9.90001,13), title.col = "Sp richness", legend.col.show = FALSE, legend.size.show = FALSE) +
tm_add_legend("symbol", title = "Sp. richness", col = RdYlGn (6), size = c(0.4,0.5,0.6,0.7,0.8,0.9), labels = c("0", "1-4","4-6","6-8","8-10","10-13"), border.col="white") +
tm_facets(along = "Year", free.coords = FALSE)
tmap_animation(map, filename = "map_anim.gif", delay = 50)
## Creating frames
## ====================
## ==========
## ==========
## ==========
## ==========
## ==========
## ==========
##
## Creating animation
## Animation saved to /Users/ramirocrego/Documents/GitHub/R-Tutorials/map_anim.gif
Addax (Addax nasomaculatus) are considered one of the rarest antelopes on earth, with an estimated population of <100 individuals in the wild. We assessed the distribution and occurrence of this species from field surveys collected by the Sahara Conservation Fund over a 7-year period (2008-2014). Our results provide insight into the factors contributing to species occurrence and are guiding field surveys to areas that have the potential to support small and geographically isolated populations. We incorporated field-derived variables of vegetation cover with remote sensing measures of vegetation productivity (NDVI - Normalized Difference Vegetation Index) and surface roughness (derived from SRTM - Shuttle Radar Topography Mission). Models were fit in a generalized linear regression framework to evaluate and predict addax occurrence.
The following analysis follows steps detailed in:
Stabach, J.A., T. Rabeil, V. Turmine, T. Wacher, T. Mueller, and P. Leimgruber. 2017. On the brink of extinction - Habitat selection of addax and dorcas gazelle across the Tin Toumma desert, Niger. Diversity and Distributions 23:581-591.
A link to the repository of the analysis is here.
In this case study, we will demonstrate how to combine the skills learned in GIS, remote sensing, and modeling to provide a predictive surface of addax habitat suitability. This will include:
One of the most difficult (and perhaps most time consuming) aspects of conducting a spatial analysis is the construction of the spatial database. Main components include locating and downloading available spatial data, mosaicking (if necessary) multiple tiles together, reprojecting the data so that all layers have the same coordinate reference system (CRS), and resampling the cell sizes to match.
Multiple sources now exist for finding available spatial data, which
include EarthExplorer and
Google Earth Engine. Some
layers, however, can also be directly accessed within R using the geodata
package (among others).
Like other exercises in R,
we will first load the necessary packages to perform the analyses. Use
help()
for any operations that you don’t understand or that
require additional information.
# Load required libraries
library(terra)
library(sf)
library(tidyverse)
library(geodata)
First, we will load the spatial points dataframe (our sampling point
locations). This is the occurrence dataset collected by partners in
Niger. In this case, I have saved the file as a shapefile (Lat/Long,
WGS84) that we will load directly into R. This file could just as easily
be loaded as a .csv
and then converted to a spatial object
by defining the projection.
# Load the point data and project to UTM32N
Addax <- st_read("data/Occurrence_dd.shp")
## Reading layer `Occurrence_dd' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Occurrence_dd.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 661 features and 12 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 11.71088 ymin: 16.01946 xmax: 12.27343 ymax: 16.95161
## Geodetic CRS: WGS 84
# Project to UTM32N, WGS84
# See: https://spatialreference.org/
UTM32N.proj <- "epsg:32632"
Addax <- st_transform(Addax, crs = UTM32N.proj) # Note that I am overwriting the original Lat/Long object
# Look at the file
Addax
## Simple feature collection with 661 features and 12 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 789521 ymin: 1773472 xmax: 850100 ymax: 1876866
## Projected CRS: WGS 84 / UTM zone 32N
## First 10 features:
## Pt_ID Date YearMonth Id_1 X Y Addax Dorcas Cornul Stipa1
## 1 13 2008-12-06 200812 23 825358.1 1773472 0 3 0 1
## 2 14 2008-12-06 200812 23 830041.8 1776055 0 5 0 0
## 3 15 2008-12-06 200812 23 835354.5 1779007 0 7 0 0
## 4 17 2008-12-06 200812 23 840739.0 1781912 0 2 0 1
## 5 19 2008-12-06 200812 23 845644.1 1784587 0 1 0 0
## 6 21 2008-12-06 200812 23 850100.0 1787353 0 11 0 0
## 7 16 2008-12-06 200812 12 800058.8 1780978 1 0 0 0
## 8 18 2008-12-06 200812 12 805058.9 1783687 2 0 0 1
## 9 20 2008-12-06 200812 12 809524.6 1786051 5 1 0 0
## 10 22 2008-12-06 200812 12 814934.7 1788977 0 0 0 1
## Stipa2 Human geometry
## 1 0 0 POINT (825358.1 1773472)
## 2 0 0 POINT (830041.8 1776055)
## 3 0 1 POINT (835354.5 1779007)
## 4 0 0 POINT (840739 1781912)
## 5 0 1 POINT (845644.1 1784587)
## 6 1 0 POINT (850100 1787353)
## 7 0 0 POINT (800058.8 1780978)
## 8 0 0 POINT (805058.9 1783687)
## 9 1 0 POINT (809524.6 1786051)
## 10 0 2 POINT (814934.7 1788977)
R has a number of options
for directly importing remotely sensed data into the console. Here, we
will import a NDVI data layer (a .tif file) that I downloaded from EarthExplorer. This layer is
located in your data directory and has already been re-projected to UTM
32N WGS84 and clipped to the study area. We will also use functions in
the terra
package to load a 90-m elevation dataset. Both
layers must exactly match each other (same CRS, extent, grid cell size)
in order to predict with the rasters.
# NDVI - Normalized Difference Vegetation Index (250-m)
# ********************************************
# ********************************************
# Note the file is already projected to UTM32N WGS84
ndvi <- rast("data/MOD13Q1_Nov2017.tif")
ndvi <- ndvi * 0.0001 # Convert to decimal values (This is done to reduce the file size)
ndvi
## class : SpatRaster
## dimensions : 421, 376, 1 (nrow, ncol, nlyr)
## resolution : 250, 250 (x, y)
## extent : 760705.1, 854705.1, 1771859, 1877109 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632)
## source(s) : memory
## varname : MOD13Q1_Nov2017
## name : MOD13Q1_Nov2017
## min value : 0.0324
## max value : 0.1276
# SRTM - Elevation data from the Shuttle Radar Topography Mission (30-m)
# ********************************************
# ********************************************
srtm <- elevation_3s(lon=12, lat=16.5, path = tempdir()) # Inputting general coordinates to grab tile
srtm
## class : SpatRaster
## dimensions : 6000, 6000, 1 (nrow, ncol, nlyr)
## resolution : 0.0008333333, 0.0008333333 (x, y)
## extent : 10, 15, 15, 20 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs
## source : srtm_39_09.tif
## name : srtm_39_09
# Clip (crop) the raster to the study area extent. This will help with processing the image
# Read in a study area boundary (also in Lat/Long).
SA <- st_read("data/StudyArea.shp")
## Reading layer `StudyArea' from data source
## `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/StudyArea.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 11.44643 ymin: 16.0124 xmax: 12.31479 ymax: 16.95185
## Geodetic CRS: WGS 84
plot(srtm)
plot(SA, add=T)
# Crop the raster to the study area extent
# You could also manually set the extent, if inclined to do so (SA <- extent(11,13,16,17))
srtm <- crop(srtm,SA)
# Note: In many cases, you might need to download multiple tiles to cover your study area. To put them together, you'd need to use the mosaic function. Be careful, rasters can get large.
#srtm1 <- elevation_3s(lon = 12, lat = 16.5, path = tempdir())
#srtm2 <- elevation_3s(lon = 15, lat = 16.5, path = tempdir())
#srtm.mosaic <- mosaic(srtm1,srtm2, fun=mean) # Using mean for overlapping region
# Global Administrative Boundaries (GADM):
# Sometimes very useful to verify you are working where you think you are
Niger <- world(resolution=1, level=0, path = tempdir()) |> st_as_sf() |> filter(NAME_0 == "Niger")
Niger <- st_transform(Niger, crs = 4326)
# Plot together - Overlapping where I think they should!
plot(Niger[1])
plot(srtm, add=T)
You’ll note that the projection of the NDVI layer is different than the SRTM data layer. We need to project the SRTM data layer to UTM32N WGS, making sure the extents, cell sizes (essential for final prediction), and grids all match. We’ll use bilinear interpolation because the data are continuous. For a categorical image (e.g., land-cover classification), use nearest neighbor. We also now have a choice to make about the resolution we will use to conduct our analyses. What resolution should we use?
# Project the srtm image, setting the resolution
srtm.utm <- project(srtm, "epsg:32632", method = 'bilinear') # This can be a slow process, depending on the size of the raster
# Resample the srtm image to the ndvi image
# The native resolution of our analysis will then be 250m - This will help to speed up the processing, but means we are losing some of the fine-scale information contained in the 90m elevation model which might be important
srtm.250m <- resample(srtm.utm, ndvi, "bilinear")
# Print both rasters and make sure they have the same extent, rows/columns, projection, and resolution
ndvi
## class : SpatRaster
## dimensions : 421, 376, 1 (nrow, ncol, nlyr)
## resolution : 250, 250 (x, y)
## extent : 760705.1, 854705.1, 1771859, 1877109 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632)
## source(s) : memory
## varname : MOD13Q1_Nov2017
## name : MOD13Q1_Nov2017
## min value : 0.0324
## max value : 0.1276
srtm.250m
## class : SpatRaster
## dimensions : 421, 376, 1 (nrow, ncol, nlyr)
## resolution : 250, 250 (x, y)
## extent : 760705.1, 854705.1, 1771859, 1877109 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632)
## source(s) : memory
## varname : MOD13Q1_Nov2017
## name : srtm_39_09
## min value : 372.0851
## max value : 586.0993
Questions:
Now that we have projected and processed the spatial data, let’s make a plot of all the data to make sure everything looks good. Data are located where I expect them to be. Amazing!
# Plot the result (order matters)
plot(srtm.250m)
plot(Addax,pch=15,cex=0.7,add=TRUE)
We assumed that addax would select inter-dunal depressions. Instead of including the raw SRTM (elevation) data in our models, we created a variable derived from the elevation dataset, termed Surface Roughness. This variable is the change in elevation range between neighboring pixels. That is, it is the difference between the minimum and maximum values of a cell and its eight surrounding neighbors. Pixels with similar neighborhood values have a low change in elevation range (e.g., flat areas), whereas highly variables neighborhoods have a large change in elevation range (e.g., steep slopes) (Wilson et al. 2007).
Terrain variables can be easily applied to digital elevation models.
Variables that can be calculated include slope, aspect, Terrain Position
Index (TPI), Terrain Ruggedness Index (TRI), and surface roughness. Some
of these variables are likely to be highly correlated. Use
help(terrain)
for additional information. In this example,
we will calculate surface roughness.
rough <- terrain(srtm.250m, v='roughness')
plot(rough)
plot(Addax,pch=15,cex=0.7,add=TRUE)
The extract
command from the terra
package
is an extremely useful and powerful function for joining spatial
objects. Here, we will extract surface roughness and NDVI at each of the
occurrence locations. If we stack the layers together, we can extract
all the rasters together with a single command. See
help(extract)
for additional information about this
function. Note that we tell R that we want the function from the
terra
package as it
# Extract NDVI and rough
ndvi.point <- terra::extract(ndvi,Addax)
rough.point <- terra::extract(rough,Addax)
# There are some NA, so we will fill them with the mean of all other values. Note this is happening because the ndvi layer we are using here is not the original used in the actual analysis, which matched the Modis images to the time each transect was conducted.
ndvi.point$MOD13Q1_Nov2017 <- ifelse(is.na(ndvi.point$MOD13Q1_Nov2017), mean(ndvi.point$MOD13Q1_Nov2017, na.rm = T),ndvi.point$MOD13Q1_Nov2017)
rough.point$roughness <- ifelse(is.na(rough.point$roughness), mean(rough.point$roughness, na.rm = T),rough.point$roughness)
# Append to spatialobject
Addax <- data.frame(Addax, ndvi=ndvi.point$MOD13Q1_Nov2017,
rough=rough.point$roughness)
# Look at the data
head(Addax)
## Pt_ID Date YearMonth Id_1 X Y Addax Dorcas Cornul Stipa1
## 1 13 2008-12-06 200812 23 825358.1 1773472 0 3 0 1
## 2 14 2008-12-06 200812 23 830041.8 1776055 0 5 0 0
## 3 15 2008-12-06 200812 23 835354.5 1779007 0 7 0 0
## 4 17 2008-12-06 200812 23 840739.0 1781912 0 2 0 1
## 5 19 2008-12-06 200812 23 845644.1 1784587 0 1 0 0
## 6 21 2008-12-06 200812 23 850100.0 1787353 0 11 0 0
## Stipa2 Human geometry ndvi rough
## 1 0 0 POINT (825358.1 1773472) 0.0956 16.258545
## 2 0 0 POINT (830041.8 1776055) 0.0882 10.016754
## 3 0 1 POINT (835354.5 1779007) 0.0854 9.454865
## 4 0 0 POINT (840739 1781912) 0.0920 19.360474
## 5 0 1 POINT (845644.1 1784587) 0.0938 15.377411
## 6 1 0 POINT (850100 1787353) 0.0878 20.611420
Note: We have simplified things a bit here because we have a time series of points and therefore, we must download and extract a time series of NDVI. Each of the dates need to be aligned with their corresponding NDVI image. In the next exercise, the dataset provided will already have this step completed so that we can focus on the modeling.
Now that we have extracted the data, we will now walk through the steps of fitting a logistic regression model to model the probability of addax occurrence.
We now have the occurrence dataset with all associated variables.
Animal counts (addax, but also dorcas gazelle) were summarized at plot locations. We re-coded all counts within a 2.5-km radius to a measure of occurrence (i.e., presence/absence). Thus, we modeled the data as a series of 1’s and 0’s, representing addax occurrence at plot locations. Data were aggregated in this fashion because of variability between surveys (i.e., the transects locations didn’t overlap exactly) and because we did not have confidence in the accuracy of the number of individuals recorded at each sighting. In addition, distance to animal sighting locations were only recorded in a subset of the surveys. Sightings >500-m from the transect were removed due to an assumed undercounting bias (confirmed by investigating the frequency of sightings in relation to distance). This allowed for a conservative broad-scale approach to incorporate extremely messy field data collected over multiple years. See more details in Stabach et al. 2017.
In this section of the analysis we will:
As done previously, we remove objects in R’s memory and load the necessary
packages to perform the analyses. Please use the help()
for
any operations that you don’t understand or that require additional
information. Knowing what to search for in Google is essential. I usually start
any query with “r-cran” (e.g., “r-cran Import shapefile”) to limit the
search terms to R.
Note: There are quite a few additional packages to
install (‘install.packages(“package name”)’) in this exercise.
# Load required libraries
library(lubridate)
library(pROC)
library(tmap)
library(visreg)
library(coefplot)
library(usdm)
As a first step, we will summarize the occurrence dataset and visualize some of the patterns.
Columns in the Addax dataframe include Date of survey, X and Y location of each plot, the number of addax and dorcas gazelle sighted at each location, a unique plot ID (Pt_ID), and the presence/absence of vegetation species Cornulaca monocantha (Cornul), Stipagrostis acutiflora (Stipa1), and Stipagrostis vulnerans (Stipa2). These vegetation species were thought a priori to influence addax occurence and were collected at each plot location. Human disturbance (e.g., footprint, sighting, tire tracks) were also recorded (i.e., Human = 1). You’ll also note that the remote sensing data that we extracted, surface roughness (Rough) and the Normalized Difference Vegetation Index (NDVI), are also included.
We need to add a few things to the dataframe, which most importantly
includes re-coding the abundance records to a measure of occurrence.
This was done because of a lack of confidence in the exact counts
collected. Here, we create a Month
, Year
, and
Season
field, correct some of the field types, and re-code
the occurrence.
# Create month, year and season fields
Occ <- Addax %>% st_drop_geometry() %>% mutate(
Month = month(Date),
Year = year(Date),
Season = case_when(
Month >=3 & Month <= 6 ~ "Dry",
Month >=7 & Month <=10 ~ "Wet",
Month >=11 & Month <=12 ~ "Cold",
TRUE ~ "Fix Problem"
)
)
# Case_when is the same as:
#Occ$Season <- ifelse(Occ$Month >=3 & Occ$Month <=6, "Dry",
# ifelse(Occ$Month >=7 & Occ$Month <=10, "Wet",
# ifelse(Occ$Month >=11 & Occ$Month <=12, "Cold","Fix Problem")
# ))
# We could easily include the following in the piping above.
# Recode Occurrence
Occ$obsAddax <- ifelse(Occ$Addax > 0, 1, 0)
Occ$obsDorcas <- ifelse(Occ$Dorcas > 0, 1, 0)
# Add some summaries
Occ$SumBoth <- Occ$obsAddax + Occ$obsDorcas
Occ$BothPresent <- ifelse(Occ$SumBoth > 1, 1, 0) # Both species present/occurred
Occ$OnePresent <- ifelse(Occ$SumBoth > 0, 1, 0) # At least one species present/occurred
# Correct the data types
# Individually, this would be:
Occ$Cornul <- as.factor(Occ$Cornul)
Occ$Stipa1 <- as.factor(Occ$Stipa1)
Occ$Stipa2 <- as.factor(Occ$Stipa2)
Occ$Season <- as.factor(Occ$Season)
Occ$Year <- as.factor(Occ$Year)
# More succinct
# cols <- c("Cornul", "Stipa1", "Stipa2", "Season", "Year")
# Occ[cols] <- lapply(Occ[cols], as.factor)
# str(Occ)
When we summarize the dataset, we see that the data were collected multiple times a year and that the occurrence of addax (and dorcas gazelle) vary between years and seasons.
# Summarize
Occ.Summary <- Occ %>%
group_by(Year, Month, Season) %>%
summarize(PresAddax = sum(obsAddax),
PresDorc = sum(obsDorcas),
PrevAdd= round(sum(obsAddax)/length(obsAddax)*100,digits=1),
PrevDorc = round(sum(obsDorcas)/length(obsAddax)*100,digits=1),
One = round(sum(OnePresent)/length(obsAddax)*100,digits=1) ,
Both = round(sum(BothPresent)/length(obsAddax)*100,digits=1)
)
# Look at the result
Occ.Summary <- as.data.frame(Occ.Summary)
Occ.Summary
## Year Month Season PresAddax PresDorc PrevAdd PrevDorc One Both
## 1 2008 12 Cold 21 14 43.8 29.2 62.5 10.4
## 2 2009 6 Dry 6 4 12.5 8.3 14.6 6.2
## 3 2009 9 Wet 8 11 16.7 22.9 37.5 2.1
## 4 2009 12 Cold 16 23 33.3 47.9 66.7 14.6
## 5 2010 4 Dry 4 11 8.3 22.9 25.0 6.2
## 6 2010 6 Dry 6 10 12.5 20.8 33.3 0.0
## 7 2010 10 Wet 3 4 6.2 8.3 14.6 0.0
## 8 2010 12 Cold 6 14 12.0 28.0 36.0 4.0
## 9 2011 3 Dry 6 4 25.0 16.7 25.0 16.7
## 10 2011 7 Wet 6 6 20.0 20.0 26.7 13.3
## 11 2012 12 Cold 6 8 15.0 20.0 27.5 7.5
## 12 2013 11 Cold 16 31 33.3 64.6 75.0 22.9
## 13 2014 3 Dry 3 8 7.5 20.0 22.5 5.0
## 14 2014 9 Wet 4 5 10.0 12.5 22.5 0.0
## 15 2014 12 Cold 1 31 1.9 58.5 58.5 1.9
# If you want, write to a file
#write.csv(Occ.Summary, file="Addax_Dorcas_Prevalence.csv", quote = FALSE, row.names = FALSE)
Let’s look at the data one last time before moving on to the modelling.
# Plot
plot(Occ$X,Occ$Y,xlab = "Easting", ylab = "Northing", main = "Plot Locations", frame = FALSE, pch = ".", col="red", cex = 5, asp=1)
Question:
It is often helpful and necessary to scale (\(x_i = \bar{x} / sd(x)\)) continuous
predictor variables that have vastly different value ranges (e.g.,
elevation and NDVI). See the scale
function. Doing so can
help with model convergence and coefficient comparability. While
relationships between your dependent variable and each independent
variable will remain the same, it is important to remember that data are
not on the same scale as the original values and must be
back-transformed when making raster predictions. This is
critical.
# Scale the continuous variables that we'll include in the modelling
Occ <- Occ %>% mutate(
sHuman = as.numeric(scale(Human)), # This is a bit ugly, but if you don't specify as "as.numeric", the variables can be difficult to plot
sndvi = as.numeric(scale(ndvi)),
srough = as.numeric(scale(rough)),
sDorcas = as.numeric(scale(Dorcas)),
sAddax = as.numeric(scale(Addax))
)
# The default scale function simply does the following:
#CenterScale <- function(x){
# (x-mean(x))/sd(x)
#}
#sndvi2 <- CenterScale(Occ$ndvi)
# These two scaled ndvi vectors should be exactly the same:
#cor(Occ$sndvi,sndvi2)
#summary(Occ$sndvi)
#summary(sndvi2)
# Let's also scale our raster layers by the mean and standard deviation at each location so we don't forget to later when making predictions from our model
s.ndvi <- (ndvi - mean(Occ$ndvi)) / sd(Occ$ndvi)
s.rough <- (rough - mean(Occ$rough)) / sd(Occ$rough)
par(mfrow=c(2,2))
# You should notice that the rasters look exactly the same, but their values have changed. They are now on a much more similar scale.
plot(ndvi, main = "Non-Scaled NDVI");plot(s.ndvi, main = "Scaled NDVI");plot(rough, main = "Non-Scaled Rough");plot(s.rough, main = "Scaled Rough")
par(mfrow=c(1,1)) # Returning plotting window to normal
Like any other analysis, we need to evaluate redundancy between our predictor variables to make sure they are sufficiently independent. Otherwise, we may obtain be unable to interpret our model. We will conduct a Variance Inflation Factor Analysis (VIF), which seems to be the preferred statistical method for comparing collinearity currently. Any variables with a VIF > 3 are potential cause for concern. Are any of the continuous variables included in our analysis highly collinear?
Question:
# Assess the continuous variables we'll include in our analysis
vifstep(Occ[,24:28]) # Including scaled Human Presence, NDVI, roughness, and Dorcas/Addax
## No variable from the 5 input variables has collinearity problem.
##
## The linear correlation coefficients ranges between:
## min correlation ( sAddax ~ sDorcas ): 0.01051778
## max correlation ( srough ~ sndvi ): -0.2823358
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 sHuman 1.010645
## 2 sndvi 1.120611
## 3 srough 1.097381
## 4 sDorcas 1.044630
## 5 sAddax 1.010735
# You could also use the cor function to investigate correlation.
# What is a reasonable correlation threshold to use |0.65|??
#cor(Occ[,22:26])
Model the occurrence of addax in a Generalized Linear Regression (GLM) framework. We expected non-linear relationships with surface roughness and ndvi, so are including these terms as quadratic effects in the model. Our goal here was not necessarily to create the very best model. Instead, we aimed to:
Question:
# Create a full model with all the variables you think are important predictors of addax occurrence
glm.Addax <- glm(obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2) + sHuman + sDorcas + Stipa1 + Stipa2 + Cornul + Season + Year,
data = Occ,
family = binomial(link="logit"))
# Summarize result
summary(glm.Addax)
##
## Call:
## glm(formula = obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2) +
## sHuman + sDorcas + Stipa1 + Stipa2 + Cornul + Season + Year,
## family = binomial(link = "logit"), data = Occ)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.95886 0.42608 -2.250 0.024421 *
## srough 0.06296 0.11707 0.538 0.590720
## I(srough^2) 0.01480 0.09313 0.159 0.873701
## sndvi -0.19687 0.12832 -1.534 0.124958
## I(sndvi^2) 0.03531 0.08309 0.425 0.670829
## sHuman -0.61539 0.24360 -2.526 0.011529 *
## sDorcas 0.03529 0.11829 0.298 0.765467
## Stipa11 0.19281 0.28041 0.688 0.491705
## Stipa21 1.02921 0.28627 3.595 0.000324 ***
## Cornul1 0.29894 0.30959 0.966 0.334256
## SeasonDry -0.14792 0.36924 -0.401 0.688717
## SeasonWet -0.29818 0.37409 -0.797 0.425401
## Year2009 -1.09962 0.45669 -2.408 0.016049 *
## Year2010 -1.99582 0.49594 -4.024 5.71e-05 ***
## Year2011 -0.96937 0.59233 -1.637 0.101726
## Year2012 -2.35473 0.58799 -4.005 6.21e-05 ***
## Year2013 -0.74979 0.47051 -1.594 0.111034
## Year2014 -3.22724 0.55623 -5.802 6.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 601.51 on 660 degrees of freedom
## Residual deviance: 507.69 on 643 degrees of freedom
## AIC: 543.69
##
## Number of Fisher Scoring iterations: 6
# Nice summary table, including model estimates and odds ratios
#tab_model(glm.Addax)
# Or print the confidence intervals
#confint(glm.Addax)
Use the visreg
and coefplot
functions to
easily graph the results from a glm and evaluate the model output. Be
careful, however, with how categorical variables are interpreted.
Visreg
does the hard parts of providing a predictive
response while holding all other variables constant.
# Plot the coefficients
coefplot(glm.Addax,
plot=TRUE,
mar=c(1,4,5.1,2),
intercept=FALSE,
vertical=TRUE,
main="",
var.las=1,
frame.plot=FALSE)
# Reset plotting
par(mar=c(5,4,4,2)+0.1)
# Graph result for all variables
#visreg(glm.Addax, scale="response", ylab="Prob", partial=TRUE)
# Or just one variable at a time
par(mfrow=c(1,2))
visreg(glm.Addax,"srough",
scale="response",
ylab="Probability of Occurrence",
xlab="Surface Roughness",
partial=TRUE,
line=list(col="blue"),
fill=list(col="gray"),
points=list(col="black",cex=0.25,pch=19),
ylim=c(0,1))
visreg(glm.Addax,"sndvi",
scale="response",
ylab="Probability of Occurrence",
xlab="NDVI",
partial=TRUE,
line=list(col="blue"),
fill=list(col="gray"),
points=list(col="black",cex=0.25,pch=19),
ylim=c(0,1))
par(mfrow=c(1,1))
Binomial data can be notoriously difficult to validate, at least when compared to standard tools used for linear regression. Ramiro discussed a few techniques to assess model assumptions, including ways to assess a model’s predictive power. We will calculate the Area Under the Curve (AUC) to assess predictive power here. AUC compares the difference between the true positive classification rate and a false positive rate (i.e., Specificity vs Sensitivity).
Another (best) option is to incorporate an independent dataset for validation.
Some guidelines for AUC:
# Evaluate deviance residuals
# No strong evidence of lack of fit. Most residuals are around a value of 0.
devresid <- resid(glm.Addax, type = "deviance")
hist(devresid)
# Calculate AUC
predpr <- predict(glm.Addax, type=c("response"))
(roccurve <- roc(Occ$obsAddax ~ predpr))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = Occ$obsAddax ~ predpr)
##
## Data: predpr in 549 controls (Occ$obsAddax 0) < 112 cases (Occ$obsAddax 1).
## Area under the curve: 0.77
plot(roccurve, main="AUC")
One of the most valuable parts of a species distribution model is predicting to locations where surveys were not performed. In order to make a prediction at these locations, we need predictor data that has wall-to-wall coverage. Unfortunately, only two of our layers incorporated in the full model have full coverage (NDVI and Surface Roughness).
Create a model with these two layers, assess how the model compares with the full model and predict across the entire study area. As you will see, model statistics indicate that this sub-model is not as good as the full model (compare the AIC, AUC).
The model, however, can still be useful, as long as we are clear about its shortcomings (e.g., we’d expect the predictive power to be decreased since we are not including the fine scale data collected at individual plot locations).
glm.Addax2 <- glm(obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2),
data = Occ,
family = binomial(link="logit"))
# Summarize and print confidence intervals
summary(glm.Addax2)
##
## Call:
## glm(formula = obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2),
## family = binomial(link = "logit"), data = Occ)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.63473 0.15623 -10.464 <2e-16 ***
## srough 0.07565 0.10696 0.707 0.479
## I(srough^2) 0.04945 0.08254 0.599 0.549
## sndvi -0.09295 0.11269 -0.825 0.409
## I(sndvi^2) -0.01304 0.07575 -0.172 0.863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 601.51 on 660 degrees of freedom
## Residual deviance: 599.00 on 656 degrees of freedom
## AIC: 609
##
## Number of Fisher Scoring iterations: 4
# View model output with odds ratios
#tab_model(glm.Addax2)
# Calculate AUC
predpr <- predict(glm.Addax2, type=c("response"))
(roccurve <- roc(Occ$obsAddax ~ predpr))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = Occ$obsAddax ~ predpr)
##
## Data: predpr in 549 controls (Occ$obsAddax 0) < 112 cases (Occ$obsAddax 1).
## Area under the curve: 0.5567
plot(roccurve) # Not great
Question:
We then can use the predict command to take our coefficients and predict
# We could physically calculate the prediction from the model coefficients:
#coef <- summary(glm.Addax2)
#coef <- coef$coefficients
#coef
#Addax.predict <- (exp(coef[1] + s.rough*coef[2] + s.rough^2*coef[3] + s.ndvi*coef[4] + s.ndvi^2*coef[5])/(1 + exp(coef[1] + s.rough*coef[2] + s.rough^2*coef[3] + s.ndvi*coef[4] + s.ndvi^2*coef[5])))
# Add to a stack of rasters and rename layer names.
satImage <- c(s.rough, s.ndvi)
names(satImage) <- c("srough", "sndvi")
# Predict and export image to directory
Addax.predict <- predict(satImage, glm.Addax2, type="response", progress='text')
# Plot result
plot(Addax.predict) # Not a great prediction, with mostly low values, but it highlights some important aeras.
Lastly, we can take this one step further and plot our predicted surface on an interactive map, so that we have a better idea of its context in the real world and also provide a way to post our results on relevant social media.
# Load ESRI imagery baselayer
tmap_mode("view")
tm_basemap("Esri.WorldImagery") +
tm_shape(Addax.predict, name = "Addax prediction") +
tm_raster(palette="-inferno", n=8, alpha=0.6,
title = "Predicted Addax Occurrence")