1 Foreword

Geographic Information Systems, use of satellite imagery, modeling, and mapping have become essentials in the toolbox of many researchers, natural resource managers, and conservationists. Yet, these resources are often not readily accessible to practitioners around the world. Most restrictive is the limited access to specialized and expensive commercial software. In recent years, several powerful open-source tools for geospatial analysis have emerged. R is a widely used statistical programming language with considerable geospatial analysis capabilities. This training focuses on performing geospatial work in R.

1.1 Learning Outcomes

By the end of these tutorials, you will be able to:

  1. Understand R coding language
  2. Understand how to troubleshoot for R errors
  3. Perform database and spatial queries in R
  4. Create and import spatial data in R
  5. Know basic raster and vector analysis functions in R
  6. Project spatial data in R
  7. Use R to create maps and visualize data
  8. Extract environmental values for spatial points and regions
  9. Fit Generalized Linear Models and predictions over space
  10. Learn to simulate that and the basics of Bayesian statistics.

1.2 About this material

This content has been the product of years of work and collaborations with many amazing people. Some of the code has been provided by them.

We thank Dr. Quingyu Huangm, Dr. Grant Connette, and Dr. Joseph Kolowski for their inputs.

1.3 Required software

During the course, students will use R and R-Studio. R is a programming language and free software environment for statistical computing widely used by researchers. R-Studio is an integrated development environment for R that makes the software more user friendly. Both R and R-Studio are compatible with the most common operating systems and are free to use. Before the course begins, you should have the software installed on your personal computers. Installation files can be accessed at:

-R: https://www.r-project.org

-RStudio Desktop: https://rstudio.com/products/rstudio/

1.4 Data access

All the data used in the tutorials are available in this link

2 An intro to the R programming language

2.1 What is R and why use R?

R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.

Among the reasons to use R you have:

  1. It’s free – open source! If you are a teacher or a student, the benefits are obvious.
  2. It runs on a variety of platforms including Windows, Unix and MacOS.
  3. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.
  4. It contains advanced statistical routines not yet available in other software.
  5. New add-on “packages” are being created and updated constantly.
  6. It has state-of-the-art graphics capabilities.

R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.

Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.

2.2 R or RStudio? How to get them?

R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).

R console, script and plot tabs.
R console, script and plot tabs.

RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.

RStudio sowfware
RStudio sowfware

2.3 Getting Help

One of the most useful commands in R is ?. At the command prompt (signified by > in your Console window), type ? followed by any command and you will be prompted with a help tab for that command (e.g., ?mean Fig. 3). You can also search through the help tab directly by searching functions on the search bar.

Getting help on R.
Getting help on R.

The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.

2.4 Basic R concepts

There are a few concepts that are important to keep in mind before you start coding. The fact that R is a programming language may deter some users who think “I can’t program”. This should not be the case for two reasons. First, R is an interpreted language, not a compiled one, meaning that all commands typed on the keyboard are directly executed without requiring you to build a complete program like in most computer languages (C, Pascal, . . . ). Second, R’s syntax is very simple and intuitive. For instance, a linear regression can be done with the command lm(y ~ x) which means fitting a linear model with y as the response and x as a predictor.

In R, in order to be executed, a function always needs to be written with parentheses, even if there is nothing within them (e.g., ls()). If you type the name of a function without parentheses, R will display the content of the function.

When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).

The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).

When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.

It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).


2.5 Starting with R

2.5.1 Setting your working directory

Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:

getwd()

Use setwd() to change or set a new working directory. For instance, you can set your working directory to be in your Documents folder on the C: drive, or in any folder you prefer.

setwd("C:/Documents/R_Practice")

In RStudio, you can do some basic operations, such as setting your working directory, using the point-and-click interface: Session > Set Working Directory > Choose Directory … as shown in Fig. 4.

Setting working directory in RStudio
Setting working directory in RStudio

You can also work with R projects. R projects have the advantage that they enhance organization, collaboration, and reproducibility in R-based work by providing a structured and isolated workspace for your projects. They contribute to a more efficient and transparent workflow, particularly when working on multiple projects or collaborating with others.

The main advantage is that R Projects create a dedicated workspace for your R-related work, keeping all files, scripts, data, and outputs within a well-defined directory. This isolation helps avoid conflicts between projects and ensures a clean environment for each project. R projects use relative paths, allowing you to refer to files and directories within the project without specifying the full path. This enhances portability and makes it easier to share your project with others.

To start a new R Project click File > New Project or directly click on the R Project icon and follow the directions. You can set up your new project on a new folder or an existing folder.

RStudio sowfware
RStudio sowfware

2.6 R Fundamentals

2.6.1 Data Types

There are four fundamental data types in R that you will work with:

  1. Character
  2. Numeric
  3. Integer
  4. Logical

You can check the data type of an object using the function class(). To convert between data types you can use: as.integer(), as.numeric(), as.logical(), as. character().

For instance:

city <- 'Nairobi'
class(city)
## [1] "character"
number <- 3
class(number)
## [1] "numeric"
Integer <- as.integer(number)
class(Integer)
## [1] "integer"
logical <- 3 > 5
logical
## [1] FALSE

2.6.2 Assigning data to objects

Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!

city <- "front royal"
summary(city)

number <- 2
summary(number)

character <- as.character(2)

Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:

table <- read.csv(file="data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder

Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.

2.6.3 Special characters

The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.

#This is a comment

The semicolon (;) defines a line continuation character so that you can write different commands on the same line of code.

a <- 3; b <- 6; c <- a+b
a
## [1] 3
b
## [1] 6
c
## [1] 9

2.7 R Data Structure

2.7.1 Vectors

Vectors are a basic data structure in R. They contain a sequence of data and can contain characters, numbers, or be TRUE/FALSE values. Remember: If you are unsure or need help, use the help function (e.g., help(seq) or ?seq). Below are several ways to create vectors in R.

1:20
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
rep(1:5,5)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep("A rolling stone gathers no moss",4)
## [1] "A rolling stone gathers no moss" "A rolling stone gathers no moss"
## [3] "A rolling stone gathers no moss" "A rolling stone gathers no moss"

2.7.1.1 Extract subset of values from a vector using [] notation

x <- 1:10
y <- c(1.5, 9, 3.46, 12.2)

To see only part (i.e., a subset) of the data stored in a vector, you need to “ask” R to extract the information you want using square brackets. Most commonly, you will indicate in square brackets the position of the data you want to extract (from beginning of the vector [1] to the Nth slot in the vector [n]). If you only wanted to see the first 5 values of ‘x’, how would you do that? Or only the 2nd and 4th element of y? What if you wanted to see all the records in y, except for the 2nd and 3rd records? There are more ways to use notation to select subsets of data, which we will cover in more detail below.

x
##  [1]  1  2  3  4  5  6  7  8  9 10
(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
x[1:5]
## [1] 1 2 3 4 5
y[c(2,4)]
## [1]  9.0 12.2
y[-c(2,3)]
## [1]  1.5 12.2

2.7.2 Matrices and Dataframes

Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.

Create a matrix with 4 rows and 5 columns using the data from x above. Consult the help (e.g., help(matrix) or ?matrix) to determine the syntax required.

test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9    3    7
## [2,]    2    6   10    4    8
## [3,]    3    7    1    5    9
## [4,]    4    8    2    6   10
# Note, I can assign any name to an object that I create.  Generally it is best to name things in a way that is meaningful, but we'll have some fun here!
superman <- matrix(data = x, nrow = 4, ncol = 5)

2.7.2.1 Subset of Matrices and Dataframes

Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:

#test_matrix(row,column)
test_matrix[1,5]
## [1] 7

In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!

For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.

test_matrix[,5]
## [1]  7  8  9 10

What about the 4th row?

test_matrix[4,]
## [1]  4  8  2  6 10

What if we wanted to view the values in the 3rd row, but only the 2nd and 4th columns?

test_matrix[3,c(2,4)]
## [1] 7 5

What happens to the matrix if we append a character field? Use the cbind() (column bind) command to bind a new column, called ‘countries’. Note that I am not changing the contents of test_matrix. Can you figure out how to do a row bind (hint: use rbind())

countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
##                            countries      
## [1,] "1" "5" "9"  "3" "7"  "United States"
## [2,] "2" "6" "10" "4" "8"  "Pakistan"     
## [3,] "3" "7" "1"  "5" "9"  "Ireland"      
## [4,] "4" "8" "2"  "6" "10" "China"
#Note that I am not changing/overwriting the contents of test_matrix.  I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)

Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?

test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
##   X1 X2 X3 X4 X5     countries
## 1  1  5  9  3  7 United States
## 2  2  6 10  4  8      Pakistan
## 3  3  7  1  5  9       Ireland
## 4  4  8  2  6 10         China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"

Can I rename the column headings?

names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
##   Val1 Val2 Val3 Val4 Val5     Countries
## 1    1    5    9    3    7 United States
## 2    2    6   10    4    8      Pakistan
## 3    3    7    1    5    9       Ireland
## 4    4    8    2    6   10         China
# Also see the colnames() function

Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?

test_dataframe[3,5]
## [1] 9
test_dataframe[,5]
## [1]  7  8  9 10
test_dataframe$Val5[3]
## [1] 9
test_dataframe$Val5
## [1]  7  8  9 10
test_dataframe[,"Val5"]
## [1]  7  8  9 10

You can also use some very simple commands to determine the size of dataframes or matrices.

nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6

You can delete individual objects to clear your working directory (rm(dataset)), or start every script with the following command to make sure you are starting fresh (this is good programming practice):

#rm(list=ls())

2.8 Functions

R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.

Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)

print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
y <- matrix(1:5)
y
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [5,]    5
df.example <- cbind(x, y)
df.example
##      [,1] [,2] [,3]
## [1,]    1    6    1
## [2,]    2    7    2
## [3,]    3    8    3
## [4,]    4    9    4
## [5,]    5   10    5

?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.

Here, for example, we install and load package named “ggplot2” that we will use for data visualization.

install.packages("ggplot2")
library(ggplot2)

2.8.1 Pre-existing Functions

R also contains many pre-existing functions in the base software. Numeric functions include sum(), mean(), sd(), min(), max(), median(), range(), quantile(), or summary(). Try a few of these on the numeric vectors you have created.

sum(x)
## [1] 55
summary(x)
##        V1          V2    
##  Min.   :1   Min.   : 6  
##  1st Qu.:2   1st Qu.: 7  
##  Median :3   Median : 8  
##  Mean   :3   Mean   : 8  
##  3rd Qu.:4   3rd Qu.: 9  
##  Max.   :5   Max.   :10
range(y)
## [1] 1 5

2.8.2 Calculations & Arithmetic Operators

R can be used to perform basic calculations and report the results back to the user.

4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260

Exponentiation: ^

2^3
## [1] 8

Max and Min: max(), min()

vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers) 
## [1] 10
min(vector_numbers)
## [1] 2

Can you calculate the square root and then subtract 5 for each element in vector_number?

2.8.3 Relational Operators

<,>, =, !=, >=, <=, Evaluate a conditional expression and return TRUE or FALSE

3 > max(c(2,3,4,5))
## [1] FALSE

2.9 Loops

The for loop is used to iterate over a sequence (numeric vector, list, or other iterable objects). Loops are very important to perform opperations and especial building blocks of many advanced models.

Here is a simple for loop that print numbers 1 to 5. As you can see, i is the element in which the loop is iterating. It can take a value of 1 to 5, and the loop ends when it reaches the last element of the sequence.

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

But the loop can iterate not just on numbers, but also lists.

my_list <- c("apple", "orange", "banana")
for (fruit in my_list) {
  print(fruit)
}
## [1] "apple"
## [1] "orange"
## [1] "banana"

We can create a vector from which we will add 5 to each element and save it in a new vector. For this, we first need to create an empty vector where we will save the new numbers. Note how we use [] to access the elements in the i positions.

vector <- c(2,3,4,5,2) #Data vector
newdata <- NULL #Vector to store output
for (i in 1:length(vector)){
  newdata[i] <- vector[i] + 5
}
newdata
## [1]  7  8  9 10  7

Finally, we can nest loops. A loop inside a loop. Make sure you understand dimensions before trying to understand the nested loop.

vector <- c(1,2,3,4,5) #Data vector
newdata <- matrix(NA, 5,5) #Vector to store output
for (i in 1:5) {
  for (j in 1:5) {
    newdata[i,j] <- vector[i] * vector[j]
  }
}
newdata
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    2    4    6    8   10
## [3,]    3    6    9   12   15
## [4,]    4    8   12   16   20
## [5,]    5   10   15   20   25

2.10 Installing SWIRL

A good way to learn more about R is to use SWIRL. This is a user-generated program (also called a package or library) for learning how to code in R. To access this, you must first install the package so it is accessible. In the Console window (bottom left), type the following and press ENTER:

install.packages("swirl")

# Data can also be installed locally if you've downloaded the zip file to your hard drive
#install.packages("yourzip.tar.gz", repos=NULL)

This may take a little while, but when the stop sign in the upper right of the console window is gone, you can proceed. For any package you install in R, you will also need to turn them on before using them. You can do this with the require() or library() functions. Type this now:

library(swirl)

Note: You may be prompted to select a “mirror” from which to download the package. If this is the case, it is recommended that you choose the mirror that is geographically closest to you.

To install the lesson, you will need to use:

install_from_swirl("R Programming")

Find out more about other courses, and other download options here: https://github.com/swirldev/swirl_courses

2.10.1 SWIRL Lessons

There are many lessons within R. Once SWIRL is loaded (below), you will be given the option of which lessons to complete. Some of the core lessons can be found in the initial section labeled R Programming (estimated time to compete required lessons is about 2 hours). We recommend to start with the following lessons:

  1. Basic Building Blocks (10 min)
  2. Workspace and Files (15 min)
  3. Sequences of Numbers (5 min)
  4. Vectors (8 min)
  5. Missing Values (5 min)
  6. Subsetting Vectors (12 min)
  7. Matrices and Data Frames (13 min)
  8. Logic (optional)
  9. Functions (30 min)
  10. lapply and sapply (optional)
  11. vapply and tapply (optional)
  12. Looking at Data (5 min)
  13. Simulation (optional)
  14. Dates and Times (10 min) (optional)
  15. Base Graphics (10 Min)

2.10.2 Run SWIRL

Type the following to begin using SWIRL. Also, when restarting your session later, you’ll need to “turn on” SWIRL each time with either library(swirl) or require(swirl).

swirl()

Have fun!

3 Introduction to Data Analysis

Now that you’ve learned the basics of R programming, we’ll take things a step further.This chapter will walk you through a new set of analyses.

We’ll be working with the dataset “mtcars” which comes pre-loaded with R. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data and to reiterate some principles in statistical modelling. This chapter will help as we move towards our ultimate goal of conducting more advanced analyses on animal point data.

You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.


For this tutorial section I followed the book: Data Analysis with R Statistical Software - A Guidebook for Scientists This is, in my opinion, one of the best resources out there for an introduction to data analysis with R.


Whenever you start working with a new script, you should first set a working directory. If you are working within a R-Studio project, the working directory will automatically be set by default. This directory will contain all the data for your analysis and will be where you will save all the data outputs.

Remember that you can check the current working directory by typing:

getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/R-Tutorials"

Now, let’s change the working directory to a location of your choosing. Create a folder if you don’t have one already, then make sure your working directory is in that folder. If you already have a folder, just set the working directory to the folder you want to use.

setwd("C:/....")

3.1 Data Management and Data Manipulation

3.1.1 Exploring the Data

Let’s start investigating a data set to later fit a linear model.

Load the “mtcars”” data set. This dataset comes with R. We will use this dataset as it is intuitive to think about it and is a good example on how to tackle many other datasets.

data(mtcars)

View the first 10 lines of the data set.

head(mtcars, 10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Assess the overall structure of the data set to get a sense of the number and type of variables included. When you work with your own data, you will be familiar with the data structure, but it is always good practice to examine your data before moving on to any model fitting. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.

Note, all columns/variables included in this sample dataset are numeric-type. You can confirm the data type of each column by typing is.numeric() next to the variable name (e.g., is.numeric(mtcars$mpg)).

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Now, summarize the data to provide a list of each variable with the mean, min, and max.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

‘Summary’ is a great function to have a quick view at the data. But what if you want to save the mean, min, or max values of each variable? There is a family of functions in R that are great for applying functions to all columns or all rows in a matrix and that return the result as a vector or list of values. This is the apply function.

The apply function has two main arguments. The MARGIN that is a 1 or a 2, indicating whether you want to operate on rows (1) or columns (2) and the FUN arguments that tell R what function is to be applied. For example, to obtain the mean of each variable in the mtcars dataset, we do:

apply(mtcars, 2, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Compare with the values reported by the summary function.

QUESTION: Can you calculate the min and max values for each variable?

Let’s have a quick look at all our variables together (since they are all numeric) by looking at scatter plots of each variable combination. Use the function “pairs” or “plot” on the data set.

plot(mtcars)

You should be able to see cases where there seems to be a strong relationship between two variables. “mpg” vs. “wt” is a good example of this. This is miles per gallon vs. the weight of the car, and this makes sense, a heavier car should consume more petrol per distance. We should transform miles to kilometers, because, what kind of measurement is miles? But in appreciation of all my American friends, we will use miles, just for this exercise. Is the slope here positive or negative?

We can plot these two variables against each other to examine the relationship closer.

plot(mtcars$mpg ~ mtcars$wt)

You could plot any of the variables in the data frame. Plotting the data is one of the simplest ways to look and explore the data.

In R you can customize your plots. We will dedicate a section to ggplot2. But for now, can you change the ‘x’ and ‘y’ variable names and add a title to this plot? Use the help file help(plot) or ?plot to determine the proper syntax, or simply Google “add tile to plot in R”.

plot(mtcars$mpg ~ mtcars$wt, xlab="Weight", ylab="MPG", main="MPG vs Weight")

Calculate the correlation coefficient between the two variables, then perform a correlation test to see if they are significantly correlated.

cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
cor.test(mtcars$mpg, mtcars$wt)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

The p-value is very small and the correlation coefficient of -0.8676594 is very high. We also note that this value is negative, meaning that as the weight increases, the fuel efficiency decreases. This makes intuitive sense. We will talk about the difference between correlation and causation later.

Let’s practice some data management before we look into these variables in more detail.

Create a new data set called “my_mtcars”, excluding the variables “vs” and “am” using vector notation. Then look at your new data set to make sure it worked. We don’t necessarily need to remove these variables to continue the analysis. We are simply doing this so that you get more familiar with manipulating data frames.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
my_mtcars <- mtcars[, -c(8,9)] #Remove columns 8 and 9
head(my_mtcars, 10) 
##                    mpg cyl  disp  hp drat    wt  qsec gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30    4    4

Now, keeping the same name for your data set, exclude the “gear” and “carb” columns without using vector notation. Instead use the “subset” function. Check out the help (?subset) for this function to figure out how to exclude columns by name.

my_mtcars <- subset(my_mtcars, select = -c(gear, carb))

Note that the initial data of my_mycars with 9 variables no longer exists, because my syntax states to save the modified 7 variable data with the original name.

QUESTION: How could you do this without overwriting the original data?

You should now have a data set called my_mtcars that has 32 observations of 7 variables. Check this.

str(my_mtcars)
## 'data.frame':    32 obs. of  7 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
dim(my_mtcars)
## [1] 32  7

QUESTION: What does ## [1] 32 7 tells us? Hint, recall vector notation.

Another way of checking the number of rows and columns is using nrow() and ncol() functions.

The variable “cyl” represents the number of cylinders in the car engine. You’ll see this is classified as a numeric variable. However, we aren’t necessarily interested in this as an actual number. For us, the number of cylinders is more useful as a grouping mechanisms to serve as a factor, or categorical variable. Let’s use the as.factor function to convert it, keeping the same variable name. Then check what the class of the variable is, to confirm the conversion worked.

my_mtcars$cyl <- as.factor(my_mtcars$cyl)
class(my_mtcars$cyl)
## [1] "factor"

Creating a categorical factor variable will enable us to generate summary statistics and plot data by groups.

We can now use this factor variable to group different operations. tapply is a great function to use for grouped operations. Check the help for ?tapply and try to calculate the mean of “mpg” for each factor of the “cyl” variable.

tapply(my_mtcars$mpg, my_mtcars$cyl, mean)
##        4        6        8 
## 26.66364 19.74286 15.10000

Now let’s create box plots of our two variables of interest (mpg, wt) to visualize their distribution. First, change your graphic parameter to show two graphs side by side.

Question: How do you think you’d specify the title of each box plot?

par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg, main = "mpg")
boxplot(my_mtcars$wt, main = "weight")  

You can see that two points are potential outliers in the plot forwt. The plot gives you a tool to make a decision whether to remove the data points. Here we will keep them.

Since cyl is a categorical variable, we can also visualize the distribution of mpg and wt across different cylinder classes.

par(mfrow=c(1, 2))
boxplot(my_mtcars$mpg ~ my_mtcars$cyl, main = "mpg")
boxplot(my_mtcars$wt ~ my_mtcars$cyl, main = "weight") 

QUESTION: How do you modify the label for x axis and y axis?

Before we move forward, let’s exclude these two observations by using a logical expression that removes the points (rows in the dataframe) in the data set where weight is greater than 5.3 tons. There are a few different ways to do this, as is the case with most things in R. Let’s use the subset function again.

my_mtcars <- subset(my_mtcars, wt <= 5.3)

You should now have a data set with 30 observations of 7 variables.

Note that removing data points from an analysis is not a good statistical practice and when done, it should be justified. But here, the goal is to use different functions on how we can work with data frames.

3.2 Data Table Manipulation with Dplyr

The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!

The base R functions that come with the default R installation have the capacity for almost all the table manipulation you will need (e.g., split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()). However, sometimes their syntax are less user-friendly and intuitive than some of the special packages built for table manipulation purposes. So, here we are introducing a few of the most useful table manipulation functions within dplyr package. This is a package I use a lot.

Note that you will have to use install.packages() and library() function to download and activate the dplyr before using it.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now, we will see how different functions of this package work.

3.2.1 select()

We can use select() to select column(s) that meet an specific pattern:

select(mtcars, mpg) # select column called mpg (miles per galon)
##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4
select(mtcars, -mpg) # select all columns in the data except mpg column
##                     cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4             6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag         6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive        6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout     8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant               6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360            8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D             4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230              4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280              6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C             6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE            8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL            8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC           8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood    8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial     8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128              4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic           4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla        4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona         4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger      8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin           8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28            8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird      8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9             4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2         4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa          4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L        8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino          6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora         8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E            4 121.0 109 4.11 2.780 18.60  1  1    4    2
select(mtcars, hp:vs) # select a continuous block columns starting from hp column and end on vs column 
##                      hp drat    wt  qsec vs
## Mazda RX4           110 3.90 2.620 16.46  0
## Mazda RX4 Wag       110 3.90 2.875 17.02  0
## Datsun 710           93 3.85 2.320 18.61  1
## Hornet 4 Drive      110 3.08 3.215 19.44  1
## Hornet Sportabout   175 3.15 3.440 17.02  0
## Valiant             105 2.76 3.460 20.22  1
## Duster 360          245 3.21 3.570 15.84  0
## Merc 240D            62 3.69 3.190 20.00  1
## Merc 230             95 3.92 3.150 22.90  1
## Merc 280            123 3.92 3.440 18.30  1
## Merc 280C           123 3.92 3.440 18.90  1
## Merc 450SE          180 3.07 4.070 17.40  0
## Merc 450SL          180 3.07 3.730 17.60  0
## Merc 450SLC         180 3.07 3.780 18.00  0
## Cadillac Fleetwood  205 2.93 5.250 17.98  0
## Lincoln Continental 215 3.00 5.424 17.82  0
## Chrysler Imperial   230 3.23 5.345 17.42  0
## Fiat 128             66 4.08 2.200 19.47  1
## Honda Civic          52 4.93 1.615 18.52  1
## Toyota Corolla       65 4.22 1.835 19.90  1
## Toyota Corona        97 3.70 2.465 20.01  1
## Dodge Challenger    150 2.76 3.520 16.87  0
## AMC Javelin         150 3.15 3.435 17.30  0
## Camaro Z28          245 3.73 3.840 15.41  0
## Pontiac Firebird    175 3.08 3.845 17.05  0
## Fiat X1-9            66 4.08 1.935 18.90  1
## Porsche 914-2        91 4.43 2.140 16.70  0
## Lotus Europa        113 3.77 1.513 16.90  1
## Ford Pantera L      264 4.22 3.170 14.50  0
## Ferrari Dino        175 3.62 2.770 15.50  0
## Maserati Bora       335 3.54 3.570 14.60  0
## Volvo 142E          109 4.11 2.780 18.60  1
select(mtcars, starts_with("c")) # select all columns that start with "c" in their column names
##                     cyl carb
## Mazda RX4             6    4
## Mazda RX4 Wag         6    4
## Datsun 710            4    1
## Hornet 4 Drive        6    1
## Hornet Sportabout     8    2
## Valiant               6    1
## Duster 360            8    4
## Merc 240D             4    2
## Merc 230              4    2
## Merc 280              6    4
## Merc 280C             6    4
## Merc 450SE            8    3
## Merc 450SL            8    3
## Merc 450SLC           8    3
## Cadillac Fleetwood    8    4
## Lincoln Continental   8    4
## Chrysler Imperial     8    4
## Fiat 128              4    1
## Honda Civic           4    2
## Toyota Corolla        4    1
## Toyota Corona         4    1
## Dodge Challenger      8    2
## AMC Javelin           8    2
## Camaro Z28            8    4
## Pontiac Firebird      8    2
## Fiat X1-9             4    1
## Porsche 914-2         4    2
## Lotus Europa          4    2
## Ford Pantera L        8    4
## Ferrari Dino          6    6
## Maserati Bora         8    8
## Volvo 142E            4    2
  • starts_with argument is very convenient because it allow you to select multiple columns that start with the same text. A few similar arguments are available to define other naming patterns.
  • ends_with()= Select columns that end with a character string.
  • contains()= Select columns that contain a character string.
  • matches()= Select columns that match a regular expression.
  • one_of()= Select columns names that are from a group of names.

Question: How do you select or exclude two columns: mpg and cyl?

3.2.2 filter()

Filter/select row(s) of data based on specific requirement of column(s) values:

filter(mtcars, cyl %in% c(4,6)) # select cars with 4 or 6 cylinders
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
filter (mtcars, mpg>20) # select rows that have mpg > 20
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
filter(mtcars, mpg>20 & cyl == 6) # select rows that have mpg>20 AND cyl == 6
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
filter(mtcars, mpg>20 | hp > 100) # select rows that have mpg>20 OR hp > 100
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Question 1: How do you select rows with 6 and 8 cylinders knowing that cyl is a factor?

3.2.3 pipe operator

The pipe operator allows you to pipe the output from one function to the input of the next function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right. It can also help you avoid creating and saving a lot of intermediate variables that you don’t need to keep. The old operator for pipes was %>%, but now a new version has been introduced, |>

# old operator
pipe_result<- mtcars %>%
  select(mpg, cyl) %>%
  head()
pipe_result
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6
# new operator
pipe_result<- mtcars |>
  select(mpg, cyl) |>
  head()
pipe_result
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

3.2.4 arrange()

This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order

order_data1<- mtcars %>% 
    arrange(mpg) 
order_data1
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
order_data2<- mtcars %>%
    arrange(cyl, mpg)
order_data2
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing? 

Question: Can you arrange the table first by wt and then by hp in decending order?

3.2.5 mutate()

The mutate() command creates new column(s) and define their values. For instance, we can create a new column with random numbers between 0 and 100 using a uniform distribution.

new_col<- mtcars %>%
    mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100)) # note the use of nrow to get the exact number of random numbers as there are rows in the dataframe.
new_col
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                       new_col
## Mazda RX4           13.320419
## Mazda RX4 Wag       91.380441
## Datsun 710          43.099530
## Hornet 4 Drive      59.621749
## Hornet Sportabout   23.499179
## Valiant             80.746549
## Duster 360          63.343340
## Merc 240D           18.063325
## Merc 230            29.064578
## Merc 280            85.146029
## Merc 280C           11.208987
## Merc 450SE          81.418487
## Merc 450SL          82.116567
## Merc 450SLC         26.008203
## Cadillac Fleetwood  66.913571
## Lincoln Continental 74.533827
## Chrysler Imperial   42.702340
## Fiat 128             1.639470
## Honda Civic         45.892397
## Toyota Corolla       9.612745
## Toyota Corona       13.092204
## Dodge Challenger    11.044016
## AMC Javelin         52.805285
## Camaro Z28          57.636350
## Pontiac Firebird    89.993556
## Fiat X1-9           12.588983
## Porsche 914-2        8.648430
## Lotus Europa        88.083560
## Ford Pantera L      99.760419
## Ferrari Dino        91.033441
## Maserati Bora       73.210326
## Volvo 142E          59.449843
new_col2 <- mtcars %>%
    mutate(new_col = runif(n = nrow(mtcars), min = 0, max = 100),  wt_kg = wt * 1000) # you can create multiple columns at once. Lets transform the weight from tones to kg
new_col2
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                       new_col wt_kg
## Mazda RX4           75.572603  2620
## Mazda RX4 Wag       86.507752  2875
## Datsun 710          65.434586  2320
## Hornet 4 Drive      29.154522  3215
## Hornet Sportabout   64.609343  3440
## Valiant             87.118349  3460
## Duster 360           1.399047  3570
## Merc 240D           13.651664  3190
## Merc 230            84.343792  3150
## Merc 280            79.836075  3440
## Merc 280C           65.493076  3440
## Merc 450SE          75.247570  4070
## Merc 450SL          66.274205  3730
## Merc 450SLC         28.878130  3780
## Cadillac Fleetwood  77.404107  5250
## Lincoln Continental 34.216939  5424
## Chrysler Imperial   29.698619  5345
## Fiat 128            34.958406  2200
## Honda Civic         60.662340  1615
## Toyota Corolla      45.379436  1835
## Toyota Corona       87.526815  2465
## Dodge Challenger    93.870674  3520
## AMC Javelin          1.647280  3435
## Camaro Z28          22.489337  3840
## Pontiac Firebird    27.950543  3845
## Fiat X1-9           67.181611  1935
## Porsche 914-2       61.389510  2140
## Lotus Europa        69.277950  1513
## Ford Pantera L      54.042772  3170
## Ferrari Dino        60.958549  2770
## Maserati Bora       34.495943  3570
## Volvo 142E          81.788555  2780

Can you create a new column call zero and give it a value of 0 ?

3.2.6 summarise()

This function calculates a summary statistics among all rows or rows within certain grouping, often used in combination with group_by()

sum_table <- mtcars %>% 
summarise(mean(wt))
sum_table
##   mean(wt)
## 1  3.21725
sum_table2 <- mtcars%>% 
summarise(avg_wt= mean(wt), min_wt= min(wt))
sum_table2
##    avg_wt min_wt
## 1 3.21725  1.513

3.2.7 group_by()

This is a great function. group_by() divides data rows into groups based on grouping column(s) provided, often used in combination with other functions which define what you do with them after placing them in groups. When group_by() and summarise() are used together, you are essentially telling R to separate rows into different groups, and for each groups you use summarise() to generate a series of summary statistics that characterize the column values.

group_summary <- mtcars |>
  group_by(cyl) |>
  summarise(avg_wt= mean(wt), min_wt= min(wt))
group_summary
## # A tibble: 3 × 3
##     cyl avg_wt min_wt
##   <dbl>  <dbl>  <dbl>
## 1     4   2.29   1.51
## 2     6   3.12   2.62
## 3     8   4.00   3.17

You can also create groups by the combination of two or multiple columns

group_summary2<- mtcars |>
  group_by(cyl, gear) |>
  summarise(avg_wt= mean(wt), min_wt= min(wt))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
group_summary2
## # A tibble: 8 × 4
## # Groups:   cyl [3]
##     cyl  gear avg_wt min_wt
##   <dbl> <dbl>  <dbl>  <dbl>
## 1     4     3   2.46   2.46
## 2     4     4   2.38   1.62
## 3     4     5   1.83   1.51
## 4     6     3   3.34   3.22
## 5     6     4   3.09   2.62
## 6     6     5   2.77   2.77
## 7     8     3   4.10   3.44
## 8     8     5   3.37   3.17

3.2.8 Join table

Let’s read two other tables that are available in the package ‘dplyr’

data(band_members)
band_members
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
data(band_instruments)
band_instruments
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
  • left_join()

Returns all records from the left table, and the matched (matched by shared ID columns) records from the right table. The result is NULL from the right side, if there is no match.

left_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass
  • full_join()

Return all records in both tables.

full_join(band_members, band_instruments, by = c("name"= "name"))
## # A tibble: 4 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Other than left_join() and full_join() there are a few other join functions that would join table differently. Sometimes you will find them useful for a specific table you want to make, but in general, you can reply on left_join() for most of the job.

  • right_join(): similar to left join, it keeps all records from the right table instead.

  • inner_join(): return only the matched records from both tables.

3.2.9 Conclusion on data management

This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by Googling and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.

3.3 Statistical Modelling

Let’s move on into data analysis.

3.3.1 General Principles

When building an statistical model, you need to specify a dependent variable (what you are trying to explain) and one or more independent variables (the variables that you are using to explain the dependent variable).

Thus, to build a model you will first need to identify the dependent and independent variables which come from the research question. Once you have identified the variables, you need to decide what kind of model structure is appropriate for the data set at hand. For this reason, knowing models and model structures can help you design the data collection process as you have a good idea of what type of data you need for a specific model.

Running models can be easy, but before you can make interpretations from the model, you need to check that the assumptions of the model are valid (model validation), decide what is the best model structure (model selection) and then, finally, interpret the parameters of the model and/or make predictions.

The general structure for defining any model in R is:

model.type(dependent.variable ~ independent.variable1 + independent.variable2, data = dataframe)

Note that in the model equation you do not have to specify the intercept. We have assumed in this example that there are 2 different independent variables in the example provided (independent.variable1 and independent.variable2), but the model could contain any number of independent variables.

3.3.2 Regression vs Classification

The dependent variable can either be continuous or categorical data. We refer to these separate parameterizations as a regression model or a classification model, respectively. Most of the models we demonstrate here are regression models, but the structure to build classification models in R are almost identical. In most cases, the data type of the dependent variable determines whether the model is a regression (non-categorical numeric-type dependent variable) or classification (categorical factor-type dependent variable). You can use functions such as as.factor() and as.numeric() to convert different data types of data to a factor or numeric variable. Note, just because a data set contains numbers, does NOT necessarily mean the numbers are numeric-type. Numbers can be used as symbols to differentiate categories as well. It is always a good practice to confirm the data type of the dependent and independent variables that you are inputing into the model.

3.4 Linear Regression

We will start with a simple linear regression analysis.

QUESTION: What is the difference between correlation and regression?

3.4.1 Model design and model fit

Model design involves deciding what we are trying to explain (i.e., the dependent variable) and what we are going to use to explain it (i.e., the independent variables). These are questions that are informed by the researcher.

Then, we need to decide what kind of model structure is appropriate to the data set. In our case, we will start with a simple linear regression model. Later in the course, however, we will investigate more complex model structures.

Run a basic linear regression model, attempting to predict/explain the fuel efficiency of a car (mpg) based on its weight. Finally, assign the model output “lm1”. Remember to specify the my_mtcars data set in the model so the R understands that the name of dependent and independent variables are columns in that specific dataframe.

lm1 <- lm(mpg ~ wt, data = my_mtcars)

3.4.2 Investigate your model

Once you have fit the model, you can use the summary function to investigate the beta coefficient, SE, and intercept values.

summary(lm1)
## 
## Call:
## lm(formula = mpg ~ wt, data = my_mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1820 -2.4603 -0.4526  1.5046  6.3191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.913      2.002  19.936  < 2e-16 ***
## wt            -6.287      0.630  -9.979 1.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.793 on 28 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7727 
## F-statistic: 99.59 on 1 and 28 DF,  p-value: 1.007e-10

3.4.3 Model validation - Checking the assumptions of your model

IMPORTANT: Before you can trust your model, you need to examine whether the assumptions on which the model is based are actually valid. If the assumptions are not satisfied, then your model will be unreliable, at least to some extent!

If you apply the function plot to the model, R will provide a series of plots that will help you to inspect model assumptions. Click enter to get all the plots.

plot(lm1)

  1. Assumption 1: The residuals are normally distributed

Check the Q-Q (quantile-quantile) plot. A perfect normal distribution is represented by a straight line. Your residuals should be close to a straight line.

  1. Assumption 2: The variances of the residuals are homogeneous (homoscedasticity)

The variability of the residuals should be uniform across the range of fitted values of the dependent variable. In the plot of residuals vs fitted values of y and standardized residuals vs. fitted values there should be a “scatter” of dots across the graphs, with no discernible pattern. Points should be randomly distributed. If you have a scatter of points on one side of the graph, your data may NOT be homogeneous.

  1. Assumption 3: The independent variables are independent of each other (no collinearity)

There are different ways you can address this before you fit your model. For instance, you can estimate the correlation of each pair of covariates and discard variables or exclude them from analysis if they highly correlated (positively or negatively). We will examine this later in the course when dealing with more complex models.

  1. Assumption 4: The data set does not contain serial auto-correlation

Serial autocorrelation is when there is significant correlation between successive data points in the data set. Spatial data or time-series data tend to be autocorrelated. There are different ways to deal with autocorrelation, such as using mixed-effect models.

  1. Assumption 5: The model is not biased by unduly influential observations.

We can check this by looking at the plot of standardized residuals vs leverage and “Cook’s Distance.”

Leverage is a measure of the influence of individual data points on the model’s parameters, measured on the scale of 0-1, where high values indicate a strong influence on the model’s parameters.

Cook’s distance is the sum of squared distances between the fitted values using the whole data set, and the fitted values with the ith observation removed. A large difference indicates that the ith observation exerts a strong influence on the model’s parameters.

No values beyond 1.


QUESTION: How does the QQ plot look? Does it indicate a potential problem?

If your standardized residuals are not normally distributed you can transform the dependent variable.

Let’s try a log transformation, which is commonly used:

lm1 <- lm(log(mpg) ~ wt, data = my_mtcars)
plot(lm1)

QUESTION: Does this improve the result?

The residuals are now randomly distributed and are homogeneous. We can now trust model results.


Continuing with interpreting results, the slope value should be -0.3055548. This means that for each unit of weight added to a car (1 ton), the log of miles per gallon it achieves is predicted to REDUCE by a value of 0.3055548. There is a negative effect of weight on car efficiency.

The intercept of 3.9259255 sets the start of the regression line at the weight of zero. In this case, this isn’t very useful (a car will not weigh zero tons) but it is a necessary element of describing a linear relationship. Here, the equation for the line is log(mpg) = 3.92593 -0.30555 (wt). Note that you can call the individual coefficients from a model directly using, in this example, “lm1$coefficients”.

Now, plot again a scatterplot of weight vs. mpg and draw the regression line, in blue. First return your graphing parameter to it’s default setting of (1,1)

par(mfrow=c(1, 1))
plot(log(mpg) ~ wt, data = my_mtcars)
abline(lm1, col="blue")

Here are two other ways you can draw the predicted regression line:

abline(coef = lm1$coefficients, col="green")
abline(lm1$coefficients, col="red")

3.4.4 Model Selection

In any mathematical modeling approach, there may be other variables, or some combination of variables, that are most effective and efficient at predicting your response variable of interest. In this case, our response variable is mpg. Looking at your plot of all the two-way relationships between variables, are there any other variables that may help predict mpg? Horsepower (hp) seems to be potentially informative. The question now is, which model might be best at predicting mpg.

Let’s say we have three options: 1. mpg ~ wt 2. mpg ~ wt + hp 3. mpg ~ hp

We’ve already fitted the first model. Now, fit a linear regression model for the next two parameter combinations, giving them unique names (lm2 and lm3), and look at the summary of their results.

lm2 <- lm(log(mpg) ~ wt + hp, data = my_mtcars)
lm3 <- lm(log(mpg) ~ hp, data = my_mtcars)
summary(lm2)
## 
## Call:
## lm(formula = log(mpg) ~ wt + hp, data = my_mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17048 -0.05589 -0.03043  0.07386  0.19392 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8995420  0.0705574  55.268  < 2e-16 ***
## wt          -0.2287735  0.0284846  -8.031 1.25e-08 ***
## hp          -0.0014795  0.0003458  -4.278 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09807 on 27 degrees of freedom
## Multiple R-squared:  0.8857, Adjusted R-squared:  0.8772 
## F-statistic: 104.6 on 2 and 27 DF,  p-value: 1.932e-13
summary(lm3)
## 
## Call:
## lm(formula = log(mpg) ~ hp, data = my_mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44058 -0.06574 -0.02156  0.09492  0.34549 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.4444246  0.0759960  45.324  < 2e-16 ***
## hp          -0.0032294  0.0004855  -6.652 3.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1773 on 28 degrees of freedom
## Multiple R-squared:  0.6125, Adjusted R-squared:  0.5986 
## F-statistic: 44.25 on 1 and 28 DF,  p-value: 3.225e-07

As you’ll see, all 3 of these models are reasonably good. Which is optimal? You might know that when you add more independent variables to a model, the model fit will often improve. But, it is certainly not ideal to have a model with a large number of independent variables (because we want to avoid overfitting). Akaike’s Information Criteria (AIC) provides a good mechanism for model comparison. AIC ranks a model’s performance by accounting for model fit while penalizing models with more variables. When adding a variable, the improved fit of the model must outweight the penalty, otherwise the improved fit will not be deemed worthwhile given the additional model complexity.

Use the AIC function to compare the AIC values of our 3 models. The lowest AIC indicates that the model is a “better” representation of the data and has a better predictive power. Note, AIC is not a measure of fit.

AIC(lm1)
## [1] -35.82462
AIC(lm2)
## [1] -49.35107
AIC(lm3)
## [1] -14.73444

What can we conclude? The best model (from the three we have tried) for predicting miles per gallon of a car uses the horsepower and weight of the car in the prediction (lm2).

To finish, let’s see if we can now use this model (lm2) to predict the fuel efficiency of a car with a hp = 225 and a wt = 4.0 tons. Making prediction with a new set of independent variables when dependent variable is absent, is ultimately one of the main goals of a regression analyses.

First, we need a dataframe with our new independent variables to be used in the prediction. Then, we can use the predict() function to apply our established linear model to the new information. Lastly, we need to transform miles per gallon back from the log scale to make it more easily interpretable.

nd <- data.frame(hp = 225, wt = 4.0)
exp(predict(lm2, newdata = nd, type = "response"))
##        1 
## 14.17612

We can get this same result without using the predict command, simply by writing out the linear equation. That is, our predicted response equals our intercept term, plus our coefficient for wt multiplied by the weight value, plus our coefficient for hp multiplied by our horsepower value (The function predict() just makes things a little easier for us). That is:

\[y_i = \beta_0 + \beta_i x_i + \epsilon_i\]

We must also use the exp() function to back-transform from the log-scale.

exp(lm2$coefficients[[1]] + (lm2$coefficients[[2]]*4.0) + (lm2$coefficients[[3]]*225))
## [1] 14.17612

3.5 Analysis of Variance: ANOVA

The analysis of variance (ANOVA) is a lineal model, but the explanatory variable is categorical. As we saw before, the categorical variable must be a factor. Remember that if the category has been coded with numbers, R will assume the variable is continuous.

We will continue working with the mtcars data set. As we did before, we specified cylinders as a categorical variable:

mtcars$cyl <- as.factor(mtcars$cyl)
class(mtcars$cyl)
## [1] "factor"

3.5.1 One-way ANOVA

Here we ask the question, do different cylinders imply more power?

Our hypothesis (\(H_1\)) is that more cylinders will provide more power to the car, expressed in gross horsepower (HP). The null hypothesis (\(H_0\)) is that there is no difference in HP among number of cylinders.

To perform an ANOVA in R, we use the function aov() and summary() to see the results.

model1 <- aov(hp ~ cyl, data=mtcars)
summary(model1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          2 104031   52015   36.18 1.32e-08 ***
## Residuals   29  41696    1438                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the results, the table shows us a p-value, representing the overall probability of significance for the categorical variable. If the term is significant, then we know at least one level is significantly different from the others.

We can reject the null hypothesis and conclude that gross horsepower differs among different number of cylinders.

However, before we know we can trust this result, we need to check the model assumptions. We will do this in the same way we did for the lineal model, using plot()

plot(model1)

From the second plot we can conclude that the residuals are normally distributed (straight line). From the first and third plots, we can see that the variances of the residuals are homogeneous (homoscedasticity). From the final plot, we can conclude that the model is not biased by unduly influential observations.

3.5.2 Pairwise port-hoc tests

To understand how levels compare to each other once we know that the ANOVA is significant, we can use a post-hoc test. We will use the Tukey post-hoc test with the function TukeyHSD().

TukeyHSD(model1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = hp ~ cyl, data = mtcars)
## 
## $cyl
##          diff       lwr       upr     p adj
## 6-4  39.64935 -5.627454  84.92616 0.0949068
## 8-4 126.57792 88.847251 164.30859 0.0000000
## 8-6  86.92857 43.579331 130.27781 0.0000839

diff is the difference in mean between two levels, lwr and upr are the lower and upper 95% confidence intervals for the difference and padj is the p-value for the difference.

In this example, we can see that there is not HP significant difference between 6 and 4 cylinders (p > 0.05), and that 8 cylinders have more HP than 4 and 6 cylinders (p < 0.05).

3.6 Non-parametric Random Forest Model

In the past few decades, there has been an increasing number of studies that use non-parametric machine learning models for classification or regression. In the most simplistic of terms, non-parametric/machine learning models do not require the predictive variables to take a predetermined distribution (e.g. normal distribution). They are also relatively less stringent in terms of model assumptions.

One of the most popular and also most powerful machine learning tools is Random Forest. Random Forests is an ensemble method for classification- or regression-typed analyses that operate by constructing a series of decision trees. We will not go into detail to discuss the Random Forest algorithm or how it works. But in general, Random Forest models perform really well when there are:

  1. Large number of independent variables.
  2. Significant correlation among independent variables.
  3. Independent variables that are not normally distributed.

3.6.1 Random Forest Regression

Although the Random Forest algorithm is complicated, running a Random Forest model in R is as easy as running a linear model. For the most part, we just need to specify the data set, and the regression model formula. Let’s use the “mtcars” data set again for this example.

Load the “mtcars”” data set, and load the random forest package.

install.packages("randomForest")
data(mtcars)
library( randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine

Now let’s run a model that includes all metrics in the data set to predict the mpg of a car.

Note that in the following code, the formula was simplified mpg ~ .. The .. represents the rest of the columns in the dataframe that are not defined as predictive variable. Whenever you have a dataframe that contains only the independent and dependent variables you need in the model, you can simplify the formula in this way. This essentially tells R to use all the variables included in the dataframe.

rf.model<-  randomForest(mpg ~ ., data=mtcars)
rf.model
## 
## Call:
##  randomForest(formula = mpg ~ ., data = mtcars) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 5.744222
##                     % Var explained: 83.68

Notice that one of the most important results of Random Forest is the “% variance explained”. It can be considered equivalet to a \(R^2\) value in a linear models quantifying how “good” the model fits the data. The closer the “% var explained” is to 100, the better the model’s performance.

Here you can see the default number of trees used in the model is 500. In reality, the number of tree is one of very few parameters that need to be modified by the user. The number of trees determines the number of ensemble models that are created in the modeling process. As we increase the tree number, the model results will tend to vary and eventually stabilize. But we don’t want to create too many ensemble tree models because it will take more computational time. The general rule is to increase the number of trees incrementally until the value of “% var explained” changes very little.

Let’s change the default number of trees parameter to 1000 and 5000 and observe the difference in model results

rf.model.1000<- randomForest( mpg ~ ., data=mtcars, ntree=1000)
rf.model.1000
## 
## Call:
##  randomForest(formula = mpg ~ ., data = mtcars, ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 5.630956
##                     % Var explained: 84
rf.model.5000<- randomForest( mpg ~ ., data=mtcars, ntree=5000)
rf.model.5000
## 
## Call:
##  randomForest(formula = mpg ~ ., data = mtcars, ntree = 5000) 
##                Type of random forest: regression
##                      Number of trees: 5000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 5.616431
##                     % Var explained: 84.04

The bigger number of trees (ntree) does not change the “% var explained” significantly. So the default ntree= 500 is as good as ntree= 1000 or ntree=5000.

Lastly, a convenient feature of Random Forest model is that it internally summarizes how important each variable is to model performance. The model does this by replacing each predictive variable separately with randomly permuted values. After running the model with replacement, the more the model accuracy decreases, the more important the specific variable (that was replaced in the process) is. You can retrieve and visualize the variable importance data using importance() and varImPlot() functions.

importance(rf.model)        
##      IncNodePurity
## cyl      167.86329
## disp     257.59362
## hp       174.06653
## drat      57.84139
## wt       276.08078
## qsec      39.34845
## vs        21.40127
## am        17.73601
## gear      20.64983
## carb      25.42392
varImpPlot(rf.model)  

Note that “IncNodePurity” is just a measure of how much the model performance has decreased. The higher that value, the more important the variable is.

QUESTION: Which variable is contributing most to predicting mpg of a car?

Lastly, just as with all the other models, the most powerful use of Random Forest model is to predict the dependent variable when you have a set of independent variables. You can, once again, use the predict() function to perform this action. Below, we will create a fictitious car, changing the Hornet-4 Drive from 6 cylinder to 8 cylinders. We want to know what the mpg of this new car is.

new.car<- mtcars[4,]
new.car$cyl<- 8
new.car
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   8  258 110 3.08 3.215 19.44  1  0    3    1
predict(rf.model, newdata = new.car, type = "response")

QUESTIONS:

  1. Do you think the prediction make sense? Why?
  2. Can you build a random forest classification model to predict the number of cylinders based on other metrics included in the dataset?

3.7 Generalized Linear Models

A primary goal of basic ecological and applied conservation research is to understand how species are distributed across space and through time. However, animal locations or abundance data typically cannot be modelled by a normal distribution. As a result, other types of model structures are needed.

Generalized linear models (GLM) are used when the residuals are non-normal, when there are non-linear relationships between dependent and independent variables, and/or when the variance in the dependent variable is not uniform across its range. E.g., presence-absence data, count data.

GLMs consist of three elements:

  1. A probability distribution from the exponential family.
  2. A linear predictor \(η = Xβ\).
  3. A link function \(g\)

A GLM allows the specification of a variety of different error distributions:

  1. Poisson errors (useful with count data)
  2. Binomial errors (useful with data on proportions)
  3. Gamma errors (useful with data showing a constant coefficient of variation)
  4. Exponential errors (useful with data on time to event - e.g., survival analysis)

The linear predictor η (eta) is the quantity which incorporates the information about the independent variables into the model. It is related to the expected value of the dependent variable through the link function.

η is expressed as linear combinations of unknown parameters β.

The linear predictor can be represented as a vector of coefficient values that is multiplied the matrix containing your independent variables  X. . η can thus be expressed as \(η = Xβ\).

The link function g provides the relationship between the linear predictor and the mean, or expected value, of the distribution function.

There are many commonly used link functions, and their choice can seem somewhat arbitrary. It can be convenient to match the domain of the link function to the range of the distribution function’s mean.

Commonly used link functions
Commonly used link functions

3.7.1 Binomial GLM for binary data (logistic regression/logistic GLM)

A logistic regression model allows us to establish a relationship between a binary (0 vs. 1) outcome variable and a group of predictor variables. It models the logit-transformed probability of a certain outcome as a linear relationship with the predictor variables.

To demonstrate how linear regression model works, we will use the following study:

Bolger et al. (1997) investigated the impacts of habitat fragmentation on the occurrence of native rodents. Quinn and Keough (2002) subsequently modeled the presence/absence native rodents against some Bolger et al. (1997) biogeographic variables.

Bolger et al. (1997) studied whether small fragments of the shrub habitats in canyons of San Diego, CA, isolated by urbanization, were capable of supporting viable populations of native rodent species.

The data set consists of rodent presence/absence and three predictor variables: 1. PERSHRUB = percentage of shrub cover 2. DISTX = the distance to the nearest large (>100 ha) “source canyon.” 3. AGE = time elapsed since the fragment was isolated by development

From Bolger et al. 1997
From Bolger et al. 1997

For this demonstration we need to install and load a new package.

#install.packages("usdm") #Install package if it is not installed in your computer.
library(usdm)

3.7.2 Import data set

Import the data set suing the ‘read.table’ function. The data for this example is in a .txt format. Read.table will read the .txt file and load it as a data frame into the R environment.

Look at the first lines of data using the head function.

bolger <- read.table("data/bolger.txt", header=T, sep= "")
head(bolger)
##   PERSHRUB DISTX AGE RODENTSP
## 1       66  2100  50        1
## 2       90   914  20        1
## 3       75  1676  34        1
## 4       75   243  34        1
## 5       60   822  16        1
## 6       50   121  14        1

Note: It is often best to rescale continuous covariates to improve numerical stability in the models. We will skip this step for this example, but we will look into scaling in other examples.


3.7.3 Investigate potential (multi)collinearity

When we fit a model we assume that the estimated effects of predictor variables are independent of each other. Thus, before we fit a model we first need to check our independent variables for potential collinearity.

Collinearity occurs when 2 or more independent variables have a strong linear relationship, making it difficult to determine which of the two collinear variables are more strongly associated with the response variable. Collinearity affects parameter estimates and increases both standard errors and p-values of parameters, likely obscuring some important relationships.

plot(bolger[,1:3])

To make sure we do not have a (multi)collinearity problem in our predictor variables, we can look at the Variance Inflation Factor (VIF) for each covariate. Variance inflation is a measure of the degree of collinearity between variables. Any variables with a VIF value > to 3 (or 5 or 10 depending who you ask) are the cause of concerning collinearity in the model. If a covariate has a VIF > 3, remove it (or other correlated variables) and recalculate VIF after deletion.

vifstep(bolger[,1:3])
## No variable from the 3 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( DISTX ~ PERSHRUB ):  -0.2745893 
## max correlation ( AGE ~ PERSHRUB ):  -0.7952976 
## 
## ---------- VIFs of the remained variables -------- 
##   Variables      VIF
## 1  PERSHRUB 2.743579
## 2     DISTX 1.093362
## 3       AGE 2.750798

In our example there is no evidence of collinearity as all VIFs are < 3. Good!

3.7.4 Fit a Binomial GLM

bolger.glm <- glm(RODENTSP ~ DISTX + AGE + PERSHRUB,
                family=binomial (link= "logit"),
                data=bolger)
summary(bolger.glm)
## 
## Call:
## glm(formula = RODENTSP ~ DISTX + AGE + PERSHRUB, family = binomial(link = "logit"), 
##     data = bolger)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -5.9099159  3.1125426  -1.899   0.0576 .
## DISTX        0.0003087  0.0007741   0.399   0.6900  
## AGE          0.0250077  0.0376618   0.664   0.5067  
## PERSHRUB     0.0958695  0.0406119   2.361   0.0182 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.617  on 24  degrees of freedom
## Residual deviance: 19.358  on 21  degrees of freedom
## AIC: 27.358
## 
## Number of Fisher Scoring iterations: 5

3.7.5 Model assumptions validation

In the case of a linear model, we used ‘plot(model1)’ to check the standardized residuals for any “patterns”. For binomial data, however, you will just see two bands in the residuals due to the binomial nature of the dependent variable.

What we need to check is evidence of model fit, thus, that the model fits the data well. To do this we extract the deviance residuals and examine their distribution. Observations with a deviance residual > 2 may indicate a lack of fit.

devresid<-resid(bolger.glm, type = "deviance")
hist(devresid)

In this case, we can conclude there is no strong evidence of lack of fit. We can now feel confident about our model to make interpretations.

3.7.6 Model interpretation and predictions

Let’s look again at the model summary:

summary(bolger.glm)
## 
## Call:
## glm(formula = RODENTSP ~ DISTX + AGE + PERSHRUB, family = binomial(link = "logit"), 
##     data = bolger)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -5.9099159  3.1125426  -1.899   0.0576 .
## DISTX        0.0003087  0.0007741   0.399   0.6900  
## AGE          0.0250077  0.0376618   0.664   0.5067  
## PERSHRUB     0.0958695  0.0406119   2.361   0.0182 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.617  on 24  degrees of freedom
## Residual deviance: 19.358  on 21  degrees of freedom
## AIC: 27.358
## 
## Number of Fisher Scoring iterations: 5

The chance of native rodent occurrence increases significantly with increasing shrub cover (\(B=0.096\), \(p = 0.0182\)). Neither fragment isolation age or distance had significant effects.

The default results are on the logit scale. The coefficient of PERSHRUB is: 0.096, so that a one unit change in percentage of shrub cover produces approximately a 0.096 unit change in the log odds of presence (i.e. a 0.096 unit change on the logit scale).

The odds of presence are defined as the ratio of the probability of presence over the probability of absence. For instance, if the probability of presence is 0.8 and the probability of absence is 0.2, the odd ratios is 0.8/0.2=4, or the odds of presence are 4 to 1. For more information on how to interpret odds visit: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

It is hard to think in terms of logits. We can plot the association between the probability of native rodent presence and percentage shrub cover.

See how the code for this plot has gotten much longer and more complex. Make sure you understand each line of code. Remember to use the help command when you do not understand a specific function and all its arguments.

xs <- seq(0, 100, 1)
bolger.predict <- predict(bolger.glm, type="response", se=T, 
                          newdata=data.frame(DISTX=mean(bolger$DISTX), AGE=mean(bolger$AGE), PERSHRUB=xs))

# Plot results
plot(bolger$RODENTSP ~ bolger$PERSHRUB, xlab="Pecentage shrub cover", ylab="Rodent presence probability", axes=T, pch=16)
lines(bolger.predict$fit~xs, type="l", col="gray")
lines(bolger.predict$fit + bolger.predict$se.fit~xs, col="red", type= "l", lty=2)
lines(bolger.predict$fit - bolger.predict$se.fit~xs, col="red", type= "l", lty=2)

***

What is the probability of native rodent occurrence with a shrub cover of 80%, distanceX of 100, and fragment isolation age of 5.

predict(bolger.glm, type="response", se=T, newdata=data.frame(DISTX=100, AGE=5, PERSHRUB=80))
## $fit
##         1 
## 0.8716415 
## 
## $se.fit
##         1 
## 0.1277815 
## 
## $residual.scale
## [1] 1

Note that the predict function already converts values from the logit scale back to the probability scale.

To do that manually, we can do the following:

x <- -5.9099159 + 0.0003087*100 + 0.0250077 * 5 + 0.0958695 * 80
exp(x)/(exp(x)+1) # This is the inverse of the logit scale
## [1] 0.8716417
plogis(x) # Use a built in function
## [1] 0.8716417

4 An introduction to plotting using ggplot2

The package ggplot2 is widely used for data visualization in R. This package is extremelly powerful. I used to like using basic R code for plotting but eventually, I had to admit that ggplot is extremelly cool and had to adopt it.

ggplot2is based on the grammar of graphics, which allows users to create complex plots from data in a systematic way.

I will first present you the basics of ggplot2 using the same mtcars dataset and then I will present you the code of different figures I used in my publications.

As with any R package, before you can use ggplot2, you need to install it (if you haven’t already) and load it into your R session.

#install.packages("ggplot2")
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
library(dplyr)

4.1 Basic Concepts

There are a few concepts that we need to know to understand how to code ggplots.

Data: The dataset you want to visualize. Aesthetics (aes): The visual properties of the plot (e.g., x and y position, color, size). Geometries (geom_): The actual marks we put on the plot (e.g., points, lines, bars). We can, for instance, use geom_line() for plotting lines. Facets: Subplots that display subsets of the data. Scales: Control how data values are mapped to visual properties. Themes: Control the overall appearance of the plot (e.g., font size, background color).

4.2 Creating Your First Plot

Let’s start with a simple scatter plot using the built-in mtcars dataset. We have created this plot before, so the look should be familiar to you.

4.2.1 A basic scatter plot

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point()

In here, ggplot(data = mtcars, aes(x = wt, y = mpg)) initializes the plot with the mtcars dataset, setting wt (weight) on the x-axis and mpg (miles per gallon) on the y-axis. geom_point() adds points to the plot to create a scatter plot.

4.2.2 Customizing the plot

4.2.2.1 Titles and labels

We now have a basic plot. Let’s start to customize it. We will first add a title, subtitle, and axis labels with the labs() function.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs Weight",
       subtitle = "Data from mtcars dataset",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       caption = "Source: mtcars dataset")

4.2.2.2 Color and size

We can also change the point color and sizes by specifying color and size to variables in the data. For this we need to specify in the aesthetics which variables represent color and size. Then we need to use scale_color_continuous() and scale_size_continuous() to set the names on the legend.

ggplot(data = mtcars, aes(x = wt, y = mpg, color = as.numeric(cyl), size = hp)) +
  geom_point() +
  labs(title = "Scatter Plot with Color and Size Mappings",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon") +
  scale_color_continuous(name = "Number of Cylinders") +
  scale_size_continuous(name = "Horsepower")

Looking pretty good.

4.2.2.4 Facets

Faceting allows you to create multiple plots based on a categorical variable. This is very nice. Let’s generate different plots for cars with different number of cylinders.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl) +
  labs(title = "Scatter Plot Faceted by Number of Cylinders",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon")

4.2.3 Using Different Geometries

Let’s try now different geometries.

4.2.3.1 Bar Plot

For a bar plot we use geom_bar().

ggplot(data = mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(title = "Bar Plot of Cylinder Counts",
       x = "Number of Cylinders",
       y = "Count")

4.2.3.2 Histogram

For a histogram we use geom_histogram().

ggplot(data = mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "Histogram of Miles Per Gallon",
       x = "Miles Per Gallon",
       y = "Count")

4.2.3.3 Box Plot

We use geom_box() for a boxplot.

ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  labs(title = "Box Plot of MPG by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Miles Per Gallon")

#### Vilolin Plot

We use geom_violin() for a violin plot. Sometimes, violin plots can represent better the distribution of the data.

ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_violin() +
  labs(title = "Box Plot of MPG by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Miles Per Gallon")

4.2.4 Customizing the plot appearance

4.2.4.1 Themes

The themes control the overall look of your plot. There are many themes available. For isntance, we can use theme_minimal().

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Scatter Plot with Minimal Theme",
       x = "Number of Cylinders",
       y = "Miles Per Gallon")

Other available themes include theme_gray(), theme_classic(), theme_bw(), and more. Check them out yourself.

4.2.4.2 Customizing scales

You can customize the scales of axes, colors, and other aesthetics. In this case, we are colouring the dots based on horsepower using the function scale_color_gradient() and setting the range of colors from blue (low) to red (high).

ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp)) +
  geom_point() +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Scatter Plot with Custom Color Scale",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Horsepower")

4.2.4.3 Adding Annotations

You can add annotations to your plots. This can be useful to highlite something in your plot.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  annotate("text", x = 5, y = 30, label = "High MPG", color = "red") +
  labs(title = "Scatter Plot with Annotation",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon")

4.2.4.4 Multi-layered plots

Something very good about ggplot is that it is very easy to add multiple layers to the same plot. You can combine different geoms in one plot using the same or different data.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Scatter Plot with Multiple Trend Lines",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

With ggplot2, you have a powerful tool to explore and present your data in compelling ways.

There are many options that you can control on ggplot. The best way to learn all the possibilities is by playing with it. Pretty much, everything can be customized. Googling anything you want to do is one of the best ways to troubleshoot.

Feel free to experiment with different datasets and ggplot2 functions to create the visualizations that best communicate your insights!

4.2.5 Saving Your Plot

To save your plot to a file use the function ggsave(). Note that first you need to save your plot as an object.

p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs Weight")

ggsave("scatter_plot.png", plot = p, width = 6, height = 4)

4.3 Examples from my work

I want to show you now some more advanced plots I have created though the years for different publications and presentations.

4.3.1 Example 1

The firsts examples come from this manuscript.

In this study, I used a multi-species occupancy model and look at species richness of large herbivores relationships with different covariables and land uses in central Kenya.

I created a violin plot for figure 1 of the article to show how species richness varies across years at each land use.

Let’s load the data.

data <-read.csv("data/forplot.csv")
colnames(data)[3:5]<-c("DistanceToWater","NDVI","LivestockAbundance")

Now we will create the plot. We specify the aesthetics with Year in the x axis, SR (Species richness) in the y axis, and Year as the grouping factor. We use geom_violin() to create the violin plot. I used a cool function, stat_summary() to plot the mean and standard deviation of species richness within each violin plot. I faceted the plot by land use. I used the theme gray (theme_gray()). With labs(), we speciy the labels. And finally, within the function theme() we make more customizations, such as, setting font size of axes, titles, and labels.

ggplot(data, aes(x=Year, y=SR, group = Year)) + geom_violin() + stat_summary(fun.data="mean_sdl",  fun.args = list(mult=1), geom="pointrange", color = "black") + facet_wrap (~LandUse, ncol=5) + theme_gray() + labs(x = "Year", y = "Estimated point sp. richness") + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12), strip.text = element_text(size=12))

In that same paper, I wanted to show the relationship of species richness with livestock abundance. I then created this scatter plot, showing how species richness declines with increasing livestock but highlighting the differences among land uses.

Here, I used the geom_point() and the stat_smooth() to create trends based on gam models for each land use. Note that I specified LandUse as the grouping and color factor in the aesthetics. I customized the colors using scale_color_manual(). I set a black and white theme. And with the theme() function, I set the position of the legend (getting the position right requires a lot of trial and error). Change it and you will understand. I also customized font sizes.

ggplot(data, aes(x=LivestockAbundance, y=SR, group=LandUse, color=LandUse)) + geom_point() + stat_smooth(aes(x=LivestockAbundance, y=SR), method = "gam", formula = y ~ s(x)) +
  scale_color_manual(name = "Land use", values = c("#FF7F00", "#8B0000", "#66CD00","#006400")) +
  labs(x = "Livestock abundance", y = "") +
  theme_bw() +
  theme(legend.position=c(0.8,0.8),legend.title=element_text(size=14),legend.text=element_text(size=12), axis.text = element_text(size = 14), axis.title = element_text(size=15))
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here is the same plot faceted by year.

ggplot(data, aes(x=LivestockAbundance, y=SR, group=LandUse, color=LandUse)) + facet_wrap(~ Year, ncol=3) + geom_point() +
  stat_smooth(aes(x=LivestockAbundance, y=SR), method = "gam", formula = y ~ x) +
  scale_color_manual(name = "Land use", values = c("#FF7F00", "#8B0000", "#66CD00","#006400")) +
  labs(x = "Livestock abundance", y = "Estimated point species richness") +
  theme_bw() +
  theme(legend.position=c(0.85,0.12),legend.title=element_text(size=14),legend.text=element_text(size=12)) +
  theme(axis.text = element_text(size = 11), axis.title = element_text(size = 15), strip.text = element_text(size=15))

Very cool!!

4.3.2 Example 2

The second example come from this manuscript.

In this study, we combined eight 2-km long transects surveyed monthly between October 2017 and March 2020 with a hierarchical multi-species and multi-season distance sampling modelling framework to: (1) estimate monthly density of an ensemble of 10 different large herbivores and (2) understand how species respond to changes in vegetation productivity and time across the Naboisho Conservancy in the Greater Mara Ecosystem, Kenya.

This first plot shows the average density of species across time with corresponding credible intervals.

Load the data. The data for plotting are organised in lists. The first plot uses the first element of datalist, while the second plot uses 3 different datasets contained in datalist2.

load('./data/ListForPlotData')
load('./data/ListForPlotData2')

You can inspect the data we are going to use.

head(datalist[[1]])
##   Species       Date   MedianAb   CI      CI_low   CI_high      mean       sd
## 1   Eland 2020-03-01  4.1766996 0.89 1.647684048  7.116809  4.498319 1.878446
## 2   Eland 2020-02-01 14.7177689 0.89 7.331800163 22.200248 15.305441 4.853975
## 3   Eland 2020-01-01 16.4567753 0.89 8.813831968 25.198753 17.045384 5.398528
## 4   Eland 2019-12-01  2.3835696 0.89 0.608483624  4.613852  2.689674 1.415061
## 5   Eland 2019-11-01  0.5767496 0.89 0.006554621  1.837341  0.944494 1.357441
## 6   Eland 2019-10-01  0.5971446 0.89 0.030504372  1.943356  0.940102 1.252115
##   Year Mass
## 1 2020  419
## 2 2020  419
## 3 2020  419
## 4 2019  419
## 5 2019  419
## 6 2019  419
head(datalist2[[1]])
##          x         y       time    Species
## 1 0.235640 0.1695834 2017-10-01 Hartebeest
## 2 0.249109 0.1820851 2017-10-01 Hartebeest
## 3 0.262578 0.1947698 2017-10-01 Hartebeest
## 4 0.276047 0.2075512 2017-10-01 Hartebeest
## 5 0.289516 0.2203360 2017-10-01 Hartebeest
## 6 0.302985 0.2330247 2017-10-01 Hartebeest
head(datalist2[[2]])
##      Species         y    yUpper     yLower        x
## 1 Hartebeest 0.2079740 0.4425786 0.09190152 0.235640
## 2 Hartebeest 0.2225848 0.4467321 0.10418357 0.249109
## 3 Hartebeest 0.2373915 0.4458057 0.11616434 0.262578
## 4 Hartebeest 0.2525141 0.4542040 0.12976854 0.276047
## 5 Hartebeest 0.2664800 0.4607097 0.14319463 0.289516
## 6 Hartebeest 0.2806685 0.4662031 0.15587398 0.302985
head(datalist2[[3]])
##     Species        y   yUpper    yLower        x
## 1 Community 1.146447 2.884560 0.4580493 0.235640
## 2 Community 1.267944 3.163149 0.5149669 0.249109
## 3 Community 1.394953 3.441288 0.5744531 0.262578
## 4 Community 1.530966 3.706783 0.6322530 0.276047
## 5 Community 1.664420 3.971599 0.6961910 0.289516
## 6 Community 1.806478 4.249713 0.7626169 0.302985

This first plot, what is Fig. 2 in the manuscript, shows the changes in monthly abundance for each species.

This plot uses geom_point() and facet_wrap() as before. To plot the credible intervals around the density estimates, I use the geom_errorbar() function to which you need to specify the ymin and ymax values. Also note the use of scale_x_date() to set the x label as dates.

ggplot(datalist[[1]], aes(x=as.Date(Date), y=MedianAb)) + 
  facet_wrap(~Species, scales="free_y", nrow = 4, ncol = 3) + labs(y="Estimated ind/km2", x = "Time") + geom_point(size=2) + geom_errorbar(aes(ymin = CI_low, ymax = CI_high), width = 0.5) + theme(axis.text=element_text(size=8),  axis.title=element_text(size=12), axis.line=element_line(linewidth=0.5), axis.ticks=element_line(size=0.5),  strip.text=element_text(size=11), strip.background=element_blank(), panel.border=element_blank(), panel.grid=element_blank(), legend.position = c(0.85, 0.1), legend.title = element_blank()) + scale_x_date(date_breaks = "6 months", date_labels = "%Y %b") 
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The next plot is quite complex. It consists of 3 different datasets, showing the full potential of ggplot.

The plot shows the response in each species abundance (as estimated number of groups) to vegetation productivity (as NDVI). This is Fig.4 of the manuscript. There are 3 different datasets because the hierarchical multi-species and multi-year model allow you to get estimates for each month, but also an average for each species across all months, and a community average.

The first dataset (datalist2[[1]]) contains estimated number of groups per species and per month for a range of NDVI values. The second dataset (datalist2[[2]]), contains the average estimate of number of groups per species across all months The third dataset (datalist2[[3]]) contains the average estimated number of groups for the ‘community’ (i.e., the assemblage of all species included in the analysis).

Let’s plot one by one, and then we combine them all together.

Here, we have the estimated number of groups in the y axis as a function of NDVI in the x axis for each month. So, each line represents a different months group = factor(time). We use geom_line() to plot the lines. We facet by species letting the y axis change freely among species facet_wrap(~Species, scales="free_y").

ggplot(datalist2[[1]], aes(x = x, y = y, group = factor(time))) + facet_wrap(~Species, scales="free_y") + geom_line(colour = 'grey') +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none') +
  scale_color_manual(values = rep('darkgray', 34))

We will now plot the second dataset. In this case, we have the average estimate of number of groups across all months. Note that we have now crebidle intervals for the estimations of number of groups. In this case, we use the function geom_ribbon() to plot the credible intervals. The alpha arguments controls the transparency.

ggplot(datalist2[[2]], aes(x = x, y = y, group = Species)) + facet_wrap(~Species, scales="free_y") + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(data = datalist2[[2]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none')

Finally, we have a plot for the community.

ggplot(datalist2[[3]], aes(x = x, y = y, group = Species)) + facet_wrap(~Species, scales="free_y") +  geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = 'black', alpha = 0.5) + geom_line(linewidth=1, colour = 'black') +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none')

We can now combine all datasets in the same plot by adding multiuple layers, each time specifying the new dataset. So, there are a few modifications to the code of each individual plot. This is the final code:

ggplot(datalist2[[1]], aes(x = x, y = y, group = factor(time))) + facet_wrap(~Species, scales="free_y") + geom_line(colour = 'grey') + geom_ribbon(data = datalist2[[2]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) + geom_line(data = datalist2[[2]], aes(x = x, y = y, group = Species), linewidth=1, colour = 'blue') +  geom_ribbon(data = datalist2[[3]], aes(ymin = yLower, ymax = yUpper, group = Species), fill = 'black', alpha = 0.5) + geom_line(data = datalist2[[3]], aes(x= x, y = y, group = Species), linewidth=1, colour = 'black') +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none') +
  scale_color_manual(values = rep('darkgray', 34))

Note, that for the community plot to be added to the facets, the data frame needs to have a column called Species with the name Community. The Species column in each dataset needs to be specified as a factor with all the levels corresponding to all species plus community as shown here:

factor(data$Species, levels= c("Hartebeest", 'Eland','Giraffe',"Grant's gazelle",'Impala',"Thomson's gazelle",'Topi','Warthog','Wildebeest',"Plains zebra", "Community"))

4.3.3 Example 3. Using phylopics.

Phylopic is a free database of silhouetted images of organisms. You can access those iamges and use them in your scientific figures. In R, you can use the rphylopic package to download and incorporate these silhouettes into your ggplot2 visualizations. Pretty cool. Here’s an example to get you started with using Phylopic silhouettes in R with ggplot2.

First, let’s load the package (and install it if you do not have done it yet).

#install.packages("rphylopic")
library(rphylopic)
## You are using rphylopic v.1.5.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).

The first thing we need to do to use a Phylopic image, is to search for the desired organism and retrieve its unique identifier (UUID). The rphylopic package provides functions to search and download these images.

Here’s how you can search for and download a Phylopic image of a giraffe. First we search for the images available with the pick_phylopic() function. Some species have multiple options to chose from. Here I am asking R to display no more than 4 images. This is an interactive function and you will need to enter which image to display. This is very useful to chose which silhouette you want to use.

pick_phylopic(name = "Giraffa camelopardalis", n = 4, view = 1)

You will see something like this:

Once you know which image you want to use, you retrieve the unique identifier using get_uuid(). Here, we are choosing the first image.

giraffe <- get_uuid(name = "Giraffa camelopardalis", n = 1)
giraffe
## [1] "b35f867d-26e8-4c48-806e-0f07f18b1844"

Now, let’s save this uuid in a data.frame. Note, you could add more species.

phylopicsuuid <- data.frame(Species = "Giraffe", 
                            uuid = c(giraffe))

Let’s now add the giraffe silhouette to the figure we created before. First, we need to filter the data to keep just the giraffe.

forgirplot <- filter(datalist2[[2]], Species == 'Giraffe')

We can recreate the previous plot but using the giraffe only data.

ggplot(forgirplot, aes(x = x, y = y)) + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none')

Now, we will use the function geom_phylopic() to add the giraffe silhouette to our plot. We need to specify the dataframe with the uuid as the data for geom_phylopic(). The tricky part, and it is quite trial and error, is defining the coordinates where the plot will appear (x and y). These will depend on the data you are plotting. Getting the size correct is also trial and error. Color, fill and alpha work as in other ggplot functions.

ggplot(forgirplot, aes(x = x, y = y)) + geom_line(linewidth=1, colour = 'blue') + geom_ribbon(aes(ymin = yLower, ymax = yUpper, group = Species), fill = "blue", alpha = 0.5) +
  geom_phylopic(data = phylopicsuuid, aes(x = 0.5, y = 1.1, uuid = uuid), size = 1.9, color = 'black', fill = 'black', alpha = 0.3) +
  labs(y=expression("Nr. of groups"), x = "NDVI") +
  theme(axis.text=element_text(size=11), 
        axis.title=element_text(size=12), 
        axis.line=element_line(linewidth=0.5),
        axis.ticks=element_line(linewidth=0.5), 
        strip.text=element_text(size=11), 
        strip.background=element_blank(), 
        panel.border=element_blank(), 
        panel.grid=element_blank(), legend.position = 'none')
## Warning: Using the `size` aesthetic in this geom was deprecated in rphylopic 1.5.0.
## ℹ Please use the `height` and `width` aesthetics instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Very cool!!

Stay tunned for more examples.

5 An introduction to GIS in R

We will now start to work with spatial data (i.e. vector and raster) in R.

Before we start, and if you are not using an R project, do not forget to set your working directory

setwd("...")
# don't forget to set the directory to your own data folder!!

Next, load packages:

library(sf)
library(terra)
library(tidyverse)

There is a lot of information in the internet about geo-spatial data and R. Here, I summarize the basic tools to conduct spatial analyses in R with the goal that you will build your skills as we progress into some study cases. I have modified some of the code from Claudia Engel (https://cengel.github.io/R-spatial/). I recommend this site for a more in depth tutorial.

5.1 Vector Data in R

Vector data are composed of points, lines, and polygons.

  • Points are the simplest spatial object. Each point consists of a coordinate pair (X, Y) that is structured as a two-column matrix, with one column for latitude and another for longitude. Each point can also have n associated variables as attribute information. For instance, species occurrence data often have information on elevation, date, and habitat of each point recorded in the field, in addition to their X and Y coordinates. In such cases, the data will be a dataframe containing several columns, each corresponding to a different variable (e.g. Point ID, X coordinate, Y coordinate, elevation, date recorded, habitat type), while each row corresponds to a single point.
plot(st_point(c(0,0)))
plot(st_point(c(0.5,0.5)), add = T)

  • Lines (or polylines) are used to represent linear features such as rivers/streams, trails, streets, etc. A single line is composed of an ordered set of points (2D coordinates) that are connected by actual line segments, which allows us to calculate length/distance.
plot(st_linestring(matrix(runif(6), ncol=2)) )

  • Polygons are used to represent the extent of geographic features, such as city boundary, lakes, sampling plots, and habitat patches. Polygons are composed of several lines that close into a polygon, allowing us to calculate area and perimeter.
plot(st_polygon(list(rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)))))

5.1.1 A short, but important word about coordinates and projections

Any spatial data collected using GPS devices or downloaded online must have a projection system that links the coordinates to specific locations on the earth’s surface. Wiithout a coordinate reference system, we cannot locate places on Earth.

Any given geospatial data (e.g. a set of points) with an incorrect projection or with unknown projection system can be visualized on its own. However, you will not be able to plot that spatial data with other spatial data, as the coordinate reference system is missing. Importantly, you will not be able to conduct any spatial analysis (e.g., calculate the distance of a given point to a road, or extract value of the underlining raster value at the point locations), since the incorrectly projected (or unprojected) spatial data is pointing to a completely different location on earth than intended.

Spatial data without the correct projection system is basically useless. When you acquire or download a geospatial data, make sure you know what projection the data were originally made (this information is usually found in the metadata of a given dataset).

There are EPSG Codes that can be used to avoid manually setting all the parameters needed in a projection. For instance, the code EPSG:32637 codes for the UTM37N projection with WGS84 datum and ellipsoid. You can find the EPSG reference list in the following link: http://spatialreference.org/ref/epsg/

5.1.2 R packages for Vector Data

There were two main packages in R to handle geospatial vector data: the sp package and the sf package.

The first package to provide classes and methods for spatial data types in R is called sp. Development of the sp package began in the early 2000s in an attempt to standardize how spatial data would be treated in R and to allow for better interoperability between different analysis packages that use spatial data. The package (first release on CRAN in 2005) provides classes and methods to create points, lines, polygons, and grids and to operate them. About 350 packages developed for spatial analysis use the data types implemented in sp, which means several spatial packages “depend” on the sp package in some way.

The sp package depends on rgeos and rgdal. However at the end of 2023, these packages were retired, primarily because their maintainer, Roger Bivand, has retired. You can read more about it here.

All geospatial work in R is being replaced by the more modern package: sf

It was first released on CRAN in late October 2016. It implements a formal standard called “Simple Features” that specifies a storage and access model of spatial geometries (i.e. points, lines, polygons).

sf provides fast I/O, particularly relevant for large files.

5.1.3 The sf Package

The sf class is an extension of data frames. Essentially, sf objects can be treated as data frames that also contain spatial data, as opposed to spatial data that may or may not also contain data frames. This enables sf objects to fit within the tidyverse workflow, making it possible to manipulate them with the dplyr commands we saw before.

sf objects consist of rows of features, hence the name “simple features (sf)”, which have both non-spatial and spatial data formats.

The spatial data of an sf object is contained in a special geometry column that is of class ‘sfc’. The geometry column contains the basic types of spatial data: the Coordinate Reference System (CRS), coordinates, and type of geometric object.

The ‘sfc’ class has seven subclasses to denote the types of geometric objects within the geometry column, which are derived from the simple features standard. The possible geometric objects are point, linestring, polygon, multipoint, multilinestring, multipolygon, and geometrycollection, which is used for any combination of the other types.

Now lets do some operations using sf.

5.1.4 Convert a data frame to a sf object

In your research you may create spreadsheets that contain data associated with spatial coordinates. You already know how to read the spreadsheet into a data frame with read.table or read.csv. We can then very easily convert the table into a spatial object in R.

Lets start converting a data frame into a sf object. For this, our data frame needs to have at least two columns indicating the coordinates of points. This could be the location of animal observations recorded on the field, for instance.

We will use the st_as_sf() function that uses a vector of the coordinate columns for the ‘coords’ argument, and take either an EPSG code or a PROJ definition for the crs.

mt_df <- read.csv("data/ke_major-towns.csv", header = TRUE)

# Create sf object with geo_data data frame and CRS
points_sf <- st_as_sf(mt_df, coords = c("lon", "lat"), crs = 4326)
print(points_sf)
## Simple feature collection with 55 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 34.0701 ymin: -4.16856 xmax: 41.8401 ymax: 4.26594
## Geodetic CRS:  WGS 84
## First 10 features:
##     TOWN_NAME                 geometry
## 1      Moyale  POINT (39.0671 3.51502)
## 2     Mandera  POINT (41.8401 3.93175)
## 3       Wajir  POINT (40.0699 1.74579)
## 4    Marsabit  POINT (37.9998 2.32618)
## 5      Lodwar  POINT (35.6075 3.12534)
## 6   Lokitaung  POINT (35.7604 4.26594)
## 7      Kitale  POINT (35.0197 1.02763)
## 8     Eldoret  POINT (35.284 0.511704)
## 9  Kapenguria  POINT (35.1246 1.23239)
## 10    Bungoma POINT (34.5667 0.566278)

We now have a sf object containing the location of major towns in Kenya.

5.1.5 Projections and reporjections

If you need to asign a CRS to an sf object, you can use the function st_set_crs().

To reproject an sf object, you can use the function st_transform(). You need to provide the object and the new CRS.

Here, we are going to project the spatial object from WGS84 to UTM37N.

points_sf_UTM <- st_transform(points_sf, crs="+proj=utm +zone=37 +north +ellps=WGS84 +datum=WGS84 +units=m")
points_sf_UTM
## Simple feature collection with 55 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -49256.84 ymin: -460773.8 xmax: 815422 ymax: 472280.3
## Projected CRS: +proj=utm +zone=37 +north +ellps=WGS84 +datum=WGS84 +units=m
## First 10 features:
##     TOWN_NAME                  geometry
## 1      Moyale POINT (507452.6 388520.5)
## 2     Mandera   POINT (815422 435119.7)
## 3       Wajir POINT (619005.1 192996.7)
## 4    Marsabit POINT (388788.2 257153.6)
## 5      Lodwar POINT (122836.8 346056.7)
## 6   Lokitaung POINT (140314.8 472280.3)
## 7      Kitale POINT (56804.23 113860.5)
## 8     Eldoret POINT (86226.52 56678.64)
## 9  Kapenguria POINT (68533.06 136530.5)
## 10    Bungoma POINT (6212.445 62779.85)

We can see that the projection is now UTM 37N.

5.1.6 Reading Shapefiles into R

Reading shapefiles into R using the sf package is quite simple. We can use the function st_read()

mammals <- st_read("data/ke_mammals_WGS84.shp")
## Reading layer `ke_mammals_WGS84' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/ke_mammals_WGS84.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 11128 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 33.9207 ymin: -4.677568 xmax: 41.91134 ymax: 5.428362
## Geodetic CRS:  WGS 84
mammals
## Simple feature collection with 11128 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 33.9207 ymin: -4.677568 xmax: 41.91134 ymax: 5.428362
## Geodetic CRS:  WGS 84
## First 10 features:
##    MAMMALS                       geometry
## 1       59 POLYGON ((35.72504 5.392665...
## 2       59 POLYGON ((35.6244 5.402578,...
## 3       60 POLYGON ((35.46943 5.427557...
## 4       59 POLYGON ((35.59788 5.412438...
## 5       60 POLYGON ((35.41483 5.426372...
## 6       59 POLYGON ((35.49558 5.428124...
## 7       58 POLYGON ((35.74144 5.381304...
## 8       59 POLYGON ((35.32218 5.424792...
## 9       60 POLYGON ((35.61533 5.405414...
## 10      59 POLYGON ((35.83205 5.343137...

What kind of vector data is this?

5.1.7 sf object manipulation

All sf objects can generally be treated as data frames when wrangling the data. This makes it possible to use workflows from the tidyverse package.

For instance, returning the subset of mammals as we did before, can be done with dplyr’s filter() function.

Working with the mammals sf object, lets filter polygons that contain more than 60 species of mammals

mammals_60 <- mammals %>% filter(MAMMALS > 60)
mammals_60
## Simple feature collection with 6884 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 33.9207 ymin: -4.677568 xmax: 40.33287 ymax: 5.385777
## Geodetic CRS:  WGS 84
## First 10 features:
##    MAMMALS                       geometry
## 1       61 POLYGON ((35.49277 5.385777...
## 2       61 POLYGON ((35.73524 5.37106,...
## 3       61 POLYGON ((35.49269 5.362251...
## 4       61 POLYGON ((35.78922 5.358685...
## 5       61 POLYGON ((35.56057 5.294257...
## 6       61 POLYGON ((35.54746 5.28794,...
## 7       62 POLYGON ((35.49659 5.263432...
## 8       61 POLYGON ((35.7855 5.264725,...
## 9       61 POLYGON ((35.49269 5.261552...
## 10      61 POLYGON ((35.79181 5.251389...

We now have a new object with only polygons containing more than 60 species of mammals.

You can save this new object as a shapefile using the st_write() function. You need to provide the sf object and the path and name where you want to save the shapefile. By default, this function checks if the file already exists, and if so it will not overwrite it. You can change the parameter delete_layer = TRUE to overwrite a shapefile with the same name.

st_write(mammals_60, "data/mammals_60", driver = "ESRI Shapefile", delete_layer = TRUE)

This is a very basic introduction to vector data. In future workflows, we will learn to use more spatial tools to manipulate vector data.

Note that we have not plot any vector data yet. You can use the function plot() to plot sf, but the plots are not great. We will learn to use another package for making maps later.

5.2 Raster Data in R

Raster processing is an important aspect of building spatial datasets for further analysis. Doing so within R can greatly help streamline workflows, allowing for all analyses to be completed in a single program. Here, we will walk through a series of steps to provide the basic skills necessary for image analysis.

A raster image is composed by pixels or cells. Each raster contains a CRS, the coordinates of its origin, dimensions, and an array of cell values.

The package rater was the main package to work with raster data in R. In 2020, the package terra was released, and thanks to its greater functionality and efficiency, has now replaced the raster package.


For this exercise we will use a Landsat 8 image. Before we start processing the image it is good to remember the different bands of Landsat 8:

Band Wavelength Useful for mapping
Band 1 Coastal Aerosol 0.435 - 0.451 Coastal and aerosol studies
Band 2 Blue 0.452 - 0.512 Bathymetric mapping, distinguishing soil from vegetation, and deciduous from coniferous vegetation
Band 3 Green 0.533 - 0.590 Emphasizes peak vegetation, which is useful for assessing plant vigor
Band 4 Red 0.636 - 0.673 Discriminates vegetation slopes
Band 5 Near Infrared (NIR) 0.851 - 0.879 Emphasizes biomass content and shorelines
Band 6 Short-wave Infrared (SWIR) 1 1.566 - 1.651 Discriminates moisture content of soil and vegetation; penetrates thin clouds
Band 7 Short-wave Infrared (SWIR) 2 2.107 - 2.294 Improved moisture content of soil and vegetation and thin cloud penetration
Band 8 Panchromatic 0.503 - 0.676 15 meter resolution, sharper image definition
Band 9 Cirrus 1.363 - 1.384 Improved detection of cirrus cloud contamination
Band 10 TIRS 1 10.60 - 11.19 100 meter resolution, thermal mapping and estimated soil moisture
Band 11 TIRS 2 11.50 - 12.51 100 meter resolution, Improved thermal mapping and estimated soil moisture
BQA Landsat Collection 1 QA Bitmask

5.2.1 Loading rasters

We use the function rast() to read a raster. In this case we are reading the Landsat8 image.

Image <- rast("./data/Landsat8.tif") 
Image
## class       : SpatRaster 
## dimensions  : 795, 814, 12  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 287430, 311850, -2490, 21360  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637) 
## source      : Landsat8.tif 
## names       : B1, B2, B3, B4, B5, B6, ...

Questions: What is the spatial resolution of this image? What is the CRS? How many bands are there?

If you only want to read the 11 bands, you can use the argument lyrs

Image2 <- rast("./data/Landsat8.tif", lyrs = c(1:11)) 
Image2
## class       : SpatRaster 
## dimensions  : 795, 814, 11  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 287430, 311850, -2490, 21360  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637) 
## source      : Landsat8.tif 
## names       : B1, B2, B3, B4, B5, B6, ...

5.2.2 Querying the Image to Obtain Basic Image Information

Similar to a data.frame, raster objects store all the relevant information to do data manipulation and/or visualization.

print(Image) # Alternatively, you could just type: ImageStack
## class       : SpatRaster 
## dimensions  : 795, 814, 12  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 287430, 311850, -2490, 21360  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 37N (EPSG:32637) 
## source      : Landsat8.tif 
## names       : B1, B2, B3, B4, B5, B6, ...
# Raster Class
class(Image)
## [1] "SpatRaster"
## attr(,"package")
## [1] "terra"
# Names of the raster layers
names(Image)
##  [1] "B1"         "B2"         "B3"         "B4"         "B5"        
##  [6] "B6"         "B7"         "B10"        "B11"        "sr_aerosol"
## [11] "pixel_qa"   "radsat_qa"
# Summary statistics (min, max, quartiles)
summary(Image)
##        B1                B2                B3               B4        
##  Min.   :-2000.0   Min.   :-2000.0   Min.   :  63.0   Min.   :  26.0  
##  1st Qu.:  263.0   1st Qu.:  333.0   1st Qu.: 571.0   1st Qu.: 660.0  
##  Median :  373.0   Median :  461.0   Median : 737.0   Median : 965.0  
##  Mean   :  372.6   Mean   :  459.4   Mean   : 733.6   Mean   : 935.2  
##  3rd Qu.:  481.0   3rd Qu.:  587.0   3rd Qu.: 890.0   3rd Qu.:1201.0  
##  Max.   : 2189.0   Max.   : 2576.0   Max.   :3183.0   Max.   :3446.0  
##        B5             B6             B7            B10            B11      
##  Min.   : 159   Min.   : 133   Min.   : 130   Min.   :2880   Min.   :2876  
##  1st Qu.:1964   1st Qu.:1882   1st Qu.:1253   1st Qu.:3010   1st Qu.:2992  
##  Median :2207   Median :2440   Median :1788   Median :3043   Median :3022  
##  Mean   :2236   Mean   :2421   Mean   :1754   Mean   :3030   Mean   :3011  
##  3rd Qu.:2467   3rd Qu.:2971   3rd Qu.:2251   3rd Qu.:3065   3rd Qu.:3043  
##  Max.   :6027   Max.   :5603   Max.   :5769   Max.   :3120   Max.   :3090  
##    sr_aerosol       pixel_qa       radsat_qa
##  Min.   :  8.0   Min.   :322.0   Min.   :0  
##  1st Qu.: 96.0   1st Qu.:322.0   1st Qu.:0  
##  Median : 96.0   Median :322.0   Median :0  
##  Mean   :117.2   Mean   :322.1   Mean   :0  
##  3rd Qu.:160.0   3rd Qu.:322.0   3rd Qu.:0  
##  Max.   :228.0   Max.   :480.0   Max.   :0
# Spatial extent
ext(Image)
## SpatExtent : 287430, 311850, -2490, 21360 (xmin, xmax, ymin, ymax)

5.2.3 Projections in rasters

You can check the projection of your raster using crs

crs(Image, proj = TRUE)
## [1] "+proj=utm +zone=37 +datum=WGS84 +units=m +no_defs"

To reproject a raster, you need to use the function project(). Note that as the pixel position will slightly shift, you need to provide an argument for the methods to recalculate the new pixel values. In general, you use ‘bilinear’ for continuous data and ‘rgb’ for classes.

Imagelatlong <- project(Image, "+init=epsg:4326", method = 'bilinear')
crs(Imagelatlong, proj = TRUE)
## [1] "+proj=longlat +datum=WGS84 +no_defs"

5.2.4 Image Plotting

Raster images can be plotted using a number of commands. Try a few commands to see how the results differ.

# A plot of the image with lots of options
plot(Image,4)

# Plot a 3-band RGB image
plotRGB(Image, r=4, g=3, b=2, stretch = "lin", axes=TRUE, xlab="Easting", ylab="Northing", main="Landsat 8, Bands 4,3,2 (RGB) - True Color Composite")

# Plot a 3-band RGB image in false color composite
plotRGB(Image, r=5, g=4, b=3, stretch = "lin", axes=TRUE, xlab="Easting", ylab="Northing", main="Landsat 8, Bands 5,4,3 (RGB) - False Color Composite")

5.2.5 Raster calculations

The real strength of the terra package is the ability to perform complex mathematical operations.

NDVI

One of the most commonly used raster products in terrestrial ecology is the Normalized Difference Vegetation Index (NDVI). Calculated from a ratio of the red and near-infrared spectral bands, the index is a measure of greenness (i.e. density of live green vegetation). NDVI values are within the -1 to 1 range. Negative NDVI values (values approaching -1) will likely correspond to water. Values close to zero (-0.1 to 0.1) generally correspond to barren areas of rock, sand, or snow. Low, positive values represent shrub and grassland (approximately 0.2 to 0.4), while high values indicate dense vegetation, such the ones found in temperate and tropical forests (values approaching 1).

To calculate NDVI, we can do a simple math calculation to manipulate the raster data layers. NDVI is:

\[NDVI = \frac{(NIR - red)}{(NIR + red)}\] where red and NIR stand for the spectral reflectance measurements acquired in the red (visible) and near-infrared regions of the electromagnetic spectrum.

# Using standard math operators, rename the red band (band 3) and the near-infrared band (band 4) in the image.
redBand <- Image$B4
nirBand <- Image$B5

# Create NDVI
ndvi <- (nirBand-redBand)/(nirBand+redBand)

plot(ndvi)

hist(ndvi)

5.2.6 Pixel Extraction

One of the most important items in raster processing is the development of samples of spatial datasets for further analysis. We will see how to extract pixel values from the NDVI raster. For this example, we will create random points using the st_sample() function withing the bounding box of the NDVI raster.

# Select random samples (pixels) from a raster
ranSamps <- st_sample(st_as_sfc(st_bbox(ndvi)), size=10, type = "random")
ranSamps <- st_as_sf(ranSamps) #Convert to simple feature collection
ranSamps
## Simple feature collection with 10 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 288421 ymin: -882.7695 xmax: 308887.8 ymax: 17400.44
## Projected CRS: WGS 84 / UTM zone 37N
##                             x
## 1   POINT (306968.6 414.9227)
## 2    POINT (292759.3 5147.23)
## 3   POINT (302153.5 17400.44)
## 4  POINT (300060.8 -882.7695)
## 5     POINT (300599 9189.424)
## 6   POINT (292283.8 15371.02)
## 7   POINT (291921.1 11205.62)
## 8   POINT (297620.5 4256.651)
## 9      POINT (288421 16766.3)
## 10  POINT (308887.8 2033.875)
# Plot ndvi and the random points
plot(ndvi)
plot(ranSamps, add = T)

Now, we use the extract() function to get the NVDI pixel values.

ndvivalues <- terra::extract(ndvi, ranSamps) 
ndvivalues
##    ID        B5
## 1   1 0.7694704
## 2   2 0.2596119
## 3   3 0.2643312
## 4   4 0.7040000
## 5   5 0.3405311
## 6   6 0.2721879
## 7   7 0.2496100
## 8   8 0.8188235
## 9   9 0.4093344
## 10 10 0.7205479

This function is the base for many spatial analysis as it allows to extract pixel values from locations that can then be used as predictor variables in models. We will see that later on.

5.2.7 Writing an Image

You can also save rasters to your hard drive so that it can be read into other programs (e.g., ArcGIS, QGIS). Use the writeRester() function.

5.3 Spatial operations

Now let’s conduct some simple analyses. Note that, for each functionality we try to achieve here, there are multiple functions from different packages that can do the same job. Here we only introduce one way of doing it. When you are more familiar with R, you will develop your own preference in specific packages and functions.

5.3.1 Read in all the spatial data and visualize them together

birds <- st_read("data/Bird_pts.shp")
## Reading layer `Bird_pts' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Bird_pts.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 20 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
fences<- st_read("data/fences.shp")
## Reading layer `fences' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/fences.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 353 features and 13 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 744545.6 ymin: 4305558 xmax: 749943.7 ymax: 4310110
## Projected CRS: WGS 84 / UTM zone 17N
habitat<- st_read("data/habitat.shp")
## Reading layer `habitat' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/habitat.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 72 features and 7 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 744545.6 ymin: 4305557 xmax: 749947.7 ymax: 4310110
## Projected CRS: WGS 84 / UTM zone 17N
roads<- st_read("data/roads.shp")
## Reading layer `roads' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/roads.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 108 features and 9 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 744483.1 ymin: 4305488 xmax: 749995.6 ymax: 4310127
## Projected CRS: WGS 84 / UTM zone 17N
landcover<- rast("data/landcover.img") 
ndvi<- rast("data/ndvi250m.tif")

5.3.2 Calculate distance to vector lines

canvas<- rast(ext(roads), crs=crs(roads), resolution=30)
road.r<- rasterize(roads, canvas)  # the distance function only works with raster data type, so here rasterize roads first
distance<- distance(road.r)
plot(distance)
plot(roads, add=T)
## Warning in plot.sf(roads, add = T): ignoring all but the first attribute

What do you need to do to calculate the distance between each bird occurence location and the nearest road?

5.3.3 Calculate buffer area along the vector line

Calculate 50m buffer and 500m buffer along the fence and plot the results. Note that when plotting sf objects, a plot for each attribue is created. To avoid multiple plots, we can extract the geometry and plot that instead.

fence_50m<- st_buffer(fences, 50)
plot(st_geometry(fence_50m))

fence_100m<- st_buffer(fences, 100)
plot(st_geometry(fence_100m))

Now we want to: Convert the buffer polygon to raster and change the values of all pixels within the 100m buffer area to 1, and the pixels outside the buffer area to 0

rasterize.buff<- rasterize(fence_100m, ndvi, field=1, background=0)
#Notice here we used an existing layer named ndvi as the template of the newly rasterized raster, the new rasterized layer will adapt the extent, the projection, and the pixel size of the template, which is convenient to us.  
plot(rasterize.buff)

5.3.4 Extract raster value at point location

Extract raster value at point locations, so we can quantify the environmental value captured by other raster data at the customized point locations. Notice that in order for the extract function to work, the points and the raster data need to cover the same extent and have the same projection.

landcover.extract<- terra::extract(landcover, birds)
birds   # this is the original data table
## Simple feature collection with 20 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
##    id Presence                 geometry
## 1   1        1 POINT (748020.7 4308458)
## 2   2        1 POINT (747684.4 4308521)
## 3   3        1 POINT (747940.9 4308299)
## 4   4        1 POINT (749314.8 4308116)
## 5   5        1   POINT (748397 4308208)
## 6   6        1   POINT (749121 4307797)
## 7   7        1 POINT (748824.5 4307586)
## 8   8        1 POINT (749195.1 4307432)
## 9   9        1 POINT (749343.3 4307552)
## 10 10        1 POINT (748784.6 4307911)
birds.extract<- cbind(birds, landcover.extract)
birds.extract # inspect the differences
## Simple feature collection with 20 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 746042.6 ymin: 4306606 xmax: 749343.3 ymax: 4309342
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
##    id Presence ID landcover                 geometry
## 1   1        1  1        11 POINT (748020.7 4308458)
## 2   2        1  2        11 POINT (747684.4 4308521)
## 3   3        1  3        11 POINT (747940.9 4308299)
## 4   4        1  4        11 POINT (749314.8 4308116)
## 5   5        1  5        11   POINT (748397 4308208)
## 6   6        1  6        11   POINT (749121 4307797)
## 7   7        1  7        11 POINT (748824.5 4307586)
## 8   8        1  8        11 POINT (749195.1 4307432)
## 9   9        1  9        11 POINT (749343.3 4307552)
## 10 10        1 10        11 POINT (748784.6 4307911)

5.3.5 Subset polygon based on attribute value

subset.habitat<- habitat %>% filter(HABITAT =="Oak")
par(mfrow= c(1,2))
plot(habitat)

plot(subset.habitat)

How do you select only the polygon with area larger than 100,000 square meters?

5.4 Making pretty maps with R

R offers the opportunity to create beautiful high quality maps for publication. This saves you from having to export all files to make the final maps in other software, such as ArcGIS or QGIS.

Making maps is like art. Good maps can help enormously with presenting your results to an audience. It takes skills and practice.

When combining R powerful data processing with visualization packages, we can move beyond static maps. Here, we will learn the basics for making three type of maps:

  1. Static maps
  2. Animated maps
  3. Interactive maps

We will use the ‘tmap’ package, which is a great new addition to R tools. Tmap is a powerful and flexible package with a concise syntax that is similar to ggplot2 users. It also has the unique capability to generate static and interactive maps using the same code via tmap_mode().

library(tmap)
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(spData)
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

Before we start using tmap, remember that you always have the help. Some functions in tmap have lots of arguments that can be edited. We will cover some basics here, hoping that you will explore more options on your own.

5.4.1 Static maps

We will use generic data to demonstrate the basics, and then use some of the data from this chapter to create a publishable map.

Static maps are the most common and traditional outputs from geospatial analysis.

Tmap syntax involves a separation between the input data and the aesthetics for data visualization. Each input dataset can be displayed in many different ways, largely depending on the nature of the data (raster, points, polygons, etc.)

The dataset we are using here is the world object from the ‘spData’ package. It contains some variables from the World Bank.

In tmap the basic building block is tm_shape(), which creates a tmap-element that defines input data as vector or raster objects. We can then add one or more layer elements such as tm_fill(), tm_borders(), tm_dots(), tm_raster(), depending what you want to display.

Here a simple example:

# Add fill layer to nz shape
data(world)
tm_shape(world) +
  tm_fill() 

# Add border layer to nz shape
tm_shape(world) +
  tm_borders() 

# Add fill and border layers to nz shape
tm_shape(world) +
  tm_fill() +
  tm_borders() 

Objects in tmap can be store with names, similar to ggplot, and then called or used as starting point to add more objects.

map1 <- tm_shape(world) +
  tm_fill() +
  tm_borders() 
map1

If we want to add more objects (e.g., another shapefile), we use + tm_shape(new_obj). When a new shape is added in this way, all subsequent aesthetic functions refer to it, until another new shape is added. This syntax allows the creation of maps with multiple shapes and layers.

tm_shape(world) + 
    tm_borders() + 
tm_shape(urban_agglomerations) + 
    tm_dots(size = "population_millions")

In addition, you can change the aesthetics of the maps using either variable fields (data in your shapefile) or constant values. The most commonly used aesthetics for fill and border layers include color, transparency, line width and line type, set with col, alpha, lwd, and lty arguments, respectively.

Here some examples on using constant numbers and colors. Note the use of tmap_arrange to plot multiple maps.

ma1 = tm_shape(world) + tm_fill(col = "red")
ma2 = tm_shape(world) + tm_fill(col = "red", alpha = 0.3)
ma3 = tm_shape(world) + tm_borders(col = "blue")
ma4 = tm_shape(world) + tm_borders(lwd = 3)
ma5 = tm_shape(world) + tm_borders(lty = 2)
ma6 = tm_shape(world) + tm_fill(col = "red", alpha = 0.3) +
  tm_borders(col = "blue", lwd = 3, lty = 2)
tmap_arrange(ma1, ma2, ma3, ma4, ma5, ma6)

But we can also color the maps based on some variable of interest. You can use the default breaks for continuous variables or define your own. ‘n’ sets the number of bins into which numeric variables are categorized and ‘palette’ defines the color scheme, which in the following example is ‘BuGn’.

Lets plot population size by country:

tm_shape(world) + tm_polygons(col = "pop") #default breaks

breaks = c(0, 2, 5, 10, 100, 500, 2000) * 1000000 # customized breaks
tm_shape(world) + tm_polygons(col = "pop", breaks = breaks)

tm_shape(world) + tm_polygons(col = "pop", n = 5)

tm_shape(world) + tm_polygons(col = "pop", palette = "BuGn", breaks = breaks)

So far we have been editing the objects, but a good map needs other elements in the layout. These include title, scale bar, north arrow, any text, etc.

Additional elements such as north arrows and scale bars have their own functions: tm_compass() and tm_scale_bar(). Here is an example on how to add the north arrow.

tm_shape(world) + 
  tm_polygons(col = "pop", palette = "BuGn", breaks = breaks, title = "Population") +
  tm_compass(position = c("left", "top"))

5.4.2 Animated maps

While static maps sometimes is all you need for a publication or a poster, animated maps can be fantastic ways of showing change across time in presentations or websites where you can display gifs or movies.

There are a few things we need to learn first. In tmaps, the functions tm_facets generates faceted maps. There are some arguments that will allow you to display change in several small maps in one figure or animated change by changing the argument in tm_facets(). If you set by we get several static maps side by side, if we set along we get all maps displayed on top of each other, ready for producing an animation.

For instance, a faceted map can look like this:

tm_shape(world) +
  tm_polygons() +
  tm_shape(urban_agglomerations) +
  tm_symbols(col = "black", border.col = "white", size = "population_millions") +
  tm_facets(by = "year", nrow = 4, free.coords = FALSE)

As you may notice, it can be difficult to visualize the difference between so many maps when they’re displayed side by side. But if we set along = year it uses all years available to create an animation. We also want to set free.coords = FALSE, which maintains the map extent for each map iteration. We then save the map as an object.

pop_anim <- tm_shape(world) + tm_polygons() + 
  tm_shape(urban_agglomerations) + tm_dots(size = "population_millions") +
  tm_facets(along = "year", free.coords = FALSE)
tmap_animation(pop_anim, filename = "urb_anim.gif", delay = 25)
## Creating frames
## =========
## ====
## =====
## ====
## =====
## ====
## =====
## ====
## ====
## =====
## ====
## =====
## ====
## =====
## ====
## =====
## ====
## 
## Creating animation
## Animation saved to /Users/ramirocrego/Documents/GitHub/R-Tutorials/urb_anim.gif

Note that the .gif file will be saved to your directory. It will look like this:

World human population change
World human population change

5.4.3 Interactive maps

It was cool to make animated maps, but we can now take maps to a new level with interactive maps. The release of the ‘leaflet’ package in 2015 revolutionized interactive web map creation from within R and a number of packages have built on it since then.

Here we will learn how to create interactive maps using tmaps. Tmaps has a unique function that allows to display your maps as static or interactive. The only caution is that certain features such as scale bars and north arrows are not displayed or used in interactive mode. It also has some limitations on the legends that can be created. But depending on the purpose, it is worth how much you gain.

All you need to do is switching to view mode, using the command tmap_mode("view").

Lets display one of our previous maps on the interactive mode:

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(world) + tm_polygons(col = "pop", palette = "BuGn", breaks = breaks, title = "Population size in millions")

You can navigate the map and click on each country to query the population size. You can also change the base map displayed. This type of map is displayed in html, and be saved as an html file to share or even to insert in your own website.

5.4.4 Creating our own high quality map.

We will use the previous bird data to create a high quality map.

tmap_mode("plot")
## tmap mode set to plotting
tm_shape(habitat) +
  tm_polygons(col="HABITAT", palette = "Greens", title ="Habitat type") +
tm_shape(roads) +
  tm_lines(col = "ROAD_USE", palette = "Set1", title.col = "Road use") +
tm_shape(fences) +
  tm_lines(lty = 'solid', col = "black") + tm_add_legend("line", col = 'black', title = "Fences") +
tm_shape(birds) +
  tm_dots(col = 'red', size = 0.3) + tm_add_legend("symbol", shape = 16, size=0.4 ,col = 'red', title = "Bird occurrence") +
tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("right", "bottom")) +
tm_layout(legend.outside = T)

We can also create an interactive map. Note that some elements have changed. We removed the scale and north arrow, and the legends are simplified. There are not options for shapes in tmap view mode available at the moment.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(habitat) +
  tm_polygons(col="HABITAT", palette = "Greens", title ="Habitat type") +
tm_shape(roads) +
  tm_lines(col = "ROAD_USE", palette = "Set1", title.col = "Road use") +
tm_shape(fences) +
  tm_lines(lty = 'solid', col = "black") +
tm_shape(birds) +
  tm_markers() + tm_add_legend("fill", col = c('black','blue'), labels = c("Fences","Bird locations")) 

5.4.5 Another example from my own work

In this manuscript I used ArcGIS to make the static maps for publication:

But for presentations, I wanted to show the map in a dynamic way, as a Dynamic map. So this is the code:

Load data:

library(heatmaply)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
## 
## ======================
## Welcome to heatmaply version 1.5.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================
SR <- st_read("./data/EstiSpRichness.shp")
## Reading layer `EstiSpRichness' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/EstiSpRichness.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 288 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 201250 ymin: -22500 xmax: 316250 ymax: 92500
## Projected CRS: WGS 84 / UTM zone 37N
colnames(SR)[3:10]<-c(2001,2004,2006,2008,2010,2012,2015,2016)

prop<-st_read("./data/Property_boundary_2015.shp")
## Reading layer `Property_boundary_2015' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Property_boundary_2015.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 314 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 186885.9 ymin: -32707.34 xmax: 321651.3 ymax: 96041.66
## Projected CRS: WGS 84 / UTM zone 37N
prop
## Simple feature collection with 314 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 186885.9 ymin: -32707.34 xmax: 321651.3 ymax: 96041.66
## Projected CRS: WGS 84 / UTM zone 37N
## First 10 features:
##                    NAME_96 LWF_UNIT LANDUSE      AREA          TENURE RECNO
## 1        NYS Mar Mar Ranch NORTHERN       2 182832252 Government Land     1
## 2              Louniek SFT  WESTERN       6 206768128 Government Land     2
## 3             Lenges Ranch NORTHERN      11  28727854   Private Ranch     3
## 4        Mugie Conservancy NORTHERN      11 183191628   Private Ranch     4
## 5       Laikipia N.Reserve NORTHERN       2 166186628 Government Land     5
## 6                      P&D NORTHERN       6 173673828    Small Holder     7
## 7  Laikipia N. Conservancy  WESTERN      11 386049301   Private Ranch     8
## 8                  L. West  WESTERN       8 204385714    Small Holder     9
## 9                  Ol Malo NORTHERN      11   7568867   Private Ranch    10
## 10               Mutukanio  WESTERN       8 210504638    Small Holder    11
##    PERIMETER      ACRES  HECTARES AREA_KM2                   TENURE_211
## 1   56493.40 1828322.52 18283.225   182.83              Government Land
## 2   62937.69 2067681.27 20676.813   206.77              Government Land
## 3   24525.24  287278.54  2872.785    28.73        Privately Owned Ranch
## 4   71924.57 1831916.28 18319.163   183.19        Privately Owned Ranch
## 5   66749.15 1661866.28 16618.663   166.19              Government Land
## 6  100202.37 1736738.28 17367.383   173.67 Privately Owned Smallholding
## 7   81338.23 3860493.01 38604.930   386.05        Privately Owned Ranch
## 8   80095.26 2043857.14 20438.571   204.39 Privately Owned Smallholding
## 9   22844.52   75688.67   756.887     7.57        Privately Owned Ranch
## 10  84720.16 2105046.38 21050.464   210.50 Privately Owned Smallholding
##                  LANDUSE211 CONSERV ANY_TRISM CONSER_VAL
## 1  Pastoralist Grazing Area       0         0          1
## 2  Pastoralist Grazing Area       0         0          2
## 3                  Ranching       0         0          2
## 4       Ranching & Wildlife       1         1          1
## 5  Pastoralist Grazing Area       0         0          2
## 6  Pastoralist Grazing Area       0         0          2
## 7             Wildlife Only       1         1          1
## 8  Pastoralist Grazing Area       0         0          2
## 9       Ranching & Wildlife       1         1          1
## 10 Pastoralist Grazing Area       0         0          2
##                 CONS_VAL_T CONS_INV            LU
## 1       Core wildlife area  Passive     Pastorial
## 2  Secondary wildlife area  Passive     Pastorial
## 3  Secondary wildlife area  Passive Mixed Farming
## 4       Core wildlife area   Active   ProWildlife
## 5  Secondary wildlife area  Passive     Pastorial
## 6  Secondary wildlife area  Passive     Pastorial
## 7       Core wildlife area   Active Wildlife Only
## 8  Secondary wildlife area  Passive     Pastorial
## 9       Core wildlife area   Active   ProWildlife
## 10 Secondary wildlife area  Passive     Pastorial
##                          geometry
## 1  MULTIPOLYGON (((240602.1 83...
## 2  MULTIPOLYGON (((211302.1 79...
## 3  MULTIPOLYGON (((236329.8 87...
## 4  MULTIPOLYGON (((235449.3 87...
## 5  MULTIPOLYGON (((259130.1 88...
## 6  MULTIPOLYGON (((245758.4 79...
## 7  MULTIPOLYGON (((220833.5 74...
## 8  MULTIPOLYGON (((225684.3 72...
## 9  MULTIPOLYGON (((266006.3 69...
## 10 MULTIPOLYGON (((228857.2 60...

Make a static map for one year

tmap_mode(mode="plot")
## tmap mode set to plotting
tm_shape(prop) + tm_polygons("LU", palette = "Greys", border.col = "white", title="Land use") +
  tm_legend(legend.show=T) +
  tm_layout(legend.outside = T, scale = 1.3, title = 2016, title.size = 2, frame = F, legend.position = c("left","center")) +
  tm_scale_bar(position = c("left","bottom"), size = 0.5) +
  tm_compass (position = c("left","top")) +
  tm_shape(SR) + tm_bubbles(size = "2016", col= "2016", border.col="white", palette = RdYlGn (6), scale = 1.1, size.max = 13, title.size = "Sp richness", style = "fixed", breaks = c(0,0.990001,3.90001,5.90001,7.90001,9.90001,13), title.col = "Sp richness", legend.col.show = FALSE, legend.size.show = FALSE) +
  tm_add_legend("symbol", title = "Sp. richness",  col = RdYlGn (6), size = c(0.4,0.5,0.6,0.7,0.8,0.9), labels = c("0", "1-4","4-6","6-8","8-10","10-13"), border.col="white")
## Warning: The argument size of tm_scale_bar is deprecated. It has been renamed
## to text.size

Re-arrange the dataframe to be able to the function tm_facets:

colnames(SR)[3:10]<-c('SR')
Year <- c(2001,2004,2006,2008,2010,2012,2015,2016)

SR2 <- SR[,3]
SR2$Year <- Year[1]
for (i in 2:8){
  temp <- SR[,2+i]
  temp$Year <- Year[i]
  SR2 <- rbind(SR2,temp)
}

Create the dynamic map:

map <- tm_shape(prop) + tm_polygons("LU", palette = "Greys", border.col = "white", title="Land use") +
  tm_legend(legend.show=T) +
  tm_layout(legend.outside = T, scale = 1.3, frame = F, legend.position = c("left","center")) +
  tm_scale_bar(position = c("left","bottom"), text.size = 0.5) +
  tm_compass (position = c("left","top")) +
  tm_shape(SR2) + tm_bubbles(size = "SR", col= "SR", border.col="white", palette = RdYlGn (6), scale = 1.1, size.max = 13, title.size = "Sp richness", style = "fixed", breaks = c(0,0.990001,3.90001,5.90001,7.90001,9.90001,13), title.col = "Sp richness", legend.col.show = FALSE, legend.size.show = FALSE) +
  tm_add_legend("symbol", title = "Sp. richness", col = RdYlGn (6), size = c(0.4,0.5,0.6,0.7,0.8,0.9), labels = c("0", "1-4","4-6","6-8","8-10","10-13"), border.col="white") +
  tm_facets(along = "Year", free.coords = FALSE)


tmap_animation(map, filename = "map_anim.gif", delay = 50)
## Creating frames
## ====================
## ==========
## ==========
## ==========
## ==========
## ==========
## ==========
## 
## Creating animation
## Animation saved to /Users/ramirocrego/Documents/GitHub/R-Tutorials/map_anim.gif

6 Case study 1: Species Occurrence and Habitat Selection

6.1 Introduction

Addax (Addax nasomaculatus) are considered one of the rarest antelopes on earth, with an estimated population of <100 individuals in the wild. We assessed the distribution and occurrence of this species from field surveys collected by the Sahara Conservation Fund over a 7-year period (2008-2014). Our results provide insight into the factors contributing to species occurrence and are guiding field surveys to areas that have the potential to support small and geographically isolated populations. We incorporated field-derived variables of vegetation cover with remote sensing measures of vegetation productivity (NDVI - Normalized Difference Vegetation Index) and surface roughness (derived from SRTM - Shuttle Radar Topography Mission). Models were fit in a generalized linear regression framework to evaluate and predict addax occurrence.

Addax moving across shifting sands in the Tin Toumma desert, Niger (Photo: T.Rabeil, SCF)
Addax moving across shifting sands in the Tin Toumma desert, Niger (Photo: T.Rabeil, SCF)

The following analysis follows steps detailed in:

Stabach, J.A., T. Rabeil, V. Turmine, T. Wacher, T. Mueller, and P. Leimgruber. 2017. On the brink of extinction - Habitat selection of addax and dorcas gazelle across the Tin Toumma desert, Niger. Diversity and Distributions 23:581-591.

A link to the repository of the analysis is here.

In this case study, we will demonstrate how to combine the skills learned in GIS, remote sensing, and modeling to provide a predictive surface of addax habitat suitability. This will include:

  • Loading and processing available spatial data
  • Creating a proxy for surface roughness
  • Relating occurrence (extract) with remotely sensed data

6.2 Import & Process Spatial Data

One of the most difficult (and perhaps most time consuming) aspects of conducting a spatial analysis is the construction of the spatial database. Main components include locating and downloading available spatial data, mosaicking (if necessary) multiple tiles together, reprojecting the data so that all layers have the same coordinate reference system (CRS), and resampling the cell sizes to match.

Multiple sources now exist for finding available spatial data, which include EarthExplorer and Google Earth Engine. Some layers, however, can also be directly accessed within R using the geodata package (among others).

6.2.1 Load Packages

Like other exercises in R, we will first load the necessary packages to perform the analyses. Use help() for any operations that you don’t understand or that require additional information.

# Load required libraries
library(terra)
library(sf)
library(tidyverse)
library(geodata)

6.2.2 Occurrence Data

First, we will load the spatial points dataframe (our sampling point locations). This is the occurrence dataset collected by partners in Niger. In this case, I have saved the file as a shapefile (Lat/Long, WGS84) that we will load directly into R. This file could just as easily be loaded as a .csv and then converted to a spatial object by defining the projection.

# Load the point data and project to UTM32N
Addax <- st_read("data/Occurrence_dd.shp")
## Reading layer `Occurrence_dd' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/Occurrence_dd.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 661 features and 12 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 11.71088 ymin: 16.01946 xmax: 12.27343 ymax: 16.95161
## Geodetic CRS:  WGS 84
# Project to UTM32N, WGS84
# See: https://spatialreference.org/
UTM32N.proj <- "epsg:32632"
Addax <- st_transform(Addax, crs = UTM32N.proj) # Note that I am overwriting the original Lat/Long object

# Look at the file
Addax
## Simple feature collection with 661 features and 12 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 789521 ymin: 1773472 xmax: 850100 ymax: 1876866
## Projected CRS: WGS 84 / UTM zone 32N
## First 10 features:
##    Pt_ID       Date YearMonth Id_1        X       Y Addax Dorcas Cornul Stipa1
## 1     13 2008-12-06    200812   23 825358.1 1773472     0      3      0      1
## 2     14 2008-12-06    200812   23 830041.8 1776055     0      5      0      0
## 3     15 2008-12-06    200812   23 835354.5 1779007     0      7      0      0
## 4     17 2008-12-06    200812   23 840739.0 1781912     0      2      0      1
## 5     19 2008-12-06    200812   23 845644.1 1784587     0      1      0      0
## 6     21 2008-12-06    200812   23 850100.0 1787353     0     11      0      0
## 7     16 2008-12-06    200812   12 800058.8 1780978     1      0      0      0
## 8     18 2008-12-06    200812   12 805058.9 1783687     2      0      0      1
## 9     20 2008-12-06    200812   12 809524.6 1786051     5      1      0      0
## 10    22 2008-12-06    200812   12 814934.7 1788977     0      0      0      1
##    Stipa2 Human                 geometry
## 1       0     0 POINT (825358.1 1773472)
## 2       0     0 POINT (830041.8 1776055)
## 3       0     1 POINT (835354.5 1779007)
## 4       0     0   POINT (840739 1781912)
## 5       0     1 POINT (845644.1 1784587)
## 6       1     0   POINT (850100 1787353)
## 7       0     0 POINT (800058.8 1780978)
## 8       0     0 POINT (805058.9 1783687)
## 9       1     0 POINT (809524.6 1786051)
## 10      0     2 POINT (814934.7 1788977)

6.2.3 Raster Data

R has a number of options for directly importing remotely sensed data into the console. Here, we will import a NDVI data layer (a .tif file) that I downloaded from EarthExplorer. This layer is located in your data directory and has already been re-projected to UTM 32N WGS84 and clipped to the study area. We will also use functions in the terra package to load a 90-m elevation dataset. Both layers must exactly match each other (same CRS, extent, grid cell size) in order to predict with the rasters.

# NDVI - Normalized Difference Vegetation Index (250-m)
# ********************************************
# ********************************************
# Note the file is already projected to UTM32N WGS84
ndvi <- rast("data/MOD13Q1_Nov2017.tif")
ndvi <- ndvi * 0.0001 # Convert to decimal values (This is done to reduce the file size)
ndvi
## class       : SpatRaster 
## dimensions  : 421, 376, 1  (nrow, ncol, nlyr)
## resolution  : 250, 250  (x, y)
## extent      : 760705.1, 854705.1, 1771859, 1877109  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632) 
## source(s)   : memory
## varname     : MOD13Q1_Nov2017 
## name        : MOD13Q1_Nov2017 
## min value   :          0.0324 
## max value   :          0.1276
# SRTM - Elevation data from the Shuttle Radar Topography Mission (30-m)
# ********************************************
# ********************************************
srtm <- elevation_3s(lon=12, lat=16.5, path = tempdir()) # Inputting general coordinates to grab tile
srtm
## class       : SpatRaster 
## dimensions  : 6000, 6000, 1  (nrow, ncol, nlyr)
## resolution  : 0.0008333333, 0.0008333333  (x, y)
## extent      : 10, 15, 15, 20  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
## source      : srtm_39_09.tif 
## name        : srtm_39_09
# Clip (crop) the raster to the study area extent. This will help with processing the image
# Read in a study area boundary (also in Lat/Long).
SA <- st_read("data/StudyArea.shp")
## Reading layer `StudyArea' from data source 
##   `/Users/ramirocrego/Documents/GitHub/R-Tutorials/data/StudyArea.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 11.44643 ymin: 16.0124 xmax: 12.31479 ymax: 16.95185
## Geodetic CRS:  WGS 84
plot(srtm)
plot(SA, add=T)

# Crop the raster to the study area extent
# You could also manually set the extent, if inclined to do so (SA <- extent(11,13,16,17))
srtm <- crop(srtm,SA)

# Note: In many cases, you might need to download multiple tiles to cover your study area.  To put them together, you'd need to use the mosaic function.  Be careful, rasters can get large.
#srtm1 <- elevation_3s(lon = 12, lat = 16.5, path = tempdir())
#srtm2 <- elevation_3s(lon = 15, lat = 16.5, path = tempdir())
#srtm.mosaic <- mosaic(srtm1,srtm2, fun=mean) # Using mean for overlapping region

# Global Administrative Boundaries (GADM):
# Sometimes very useful to verify you are working where you think you are
Niger <- world(resolution=1, level=0, path = tempdir()) |> st_as_sf() |> filter(NAME_0 == "Niger")
Niger <- st_transform(Niger, crs = 4326)
# Plot together - Overlapping where I think they should!
plot(Niger[1])
plot(srtm, add=T)

6.2.4 Raster Project & Resample

You’ll note that the projection of the NDVI layer is different than the SRTM data layer. We need to project the SRTM data layer to UTM32N WGS, making sure the extents, cell sizes (essential for final prediction), and grids all match. We’ll use bilinear interpolation because the data are continuous. For a categorical image (e.g., land-cover classification), use nearest neighbor. We also now have a choice to make about the resolution we will use to conduct our analyses. What resolution should we use?

# Project the srtm image, setting the resolution
srtm.utm <- project(srtm, "epsg:32632", method = 'bilinear') # This can be a slow process, depending on the size of the raster

# Resample the srtm image to the ndvi image
# The native resolution of our analysis will then be 250m - This will help to speed up the processing, but means we are losing some of the fine-scale information contained in the 90m elevation model which might be important
srtm.250m <- resample(srtm.utm, ndvi, "bilinear")

# Print both rasters and make sure they have the same extent, rows/columns, projection, and resolution
ndvi
## class       : SpatRaster 
## dimensions  : 421, 376, 1  (nrow, ncol, nlyr)
## resolution  : 250, 250  (x, y)
## extent      : 760705.1, 854705.1, 1771859, 1877109  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632) 
## source(s)   : memory
## varname     : MOD13Q1_Nov2017 
## name        : MOD13Q1_Nov2017 
## min value   :          0.0324 
## max value   :          0.1276
srtm.250m
## class       : SpatRaster 
## dimensions  : 421, 376, 1  (nrow, ncol, nlyr)
## resolution  : 250, 250  (x, y)
## extent      : 760705.1, 854705.1, 1771859, 1877109  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632) 
## source(s)   : memory
## varname     : MOD13Q1_Nov2017 
## name        : srtm_39_09 
## min value   :   372.0851 
## max value   :   586.0993

Questions:

  1. Are these files projected? How would you check?
  2. What is the R function for re-projecting raster and/or spatial point data files if you needed to?
  3. What should you always do when importing spatial data (also good practice for any data you load into R)?

Now that we have projected and processed the spatial data, let’s make a plot of all the data to make sure everything looks good. Data are located where I expect them to be. Amazing!

# Plot the result (order matters)
plot(srtm.250m)
plot(Addax,pch=15,cex=0.7,add=TRUE)

6.2.5 Generate Terrain Variables

We assumed that addax would select inter-dunal depressions. Instead of including the raw SRTM (elevation) data in our models, we created a variable derived from the elevation dataset, termed Surface Roughness. This variable is the change in elevation range between neighboring pixels. That is, it is the difference between the minimum and maximum values of a cell and its eight surrounding neighbors. Pixels with similar neighborhood values have a low change in elevation range (e.g., flat areas), whereas highly variables neighborhoods have a large change in elevation range (e.g., steep slopes) (Wilson et al. 2007).

Terrain variables can be easily applied to digital elevation models. Variables that can be calculated include slope, aspect, Terrain Position Index (TPI), Terrain Ruggedness Index (TRI), and surface roughness. Some of these variables are likely to be highly correlated. Use help(terrain) for additional information. In this example, we will calculate surface roughness.

rough <- terrain(srtm.250m, v='roughness')
plot(rough)
plot(Addax,pch=15,cex=0.7,add=TRUE)

6.2.6 Raster Extraction

The extract command from the terra package is an extremely useful and powerful function for joining spatial objects. Here, we will extract surface roughness and NDVI at each of the occurrence locations. If we stack the layers together, we can extract all the rasters together with a single command. See help(extract) for additional information about this function. Note that we tell R that we want the function from the terra package as it

# Extract NDVI and rough  
ndvi.point <- terra::extract(ndvi,Addax)
rough.point <- terra::extract(rough,Addax)

# There are some NA, so we will fill them with the mean of all other values. Note this is happening because the ndvi layer we are using here is not the original used in the actual analysis, which matched the Modis images to the time each transect was conducted.
ndvi.point$MOD13Q1_Nov2017 <- ifelse(is.na(ndvi.point$MOD13Q1_Nov2017), mean(ndvi.point$MOD13Q1_Nov2017, na.rm = T),ndvi.point$MOD13Q1_Nov2017)
rough.point$roughness <- ifelse(is.na(rough.point$roughness), mean(rough.point$roughness, na.rm = T),rough.point$roughness)

# Append to spatialobject
Addax <- data.frame(Addax, ndvi=ndvi.point$MOD13Q1_Nov2017,
                    rough=rough.point$roughness) 

# Look at the data
head(Addax)
##   Pt_ID       Date YearMonth Id_1        X       Y Addax Dorcas Cornul Stipa1
## 1    13 2008-12-06    200812   23 825358.1 1773472     0      3      0      1
## 2    14 2008-12-06    200812   23 830041.8 1776055     0      5      0      0
## 3    15 2008-12-06    200812   23 835354.5 1779007     0      7      0      0
## 4    17 2008-12-06    200812   23 840739.0 1781912     0      2      0      1
## 5    19 2008-12-06    200812   23 845644.1 1784587     0      1      0      0
## 6    21 2008-12-06    200812   23 850100.0 1787353     0     11      0      0
##   Stipa2 Human                 geometry   ndvi     rough
## 1      0     0 POINT (825358.1 1773472) 0.0956 16.258545
## 2      0     0 POINT (830041.8 1776055) 0.0882 10.016754
## 3      0     1 POINT (835354.5 1779007) 0.0854  9.454865
## 4      0     0   POINT (840739 1781912) 0.0920 19.360474
## 5      0     1 POINT (845644.1 1784587) 0.0938 15.377411
## 6      1     0   POINT (850100 1787353) 0.0878 20.611420

Note: We have simplified things a bit here because we have a time series of points and therefore, we must download and extract a time series of NDVI. Each of the dates need to be aligned with their corresponding NDVI image. In the next exercise, the dataset provided will already have this step completed so that we can focus on the modeling.

6.3 Fitting a GLM

Now that we have extracted the data, we will now walk through the steps of fitting a logistic regression model to model the probability of addax occurrence.

We now have the occurrence dataset with all associated variables.

Animal counts (addax, but also dorcas gazelle) were summarized at plot locations. We re-coded all counts within a 2.5-km radius to a measure of occurrence (i.e., presence/absence). Thus, we modeled the data as a series of 1’s and 0’s, representing addax occurrence at plot locations. Data were aggregated in this fashion because of variability between surveys (i.e., the transects locations didn’t overlap exactly) and because we did not have confidence in the accuracy of the number of individuals recorded at each sighting. In addition, distance to animal sighting locations were only recorded in a subset of the surveys. Sightings >500-m from the transect were removed due to an assumed undercounting bias (confirmed by investigating the frequency of sightings in relation to distance). This allowed for a conservative broad-scale approach to incorporate extremely messy field data collected over multiple years. See more details in Stabach et al. 2017.

In this section of the analysis we will:

  • Summarize the occurrence dataset
  • Scale/center parameter Values
  • Assess potential collinearity between variables
  • Apply logistic regression to model the probability of occurrence of Addax
  • Graph response curves and interpret coefficients
  • Validate the result
  • Make a predictive surface

6.3.1 Load extra packages

As done previously, we remove objects in R’s memory and load the necessary packages to perform the analyses. Please use the help() for any operations that you don’t understand or that require additional information. Knowing what to search for in Google is essential. I usually start any query with “r-cran” (e.g., “r-cran Import shapefile”) to limit the search terms to R. Note: There are quite a few additional packages to install (‘install.packages(“package name”)’) in this exercise.

# Load required libraries
library(lubridate)
library(pROC)
library(tmap)
library(visreg)
library(coefplot)
library(usdm)

6.3.2 Summarize the Occurrence Dataset

As a first step, we will summarize the occurrence dataset and visualize some of the patterns.

6.3.3 Data Cleaning

Columns in the Addax dataframe include Date of survey, X and Y location of each plot, the number of addax and dorcas gazelle sighted at each location, a unique plot ID (Pt_ID), and the presence/absence of vegetation species Cornulaca monocantha (Cornul), Stipagrostis acutiflora (Stipa1), and Stipagrostis vulnerans (Stipa2). These vegetation species were thought a priori to influence addax occurence and were collected at each plot location. Human disturbance (e.g., footprint, sighting, tire tracks) were also recorded (i.e., Human = 1). You’ll also note that the remote sensing data that we extracted, surface roughness (Rough) and the Normalized Difference Vegetation Index (NDVI), are also included.

We need to add a few things to the dataframe, which most importantly includes re-coding the abundance records to a measure of occurrence. This was done because of a lack of confidence in the exact counts collected. Here, we create a Month, Year, and Season field, correct some of the field types, and re-code the occurrence.

# Create month, year and season fields
Occ <- Addax %>% st_drop_geometry() %>% mutate(
  Month = month(Date),
  Year = year(Date),
  Season = case_when(
    Month >=3 & Month <= 6 ~ "Dry",
    Month >=7 & Month <=10 ~ "Wet",
    Month >=11 & Month <=12 ~ "Cold",
    TRUE ~ "Fix Problem"
  )
)

# Case_when is the same as:
#Occ$Season <- ifelse(Occ$Month >=3 & Occ$Month <=6, "Dry",
#                          ifelse(Occ$Month >=7 & Occ$Month <=10, "Wet",
#                                 ifelse(Occ$Month >=11 & Occ$Month <=12, "Cold","Fix Problem")
#                          ))

# We could easily include the following in the piping above.
# Recode Occurrence  
Occ$obsAddax <- ifelse(Occ$Addax > 0, 1, 0)
Occ$obsDorcas <- ifelse(Occ$Dorcas > 0, 1, 0)

# Add some summaries
Occ$SumBoth <- Occ$obsAddax + Occ$obsDorcas
Occ$BothPresent <- ifelse(Occ$SumBoth > 1, 1, 0) # Both species present/occurred
Occ$OnePresent <- ifelse(Occ$SumBoth > 0, 1, 0) # At least one species present/occurred

# Correct the data types
# Individually, this would be:
Occ$Cornul <- as.factor(Occ$Cornul)
Occ$Stipa1 <- as.factor(Occ$Stipa1)
Occ$Stipa2 <- as.factor(Occ$Stipa2)
Occ$Season <- as.factor(Occ$Season)
Occ$Year <- as.factor(Occ$Year)
# More succinct

# cols <- c("Cornul", "Stipa1", "Stipa2", "Season", "Year")
# Occ[cols] <- lapply(Occ[cols], as.factor)
# str(Occ)

6.3.4 Data Summary

When we summarize the dataset, we see that the data were collected multiple times a year and that the occurrence of addax (and dorcas gazelle) vary between years and seasons.

# Summarize
Occ.Summary <- Occ %>%
  group_by(Year, Month, Season) %>%
  summarize(PresAddax = sum(obsAddax),
            PresDorc = sum(obsDorcas),
            PrevAdd= round(sum(obsAddax)/length(obsAddax)*100,digits=1),
            PrevDorc = round(sum(obsDorcas)/length(obsAddax)*100,digits=1),
            One = round(sum(OnePresent)/length(obsAddax)*100,digits=1) ,
            Both = round(sum(BothPresent)/length(obsAddax)*100,digits=1)
            )

# Look at the result
Occ.Summary <- as.data.frame(Occ.Summary)
Occ.Summary
##    Year Month Season PresAddax PresDorc PrevAdd PrevDorc  One Both
## 1  2008    12   Cold        21       14    43.8     29.2 62.5 10.4
## 2  2009     6    Dry         6        4    12.5      8.3 14.6  6.2
## 3  2009     9    Wet         8       11    16.7     22.9 37.5  2.1
## 4  2009    12   Cold        16       23    33.3     47.9 66.7 14.6
## 5  2010     4    Dry         4       11     8.3     22.9 25.0  6.2
## 6  2010     6    Dry         6       10    12.5     20.8 33.3  0.0
## 7  2010    10    Wet         3        4     6.2      8.3 14.6  0.0
## 8  2010    12   Cold         6       14    12.0     28.0 36.0  4.0
## 9  2011     3    Dry         6        4    25.0     16.7 25.0 16.7
## 10 2011     7    Wet         6        6    20.0     20.0 26.7 13.3
## 11 2012    12   Cold         6        8    15.0     20.0 27.5  7.5
## 12 2013    11   Cold        16       31    33.3     64.6 75.0 22.9
## 13 2014     3    Dry         3        8     7.5     20.0 22.5  5.0
## 14 2014     9    Wet         4        5    10.0     12.5 22.5  0.0
## 15 2014    12   Cold         1       31     1.9     58.5 58.5  1.9
# If you want, write to a file
#write.csv(Occ.Summary, file="Addax_Dorcas_Prevalence.csv", quote = FALSE, row.names = FALSE)

Let’s look at the data one last time before moving on to the modelling.

# Plot
plot(Occ$X,Occ$Y,xlab = "Easting", ylab = "Northing", main = "Plot Locations", frame = FALSE, pch = ".", col="red", cex = 5, asp=1)

Question:

  1. The field team didn’t visit the same plots every year or month. How could you programmatically and efficiently view the plots that were collected each field seasons?

6.3.5 Scale Continuous Variables

It is often helpful and necessary to scale (\(x_i = \bar{x} / sd(x)\)) continuous predictor variables that have vastly different value ranges (e.g., elevation and NDVI). See the scale function. Doing so can help with model convergence and coefficient comparability. While relationships between your dependent variable and each independent variable will remain the same, it is important to remember that data are not on the same scale as the original values and must be back-transformed when making raster predictions. This is critical.

# Scale the continuous variables that we'll include in the modelling
Occ <- Occ %>% mutate(
  sHuman = as.numeric(scale(Human)), # This is a bit ugly, but if you don't specify as "as.numeric", the variables can be difficult to plot
  sndvi = as.numeric(scale(ndvi)),
  srough = as.numeric(scale(rough)),
  sDorcas = as.numeric(scale(Dorcas)),
  sAddax = as.numeric(scale(Addax))
)

# The default scale function simply does the following:
#CenterScale <- function(x){
#  (x-mean(x))/sd(x)
#}

#sndvi2 <- CenterScale(Occ$ndvi)

# These two scaled ndvi vectors should be exactly the same:
#cor(Occ$sndvi,sndvi2)
#summary(Occ$sndvi)
#summary(sndvi2)

# Let's also scale our raster layers by the mean and standard deviation at each location so we don't forget to later when making predictions from our model
s.ndvi <- (ndvi - mean(Occ$ndvi)) / sd(Occ$ndvi)
s.rough <- (rough - mean(Occ$rough)) / sd(Occ$rough)

par(mfrow=c(2,2))
# You should notice that the rasters look exactly the same, but their values have changed.  They are now on a much more similar scale.
plot(ndvi, main = "Non-Scaled NDVI");plot(s.ndvi, main = "Scaled NDVI");plot(rough, main = "Non-Scaled Rough");plot(s.rough, main = "Scaled Rough")

par(mfrow=c(1,1)) # Returning plotting window to normal

6.3.6 Investigate Collinearity

Like any other analysis, we need to evaluate redundancy between our predictor variables to make sure they are sufficiently independent. Otherwise, we may obtain be unable to interpret our model. We will conduct a Variance Inflation Factor Analysis (VIF), which seems to be the preferred statistical method for comparing collinearity currently. Any variables with a VIF > 3 are potential cause for concern. Are any of the continuous variables included in our analysis highly collinear?

Question:

  1. What do we do with categorical variables?
# Assess the continuous variables we'll include in our analysis
vifstep(Occ[,24:28]) # Including scaled Human Presence, NDVI, roughness, and Dorcas/Addax
## No variable from the 5 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( sAddax ~ sDorcas ):  0.01051778 
## max correlation ( srough ~ sndvi ):  -0.2823358 
## 
## ---------- VIFs of the remained variables -------- 
##   Variables      VIF
## 1    sHuman 1.010645
## 2     sndvi 1.120611
## 3    srough 1.097381
## 4   sDorcas 1.044630
## 5    sAddax 1.010735
# You could also use the cor function to investigate correlation.
# What is a reasonable correlation threshold to use |0.65|??
#cor(Occ[,22:26])

6.3.7 Generalized Linear Regression (GLM)

Model the occurrence of addax in a Generalized Linear Regression (GLM) framework. We expected non-linear relationships with surface roughness and ndvi, so are including these terms as quadratic effects in the model. Our goal here was not necessarily to create the very best model. Instead, we aimed to:

  1. Identify the relationships between all reasonable predictor variables and addax occurrence
  2. Evaluate a sub-model that contains only the remote sensing layers to make a prediction of habitat suitability that can be extrapolated across the landscape

Question:

  1. Why is using a GLM advantageous given the data and objectives?
# Create a full model with all the variables you think are important predictors of addax occurrence
glm.Addax <- glm(obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2) + sHuman + sDorcas + Stipa1 + Stipa2 + Cornul + Season + Year, 
                 data = Occ, 
                 family = binomial(link="logit"))
# Summarize result
summary(glm.Addax)
## 
## Call:
## glm(formula = obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2) + 
##     sHuman + sDorcas + Stipa1 + Stipa2 + Cornul + Season + Year, 
##     family = binomial(link = "logit"), data = Occ)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.95886    0.42608  -2.250 0.024421 *  
## srough       0.06296    0.11707   0.538 0.590720    
## I(srough^2)  0.01480    0.09313   0.159 0.873701    
## sndvi       -0.19687    0.12832  -1.534 0.124958    
## I(sndvi^2)   0.03531    0.08309   0.425 0.670829    
## sHuman      -0.61539    0.24360  -2.526 0.011529 *  
## sDorcas      0.03529    0.11829   0.298 0.765467    
## Stipa11      0.19281    0.28041   0.688 0.491705    
## Stipa21      1.02921    0.28627   3.595 0.000324 ***
## Cornul1      0.29894    0.30959   0.966 0.334256    
## SeasonDry   -0.14792    0.36924  -0.401 0.688717    
## SeasonWet   -0.29818    0.37409  -0.797 0.425401    
## Year2009    -1.09962    0.45669  -2.408 0.016049 *  
## Year2010    -1.99582    0.49594  -4.024 5.71e-05 ***
## Year2011    -0.96937    0.59233  -1.637 0.101726    
## Year2012    -2.35473    0.58799  -4.005 6.21e-05 ***
## Year2013    -0.74979    0.47051  -1.594 0.111034    
## Year2014    -3.22724    0.55623  -5.802 6.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 601.51  on 660  degrees of freedom
## Residual deviance: 507.69  on 643  degrees of freedom
## AIC: 543.69
## 
## Number of Fisher Scoring iterations: 6
# Nice summary table, including model estimates and odds ratios
#tab_model(glm.Addax)

# Or print the confidence intervals
#confint(glm.Addax)

6.3.8 Graph and Interpret

Use the visreg and coefplot functions to easily graph the results from a glm and evaluate the model output. Be careful, however, with how categorical variables are interpreted. Visreg does the hard parts of providing a predictive response while holding all other variables constant.

# Plot the coefficients
coefplot(glm.Addax,
        plot=TRUE,
        mar=c(1,4,5.1,2),
        intercept=FALSE,
        vertical=TRUE,
        main="",
        var.las=1,
        frame.plot=FALSE)

# Reset plotting
par(mar=c(5,4,4,2)+0.1) 

# Graph result for all variables
#visreg(glm.Addax, scale="response", ylab="Prob", partial=TRUE)

# Or just one variable at a time
par(mfrow=c(1,2))
visreg(glm.Addax,"srough",
       scale="response", 
       ylab="Probability of Occurrence", 
       xlab="Surface Roughness",
       partial=TRUE, 
       line=list(col="blue"), 
       fill=list(col="gray"),
       points=list(col="black",cex=0.25,pch=19),
       ylim=c(0,1))
visreg(glm.Addax,"sndvi",
       scale="response", 
       ylab="Probability of Occurrence",
       xlab="NDVI",
       partial=TRUE, 
       line=list(col="blue"), 
       fill=list(col="gray"),
       points=list(col="black",cex=0.25,pch=19),
       ylim=c(0,1))

par(mfrow=c(1,1))

6.3.9 Model Validation

Binomial data can be notoriously difficult to validate, at least when compared to standard tools used for linear regression. Ramiro discussed a few techniques to assess model assumptions, including ways to assess a model’s predictive power. We will calculate the Area Under the Curve (AUC) to assess predictive power here. AUC compares the difference between the true positive classification rate and a false positive rate (i.e., Specificity vs Sensitivity).

Another (best) option is to incorporate an independent dataset for validation.

Some guidelines for AUC:

  • 0.9 - 1: Excellent (A)
  • 0.8 - 0.9: Good (B)
  • 0.7 - 0.8: Fair (C)
  • 0.6 - 0.7: Poor (D)
  • 0.5 - 0.6: Fail (F)
# Evaluate deviance residuals
# No strong evidence of lack of fit.  Most residuals are around a value of 0.
devresid <- resid(glm.Addax, type = "deviance")
hist(devresid)

# Calculate AUC
predpr <- predict(glm.Addax, type=c("response"))
(roccurve <- roc(Occ$obsAddax ~ predpr))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.formula(formula = Occ$obsAddax ~ predpr)
## 
## Data: predpr in 549 controls (Occ$obsAddax 0) < 112 cases (Occ$obsAddax 1).
## Area under the curve: 0.77
plot(roccurve, main="AUC")

6.3.10 Make a Predictive Surface

One of the most valuable parts of a species distribution model is predicting to locations where surveys were not performed. In order to make a prediction at these locations, we need predictor data that has wall-to-wall coverage. Unfortunately, only two of our layers incorporated in the full model have full coverage (NDVI and Surface Roughness).

Create a model with these two layers, assess how the model compares with the full model and predict across the entire study area. As you will see, model statistics indicate that this sub-model is not as good as the full model (compare the AIC, AUC).

The model, however, can still be useful, as long as we are clear about its shortcomings (e.g., we’d expect the predictive power to be decreased since we are not including the fine scale data collected at individual plot locations).

6.3.11 Create Model Subset

glm.Addax2 <- glm(obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2), 
                  data = Occ, 
                  family = binomial(link="logit"))

# Summarize and print confidence intervals
summary(glm.Addax2)
## 
## Call:
## glm(formula = obsAddax ~ srough + I(srough^2) + sndvi + I(sndvi^2), 
##     family = binomial(link = "logit"), data = Occ)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.63473    0.15623 -10.464   <2e-16 ***
## srough       0.07565    0.10696   0.707    0.479    
## I(srough^2)  0.04945    0.08254   0.599    0.549    
## sndvi       -0.09295    0.11269  -0.825    0.409    
## I(sndvi^2)  -0.01304    0.07575  -0.172    0.863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 601.51  on 660  degrees of freedom
## Residual deviance: 599.00  on 656  degrees of freedom
## AIC: 609
## 
## Number of Fisher Scoring iterations: 4
# View model output with odds ratios
#tab_model(glm.Addax2)

# Calculate AUC
predpr <- predict(glm.Addax2, type=c("response"))
(roccurve <- roc(Occ$obsAddax ~ predpr))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.formula(formula = Occ$obsAddax ~ predpr)
## 
## Data: predpr in 549 controls (Occ$obsAddax 0) < 112 cases (Occ$obsAddax 1).
## Area under the curve: 0.5567
plot(roccurve) # Not great

Question:

  1. Is our reduced model as good as our full model?

6.3.12 Raster Prediction

We then can use the predict command to take our coefficients and predict

# We could physically calculate the prediction from the model coefficients:
#coef <- summary(glm.Addax2)
#coef <- coef$coefficients
#coef

#Addax.predict <- (exp(coef[1] + s.rough*coef[2] + s.rough^2*coef[3] + s.ndvi*coef[4] + s.ndvi^2*coef[5])/(1 + exp(coef[1] + s.rough*coef[2] + s.rough^2*coef[3] + s.ndvi*coef[4] + s.ndvi^2*coef[5])))

# Add to a stack of rasters and rename layer names.
satImage <- c(s.rough, s.ndvi)
names(satImage) <- c("srough", "sndvi")

# Predict and export image to directory
Addax.predict <- predict(satImage, glm.Addax2, type="response", progress='text')

# Plot result
plot(Addax.predict) # Not a great prediction, with mostly low values, but it highlights some important aeras.

6.3.13 Interactive Mapping

Lastly, we can take this one step further and plot our predicted surface on an interactive map, so that we have a better idea of its context in the real world and also provide a way to post our results on relevant social media.

# Load ESRI imagery baselayer
tmap_mode("view")
tm_basemap("Esri.WorldImagery") +
  tm_shape(Addax.predict, name = "Addax prediction") +
  tm_raster(palette="-inferno", n=8, alpha=0.6, 
            title = "Predicted Addax Occurrence")