Examples from R for Ecology.
library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)
pak::pkg_install("purrr")
library(purrr)
First, let’s create a sample dataset to work with. This dataset will have columns with the heights of five plants after 0, 10, and 20 days of growth. Each row represents an individual plant.
plant_data <- data.frame(
plant_id = c(1, 2, 3, 4, 5),
day_0 = c(10, 12, 9, 11, 10),
day_10 = c(15, 18, 14, 16, 15),
day_20 = c(20, 24, 19, 22, 20)
)
head(plant_data)
## plant_id day_0 day_10 day_20
## 1 1 10 15 20
## 2 2 12 18 24
## 3 3 9 14 19
## 4 4 11 16 22
## 5 5 10 15 20
apply for Row and Column CalculationsLet’s practice the apply function. We will calculate the
mean height of each plant across the three time points by setting
MARGIN = 1. The result will contain an average for each
individual, and the original columns will effectively disappear. The
same result can be accomplished with the rowMeans()
function, which can be more intuitive and easier to use.
# Calculating the mean height for each row in the data frame
row_means <- apply(plant_data[, 2:4], MARGIN = 1, mean)
# View row means
row_means
## [1] 15.00000 18.00000 14.00000 16.33333 15.00000
Next, let’s use apply to calculate the variance within
each column of the dataset. By setting MARGIN = 2 and
setting out function to var, we will create a new vector
with the variance of each column. The result will only have one row and
will retain the original column names.
# Calculating the variance for each column in the data frame
col_vars <- apply(plant_data[, 2:4], 2, var) # the MARGIN parameter name can be left out if the arguments are in the correct order
# View the column variances
col_vars
## day_0 day_10 day_20
## 1.3 2.3 4.0
Finally, apply can be used to perform your own custom
functions. In this example, we will calculate the range of each plant
heights by subtracting the minimum height from the maximum height. The
result will be a vector with the range of plant heights for each time
point.
# Creating the range function
get_range <- function(x) {
return(max(x) - min(x))
}
# Calculating the range of plant heights for each time point
height_ranges <- apply(plant_data[, 2:4], 2, get_range)
# View the plant height ranges
height_ranges
## day_0 day_10 day_20
## 3 4 5
tapply for Grouped CalculationsRemember, tapply is the old school
group_by. tapply is used to apply a function
to subsets of a vector based on a factor. In this example, we will
calculate the mean height of each plant across time points.
# First, pivot the data into long format so that we have a factor (plant_id) to work with
plant_data_long <- plant_data %>%
pivot_longer(cols = -plant_id, names_to = "day", values_to = "height")
# Use sub() to remove the "height_" prefix from the day column
plant_data_long$day <- sub("day_", "", plant_data_long$day)
# View the long-format data
head(plant_data_long)
## # A tibble: 6 × 3
## plant_id day height
## <dbl> <chr> <dbl>
## 1 1 0 10
## 2 1 10 15
## 3 1 20 20
## 4 2 0 12
## 5 2 10 18
## 6 2 20 24
Now we can use the plant_id factor to calculate the mean
height of each plant across time points. Notice that the result is the
same as the one obtained with apply that resulted in
row_means in the previous section. The usage depends on the
format of the data and the desired output.
# Calculate the mean height of each plant across time points
height_means <- tapply(plant_data_long$height, plant_data_long$plant_id, mean)
# View the mean heights
height_means
## 1 2 3 4 5
## 15.00000 18.00000 14.00000 16.33333 15.00000
lapply and sapply for ListsBefore we practice using lapply and sapply,
let’s create a list of vectors to work with. Each vector contains data
on different plant attributes.
# Fix the random seed so that we obtain the same results in each run
set.seed(123)
# Create a list of vectors with plant data
plant_list <- list(
height = rnorm(10, 10, 2),
weight = rnorm(10, 5, 1),
leaf_count = rpois(10, 20) # rpois generates random numbers from a Poisson distribution
)
# View the list
plant_list
## $height
## [1] 8.879049 9.539645 13.117417 10.141017 10.258575 13.430130 10.921832
## [8] 7.469878 8.626294 9.108676
##
## $weight
## [1] 6.224082 5.359814 5.400771 5.110683 4.444159 6.786913 5.497850 3.033383
## [9] 5.701356 4.527209
##
## $leaf_count
## [1] 15 18 15 17 19 14 14 18 14 24
Now, let’s use lapply to calculate the mean of each
vector in the list. The result will be a list of means for each
attribute. We could use the mean function directly, or even
use a for loop, but this example demonstrates the elegance
of lapply.
# Calculate the mean of each attribute in the list
plant_means <- lapply(plant_list, mean)
# View the means
plant_means
## $height
## [1] 10.14925
##
## $weight
## [1] 5.208622
##
## $leaf_count
## [1] 16.8
Lastly, we will use sapply to simplify the output of the
previous calculation. The result will be a vector with the means of each
attribute, which can be easier to work with than a list.
# Calculate the mean of each attribute in the list using sapply
plant_means_vector <- sapply(plant_list, mean)
# View the simplified means
plant_means_vector
## height weight leaf_count
## 10.149251 5.208622 16.800000
apply Functions in PythonLet’s re-create the original plant heights dataset in Python.
import pandas as pd
plant_data = pd.DataFrame({
'plant_id': [1, 2, 3, 4, 5],
'day_0': [10, 12, 9, 11, 10],
'day_10': [15, 18, 14, 16, 15],
'day_20': [20, 24, 19, 22, 20]
})
Instead of using apply, we will use the
axis parameter to define the direction of the operation.
Here, we will calculate the mean height of each plant across the three
time points. By setting axis = 1, we are “sqeezing” the
columns and they will essentially disappear.
# Calculate the mean height for each row in the data frame
row_means = plant_data.iloc[:, 1:].mean(axis = 1) # iloc allows direct indexing in the data frame
# View the row means
row_means
## 0 15.000000
## 1 18.000000
## 2 14.000000
## 3 16.333333
## 4 15.000000
## dtype: float64
Next, we will calculate the variance within each column of the
dataset. By setting axis = 0, we will calculate the
variance for each column.
# Calculate the variance for each column in the data frame
col_vars = plant_data.iloc[:, 1:].var(axis = 0)
# View the column variances
col_vars
## day_0 1.3
## day_10 2.3
## day_20 4.0
## dtype: float64
Python even has its own apply function that allows use
to perform custom operations. In this example, we will calculate the
range of each plant heights for each time point.
# Calculate the range of plant heights for each time point
height_ranges = plant_data.iloc[:, 1:].apply(lambda x: x.max() - x.min(), axis = 0)
# View the plant height ranges
height_ranges
## day_0 3
## day_10 4
## day_20 5
## dtype: int64
groupby for Grouped CalculationsPython has no tapply function, but we can use
groupby to perform similar operations as we learned last
week. In this example, we will calculate the mean height of each plant
across time points.
# First, pivot the data into long format so that we have a factor (plant_id) to work with
plant_data_long = plant_data.melt(id_vars = 'plant_id', var_name = 'day', value_name = 'height')
# View the long-format data
plant_data_long.head()
## plant_id day height
## 0 1 day_0 10
## 1 2 day_0 12
## 2 3 day_0 9
## 3 4 day_0 11
## 4 5 day_0 10
Now we can use the plant_id factor to calculate the mean
height of each plant across time points. Again, we get the same result
as using mean with axis = 1 in the previous
section.
# Calculate the mean height of each plant across time points
height_means = plant_data_long.groupby('plant_id')['height'].mean()
# View the mean heights
height_means
## plant_id
## 1 15.000000
## 2 18.000000
## 3 14.000000
## 4 16.333333
## 5 15.000000
## Name: height, dtype: float64
Finally, we can accomplish the same result with agg.
# Calculate the mean height of each plant across time points using agg
height_means_agg = plant_data_long.groupby('plant_id').agg({'height': 'mean'})
# View the mean heights
height_means_agg
## height
## plant_id
## 1 15.000000
## 2 18.000000
## 3 14.000000
## 4 16.333333
## 5 15.000000
map Functions from the purrr Package
in RNow let’s use a short example to demonstrate the usage of
map. map applies the same function to each
element of a list or vector. In this example, we will calculate the
square root of each element in a vector containing integers 1 through
10.
print(class(1:10))
## [1] "integer"
sqrts = map_dbl(1:10, sqrt)
sqrts
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
Next, let’s take the original vector of integers and add a constant to each element. We will define a function that accepts two arguments and adds them together.
add_nums <- function(x, y) {
return(x + y)
}
added_constant = map(1:10, add_nums, 5) # the number 5 will be added to each vector element
added_constant
## [[1]]
## [1] 6
##
## [[2]]
## [1] 7
##
## [[3]]
## [1] 8
##
## [[4]]
## [1] 9
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 11
##
## [[7]]
## [1] 12
##
## [[8]]
## [1] 13
##
## [[9]]
## [1] 14
##
## [[10]]
## [1] 15
Next, let’s take the original vector of integers and add another vector of constants to each element. We will use the same custom function as in the previous example.
added_vector = map(1:10, add_nums, 1:10)
added_vector
## [[1]]
## [1] 2 3 4 5 6 7 8 9 10 11
##
## [[2]]
## [1] 3 4 5 6 7 8 9 10 11 12
##
## [[3]]
## [1] 4 5 6 7 8 9 10 11 12 13
##
## [[4]]
## [1] 5 6 7 8 9 10 11 12 13 14
##
## [[5]]
## [1] 6 7 8 9 10 11 12 13 14 15
##
## [[6]]
## [1] 7 8 9 10 11 12 13 14 15 16
##
## [[7]]
## [1] 8 9 10 11 12 13 14 15 16 17
##
## [[8]]
## [1] 9 10 11 12 13 14 15 16 17 18
##
## [[9]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[10]]
## [1] 11 12 13 14 15 16 17 18 19 20
Finally, let’s take the original vector of integers and add them to another vector of integers, element-wise. Here, the lengths of the two vectors must be the same, and the function will be applied to each pair of elements.
element_wise_addition = map2(1:10, 1:10, add_nums) # map2 is needed to map over two inputs
element_wise_addition
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 12
##
## [[7]]
## [1] 14
##
## [[8]]
## [1] 16
##
## [[9]]
## [1] 18
##
## [[10]]
## [1] 20
Task 1: Calculate the mean and variance of the
displacement, horsepower, and
weight columns in the “auto-mpg.csv” dataset.
Expected Output: A data.frame with
the means and variances for each indicated column.
Task 2: Calculate the average mpg for each car manufacturer in the dataset. Which manufacturer has the highest average mpg? Note: you can treat abbreviations as unique manufacturers.
Expected Output: The name of the manufacturer with the highest average mpg.
The iris3 dataset can be loaded into R using the
following command:
data(iris3)
Task 1: Calculate the average sepal length,
sepal width, and petal length across individuals for each species in the
iris3 dataset.
Expected Output: A data.frame with
the average values for each species.
Task 2: Define a function
sepal_area that calculates the sepal area
(length * width) for each individual in the dataset. Apply
this function to the appropriate columns and create a new column to
store the values. Finally, report the variance in sepal area for each
species.
Expected Output: A data.frame with
the variance in sepal area for each species.
purrr in Rresult = map(1:10, your_function_here, 1:100)
Create a function add_mean that, when replacing
your_function_here, adds the mean of the vector
1:100 to each element in 1:10.
Expected Output: The vector result
with the correct values.
Task 2: Perform element-wise division between
the vectors 1:10 and 91:100 where
1:10 is the divisor and store the results in a new vector
quotients.
Expected Output: The quotients
vector with the correct values.