GitHub Classroom Assignment

Applying Functions in R

Examples from R for Ecology.

library(pak)
pak::pkg_install("tidyverse")
library(tidyverse)

pak::pkg_install("purrr")
library(purrr)

First, let’s create a sample dataset to work with. This dataset will have columns with the heights of five plants after 0, 10, and 20 days of growth. Each row represents an individual plant.

plant_data <- data.frame(
  plant_id = c(1, 2, 3, 4, 5),
  day_0 = c(10, 12, 9, 11, 10),
  day_10 = c(15, 18, 14, 16, 15),
  day_20 = c(20, 24, 19, 22, 20)
)

head(plant_data)
##   plant_id day_0 day_10 day_20
## 1        1    10     15     20
## 2        2    12     18     24
## 3        3     9     14     19
## 4        4    11     16     22
## 5        5    10     15     20

Using apply for Row and Column Calculations

Let’s practice the apply function. We will calculate the mean height of each plant across the three time points by setting MARGIN = 1. The result will contain an average for each individual, and the original columns will effectively disappear. The same result can be accomplished with the rowMeans() function, which can be more intuitive and easier to use.

# Calculating the mean height for each row in the data frame
row_means <- apply(plant_data[, 2:4], MARGIN = 1, mean)

# View row means
row_means
## [1] 15.00000 18.00000 14.00000 16.33333 15.00000

Next, let’s use apply to calculate the variance within each column of the dataset. By setting MARGIN = 2 and setting out function to var, we will create a new vector with the variance of each column. The result will only have one row and will retain the original column names.

# Calculating the variance for each column in the data frame
col_vars <- apply(plant_data[, 2:4], 2, var) # the MARGIN parameter name can be left out if the arguments are in the correct order

# View the column variances
col_vars
##  day_0 day_10 day_20 
##    1.3    2.3    4.0

Finally, apply can be used to perform your own custom functions. In this example, we will calculate the range of each plant heights by subtracting the minimum height from the maximum height. The result will be a vector with the range of plant heights for each time point.

# Creating the range function
get_range <- function(x) {
  return(max(x) - min(x))
}

# Calculating the range of plant heights for each time point
height_ranges <- apply(plant_data[, 2:4], 2, get_range)

# View the plant height ranges
height_ranges
##  day_0 day_10 day_20 
##      3      4      5

Using tapply for Grouped Calculations

Remember, tapply is the old school group_by. tapply is used to apply a function to subsets of a vector based on a factor. In this example, we will calculate the mean height of each plant across time points.

# First, pivot the data into long format so that we have a factor (plant_id) to work with
plant_data_long <- plant_data %>%
  pivot_longer(cols = -plant_id, names_to = "day", values_to = "height")

# Use sub() to remove the "height_" prefix from the day column
plant_data_long$day <- sub("day_", "", plant_data_long$day)

# View the long-format data
head(plant_data_long)
## # A tibble: 6 × 3
##   plant_id day   height
##      <dbl> <chr>  <dbl>
## 1        1 0         10
## 2        1 10        15
## 3        1 20        20
## 4        2 0         12
## 5        2 10        18
## 6        2 20        24

Now we can use the plant_id factor to calculate the mean height of each plant across time points. Notice that the result is the same as the one obtained with apply that resulted in row_means in the previous section. The usage depends on the format of the data and the desired output.

# Calculate the mean height of each plant across time points
height_means <- tapply(plant_data_long$height, plant_data_long$plant_id, mean)

# View the mean heights
height_means
##        1        2        3        4        5 
## 15.00000 18.00000 14.00000 16.33333 15.00000

Using lapply and sapply for Lists

Before we practice using lapply and sapply, let’s create a list of vectors to work with. Each vector contains data on different plant attributes.

# Fix the random seed so that we obtain the same results in each run
set.seed(123)

# Create a list of vectors with plant data
plant_list <- list(
  height = rnorm(10, 10, 2),
  weight = rnorm(10, 5, 1),
  leaf_count = rpois(10, 20) # rpois generates random numbers from a Poisson distribution
)

# View the list
plant_list
## $height
##  [1]  8.879049  9.539645 13.117417 10.141017 10.258575 13.430130 10.921832
##  [8]  7.469878  8.626294  9.108676
## 
## $weight
##  [1] 6.224082 5.359814 5.400771 5.110683 4.444159 6.786913 5.497850 3.033383
##  [9] 5.701356 4.527209
## 
## $leaf_count
##  [1] 15 18 15 17 19 14 14 18 14 24

Now, let’s use lapply to calculate the mean of each vector in the list. The result will be a list of means for each attribute. We could use the mean function directly, or even use a for loop, but this example demonstrates the elegance of lapply.

# Calculate the mean of each attribute in the list
plant_means <- lapply(plant_list, mean)

# View the means
plant_means
## $height
## [1] 10.14925
## 
## $weight
## [1] 5.208622
## 
## $leaf_count
## [1] 16.8

Lastly, we will use sapply to simplify the output of the previous calculation. The result will be a vector with the means of each attribute, which can be easier to work with than a list.

# Calculate the mean of each attribute in the list using sapply
plant_means_vector <- sapply(plant_list, mean)

# View the simplified means
plant_means_vector
##     height     weight leaf_count 
##  10.149251   5.208622  16.800000

Applying Functions in Python

Alternatives to R’s apply Functions in Python

Let’s re-create the original plant heights dataset in Python.

import pandas as pd
plant_data = pd.DataFrame({
    'plant_id': [1, 2, 3, 4, 5],
    'day_0': [10, 12, 9, 11, 10],
    'day_10': [15, 18, 14, 16, 15],
    'day_20': [20, 24, 19, 22, 20]
})

Instead of using apply, we will use the axis parameter to define the direction of the operation. Here, we will calculate the mean height of each plant across the three time points. By setting axis = 1, we are “sqeezing” the columns and they will essentially disappear.

# Calculate the mean height for each row in the data frame
row_means = plant_data.iloc[:, 1:].mean(axis = 1) # iloc allows direct indexing in the data frame

# View the row means
row_means
## 0    15.000000
## 1    18.000000
## 2    14.000000
## 3    16.333333
## 4    15.000000
## dtype: float64

Next, we will calculate the variance within each column of the dataset. By setting axis = 0, we will calculate the variance for each column.

# Calculate the variance for each column in the data frame
col_vars = plant_data.iloc[:, 1:].var(axis = 0)

# View the column variances
col_vars
## day_0     1.3
## day_10    2.3
## day_20    4.0
## dtype: float64

Python even has its own apply function that allows use to perform custom operations. In this example, we will calculate the range of each plant heights for each time point.

# Calculate the range of plant heights for each time point
height_ranges = plant_data.iloc[:, 1:].apply(lambda x: x.max() - x.min(), axis = 0)

# View the plant height ranges
height_ranges
## day_0     3
## day_10    4
## day_20    5
## dtype: int64

Using groupby for Grouped Calculations

Python has no tapply function, but we can use groupby to perform similar operations as we learned last week. In this example, we will calculate the mean height of each plant across time points.

# First, pivot the data into long format so that we have a factor (plant_id) to work with
plant_data_long = plant_data.melt(id_vars = 'plant_id', var_name = 'day', value_name = 'height')

# View the long-format data
plant_data_long.head()
##    plant_id    day  height
## 0         1  day_0      10
## 1         2  day_0      12
## 2         3  day_0       9
## 3         4  day_0      11
## 4         5  day_0      10

Now we can use the plant_id factor to calculate the mean height of each plant across time points. Again, we get the same result as using mean with axis = 1 in the previous section.

# Calculate the mean height of each plant across time points
height_means = plant_data_long.groupby('plant_id')['height'].mean()

# View the mean heights
height_means
## plant_id
## 1    15.000000
## 2    18.000000
## 3    14.000000
## 4    16.333333
## 5    15.000000
## Name: height, dtype: float64

Finally, we can accomplish the same result with agg.

# Calculate the mean height of each plant across time points using agg
height_means_agg = plant_data_long.groupby('plant_id').agg({'height': 'mean'})

# View the mean heights
height_means_agg
##              height
## plant_id           
## 1         15.000000
## 2         18.000000
## 3         14.000000
## 4         16.333333
## 5         15.000000

Using map Functions from the purrr Package in R

Now let’s use a short example to demonstrate the usage of map. map applies the same function to each element of a list or vector. In this example, we will calculate the square root of each element in a vector containing integers 1 through 10.

print(class(1:10))
## [1] "integer"
sqrts = map_dbl(1:10, sqrt)

sqrts
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278

Next, let’s take the original vector of integers and add a constant to each element. We will define a function that accepts two arguments and adds them together.

add_nums <- function(x, y) {
  return(x + y)
}

added_constant = map(1:10, add_nums, 5) # the number 5 will be added to each vector element

added_constant
## [[1]]
## [1] 6
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 8
## 
## [[4]]
## [1] 9
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 11
## 
## [[7]]
## [1] 12
## 
## [[8]]
## [1] 13
## 
## [[9]]
## [1] 14
## 
## [[10]]
## [1] 15

Next, let’s take the original vector of integers and add another vector of constants to each element. We will use the same custom function as in the previous example.

added_vector = map(1:10, add_nums, 1:10)

added_vector
## [[1]]
##  [1]  2  3  4  5  6  7  8  9 10 11
## 
## [[2]]
##  [1]  3  4  5  6  7  8  9 10 11 12
## 
## [[3]]
##  [1]  4  5  6  7  8  9 10 11 12 13
## 
## [[4]]
##  [1]  5  6  7  8  9 10 11 12 13 14
## 
## [[5]]
##  [1]  6  7  8  9 10 11 12 13 14 15
## 
## [[6]]
##  [1]  7  8  9 10 11 12 13 14 15 16
## 
## [[7]]
##  [1]  8  9 10 11 12 13 14 15 16 17
## 
## [[8]]
##  [1]  9 10 11 12 13 14 15 16 17 18
## 
## [[9]]
##  [1] 10 11 12 13 14 15 16 17 18 19
## 
## [[10]]
##  [1] 11 12 13 14 15 16 17 18 19 20

Finally, let’s take the original vector of integers and add them to another vector of integers, element-wise. Here, the lengths of the two vectors must be the same, and the function will be applied to each pair of elements.

element_wise_addition = map2(1:10, 1:10, add_nums) # map2 is needed to map over two inputs

element_wise_addition
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 12
## 
## [[7]]
## [1] 14
## 
## [[8]]
## [1] 16
## 
## [[9]]
## [1] 18
## 
## [[10]]
## [1] 20

Exercises

Exercise 1 – Exploring 2D Automobile Data in Python

  • Task 1: Calculate the mean and variance of the displacement, horsepower, and weight columns in the “auto-mpg.csv” dataset.

  • Expected Output: A data.frame with the means and variances for each indicated column.

  • Task 2: Calculate the average mpg for each car manufacturer in the dataset. Which manufacturer has the highest average mpg? Note: you can treat abbreviations as unique manufacturers.

  • Expected Output: The name of the manufacturer with the highest average mpg.

Exercise 2 – Exploring 3D Iris Data in R

The iris3 dataset can be loaded into R using the following command:

data(iris3)
  • Task 1: Calculate the average sepal length, sepal width, and petal length across individuals for each species in the iris3 dataset.

  • Expected Output: A data.frame with the average values for each species.

  • Task 2: Define a function sepal_area that calculates the sepal area (length * width) for each individual in the dataset. Apply this function to the appropriate columns and create a new column to store the values. Finally, report the variance in sepal area for each species.

  • Expected Output: A data.frame with the variance in sepal area for each species.

Exercise 3 – Learning to purrr in R

  • Task 1: Using the following line of code:
result = map(1:10, your_function_here, 1:100)

Create a function add_mean that, when replacing your_function_here, adds the mean of the vector 1:100 to each element in 1:10.

  • Expected Output: The vector result with the correct values.

  • Task 2: Perform element-wise division between the vectors 1:10 and 91:100 where 1:10 is the divisor and store the results in a new vector quotients.

  • Expected Output: The quotients vector with the correct values.