Applyong and mapping functions

.title[
# Applyong and mapping functions
]
.subtitle[
## Avoidings loops! – Split Apply Combine
]
.author[
### Guillaume Falmagne
]
.date[
### <br> Oct. 7th, 2023
]

---

---

class: left, top
background-image: url(figures/splitApplyCombine.png)
background-position: center middle
background-size:  70%

# Split-apply-combine scheme

---

# The apply family of functions in R

- Meant to make using functions along object easy

- Several types, depending on the input and output type

- `apply`: along dimensions of an array
    - `lapply`: along the elements of a vector or list, returns a list
    - `sapply`: along the elements of a vector or list, returns a vector (if possible)
    - `tapply`: along the elements of a vector, grouped by a factor vector

- These differ in how the "split" part of Split-Apply-Combine is represented

- These are not always "type stable": the output format can change depending on the function we use

- Equivalent in Python: mainly `df.apply(some_function)` or `dfcolumn.map(function)`, though sometimes tricky (and does not work `inplace`). 
  - For `numpy` arrays, simple functions are implemented in numpy -- use the `axis=` argument in them!

---
# apply in vectors and array

- General structure: 
```r
apply(X = array or matrix, 
          MARGIN = dimension that is kept, 
          FUN = function that is used)
```
.center[
![](figures/arrays.jpeg)
]
---

# apply in vectors and array

- General structure: 
```r
apply(X = array or matrix, 
          MARGIN = dimension that is kept, 
          FUN = function that is used)
```

``` r
data(iris)
# Extract the numeric values as a matrix
iris_num = as.matrix(iris[,1:4])

# calculate column means
apply(iris_num, 2, mean)
```

```
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
```

``` r
# This is so common there is a fast dedicated function:
colMeans(iris_num)
```

- In numpy, simply: `somearray.mean(axis=n)`. **BEWARE**, here `n` is the axis that is averaged on/disappears!

---

# At least one array during this course...

- The `iris3` dataset has the same data as in the `iris` dataframe, but in an array format

- It has 3 dimensions
    - Dim 1, rows, indexed individuals
    - Dim 2, columns, indexes traits
    - Dim 3, that lacks a common name, holds species

``` r
dim(iris3)
```

```
## [1] 50  4  3
```
--

- To calculate the variance per trait per species

``` r
apply(iris3, c(2, 3), var)
```

```
##              Setosa Versicolor  Virginica
## Sepal L. 0.12424898 0.26643265 0.40434286
## Sepal W. 0.14368980 0.09846939 0.10400408
## Petal L. 0.03015918 0.22081633 0.30458776
## Petal W. 0.01110612 0.03910612 0.07543265
```
---

# `tapply`, old school `group_by`

- General structure: 
```r
tapply(X = vector,
          MARGIN = grouping_vector, 
          FUN = function that is used)
```

``` r
sepal_means = tapply(iris$Petal.Length, iris$Species, mean)
sepal_means
```

```
##     setosa versicolor  virginica 
##      1.462      4.260      5.552
```

- Pandas: go back to `iris.groupby('Length').apply('mean')`. Also look into `df.agg({'col_name':function, ...})`

---

# `lapply`, the super useful generic function for lists

- General structure: 
```r
lapply(X = vector or list,
           FUN = function that is used)
```

``` r
# List all the csv files in the current folder
my_csvs = dir(pattern = "\\.csv$")

# Apply the read.csv function to each file
input_dfs = lapply(my_csvs, read.csv)

# Check out the dimensions of the inputed data.frames
lapply(input_dfs, dim)
```

```
## list()
```

---

---

# The evolution of the apply family: `plyr`

- The `plyr` package is an early tidyverse precursor that provides more intuitive, type stable, replacements for the apply family of functions.

- Functions have a generic naming scheme: `..ply()`

.pull-left[
  - The first letter defines the input format:
    - `l.ply()` - receives a list
    - `d.ply()` - receives a data.frame
    - `a.ply()` - receives an array
] 
--
.pull-right[
  - The second letter defines the output format:
    - `.lply()` - outputs a list
    - `.dply()` - outputs a data.frame
    - `.aply()` - outputs an array
] 
--

``` r
library(plyr)
adply(iris3, 3, colMeans) # Array to data.frame
```

```
##           X1 Sepal L. Sepal W. Petal L. Petal W.
## 1     Setosa    5.006    3.428    1.462    0.246
## 2 Versicolor    5.936    2.770    4.260    1.326
## 3  Virginica    6.588    2.974    5.552    2.026
```
---

# `l.ply`

- Lists are naturaly segregated, so we only need to pass an input list and a function

``` r
library(plyr)
simple_list <- list('zero' = rnorm(5), 
                    'five' = rnorm(5, 5), 
                    'ten' = rnorm(5, 10))
simple_list
```

```
## $zero
## [1]  0.4315625 -0.8129847  0.4320722 -0.4182812  0.2801960
## 
## $five
## [1] 5.026150 5.167334 4.117727 2.927309 4.227760
## 
## $ten
## [1] 10.971430  9.139683  9.429977 10.085677 10.792252
```

``` r
laply(simple_list, mean)
```

```
## [1] -0.01748703  4.29325611 10.08380365
```

---

# Type conversions using `plyr`

- The plyr functions are convenient for converting objects:
- To convert out list to a data.frame, use `ldply`:

``` r
ldply(simple_list)
```

```
##    .id         V1         V2        V3         V4        V5
## 1 zero  0.4315625 -0.8129847 0.4320722 -0.4182812  0.280196
## 2 five  5.0261504  5.1673342 4.1177267  2.9273091  4.227760
## 3  ten 10.9714300  9.1396827 9.4299772 10.0856768 10.792252
```

- Or to an array, using `laply`:

``` r
laply(simple_list, identity) # for arrays we need the identity() function
```

```
##               1          2         3          4         5
## [1,]  0.4315625 -0.8129847 0.4320722 -0.4182812  0.280196
## [2,]  5.0261504  5.1673342 4.1177267  2.9273091  4.227760
## [3,] 10.9714300  9.1396827 9.4299772 10.0856768 10.792252
```

- Python: simply use the list as argument to a `np.array()` or `pd.DataFrame` initializer

---

# `a.ply` functions

- Arrays lack a natural division, so we need to specify the dimensions to apply the functions using the second argument

``` r
aaply(iris3, 3, colMeans)
```

```
##             
## X1           Sepal L. Sepal W. Petal L. Petal W.
##   Setosa        5.006    3.428    1.462    0.246
##   Versicolor    5.936    2.770    4.260    1.326
##   Virginica     6.588    2.974    5.552    2.026
```

---

# `d.ply` is mostly obsolete

- Any time you feel the urge to use plyr to work on a data.frame, you should probably use the tidyverse functions we saw last week

- `dplyr`, the tidyverse package that holds all the data.frame manipulation functions is a more modern version of `ddply`

---

# Parallel computing for the masses

- `plyr` is old, but it is the easy way to run parallel code in R. The parallel interface in `plyr` is dead simple

1. Just load a parallel computation package, 
2. tell it how many threads you want to use, 
3. and use the `plyr` functions with the `.parallel = TRUE` argument

`library(doMC)`

``` r
registerDoMC(8)

system.time(x <- llply(1:10000, function(x) rnorm(10), .parallel = FALSE))
```

```
##    user  system elapsed 
##    0.01    0.00    0.01
```

``` r
system.time(x <- llply(1:10000, function(x) rnorm(10), .parallel = TRUE))
```

```
##    user  system elapsed 
##   0.606   0.177   0.530
```

---

# Running a parallel bootstrap

``` r
runGLM <- function(arg){
  x <- iris[which(iris[,5] != "setosa"), c(1,5)]
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))   # the '~' is used for a "formula" 
  coefficients(result1)
}
```

- This computation is complex enough that it is worth using the extra cores

``` r
system.time(x <- llply(1:1000, runGLM, .parallel = FALSE))
```

```
##    user  system elapsed 
##   0.746   0.006   0.752
```

``` r
system.time(x <- llply(1:1000, runGLM, .parallel = TRUE))
```

```
##    user  system elapsed 
##   1.124   0.369   0.274
```

---

---

# Back into the tidyverse with `purrr`

- Functions for applying a function over a vector

- Great error messages

- Basic function is `map`

``` r
triple <- function(x) x * 3
map(1:3, triple)
```

```
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9
```
]
.pull-right[
  ![:scale 50%](figures/purrr.png)
]

---

background-image: url(figures/map.png)
background-position: middle center
background-size:  50%
# Visualizing `map`

---

background-image: url(figures/map-arg.png)
background-position: middle center
background-size:  50%
# Additional arguments get repeated

---

background-image: url(figures/map-arg-recycle.png)
background-position: middle center
background-size:  50%
# Additional arguments get repeated

- Even vectors get repeated

---

background-image: url(figures/map2.png)
background-position: middle center
background-size:  50%
# If you have two lists of arguments, `map2`

---

background-image: url(figures/pmap-3.png)
background-position: middle center
background-size:  50%
# After that, `pmap`

- For `pmap`, the inputs should be wrapped in a list

---

background-image: url(figures/pmap-arg.png)
background-position: middle center
background-size:  50%
# All these functions allow common arguments

---

# Type stable functions

- You usually know the type of the output you expect

- There are some functions that guarantee that outcome:

- `map_int`: returns a `int` vector

- `map_dbl`: returns a `numeric` vector

- `map_lgl`: returns a `logical` vector

- `map_vec`: returns a vector of the simplest possible type

- These throw great comprehensible errors when the output type is unexpected

- [Purrr Cheat Sheet](https://rstudio.github.io/cheatsheets/purrr.pdf)

- [Functional programming](https://adv-r.hadley.nz/functionals.html)

---

# `purrr` in the tidyverse

``` r
library(purrr)

mtcars |> 
  split(mtcars$cyl) |>  # from base R
  map(\(df) lm(mpg ~ wt, data = df)) |>  # \(df) is a shortcut for function(df)
  map(summary) |>
  map_dbl("r.squared")  # R^2 is one of the outputs of the summary from the linear model fit
```

```
##         4         6         8 
## 0.5086326 0.4645102 0.4229655
```

---

background-image: url(figures/reduce.png)
background-position: middle center
background-size:  50%
# Reduce

---

background-image: url(figures/walk.png)
background-position: middle center
background-size:  35%
# `walk` and `walk2`, for the side effects

- Side effects are things that happen in a program that do not alter any object.
  - Showing a plot 
  - Outputing a file (`export`, `write_csv`)
  - Printing to the console (`cat`, `print`)

---

background-image: url(figures/walk2.png)
background-position: middle center
background-size:  40%
# `walk` and `walk2`, for the side effects

- Side effects are things that happen in a program that do not alter any object.
  - Showing a plot 
  - Outputing a file (`export`, `write_csv`)
  - Printing to the console (`cat`, `print`)

---

# walk2 for outputing files

``` r
# Split the iris data.set into 3 parts
iris_list = split(iris, iris$Species)

# Create the appropriate file names:
iris_out_files = paste0("iris-", names(iris_list), ".csv")

# Use the walk2 function to write the files
walk2(iris_list, iris_out_files, write_csv)

# Check that the files were created
dir(pattern = "iris")
```

```
## [1] "iris-setosa.csv"     "iris-versicolor.csv" "iris-virginica.csv"
```