class: center, middle, inverse, title-slide .title[ # Applyong and mapping functions ] .subtitle[ ## Avoidings loops! – Split Apply Combine ] .author[ ### Guillaume Falmagne ] .date[ ###
Oct. 7th, 2023 ] --- class: inverse, center, middle # Loops are slow in R (and Python) --- class: left, top background-image: url(figures/splitApplyCombine.png) background-position: center middle background-size: 70% # Split-apply-combine scheme --- # The apply family of functions in R - Meant to make using functions along object easy - Several types, depending on the input and output type - `apply`: along dimensions of an array - `lapply`: along the elements of a vector or list, returns a list - `sapply`: along the elements of a vector or list, returns a vector (if possible) - `tapply`: along the elements of a vector, grouped by a factor vector - These differ in how the "split" part of Split-Apply-Combine is represented - These are not always "type stable": the output format can change depending on the function we use - Equivalent in Python: mainly `df.apply(some_function)` or `dfcolumn.map(function)`, though sometimes tricky (and does not work `inplace`). - For `numpy` arrays, simple functions are implemented in numpy -- use the `axis=` argument in them! --- # apply in vectors and array - General structure: ```r apply(X = array or matrix, MARGIN = dimension that is kept, FUN = function that is used) ``` .center[  ] --- # apply in vectors and array - General structure: ```r apply(X = array or matrix, MARGIN = dimension that is kept, FUN = function that is used) ``` ``` r data(iris) # Extract the numeric values as a matrix iris_num = as.matrix(iris[,1:4]) # calculate column means apply(iris_num, 2, mean) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.843333 3.057333 3.758000 1.199333 ``` ``` r # This is so common there is a fast dedicated function: colMeans(iris_num) ``` - In numpy, simply: `somearray.mean(axis=n)`. **BEWARE**, here `n` is the axis that is averaged on/disappears! --- # At least one array during this course... - The `iris3` dataset has the same data as in the `iris` dataframe, but in an array format - It has 3 dimensions - Dim 1, rows, indexed individuals - Dim 2, columns, indexes traits - Dim 3, that lacks a common name, holds species ``` r dim(iris3) ``` ``` ## [1] 50 4 3 ``` -- - To calculate the variance per trait per species ``` r apply(iris3, c(2, 3), var) ``` ``` ## Setosa Versicolor Virginica ## Sepal L. 0.12424898 0.26643265 0.40434286 ## Sepal W. 0.14368980 0.09846939 0.10400408 ## Petal L. 0.03015918 0.22081633 0.30458776 ## Petal W. 0.01110612 0.03910612 0.07543265 ``` --- # `tapply`, old school `group_by` - General structure: ```r tapply(X = vector, MARGIN = grouping_vector, FUN = function that is used) ``` ``` r sepal_means = tapply(iris$Petal.Length, iris$Species, mean) sepal_means ``` ``` ## setosa versicolor virginica ## 1.462 4.260 5.552 ``` - Pandas: go back to `iris.groupby('Length').apply('mean')`. Also look into `df.agg({'col_name':function, ...})` --- # `lapply`, the super useful generic function for lists - General structure: ```r lapply(X = vector or list, FUN = function that is used) ``` ``` r # List all the csv files in the current folder my_csvs = dir(pattern = "\\.csv$") # Apply the read.csv function to each file input_dfs = lapply(my_csvs, read.csv) # Check out the dimensions of the inputed data.frames lapply(input_dfs, dim) ``` ``` ## list() ``` --- class: inverse, center, middle # Modern Alternatives to apply functions --- # The evolution of the apply family: `plyr` - The `plyr` package is an early tidyverse precursor that provides more intuitive, type stable, replacements for the apply family of functions. - Functions have a generic naming scheme: `..ply()` .pull-left[ - The first letter defines the input format: - `l.ply()` - receives a list - `d.ply()` - receives a data.frame - `a.ply()` - receives an array ] -- .pull-right[ - The second letter defines the output format: - `.lply()` - outputs a list - `.dply()` - outputs a data.frame - `.aply()` - outputs an array ] -- ``` r library(plyr) adply(iris3, 3, colMeans) # Array to data.frame ``` ``` ## X1 Sepal L. Sepal W. Petal L. Petal W. ## 1 Setosa 5.006 3.428 1.462 0.246 ## 2 Versicolor 5.936 2.770 4.260 1.326 ## 3 Virginica 6.588 2.974 5.552 2.026 ``` --- # `l.ply` - Lists are naturaly segregated, so we only need to pass an input list and a function ``` r library(plyr) simple_list <- list('zero' = rnorm(5), 'five' = rnorm(5, 5), 'ten' = rnorm(5, 10)) simple_list ``` ``` ## $zero ## [1] 0.4315625 -0.8129847 0.4320722 -0.4182812 0.2801960 ## ## $five ## [1] 5.026150 5.167334 4.117727 2.927309 4.227760 ## ## $ten ## [1] 10.971430 9.139683 9.429977 10.085677 10.792252 ``` ``` r laply(simple_list, mean) ``` ``` ## [1] -0.01748703 4.29325611 10.08380365 ``` --- # Type conversions using `plyr` - The plyr functions are convenient for converting objects: - To convert out list to a data.frame, use `ldply`: ``` r ldply(simple_list) ``` ``` ## .id V1 V2 V3 V4 V5 ## 1 zero 0.4315625 -0.8129847 0.4320722 -0.4182812 0.280196 ## 2 five 5.0261504 5.1673342 4.1177267 2.9273091 4.227760 ## 3 ten 10.9714300 9.1396827 9.4299772 10.0856768 10.792252 ``` - Or to an array, using `laply`: ``` r laply(simple_list, identity) # for arrays we need the identity() function ``` ``` ## 1 2 3 4 5 ## [1,] 0.4315625 -0.8129847 0.4320722 -0.4182812 0.280196 ## [2,] 5.0261504 5.1673342 4.1177267 2.9273091 4.227760 ## [3,] 10.9714300 9.1396827 9.4299772 10.0856768 10.792252 ``` - Python: simply use the list as argument to a `np.array()` or `pd.DataFrame` initializer --- # `a.ply` functions - Arrays lack a natural division, so we need to specify the dimensions to apply the functions using the second argument ``` r aaply(iris3, 3, colMeans) ``` ``` ## ## X1 Sepal L. Sepal W. Petal L. Petal W. ## Setosa 5.006 3.428 1.462 0.246 ## Versicolor 5.936 2.770 4.260 1.326 ## Virginica 6.588 2.974 5.552 2.026 ``` --- # `d.ply` is mostly obsolete - Any time you feel the urge to use plyr to work on a data.frame, you should probably use the tidyverse functions we saw last week - `dplyr`, the tidyverse package that holds all the data.frame manipulation functions is a more modern version of `ddply` --- # Parallel computing for the masses - `plyr` is old, but it is the easy way to run parallel code in R. The parallel interface in `plyr` is dead simple 1. Just load a parallel computation package, 2. tell it how many threads you want to use, 3. and use the `plyr` functions with the `.parallel = TRUE` argument `library(doMC)` ``` r registerDoMC(8) system.time(x <- llply(1:10000, function(x) rnorm(10), .parallel = FALSE)) ``` ``` ## user system elapsed ## 0.01 0.00 0.01 ``` ``` r system.time(x <- llply(1:10000, function(x) rnorm(10), .parallel = TRUE)) ``` ``` ## user system elapsed ## 0.606 0.177 0.530 ``` --- # Running a parallel bootstrap ``` r runGLM <- function(arg){ x <- iris[which(iris[,5] != "setosa"), c(1,5)] ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) # the '~' is used for a "formula" coefficients(result1) } ``` - This computation is complex enough that it is worth using the extra cores ``` r system.time(x <- llply(1:1000, runGLM, .parallel = FALSE)) ``` ``` ## user system elapsed ## 0.746 0.006 0.752 ``` ``` r system.time(x <- llply(1:1000, runGLM, .parallel = TRUE)) ``` ``` ## user system elapsed ## 1.124 0.369 0.274 ``` --- class: inverse, center, middle # Post-Modern Alternatives to apply functions --- # Back into the tidyverse with `purrr` - Functions for applying a function over a vector - Great error messages - Basic function is `map` .pull-left[ ``` r triple <- function(x) x * 3 map(1:3, triple) ``` ``` ## [[1]] ## [1] 3 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 9 ``` ] .pull-right[  ] --- background-image: url(figures/map.png) background-position: middle center background-size: 50% # Visualizing `map` --- background-image: url(figures/map-arg.png) background-position: middle center background-size: 50% # Additional arguments get repeated --- background-image: url(figures/map-arg-recycle.png) background-position: middle center background-size: 50% # Additional arguments get repeated - Even vectors get repeated --- background-image: url(figures/map2.png) background-position: middle center background-size: 50% # If you have two lists of arguments, `map2` --- background-image: url(figures/pmap-3.png) background-position: middle center background-size: 50% # After that, `pmap` - For `pmap`, the inputs should be wrapped in a list --- background-image: url(figures/pmap-arg.png) background-position: middle center background-size: 50% # All these functions allow common arguments --- # Type stable functions - You usually know the type of the output you expect - There are some functions that guarantee that outcome: - `map_int`: returns a `int` vector - `map_dbl`: returns a `numeric` vector - `map_lgl`: returns a `logical` vector - `map_vec`: returns a vector of the simplest possible type - These throw great comprehensible errors when the output type is unexpected - [Purrr Cheat Sheet](https://rstudio.github.io/cheatsheets/purrr.pdf) - [Functional programming](https://adv-r.hadley.nz/functionals.html) --- # `purrr` in the tidyverse ``` r library(purrr) mtcars |> split(mtcars$cyl) |> # from base R map(\(df) lm(mpg ~ wt, data = df)) |> # \(df) is a shortcut for function(df) map(summary) |> map_dbl("r.squared") # R^2 is one of the outputs of the summary from the linear model fit ``` ``` ## 4 6 8 ## 0.5086326 0.4645102 0.4229655 ``` --- background-image: url(figures/reduce.png) background-position: middle center background-size: 50% # Reduce --- background-image: url(figures/walk.png) background-position: middle center background-size: 35% # `walk` and `walk2`, for the side effects - Side effects are things that happen in a program that do not alter any object. - Showing a plot - Outputing a file (`export`, `write_csv`) - Printing to the console (`cat`, `print`) --- background-image: url(figures/walk2.png) background-position: middle center background-size: 40% # `walk` and `walk2`, for the side effects - Side effects are things that happen in a program that do not alter any object. - Showing a plot - Outputing a file (`export`, `write_csv`) - Printing to the console (`cat`, `print`) --- # walk2 for outputing files ``` r # Split the iris data.set into 3 parts iris_list = split(iris, iris$Species) # Create the appropriate file names: iris_out_files = paste0("iris-", names(iris_list), ".csv") # Use the walk2 function to write the files walk2(iris_list, iris_out_files, write_csv) # Check that the files were created dir(pattern = "iris") ``` ``` ## [1] "iris-setosa.csv" "iris-versicolor.csv" "iris-virginica.csv" ```