--- title: "A matsindf_apply primer" author: "Matthew Kuperus Heun" date: "`r Sys.Date()`" header-includes: - \usepackage{amsmath} output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A matsindf_apply primer} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: References.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(dplyr) library(matsbyname) library(matsindf) library(tidyr) ``` ## Introduction `matsindf_apply()` is a powerful and versatile function that enables analysis with lists and data frames by applying `FUN` in helpful ways. The function is called `matsindf_apply()`, because it can be used to apply `FUN` to a `matsindf` data frame, a data frame that contains matrices as individual entries in a data frame. (A `matsindf` data frame can be created by calling `collapse_to_matrices()`, as demonstrated below.) But `matsindf_apply()` can apply `FUN` across much more: data frames of single numbers, lists of matrices, lists of single numbers, and individual numbers. This vignette demonstrates `matsindf_apply()`, starting with simple examples and proceeding toward sophisticated analyses. ## The basics The basis of all analyses conducted with `matsindf_apply()` is a function (`FUN`) to be applied across data supplied in `.dat` or `...`. `FUN` must return a named list of variables as its result. Here is an example function that both adds and subtracts its arguments, `a` and `b`, and returns a list containing its result, `c` and `d`. ```{r} example_fun <- function(a, b){ return(list(c = matsbyname::sum_byname(a, b), d = matsbyname::difference_byname(a, b))) } ``` Similar to `lapply()` and its siblings, additional argument(s) to `matsindf_apply()` include the data over which `FUN` is to be applied. These arguments can, in the first instance, be supplied as named arguments to the `...` argument of `matsindf_apply()`. All arguments in `...` must be named. The `...` arguments to `matsindf_apply()` are passed to `FUN` according to their names. In this case, the output of `matsindf_apply()` is the the named list returned by `FUN`. ```{r} matsindf_apply(FUN = example_fun, a = 2, b = 1) ``` Passing an additional argument (`z = 2`) causes an unused argument error, because `example_fun` does not have a `z` argument. ```{r} tryCatch( matsindf_apply(FUN = example_fun, a = 2, b = 1, z = 2), error = function(e){e} ) ``` Failing to pass a needed argument (`b`) causes an error that indicates the missing argument. ```{r} tryCatch( matsindf_apply(FUN = example_fun, a = 2), error = function(e){e} ) ``` Alternatively, arguments to `FUN` can be given in a named list to `.dat`, the first argument of `matsindf_apply()`. When a value is assigned to `.dat`, the return value from `matsindf_apply()` contains all named variables in `.dat` (in this case both `a` and `b`) in addition to the results provided by `FUN` (in this case both `c` and `d`). ```{r} matsindf_apply(list(a = 2, b = 1), FUN = example_fun) ``` Extra variables are tolerated in `.dat`, because `.dat` is considered to be a store of data from which variables can be drawn as needed. ```{r} matsindf_apply(list(a = 2, b = 1, z = 42), FUN = example_fun) ``` In contrast, arguments to `...` are named explicitly by the user, so including an extra argument in `...` is considered an error, as shown above. ## Some details If a named argument is supplied by both `.dat` and `...`, the argument in `...` takes precedence, overriding the argument in `.dat`. ```{r} matsindf_apply(list(a = 2, b = 1), FUN = example_fun, a = 10) ``` When supplying **both** `.dat` and `...`, `...` can contain named strings of length `1` which are interpreted as mappings from named items in `.dat` to arguments in the signature of `FUN`. In the example below, `a = "z"` indicates that argument `a` to `FUN` should be supplied by item `z` in `.dat`. ```{r} matsindf_apply(list(a = 2, b = 1, z = 42), FUN = example_fun, a = "z") ``` If a named argument appears in both `.dat` and the output of `FUN`, a name collision occurs in the output of `matsindf_apply()`, and a warning is issued. ```{r} tryCatch( matsindf_apply(list(a = 2, b = 1, c = 42), FUN = example_fun), warning = function(w){w} ) ``` `FUN` can accept more than just numerics. `example_fun_with_string()` accepts a character string and a numeric. However, because `...` argument that is a character string of length `1` has special meaning (namely mapping variables in `.dat` to arguments of `FUN`), passing a character string of length `1` can cause an error. To get around the problem, wrap the single string in a list, as shown below. ```{r} example_fun_with_string <- function(str_a, b) { a <- as.numeric(str_a) list(added = matsbyname::sum_byname(a, b), subtracted = matsbyname::difference_byname(a, b)) } # Causes an error tryCatch( matsindf_apply(FUN = example_fun_with_string, str_a = "1", b = 2), error = function(e){e} ) # To solve the problem, wrap "1" in list(). matsindf_apply(FUN = example_fun_with_string, str_a = list("1"), b = 2) matsindf_apply(FUN = example_fun_with_string, str_a = list("1"), b = list(2)) matsindf_apply(FUN = example_fun_with_string, str_a = list("1", "3"), b = list(2, 4)) matsindf_apply(.dat = list(str_a = list("1"), b = list(2)), FUN = example_fun_with_string) matsindf_apply(.dat = list(m = list("1"), n = list(2)), FUN = example_fun_with_string, str_a = "m", b = "n") ``` ## `matsindf_apply()` and data frames `.dat` can also contain a data frame (or tibble), both of which are fancy lists. When `.dat` is a data frame or tibble, the output of `matsindf_apply()` is a tibble, and `FUN` acts like a specialized `dplyr::mutate()`, adding new columns at the right of `.dat`. ```{r} matsindf_apply(.dat = data.frame(str_a = c("1", "3"), b = c(2, 4)), FUN = example_fun_with_string) matsindf_apply(.dat = data.frame(str_a = c("1", "3"), b = c(2, 4)), FUN = example_fun_with_string, str_a = "str_a", b = "b") matsindf_apply(.dat = data.frame(m = c("1", "3"), n = c(2, 4)), FUN = example_fun_with_string, str_a = "m", b = "n") ``` Additional niceties are available when `.dat` is a data frame or a tibble. `matsindf_apply()` works when the data frame is filled with single numeric values, as is typical. ```{r} df <- data.frame(a = 2:4, b = 1:3) matsindf_apply(df, FUN = example_fun) ``` But `matsindf_apply()` also works with `matsindf` data frames, data frames in which each cell of the data frame is filled with a single matrix. To demonstrate use of `matsindf_apply()` with a `matsindf` data frame, we'll construct a simple `matsindf` data frame (`midf`) using functions in this package. ```{r} # Create a tidy data frame containing data for matrices tidy <- tibble::tibble(Year = rep(c(rep(2017, 4), rep(2018, 4)), 2), matnames = c(rep("U", 8), rep("V", 8)), matvals = c(1:4, 11:14, 21:24, 31:34), rownames = c(rep(c(rep("p1", 2), rep("p2", 2)), 2), rep(c(rep("i1", 2), rep("i2", 2)), 2)), colnames = c(rep(c("i1", "i2"), 4), rep(c("p1", "p2"), 4))) |> dplyr::mutate( rowtypes = case_when( matnames == "U" ~ "Product", matnames == "V" ~ "Industry", TRUE ~ NA_character_ ), coltypes = case_when( matnames == "U" ~ "Industry", matnames == "V" ~ "Product", TRUE ~ NA_character_ ) ) tidy # Convert to a matsindf data frame midf <- tidy |> dplyr::group_by(Year, matnames) |> collapse_to_matrices(rowtypes = "rowtypes", coltypes = "coltypes") |> tidyr::pivot_wider(names_from = "matnames", values_from = "matvals") # Take a look at the midf data frame and some of the matrices it contains. midf midf$U[[1]] midf$V[[1]] ``` With `midf` in hand, we can demonstrate use of [`tidyverse`](https://www.tidyverse.org)-style functional programming to perform matrix algebra within a data frame. The functions of the `matsbyname` package (such as `difference_byname()` below) can be used for this purpose. ```{r} result <- midf |> dplyr::mutate( W = difference_byname(transpose_byname(V), U) ) result result$W[[1]] result$W[[2]] ``` This way of performing matrix calculations works equally well within a 2-row `matsindf` data frame (as shown above) or within a 1000-row `matsindf` data frame. ## Programming with `matsindf_apply()` Users can write their own functions using `matsindf_apply()`. A flexible `calc_W()` function can be written as follows. ```{r} calc_W <- function(.DF = NULL, U = "U", V = "V", W = "W") { # The inner function does all the work. W_func <- function(U_mat, V_mat){ # When we get here, U_mat and V_mat will be single matrices or single numbers, # not a column in a data frame or an item in a list. if (length(U_mat) == 0 & length(V_mat == 0)) { # Tolerate zero-length arguments by returning a zero-length # a list with the correct name and return type. return(list(numeric()) |> magrittr::setnames(W)) } # Calculate W_mat from the inputs U_mat and V_mat. W_mat <- matsbyname::difference_byname( matsbyname::transpose_byname(V_mat), U_mat) # Return a named list. list(W_mat) |> magrittr::set_names(W) } # The body of the main function consists of a call to matsindf_apply # that specifies the inner function in the FUN argument. matsindf_apply(.DF, FUN = W_func, U_mat = U, V_mat = V) } ``` This style of writing `matsindf_apply()` functions is incredibly versatile, leveraging the capabilities of both the `matsindf` and `matsbyname` packages. (Indeed, the `Recca` package uses `matsindf_apply()` heavily and is built upon the functions in the `matsindf` and `matsbyname` packages.) Functions written like `calc_W()` can operate in ways similar to `matsindf_apply()` itself. To demonstrate, we'll use `calc_W()` in all the ways that `matsindf_apply()` can be used, going in the reverse order to our demonstration of the capabilities of `matsindf_apply()` above. `calc_W()` can be used as a specialized `mutate` function that operates on `matsindf` data frames. ```{r} midf |> calc_W() ``` The added column could be given a different name from the default ("`W`") using the `W` argument. ```{r} midf |> calc_W(W = "W_prime") ``` As with `matsindf_apply()`, column names in `midf` can be mapped to the arguments of `calc_W()` by the arguments to `calc_W()`. ```{r} midf |> dplyr::rename(X = U, Y = V) |> calc_W(U = "X", V = "Y") ``` `calc_W()` can operate on lists of single matrices, too. This approach works, because the default values for the `U` and `V` arguments to `calc_W()` are "U" and "V", respectively. The input list members (in this case `midf$U[[1]]` and `midf$V[[1]]`) are returned with the output, because `list(U = midf$U[[1]], V = midf$V[[1]])` is passed to the `.dat` argument of `matsindf_apply()`. ```{r} calc_W(list(U = midf$U[[1]], V = midf$V[[1]])) ``` It may be clearer to name the arguments as required by the `calc_W()` function without wrapping in a list first, as shown below. But in this approach, the input matrices are not returned with the output, because arguments `U` and `V` are passed to the `...` argument of `matsindf_apply()`, not the `.dat` argument of `matsindf_apply()`. ```{r} calc_W(U = midf$U[[1]], V = midf$V[[1]]) ``` `calc_W()` can operate on data frames containing single numbers. ```{r} data.frame(U = c(1, 2), V = c(3, 4)) |> calc_W() ``` Finally, `calc_W()` can be applied to single numbers, and the result is 1x1 matrix. ```{r} calc_W(U = 2, V = 3) ``` It is good practice to write internal functions that tolerate zero-length inputs, as `calc_W()` does. Doing so, enables results from different calculations to be `rbind`ed together. ```{r} calc_W(U = numeric(), V = numeric()) calc_W(list(U = numeric(), V = numeric())) res <- calc_W(list(U = c(2, 3, 4, 5), V = c(3, 4, 5, 6))) res0 <- calc_W(list(U = numeric(), V = numeric())) dplyr::bind_rows(res, res0) ``` ## Conclusion This vignette demonstrated use of the versatile `matsindf_apply()` function. Inputs to `matsindf_apply()` can be * single numbers, * matrices, or * data frames with appropriately-named columns. `matsindf_apply()` can be used for programming, and functions constructed as demonstrated above share characteristics with `matsindf_apply()`: * they can be used as specialized `dplyr::mutate()` operators, and * they can be applied to single numbers, matrices, or data frames with appropriately-named columns.