--- title: "Grouped Compositional Data" author: "N. Frerebeau" date: "`r Sys.Date()`" output: markdown::html_format: options: toc: true number_sections: true vignette: > %\VignetteIndexEntry{Grouped Compositional Data} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) Sys.setenv(LANGUAGE = "en") # Force locale ``` ```{r setup} ## Install extra packages (if needed) # install.packages("folio") # datasets library(nexus) ``` # Reference Groups Provenance studies typically rely on two approaches, which can be used together: * Identification of groups among the artifacts being studied, based on mineralogical or geochemical criteria (*clustering*). * Comparison with so-called reference groups, i.e. known geological sources or archaeological contexts (*classification*). We use here the results of the analysis of 369 ancient bronzes (see `help(bronze, package = "folio")`) attributed to three dynasties. For the sake of the demonstration, let's assume that a third of the samples are of unknown provenance: missing values (`NA`) can be used to specify that a sample does not belong to any group. ```{r data} ## Data from Wood and Liu 2023 data("bronze", package = "folio") dynasty <- as.character(bronze$dynasty) # Save original data for further use ## Randomly add missing values set.seed(12345) # Set seed for reproductibility n <- nrow(bronze) bronze$dynasty[sample(n, size = n / 3)] <- NA ``` When coercing a `data.frame` to a `CompositionMatrix` object, **nexus** allows to specify whether an observation belongs to a specific group (or not): ```{r coda} ## Use the third column (dynasties) for grouping coda <- as_composition(bronze, parts = 4:11, groups = 3) ``` Alternatively, `group()` allows to set groups of an existing `CompositionMatrix`. ```{r group} ## Create a composition data matrix coda <- as_composition(bronze, parts = 4:11) ## Use the third dynasties for grouping coda <- group(coda, by = bronze$dynasty) ``` Once groups have been defined, they can be used by further methods (e.g. plotting). Note that for better readability, you can select only some of the parts (e.g. major elements): ```{r barplot, fig.width=10, fig.height=10, out.width='100%'} ## Select major elements major <- coda[, is_element_major(coda)] ## Compositional bar plot barplot(major, order_rows = "Cu", space = 0) ``` ```{r ternary, fig.width=10, fig.height=10, out.width='100%'} ## Matrix of ternary plots pairs(coda) ``` # Log-Ratio Analysis ```{r pca, fig.width=7, fig.height=7, out.width='50%', fig.show='hold'} ## CLR clr <- transform_clr(coda, weights = TRUE) ## PCA lra <- pca(clr) ## Visualize results viz_individuals( x = lra, extra_quali = group_names(clr), color = c("#004488", "#DDAA33", "#BB5566"), hull = TRUE ) viz_variables(lra) ``` # Discriminant Analysis ```{r manova} ## Subset training data train <- coda[is_assigned(coda), ] ## ILR ilr_train <- transform_ilr(train) ## MANOVA fit <- manova(ilr_train ~ group_names(ilr_train)) summary(fit) ``` The MANOVA results suggest that there are statistically significant differences between groups. Let's now try how effective a Linear Discriminant Analysis (LDA) is at separating the three groups: ```{r lda} ## LDA (discr <- MASS::lda(ilr_train, grouping = group_names(ilr_train))) ``` By default, some of these results (group averages and discriminant function coefficients) are reported in ILR coordinates, making them difficult to interpret. Therefore, it is preferable to transform these results back into compositions or equivalent CLR coefficients: ```{r lda-model} ## Back transform results transform_inverse(discr$means, origin = ilr_train) ``` ```{r lda-plot, fig.width=10, fig.height=10, out.width='100%'} plot(discr, col = c("#DDAA33", "#BB5566", "#004488")[group_indices(ilr_train)]) ``` We can then try to predict the group membership of the unassigned samples: ```{r lda-predict} ## Subset unassigned samples test <- coda[!is_assigned(coda), ] ilr_test <- transform_ilr(test) ## Predict group membership results <- predict(discr, ilr_test) ## Assess the accuracy of the prediction (ct <- table( predicted = results$class, expected = dynasty[!is_assigned(coda)] )) diag(proportions(ct, margin = 1)) ## Total percent correct sum(diag(proportions(ct))) ```