\documentclass{article} \usepackage[numbers]{natbib} \newcommand{\bibTitle}[1]{\emph{#1}} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage[margin=0.5in]{geometry} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} %\VignetteIndexEntry{Information-theoretic PGLS model selection workflow: a primate vocal communication example} \begin{document} \setkeys{Gin}{width=1.1\textwidth} <>= opt.old <- options(keep.source = TRUE, width = 95) foo <- packageDescription("mmodely") @ \title{ Ecological factors influencing primate vocal signaling:\\ a phylogenetic regression workflow for the \\ \Rpackage{mmodely} R-package (Version \Sexpr{foo$Version})\\ } \author{ David M. Schruth \\ \texttt{dschruth@anthropoidea.org}\\ \\ } \maketitle \section{Introduction} The historical relationships between evolved traits of organisms and the ecological settings that shape these traits are complicated systems that can be challenging to untangle \cite{Pagel1999}. The origins of behavioral traits can particularly difficult to understand as they tend to also be mediated through the behaviors of other organisms, which are themselves constantly in flux and considerably labile \cite{Blomberg2003}. A perfect example of such a trait is that of vocal signal complexity. Animals use complex calls to assert obscured position, unique identity, special status, or emotive state to conspecifics over interference from other calls or distortions from background noise \cite{Kaplan2000}. Here, using the \Rpackage{mmodely} package on a primate vocalization dataset \cite{Schruth2019a}, I demonstrate how the origins of complex call structure, such as syllablic diversity \cite{Schruth2021a}, can be elucidated from a range of environmental and behavioral covariates from disparate datasets \cite{Rowe2017}. Model averaging [MA] \cite{Symonds2011} and model selection [MS] \cite{Johnson2004} results primarily highlight locomotion \cite{Schruth2019b} and mating system \cite{Feuntes1998} as important factors driving complex calling, as well as the trophic security \cite{Schruth2021b} variables of mass, group size, and arboreality. The \Rpackage{mmodely} package enables implementation of a combination of phylogenetic controlled regression \cite{Revell2014} and information theoretic \cite{Garamszegi2011} (MA and MS) examination to simultaneously compare (weighted) predictor coefficients across the numerous sub-datasets generated during exploration of all possible model combinations. \section{Licensing} The \Rpackage{mmodely} package is licensed under the Apache License v2.0: it is therefore free to use and redistribute, however, we, the copyright holders, wish to maintain primary control over any further development. Please be sure to cite \Rpackage{mmodely} if you use the package in presentations or work leading to publication. \section{Installation} This package largely depends upon the \Rpackage{caper} package, but most functions do not require any particular library. It is recommended that you have \Rpackage{caper}, \Rpackage{ape}, and the \Rpackage{caroline} package installed as a minimum. <>= # wget https://cran.r-project.org/src/contrib/Archive/caroline/caroline_0.8.0.tar.gz # wget https://cran.r-project.org/src/contrib/Archive/caper/caper_0.5.tar.gz # wget https://cran.r-project.org/src/contrib/Archive/ape/ape_3.0-5.tar.gz # R CMD INSTALL caroline_0.8.0.tar.gz # R CMD INSTALL caper_0.5.tar.gz # R CMD INSTALL ape_3.0-5.tar.gz @ Building the \Rpackage{mmodely} package from source requires that you have the proper dependency packages, \Rpackage{caroline}, installed from CRAN. This can typically be accomplished via the following commands from within the R command line environment: \begin{verbatim} install.packages(c('caroline','ape','caper')) \end{verbatim} After a successful installation the \Rpackage{mmodely} package can be loaded in the normal way: by starting R and invoking the following \Rfunction{library} command: <>= library(caper) library(mmodely) @ \section{Reading in Data} Read in the tree \cite{Springer2012} and datasets then merge them together. <>= data.path <- system.file("extdata","primate-example.data.csv", package="mmodely") data <- read.csv(data.path, row.names=1) data$gn_sp <- rownames(data) #multiply two vocalization metrics together to create "vocal complexity" data$VC <- apply(data[,c('syllables_max','rhythm_max')], 1, prod) data <- subset(data, !is.na(VC)) # merge data sets here if applicable tree.path <- system.file("extdata","primate-springer.2012.tre", package="mmodely") phyl <- ape::read.tree(tree.path)[[5]] #5. RAxML phylogram based on the 61199 bp concatenation of 69 nuclear and ten mitochondrial genes. phyl <- trim.phylo(phylo=phyl, gs.vect=data$gn_sp) # prune unused nodes and branches comp <- comp.data(phylo=phyl, df=data) @ Typically there will be some missing data (species) in certain sources that do not occur in others. A merge of these will result in NA values for some cells. The more missing cells and merges there are, the more sub-datasets will be possible, due to case-wise deletion in the process of combinatorics underlying model iteration, averaging, and selection. The above example has little if any missing data, but the examples below introduce some artificially. \section{Basic Reporting} First, for illustration purposes, we perform a simple analysis of a single model using 'pgls' directly from the \Rpackage{caper} package, then show-off the 'pgls.report' functionality of the \Rpackage{mmodely} package. ANOVA, AIC, and one-line model reports can be output via this function. <>= model <- as.formula('VC ~ mass.Kg + group.size') fit <- caper::pgls(formula=model, data=comp) summary(fit) pgls.report(comp, f=model, anova=TRUE, QC.plot=TRUE) @ \clearpage \section{Multivariate Combinatoric Iteration} The \Rpackage{mmodely} package's chief contribution is enabling approaches that utilize multi-model iteration averaging. Using a smaller subset of variables can speed up the (slower) maximum likelihood computation step and still achieve the desired result of fixed tree transformation parameters. <>= pv0 <- c("mass.Kg","arboreal","home.range","monogamy") #"swing.pct" est.mods <- get.model.combos(predictor.vars=pv0, outcome.var='VC', min.q=2) ps <- get.phylo.stats(phylo=phyl, data=data, trait.clmn='VC'); lambda <- ps$lambda$lambda ; print(lambda) PGLSi <- pgls.iter(models=est.mods, phylo=phyl, df=data, l=lambda, k='ML', d='ML') @ \clearpage \section{Tree Transformation Averaging and Re-iteration} After running PGLS on a test-subset of predictor-variable combinations using maximum likelihood, we can average the tree transformation parameters \cite{Schruth2021d}to obtain fixed values going forward. This approach can speed up computations for larger sets of modeling data and variable combinations. <>= tt.avgs <- apply(PGLSi$params, 2, mean, na.rm=TRUE) # tree transformation averages print(tt.avgs) @ Next we use the full set of variables and our tree transform averages. For demonstration, we sprinkle in some missing values to our dataset so as to artificially boost the number of sub-datasets. The subsequent fixed tree parameter itteration run should now generate more diverse output upon which the \Rpackage{mmodely} can demonstrate it unique model averaging and model selection functionality. <>= pvs <- c("mass.Kg","group.size","arboreal","monogamy","leap.pct","swing.pct") all.mods <- get.model.combos(predictor.vars=pvs, outcome.var='VC', min.q=2) # randomly sprinkle in some missing values (for more interesting for model selection) missing.value.ct <- 1 for(pv in pv0){ data[sample(x=1:nrow(data),size=missing.value.ct),pv] <- NA} PGLSi <- pgls.iter(models=all.mods, phylo=phyl, df=data, l=lambda, k=tt.avgs['k'], d=tt.avgs['d']) pgls.iter.stats(PGLSi) @ \clearpage \section{Fixed iteration run statistics} We should briefly inspect how this fixed iteration run performed and how many sub-datasets we need to investigate. It is recommended to try \Rpackage{mmodely} using 'rwGsm.' This abbreviation stands for 'raw \emph{Genus species} sums.' It represents a sum of the (concatenated) raw character values of all species constituting the underlying dataset (which has all rows with any missing data removed) for a particular combination of model predictor variables. While this default is preferred, the number of species 'n' [default] or number of model variables 'q' can also be used. <>= pgls.iter.stats(PGLSi) @ \clearpage \section{Model Averaging} Now we can estimate the predictor variable parameters by averaging over all possible fixed PGLS runs, using the AICc differences (from the lowest AICc) as weights. By default this AICw weighted average is performed per sub-dataset using 'rwGsm' or 'n' [default] as mentioned in the preceding section. While model averaging is not recommended under high multicolinearity, as denominators of regression coefficients change across models, it is possible to rescale these using 'standarize' \cite{Cade2015}. A slightly more conservative alternative to MA uses 'variable importance' which is equivalent to an AIC-weighted MA of binary indicators of presence or absence of covariate model inclusion \cite{Anderson2000}. <>= w.means.pds <- average.fit.models(vars=pvs, fits=PGLSi$fits, optims=PGLSi$optim, by='rwGsm', standardize=TRUE) # apply(w.means.pds, 2, mean, na.rm=T) #average of weighted means over all sub-datasets w.means.pds # weighted means per sub-dataset @ <>= w.import.pds <- variable.importance(vars=pvs, fits=PGLSi$fits, optims=PGLSi$optim, by='rwGsm') # apply(w.import.pds, 2, mean, na.rm=T) #average of weighted means over all sub-datasets w.import.pds # weighted means per sub-dataset @ \clearpage \section{Model Selection} We can select the best model by sorting each subset (e.g. by AICc) or by using visualization methods. <>= select.best.models(PGLSi, using='AICc') @ Plotting the coefficients of determination versus the AIC values allows selection of high-performing models for inspection and reporting. <>= plot.pgls.iters(PGLSi) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{All possible model combinations appear as individual points above. As there is a generally negative association between AIC and the coefficient of determination, the points tend to follow a negative sloping streak to the lower right. The "best" models typically appear in the lower right of each streak. Therefore, minimizing AIC tends to also maximize the coefficient of determination, but not necessarily. This four panel plot looks at correct and adjusted versions of each model assessment measure. All points are scaled by subdataset sample size by default if 'n' is used in grouping.}{} \label{plotiter:one} \end{figure} <>= sdevs.objs <- get.pgls.coefs(PGLSi$fits, est='t value') coefs.objs <- get.pgls.coefs(PGLSi$fits, est='Estimate') @ <>= report.vect <- sapply(1:length(PGLSi$fits), function(i) fit.1ln.rprt(PGLSi$fits[[i]], rtrn.line=FALSE, mn=i)) @ <>== par(mar=c(5,5,3,3)) plot.pgls.R2AIC(PGLSi$optim) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{This is a one panel verison of the previous model selection plot. The numbered points in the lower right corner of each streak of possible models represent the best model within a sub-dataset. Since these AICc values should not strictly be compared, it is not a bad idea that all "best" models selected from each sub-dataset should get reported, such as in the form of the 'sparge' plot below.}{} \label{R2AIC:one} \end{figure} \section{Coefficient Plotting} Finally, the resulting model fits from the PGLS runs can be be plotted out horizontally as distributions so the influence of each ecological predictor variable can be compared. <>= par.old <- par(mar=c(5,8,1,4),mfrow=c(2,1)) sparge.modsel(sdevs.objs, R2x=7, xlab='t value') sparge.modsel(coefs.objs, R2x=7, xlab='Estimate') @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{These horizontal parameter distribution dot-plots demonstrate how the (t-values of) coefficients from models can be simultaneously plot in order to verify consistency of estimates across the various (often missing-data driven) sub-datasets. To visually assess potential over-fitting in each model, point sizes represent underlying sample sizes and circle thickness corresponds coefficient of determination values. Note that mate choice, locomotion, and statural factors drive complex (here rhythmically syllabic) calling in primates. } \label{coefdists:one} \end{figure} \clearpage \begin{thebibliography}{Bibliography} \bibitem{Kaplan2000}Rogers, L.J. and Kaplan, G. (2000) \bibTitle{Songs, Roars, and Rituals: communication in birds, mammals, and other animals}. Harvard University Press, Cambridge. \bibitem{Pagel1999}Pagel, M. (1999) ‘Inferring the historical patterns of biological evolution’ \bibTitle{Nature}, 401, pp. 877–884. Available at: \url{https://doi.org/10.1038/44766}. \bibitem{Blomberg2003}Blomberg, S.P., Garland Jr, T., and Ives, A.R. (2003) ‘Testing for phylogenetic signal in comparative data: behavioral traits are more labile’ Evolution, 57(4) pp. 717–745. Available at: \url{https://doi.org/10.1111/j.0014-3820.2003.tb00285.x}. \bibitem{Grueber2011}Grueber, C.E., Nakagawa, .R.J., and Jamieson, I.G. (2011) ‘Multimodel inference in ecology and evolution: challenges and solutions’ \bibTitle{Journal of Evolutionary Biology}, 24, pp. 699-711. Available at: \url{https://doi.org/10.1111/j.1420-9101.2010.02210.x} \bibitem{Stephens2007}Stephens, P.A., Buskirk, S.W., and del Rio, C.M. (2007) ‘Inference in ecology and evolution’ \bibTitle{Trends in Ecology and Evolution}, 22(4). Available at: \url{https://doi.org/10.1016/j.tree.2006.12.003}. \bibitem{Symonds2011}Symonds, M.R.E., and Moussalli, A. (2011) ‘A brief guide to model selection, multimodel inference and model averaging in behavioral ecology using Akaike's information criterion’ \bibTitle{Behavioral Ecology and Sociobiology}, 65, pp. 13–21. Available at: \url{https://doi.org/10.1007/s00265-010-1037-6}. \bibitem{Johnson2004}Johnson, J.B., Omland, K.S. (2004) ‘Model selection in ecology and evolution’ \bibTitle{Trends in Ecology and Evolution}, 19(2). Available at: \url{https://doi.org/10.1016/j.tree.2003.10.013}. \bibitem{Revell2014}Revell, L.J. (2014) phytools: An R package for phylogenetic comparative biology (and other things). [CRAN]. Available at: \url{http://cran.r-project.org/package=phytools}. \bibitem{Garamszegi2011}Garamszegi L.Z. (2011) ‘Information-theoretic approaches to statistical analysis in behavioral ecology: an introduction’ \bibTitle{Behavioral Ecology and Sociobiology}, 65, pp. 1–11. Available at: \url{https://doi.org/10.1007/s00265-010-1028-7}. \bibitem{Rowe2017}Rowe, N. and Meyers, M. (2017) \bibTitle{All the World’s Primates}. Charlestown, RI: Pogonias Press. \bibitem{Anderson2000} Burnham, P.B. and Anderson, D.R. (2000) \bibTitle{‘Model Selection and Inference: A Practical Information-Theoretic Approach’} \bibitem{Cade2015} Cade, B.S. (2015) ‘Model averaging and muddled multimodel inferences’ \bibTitle{Ecology}, 96(9), pp. 2370–2382. Available at: \url{https://doi.org/abs/10.1890/14-1639.1}. \bibitem{Feuntes1998}Feuntes, A. (1998) ‘Re-evaluating primate monogamy’ \bibTitle{American Anthropology}, 100(4) pp. 890-907. \bibitem{Springer2012}Springer, M.S., et.al. (2012) ‘Re-evaluating primate monogamy’ \bibTitle{PLoS ONE} 7(11) p. e49521 \url{http://doi.org/10.1371/journal.pone.0049521}. \bibitem{Schruth2019a}Schruth, D.M. (2019a) ‘Structural acoustic features of human musicality scored on primate vocalizations’ \bibTitle{The Center for Open Science}. Available at: \url{https://doi.io/bvsfz/}. \bibitem{Schruth2019b}Schruth, D.M. (2019b) ‘Primate Locomotor Activity’ \bibTitle{The Center for Open Science}. Available at: \url{https://osf.io/cd68q/}. \bibitem{Schruth2021a}Schruth, D.M., Templeton, C.N. and Holman, D.J. (2021a) ‘On reappearance and complexity in musical calling’ \bibTitle{PLoS ONE}. Available at: \url{https://doi.org/10.1371/journal.pone.0218006}. \bibitem{Schruth2021b}Schruth, D.M. (2021b) Arboreal locomotion and trophic security at the dawn of Euprimate vision. EcoEvoRxiv. Available at: \url{http://doi.org/10.32942/osf.io/d6wk2}. \bibitem{Schruth2021c}Schruth, D.M. (2021c) ‘A global variable-permutation based approach for estimating tree transformation parameters used in phylogenetically controlled multivariate regression’ \bibTitle{Protocols.io}. Available at: \url{http://doi.org/10.17504/protocols.io.bzdhp236}. \bibitem{Schruth2021d}Schruth, D.M. (2021d) ‘Primates evolved spectrally complex calls in compensation for reduction in olfactory cognition’ \bibTitle{Proceedings of the Annual Meeting of the Cognitive Science Society} 43. Available at: \url{https://escholarship.org/uc/item/0jw446s9}. \end{thebibliography} <>= options(opt.old) par(par.old) @ \end{document}