\documentclass{article} \usepackage[numbers]{natbib} \newcommand{\bibTitle}[1]{\emph{#1}} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage[margin=0.5in]{geometry} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} %\VignetteIndexEntry{Information-theoretic PGLS model selection workflow: a primate vision example} \begin{document} \setkeys{Gin}{width=1.1\textwidth} <>= opt.old <- options(keep.source = TRUE, width = 95) foo <- packageDescription("mmodely") @ \title{ Ecological factors influencing primate vision conformation:\\ a phylogenetic regression workflow for the \\ \Rpackage{mmodely} R-package (Version \Sexpr{foo$Version})\\ } \author{ David M. Schruth \\ \texttt{dschruth@anthropoidea.org}\\ \\ } \maketitle \section{Introduction} Traits evolve in the context of historically and ecologically complex arrangements that can present difficulty for researchers attempting to uncover causal relationships \cite{Pagel1999} or draw inference to larger populations \cite{Stephens2007}. Primates represent a manageably diverse clade of mammals who exhibit a wide range of behaviors and morphology conducive to revealing evolutionary processes. Ideas on the origins on primates range from predation detection \cite{Isbell2006} or deterrence \cite{Schruth2021a} to targeting in hunting \cite{Cartmill1972} or locomotion \cite{Szalay1988}. Here, using the \Rpackage{mmodely} package on primate data for locomotion \cite{Schruth2019} and vision \cite{Wheeler2011}, I demonstrate how the origins of primate cranial morphology can be elucidated via several ecological variables from numerous datasets \cite{Rowe2017}. Model averaging [MA] \cite{Symonds2011} and model selection [MS] \cite{Johnson2004} results primarily highlight arboreal locomotor targeting and trophic security \cite{Schruth2021a} variables (such as stature or group size) as playing key roles in determining convergence of primate orbits. The \Rpackage{mmodely} package enables implementation of a combination of phylogenetic controlled regression \cite{Revell2014} and information theoretic \cite{Garamszegi2011} (MA and MS) examination to simultaneously compare (weighted) predictor coefficients across the numerous sub-datasets generated during the exploration of all possible model combinations. \section{Licensing} The \Rpackage{mmodely} package is licensed under the Apache License v2.0: it is therefore free to use and redistribute, however, we, the copyright holders, wish to maintain primary control over any further development. Please be sure to cite \Rpackage{mmodely} if you use the package in presentations or work leading to publication. \section{Installation} This package largely depends upon the \Rpackage{caper} package, but most functions do not require any particular library. It is recommended that you have \Rpackage{caper}, \Rpackage{ape}, and the \Rpackage{caroline} package installed as a minimum. <>= # wget https://cran.r-project.org/src/contrib/Archive/caroline/caroline_0.8.0.tar.gz # wget https://cran.r-project.org/src/contrib/Archive/caper/caper_0.5.tar.gz # wget https://cran.r-project.org/src/contrib/Archive/ape/ape_3.0-5.tar.gz # R CMD INSTALL caroline_0.8.0.tar.gz # R CMD INSTALL caper_0.5.tar.gz # R CMD INSTALL ape_3.0-5.tar.gz @ Building the \Rpackage{mmodely} package from source requires that you have the proper dependency packages, \Rpackage{caroline}, installed from CRAN. This can typically be accomplished via the following commands from within the R command line environment: \begin{verbatim} install.packages(c('caroline','ape','caper')) \end{verbatim} After a successful installation the \Rpackage{mmodely} package can be loaded in the normal way: by starting R and invoking the following \Rfunction{library} command: <>= library(caper) library(mmodely) @ \section{Reading in Data} Read in the tree \cite{Springer2012} and datasets then merge them together. <>= data.path <- system.file("extdata","primate-example.data.csv", package="mmodely") data <- read.csv(data.path, row.names=1) data$gn_sp <- rownames(data) # merge data sets here if applicable tree.path <- system.file("extdata","primate-springer.2012.tre", package="mmodely") phyl <- ape::read.tree(tree.path)[[5]] #5. RAxML phylogram based on the 61199 bp concatenation of 69 nuclear and ten mitochondrial genes. phyl <- trim.phylo(phylo=phyl, gs.vect=data$gn_sp) # prune unused nodes and branches comp <- comp.data(phylo=phyl, df=data) @ Typically there will be some missing data (species) in certain sources that do not occur in others. A merge of these will result in NA values for some cells. The more missing cells and merges there are, the more sub-datasets will be possible, due to case-wise deletion in the process of combinatorics underlying model iteration, averaging, and selection. The above example has little if any missing data, but the examples below introduce some artificially. \section{Basic Reporting} First, for illustration purposes, we perform a simple analysis of a single model using 'pgls' directly from the \Rpackage{caper} package, then show-off the 'pgls.report' functionality of the \Rpackage{mmodely} package. ANOVA, AIC, and one-line model reports can be output via this function. <>= model <- as.formula('OC ~ mass.Kg + group.size') fit <- caper::pgls(formula=model, data=comp) summary(fit) pgls.report(comp, f=model, anova=TRUE, QC.plot=TRUE) @ \clearpage \section{Multivariate Combinatoric Iteration} The \Rpackage{mmodely} package's chief contribution is enabling approaches that utilize multi-model iteration averaging. Using a smaller subset of variables can speed up the (slower) maximum likelihood computation step and still achieve the desired result of fixed tree transformation parameters. <>= pv0 <- c("mass.Kg","group.size","arboreal","nocturnal") #"swing.pct" est.mods <- get.model.combos(predictor.vars=pv0, outcome.var='OC', min.q=2) ps <- get.phylo.stats(phylo=phyl, data=data, trait.clmn='OC'); lambda <- ps$lambda$lambda ; print(lambda) PGLSi <- pgls.iter(models=est.mods, phylo=phyl, df=data, l=lambda, k='ML', d='ML') pgls.iter.stats(PGLSi) # check run, especially to see how few sub-datasets exist @ \clearpage \section{Tree Transformation Averaging and Re-iteration} After running PGLS on a test-subset of predictor-variable combinations using maximum likelihood, we can average the tree transformation parameters \cite{Schruth2021b} to obtain fixed values going forward. This approach can speed up computations for larger sets of modeling data and variable combinations. <>= tt.avgs <- apply(PGLSi$params, 2, mean, na.rm=TRUE) # tree transformation averages print(tt.avgs) @ Next we use the full set of variables and our tree transform averages. For demonstration, we sprinkle in some missing values to our dataset so as to artificially boost the number of sub-datasets. The subsequent fixed tree parameter itteration run should now generate more diverse output upon which the \Rpackage{mmodely} can demonstrate it unique model averaging and model selection functionality. <>= pvs <- c("mass.Kg","group.size","infant.carry","arboreal","DPL.km","nocturnal") all.mods <- get.model.combos(predictor.vars=pvs, outcome.var='OC', min.q=2) data <- subset(data,!grepl(rownames(data),pattern='gorilla')) # remove an OC measurement outlier # randomly sprinkle in some missing values (for more interesting for model selection) missing.value.ct <- 1 for(pv in pv0){ data[sample(x=1:nrow(data),size=missing.value.ct),pv] <- NA} PGLSi <- pgls.iter(models=all.mods, phylo=phyl, df=data, l=lambda, k=tt.avgs['k'], d=tt.avgs['d']) @ \section{Fixed iteration run statistics} We should briefly inspect how this fixed iteration run performed and how many sub-datasets we need to investigate. It is recommended to try \Rpackage{mmodely} using 'rwGsm.' This abbreviation stands for 'raw \emph{Genus species} sums.' It represents a sum of the (concatenated) raw character values of all species constituting the underlying dataset (which has all rows with any missing data removed) for a particular combination of model predictor variables. While this default is preferred, the number of species 'n' [default] or number of model variables 'q' can also be used. <>= pgls.iter.stats(PGLSi) @ \section{Model Averaging} Now we can estimate the predictor variable parameters by averaging over all possible fixed PGLS runs, using the AICc differences (from the lowest AICc) as weights. By default this AICw weighted average is performed per sub-dataset using 'rwGsm' or 'n' [default] as mentioned in the preceding section. While model averaging is not recommended under high multicolinearity, as denominators of regression coefficients change across models, it is possible to rescale these using 'standarize' \cite{Cade2015}. A slightly more conservative alternative to MA uses 'model importance' which is equivalent to an AIC-weighted MA of binary indicators of presence or absence of covariate model inclusion \cite{Anderson2000}. <>= w.means.pds <- average.fit.models(vars=pvs, fits=PGLSi$fits, optims=PGLSi$optim, by='rwGsm', standardize=TRUE) # apply(w.means.pds, 2, mean, na.rm=T) #average of weighted means over all sub-datasets w.means.pds # weighted means per sub-dataset @ <>= w.import.pds <- variable.importance(vars=pvs, fits=PGLSi$fits, optims=PGLSi$optim, by='rwGsm') # apply(w.import.pds, 2, mean, na.rm=T) #average of weighted means over all sub-datasets w.import.pds # weighted means per sub-dataset @ \clearpage \section{Model Selection} We can select the best model by sorting each subset (e.g. by AICc) or by using visualization methods. <>= select.best.models(PGLSi, using='AICc') @ Plotting the coefficients of determination versus the AIC values allows selection of high-performing models for inspection and reporting. <>= plot.pgls.iters(PGLSi) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{All possible model combinations appear as individual points above. As there is a generally negative association between AIC and the coefficient of determination, the points tend to follow a negative sloping streak to the lower right. The "best" models typically appear in the lower right of each streak. Therefore, minimizing AIC tends to also maximize the coefficient of determination, but not necessarily. This four panel plot looks at correct and adjusted versions of each model assessment measure. All points are scaled by subdataset sample size by default if 'n' is used in grouping.}{} \label{plotiter:one} \end{figure} <>= sdevs.objs <- get.pgls.coefs(PGLSi$fits, est='t value') coefs.objs <- get.pgls.coefs(PGLSi$fits, est='Estimate') @ <>= report.vect <- sapply(1:length(PGLSi$fits), function(i) fit.1ln.rprt(PGLSi$fits[[i]], rtrn.line=FALSE, mn=i)) @ <>== par(mar=c(5,5,3,3)) plot.pgls.R2AIC(PGLSi$optim) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{This is a one panel verison of the previous model selection plot. The numbered points in the lower right corner of each streak of possible models represent the best model within a sub-dataset. Since these AICc values should not strictly be compared, it is recommended that all "best" models selected from each sub-dataset should be inspected or reported somehow, such as in the form of the sparge plot below.}{} \label{R2AIC:one} \end{figure} \section{Coefficient Plotting} Finally, the resulting model fits from the PGLS runs can be be plotted out horizontally as distributions so the influence of each ecological predictor variable can be compared. <>= par.old <- par(mar=c(5,8,1,4),mfrow=c(2,1)) sparge.modsel(sdevs.objs, R2x=7, xlab='t value') sparge.modsel(coefs.objs, R2x=7, xlab='Estimate') @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{These horizontal parameter distribution dots plots demonstrate how the (t-values of) coefficients from all models can be simultaneously plot in order to verify consistency of estimates across the various (often missing-data driven) sub-datasets. To visually assess potential over-fitting of each model, point sizes represent underlying sample sizes and circle thickness corresponds to coefficient of determination values. Note that arboreal locomotor targeting and staturally protective (e.g. infant fur-cling carrying) factors play key roles driving orbital convergence.} \label{coefdists:one} \end{figure} \clearpage \begin{thebibliography}{Bibliography} \bibitem{Isbell2006} Isbell, L. (2006) ‘Snakes as agents of evolutionary change in primate brains’ \bibTitle{Journal of Human Evolution}, 51(1) pp. 1–35. \bibitem{Cartmill1972} Cartmill, M. (1972) ‘Arboreal Adaptations and the Origin of the Order Primates’ In \bibTitle{The Function and Evolutionary Biology of Primates}, pp. 97–122, Aldine-Atherton. \bibitem{Szalay1988} Szalay, Frederick S., and Marian Dagosto. ‘Evolution of Hallucial Grasping in the Primates’ \bibTitle{Journal of Human Evolution}, 17(1–2) pp. 1–33. \url{https://doi.org/10.1016/0047-2484(88)90047-4}. \bibitem{Wheeler2011} Wheeler, B.C., B.J. Bradely, and J.M. Kamilar. (2011) ‘Predictors of Orbital Convergence in Primates: A Test of the Snake Detection Hypothesis of Primate Evolution’ \bibTitle{Journal of Human Evolution}, 61, pp. 233–42. \bibitem{Pagel1999}Pagel, M. (1999) ‘Inferring the historical patterns of biological evolution’ \bibTitle{Nature}, 401, pp. 877–884. Available at: \url{https://doi.org/10.1038/44766}. \bibitem{Grueber2011}Grueber, C.E., Nakagawa, .R.J., and Jamieson, I.G. (2011) ‘Multimodel inference in ecology and evolution: challenges and solutions’ \bibTitle{Journal of Evolutionary Biology}, 24, pp. 699-711. Available at: \url{https://doi.org/10.1111/j.1420-9101.2010.02210.x} \bibitem{Stephens2007}Stephens, P.A., Buskirk, S.W., and del Rio, C.M. (2007) ‘Inference in ecology and evolution’ \bibTitle{Trends in Ecology and Evolution}, 22(4). Available at: \url{https://doi.org/10.1016/j.tree.2006.12.003}. \bibitem{Symonds2011}Symonds, M.R.E., and Moussalli, A. (2011) ‘A brief guide to model selection, multimodel inference and model averaging in behavioral ecology using Akaike's information criterion’ \bibTitle{Behavioral Ecology and Sociobiology}, 65, pp. 13–21. Available at: \url{https://doi.org/10.1007/s00265-010-1037-6}. \bibitem{Johnson2004}Johnson, J.B., Omland, K.S. (2004) ‘Model selection in ecology and evolution’ \bibTitle{Trends in Ecology and Evolution}, 19(2). Available at: \url{https://doi.org/10.1016/j.tree.2003.10.013}. \bibitem{Revell2014}Revell, L.J. (2014) phytools: An R package for phylogenetic comparative biology (and other things). [CRAN]. Available at: \url{http://cran.r-project.org/package=phytools}. \bibitem{Garamszegi2011}Garamszegi L.Z. (2011) ‘Information-theoretic approaches to statistical analysis in behavioral ecology: an introduction’ \bibTitle{Behavioral Ecology and Sociobiology}, 65, pp. 1–11. Available at: \url{https://doi.org/10.1007/s00265-010-1028-7}. \bibitem{Anderson2000}Burnham, P.B. and Anderson, D.R. (2000) \bibTitle{‘Model Selection and Inference: A Practical Information-Theoretic Approach’} \bibitem{Cade2015}Cade, B.S. (2015) ‘Model averaging and muddled multimodel inferences’ \bibTitle{Ecology}, 96(9), pp. 2370–2382. Available at: \url{https://doi.org/10.1890/14-1639.1}. \bibitem{Rowe2017}Rowe, N. and Meyers, M. (2017) \bibTitle{All the World’s Primates}. Charlestown, RI: Pogonias Press. \bibitem{Springer2012}Springer, M.S., et.al. (2012) ‘Re-evaluating primate monogamy’ \bibTitle{PLoS ONE} 7(11) p. e49521 \url{http://doi.org/10.1371/journal.pone.0049521}. \bibitem{Schruth2019}Schruth, D.M. (2019) ‘Primate Locomotor Activity’ \bibTitle{The Center for Open Science}. Available at: \url{https://osf.io/cd68q/}. \bibitem{Schruth2021a}Schruth, D.M. (2021a) Arboreal locomotion and trophic security at the dawn of Euprimate vision. EcoEvoRxiv. Available at: \url{http://doi.org/10.32942/osf.io/d6wk2}. \bibitem{Schruth2021b}Schruth, D.M. (2021b) ‘A global variable-permutation based approach for estimating tree transformation parameters used in phylogenetically controlled multivariate regression’ \bibTitle{Protocols.io}. Available at: \url{http://doi.org/10.17504/protocols.io.bzdhp236}. \bibitem{Ross2001}Ross, C. (2001) ‘Park or Ride? Evolution of Infant Carrying in Primates’ \bibTitle{International Journal of Primatology}, 22(5), pp. 749–771. Available at: \url{https://doi.org/10.1023/A:1012065332758}. \end{thebibliography} <>= options(opt.old) par(par.old) @ \end{document}