%\VignetteIndexEntry{Qtlhot Tutorial} %\VignetteKeywords{permutation,threshold,QTL} \documentclass[11pt]{article} \usepackage{epsf} \usepackage{epsfig} \usepackage{amssymb,amsmath} \usepackage{amsbsy} \usepackage[all]{xy} \usepackage{latexsym} \usepackage{float} \usepackage{amsthm} \usepackage{graphicx} \usepackage{Sweave} \topmargin 0.0cm \oddsidemargin 0cm \evensidemargin 0cm \textwidth 16cm \textheight 21cm \date{\today} \SweaveOpts{eps=FALSE,keep.source=TRUE} \title{Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial} \author{Elias Chaibub Neto\footnote{Department of Computational Biology, Sage Bionetworks, Seattle WA} \ and Brian S Yandell\footnote{Department of Statistics, University of Wisconsin-Madison, Madison WI}} \date{\today} \begin{document} \maketitle %\begin{center} %{\bf Abstract} %\end{center} \section{Motivation} QTL hotspots, groups of traits co-mapping to the same genomic location, are a common feature of genetical genomics studies. Genomic locations associated with many traits are biologically interesting since they may harbor influential regulators. Nonetheless, non-genetic mechanisms, uncontrolled environmental factors and unmeasured variables are capable of inducing a strong correlation structure among clusters of transcripts, and as a consequence, whenever a transcript shows a spurious linkage, many correlated transcripts will likely map to the same locus, creating a spurious QTL hotspot. Permutation approaches that do not take into account the phenotypic correlation tend to underestimate the size of the hotspots that might appear by change in these situations (Breitling et al. 2008). This issue motivated the development of permutation tests that preserve the correlation structure of the phenotypes in order to determine the significance of QTL hotspots (Breitling et al. 2008, Chaibub Neto et al. 2012). In this tutorial we present software tools implementing the $NL$-method (Chaibub Neto et al. 2012), the $N$-method (Breitling et al. 2008), and the $Q$-method (West et al. 2007, Wu et al. 2008) permutation approaches. \section{Overview} This tutorial illustrates the application of the $NL$-, $N$- and $Q$-methods, implemented in the \texttt{qtlhot} R package, to a few toy examples. The \texttt{qtlhot} package is built over the \texttt{R/qtl} package (Broman et al. 2003), and we assume the reader is familiar with it. \section{Basic functionality with No Real Hotspots} In this section we consider two toy simulated examples. In the first we simulate highly correlated phenotypes. In the second, we simulate uncorrelated phenotypes. <>= library(qtlhot) @ We start by simulating a ``null backcross" data set composed of 1,000 phenotypes, 4 chromosomes, 51 equally spaced genetic markers per chromosome, and 100 individuals, with the \texttt{sim.null.cross} function. The \texttt{latent.eff} parameter control the amount of correlation among the phenotypes. Each phenotype $k$ is generated according to the model $Y_k = \theta L + \epsilon_k$, where $L \sim N(0,\sigma^2)$ is a latent variable, $\theta$ represents the effect (\texttt{latent.eff}) of the latent variable on the phenotype, and $\epsilon_k \sim N(0, \sigma^2)$ represents a residual error term with $\sigma^2$ set to \texttt{res.var}. Note that we do not simulate any QTLs in a ``null cross" and any linkages we might detect in such a data set are due entirely to chance. <<>>= ncross1 <- sim.null.cross(chr.len = rep(100, 4), n.mar = 51, n.ind = 100, type = "bc", n.phe = 1000, latent.eff = 3, res.var = 1, init.seed = 123457) @ The function \texttt{include.hotspots} takes the ``null cross" as an input and includes 3 hotspots of size \texttt{hsize} at position \texttt{hpos} of chromosome \texttt{hchr} into it. Explicitly, it simulates each one of the hotspots according to the model $Y_k^\ast = \beta M + Y_k$, where $Y_k$ is the phenotype generated by the \texttt{generate.null.cross} function; $M = \gamma \, Q + \epsilon_M$ is a master regulator that affects all phenotypes in the hotspot; $Q$ is a QTL located at position \texttt{hpos} of chromosome \texttt{hchr}; $\gamma$ represents the QTL effect (\texttt{Q.eff}); $\epsilon_M \sim N(0, \sigma^2)$; and $\beta$ is computed such that the association between $Y_k^\ast$ and Q, measured by the LOD score, is given (theoretically) by a valued sampled from the user specified LOD score range (\texttt{lod.range.1} and etc). <<>>= cross1 <- include.hotspots(cross = ncross1, hchr = c(2, 3, 4), hpos = c(25, 75, 50), hsize = c(100, 50, 20), Q.eff = 2, latent.eff = 3, lod.range.1 = c(2.5, 2.5), lod.range.2 = c(5, 8), lod.range.3 = c(10, 15), res.var = 1, n.phe = 1000, init.seed = 12345) @ Note that by choosing \texttt{latent.eff = 3} we generate highly correlated phenotype data. The distribution of the correlation values for each pair of phenotypes is given below. <<>>= ncor1 <- cor(cross1$pheno) summary(ncor1[lower.tri(ncor1)]) rm(ncor1) @ <>= if(file.exists("savedperms.RData")) load("savedperms.RData") @ Next we obtain standard permutation thresholds (Churchill and Doerge 1994) for single trait QTL mapping analysis for the sequence \texttt{alphas}, representing target genome wide error rates (GWER). <>= if(!exists("pt")) { set.seed(123) pt <- scanone(ncross1, method = "hk", n.perm = 1000) } @ \begin{Schunk} \begin{Sinput} > set.seed(123) > pt <- scanone(ncross1, method = "hk", n.perm = 1000) \end{Sinput} \end{Schunk} <<>>= alphas <- seq(0.01, 0.10, by=0.01) lod.thrs <- summary(pt, alphas) lod.thrs lod.thr <- lod.thrs[5] @ We perform QTL mapping analysis for 1,000 phenotypes using Haley-Knott regression, and keep only the \texttt{drop.lod} = 1.5 LOD support interval (Manichaikul et al. 2006) around significant peaks (above \texttt{lod.thr}) at each chromosome for each trait. LOD support intervals are the most commonly used interval estimate for the location of a QTL. <<>>= scan1 <- qtl::scanone(cross1, pheno.col = 1:1000, method = "hk") @ The routine \texttt{highlod} just saves the LOD support intervals for significant peaks, which dramatically reduces the object size from the result of \texttt{scanone}. We can examine the maximum hotspot size for each possible single trait threshold using the \texttt{max} method for \texttt{highlod} objects. The first column gives the counts associated with QTL mapping threshold of \Sexpr{round(lod.thrs[1], 2)}, whereas the last one shows the counts based on the more liberal threshold \Sexpr{round(lod.thrs[10], 2)}. Note how the counts increase as the QTL mapping thresholds decrease. <<>>= high1 <- highlod(scan1, lod.thr = min(lod.thrs), drop.lod = 1.5) max(high1, lod.thr = lod.thrs) @ Next we infer the hotspot architecture at varying QTL mapping thresholds. In other words, for each genomic position, we count the number of traits that map to it with a LOD score equal or higher than the threshold in \texttt{lod.thr}. The \texttt{hots1} object is a \texttt{scanone} object with added attributes and specialized summary and plot methods. The summary shows the counts for the most significant loci per chromosome, essentially using the \texttt{summary} method for \texttt{scanone} objects. <<>>= hots1 <- hotsize(high1, lod.thr = lod.thr) summary(hots1) @ \begin{figure}[!ht] \begin{center} <>= par(mar=c(4.1,4.1,0.5,0.1)) plot(hots1, cex.lab = 1.5, cex.axis = 1.5) @ \end{center} \caption{Hotspot architecture associated with QTL mapping threshold of \Sexpr{round(lod.thr, 2)} in example 1.} \label{plotex1counts} \end{figure} We plot the hotspot architecture inferred using the single trait permutation threshold \Sexpr{round(lod.thr, 2)} ($\alpha = 0.05$). Figure \ref{plotex1counts} shows the counts across the genome. Recall that in the call of function \texttt{include.hotspots} we set to simulate 3 hotspots: (1) a hotspot of size 100 at position 25cM of chromosome 2 with LOD scores around 2.5; (2) a hotspot of size 50 at position 75cM of chromosome 3 with LOD scores ranging from 5 to 8; and (3) a hotspot of size 20 at position 50cM of chromosome 4 with LOD scores ranging from 10 to 15. Nonetheless, Figure \ref{plotex1counts} shows several spurious peaks on chromosome 1, that arise because of the high correlation of the phenotypes. \begin{Schunk} \begin{Sinput} > plot(hots1, cex.lab = 1.5, cex.axis = 1.5) \end{Sinput} \end{Schunk} Next, we perform permutation tests to assess the statistical significance of the hotspots detected on Figure \ref{plotex1counts}. We consider the $N$- and $NL$-methods. [The $Q$ method of West and Wu is documented in the \texttt{ww.perm} manual page but not shown here.] The \texttt{hotperm} function implements the $N$- and $NL$-methods' permutation schemes (see Chaibub Neto et al. 2012, for details). The parameter \texttt{n.quant} sets the maximum hotspot size to be analyzed by the $NL$-method. The parameter \texttt{drop.lod} controls the magnitude of the LOD support interval computation during the LOD profile processing step. The function's output is a list with two elements: \texttt{max.lod.quant} and \texttt{max.N}. <>= if(!exists("hotperm1")) { set.seed(12345) hotperm1 <- hotperm(cross = cross1, n.quant = 300, n.perm = 100, lod.thrs = lod.thrs, alpha.levels = alphas, drop.lod = 1.5, verbose = FALSE) } @ \begin{Schunk} \begin{Sinput} > set.seed(12345) > hotperm1 <- hotperm(cross = cross1, + n.quant = 300, + n.perm = 100, + lod.thrs = lod.thrs, + alpha.levels = alphas, + drop.lod = 1.5, + verbose = FALSE) \end{Sinput} \end{Schunk} <<>>= names(hotperm1) summary(hotperm1) @ The \texttt{max.N} element of the \texttt{hotperm1} object stores the output of the $N$-method's permutations and is given by a matrix with 100 rows representing the permutations, and 10 columns representing the QTL mapping thresholds. Entry $ij$ stores the maximum genome wide hotspot size detected at permutation $i$ using the QTL mapping threshold $j$. The \texttt{max.N} element of the \texttt{summary} is a 10 by 10 matrix with rows indexing the QTL mapping thresholds and columns indexing the target genome wide error rates. Each entry $ij$ shows the hotspot size above which a hotspot is considered significant at a GWER $j$ using the QTL mapping threshold $i$. As before, our interest focus on the diagonal, and the $N$-method's threshold that controls the hotspot GWER at a 5\% level when the QTL mapping was controlled at a GWER of 5\% (single trait LOD threshold of \Sexpr{round(lod.thr, 2)}) is \Sexpr{ceiling(summary(hotperm1)$max.N[5,5])} traits per hotspot (rounded up). Note that according to the $N$-method, none of the hotspots on Figure \ref{plotex1counts} is significant. The \texttt{max.lod.quant} element of the \texttt{hotperm1} object stores the output of the $NL$-method's permutations and is given by a matrix with 100 rows representing the permutations, and 300 columns representing the hotspot sizes analyzed. Entry $ij$ stores the maximum genome wide $qLOD(n)$ value--the $n$th LOD score in a sample ordered from highest to lowest--computed at permutation $i$ using the QTL mapping threshold $j$. The \texttt{max.lod.quant} element of the \texttt{summary} shows just the extremes and quartiles. The \texttt{max.lod.quant} element of the \texttt{quantile} contains the sliding LOD quantiles from 1 up to 300 for each single trait LOD threshold. This can be used to plot how many traits pass the sliding LOD threshold, and are significant, at each locus in the genome. Figure \ref{plotex1slidingbar} shows the hotspot significance profile for the thresholds targeting GWER at a 5\% level. Note that we only consider LODs above \texttt{lod.thr}, and chromosome 1 has none. <>= par(mar=c(4.1,4.1,0.5,4.1)) quant1 <- quantile(hotperm1, 0.05, lod.thr = lod.thr) plot(high1, quant.level = quant1, sliding = TRUE) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{Hotspot size significance profile targeting GWER at a 5\% level for example 1.} \label{plotex1slidingbar} \end{center} \end{figure} \begin{Schunk} \begin{Sinput} > quant1 <- quantile(hotperm1, 0.05, lod.thr = lod.thr) > plot(high1, quant.level = quant1, sliding = TRUE) \end{Sinput} \end{Schunk} <>= hotsq1 <- hotsize(high1, lod = lod.thr, window = 5, quant.level = quant1) quant.axis <- pmax(1, pretty(c(0, qtl:::summary.scanone(hotsq1)[3,"quant"]))) quant.level <- round(attr(hotsq1, "quant.level")[quant.axis], 2) @ <>= par(mar=c(4.1,4.1,0.5,4.1)) plot(hotsq1) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{Hotspot size significance profile at 5\% level for example 1. Black is raw counts, red is counts smoothed over a 5cM window, blue is the sliding LOD threshold approach.} \label{plotex1elias} \end{center} \end{figure} Figure \ref{plotex1slidingbar} depicts a sliding window of hotspot size thresholds ranging from $n=1,\ldots,N$, where $N=100$ corresponds to the (approximate) hotspot size threshold derived from the $N$-method. For each genomic location this figure shows the hotspot sizes at which the hotspot was significant, that is, at which the hotspot locus had more traits than the hotspot size threshold on the left mapping to it with a LOD score higher than the threshold on the right than expected by chance. For example, the hotspot on chromosome 3 was significant at all sizes from 1 to \Sexpr{max(quant.axis)}, meaning that at least \Sexpr{quant.axis[1]} trait mapped to the hotspot locus with LOD higher than \Sexpr{quant.level[1]}, at least \Sexpr{quant.axis[3]} traits mapped to the hotspot locus with LOD higher than \Sexpr{quant.level[3]}, and so on up to hotspot size \Sexpr{max(quant.axis)}, all with LODs higher than \Sexpr{min(quant.level)}. The $N$-method that did not detect any hotspots, but the $NL$-method's sliding window correctly detected the simulated hotspots and showed that the apparent hotspots on chromosome 1 were noisy artifacts. Figure \ref{plotex1elias} is another way to visualize this, using the sliding LOD threshold (blue cureve). We also show a 5-cM smoothing window applied to the counts (red curve). Note that with the sliding LOD threshold, no hotspots (or individual traits) are significant on chr 1, and hotspots of size 20 are found on chr 2 and 4, with a significant hotspot of size \Sexpr{max(quant.axis)} on chr 3. \begin{Schunk} \begin{Sinput} > hotsq1 <- hotsize(high1, lod = lod.thr, window = 5, quant.level = quant1) > plot(hotsq1) \end{Sinput} \end{Schunk} <<>>= summary(hotsq1) @ \section{Example with Uncorrelated Phenotypes} Next we consider a second toy example with uncorrelated phenotype data. We repeat the simulation and analysis steps presented previously changing \texttt{latent.eff} to zero. <>= ncross2 <- sim.null.cross(chr.len = rep(100,4), n.mar = 51, n.ind = 100, type = "bc", n.phe = 1000, latent.eff = 0, res.var = 1, init.seed = 123457) cross2 <- include.hotspots(cross = ncross2, hchr = c(2, 3, 4), hpos = c(25, 75, 50), hsize = c(100, 50, 20), Q.eff = 2, latent.eff = 0, lod.range.1 = c(2.5, 2.5), lod.range.2 = c(5, 8), lod.range.3 = c(10, 15), res.var = 1, n.phe = 1000, init.seed = 12345) @ <>= scan2 <- scanone(cross2, pheno.col = 1:1000, method = "hk") high2 <- highlod(scan2, lod.thr = lod.thr, drop.lod = 1.5) hots2 <- hotsize(high2) par(mar=c(4.1,4.1,0.1,0.1)) plot(hots2, cex.lab = 1.5, cex.axis = 1.5) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{Hotspot architecture targeting 5\% GWER for example 2.} \label{plotex2counts} \end{center} \end{figure} \begin{Schunk} \begin{Sinput} > ncross2 <- sim.null.cross(chr.len = rep(100,4), + n.mar = 51, + n.ind = 100, + type = "bc", + n.phe = 1000, + latent.eff = 0, + res.var = 1, + init.seed = 123457) > cross2 <- include.hotspots(cross = ncross2, + hchr = c(2, 3, 4), + hpos = c(25, 75, 50), + hsize = c(100, 50, 20), + Q.eff = 2, + latent.eff = 0, + lod.range.1 = c(2.5, 2.5), + lod.range.2 = c(5, 8), + lod.range.3 = c(10, 15), + res.var = 1, + n.phe = 1000, + init.seed = 12345) \end{Sinput} \end{Schunk} <<>>= ncor2 <- cor(cross2$pheno) summary(ncor2[lower.tri(ncor2)]) rm(ncor2) @ \begin{Schunk} \begin{Sinput} > scan2 <- scanone(cross2, pheno.col = 1:1000, method = "hk") > high2 <- highlod(scan2, lod.thr = lod.thr, drop.lod = 1.5) > hots2 <- hotsize(high2) > plot(hots2, cex.lab = 1.5, cex.axis = 1.5) \end{Sinput} \end{Schunk} <>= if(!exists("hotperm2")) { set.seed(12345) hotperm2 <- hotperm(cross = cross2, n.quant = 300, n.perm = 100, lod.thrs = lod.thrs, alpha.levels = alphas, drop.lod = 1.5, verbose = FALSE) } @ <>= quant2 <- quantile(hotperm2, 0.05, lod.thr = lod.thr) @ <>= par(mar=c(4.1,4.1,0.5,0.1)) plot(high2, lod.thr = lod.thr, quant.level = quant2, sliding = TRUE) @ \begin{figure}[!ht] \begin{center} \includegraphics{qtlhot-plotex2slidingbar} \caption{Hotspot significance profile targeting 5\% GWER for example 2.} \label{plotex2slidingbar} \end{center} \end{figure} <>= par(mar=c(4.1,4.1,0.5,0.1)) plot(high2, quant.level = quant2) @ \begin{figure}[!ht] \begin{center} \includegraphics{qtlhot-plotex2sigct} \caption{Hotspot significance scan targeting 5\% GWER for example 2.} \label{plotex2sigct} \end{center} \end{figure} \begin{Schunk} \begin{Sinput} > set.seed(12345) > hotperm2 <- hotperm(cross = cross2, + n.quant = 300, + n.perm = 100, + lod.thrs = lod.thrs, + alpha.levels = alphas, + drop.lod = 1.5, + verbose = FALSE) > quant2 <- quantile(hotperm2, 0.05, lod.thr = lod.thr) \end{Sinput} \end{Schunk} <>= if(!file.exists("savedperms.RData")) save(pt, hotperm1, hotperm2, file = "savedperms.RData", compress = TRUE) @ The $N$-method, as expected, gave rise to much smaller thresholds in this second example with uncorrelated phenotypes. Additionally, inspection of Figure \ref{plotex2counts} shows no spurious hotspots on chromosome 1. \begin{Schunk} \begin{Sinput} > plot(high2, lod.thr = lod.thr, quant.level = quant2, sliding = TRUE) \end{Sinput} \end{Schunk} Figure \ref{plotex2slidingbar} presents the hotspot significance profile targeting 5\% GWER. For this second example, all methods correctly detected the simulated hotspots. \begin{Schunk} \begin{Sinput} > plot(high2, quant.level = quant2) \end{Sinput} \end{Schunk} Figure \ref{plotex2sigct} shows another way to represent significant hotspots. We overlay the largest significant hotspot counts using the sliding quantiles in red on top of the curve on Figure \ref{plotex2counts}. Notice that the large sizes are all significant, but only small sizes corresponding to larger LOD scores are significant. We add a right axis with the sliding LOD thresholds. \section{References} \begin{enumerate} \item Breitling R., Y. Li, B. M. Tesson, J. Fu, C. Wu, et al., 2008 Genetical genomics: spotlight on QTL hotspots. PLoS Genetics {\bf 4:} e1000232. \item Broman K. W., W. Wu, S. Sen, G. A. Churchill, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics {\bf 19}: 889-890. \item Chaibub Neto et al., 2012 Quantile-based permutation thresholds for QTL hotspots. Genetics (under review). \item Churchill G. A., and R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics {\bf 138}: 963-971. \item Manichaikul A., J. Dupuis, S. Sen, and K. W. Broman, 2006 Poor performance of bootstrap confidence intervals for the location of a quantitative trait locus. Genetics {\bf 174:} 481-489. \item West M. A. L., K. Kim, D. J. Kliebenstein, H. van Leeuwen, R. W. Michelmore, R. W. Doerge, D. A. St. Clair 2007 Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics {\bf 175:} 1441-1450. \item Wu C., D. L. Delano, N. Mitro, S. V. Su, J. Janes, et al. 2008 Gene set enrichment in eQTL data identifies novel annotations and pathway regulators. PLoS Genetics {\bf 4:} e1000070. \end{enumerate} \end{document}