%\VignetteIndexEntry{Using jetset}
%\VignetteKeywords{annotation, Affymetrix, hgu95av2, hgu133a, hgu133plus2, u133x3p}
%\VignetteDepends{org.Hs.eg.db}
%\VignettePackage{jetset}

\documentclass[10pt,a4paper]{article}

% \usepackage{a4wide}

% Replacement for now-defunct a4wide?
\usepackage{geometry}
\geometry{
  includeheadfoot,
  margin=2.54cm
}


% Change the Sweave path if the TeX engine does not find the file Sweave.sty
\usepackage{Sweave}
\usepackage{hyperref}

\begin{document}

\title{Jetset: selecting an optimal Affymetrix probe set to represent a gene}
\author{Qiyuan Li, Aron C Eklund}
\date{\today}

\maketitle


\tableofcontents

\section{Introduction}

On Affymetrix gene expression microarrays, a given gene may be detected by multiple probe sets which may deliver inconsistent or even contradictory measurements. Therefore, obtaining an unambiguous expression estimate of a pre-specified gene can be nontrivial. We developed scoring methods to assess each probe set for specificity, coverage, and degradation resistance. We used these scores to select the optimal probe set for each gene, and thus create a simple one-to-one mapping between gene and probe set. 

\texttt{jetset} is a package enabling the selection of optimal probe sets from the HG-U95Av2, HG-U133A, HG-U133 Plus 2.0, or U133 X3P microarray platforms.

<<>>=
  library(jetset)
@


\newpage
\section{Contents of the package}

The \texttt{jetset} package contains the following objects:

<<>>=
  ls("package:jetset")
@

The functions \texttt{jmap} and \texttt{jscores} are the intended user-level interface, and \texttt{scores.*} are data sets that support these functions.


\subsection{\texttt{jmap}}

\texttt{jmap} is a function that returns the best probe sets matching a list of Entrez GeneIDs, gene symbols, or gene aliases.


\subsection{\texttt{jscores}}

\texttt{jscores} is a function that returns the jetset scores for all probe sets matching a list of Entrez GeneIDs, gene symbols, aliases, or ensembl IDs.


\subsection{\texttt{scores.hgu95av2}}

\texttt{scores.hgu95av2} is a data frame with Entrez IDs and pre-calculated quality control scores for each probe set ID. All scores range from 0 to 1, and a higher score indicates better (predicted) performance.

<<>>=
  head(scores.hgu95av2)
@

\begin{itemize}

\item The Entrez GeneID (\emph{EntrezID}) is a unique gene identifier.  Note: as in other Bioconductor packages, the GeneID is stored as type \emph{character}.

\item The processivity requirement (\emph{process}) is the number of consecutive bases that must be synthesized to generate a target that can be detected by the probe set.

\item The \emph{specificity} score is the fraction of the probes in a probe set that are likely to detect the targeted gene and unlikely to detect other genes.

\item The \emph{coverage} score is the fraction of the splice isoforms belonging to the targeted gene that are detected by the probe set.

\end{itemize}

Note that the robustness score and overall score are not stored in this data; instead these scores are calculated on-the-fly when the score data is retrieved using \texttt{jscores}.

\begin{itemize}

\item The robustness score (\emph{robust}) is intended to quantify robustness against transcript degradation.  The robustness score uses the processivity requirement to estimate the signal intensity of a probe set, relative to the ideal case of perfect processivity.

\item The \emph{overall} score is the product of the specificity score, coverage score, and robustness score.

\end{itemize}


\newpage
\section{Probe set selection}

A typical application of the jetset packages is to identify probe sets corresponding to a published list of genes.

\subsection{Selecting probe sets by Entrez Gene ID}

The Entrez GeneID is an unambiguous way to specify a gene; however the numeric ID is not particularly descriptive, which may explain why this is not commonly provided in publications.

<<>>==
  jmap('hgu95av2', eg = "2099")
  jmap('hgu133a', eg = "2099")
  jmap('hgu133plus2', eg = "2099")
  jmap('u133x3p', eg = "2099")
@

Entrez GeneID 2099 corresponds to the estrogen receptor (ESR1) gene, which is an important indicator of breast cancer phenotype.


\subsection{Selecting probe sets by ensembl ID}

The ensembl ID is another unambiguous way to specify a gene.

<<>>==
  jmap('hgu95av2', ensembl = "ENSG00000091831")
  jmap('hgu133a', ensembl = "ENSG00000091831")
@


\subsection{Selecting by gene symbol}

Often, we know the official HUGO gene symbols, and want the corresponding probe sets.

<<>>=
  jmap('hgu133a', symbol = c("ESR1", "ERBB2", "AURKA"))
@

Unfortunately, the gene symbol for a given gene can change. Furthermore, in rare cases a HUGO gene symbol can correspond to two distinct genes.


\subsection{Selecting by gene alias}

If we have gene symbols, but they are not the \emph{official} symbols, the gene aliases might be useful.

<<>>=
  jmap('u133x3p', alias = c("P53", "HER-2", "K-RAS"))
@


\newpage
\section{View probe set quality scores}

We might want to compare quality scores for all probe sets corresponding to a gene of interest. For this example, the STAT1 gene is used because it is detected by several probe sets.

<<>>=
  jscores('hgu95av2', symbol = 'STAT1')
@

Note that the probe sets are sorted by decreasing overall score.

We can confirm that \texttt{jmap} returns the probe set with the highest overall score:

<<>>==
  jmap('hgu95av2', symbol = 'STAT1')
@

We can also retrieve quality scores for all probesets:

<<>>=
  allscores <- jscores('hgu95av2')
  str(allscores)
@


\newpage
\section{Statistics for this release}

\subsection{How many probe sets could be mapped to a gene?}

<<>>==
table(!is.na(scores.hgu95av2$EntrezID))
table(!is.na(scores.hgu133a$EntrezID))
table(!is.na(scores.hgu133plus2$EntrezID))
table(!is.na(scores.u133x3p$EntrezID))
@

\subsection{How many genes are mapped to a probe set?}

<<>>==
length(na.omit(unique(scores.hgu95av2$EntrezID)))
length(na.omit(unique(scores.hgu133a$EntrezID)))
length(na.omit(unique(scores.hgu133plus2$EntrezID)))
length(na.omit(unique(scores.u133x3p$EntrezID)))
@


\newpage
\section{Jetset algorithm details}

We downloaded probe sequences corresponding to four human gene expression microarrays from Affymetrix: U95Av2, U133A, U133 Plus 2.0, and X3P. We used NCBI BLASTN to search the 25-base probe sequences for matches to the Refseq human RNA database (Pruitt, et al., 2005). The BLASTN search was run with the default parameters, except that filtering was turned off, the word size was set to 8 to increase sensitivity, and the expectation value was set to 1 to reduce output size. 

\texttt{
blastall -p blastn -d refseq.human.rna -i probe.hgu133a.fa -o hgu133a.refseq.20110817.bls -F F -m 8 -e 1 -W 8 -a 8
}

We used the alignment score (bit score) between each probe and cDNA as an indication of probe sensitivity. We defined three levels of alignment: a strong alignment has a score between 48 and 51, indicating that at least 24 bases are identical and that the probe is very likely to detect the target.  A moderate alignment has a score between 32 and 47, corresponding to an uninterrupted alignment of length 16 to 23 bases; the probe may or may not respond to the target. A weak alignment has a score less than 32 and is unlikely to respond to the target.

\textbf{Specificity.} A probe was considered to specifically detect a given gene if it aligned strongly to at least one transcript of the gene, but did not have a strong or moderate alignment to a transcript from another gene. The gene specifically detected by the largest number of probes in a probe set was considered the targeted gene of the probe set. We defined the specificity score $S_s$ of a probe set as the fraction of its probes that specifically detect the targeted gene.

\textbf{Coverage.} A transcript of the targeted gene was considered detected by a probe set if the transcript has a strong alignment to the majority of the probes in the probe set. The coverage score $S_c$ of a probe set is defined as the fraction of all transcripts belonging to the targeted gene that are detected by the probe set.

\textbf{Robustness.} The processivity requirement for a probe-transcript alignment is the number of bases between the 5' end of the alignment and the 3' end of the transcript sequence; this corresponds to the length of labeled target that must be synthesized by in vitro transcription to reach the query region. The overall processivity requirement $N$ of a probe set is the median processivity requirement for all strong alignments between probes in the probe set and transcripts in the targeted gene. We define the robustness score $S_r$ of a probe set as the probability that synthesis of the target up to the processivity requirement is achieved without interruption: $S_r = (1-p)^n$
 
Here, $p$ is the probability of the IVT synthesis being interrupted at each base, due to either transcript degradation or lack of enzyme processivity. The value of $p$ is likely to be variable in clinical specimens, but for simplicity we use a value corresponding to the manufacturer's design criteria: $1/300$ for the X3P array, or $1/600$ for the other arrays.

\textbf{Overall score.} We define the overall score $S_o$ as the product of the three scores described above:  $S_o = S_s * S_c * S_r$

For a given gene, the probe set targeting this gene with the highest overall score is selected to represent the gene.


\newpage
\section{Reference}

Qiyuan Li, Nicolai J. Birkbak, Balazs Gyorffy, Zoltan Szallasi and Aron C. Eklund (2011).
Jetset: selecting the optimal microarray probe set to represent a gene.  BMC Bioinformatics. 12:474.


\section{R sessionInfo}

The results in this file were generated using the following packages:

<<sessionInfo>>=
sessionInfo()
@


\end{document}