--- title: "BioUtils: A Case Study in Transcriptomic Analysis of Renal Cell Carcinoma" author: "Spencer Treadway" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BioUtils: A Case Study in Transcriptomic Analysis of Renal Cell Carcinoma} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, cache = FALSE, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 5 ) ``` ## Overview Understanding which genes are expressed differently in diseased versus healthy tissue is one of the central questions of modern molecular biology. When a cell becomes cancerous, hundreds or thousands of genes may change their activity levels, some becoming dramatically more active, others effectively silenced. Identifying these changes is essential for discovering what drives the disease, finding diagnostic biomarkers, and ultimately developing targeted therapies. This vignette walks through a complete transcriptomic analysis of **renal cell carcinoma (RCC)**, the most common form of kidney cancer, using publicly available gene expression microarray data from the NCBI Gene Expression Omnibus (GEO). The dataset, **GDS507**, contains expression profiles from 17 samples, a mix of clear cell RCC tumor tissue and matched normal kidney tissue, measured across ~22,000 probe sets on an Affymetrix HG-U133A microarray. We will use the BioUtils package to carry out the full analytical workflow: data import, quality control, differential expression analysis, visualization, co-expression analysis, and multi-gene predictive modeling. Each step is explained in terms of both what the code is doing and what it means biologically. --- ## Part 1: Data Import and Quality Control ### 1.1 What Is a Microarray? Before we load data, it helps to understand what we are working with. A DNA microarray is a glass slide or chip printed with tens of thousands of short DNA sequences called **probes**, each designed to bind to the RNA transcribed from a specific gene. When a biological sample is processed and washed over the chip, messenger RNA from active genes hybridizes to its matching probe and emits a fluorescent signal. The brightness of each spot reflects how actively that gene is being transcribed, its **expression level**. The result is a table of ~22,000 numbers per sample, one for each probe on the chip, representing the transcriptional activity of the cell at the moment the sample was collected. Our job is to extract biologically meaningful signal from this high-dimensional data. ### 1.2 Loading the Dataset The GEO archive distributes datasets in SOFT format files. BioUtils reads these directly using `load.geo.soft()`, which wraps GEOquery to parse the file and return a structured `ExpressionSet` object: ```{r load} library(BioUtils) eset <- load.geo.soft("", "GDS507", log.transform = TRUE) ``` The `log.transform = TRUE` argument is important here. Affymetrix arrays report raw intensity values that span several orders of magnitude and are right-skewed. Applying a log2 transformation compresses this range, makes the data more symmetric, and means that differences in expression become differences in fold change, a 1-unit increase on the log2 scale represents a doubling of expression. This is standard preprocessing for Affymetrix data and is necessary before any statistical testing. ### 1.3 Extracting the Data Components `extract.expression()` decomposes the `ExpressionSet` into the three components used throughout BioUtils: ```{r extract} geo <- extract.expression(eset) ``` This returns a named list with: - **`geo$expression`** - a 22,645 by 17 matrix of log2-transformed probe intensities. Each row is a probe (a specific DNA sequence on the chip), each column is a patient sample. - **`geo$phenotype`** - a data frame of sample metadata. Most importantly for us, the `disease.state` column identifies each sample as either `"RCC"` or `"normal"`. - **`geo$gene`** - a data frame of probe annotations mapping each probe ID to the gene it targets (gene symbol, full title, chromosomal location, etc.). ### 1.4 Principal Component Analysis for Quality Control With 22,000 variables and 17 samples, the data is far too high-dimensional to inspect directly. **Principal Component Analysis (PCA)** is the standard first step for understanding the global structure of the dataset. It finds the axes of maximum variance in the data and projects each sample onto these axes, so that similar samples cluster together in a 2D plot. Before any analysis, we want to confirm that the largest source of variation in the data is the biological signal we care about, disease state, rather than technical artifacts like batch effects or sample processing date. ```{r pca} pca.plot(geo$expression, geo$phenotype, color.by = "disease.state") ``` `pca.plot()` runs PCA on the transposed expression matrix (so samples are the observations and probes are the variables), extracts the first two principal components, and plots each sample as a point colored by disease state. A well-separated PCA plot, with RCC and normal samples forming distinct clusters, tells us that the transcriptional differences between tumor and normal tissue are large enough to dominate the global variance structure of the dataset. This is a reassuring quality control check: it means the biology is strong and our statistical tests should have good power to detect it. If instead the samples were mixed randomly or clustered by something unrelated (a technician ID, a plate number), we would need to investigate and correct for that before proceeding. --- ## Part 2: Genome-Wide Differential Expression ### 2.1 The Statistical Challenge We want to know, for each of the 22,645 probes, whether the average expression level differs significantly between RCC and normal tissue. The naive approach, run a t-test on each probe separately, has a fundamental problem: if you run 22,645 independent tests at a 5% significance threshold, you expect to get roughly 1,132 false positives by chance alone, even if nothing is truly different. This is the **multiple testing problem**. ### 2.2 Running limma BioUtils uses the `limma` (Linear Models for Microarray data) framework for differential expression analysis, which solves both problems elegantly: ```{r limma} de.results <- run.limma.de(geo, condition.col = "disease.state") ``` Rather than testing each gene independently, `limma` fits a linear model to all probes simultaneously. Its key innovation is **empirical Bayes moderation**: it borrows variance information across all genes to produce a stabilized estimate of within-group variability for each probe. Genes with few samples or noisy measurements benefit from this pooled information, which prevents genes with accidentally small variance estimates from generating artificially inflated t-statistics. The result, called a **TopTable**, is a data frame with one row per probe, sorted by evidence of differential expression. The most important columns are: - **`logFC`** - the log2 fold change between RCC and normal tissue. A value of 2 means RCC samples express that gene 4-fold higher on average. A value of -1 means expression is halved in RCC. - **`adj.P.Val`** - the p-value after **Benjamini-Hochberg correction** for multiple testing. This controls the **false discovery rate (FDR)**: an `adj.P.Val` of 0.05 means we expect 5% of the probes we call significant to be false positives, rather than 5% of *all* tests to be wrong. ```{r inspect_de} # Look at the top 10 most significantly differentially expressed probes head(de.results[order(de.results$adj.P.Val), ], 10) ``` ### 2.3 The Volcano Plot A volcano plot is the standard visualization for a differential expression result set. It places every probe in a two-dimensional space: fold change on the x-axis (how large the difference is) and statistical significance on the y-axis (how confident we are in that difference). The most biologically interesting genes, large effect, high confidence, appear in the upper corners. ```{r volcano} volcano.plot(de.results, fc.threshold = 1, fdr.threshold = 0.05) ``` The dashed lines mark our thresholds: genes must have an absolute log2 fold change greater than 1 (a 2-fold change) AND an FDR-adjusted p-value below 0.05 to be colored as differentially expressed. Genes in red are upregulated in RCC relative to normal; genes in blue are downregulated. In kidney cancer, we might expect to see upregulation of genes involved in angiogenesis (tumor blood vessel formation), glycolysis (many kidney cancers switch their metabolism even under normal oxygen levels, the Warburg effect), and cell proliferation, while genes involved in normal kidney function and differentiation may be downregulated. ### 2.4 Identifying Top Candidates We now extract the probe IDs of the most significantly differentially expressed genes to carry forward into deeper analysis: ```{r top_probes} # Extract the 10 most significant probe IDs (row names of the TopTable) top.probes <- rownames(head(de.results[order(de.results$adj.P.Val), ], 10)) # Resolve probe IDs to human-readable gene names top.genes <- get.gene.name(geo$gene, top.probes) # Some probes on the array do not map to annotated genes, remove those annotated.mask <- which(top.genes != "") top.probes <- top.probes[annotated.mask] top.genes <- top.genes[annotated.mask] cat("Top differentially expressed genes:\n") print(top.genes) ``` `get.gene.name()` performs the reverse lookup of `find.probe.by.gene()`, given probe IDs from the TopTable, it retrieves the human-readable gene title from the annotation data frame. The filtering step removes probes that are present on the array but have no gene annotation in the database (common in older array designs). --- ## Part 3: Deep-Dive Single-Gene Analysis The limma TopTable tells us *which* genes are differentially expressed. Now we use BioUtils to understand *how* they are expressed, the magnitude, direction, and robustness of each gene's difference. ### 3.1 Retrieving and Preparing Expression Data ```{r single_prep} # Retrieve expression values for all top probes as a matrix expr.mat <- get.gene.expression(geo$expression, top.probes) # Build the long-format analysis data frame df.multi <- build.analysis.df(expr.mat, geo$phenotype, geo$gene) ``` `get.gene.expression()` slices the rows of the full expression matrix corresponding to our selected probes. `build.analysis.df()` then pivots this wide matrix (probes by samples) into a **long-format** data frame where each row represents one gene-sample observation, with columns for gene name, expression value, and disease state group. This format is required by all downstream analysis and visualization functions. ### 3.2 Multi-Gene Overview ```{r multi_plot} # Faceted violin + boxplot across all top genes simultaneously gene.analysis.plot(df.multi) ``` In multi-gene mode, `gene.analysis.plot()` creates a faceted panel, one violin/boxplot per gene, allowing visual comparison of expression patterns across the entire candidate set at once. This is useful for identifying which genes show large, clean separation between groups and which show more overlap or outlier-driven differences. ### 3.3 Focused Single-Gene Analysis For the gene showing the most dramatic separation, we perform a full statistical analysis: ```{r single_gene} # Subset to a single gene of interest gene.of.interest <- get.gene.name(geo$gene, top.probes[1], use.symbols=TRUE) df.single <- df.multi[which(df.multi$gene == gene.of.interest), ] # Annotated single-gene visualization gene.analysis.plot(df.single) ``` In single-gene mode, `gene.analysis.plot()` runs `analyze.gene()` internally and annotates the plot with the key statistical results. ### 3.4 Understanding `analyze.gene()` ```{r analyze} result <- analyze.gene(df.single) cat(result$interpretation) ``` `analyze.gene()` integrates multiple statistical perspectives into a single object. It is worth understanding what each component contributes: **The adaptive t-test** (`result$raw$parametric`): BioUtils first runs `var.test()` to check whether the two groups have equal variances. If they do not (a common occurrence in gene expression data, where disease states often increase expression variability), it automatically uses **Welch's t-test**, which does not assume equal variances, rather than the standard Student's t-test. This prevents a common source of inflated false positives. **Cohen's d** (`result$effect.size`): A p-value alone does not tell you whether a statistically significant difference is biologically meaningful. With 17 samples, even a tiny difference in means can reach statistical significance. Cohen's d is a **standardized effect size**, the difference in group means divided by the pooled standard deviation, that puts the magnitude of the difference in context. A d of 0.2 is small; 0.5 is moderate; 0.8 or above is large. In a clinical context, genes with small effect sizes are unlikely to serve as robust biomarkers even if they are highly significant. **Bootstrapped confidence interval** (`result$confidence.interval`): The confidence interval around Cohen's d quantifies our uncertainty about the true effect size given the sample size. A narrow CI (e.g., [0.8, 1.2]) means we have a precise estimate; a wide CI (e.g., [0.1, 1.5]) means the true effect could be anywhere from negligible to very large, and we should be cautious about strong claims. **The Wilcoxon rank-sum test** (`result$nonparametric.p`): The t-test assumes that expression values are roughly normally distributed within each group. Gene expression data often violates this assumption, particularly when one group has outlier samples. The Wilcoxon test makes no distributional assumptions and provides an independent check. When both tests agree (`result$robustness`), we can be more confident in the result. When they disagree, t-test significant, Wilcoxon not, the result may be driven by distributional differences rather than a genuine shift in the population mean. ```{r interpret} # Inspect individual components cat("Effect size (Cohen's d):", round(result$effect.size, 3), "\n") cat("Effect size class: ", result$effect.size.class, "\n") cat("95% CI: [", round(result$confidence.interval$lower, 3), ",", round(result$confidence.interval$upper, 3), "]\n") cat("Parametric p-value: ", signif(result$p.value, 3), "\n") cat("Nonparametric p-value: ", signif(result$nonparametric.p, 3), "\n") cat("Robustness: ", result$robustness, "\n") cat("Biological assessment: ", result$biological.relevance, "\n") ``` --- ## Part 4: Pathway-Level Interpretation with GSEA So far every analysis has focused on individual genes, which ones are differentially expressed, how large their effects are, and which ones jointly predict disease state. But genes do not act in isolation. They participate in coordinated **biological pathways**: cascades of molecular events that together accomplish a cellular function. The limma TopTable tells us *which* genes change; Gene Set Enrichment Analysis (GSEA) tells us *what those changes mean* at the level of biology. ### 4.1 What Is a Gene Set? A **gene set** is simply a curated list of genes known to participate in the same biological process. The Molecular Signatures Database (MSigDB) maintains thousands of such sets organized into collections. The most widely used collection for disease studies is the **Hallmark** collection (`category = "H"`), which contains 50 gene sets representing well-defined biological states, things like "HYPOXIA", "MYC_TARGETS_V1", "INFLAMMATORY_RESPONSE", and "EPITHELIAL_MESENCHYMAL_TRANSITION". Each set was assembled by expert curation to be coherent and minimally overlapping with the others. For kidney cancer specifically, we might expect to see strong enrichment of hypoxia-related gene sets. Clear cell RCC commonly inactivates the *VHL* tumor suppressor gene, which normally targets the HIF1$\alpha$ transcription factor for degradation. Without VHL, HIF1$\alpha$ accumulates and drives a transcriptional program mimicking chronic oxygen deprivation, activating genes for blood vessel formation, glucose uptake, and anaerobic metabolism. This is one of the best-characterized molecular signatures in all of oncology, and GSEA is an ideal tool for detecting it. ### 4.2 Loading Gene Sets We use the `msigdbr` package to download the Hallmark gene sets as a named list, which is the format `run.gsea()` expects: ```{r gene_sets} # Download Hallmark gene sets for Homo sapiens hallmark.df <- msigdbr::msigdbr(species = "Homo sapiens", category = "H") # Convert to a named list: pathway name -> character vector of gene symbols pathways <- split(hallmark.df$gene_symbol, hallmark.df$gs_name) # Each element is a vector of gene symbols belonging to that pathway cat("Number of Hallmark gene sets:", length(pathways), "\n") cat("Example - first 6 genes in HALLMARK_HYPOXIA:\n") print(head(pathways[["HALLMARK_HYPOXIA"]])) ``` ### 4.3 Running GSEA GSEA works on a **ranked gene list**, all genes in the experiment ordered from most upregulated to most downregulated, rather than a filtered list of significant hits. This is an important distinction from simpler enrichment methods that only ask "are pathway genes over-represented among my significant hits?" GSEA asks "do pathway genes cluster at the top or bottom of the full ranked list?" which makes it sensitive to coordinated shifts across an entire pathway even when no individual gene clears a significance threshold. The ranking metric used by `run.gsea()` is log2 fold change from the limma TopTable: genes at the top are the most upregulated in RCC relative to normal, and genes at the bottom are the most downregulated. ```{r gsea} gsea.results <- run.gsea(de.results, geo$gene, pathways, min.size = 15, max.size = 500) # Sort by adjusted p-value and inspect the top enriched pathways gsea.results <- gsea.results[order(gsea.results$padj), ] print(head(gsea.results[, c("pathway", "NES", "pval", "padj", "size")], 10)) ``` ### 4.4 Interpreting the Results The two most important columns in the GSEA output are: **`NES` (Normalized Enrichment Score):** The core statistic of GSEA. A positive NES means the gene set is enriched among genes upregulated in RCC, the genes belonging to this pathway tend to cluster at the top of the ranked list. A negative NES means the pathway is enriched among downregulated genes, suppressed in RCC relative to normal tissue. The normalization accounts for gene set size, so NES values are comparable across sets. **`padj`:** The NES is tested for significance by permuting the gene labels and recomputing enrichment scores under the null hypothesis that genes are randomly distributed in the ranked list. `padj` is the resulting p-value after Benjamini-Hochberg correction across all tested pathways. **`leadingEdge`:** The subset of genes from the pathway that contribute most to driving the enrichment score. These are the most biologically interesting genes within an enriched pathway, the ones at the "leading edge" of the ranked list. They are strong candidates for follow-up with `analyze.gene()`. ```{r gsea_followup} # Extract the leading edge genes of the top enriched pathway top.pathway <- gsea.results$pathway[1] leading.genes <- unlist(gsea.results$leadingEdge[1]) cat("Top enriched pathway:", top.pathway, "\n") cat("Leading edge genes:\n") print(leading.genes) # Find probes for the leading edge genes and perform single-gene deep-dives leading.probes <- find.probe.by.gene(geo$gene, leading.genes) leading.probes <- leading.probes[leading.probes != ""] if(length(leading.probes) > 0) { leading.expr <- get.gene.expression(geo$expression, leading.probes) df.leading <- build.analysis.df(leading.expr, geo$phenotype, geo$gene) gene.analysis.plot(df.leading) } ``` This closes the analytical loop: GSEA identifies which biological programs are dysregulated in RCC at the pathway level, and the leading edge genes point us back to the individual-gene tools, `analyze.gene()` and `gene.analysis.plot()`, to characterize the most important drivers within each pathway. --- ## Part 5: Multi-Gene Relationships Individual gene analysis tells us about each gene in isolation. But biology is a system, genes do not act alone. They are connected in regulatory networks, share transcription factors, and participate in common pathways. BioUtils provides two complementary tools for exploring these multi-gene relationships. ### 5.1 Co-expression Correlation Heatmap Genes that are regulated together tend to show correlated expression patterns across samples: when one goes up, the other goes up. These **co-expression relationships** can suggest shared regulatory control, functional partnership, or membership in the same biological pathway. ```{r coexp} # Compute pairwise Pearson correlations between our top probes cor.mat <- gene.correlation.matrix(geo$expression, top.probes, method = "pearson") # Visualize with hierarchical clustering correlation.heatmap.plot(cor.mat, gene.names = get.gene.name(geo$gene, top.probes, use.symbols=TRUE)) ``` `gene.correlation.matrix()` computes the full pairwise correlation matrix across all samples. `correlation.heatmap.plot()` then visualizes this as a clustered heatmap, where hierarchical clustering reorders rows and columns so that genes with similar co-expression profiles are placed next to each other. Red cells indicate strongly co-expressed gene pairs; blue cells indicate anti-correlated pairs (when one goes up, the other tends to go down). Clusters of co-expressed genes visible in the heatmap often correspond to genes in the same biological pathway or under the control of the same transcription factor. In a renal cell carcinoma context, you might see a cluster of angiogenesis-related genes (driven by HIF1$\alpha$ pathway activation, common in RCC) all moving together, or a cluster of genes involved in fatty acid metabolism. ### 5.2 LASSO Regression for Predictive Biomarker Discovery Co-expression tells us about correlations. But a separate question is: which genes, **taken together**, best discriminate RCC from normal tissue? This is a classification problem, and it is where `fit.lasso()` comes in. ```{r lasso} # Encode disease state as a binary outcome variable phenotype.binary <- ifelse(geo$phenotype$disease.state == "RCC", 1, 0) # Fit LASSO logistic regression across all top probes simultaneously lasso.fit <- fit.lasso(geo$expression, phenotype.binary) # Extract genes selected by LASSO at the 1-standard-error lambda coef.mat <- coef(lasso.fit, s = "lambda.1se") selected <- coef.mat[coef.mat[, 1] != 0, ] print(selected) ``` **Why LASSO?** Standard logistic regression cannot be applied here: we have more probes than samples, and many of the top genes are correlated with each other (as shown in the heatmap). LASSO (Least Absolute Shrinkage and Selection Operator) adds a penalty term to the logistic regression that forces most gene coefficients to exactly zero, keeping only those with **independent predictive value** when all other selected genes are accounted for. `cv.glmnet()` determines the optimal penalty strength through 10-fold cross-validation. At `lambda.1se`, the most regularized model within one standard error of the minimum cross-validation error, LASSO returns the sparsest model that still predicts well, which tends to be the most interpretable and generalizable. The genes with non-zero coefficients in the selected output are BioUtils's answer to the question: "If you could only measure a handful of genes to diagnose RCC from a biopsy, which ones would give you the most information?" These are strong candidates for further validation as diagnostic biomarkers. This is complementary to the limma DE analysis: limma identifies genes that differ **individually** between groups, while LASSO identifies the minimal set of genes that **jointly** carry discriminative information. A gene might be highly significant in limma but dropped by LASSO because its information is redundant with another gene in the panel. --- ## Part 6: The Complete Workflow The full analysis, from raw SOFT file to multi-gene biomarker panel, can be expressed as a coherent pipeline: ```{r full_pipeline} library(BioUtils) # == 1. Import ================================================================= eset <- load.geo.soft("", "GDS507", log.transform = TRUE) geo <- extract.expression(eset) # == 2. Quality Control ======================================================== pca.plot(geo$expression, geo$phenotype, color.by = "disease.state") # == 3. Genome-Wide Differential Expression ==================================== de.results <- run.limma.de(geo, condition.col = "disease.state") volcano.plot(de.results, fc.threshold = 1, fdr.threshold = 0.05) # == 4. Select Top Candidates ================================================== top.probes <- rownames(head(de.results[order(de.results$adj.P.Val), ], 10)) top.genes <- get.gene.name(geo$gene, top.probes) valid.mask <- which(top.genes != "") top.probes <- top.probes[valid.mask] top.genes <- top.genes[valid.mask] # == 5. Build Analysis Data Frame ============================================== expr.mat <- get.gene.expression(geo$expression, top.probes) df.multi <- build.analysis.df(expr.mat, geo$phenotype, geo$gene) # == 6. Multi-Gene Visualization =============================================== gene.analysis.plot(df.multi) # == 7. Single-Gene Deep-Dive ================================================== df.single <- df.multi[which(df.multi$gene == get.gene.name(geo$gene, top.probes[1], use.symbols=TRUE)), ] gene.analysis.plot(df.single) result <- analyze.gene(df.single) cat(result$interpretation) # == 8. Co-expression Analysis ================================================= cor.mat <- gene.correlation.matrix(geo$expression, top.probes, method = "pearson") correlation.heatmap.plot(cor.mat, gene.names = get.gene.name(geo$gene, top.probes, use.symbols=TRUE)) # == 9. Pathway Enrichment ===================================================== hallmark.df <- msigdbr::msigdbr(species = "Homo sapiens", category = "H") pathways <- split(hallmark.df$gene_symbol, hallmark.df$gs_name) gsea.results <- run.gsea(de.results, geo$gene, pathways) gsea.results <- gsea.results[order(gsea.results$padj), ] print(head(gsea.results[, c("pathway", "NES", "pval", "padj")], 10)) # == 10. Multi-Gene Predictive Modeling ======================================== phenotype.binary <- ifelse(geo$phenotype$disease.state == "RCC", 1, 0) lasso.fit <- fit.lasso(geo$expression, phenotype.binary) coef.mat <- coef(lasso.fit, s = "lambda.1se") selected <- coef.mat[coef.mat[, 1] != 0, ] print(selected) ``` --- ## Summary The table below summarizes every BioUtils function used in this vignette, the input it expects, and the output it produces: | Function | Input | Output | Role in Workflow | |---|---|---|---| | `load.geo.soft()` | SOFT file path | `ExpressionSet` | Data import | | `extract.expression()` | `ExpressionSet` | Named list | Decompose into usable components | | `pca.plot()` | Expression matrix, phenotype | ggplot | Quality control | | `run.limma.de()` | `geo` list | TopTable data frame | Genome-wide DE | | `volcano.plot()` | TopTable | ggplot | DE visualization | | `get.gene.name()` | Annotation df, probe IDs | Gene name strings | Probe to gene resolution | | `find.probe.by.gene()` | Annotation df, gene names | Probe ID integers | Gene to probe resolution | | `get.gene.expression()` | Expression matrix, probe IDs | Expression matrix | Slice expression data | | `build.analysis.df()` | Expression matrix, phenotype, annotation | Long-format df | Prepare for analysis | | `gene.analysis.plot()` | Long-format df | ggplot | Per-gene visualization | | `analyze.gene()` | Long-format df | Results list | Full statistical analysis | | `gene.correlation.matrix()` | Expression matrix, probe IDs | Correlation matrix | Co-expression | | `correlation.heatmap.plot()` | Correlation matrix | pheatmap | Co-expression visualization | | `run.gsea()` | TopTable, pathway list | GSEA results data frame | Pathway enrichment | | `fit.lasso()` | Expression matrix, binary phenotype | cv.glmnet | Multi-gene biomarker selection |