tepr (Transcription Elongation Profile) is an R package designed for analyzing data from nascent RNA sequencing technologies, such as TT-seq, mNET-seq, and PRO-seq. It calculates the probability distribution of nascent RNA sequencing signal across the gene body or transcription unit of a given gene. By comparing this profile to a uniform signal, tepr can identify transcription attenuation sites. Furthermore, it can detect increased or decreased transcription attenuation by comparing profiles across different conditions. Beyond its rigorous statistical testing and high sensitivity, a key strength of tepr is its ability to resolve the elongation pattern of individual genes, including the precise location of the primary attenuation point, when present. This capability allows users to visualize and refine genome-wide aggregated analyses, enabling the robust identification of effects specific to gene subsets. These metrics facilitate comparisons between genes within a condition, across conditions for the same gene, or against a theoretical model of perfect uniform elongation.
For any questions or bug reports, please open an issue on GitHub.
The following example demonstrates a complete run of the package. Preprocessing is performed on chromosome 13, and postprocessing is limited to the first 100 transcripts. The system.file command retrieves file paths from the package’s “extdata” directory.
library(tepr)
#########
# Pre-processing - takes ~ 21 seconds
#########
## Parameters
expprepath <- system.file("extdata", "exptab-preprocessing.csv", package="tepr")
windsize <- 200
## Read input tables
expdfpre <- read.csv(expprepath)
## Retrieving the bedgraph paths
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
    package = "tepr"))
## Replace the path column of expdfpre with the previously retrieved paths
## and writing the new experiment file to the current folder
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE, quote = FALSE)
expprepath <- "exptab-preprocessing.csv"
gencodepath <- system.file("extdata", "gencode-chr13.gtf", package = "tepr")
maptrackpath <- system.file("extdata", "k50.umap.chr13.hg38.0.8.bed",
    package = "tepr")
blacklistpath <- system.file("extdata", "hg38-blacklist-chr13.v2.bed",
    package = "tepr")
genomename <- "hg38"
## The lines below are optional. The chromosome info can be retrieved automatically
## Make chromosome info retrieval explicit for building the vignette
chromtabtestppath <- system.file("extdata", "chromtabtest.rds", package="tepr")
chromtabtest <- readRDS(chromtabtestppath)
allchromvec <- GenomeInfoDb::seqnames(chromtabtest)
chromtabtest <- chromtabtest[allchromvec[which(allchromvec == "chr13")], ]
finaltab <- preprocessing(expprepath, gencodepath, windsize, maptrackpath,
    blacklistpath, genomename, finaltabpath = tempdir(), finaltabname = "anno.tsv",
    chromtab = chromtabtest, showtime = FALSE, verbose = FALSE)
#########
# tepr analysis - takes ~ 1 seconds
#########
## Parameters (transpath limited to 6 transcripts)
exppath <-  system.file("extdata", "exptab.csv", package="tepr")
transpath <- system.file("extdata", "cugusi_6.tsv", package="tepr")
expthres <- 0.1
## Read input tables
expdf <- read.csv(exppath)
transdf <- read.delim(transpath, header = FALSE)
reslist <- tepr(expdf, transdf, expthres, showtime = FALSE, verbose = FALSE)The table produced by the preprocessing function is saved without column headers. The resulting data, using the testing dataset, is:
##   biotype   chr coor1 coor2        transcript    gene strand window
## 1  lncRNA chr13  1000  1003 ENST00000623511.1 FAM230C      +      1
## 2  lncRNA chr13  1003  1006 ENST00000623511.1 FAM230C      +      2
## 3  lncRNA chr13  1006  1009 ENST00000623511.1 FAM230C      +      3
##                              id         dataset.x  score.x         dataset.y
## 1 ENST00000623511.1_FAM230C_+_1 ctrl_rep1.forward 3.536147 ctrl_rep1.reverse
## 2 ENST00000623511.1_FAM230C_+_2 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## 3 ENST00000623511.1_FAM230C_+_3 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
##   score.y
## 1      NA
## 2      NA
## 3      NAThe tepr function returns two tables that are used for
plotting different information (see below for explanations and
?tepr). On the testing dataset it gives:
## The table meandifference:##          biotype  chr    coor1    coor2         transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11  DAP      -    200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11  DAP      -    199
##                             id         ctrl_rep1         ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
##           HS_rep1         HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse     1              2.424338
## 2 HS_rep1.reverse HS_rep2.reverse     2              9.451467
##   value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1             0.5111166             1.64734             2.20375
## 2            14.2957119            11.85357            13.22088
##   Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1        0.001792650       0.0002728066       0.00839895       0.01045131
## 2        0.008888557       0.0080750764       0.07086614       0.07315914
##   mean_value_ctrl mean_Fx_ctrl diff_Fx_ctrl mean_value_HS  mean_Fx_HS
## 1        1.467727  0.001032728 -0.003967272      1.925545 0.009425128
## 2       11.873589  0.008481817 -0.001518183     12.537228 0.072012643
##    diff_Fx_HS Diff_meanvalue_ctrl_HS Diff_meanvalue_HS_ctrl Diff_meanFx_ctrl_HS
## 1 0.004425128             -0.4578180              0.4578180         -0.00839240
## 2 0.062012643             -0.6636391              0.6636391         -0.06353083
##   Diff_meanFx_HS_ctrl
## 1          0.00839240
## 2          0.06353083## 
## 
##  The table universegroup:##   Universe      Group         transcript    gene strand  chr    coor1    coor2
## 1     TRUE Attenuated ENST00000230895.11     DAP      - chr5 10679230 10761234
## 2     TRUE   Outgroup ENST00000274140.10 MARCHF6      + chr5 10353695 10440388
##    size window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl    AUC_HS
## 1 82005         410 5.830076  0.9874106      0.045           7.927029 65.589375
## 2 86694         433 5.046713  0.9639452      0.050           8.032174  7.447405
##       p_AUC_HS D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 1.203856e-28    0.570         1.003308         0.9874106    3.611568e-28
## 2 6.271671e-01    0.075         9.129133         0.9874106    6.271671e-01
##   dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1                59.759300               9.396710e-26
## 2                 2.400692               9.874106e-01
##   D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1                      0.540                      5.638026e-25            49
## 2                      0.045                      9.874106e-01            49
##   max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1       0.04433176          34     0.56852906        0     9.375234
## 2       0.04980327          76     0.07198283        0     9.662655
##   DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1       7.464112         20.38480    4.36426    0.3201154       92.66507
## 2       7.522987         22.14369   10.80359    8.1156322       24.88020For a comprehensive analysis, we will use the data from Cugusi et al.. To validate our approach and explore the utility of tepr metrics in revealing gene-specific elongation characteristics, we re-analyzed previous experiments demonstrating that heat shock (HS) induces attenuation in human cultured cells. We compared TT-seq data from HS-stressed and control MRC5-VA cells. Stressed cells were cultured for 2 hours at 42°C, while control cells were maintained at 37°C. Both samples were subjected to a 15-minute pulse of 4-thiouridine (4sU) labeling of nascent transcripts, followed by purification and sequencing.
The bedgraph files, mappability track and black list necessary for performing the entire analysis can be downloaded from zenodo:
#!/usr/bin/sh
mkdir data
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.basic.annotation.gtf.gz -P data/ && gunzip data/gencode.v43.basic.annotation.gtf.gz
wget https://github.com/Boyle-Lab/Blacklist/raw/refs/heads/master/lists/hg38-blacklist.v2.bed.gz -P data/ && gunzip data/hg38-blacklist.v2.bed.gz
wget https://zenodo.org/records/15050723/files/k50.umap.hg38.0.8.bed -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep1.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep1.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep2.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep2.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep1.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep1.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep2.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep2.reverse.bg -P data/If one would like to skip the preprocessing, the preprocessing result file can be downloaded with:
The preprocessing pipeline consists of the following steps:
windsize).structure
Important: The preprocessing function
allows for saving intermediate objects, which prevents recomputation in
cases of failure due to memory or time limits typically set in HPC
parameters. See the saveobjectpath and
deletetmp parameters in the ?preprocessing
help documentation for details. Resource requirements for processing the
complete dataset are provided below.
| nb CPU | RAM | Time | 
|---|---|---|
| 15 | 113.5 G | 3h47 | 
| 10 | 78.9 G | 4h24 | 
| 7 | 57.4 G | 5h27 | 
| 5 | 43G | 6h58 | 
| 3 | 28.7G | 9h02 | 
The testing dataset below is limited to a portion of the chromosome 13 and one control replicate.
The following input files are required for the analysis:
?checkexptab utility function can be used to
verify the table’s format.NA.The ?retrieveanno function performs the following
steps:
exptab CSV file.saveobjectpath directory.showtime = TRUE).## 
##  The result is:##   chrom start  end           ensembl  symbol strand        biotype
## 1 chr13     1  800 ENST00000400113.8  TUBA3C      - protein-coding
## 2 chr13  1000 1600 ENST00000623511.1 FAM230C      +         lncRNAThe ?makewindows function performs the following
operations:
windsize.windsize.The output includes metadata for each window, such as its chromosome, start and end coordinates, associated gene, and window number.
## 
##  The result is:##          biotype   chr coor1 coor2        transcript   gene strand window
## 1 protein-coding chr13     1     5 ENST00000400113.8 TUBA3C      -      1
## 2 protein-coding chr13     5     9 ENST00000400113.8 TUBA3C      -      2
## 3 protein-coding chr13     9    13 ENST00000400113.8 TUBA3C      -      3The ?blacklisthighmap function iterates through each
chromosome, processing genomic scores by removing those that overlap
with blacklisted and low-mappability regions. It also calculates
weighted means for scores within each window. This function leverages
parallel processing for efficiency and supports saving
(saveobjectpath) and reloading (reload)
intermediate results to optimize the workflow.
The main steps include:
tmpfold).If you did not execute the “quick start” section code, you need to first copy the bedgraph files to your current folder:
expprepath <-  system.file("extdata", "exptab-preprocessing.csv", package="tepr")
expdfpre <- read.csv(expprepath)
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
    package = "tepr"))
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE, quote = FALSE)
exptabpath <- "exptab-preprocessing.csv"Note that this function does not return a value; instead, it saves
the processed files to a temporary folder for use by the subsequent
?createtablescores function.
The ?createtablescores function merges the files stored
in the previously produced temporary folder that belong to the same
experiment and direction. These files are combined into a single table,
with two columns per experiment: one containing the experiment name and
the other containing the corresponding scores. The resulting table also
includes annotations for each transcript.
finaltab <- createtablescores(tmpfold = file.path(tempdir(), "tmptepr"), expprepath, savefinaltable = FALSE)The final table should look like:
##       V1    V2   V3   V4                V5      V6 V7 V8
## 1 lncRNA chr13 1000 1003 ENST00000623511.1 FAM230C  +  1
## 2 lncRNA chr13 1003 1006 ENST00000623511.1 FAM230C  +  2
## 3 lncRNA chr13 1006 1009 ENST00000623511.1 FAM230C  +  3
##                              V9               V10      V11               V12
## 1 ENST00000623511.1_FAM230C_+_1 ctrl_rep1.forward 3.536147 ctrl_rep1.reverse
## 2 ENST00000623511.1_FAM230C_+_2 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## 3 ENST00000623511.1_FAM230C_+_3 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
##   V13
## 1  NA
## 2  NA
## 3  NAThe quick start section demonstrates how to perform a tepr analysis
in a single call using the ?tepr function. Below, we detail
the individual steps performed by ?tepr using a subset of 6
transcripts for demonstration purposes. Plots generated from the
complete annotation dataset are also provided. The complete table
resulting from preprocessing can be downloaded here.
See section 6.1 for notes on processing the complete dataset.
Two tables are required as input:
transdf: The table generated by the
?preprocessing function (described in previous sections).
This table contains information about each experiment, along with scores
for each protein-coding and lncRNA annotation.
exptab: A table describing the experiments,
containing the columns “condition”, “replicate”, “direction”, “strand”,
and “path”. The “direction” column should contain the values “forward”
and “reverse”, and the “strand” column should contain “plus” and
“minus”. You can verify the table’s format using the
?checkexptab utility function.
These two tables are included with the package and can be accessed
using the ?system.file command within the “extdata”
directory. For this demonstration, we have sampled 6 transcripts from
the transdf table to reduce computation time.
## Define manually the column names for display purpose
colnames(transdf) <- c("biotype", "chr", "coor1", "coor2", "transcript", "gene",
    "strand", "window", "id", "ctrl_rep1.plus", "ctrl_rep1.plus_score",
    "ctrl_rep1.minus", "ctrl_rep1.minus_score", "ctrl_rep2.plus",
    "ctrl_rep2.plus_score", "ctrl_rep2.minus", "ctrl_rep2.minus_score",
    "HS_rep1.plus", "HS_rep1.plus_score", "HS_rep1.minus",
    "HS_rep1.minus_score", "HS_rep2.plus", "HS_rep2.plus_score",
    "HS_rep2.minus", "HS_rep2.minus_score")
message("The table given by the preprocessing function is:\n")## The table given by the preprocessing function is:##          biotype  chr    coor1    coor2        transcript gene strand window
## 1 protein-coding chr7 55019017 55019980 ENST00000275493.7 EGFR      +      1
## 2 protein-coding chr7 55019980 55020943 ENST00000275493.7 EGFR      +      2
## 3 protein-coding chr7 55020943 55021906 ENST00000275493.7 EGFR      +      3
##                           id    ctrl_rep1.plus ctrl_rep1.plus_score
## 1 ENST00000275493.7_EGFR_+_1 ctrl_rep1.forward             1.356774
## 2 ENST00000275493.7_EGFR_+_2 ctrl_rep1.forward             3.657029
## 3 ENST00000275493.7_EGFR_+_3 ctrl_rep1.forward             8.161519
##     ctrl_rep1.minus ctrl_rep1.minus_score    ctrl_rep2.plus
## 1 ctrl_rep1.reverse                    NA ctrl_rep2.forward
## 2 ctrl_rep1.reverse                    NA ctrl_rep2.forward
## 3 ctrl_rep1.reverse                    NA ctrl_rep2.forward
##   ctrl_rep2.plus_score   ctrl_rep2.minus ctrl_rep2.minus_score    HS_rep1.plus
## 1             1.430225 ctrl_rep2.reverse                    NA HS_rep1.forward
## 2             5.461259 ctrl_rep2.reverse                    NA HS_rep1.forward
## 3             9.692524 ctrl_rep2.reverse                    NA HS_rep1.forward
##   HS_rep1.plus_score   HS_rep1.minus HS_rep1.minus_score    HS_rep2.plus
## 1           2.011272 HS_rep1.reverse                  NA HS_rep2.forward
## 2           8.927487 HS_rep1.reverse                  NA HS_rep2.forward
## 3          12.844230 HS_rep1.reverse                  NA HS_rep2.forward
##   HS_rep2.plus_score   HS_rep2.minus HS_rep2.minus_score
## 1           2.068358 HS_rep2.reverse                  NA
## 2           8.943509 HS_rep2.reverse                  NA
## 3          14.917819 HS_rep2.reverse                  NA## 
##  The expdf table contains information about each replicate (here limited to one):##   condition replicate direction strand                       path
## 1      ctrl         1   forward   plus ctrl_rep1_chr13.forward.bg
## 2      ctrl         1   reverse  minus ctrl_rep1_chr13.reverse.bg
## 3      ctrl         2   forward   plus ctrl_rep2_chr13.forward.bg
## 4      ctrl         2   reverse  minus ctrl_rep2_chr13.reverse.bg
## 5        HS         1   forward   plus   HS_rep1_chr13.forward.bg
## 6        HS         1   reverse  minus   HS_rep1_chr13.reverse.bgYou can verify the table’s format using the ?checkexptab
utility function. Note that this verification is also performed by
?averageandfilterexprs.
To study changes in nascent RNA patterns, the initial step is to
select expressed transcripts with ?averageandfilterexprs.
The expression value of a gene is defined as the mean score across both
strands. These scores are derived from the bedgraph files. A gene is
considered expressed if its mean expression in all conditions exceeds a
specified threshold. The default expression threshold is set to 0.1. If
no expressed transcript is returned, an error is thrown.
resallexprs is a list. Its first element is the input
transdf with additional columns, such as the calculated
mean expression values. The second element is a vector containing the
IDs of the transcripts considered expressed. This vector will be used in
subsequent downstream analyses.
## [1] "ENST00000230895.11" "ENST00000274140.10" "ENST00000275493.7" 
## [4] "ENST00000527786.7"  "ENST00000549802.5"  "ENST00000615891.2"During preprocessing, scores within user-defined blacklisted regions,
low-mappability regions, or any other excluded regions are replaced with
NA values. The ?countna function retrieves the number of
missing scores for each transcript. This table can be used for further
data filtering, such as removing transcripts with an excessive number of
windows lacking scores.
##                         gene         transcript strand Count_NA
## ENST00000230895.11       DAP ENST00000230895.11      -        0
## ENST00000274140.10   MARCHF6 ENST00000274140.10      +        0
## ENST00000275493.7       EGFR  ENST00000275493.7      +        0
## ENST00000527786.7       FLI1  ENST00000527786.7      +        0
## ENST00000549802.5  LINC01619  ENST00000549802.5      -        0
## ENST00000615891.2      AP5S1  ENST00000615891.2      +        8The ?genesECDF function calculates the empirical
cumulative distribution function (ECDF) for expressed genes across their
transcripts. An ECDF is a probability distribution that
estimates the Cumulative Distribution Function; here, we use the
?ecdf function from R. It generates an ECDF for each
transcript, which are then used to assess differences in signal using a
Kolmogorov-Smirnov test. The ECDF helps answer the question: How likely
is it that we would observe a distribution of signals like this if the
signals were drawn from a uniform probability distribution? In other
words, this statistic helps identify specific loci where the
transcription signal deviates significantly from a uniform density. In
the following steps, the ECDF is used to compute the difference from the
cumulative distribution (see ?meananddifference), which is
subsequently used to calculate the area under the curve (AUC, see
?allauc) and the inflection point or knee (see
?kneeid).
The ?genesECDF function returns a list. The first
element is the main table with added ECDF columns, prefixed with
Fx.
##          biotype  chr    coor1    coor2         transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11  DAP      -    200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11  DAP      -    199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11  DAP      -    198
##                             id         ctrl_rep1         ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
##           HS_rep1         HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse     1              2.424338
## 2 HS_rep1.reverse HS_rep2.reverse     2              9.451467
## 3 HS_rep1.reverse HS_rep2.reverse     3             13.218833
##   value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1             0.5111166            1.647340             2.20375
## 2            14.2957119           11.853574            13.22088
## 3            16.2775011            9.271255            14.74644
##   Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1        0.001792650       0.0002728066       0.00839895       0.01045131
## 2        0.008888557       0.0080750764       0.07086614       0.07315914
## 3        0.018748133       0.0169685727       0.11968504       0.14299287The second element returns the number of windows per transcript, which is set to 200 by default during preprocessing.
## [1] 200The ?meandifference function computes three types of
statistics: mean score values, mean ECDF for each replicate, and the
difference between the mean ECDF and the uniform cumulative
distribution. Mean values across replicates are calculated, resulting in
one mean_value column per condition. These
mean_value score columns are required for subsequent
attenuation score calculations (see ?attenuation). The mean
ECDF values are used to calculate the difference from the cumulative
distribution, which we term diff_Fx. The diff_Fx
statistic is used to calculate the area under the curve (AUC, see
?allauc) and the inflection point or knee (see
?kneeid).
The difference is calculated as: diff_Fx = meanECDF - cumvec / nbwindows
resmeandiff <- meandifference(resecdflist[[1]], expdf, nbwindows, verbose = FALSE)
print(head(resmeandiff, 3))##          biotype  chr    coor1    coor2         transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11  DAP      -    200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11  DAP      -    199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11  DAP      -    198
##                             id         ctrl_rep1         ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
##           HS_rep1         HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse     1              2.424338
## 2 HS_rep1.reverse HS_rep2.reverse     2              9.451467
## 3 HS_rep1.reverse HS_rep2.reverse     3             13.218833
##   value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1             0.5111166            1.647340             2.20375
## 2            14.2957119           11.853574            13.22088
## 3            16.2775011            9.271255            14.74644
##   Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1        0.001792650       0.0002728066       0.00839895       0.01045131
## 2        0.008888557       0.0080750764       0.07086614       0.07315914
## 3        0.018748133       0.0169685727       0.11968504       0.14299287
##   mean_value_ctrl mean_Fx_ctrl diff_Fx_ctrl mean_value_HS  mean_Fx_HS
## 1        1.467727  0.001032728 -0.003967272      1.925545 0.009425128
## 2       11.873589  0.008481817 -0.001518183     12.537228 0.072012643
## 3       14.748167  0.017858353  0.002858353     12.008848 0.131338957
##    diff_Fx_HS Diff_meanvalue_ctrl_HS Diff_meanvalue_HS_ctrl Diff_meanFx_ctrl_HS
## 1 0.004425128             -0.4578180              0.4578180         -0.00839240
## 2 0.062012643             -0.6636391              0.6636391         -0.06353083
## 3 0.116338957              2.7393194             -2.7393194         -0.11348060
##   Diff_meanFx_HS_ctrl
## 1          0.00839240
## 2          0.06353083
## 3          0.11348060The resulting table is then split into a list of tables, one per transcript. This enables parallel processing by transcript for subsequent operations (see nbcpu parameter in each function).
The Area Under the Curve (AUC) values enable comparison of the
nascent RNA signal to a uniform distribution (representing no signal
attenuation, where x = y) or comparison of the difference in signal
distribution between two conditions (dAUC). The difference between
cumulative densities is estimated using a Kolmogorov-Smirnov test with
adjusted p-value. Finally, ?allauc returns results by
transcript, along with a mean value termed MeanValueFull.
For each condition (if more than one), the AUC of diff_Fx
(described in the previous section) is computed using a trapezoidal
approximation (see pracma::trapz):
AUC = pracma::trapz(transcoord, diff_Fx)
where transcoord represents the positions along the
transcript and diff_Fx is the difference from the cumulative
distribution.
## Calculate Area Under Curve (AUC) and Differences of AUC for Transcript Data
resauc <- allauc(bytranslistmean, expdf, nbwindows, verbose = FALSE)
print(head(resauc, 3))##    gene         transcript strand window_size  AUC_ctrl  p_AUC_ctrl D_AUC_ctrl
## 1 AP5S1  ENST00000615891.2      +          41 19.405661 0.008635679      0.165
## 2   DAP ENST00000230895.11      -         410  5.830076 0.987410626      0.045
## 3  EGFR  ENST00000275493.7      +         963  7.515544 0.544142503      0.080
##   MeanValueFull_ctrl   AUC_HS     p_AUC_HS D_AUC_HS MeanValueFull_HS
## 1           3.001846 10.85386 1.122497e-01     0.12         2.561331
## 2           7.927029 65.58938 1.203856e-28     0.57         1.003308
## 3           6.210447 56.67197 3.857500e-22     0.50         3.330179
##   adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS dAUC_Diff_meanFx_HS_ctrl
## 1        0.02590704    1.346996e-01                -8.551806
## 2        0.98741063    3.611568e-28                59.759300
## 3        0.81621375    7.714999e-22                49.156430
##   p_dAUC_Diff_meanFx_HS_ctrl D_dAUC_Diff_meanFx_HS_ctrl
## 1               5.224189e-02                      0.135
## 2               9.396710e-26                      0.540
## 3               7.407064e-21                      0.485
##   adjFDR_p_dAUC_Diff_meanFx_HS_ctrl
## 1                      6.269027e-02
## 2                      5.638026e-25
## 3                      1.481413e-20To assess attenuation in the nascent RNA signal, it’s necessary to
identify significant changes in signal distribution. tepr accomplishes
this by identifying the maximum diff_Fx. In other words, it
pinpoints the greatest variation in the signal progression of the
empirical cumulative distribution (see ?genesECDF and
?meandifference). The window where this maximum change
occurs is termed the knee and is calculated as: knee =
max(diff_Fx). Thus, the knee position is defined as the maximum
difference between the ECDF and the y=x curve.
?kneeid returns both the position (Knee_AUC)
and the value (diff_Fx) of the knee.
##                            transcript knee_AUC_ctrl max_diff_Fx_ctrl
## ENST00000230895.11 ENST00000230895.11            49       0.04433176
## ENST00000274140.10 ENST00000274140.10            49       0.04980327
## ENST00000275493.7   ENST00000275493.7            94       0.07642503
## ENST00000527786.7   ENST00000527786.7            61       0.10785215
## ENST00000549802.5   ENST00000549802.5            59       0.53104881
## ENST00000615891.2   ENST00000615891.2           100       0.16331981
##                    knee_AUC_HS max_diff_Fx_HS
## ENST00000230895.11          34     0.56852906
## ENST00000274140.10          76     0.07198283
## ENST00000275493.7           70     0.49569073
## ENST00000527786.7           78     0.49396784
## ENST00000549802.5           48     0.58517400
## ENST00000615891.2           65     0.11664132The ?attenuation function calculates the mean score of
all window means before the knee location and the mean score of
all window means after the knee location. It then calculates an
“attenuation score” using the following formula: att <- 100 -
(downmean / upmean) x 100 (x for multiplication). The function
returns three values, stored in the columns “UP_mean_,” “DOWN_mean_,”
and “Attenuation_,” representing att, downmean, and
upmean, respectively.
resatt <- attenuation(resauc, resknee, rescountna, bytranslistmean, expdf,
    resmeandiff, verbose = FALSE)
print(head(resatt, 3))##           transcript    gene strand  chr    coor1    coor2   size window_size
## 1 ENST00000230895.11     DAP      - chr5 10679230 10761234  82005         410
## 2 ENST00000274140.10 MARCHF6      + chr5 10353695 10440388  86694         433
## 3  ENST00000275493.7    EGFR      + chr7 55019017 55211628 192612         963
##   AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl    AUC_HS     p_AUC_HS
## 1 5.830076  0.9874106      0.045           7.927029 65.589375 1.203856e-28
## 2 5.046713  0.9639452      0.050           8.032174  7.447405 6.271671e-01
## 3 7.515544  0.5441425      0.080           6.210447 56.671974 3.857500e-22
##   D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1    0.570         1.003308         0.9874106    3.611568e-28
## 2    0.075         9.129133         0.9874106    6.271671e-01
## 3    0.500         3.330179         0.8162138    7.714999e-22
##   dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1                59.759300               9.396710e-26
## 2                 2.400692               9.874106e-01
## 3                49.156430               7.407064e-21
##   D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1                      0.540                      5.638026e-25            49
## 2                      0.045                      9.874106e-01            49
## 3                      0.485                      1.481413e-20            94
##   max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1       0.04433176          34     0.56852906        0     9.375234
## 2       0.04980327          76     0.07198283        0     9.662655
## 3       0.07642503          70     0.49569073        0     7.209353
##   DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1       7.464112         20.38480   4.364260    0.3201154       92.66507
## 2       7.522987         22.14369  10.803586    8.1156322       24.88020
## 3       5.346316         25.84194   8.049177    0.8118397       89.91400Universe
Prior to further downstream analysis, the set of transcripts to be studied must be defined. These transcripts are designated as belonging to the “Universe”. tepr selects a transcript for inclusion in the “Universe” if it meets the following criteria:
Analytically, this can be represented as:
window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
meanstress > meanstressthreshold &
pvaltheory > pvaltheorythresIf only one condition is provided, only the control threshold is used:
window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
pvaltheory > pvaltheorythresAppropriate threshold values for each criterion depend on the
specific experiments. The user can adjust these thresholds throughout
the analysis. A transcript that satisfies the above criteria is defined
as belonging to the “Universe” of transcripts to be analyzed. The
Universe column in the data will contain TRUE
or FALSE to indicate membership.
Groups
The transcripts in the Universe are further categorized into Groups. A transcript belongs to the “Attenuated” group if it meets the following criteria:
Universe == TRUE).Analytically, this is represented as:
If only one condition is provided, no comparison via the
Kolmogorov-Smirnov test with another condition is performed. A
transcript will be considered attenuated if it deviates significantly
from the theoretical cumulative distribution (y = x) and its
AUC value is sufficiently high. The significant deviation has already
been evaluated during the Universe calculation
(pvaltheory > pvaltheorythres). Therefore, with only one
condition, a transcript is considered attenuated if:
A transcript is considered non-attenuated (“Outgroup”) if it meets the following criteria:
Analytically, this is represented as:
Universe == TRUE & pvalks > outgrouppvalksthreshold &
aucctrl > aucctrlthresoldhigher & aucctrl < aucctrlthresholdlowerFor a single-condition analysis, where no Kolmogorov-Smirnov test against another condition is performed, a transcript belongs to the “Outgroup” if it belongs to the Universe and its control AUC falls between two user-defined thresholds:
Note that transcripts belonging to neither “Attenuated” nor
“Outgroup” are assigned NA to the Group column. The
Universe and Group columns are computed with
the ?universegroup function:
##   Universe      Group         transcript    gene strand  chr    coor1    coor2
## 1     TRUE Attenuated ENST00000230895.11     DAP      - chr5 10679230 10761234
## 2     TRUE   Outgroup ENST00000274140.10 MARCHF6      + chr5 10353695 10440388
##    size window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl    AUC_HS
## 1 82005         410 5.830076  0.9874106      0.045           7.927029 65.589375
## 2 86694         433 5.046713  0.9639452      0.050           8.032174  7.447405
##       p_AUC_HS D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 1.203856e-28    0.570         1.003308         0.9874106    3.611568e-28
## 2 6.271671e-01    0.075         9.129133         0.9874106    6.271671e-01
##   dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1                59.759300               9.396710e-26
## 2                 2.400692               9.874106e-01
##   D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1                      0.540                      5.638026e-25            49
## 2                      0.045                      9.874106e-01            49
##   max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1       0.04433176          34     0.56852906        0     9.375234
## 2       0.04980327          76     0.07198283        0     9.662655
##   DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1       7.464112         20.38480    4.36426    0.3201154       92.66507
## 2       7.522987         22.14369   10.80359    8.1156322       24.88020tepr provides visualization capabilities for: the cumulative
transcription density along a selected transcription unit
(?plotecdf); the comparison of AUC between two conditions
(?plotauc); the average transcription density of a
user-selected set of transcripts (?plotmetagenes); and a
histogram of the distance between knees and transcription start sites
(TSS, ?plothistoknee).
plotecdfThe empirical cumulative distribution function (ECDF) for a specific transcript (EGFR in this example) can be visualized using:
colvec <- c("#90AFBB", "#10AFBB", "#FF9A04", "#FC4E07")
plotecdf(resmeandiff, res, expdf, "EGFR", colvec, plot = TRUE, verbose = FALSE)plotaucAUC values between conditions can be compared by highlighting the two conditions on the plot (only 6 points appearing):
## Ignoring unknown labels:
## • legend : "-log10 p-value"
## • colour : "-log10 p-value"The plot generated using the complete dataset is:
AUC group plot
It is also possible to compare the AUC by coloring points according
to the Kolmogorov-Smirnov test adjusted p-values and by indicating
graphically the points corresponding to a group of genes (here defined
by genevec):
genevec <- c("EGFR", "DAP", "FLI1", "MARCHF6", "LINC01619")
plotauc(res, expdf, genevec, plot = TRUE)## Ignoring unknown labels:
## • legend : "-log10 p-value"The plot generated using the complete dataset is:
AUC pval plot
plotmetagenesThe average transcription density along a meta-transcript or meta-gene can be visualized as follows:
The plot on the complete dataset gives:
Attenuation meta
Note that the plottype parameter, set to “attenuation”
in this example, specifies that the distribution should be plotted for
transcripts identified as “attenuated” by the
?universegroup function. Other options for
plottype include “outgroup” (for transcripts categorized as
“outgroup”), “universe” (for all transcripts in the Universe), and “all”
(for all transcripts in the initial table).
plothistokneeThe distribution of knee distances from the TSS can be visualized as a percentage:
The plot generated using the complete dataset is:
histo percent
The distances can also be visualized in kilobases (kb) by setting the
plottype parameter to “kb”:
The plot generated using the complete dataset is:
histo kb
teprmulti AnalysisAnalyzing more than two experimental conditions is possible using the
?teprmulti function. This function performs an “all
vs. all” comparison and returns a list. Each element of this list
corresponds to a specific pairwise comparison and contains another list
with two data frames:
resmeandiff_<comparison>: The results of the
?meandifference function for the specified two-condition
comparison.resunigroupatt_<comparison>: The results of the
?universegroup function for the specified two-condition
comparison.The ?showallcomp function returns a vector of all
comparison names that ?teprmulti will perform. This allows
users to reduce the number of comparisons by using the
dontcompare parameter. dontcompare accepts a
vector of comparison names to exclude. The following code limits
?teprmulti to only three comparisons:
plotmultiUsing the list of results from ?teprmulti, all plots for
all comparisons can be generated at once using ?plotmulti.
This function iterates through each element of resteprmulti
(each element representing a two-condition comparison) and calls the
following functions:
?plotecdf: Generates a figure for each gene specified
in ecdfgenevec.?plotauc: Generates figures with options “group” and
“p-value”. The “p-value” option figure is not generated if
genaucvec = NA.?plotmetagenes: Generates figures with options
“attenuation”, “outgroup”, “universe”, and “all”.?plothistoknee: Generates figures with options
“percent” and “kb”.Figures are written to the directory specified by
outfold, and subfolders are automatically created for each
comparison.
If your dataset contains only one condition (with one or more
replicates), you can analyze the data by comparing the observed signal
to a linear “theoretical” distribution representing a uniform signal y =
x. The calculation of differences in means and AUC is skipped when
running ?meandifference and ?allauc. The
criteria for defining the “Universe” and “Group” columns also differ
(see details in the previous section).
## The experiment table is limited to one condition and one replicate
expdfonecond <- expdf[which(expdf$condition == "HS" & expdf$replicate == 1), ]
## The table obtained by preprocessing is limited to the condition 'HS' and replicate 1
transdfonecond <- transdf[, c(seq_len(9), 18, 19, 20, 21)]
## Computing the object 'res' with tepr on one condition
resonecond <- tepr(expdfonecond, transdfonecond, expthres, controlcondname = "HS", verbose = FALSE)Because AUC and metagene plots require two conditions, they cannot be generated in a single-condition analysis. However, the ECDF for a given gene can be obtained by considering the y = x distribution as follows:
plotecdf(resonecond[[1]], resonecond[[2]], expdfonecond, genename = "EGFR",
    colvec = c("#90AFBB"), plot = TRUE, verbose = FALSE)The histogram of knee positions can be obtained with:
## Randomly marking 3 transcripts as attenuated as a mock example
idxatt <- sample(seq_len(6), 3)
resonecond[[2]][idxatt, "Group"] <- "Attenuated"
resonecond[[2]][idxatt, "Universe"] <- TRUE
plothistoknee(resonecond[[2]], kneename = "knee_AUC_HS", plottype = "percent", plot = TRUE, verbose = FALSE)For processing the complete dataset, parallelization is highly
recommended. We used 20 CPUs for our analysis. The code below assumes
that the file “dTAG_Cugusi_stranded_20230810.tsv” is located in the
current working directory. You can verify the format of the
transdf table using the checkexptab utility
function.
The approximate computation time for each function is indicated below.
library(tepr)
## After downloading with: wget https://zenodo.org/records/15050723/files/cugusi.tsv -P preprocessed/
transpath <- "preprocessed/cugusi.tsv"
exppath <- system.file("extdata", "exptab.csv", package="tepr")
## Reading input files
transdf <- read.delim(transpath, header = FALSE) 
expdf <- read.csv(exppath)
reslist <- tepr(expdf, transdf, expthres = 0.1, nbcpu=5, showtime = TRUE, verbose = TRUE)
         ## Calculating average expression per transcript
         Removing lines with values < expthres
                 -- Analysis performed in: 8.1 secs
         ## Counting NA values
         Splitting the table by each transcript
                 -- Analysis performed in: 38 secs
         ## Computing ecdf
         Filtering to keep only the expressed transcripts
         Splitting the table by each transcript
         Computing ecdf on each transcript
                 -- Analysis performed in: 3.2 mins
         ## Computing meandifference
         Merging columns for condition ctrl
         Calculating average and difference between replicates for columns 'value' of ctrl
         Calculating average and difference between replicates for columns 'Fx' of ctrl
         Merging columns for condition HS
         Calculating average and difference between replicates for columns 'value' of HS
         Calculating average and difference between replicates for columns 'Fx' of HS
         Computing all differences on mean columns
                 -- Analysis performed in: 4.3 secs
         ## Split the results by transcripts
                 -- Analysis performed in: 6.3 secs
         ## Computing AUC and difference
         Computing the differences (d or delta) of AUC
         Computing the Area Under Curve (AUC)
         Merging results
                 -- Analysis performed in: 1.6 mins
         ## Calculating knee
                 -- Analysis performed in: 26 secs
         ## Calculating attenuation
         Merging tables
         Building summary
         Merging summary
         Merging detailed mean table with summary
         Splitting the previous table by transcript
         Computing up and down mean
         Merging attenuation to the complete table
                 -- Analysis performed in: 1.26459248860677
         ## Computing universe and group columns
                 -- Analysis performed in: 0.021 secs
                 -- tepr analysis performed in: 7.4 mins
## Retrieving results
resmeandiff <- reslist[[1]]
res <- reslist[[2]]
## Plotting and saving to current folder
colvec <- c("#90AFBB", "#10AFBB", "#FF9A04", "#FC4E07")
plotecdf(resmeandiff, res, expdf, "EGFR", colvec, formatname = "png")
plotauc(res, expdf, plottype = "groups", outfile = "AUCcompare_group",
    formatname = "png")
genevec <- c("EGFR", "DAP", "FLI1", "MARCHF6", "LINC01619")
plotauc(res, expdf, genevec, plottype = "pval", outfile = "AUCcompare_pval",
    formatname = "png")
plotmetagenes(res, resmeandiff, expdf, plottype = "attenuation", formatname = "png")
plothistoknee(res, formatname = "png")
plothistoknee(res, plottype = "kb", formatname = "png")