Type: | Package |
Title: | Retrieve Genomic and Clinical Data from CBioPortal Including TCGA Data |
Version: | 1.9.1 |
Date: | 2024-01-22 |
Author: | Damiano Fantini |
Maintainer: | Damiano Fantini <damiano.fantini@gmail.com> |
Depends: | R (≥ 3.5.0) |
Imports: | httr, reshape2, jsonlite |
Suggests: | stats, utils, knitr, rmarkdown, ggplot2, dplyr |
VignetteBuilder: | knitr |
Description: | The Cancer Genome Atlas (TCGA) is a program aimed at improving our understanding of Cancer Biology. Several TCGA Datasets are available online. 'TCGAretriever' helps accessing and downloading TCGA data hosted on 'cBioPortal' via its Web Interface (see https://www.cbioportal.org/ for more information). |
License: | GPL-3 |
LazyData: | true |
URL: | https://www.data-pulse.com/dev_site/TCGAretriever/ |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2024-01-23 20:51:47 UTC; dami |
Repository: | CRAN |
Date/Publication: | 2024-01-23 21:52:48 UTC |
Retrieve Genomic and Clinical Data from CBioPortal
Description
The Cancer Genome Atlas (TCGA) is a scientific and medical program aimed at improving our understanding of Cancer Biology. Part of the TCGA Datasets (free-access tier) are hosted on cBioPortal, which is an open-access, open-source resource for interactive exploration of multidimensional cancer genomics data sets. TCGAretriever helps accessing and downloading TCGA data via the cBioPortal API. Features of TCGAretriever are:
it is simple and reliable
it is tailored for downloading large volumes of data
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
See Also
Useful links:
Examples
# List available Adenoid Cystic Carcinoma (acyc) Studies.
# Set `dryrun = FALSE` (default option) in production!
all_studies <- get_cancer_studies(dryrun = TRUE)
acyc_ids <- grep('acyc', all_studies$studyId, value = TRUE)
print(acyc_ids)
# List blca_tcga profiles.
# Set `dryrun = FALSE` (default option) in production!
get_genetic_profiles(csid = 'blca_tcga', dryrun = TRUE)
# List blca_tcga case lists.
# Set `dryrun = FALSE` (default option) in production!
get_case_lists(csid = 'blca_tcga', dryrun = TRUE)
# Retrieve expression data.
# Set `dryrun = FALSE` (default option) in production!
my_genes <- c('PTEN', 'TP53')
get_molecular_data(case_list_id = 'blca_tcga_3way_complete',
gprofile_id = 'blca_tcga_rna_seq_v2_mrna',
glist = my_genes, dryrun = TRUE)
TCGAretriever Examples
Description
A list of objects including examples of the output returned by different 'TCGAretriever' functions. The objects were obtained from the '"blca_tcga"' study (bladder cancer).
Usage
data(blcaOutputExamples)
Format
A list including 7 elements.
- exmpl_1
data.frame (dimensions: 10 by 13). Sample output of the 'get_cancer_studies()' function.
- exmpl_2
data.frame (dimensions: 10 by 5). Sample output of the 'get_cancer_types()' function.
- exmpl_3
data.frame (dimensions: 9 by 5). Sample output of the 'get_case_lists()' function.
- exmpl_4
list including 9 elements. Sample output of the 'expand_cases()' function.
- exmpl_5
data.frame (dimensions: 10 by 94). Sample output of the 'get_clinical_data()' function.
- exmpl_6
data.frame (dimensions: 9 by 8). Sample output of the 'get_genetic_profiles()' function.
- exmpl_7
data.frame (dimensions: 10 by 3). Sample output of the 'get_gene_identifiers()' function.
- exmpl_8
data.frame (dimensions: 2 by 10). Sample output of the 'get_molecular_data()' function.
- exmpl_9
data.frame (dimensions: 6 by 27). Sample output of the 'get_mutation_data()' function.
Details
The object was built using the following lines of code.
blcaOutputExamples <- list(
exmpl_1 = head(get_cancer_studies(), 10),
exmpl_2 = head(get_cancer_types(), 10),
exmpl_3 = head(get_case_lists("blca_tcga"), 10) ,
exmpl_4 = expand_cases("blca_tcga"),
exmpl_5 = head(get_clinical_data("blca_tcga"), 10),
exmpl_6 = head(get_genetic_profiles("blca_tcga") , 10),
exmpl_7 = head(get_gene_identifiers(), 10),
exmpl_8 = get_molecular_data(case_list_id = 'blca_tcga_3way_complete',
gprofile_id = 'blca_tcga_rna_seq_v2_mrna',
glist = c("TP53", "E2F1"))[, 1:10],
exmpl_9 = head(get_mutation_data(case_list_id = 'blca_tcga_sequenced',
gprofile_id = 'blca_tcga_mutations',
glist = c('TP53', 'PTEN'))))
Examples
data(blcaOutputExamples)
blcaOutputExamples$exmpl_1
Core Query Engine
Description
Submit a query by getting or posting to the URL provided as argument (typically a cbioportal.org URL). The function attempts the same query recursively until the content has been completely downloaded. If 'body' is NULL, the query is submitted via GET. Otherwise, the query is submitted via POST.
Usage
cb_query(my_url, body = NULL)
Arguments
my_url |
String. The URL pointing to the cBioPortal API. |
body |
String. Parameters to be passed via POST to the query URL. Can be NULL. |
Details
This is a core function invoked by other functions in the package.
Value
Data.frame including data retrieved from cBioPortal.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# The example below requires an active Internet connection.
## Not run:
my_url <- "https://www.cbioportal.org/api/studies"
x <- TCGAretriever:::cb_query(my_url)
## End(Not run)
Samples Included in a List of Samples.
Description
Each study includes one or more "case lists". Each case list is a collection of samples that were analyzed using one or more platforms/assays. It is possible to obtain a list of all sample identifiers for each case list of interest.
Usage
expand_cases(csid, dryrun = FALSE)
Arguments
csid |
String corresponding to a TCGA Cancer Study identifier. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
list containing as many elements as TCGA case lists available for a given TCGA Study. Each element is named after a case list identifier. Also, each element is a character vector including all sample identifiers (case ids) corresponding to the corresponding case list identifier.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
x <- expand_cases("blca_tcga", dryrun = TRUE)
lapply(x, utils::head)
Fetch All Molecular Data for a Cancer Profile of Interest.
Description
Recursively query cbioportal to retrieve data corresponding to all available genes. Data are returned as a 'data.frame' that can be easily manipulated for downstream analyses.
Usage
fetch_all_tcgadata(case_list_id, gprofile_id, mutations = FALSE)
Arguments
case_list_id |
string corresponding to the identifier of the TCGA Case List of interest |
gprofile_id |
string corresponding to the identifier of the TCGA Profile of interest |
mutations |
logical. If TRUE, extended mutation data are fetched instead of the standard TCGA data |
Value
A data.frame is returned, including the desired TCGA data. Typically, rows are genes and columns are cases. If "extended mutation" data are retrieved (mutations = TRUE), rows correspond to individual mutations while columns are populated with mutation features
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# The examples below require an active Internet connection.
# Note: execution may take several minutes.
## Not run:
# Download all brca_pub mutation data (complete samples)
all_brca_MUT <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete",
gprofile_id = "brca_tcga_pub_mutations",
mutations = TRUE)
# Download all brca_pub RNA expression data (complete samples)
all_brca_RNA <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete",
gprofile_id = "brca_tcga_pub_mrna",
mutations = FALSE)
## End(Not run)
Retrieve the List of Cancer Studies Available at cbioportal.
Description
Retrieve information about the studies or datasets available at cbioportal.org. Information include a 'studyId', description, references, and more.
Usage
get_cancer_studies(dryrun = FALSE)
Arguments
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
Data Frame including cancer study information.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
all_studies <- get_cancer_studies(dryrun = TRUE)
utils::head(all_studies)
Retrieve Cancer Types.
Description
Retrieve information about cancer types and corresponding abbreviations from cbioportal.org. Information include identifiers, names, and parental cancer type.
Usage
get_cancer_types(dryrun = FALSE)
Arguments
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
A data.frame including cancer type information.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
all_canc <- get_cancer_types(dryrun = TRUE)
utils::head(all_canc)
Retrieve Case List Available for a Specific Cancer Study.
Description
Each study includes one or more "case lists". Each case list is a collection of samples that were analyzed using one or more platforms/assays. It is possible to obtain a list of case list identifiers from cbioportal.org for a cancer study of interest. Identifier, name, description and category are returned for each entry.
Usage
get_case_lists(csid, dryrun = FALSE)
Arguments
csid |
String corresponding to the Identifier of the Study of Interest |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
Data Frame including Case List information.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
blca_case_lists <- get_case_lists("blca_tcga", dryrun = TRUE)
blca_case_lists
Retrieve Clinical Information for a Cancer Study.
Description
Retrieve Clinical Information about the samples included in a cancer study of interest. For each sample/case, information about the corresponding cancer patient are returned. These may include sex, age, therapeutic regimen, tumor stage, survival status, as well as other information.
Usage
get_clinical_data(csid, case_list_id = NULL, dryrun = FALSE)
Arguments
csid |
String corresponding to a TCGA Cancer Study identifier. |
case_list_id |
String corresponding to the case_list identifier of interest. This Can be NULL. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
data.frame including clinical information of a list of samples/cases of interest.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
clinic_data <- get_clinical_data("blca_tcga", dryrun = TRUE)
utils::head(clinic_data[, 1:7])
Retrieve All Gene Identifiers
Description
Obtain all valid gene identifiers, including ENTREZ gene identifiers and HUGO gene symbols. Genes are classified according to the gene type (*e.g.*, 'protein-coding', 'pseudogene', 'miRNA', ...). Note that miRNA and phosphoprotein genes are associated with a negative entrezGeneId.
Usage
get_gene_identifiers(dryrun = FALSE)
Arguments
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
Data Frame including gene identifiers.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
x <- get_gene_identifiers(dryrun = TRUE)
Retrieve Genetic Profiles for a TCGA Study of Interest
Description
Retrieve Information about all genetic profiles associated with a cancer study of interest. Each cancer study includes one or more types of molecular analyses. The corresponding assays or platforms are referred to as genetic profiles. A genetic profile identifier is required to download molecular data.
Usage
get_genetic_profiles(csid = NULL, dryrun = FALSE)
Arguments
csid |
String corresponding to the cancer study id of interest |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
data.frame including information about genetic profiles.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
get_genetic_profiles("blca_tcga", dryrun = TRUE)
Retrieve Molecular Data corresponding to a Genetic Profile of Interest.
Description
Retrieve Data corresponding to a Genetic Profile of interest from a cancer study of interest. This function is the workhorse of the TCGAretriever package and can be used to fetch data concerning several genes at once. For retrieving mutation data, please use the 'get_mutation_data()' function. For large queries (more than 500 genes), please use the 'fetch_all_tcgadata()' function.
Usage
get_molecular_data(
case_list_id,
gprofile_id,
glist = c("TP53", "E2F1"),
dryrun = FALSE
)
Arguments
case_list_id |
String corresponding to the Identifier of a list of cases. |
gprofile_id |
String corresponding to the Identifier of a genetic Profile of interest. |
glist |
Vector including one or more gene identifiers (ENTREZID or OFFICIAL_SYMBOL). ENTREZID gene identifiers should be passed as numeric. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
data.frame including the molecular data of interest. Rows are genes, columns are samples.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
x <- get_molecular_data(case_list_id = 'blca_tcga_3way_complete',
gprofile_id = 'blca_tcga_rna_seq_v2_mrna',
glist = c("TP53", "E2F1"), dryrun = TRUE)
x[, 1:10]
Retrieve Mutation Data corresponding to a Genetic Profile of Interest.
Description
Retrieve DNA Sequence Variations (Mutations) identified by exome sequencing projects. This function is the workhorse of the TCGAretriever package for mutation data and can be used to fetch data concerning several genes at once. For retrieving non-mutation data, please use the 'get_molecular_data()' function. For large queries (more than 500 genes), please use the 'fetch_all_tcgadata()' function.
Usage
get_mutation_data(
case_list_id,
gprofile_id,
glist = c("TP53", "E2F1"),
dryrun = FALSE
)
Arguments
case_list_id |
String corresponding to the Identifier of a list of cases. |
gprofile_id |
String corresponding to the Identifier of a genetic Profile of interest. |
glist |
Vector including one or more gene identifiers (ENTREZID or OFFICIAL_SYMOL). ENTREZID gene identifiers should be passed as numeric. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Value
data Frame inluding one row per mutation
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
# Set `dryrun = FALSE` (default option) in production!
x <- get_mutation_data(case_list_id = 'blca_tcga_sequenced',
gprofile_id = 'blca_tcga_mutations',
glist = c('TP53', 'PTEN'), dryrun = TRUE)
utils::head(x[, c(4, 7, 23, 15, 16, 17, 24, 18, 21)])
Split Numeric Vectors in Groups.
Description
Assign each element of a numeric vector to a group. Grouping is based on ranks: numeric values are sorted and then split in 2 or more groups. Values may be sorted in an increasing or decreasing fashion. The vector is returned in the original order. Labels may be assigned to each group.
Usage
make_groups(num_vector, groups, group_labels = NULL, desc = FALSE)
Arguments
num_vector |
numeric vector. It includes the values to be assigned to the different groups |
groups |
integer. The number of groups that will be generated |
group_labels |
character vector. Labels for each group. Note that the length of group_labels has to be equal to the number of groups |
desc |
logical. If TRUE, the sorting is applied in a decreasing fashion |
Value
data.frame including the vector provided as argument in the original order ("value") and the grouping vector ("rank"). If labels are provided as an argument, group labels are also included in the data.frame ("labels").
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/TCGAretriever/
Examples
exprs_geneX <- c(19.1,18.4,22.4,15.5,20.2,17.4,9.4,12.4,31.2,33.2,18.4,22.1)
groups_num <- 3
groups_labels <- c("high", "med", "low")
make_groups(exprs_geneX, groups_num, groups_labels, desc = TRUE)