rsynthbio is an R package that provides a convenient
interface to the Synthesize
Bio API, allowing users to generate realistic gene expression data
based on specified biological conditions. This package enables
researchers to easily access AI-generated transcriptomic data for
various modalities including bulk RNA-seq and single-cell RNA-seq.
Alternatively, you can AI generate datasets from our web platform.
You can install rsynthbio from CRAN:
If you want the development version, you can install using the
remotes package to install from GitHub:
if (!("remotes" %in% installed.packages())) {
install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")Once installed, load the package:
Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:
# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()
# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)Loading your API key for a session.
# In future sessions, load the stored token
load_synthesize_token_from_keyring()
# Check if a token is already set
has_synthesize_token()You can obtain an API token by registering at Synthesize Bio.
For security reasons, remember to clear your token when you’re done:
# Clear token from current session
clear_synthesize_token()
# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)Never hard-code your token in scripts that will be shared or committed to version control.
The modality (data type to generate) is specified in the query using
get_valid_query():
bulk: Bulk RNA-seq (asynchronous under
the hood, returned as data frames)single-cell: Single-cell RNA-seq
(asynchronous under the hood, returned as data frames)You can check which modalities are available programmatically:
You do not need to specify any internal API slugs. The library maps modalities to the appropriate model endpoints automatically.
The structure of the query required by the API is fixed for the
currently supported model. You can use get_valid_query() to
get a correctly structured example list.
# Get the example query structure
example_query <- get_valid_query()
# Inspect the query structure
str(example_query)The query consists of:
modality: The type of gene expression
data to generate (“bulk” or “single-cell”)mode: The prediction mode that
controls how expression data is generated:
inputs: A list of biological
conditions to generate data forEach input contains metadata (describing the biological
sample) and num_samples (how many samples to generate).
See the Query Parameters section below for detailed documentation on
modeand other optional query fields.
Once your query is ready, you can send it to the API to generate gene expression data:
This result will be a list of two dataframes: metadata
and expression
Behind the scenes, the API uses an asynchronous model to handle queries efficiently:
All of this happens automatically when you call
predict_query().
You can customize the polling behavior if needed:
The input metadata is a list of lists. This is the full list of valid metadata keys:
Biological:
age_yearscell_line_ontology_idcell_type_ontology_iddevelopmental_stagedisease_ontology_idethnicitygenotyperacesample_type (“cell line”, “organoid”, “other”, “primary
cells”, “primary tissue”, “xenograft”)sex (“male”, “female”)tissue_ontology_idPerturbational:
perturbation_doseperturbation_ontology_idperturbation_timeperturbation_type
(“coculture”,“compound”,“control”,“crispr”,“genetic”,“infection”,“other”,“overexpression”,“peptide
or biologic”,“shrna”,“sirna”)Technical:
study (Bioproject ID)library_selection (e.g., “cDNA”, “polyA”, “Oligo-dT” -
see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)library_layout (“PAIRED”, “SINGLE”)platform (“illumina”)The following are the valid values or expected formats for selected metadata keys:
| Metadata Field | Requirement / Example |
|---|---|
cell_line_ontology_id |
Requires a Cellosaurus ID. |
cell_type_ontology_id |
Requires a CL ID. |
disease_ontology_id |
Requires a MONDO ID. |
perturbation_ontology_id |
Must be a valid Ensembl gene ID (e.g.,
ENSG00000156127), ChEBI ID (e.g.,
CHEBI:16681), ChEMBL ID (e.g.,
CHEMBL1234567), or NCBI Taxonomy ID (e.g.,
9606). |
tissue_ontology_id |
Requires a UBERON ID. |
We highly recommend using the EMBL-EBI Ontology Lookup Service to find valid IDs for your metadata.
Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.
In addition to metadata, queries support several optional parameters that control the generation process:
Controls the type of prediction the model generates. This parameter is required in all queries.
Available modes:
“sample generation”: The model works identically to the mean estimation approach, except that the final gene expression distribution is also sampled to generate realistic-looking synthetic data that captures the error associated with measurements. This mode is useful when you want data that mimics real experimental measurements.
“mean estimation”: The model creates a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction. This mode is useful when you want a stable estimate of expected expression levels.
Note: Single-cell queries only support “mean estimation” mode. Bulk queries support both modes.
# Bulk query with sample generation (default for bulk)
bulk_query <- get_valid_query(modality = "bulk")
bulk_query$mode <- "sample generation"
# Bulk query with mean estimation
bulk_query_mean <- get_valid_query(modality = "bulk")
bulk_query_mean$mode <- "mean estimation"
# Single-cell query (must use mean estimation)
sc_query <- get_valid_query(modality = "single-cell")
sc_query$mode <- "mean estimation" # Required for single-cellLibrary size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.
If TRUE, the model uses the mean of each latent
distribution (p(z|metadata) or q(z|x)) instead
of sampling. This removes randomness from latent sampling and produces
deterministic outputs for the same inputs.
FALSE (sampling is enabled)Random seed for reproducibility when using stochastic sampling.
You can combine multiple parameters in a single query:
You can customize the query inputs to fit your specific research needs: