Getting Started

rsynthbio is an R package that provides a convenient interface to the Synthesize Bio API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq.

Alternatively, you can AI generate datasets from our web platform.

How to install

You can install rsynthbio from CRAN:

install.packages("rsynthbio")

If you want the development version, you can install using the remotes package to install from GitHub:

if (!("remotes" %in% installed.packages())) {
  install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")

Once installed, load the package:

library(rsynthbio)

Authentication

Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:

# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()

# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)

Loading your API key for a session.

# In future sessions, load the stored token
load_synthesize_token_from_keyring()

# Check if a token is already set
has_synthesize_token()

You can obtain an API token by registering at Synthesize Bio.

Security Best Practices

For security reasons, remember to clear your token when you’re done:

# Clear token from current session
clear_synthesize_token()

# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)

Never hard-code your token in scripts that will be shared or committed to version control.

Designing Queries for Models

Choosing a Modality

The modality (data type to generate) is specified in the query using get_valid_query():

You can check which modalities are available programmatically:

# Check available modalities
get_valid_modalities()

You do not need to specify any internal API slugs. The library maps modalities to the appropriate model endpoints automatically.

# Create a bulk query
bulk_query <- get_valid_query(modality = "bulk")
bulk <- predict_query(bulk_query, as_counts = TRUE)

# Create a single-cell query
sc_query <- get_valid_query(modality = "single-cell")
sc <- predict_query(sc_query, as_counts = TRUE)

Creating a Query

The structure of the query required by the API is fixed for the currently supported model. You can use get_valid_query() to get a correctly structured example list.

# Get the example query structure
example_query <- get_valid_query()

# Inspect the query structure
str(example_query)

The query consists of:

  1. modality: The type of gene expression data to generate (“bulk” or “single-cell”)
  2. mode: The prediction mode that controls how expression data is generated:
    • “sample generation”: Generates realistic-looking synthetic data with measurement error (bulk only)
    • “mean estimation”: Provides stable mean estimates of expression levels (bulk and single-cell)
  3. inputs: A list of biological conditions to generate data for

Each input contains metadata (describing the biological sample) and num_samples (how many samples to generate).

See the Query Parameters section below for detailed documentation on mode and other optional query fields.

Making a Prediction

Once your query is ready, you can send it to the API to generate gene expression data:

result <- predict_query(query, as_counts = TRUE)

This result will be a list of two dataframes: metadata and expression

Understanding the Async API

Behind the scenes, the API uses an asynchronous model to handle queries efficiently:

  1. Your query is submitted to the API, which returns a query ID
  2. The function automatically polls the status endpoint (default: every 2 seconds)
  3. When the query completes, results are downloaded from a signed URL
  4. Data is parsed and returned as R data frames

All of this happens automatically when you call predict_query().

Controlling Async Behavior

You can customize the polling behavior if needed:

# Increase timeout for large queries (default: 900 seconds = 15 minutes)
result <- predict_query(
  query,
  poll_timeout_seconds = 1800, # 30 minutes
  poll_interval_seconds = 5 # Check every 5 seconds instead of 2
)

Valid Metadata Keys

The input metadata is a list of lists. This is the full list of valid metadata keys:

Biological:

Perturbational:

Technical:

Valid Metadata Values

The following are the valid values or expected formats for selected metadata keys:

Metadata Field Requirement / Example
cell_line_ontology_id Requires a Cellosaurus ID.
cell_type_ontology_id Requires a CL ID.
disease_ontology_id Requires a MONDO ID.
perturbation_ontology_id Must be a valid Ensembl gene ID (e.g., ENSG00000156127), ChEBI ID (e.g., CHEBI:16681), ChEMBL ID (e.g., CHEMBL1234567), or NCBI Taxonomy ID (e.g., 9606).
tissue_ontology_id Requires a UBERON ID.

We highly recommend using the EMBL-EBI Ontology Lookup Service to find valid IDs for your metadata.

Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.

Query Parameters

In addition to metadata, queries support several optional parameters that control the generation process:

mode (character, required)

Controls the type of prediction the model generates. This parameter is required in all queries.

Available modes:

  • “sample generation”: The model works identically to the mean estimation approach, except that the final gene expression distribution is also sampled to generate realistic-looking synthetic data that captures the error associated with measurements. This mode is useful when you want data that mimics real experimental measurements.

  • “mean estimation”: The model creates a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction. This mode is useful when you want a stable estimate of expected expression levels.

Note: Single-cell queries only support “mean estimation” mode. Bulk queries support both modes.

# Bulk query with sample generation (default for bulk)
bulk_query <- get_valid_query(modality = "bulk")
bulk_query$mode <- "sample generation"

# Bulk query with mean estimation
bulk_query_mean <- get_valid_query(modality = "bulk")
bulk_query_mean$mode <- "mean estimation"

# Single-cell query (must use mean estimation)
sc_query <- get_valid_query(modality = "single-cell")
sc_query$mode <- "mean estimation" # Required for single-cell

total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.

# Create a query and add custom total_count
query <- get_valid_query(modality = "bulk")
query$total_count <- 5000000

deterministic_latents (logical, optional)

If TRUE, the model uses the mean of each latent distribution (p(z|metadata) or q(z|x)) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.

  • Default: FALSE (sampling is enabled)
# Create a query and enable deterministic latents
query <- get_valid_query(modality = "bulk")
query$deterministic_latents <- TRUE

seed (integer, optional)

Random seed for reproducibility when using stochastic sampling.

# Create a query with a specific seed
query <- get_valid_query(modality = "bulk")
query$seed <- 42

You can combine multiple parameters in a single query:

# Create a query and add multiple parameters
query <- get_valid_query(modality = "bulk")
query$total_count <- 8000000
query$deterministic_latents <- TRUE
query$mode <- "mean estimation"

results <- predict_query(query)

Modifying Query Inputs

You can customize the query inputs to fit your specific research needs:

# Get a base query
query <- get_valid_query()

# Adjust number of samples for the first input
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue",
    tissue_ontology_id = "UBERON:0002371"
  ),
  num_samples = 5
)

Working with Results

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)

You may want to process the data in chunks or save it for later use:

# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")

Custom Validation

You can validate your queries before sending them to the API:

# Validate structure
validate_query(query)

# Validate modality
validate_modality(query)

Session info

sessionInfo()

Additional Resources