---
title: "Define and Summarize Arms in Clinical Trials"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Define and Summarize Arms in Clinical Trials}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  cache = TRUE,
  cache.path = 'cache/defineArms/',
  comment = '#>',
  dpi = 300,
  out.width = '100%'
)
```

```{r setup, message = FALSE, echo = FALSE}
library(dplyr)
library(simdata)
library(TrialSimulator)
set.seed(12345)
```

In `TrialSimulator`, a trial arm is defined as a collection of endpoints (and potentially other covariates or biomarkers) with a data generation process. This vignette demonstrates how to use the following key functions to define and summarize arms in a simulated clinical trial setting.

- `endpoint`: Creates one or more endpoints
- `add_endpoints`: Add one or more endpoints to an arm
- `generate_data`: Generates a dataset from an `Arms` object (for exploratory purpose)
- `print`: Method that displays a summary report of an `Arms` object summarizing all endpoints in the arm

## Define Arm with Mulitple Sets of Endpoints

The function `endpoint` can be used to define one or multiple endpoints simultaneously. These endpoints can be independent or correlated, depending on the `generator` provided. In the following hypothetical example, we construct a custom generator that simulates PFS, OS, PSA levels at baseline and year 1. A pre-specified correlation matrix ensures the endpoints are appropriately correlated. We also ensure that PFS is always less than or equal to OS. 

```{r garalbaiola}
rng <- function(n, pfs_rate, os_rate, psa_mean, psa_sd, corr_matrix){
  
  dist <- list()
  dist[['PFS']] <- function(x) qexp(x, rate = pfs_rate)
  dist[['OS']] <- function(x) qexp(x, rate = os_rate)
  dist[['PSA_baseline']] <- function(x) qnorm(x, mean = psa_mean, sd = psa_sd)
  dist[['PSA_year1']] <- function(x) qnorm(x, mean = psa_mean - 12, sd = psa_sd)
  dsgn = simdata::simdesign_norta(cor_target_final = corr_matrix, 
                                dist = dist, 
                                transform_initial = data.frame,
                                names_final = names(dist), 
                                seed_initial = 1)
  
  simdata::simulate_data(dsgn, n_obs = n) %>% 
    mutate(PFS = pmin(PFS, OS)) %>% 
    mutate(PFS_event = 1, OS_event = 1)
  
}
```

In this generator, 

- Baseline PSA follows a normal distribution with mean 20 and SD 4.0.
- After one year of treatment, the average PSA level decreases to 8. 
- PFS and OS are exponentially distributed with median times of 2.5 and 4.5 years, respectively. 
- A constraint ensures that PFS does not exceed OS. To define correlated PFS and OS that satisfy this constraint in a more natural way, refer to the vignette of [multi-state model](simulatePfsAndOs.html)
- PSA readout times are 0 (at baseline) and 1 (at year 1). 

The following code defines the endpoints and uses the `print` method to generate a summary report based on 10,000 samples from the generator `rng`. 

```{r adldaieafg, results='asis'}
ep1 <- endpoint(name = c('PSA_baseline', 'PSA_year1', 'OS', 'PFS'), 
                type = c('non-tte', 'non-tte', 'tte', 'tte'), 
                readout = c(PSA_baseline = 0, PSA_year1 = 1), 
                generator = rng, 
                pfs_rate = log(2)/2.5, os_rate = log(2)/4.5, 
                psa_mean = 20, psa_sd = 4, 
                corr_matrix = matrix(c(1, .6, -.5, -.4, 
                                       .6, 1, -.4, -.3, 
                                       -.5, -.4, 1, .7, 
                                       -.4, -.3, .7, 1), nrow = 4))

ep1
```

We can define another set of endpoints using a separate call to `endpoint()`. However, keep in mind that any endpoints defined separately are assumed to be independent of those in prior calls (i.e. `PSA_baseline`, `PSA_year1`, `PFS` and `OS`). 

In the following example, we define a biomarker, even though it is actually not an endpoint. In practice, the function `endpoint` is useful in introducing any variables, including covariates, biomarkers, sub-group indicators, etc. Ideally, a biomarker should be integrated into the generator `rng` to capture meaningful correlation with other endpoints. 

```{r ioagjie, results='asis'}
ep2 <- endpoint(name = 'biomarker', 
                type = 'non-tte', 
                readout = c(biomarker = 0), 
                generator = rbinom, 
                size = 1, prob = .3)
ep2
```

We now create a treatment arm by combining `ep1` and `ep2`. The `print` method automatically summarizes the marginal distributions of all endpoints. As seen, the summary report of the arm simply concatenates the two reports of `ep1` and `ep2`.  

```{r dlaieafj, results='asis'}
trt <- arm(name = 'treated')
trt$add_endpoints(ep1, ep2)
trt
```

## Add Inclusion Criteria for the Arm

We can define inclusion criteria for the arm by passing logical filter expressions via the `...` argument in `arm()`. These filters are applied to the generated trial data. For example, the following code restricts enrollment to patients with 

- Baseline PSA > 10
- Positive PSA values at year 1

The summary report will reflect the effect of these inclusion criteria on the simulated population. 

```{r eiaojda, results='asis'}
trt <- arm(name = 'treated', PSA_baseline > 10 & PSA_year1 > 0)
trt$add_endpoints(ep1, ep2)
trt
```

## Simulate Data Explicitly (Not Recommended)

Although `TrialSimulator` allows direct data generation using the `generate_data()` method, it is generally discouraged. One of the core principles of `TrialSimulator` is to separate trial simulation logic from data generation, allowing the framework to manage data generation and truncation (and/or censoring) dynamically based on trial milestones.

Nevertheless, for inspection or debugging, one can call

```{r eiaojfa}
## not recommended
tmp <- trt$generate_data(100)
head(tmp)
```

This gives a preview of the patient-level data generated by the treatment arm configuration (generator, inclusion filters). However, enrollment schedule and dropout are not taken into account, which is another reason why we strongly discourage this way to use `TrialSimulator`.