Type: | Package |
Title: | Continuous and Dichotomized Index Predictors Based on Distribution Quantiles |
Version: | 0.1.7 |
Date: | 2024-11-14 |
Description: | Select optimal functional regression or dichotomized quantile predictors for survival/logistic/numeric outcome and perform optimistic bias correction for any optimally dichotomized numeric predictor(s), as in Yi, et. al. (2023) <doi:10.1016/j.labinv.2023.100158>. |
RoxygenNote: | 7.3.2 |
Encoding: | UTF-8 |
License: | GPL-2 |
Depends: | R (≥ 4.4), |
Language: | en-US |
Imports: | matrixStats, methods, mgcv, plotly, rpart, survival |
Suggests: | knitr, boot, htmlwidgets, Qindex.data |
NeedsCompilation: | no |
Packaged: | 2024-11-14 19:55:29 UTC; tingtingzhan |
Author: | Tingting Zhan |
Maintainer: | Tingting Zhan <tingtingzhan@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-11-14 20:10:02 UTC |
Continuous and Dichotomized Index Predictors Based on Distribution Quantiles
Description
Continuous and dichotomized index predictors based on distribution quantiles.
Author(s)
Maintainer: Tingting Zhan tingtingzhan@gmail.com (ORCID) [copyright holder]
Authors:
Misung Yi misung.yi@dankook.ac.kr (ORCID) [copyright holder]
Inna Chervoneva Inna.Chervoneva@jefferson.edu (ORCID) [copyright holder]
References
Selection of optimal quantile protein biomarkers based on cell-level immunohistochemistry data. Misung Yi, Tingting Zhan, Amy P. Peck, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, Hai Hu, Yunguang Sun, Hallgeir Rui and Inna Chervoneva. BMC Bioinformatics, 2023. doi:10.1186/s12859-023-05408-8
Quantile index biomarkers based on single-cell expression data. Misung Yi, Tingting Zhan, Amy P. Peck, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, Hai Hu, Yunguang Sun, Hallgeir Rui and Inna Chervoneva. Laboratory Investigation, 2023. doi:10.1016/j.labinv.2023.100158
Examples
### Data Preparation
library(survival)
data(Ki67, package = 'Qindex.data')
Ki67c = within(Ki67[complete.cases(Ki67), , drop = FALSE], expr = {
marker = log1p(Marker); Marker = NULL
PFS = Surv(RECFREESURV_MO, RECURRENCE)
})
(npt = length(unique(Ki67c$PATIENT_ID))) # 592
### Step 1: Cluster-Specific Sample Quantiles
Ki67q = clusterQp(marker ~ . - tissueID - inner_x - inner_y | PATIENT_ID, data = Ki67c)
stopifnot(is.matrix(Ki67q$marker))
head(Ki67q$marker, n = c(4L, 6L))
set.seed(234); id = sort.int(sample.int(n = npt, size = 480L))
Ki67q_0 = Ki67q[id, , drop = FALSE] # training set
Ki67q_1 = Ki67q[-id, , drop = FALSE] # test set
### Step 2 (after Step 1)
## Step 2a: Linear Sign-Adjusted Quantile Indices
(fr = Qindex(PFS ~ marker, data = Ki67q_0))
stopifnot(all.equal.numeric(c(fr), predict(fr)))
integrandSurface(fr)
integrandSurface(fr, newdata = Ki67q_1)
## Step 2b: Non-Linear Sign-Adjusted Quantile Indices
(nlfr = Qindex(PFS ~ marker, data = Ki67q_0, nonlinear = TRUE))
stopifnot(all.equal.numeric(c(nlfr), predict(nlfr)))
integrandSurface(nlfr)
integrandSurface(nlfr, newdata = Ki67q_1)
## view linear and non-linear sign-adjusted quantile indices together
integrandSurface(fr, nlfr)
### Step 2c: Optimal Dichotomizing
set.seed(14837); (m1 = optimSplit_dichotom(
PFS ~ marker, data = Ki67q_0, nsplit = 20L, top = 2L))
predict(m1)
predict(m1, boolean = FALSE)
predict(m1, newdata = Ki67q_1)
### Step 3 (after Step 1 & 2)
Ki67q_0a = within.data.frame(Ki67q_0, expr = {
FR = std_IQR(fr)
nlFR = std_IQR(nlfr)
optS = std_IQR(marker[,'0.27'])
})
Ki67q_1a = within.data.frame(Ki67q_1, expr = {
FR = std_IQR(predict(fr, newdata = Ki67q_1))
nlFR = std_IQR(predict(nlfr, newdata = Ki67q_1))
optS = std_IQR(marker[,'0.27'])
})
# `optS`: use the best quantile but discard the cutoff identified by [optimSplit_dichotom]
# all models below can also be used on training data `Ki67q_0a`
# naive use
summary(coxph(PFS ~ NodeSt + Tstage + FR, data = Ki67q_1a))
summary(coxph(PFS ~ NodeSt + Tstage + nlFR, data = Ki67q_1a))
summary(coxph(PFS ~ NodeSt + Tstage + optS, data = Ki67q_1a))
# set.seed if necessary
summary(BBC_dichotom(PFS ~ NodeSt + Tstage ~ FR, data = Ki67q_1a))
# `NodeSt`, `Tstage`: predctors to be used as-is
# `FR` to be dichotomized
# set.seed if necessary
summary(BBC_dichotom(PFS ~ NodeSt + Tstage ~ nlFR, data = Ki67q_1a))
# set.seed if necessary
summary(BBC_dichotom(PFS ~ NodeSt + Tstage ~ optS, data = Ki67q_1a)) # statistically rigorous
# Option 1
summary(BBC_dichotom(PFS ~ NodeSt + Tstage ~ FR, data = Ki67q_1a))
# Option 2:
summary(tmp <- BBC_dichotom(PFS ~ NodeSt + Tstage ~ FR, data = Ki67q_0a))
#coxph(PFS ~ NodeSt + Tstage + I(FR > attr(tmp, 'apparent_cutoff')), data = Ki67q_1a)
coxph(PFS ~ NodeSt + Tstage + I(FR > matrixStats::colMedians(BBC_cutoff(tmp))), data = Ki67q_1a)
# Option 1 and 2 are also applicable to `nlFR` and `optS`
Bootstrap Cutoff
Description
..
Usage
BBC_cutoff(object)
Arguments
object |
returned value from function BBC_dichotom |
Details
we use the output of BBC_dichotom.
but actually this works on the output of optimism_dichotom.
Value
Function BBC_cutoff returns a matrix of bootstrap cutoffs.
Bootstrap-based Optimism Correction for Dichotomization
Description
Multivariable regression model with bootstrap-based optimism correction on the dichotomized predictors.
Usage
BBC_dichotom(formula, data, ...)
optimism_dichotom(fom, X, data, R = 100L, ...)
coef_dichotom(fom, X., data)
Arguments
formula |
formula, e.g., |
data |
|
... |
additional parameters, currently not in use |
fom |
formula, e.g., |
X |
numeric matrix of |
R |
positive integer scalar,
number of bootstrap replicates |
X. |
logical matrix |
Details
Function BBC_dichotom obtains a multivariable regression model with bootstrap-based optimism correction on the dichotomized predictors. Specifically,
Obtain the dichotomizing rules
\mathbf{\mathcal{D}}
of predictorsx_1,\cdots,x_k
based on responsey
(via m_rpartD). Multivariable regression (with additional predictorsz
, if any) with dichotomized predictors\left(\tilde{x}_1,\cdots,\tilde{x}_k\right) = \mathcal{D}\left(x_1,\cdots,x_k\right)
(via helper function coef_dichotom) is the apparent performance.Obtain the bootstrap-based optimism based on
R
copies of bootstrap samples (via helper function optimism_dichotom). The median of bootstrap-based optimism overR
bootstrap copies is the optimism-correction of the dichotomized predictors\tilde{x}_1,\cdots,\tilde{x}_k
.Subtract the optimism-correction (in Step 2) from the apparent performance estimates (in Step 1), only for
\tilde{x}_1,\cdots,\tilde{x}_k
. The apparent performance estimates for additional predictorsz
's, if any, are not modified. Neither the variance-covariance (vcov) estimates nor the other regression diagnostics, e.g., residuals, logLikelihood, etc., of the apparent performance are modified for now. This coefficient-only, partially-modified regression model is the optimism-corrected performance.
Value
Function BBC_dichotom returns a coxph, glm or lm regression model, with attributes,
attr(,'optimism')
the returned object from optimism_dichotom
attr(,'apparent_cutoff')
a double vector, cutoff thresholds for the
k
predictors in the apparent model
Details on Helper Functions
Bootstrap-Based Optimism
Helper function optimism_dichotom computes the bootstrap-based optimism of the dichotomized predictors. Specifically,
R
copies of bootstrap samples are generated. In thej
-th bootstrap sample,obtain the dichotomizing rules
\mathbf{\mathcal{D}}^{(j)}
of predictorsx_1^{(j)},\cdots,x_k^{(j)}
based on responsey^{(j)}
(via m_rpartD)multivariable regression (with additional predictors
z^{(j)}
, if any) coefficient estimates\mathbf{\hat{\beta}}^{(j)} = \left(\hat{\beta}_1^{(j)},\cdots,\hat{\beta}_k^{(j)}\right)^t
of the dichotomized predictors\left(\tilde{x}_1^{(j)},\cdots,\tilde{x}_k^{(j)}\right) = \mathcal{D}^{(j)}\left(x_1^{(j)},\cdots,x_k^{(j)}\right)
(via coef_dichotom) are the bootstrap performance estimate.
Dichotomize
x_1,\cdots,x_k
in the entire data using each of the bootstrap rules\mathcal{D}^{(1)},\cdots,\mathcal{D}^{(R)}
. Multivariable regression (with additional predictorsz
, if any) coefficient estimates\mathbf{\hat{\beta}}^{[j]} = \left(\hat{\beta}_1^{[j]},\cdots,\hat{\beta}_k^{[j]}\right)^t
of the dichotomized predictors\left(\tilde{x}_1^{[j]},\cdots,\tilde{x}_k^{[j]}\right) = \mathcal{D}^{(j)}\left(x_1,\cdots,x_k\right)
(via coef_dichotom) are the test performance estimate.Difference between the bootstrap and test performance estimates, an
R\times k
matrix of\left(\mathbf{\hat{\beta}}^{(1)},\cdots,\mathbf{\hat{\beta}}^{(R)}\right)
minus anotherR\times k
matrix of\left(\mathbf{\hat{\beta}}^{[1]},\cdots,\mathbf{\hat{\beta}}^{[R]}\right)
, are the bootstrap-based optimism.
Multivariable Regression Coefficient Estimates of Dichotomized Predictors \tilde{x}
's
Helper function coef_dichotom
fits a multivariable Cox proportional hazards (coxph) model for Surv response,
logistic (glm) regression model for logical response,
or linear (lm) regression model for gaussian response,
with
the dichotomized predictors \tilde{x}_1,\cdots,\tilde{x}_k
as well as
the additional predictors z
's.
It is almost inevitable to have duplicates among the dichotomized predictors \tilde{x}_1,\cdots,\tilde{x}_k
.
In such case, the multivariable model is fitted using the unique \tilde{x}
's.
Returns of Helper Functions
Of helper function optimism_dichotom
Helper function optimism_dichotom returns an R\times k
double matrix of
bootstrap-based optimism,
with attributes
attr(,'cutoff')
an
R\times k
double matrix, theR
copies of bootstrap cutoff thresholds for thek
predictors. See attribute'cutoff'
of function m_rpartD
Of helper function coef_dichotom
Helper function coef_dichotom returns a double vector of
the regression coefficients of dichotomized predictors \tilde{x}
's, with attributes
In the case of duplicated \tilde{x}
's, the regression coefficients of the unique \tilde{x}
's are duplicated for those duplicates in \tilde{x}
's.
References
For helper function optimism_dichotom
Ewout W. Steyerberg (2009) Clinical Prediction Models. doi:10.1007/978-0-387-77244-8
Frank E. Harrell Jr., Kerry L. Lee, Daniel B. Mark. (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
Examples
library(survival)
data(flchain, package = 'survival') # see more details from ?survival::flchain
head(flchain2 <- within.data.frame(flchain, expr = {
mgus = as.logical(mgus)
}))
dim(flchain3 <- subset(flchain2, futime > 0)) # required by ?rpart::rpart
dim(flchain_Circulatory <- subset(flchain3, chapter == 'Circulatory'))
m1 = BBC_dichotom(Surv(futime, death) ~ age + sex + mgus ~ kappa + lambda,
data = flchain_Circulatory, R = 1e2L)
summary(m1)
matrixStats::colMedians(BBC_cutoff(m1)) # median bootstrap cutoff
attr(m1, 'apparent_cutoff')
Sign-Adjusted Quantile Indices
Description
Sign-adjusted quantile indices based on linear and/or nonlinear functional predictors.
Usage
Qindex(formula, data, sign_prob = 0.5, ...)
Qindex_prefit_(formula, data, family, nonlinear = FALSE, ...)
Arguments
formula |
formula, e.g., |
data |
data.frame, must be a returned object from function clusterQp |
sign_prob |
double scalar between 0 and 1,
user-specified probability |
... |
additional parameters for functions s and ti,
most importantly |
family |
|
nonlinear |
logical scalar,
whether to use nonlinear or linear functional model.
Default |
Value
Function Qindex returns an Qindex object, which is an instance of an S4 class. See section Slots for details.
Slots
.Data
double vector, sign-adjusted quantile indices, see section Details of function integrandSurface
formula
see section Arguments, parameter
formula
gam
a gam object
gpf
a
'gam.prefit'
object, which is the returned object from function gam with argumentfit = FALSE
p.value
numeric scalar,
p
-value for the test of significance of the functional predictor, based on slot@gam
sign
double scalar of either 1 or -1, sign-adjustment, see section Details of function integrandSurface
sign_prob
double scalar, section Arguments, parameter
sign_prob
Examples
# see ?`Qindex-package`
Visualize Qindex object using R package graphics
Description
Create perspective and contour plots of FR-index integrand using R package graphics.
End users are encouraged to use function integrandSurface with plotly work horse.
Usage
## S3 method for class 'Qindex'
persp(
x,
n = 31L,
xlab = "Percentages",
ylab = "Quantiles",
zlab = "Integrand of FR-index",
...
)
## S3 method for class 'Qindex'
contour(
x,
n = 501L,
image_col = topo.colors(20L),
xlab = "Percentages",
ylab = "Quantiles",
...
)
Arguments
x |
Qindex object |
n |
integer scalar, fineness of visualization,
default |
xlab , ylab |
character scalars |
zlab |
character scalar, for function persp.Qindex |
... |
.. |
image_col |
argument |
Value
Function persp.Qindex, a method dispatch of S3 generic persp, does not have a return value.
Function contour.Qindex, a method dispatch of S3 generic contour, does not have a return value
Bootstrap Indices
Description
Generate a series of bootstrap indices.
Usage
bootid(n, R)
Arguments
n |
positive integer scalar, sample size |
R |
positive integer scalar, number of bootstrap replicates |
Details
Function bootid generates the same bootstrap indices as
those generated from the default options of function boot
(i.e., sim = 'ordinary'
and m = 0
).
Value
Function bootid returns a length-R
list of
positive integer vectors.
Each element is the length-n
indices of each bootstrap sample.
See Also
Function bootid is inspired by functions boot:::index.array
and boot:::ordinary.array
.
Examples
set.seed(1345); (bt1 = boot::boot(data = 1:10, statistic = function(data, ind) ind, R = 3L)[['t']])
set.seed(1345); (bt2 = do.call(rbind, bootid(10L, R = 3L)))
stopifnot(identical(bt1, bt2))
Cluster-Specific Sample Quantiles
Description
Sample quantiles in each cluster of observations.
Usage
clusterQp(
formula,
data,
f_sum_ = mean.default,
probs = seq.int(from = 0.01, to = 0.99, by = 0.01),
...
)
Arguments
formula |
formula,
including response
|
data |
|
f_sum_ |
function to summarize the sample quantiles from
lower-level cluster |
probs |
double vector,
probabilities |
... |
additional parameters of function quantile |
Value
Function clusterQp returns an aggregated data.frame, in which
-
the highest cluster
c_1
and cluster-specific covariate(s)x
's are retained.-
If the input
formula
takes form ofy ~ . | c1
ory ~ . - z1 | c1
, then all covariates (except forz_1
) are considered cluster-specific; -
Sample quantiles from lower-level clusters (e.g.,
c_2
) are point-wise summarized using functionf_sum_
.
-
-
response
y
is removed; instead, a double matrix ofN
columns stores the cluster-specific sample quantiles. This matrix-
is named after the parsed expression of response
y
informula
; -
colnames are the probabilities
\mathbf{p}
, for the ease of subsequent programming.
-
Examples
# see ?`Qindex-package` for examples
Back Compatibility
Description
Functions that have been .Defunct.
Usage
FRindex(...)
## S3 method for class 'FRindex'
predict(...)
optim_splitSample_dichotom(...)
Arguments
... |
parameters that have been .Defunct. |
Integrand Surface(s) of Sign-Adjusted Quantile Indices Qindex
Description
An interactive htmlwidgets of the perspective plot for Qindex model(s) using package plotly.
Usage
integrandSurface(
...,
newdata = data,
proj_Q_p = TRUE,
proj_S_p = TRUE,
proj_beta = TRUE,
n = 501L,
newid = seq_len(min(50L, .row_names_info(newdata, type = 2L))),
qlim = range(X, newX),
axis_col = c("dodgerblue", "deeppink", "darkolivegreen"),
beta_col = "purple",
surface_col = c("white", "lightgreen")
)
Arguments
... |
one or more Qindex models based on a same training set. |
newdata |
data.frame, with at least
the response |
proj_Q_p |
logical scalar, whether to show
the projection of |
proj_S_p |
logical scalar, whether to show
the projection of |
proj_beta |
logical scalar, whether to show
|
n |
integer scalar, fineness of visualization,
default |
newid |
integer scalar or vector,
row indices of |
qlim |
length-2 double vector,
range on |
axis_col |
|
beta_col |
character scalar, color
of |
surface_col |
length-2 character vector, color of the integrand surface(s), for lowest and highest surface values |
Value
Function integrandSurface returns a pretty htmlwidgets created by R package plotly
to showcase the perspective plot of the
estimated sign-adjusted integrand surface \hat{S}(p,q)
.
If a set of training/test subjects is selected (via parameter newid
), then
-
the estimated sign-adjusted line integrand curve
\hat{S}\big(p, Q_i(p)\big)
of subjecti
is displayed on the surface\hat{S}(p,q)
; -
the quantile curve
Q_i(p)
is projected on the(p,q)
-plain of the 3-dimensional(p,q,s)
cube, ifproj_Q_p=TRUE
(default); -
the user-specified
\tilde{p}
is marked on the(p,q)
-plain of the 3D cube, ifproj_Q_p=TRUE
(default); -
\hat{S}\big(p, Q_i(p)\big)
is projected on the(p,s)
-plain of the 3-dimensional(p,q,s)
cube, if one and only one Qindex model is provided in in put argument...
andproj_S_p=TRUE
(default); -
the estimated linear functional coefficient
\hat{\beta}(p)
is shown on the(p,s)
-plain of the 3D cube, if one and only one linear Qindex model is provided in input argument...
andproj_beta=TRUE
(default).
Integrand Surface
The quantile index (QI),
\text{QI}=\displaystyle\int_0^1\beta(p)\cdot Q(p)\,dp
with a linear functional coefficient \beta(p)
can be estimated by fitting a functional generalized linear model (FGLM, James, 2002) to exponential-family outcomes,
or by fitting a linear functional Cox model (LFCM, Gellar et al., 2015) to survival outcomes.
More flexible non-linear quantile index (nlQI)
\text{nlQI}=\displaystyle\int_0^1 F\big(p, Q(p)\big)\,dp
with a bivariate twice differentiable function F(\cdot,\cdot)
can be estimated by fitting a functional generalized additive model (FGAM, McLean et al., 2014) to exponential-family outcomes,
or by fitting an additive functional Cox model (AFCM, Cui et al., 2021) to survival outcomes.
The estimated integrand surface of quantile indices and non-linear quantile indices, defined on
p\in[0,1]
and
q\in\text{range}\big(Q_i(p)\big)
for all training subjects i=1,\cdots,n
,
is
\hat{S}_0(p,q) =
\begin{cases}
\hat{\beta}(p)\cdot q & \text{for QI}\\
\hat{F}(p,q) & \text{for nlQI}
\end{cases}
Sign-Adjustment
Ideally, we would wish that, in the training set, the estimated linear and/or non-linear quantile indices
\widehat{\text{QI}}_i = \displaystyle\int_0^1 \hat{S}_0\big(p, Q_i(p)\big)dp
be positively correlated with a more intuitive quantity, e.g., quantiles Q_i(\tilde{p})
at a user-specified \tilde{p}
, for the interpretation of downstream analysis,
Therefore, we define the sign-adjustment term
\hat{c} = \text{sign}\left(\text{corr}\left(Q_i(\tilde{p}), \widehat{\text{QI}}_i\right)\right),\quad i =1,\cdots,n
as the sign of the correlation between
the estimated quantile index \widehat{\text{QI}}_i
and the quantile Q_i(\tilde{p})
,
for training subjects i=1,\cdots,n
.
The estimated sign-adjusted integrand surface is
\hat{S}(p,q) = \hat{c} \cdot \hat{S}_0(p,q)
.
The estimated sign-adjusted quantile indices
\int_0^1 \hat{S}\big(p, Q_i(p)\big)dp
are positively correlated with subject-specific sample medians
(default \tilde{p} = .5
) in the training set.
Note
The maintainer is not aware of any functionality of projection of arbitrary curves in package plotly.
Currently, the projection to (p,q)
-plain is hard coded on (p,q,s=\text{min}(s))
-plain.
References
James, G. M. (2002). Generalized Linear Models with Functional Predictors, doi:10.1111/1467-9868.00342
Gellar, J. E., et al. (2015). Cox regression models with functional covariates for survival data, doi:10.1177/1471082X14565526
Mathew W. M., et al. (2014) Functional Generalized Additive Models, doi:10.1080/10618600.2012.729985
Cui, E., et al. (2021). Additive Functional Cox Model, doi:10.1080/10618600.2020.1853550
Examples
# see ?`Qindex-package`
Optimal Dichotomizing Predictors via Repeated Sample Splits
Description
To identify the optimal dichotomizing predictors using repeated sample splits.
Usage
optimSplit_dichotom(
formula,
data,
include = quote(p1 > 0.15 & p1 < 0.85),
top = 1L,
nsplit,
...
)
split_dichotom(y, x, id, ...)
splits_dichotom(y, x, ids = rSplit(y, ...), ...)
## S3 method for class 'splits_dichotom'
quantile(x, probs = 0.5, ...)
Arguments
formula , y , x |
formula, e.g., |
data |
|
include |
(optional) language, inclusion criteria.
Default |
top |
positive integer scalar, number of optimal dichotomizing predictors, default |
nsplit , ... |
additional parameters for function rSplit |
id |
logical vector for helper function split_dichotom, indices of training ( |
ids |
(optional) list of logical vectors for helper function splits_dichotom, multiple copies of indices of repeated training-test sample splits. |
probs |
double scalar for helper function quantile.splits_dichotom, see quantile |
Details
Function optimSplit_dichotom identifies the optimal dichotomizing predictors via repeated sample splits. Specifically,
Generate multiple, i.e., repeated, training-test sample splits (via rSplit)
For each candidate predictor
x_i
, find the median-split-dichotomized regression model based on the repeated sample splits, see details in section Details on Helper FunctionsLimit the selection of the candidate predictors
x
's to a user-desired range ofp_1
of the split-dichotomized regression models, see explanations ofp_1
in section Returns of Helper FunctionsRank the candidate predictors
x
's by the decreasing order of the absolute values of the regression coefficient estimate of the median-split-dichotomized regression models. On the top of this rank are the optimal dichotomizing predictors.
Value
Function optimSplit_dichotom returns an object of class 'optimSplit_dichotom'
, which is a list of dichotomizing functions,
with the input formula
and data
as additional attributes.
Details on Helper Functions
Split-Dichotomized Regression Model
Helper function split_dichotom performs a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set. Specifically, given a training-test sample split,
find the dichotomizing rule
\mathcal{D}
of the predictorx_0
given the responsey_0
in the training set (via rpartD);fit a univariable regression model of the response
y_1
with the dichotomized predictor\mathcal{D}(x_1)
in the test set.
Currently the Cox proportional hazards (coxph) regression for Surv response, logistic (glm) regression for logical response and linear (lm) regression for gaussian response are supported.
Split-Dichotomized Regression Models based on Repeated Training-Test Sample Splits
Helper function splits_dichotom fits multiple split-dichotomized regression models split_dichotom on the response y
and predictor x
, based on each copy of the repeated training-test sample splits.
Quantile of Split-Dichotomized Regression Models
Helper function quantile.splits_dichotom is a method dispatch of the S3 generic function quantile on splits_dichotom object. Specifically,
-
collect the univariable regression coefficient estimate from each one of the split-dichotomized regression models;
-
find the nearest-even (i.e.,
type = 3
) quantile of the coefficients from Step 1. By default, we use the median (i.e.,prob = .5
); -
the split-dichotomized regression model corresponding to the selected coefficient quantile in Step 2, is returned.
Returns of Helper Functions
Helper function split_dichotom returns a split-dichotomized regression model, which is either a Cox proportional hazards (coxph), a logistic (glm), or a linear (lm) regression model, with additional attributes
attr(,'rule')
function, dichotomizing rule
\mathcal{D}
based on the training setattr(,'text')
character scalar, human-friendly description of
\mathcal{D}
attr(,'p1')
double scalar,
p_1 = \text{Pr}(\mathcal{D}(x_1)=1)
attr(,'coef')
double scalar, univariable regression coefficient estimate of
y_1\sim\mathcal{D}(x_1)
Helper function splits_dichotom returns a list of split-dichotomized regression models (split_dichotom).
Helper function quantile.splits_dichotom returns a split-dichotomized regression model (split_dichotom).
Examples
# see ?`Qindex-package`
Predicted Sign-Adjusted Quantile Indices
Description
To predict sign-adjusted quantile indices of a test set.
Usage
## S3 method for class 'Qindex'
predict(object, newdata = object@gam$data, ...)
Arguments
object |
an Qindex object based on the training set. |
newdata |
test data.frame, with at least
the response |
... |
additional parameters, currently not in use. |
Details
Function predict.Qindex computes
the predicted sign-adjusted quantile indices on the test set,
which is
the product of function predict.gam return
and the correlation sign based on training set
(object@sign
, see Step 3 of section Details of function Qindex).
Multiplication by object@sign
is required to ensure
that the predicted sign-adjusted quantile indices
are positively associated with the training functional predictor values
at the selected tabulating grid.
Value
Function predict.Qindex returns a double vector, which is the predicted sign-adjusted quantile indices on the test set.
Regression Models with Optimal Dichotomizing Predictors
Description
Regression models with optimal dichotomizing predictor(s), used either as boolean or continuous predictor(s).
Usage
## S3 method for class 'optimSplit_dichotom'
predict(
object,
formula = attr(object, which = "formula", exact = TRUE),
newdata = attr(object, which = "data", exact = TRUE),
boolean = TRUE,
...
)
Arguments
object |
an optimSplit_dichotom object |
formula |
(optional) formula to specify the response in test data. If missing, the model formula of training data is used |
newdata |
(optional) test data.frame, candidate numeric predictors |
boolean |
logical scalar, whether to use the dichotomized predictor (default, |
... |
additional parameters, currently not in use |
Value
Function predict.optimSplit_dichotom returns a list of regression models, coxph model for Surv response, glm for logical response, and lm model for numeric response.
Examples
# see ?`Qindex-package`
Stratified Random Split Sampling
Description
Random split sampling, stratified based on the type of the response.
Usage
rSplit(y, nsplit, stratify = TRUE, s_ratio = 0.8, ...)
Arguments
y |
a double vector,
a logical vector,
a factor,
or a Surv object,
response |
nsplit |
positive integer scalar, number of replicates of random splits to be performed |
stratify |
logical scalar,
whether stratification based on response |
s_ratio |
double scalar between 0 and 1,
split ratio, i.e., percentage of training subjects |
... |
additional parameters, currently not in use |
Details
Function rSplit performs random split sampling, with or without stratification. Specifically,
If
stratify = FALSE
, or if we have a double responsey
, then split the sample into a training and a test set by oddsp/(1-p)
, without stratification.Otherwise, split a Surv response
y
, stratified by its censoring status. Specifically, split subjects with observed event into a training and a test set by oddsp/(1-p)
, and split the censored subjects into a training and a test set by oddsp/(1-p)
. Then combine the training sets from subjects with observed events and censored subjects, and combine the test sets from subjects with observed events and censored subjects.Otherwise, split a logical response
y
, stratified by itself. Specifically, split the subjects withTRUE
response into a training and a test set by oddsp/(1-p)
, and split the subjects withFALSE
response into a training and a test set by oddsp/(1-p)
. Then combine the training sets, and the test sets, in a similar fashion as described above.Otherwise, split a factor response
y
, stratified by its levels. Specifically, split the subjects in each level ofy
into a training and a test set by oddsp/(1-p)
. Then combine the training sets, and the test sets, from all levels ofy
.
Value
Function rSplit returns a length-nsplit
list of
logical vectors.
In each logical vector,
the TRUE
elements indicate training subjects and
the FALSE
elements indicate test subjects.
Note
caTools::sample.split
is not what we need.
See Also
split, caret::createDataPartition
Examples
rSplit(y = rep(c(TRUE, FALSE), times = c(20, 30)), nsplit = 3L)
Dichotomize via Recursive Partitioning
Description
Dichotomize one or more predictors of a Surv, a logical, or a double response, using recursive partitioning and regression tree rpart.
Usage
rpartD(
y,
x,
check_degeneracy = TRUE,
cp = .Machine$double.eps,
maxdepth = 2L,
...
)
m_rpartD(y, X, check_degeneracy = TRUE, ...)
Arguments
y |
a Surv object,
a logical vector,
or a double vector, the response |
x |
|
check_degeneracy |
logical scalar, whether to allow the
dichotomized value to be all- |
cp |
double scalar, complexity parameter, see rpart.control.
Default |
maxdepth |
positive integer scalar, maximum depth of any node, see rpart.control.
Default |
... |
additional parameters of rpart and/or rpart.control |
X |
numeric matrix,
a set of predictors.
Each column of |
Details
Dichotomize Single Predictor
Function rpartD dichotomizes one predictor in the following steps,
-
Recursive partitioning and regression tree rpart analysis is performed for the response
y
and the predictorx
. -
The labels.rpart of the first node of the rpart tree is considered as the dichotomizing rule of the double predictor
x
. The term dichotomizing rule indicates the combination of an inequality sign (>, >=, < and <=) and a double cutoff thresholda
-
The dichotomizing rule from Step 2 is further processed, such that
-
<a
is regarded as\geq a
-
\leq a
is regarded as>a
-
> a
and\geq a
are regarded as is.
This step is necessary for a narrative of greater than or greater than or equal to the threshold
a
. -
-
A warning message is produced, if the dichotomizing rule, applied to a new double predictor
newx
, creates an all-TRUE
or all-FALSE
result. We do not make the algorithm stop, as most regression models in R are capable of handling an all-TRUE
or all-FALSE
predictor, by returning aNA_real_
regression coefficient estimate.
Dichotomize Multiple Predictors
Function m_rpartD dichotomizes
each predictor X[,i]
based on the response y
using function rpartD.
Applying the multiple dichotomizing rules to a new set of predictors newX
,
-
A warning message is produced, if at least one of the dichotomized predictors is all-
TRUE
or all-FALSE
. -
We do not check if more than one of the dichotomized predictors are identical to each other. We take care of this situation in helper function coef_dichotom
Value
Dichotomize Single Predictor
Function rpartD returns a function,
with a double vector parameter newx
.
The returned value of rpartD(y,x)(newx)
is a
logical vector
with attributes
attr(,'cutoff')
double scalar, the cutoff value for
newx
Dichotomize Multiple Predictors
Function m_rpartD returns a function,
with a double matrix parameter newX
.
The argument for newX
must have
the same number of columns and the same column names as
the input matrix X
.
The returned value of m_rpartD(y,X)(newX)
is a
logical matrix
with attributes
Note
In future integer and factor predictors will be supported.
Examples
## Dichotomize Single Predictor
data(cu.summary, package = 'rpart') # see more details from ?rpart::cu.summary
with(cu.summary, rpartD(y = Price, x = Mileage, check_degeneracy = FALSE))
(foo = with(cu.summary, rpartD(y = Price, x = Mileage)))
foo(rnorm(10, mean = 24.5))
## Dichotomize Multiple Predictors
library(survival)
data(stagec, package = 'rpart') # see more details from ?rpart::stagec
nrow(stagec) # 146
(foo = with(stagec[1:100,], m_rpartD(y = Surv(pgtime, pgstat), X = cbind(age, g2, gleason))))
foo(as.matrix(stagec[-(1:100), c('age', 'g2', 'gleason')]))
Show Qindex Object
Description
Show Qindex object.
Usage
## S4 method for signature 'Qindex'
show(object)
Arguments
object |
an Qindex object |
Value
The S4 show method of Qindex object does not have a returned value.
Alternative Standardization Methods
Description
Alternative standardization using median, IQR and mad.
Usage
std_IQR(x, na.rm = TRUE, ...)
std_mad(x, na.rm = TRUE, ...)
Arguments
x |
|
na.rm |
logical scalar,
see functions quantile, median and mad.
Default |
... |
Value
Standardize using median and IQR
Function std_IQR returns a numeric vector of the same length as x
.
Standardize using median and mad
Function std_mad returns a numeric vector of the same length as x
.
Examples
std_IQR(rnorm(20))
std_mad(rnorm(20))