| Type: | Package | 
| Title: | Hierarchical GxE Interactions in a Regularized Regression Model | 
| Version: | 1.0.2 | 
| Date: | 2021-11-28 | 
| Author: | Natalia Zemlianskaia | 
| Maintainer: | Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com> | 
| Description: | The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <doi:10.48550/arXiv.2103.13510>. | 
| License: | MIT + file LICENSE | 
| Imports: | Rcpp (≥ 1.0.3), Matrix, bigmemory, methods | 
| Depends: | dplyr, R (≥ 3.5) | 
| Suggests: | glmnet, testthat, knitr, rmarkdown, ggplot2 | 
| LinkingTo: | Rcpp, RcppEigen, RcppThread, BH, bigmemory | 
| VignetteBuilder: | knitr | 
| NeedsCompilation: | yes | 
| Packaged: | 2021-11-28 17:20:02 UTC; nataliazemlianskaia | 
| Repository: | CRAN | 
| Date/Publication: | 2021-11-30 07:30:02 UTC | 
Hierarchical GxE Interactions in a Regularized Regression Model
Description
The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.
Author(s)
Natalia Zemlianskaia
Maintainer: Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com>
References
"A Scalable Hierarchical Lasso for Gene-Environment Interactions", Natalia Zemlianskaia, W.James Gauderman, Juan Pablo Lewinger https://arxiv.org/abs/2103.13510
Data Generation
Description
Generates genotypes data matrix G (sample_size by p), vector of environmental measurments E, and an outcome vector Y of size sample_size. Simulates training, validation, and test datasets.
Usage
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10, 
         family = "gaussian", mode = "strong_hierarchical", 
         normalize = FALSE, normalize_response = FALSE, 
         seed = 1, pG = 0.2, pE = 0.3,
         n_confounders = NULL)
Arguments
| sample_size | sample size of the data | 
| p | total number of main effects | 
| n_g_non_zero | number of non-zero main effects to generate | 
| n_gxe_non_zero | number of non-zero interaction effects to generate | 
| family | "gaussian" for continous outcome Y and "binomial" for binary 0/1 outcome | 
| mode | either "strong_hierarchical", "hierarchical", or "anti_hierarchical". In the strong hierarchical mode the hierarchical structure is maintained (beta_g = 0 then beta_gxe = 0) and also |beta_g| >= |beta_gxe|. In the hierarchical mode the hierarchical structure is maintained, but |beta_G| < |beta_gxe|. In the anti_hierarchical mode the hierarchical structure is violated (beta_g = 0 then beta_gxe != 0). | 
| normalize | 
 | 
| normalize_response | 
 | 
| pG | genotypes prevalence, value from 0 to 1 | 
| pE | environment prevalence, value from 0 to 1 | 
| seed | random seed | 
| n_confounders | number of confounders to generate, either  | 
Value
A list of simulated datasets and generating coefficients
| G_train,G_valid,G_test | generated genotypes matrices | 
| E_train,E_valid,E_test | generated vectors of environmental values | 
| Y_train,Y_valid,Y_test | generated outcome vectors | 
| C_train,C_valid,C_test | generated confounders matrices | 
| GxE_train,GxE_valid,GxE_test | generated GxE matrix | 
| Beta_G | main effect coefficients vector | 
| Beta_GxE | interaction coefficients vector | 
| beta_0 | intercept coefficient value | 
| beta_E | environment coefficient value | 
| Beta_C | confounders coefficient values | 
| index_beta_non_zero,index_beta_gxe_non_zero,index_beta_zero,index_beta_gxe_zero | inner data generation variables | 
| n_g_non_zero | number of non-zero main effects generated | 
| n_gxe_non_zero | number of non-zero interactions generated | 
| n_total_non_zero | total number of non-zero variables | 
| SNR_g | signal-to-noise ratio for the main effects | 
| SNR_gxe | signal-to-noise ratio for the interactions | 
| family,p,sample_size,mode,seed | input simulation parameters | 
Examples
data = data.gen(sample_size=100, p=100)
G = data$G_train; GxE = data$GxE_train
E = data$E_train; Y = data$Y_train
Get model coefficients
Description
A function to obtain coefficients from the model fit object corresponding to the desired pair of tuning parameters lambda = (lambda_1, lambda_2).
Usage
gesso.coef(fit, lambda)
Arguments
| fit | model fit object obtained either by using function  | 
| lambda | a pair of tuning parameters organized in a tibble (ex:  | 
Value
A list of model coefficients corresponding to lambda values of tuning parameters
| beta_0 | estimated intercept value | 
| beta_e | estimated environmental coefficient value | 
| beta_g | a vector of estimated main effect coefficients | 
| beta_c | a vector of estimated confounders coefficients | 
| beta_gxe | a vector of estimated interaction coefficients | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20, 
        parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g     
Get model coefficients with specified number of non-zero interactions
Description
A function to obtain coefficients with target_b_gxe_non_zero specified to control the desired sparsity of interactions in the model.
Usage
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
Arguments
| cv_model | cross-validated model fit object obtained by using function  | 
| target_b_gxe_non_zero | number of non-zero interactions we want to inlcude in the model | 
| less_than | 
 | 
Value
A list of model coefficients corresponding to the best model that contains at most or at least target_b_gxe_non_zero non-zero interaction terms. 
The target model is selected based on the averaged cross-validation (cv) results: for each pair of  parameters lambda=(lambda_1, lambda_2) in the grid and each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full data with the selected tuning parameters. 
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
| beta_0 | estimated intercept value | 
| beta_e | estimated environmental coefficient value | 
| beta_g | a vector of estimated main effect coefficients | 
| beta_gxe | a vector of estimated interaction coefficients | 
| beta_c | a vector of estimated confounders coefficients | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
model_coefficients = gesso.coefnum(model, 5)
gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)              
Cross-Validation
Description
Performs nfolds-fold cross-validation to tune hyperparmeters lambda_1 and lambda_2 for the gesso model.
Usage
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
         grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian", 
         type_measure = "loss", fold_ids = NULL, nfolds = 4, 
         parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000, 
         min_working_set_size = 100, verbose = TRUE)
Arguments
| G | matrix of main effects of size  | 
| E | vector of environmental measurments | 
| Y | outcome vector. Set  | 
| C | matrix of confounders of size  | 
| normalize | 
 | 
| normalize_response | 
 | 
| grid | grid sequence for tuning hyperparameters, we use the same grid for  | 
| grid_size | specify  | 
| grid_min_ratio | parameter to determine  | 
| alpha | if  | 
| family | 
 | 
| type_measure | loss to use for cross-validation. Specity  | 
| fold_ids | option to input custom folds assignments | 
| tolerance | tolerance for the dual gap convergence criterion | 
| max_iterations | maximum number of iterations | 
| min_working_set_size | minimum size of the working set | 
| nfolds | number of cross-validation splits | 
| parallel | 
 | 
| seed | set random seed to control random folds assignments | 
| verbose | 
 | 
Value
A list of objects
| cv_result | a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of ( 
 | 
| lambda_min | a tibble of optimal ( | 
| fit | list, return of the function gesso.fit on the full data | 
| grid | vector of values used for hyperparameters tuning | 
| full_cv_result | inner variables | 
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train, 
                      grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe        
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g          
gesso fit
Description
Fits gesso model over the two dimentional grid of hyperparmeters lambda_1 and lambda_2, returns estimated coefficients for each pair of hyperparameters.
Usage
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE,
          grid = NULL, grid_size = 20, grid_min_ratio = NULL, 
          alpha = NULL, family = "gaussian", weights = NULL,
          tolerance = 1e-3, max_iterations = 5000, 
          min_working_set_size = 100,
          verbose = FALSE)
Arguments
| G | matrix of main effects of size  | 
| E | vector of environmental measurments | 
| Y | outcome vector. Set  | 
| C | matrix of confounders of size  | 
| normalize | 
 | 
| normalize_response | 
 | 
| grid | grid sequence for tuning hyperparameters, we use the same grid for  | 
| grid_size | specify  | 
| grid_min_ratio | parameter to determine  | 
| alpha | if  | 
| family | 
 | 
| tolerance | tolerance for the dual gap convergence criterion | 
| max_iterations | maximum number of iterations | 
| min_working_set_size | minimum size of the working set | 
| weights | inner fitting parameter | 
| verbose | 
 | 
Value
A list of estimated coefficients and other model fit metrics for each pair of hyperparameters (lambda_1, lambda_2)
| beta_0 | vector of estimated intercept values of size  | 
| beta_e | vector of estimated environment coefficients of size  | 
| beta_g | matrix of estimated main effects coefficients organized by rows, size ( | 
| beta_gxe | matrix of estimated interactions coefficients organized by rows, size ( | 
| beta_c | matrix of estimated confounders coefficients organized by rows, size ( | 
| num_iterations | number of iterations until convergence for each fit | 
| working_set_size | maximum number of variables in the working set for each fit | 
| has_converged | 1 if the model converged within given  | 
| objective_value | objective function (loss) value for each fit | 
| beta_g_nonzero | number of estimated non-zero main effects for each fit | 
| beta_gxe_nonzero | number of estimated non-zero interactions for each fit | 
| lambda_1 | 
 | 
| lambda_2 | 
 | 
| grid | vector of values used for hyperparameters tuning | 
Examples
data = data.gen()
fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE)
plot(fit$beta_g_nonzero, pch=19, cex=0.4, 
     ylab="num of non-zero features", xlab="lambdas path")
points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
Predict new outcome vector
Description
Predict new outcome vector based on the new data and estimated model coefficients.
Usage
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E, 
                   beta_c=NULL, new_C=NULL, family = "gaussian")
Arguments
| beta_0 | estimated intercept value | 
| beta_e | estimated environmental coefficient value | 
| beta_g | a vector of estimated main effect coefficients | 
| beta_gxe | a vector of estimated interaction coefficients | 
| new_G | matrix of main effects, variables organized by columns | 
| new_E | vector of environmental measurments | 
| beta_c | a vector of estimated confounders coefficients | 
| new_C | matrix of confounders, variables organized by columns | 
| family | set  | 
Value
Returns a vector of predicted values
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train)
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e                   
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe     
new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
cor(new_Y, data$Y_test)^2
Selection metrics
Description
Calculates principal selection metrics for the binary zero/non-zero classification problem (sensitivity, specificity, precision, auc).
Usage
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
Arguments
| true_b_g | vector of true main effect coefficients | 
| true_b_gxe | vector of true interaction coefficients | 
| estimated_b_g | vector of estimated main effect coefficients | 
| estimated_b_gxe | vector of estimated interaction coefficients | 
Value
A list of principal selection metrics
| b_g_non_zero | number of non-zero main effects | 
| b_gxe_non_zero | number of non-zero interactions | 
| mse_b_g | mean squared error for estimation of main effects effect sizes | 
| mse_b_gxe | mean squared error for estimation of interactions effect sizes | 
| sensitivity_g | recall of the non-zero main effects | 
| specificity_g | recall of the zero main effects | 
| precision_g | precision with respect to non-zero main effects | 
| sensitivity_gxe | recall of the non-zero interactions | 
| specificity_gxe | recall of the zero interactions | 
| precision_gxe | precision with respect to non-zero interactions | 
| auc_g | area under the curve for zero/non-zero binary classification problem for main effects | 
| auc_gxe | area under the curve for zero/non-zero binary classification problem for interactions | 
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe                
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g  
selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)