| Type: | Package | 
| Title: | Entrywise Splitting Cross-Validation for Factor Models | 
| Version: | 1.0.1 | 
| Description: | Implements entrywise splitting cross-validation (ECV) and its penalized variant (pECV) for selecting the number of factors in generalized factor models. | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| Language: | en-US | 
| Depends: | R (≥ 3.5.0) | 
| Imports: | stats, Rcpp (≥ 1.0.0), irlba | 
| Suggests: | mirtjml, testthat (≥ 3.0.0) | 
| LinkingTo: | Rcpp, RcppArmadillo | 
| URL: | https://github.com/wangATsu/ECV | 
| BugReports: | https://github.com/wangATsu/ECV/issues | 
| RoxygenNote: | 7.3.2 | 
| Config/testthat/edition: | 3 | 
| ByteCompile: | true | 
| NeedsCompilation: | yes | 
| Packaged: | 2025-08-23 02:29:40 UTC; clswt-wangzhijing | 
| Author: | Zhijing Wang [aut, cre] | 
| Maintainer: | Zhijing Wang <wangzhijing@sjtu.edu.cn> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-08-28 08:50:07 UTC | 
Estimate constraint constant C for continuous data
Description
Data-driven estimation of the constraint constant C in alternating maximization algorithm for continuous data using truncated SVD approach. This function decomposes the data matrix and estimates C based on the maximum row norms.
Usage
estimate_C(X, qmax = 8, safety = 1.2)
Arguments
| X | n x p continuous data matrix | 
| qmax | Rank for truncated SVD (default 8) | 
| safety | Safety parameter for conservative estimation (default 1.2) | 
Details
The function performs the following steps: 1. Computes truncated SVD of X with rank qmax 2. Constructs factor matrices A = U * sqrt(D) and B = V * sqrt(D) 3. Calculates row 2-norms for matrices A and B 4. Takes the maximum norm and multiplies by safety parameter
For count data, it is recommended to transform the data using log(X + 1) before applying this function.
Value
A list containing:
| qmax | Truncation rank used | 
| safety | Safety parameter applied | 
| C_norm_hat | Original maximum row norm | 
| C_est | Final conservative estimate of C | 
| a_norms | Row norms of factor matrix A | 
| b_norms | Row norms of factor matrix B | 
Examples
# Example 1: Continuous data
set.seed(123)
n <- 100; p <- 50; q <- 3
theta_true <- matrix(runif(n * q), n, q)
A_true <- matrix(runif(p * q), p, q)
X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p)
# Estimate C
C_result <- estimate_C(X, qmax = 5)
print(C_result$C_est)
# Example 2: Count data (apply log transformation)
lambda <- exp(theta_true %*% t(A_true))
X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p)
X_transformed <- log(X_count + 1)
C_count <- estimate_C(X_transformed, qmax = 5)
print(C_count$C_est)
Estimate constraint constant C for binary data
Description
Data-driven estimation of the constraint constant C for binary data using cross-window smoothing and empirical logit transformation.
Usage
estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)
Arguments
| X | n x p binary data matrix (0/1 values) | 
| qmax | Rank for truncated SVD (default 8) | 
| safety | Safety parameter for conservative estimation (default 1.5) | 
| eps | Small constant to avoid logit divergence when p=0 or p=1 (default 1e-12) | 
| radius | Radius for cross-window smoothing (default 1) | 
Details
The function performs the following steps: 1. Applies cross-window smoothing to estimate probabilities 2. Performs empirical logit transformation with smoothing 3. Computes truncated SVD of the transformed matrix 4. Constructs matrices A and B and calculates row norms 5. Estimates C as the maximum norm times safety parameter
The cross-window smoothing helps stabilize probability estimates, especially for sparse binary data.
Value
A list containing:
| radius | Cross-window radius used | 
| qmax | Truncation rank used | 
| safety | Safety parameter applied | 
| C0 | Original maximum row norm | 
| C_est | Final conservative estimate of C | 
| a_norms | Row norms of factor matrix A | 
| b_norms | Row norms of factor matrix B | 
| Mhat | Logit-transformed matrix | 
| P_smooth | Smoothed probability matrix | 
| N_counts | Count of values in each smoothing window | 
Generate binary data example
Description
Generate simulated data from a binary (logistic) factor model.
Usage
generate_binary_data(n = 100, p = 50, q = 3)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
Value
A named list with components:
- resp
- Binary matrix (n x p). Generated 0/1 responses. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix (n x q). True latent factor scores. 
- A_true
- Numeric matrix (p x q). True factor loadings. 
- d_true
- Numeric vector (length p). Item intercepts. 
Generate binary data with missing values
Description
Generate simulated data from a binary (logistic) factor model with missing values.
Usage
generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). | 
Value
A named list with components:
- resp
- Binary matrix (n x p). Generated 0/1 responses with missing values (NA). 
- resp_complete
- Binary matrix (n x p). Complete data before missingness. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix. True latent factor scores. 
- A_true
- Numeric matrix. True factor loadings. 
- d_true
- Numeric vector (length p). Item intercepts. 
- miss_prop
- Numeric. Proportion of entries set to missing. 
Generate continuous data example
Description
Generate simulated data from a Gaussian factor model.
Usage
generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
| noise_sd | Numeric. Standard deviation of Gaussian noise. | 
Value
A named list with components:
- resp
- Numeric matrix (n x p). Generated observed data. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix (n x (q+1)). True latent factor scores with intercept. 
- A_true
- Numeric matrix (p x (q+1)). True factor loadings. 
Generate continuous data with missing values
Description
Generate simulated data from a Gaussian factor model with missing values.
Usage
generate_continuous_data_miss(
  n = 100,
  p = 50,
  q = 3,
  noise_sd = 1,
  miss_prop = 0.05
)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
| noise_sd | Numeric. Standard deviation of Gaussian noise. | 
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). | 
Value
A named list with components:
- resp
- Numeric matrix (n x p). Generated data with missing values (NA). 
- resp_complete
- Numeric matrix (n x p). Complete data before missingness. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix (n x (q+1)). True latent factor scores with intercept. 
- A_true
- Numeric matrix (p x (q+1)). True factor loadings. 
- miss_prop
- Numeric. Proportion of entries set to missing. 
Generate count data example
Description
Generate simulated data from a Poisson factor model.
Usage
generate_count_data(n = 100, p = 50, q = 3)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
Value
A named list with components:
- resp
- Integer matrix (n x p). Generated Poisson observations. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix (n x (q+1)). True latent factor scores with intercept. 
- A_true
- Numeric matrix (p x (q+1)). True factor loadings. 
Generate count data with missing values
Description
Generate simulated data from a Poisson factor model with missing values.
Usage
generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
| n | Integer. Number of observations. | 
| p | Integer. Number of variables. | 
| q | Integer. True number of latent factors. | 
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). | 
Value
A named list with components:
- resp
- Integer matrix (n x p). Generated data with missing values (NA). 
- resp_complete
- Integer matrix (n x p). Complete data before missingness. 
- true_q
- Integer. True number of factors used in simulation. 
- theta_true
- Numeric matrix (n x (q+1)). True latent factor scores with intercept. 
- A_true
- Numeric matrix (p x (q+1)). True factor loadings. 
- miss_prop
- Numeric. Proportion of entries set to missing. 
Entrywise Splitting Cross-Validation for Factor Models
Description
Uses (Penalized) Entrywise Splitting Cross-Validation (ECV / pECV) to estimate the number of latent factors in generalized factor models.
Usage
pECV(
  resp,
  C = 5,
  qmax = 8,
  fold = 5,
  tol_val = 0.01,
  theta0 = NULL,
  A0 = NULL,
  seed = 1,
  data_type = NULL
)
Arguments
| resp | Observation data matrix (n x p); can be continuous, count, or binary. | 
| C | Constraint constant, default is 5. | 
| qmax | Maximum number of factors to consider, default is 8. | 
| fold | Number of folds in cross-validation, default is 5. | 
| tol_val | Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). | 
| theta0 | Optional initial matrix for factors; sampled from Uniform if not provided. | 
| A0 | Optional initial matrix for loadings; sampled from Uniform if not provided. | 
| seed | Random seed, default is 1. | 
| data_type | Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected. | 
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
- ECV
- Integer. Number of factors selected by standard ECV. 
- p1ECV
- Integer. Number of factors selected by ECV with penalty 1. 
- p2ECV
- Integer. Number of factors selected by ECV with penalty 2. 
- p3ECV
- Integer. Number of factors selected by ECV with penalty 3. 
- p4ECV
- Integer. Number of factors selected by ECV with penalty 4. 
- ECV_loss
- Numeric vector. Cross-validation loss for each candidate factor number (typically of length - qmax).
- data_type
- Character. The detected/used data type: - "continuous",- "count", or- "binary".
The return value has base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
  rpois(length(lambda), lambda = as.vector(lambda)),
  nrow = nrow(lambda), ncol = ncol(lambda)
)
result <- pECV(resp, C = 4, qmax = 4, fold = 5)
print(result)
Entrywise Splitting Cross-Validation with Missing Data
Description
Uses (Penalized) Entrywise Splitting Cross-Validation to estimate the number of latent factors in generalized factor models when the data contain missing values.
Usage
pECV.miss(
  resp,
  C = 5,
  qmax = 8,
  fold = 5,
  tol_val = 0.01,
  theta0 = NULL,
  A0 = NULL,
  seed = 1,
  data_type = NULL
)
Arguments
| resp | Observation data matrix (n x p) with missing values as  | 
| C | Constraint constant, default is 5. | 
| qmax | Maximum number of factors to consider, default is 8. | 
| fold | Number of folds in cross-validation, default is 5. | 
| tol_val | Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). | 
| theta0 | Optional initial matrix for factors; sampled from Uniform if not provided. | 
| A0 | Optional initial matrix for loadings; sampled from Uniform if not provided. | 
| seed | Random seed, default is 1. | 
| data_type | Data type, one of  | 
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
- ECV
- Integer. Number of factors selected by standard ECV. 
- p1ECV
- Integer. Number of factors selected by ECV with penalty 1. 
- p2ECV
- Integer. Number of factors selected by ECV with penalty 2. 
- p3ECV
- Integer. Number of factors selected by ECV with penalty 3. 
- p4ECV
- Integer. Number of factors selected by ECV with penalty 4. 
- ECV_loss
- Numeric vector. Cross-validation loss for each candidate factor number (typically of length - qmax).
- data_type
- Character. The detected/used data type: - "continuous",- "count", or- "binary".
- miss_percent
- Numeric scalar. Percentage of missing entries in - resp.
The return value uses base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data with missing values
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
  rpois(length(lambda), lambda = as.vector(lambda)),
  nrow = nrow(lambda), ncol = ncol(lambda)
)
# Introduce 5% missing values
miss_idx <- sample(1:(n * p), size = 0.05 * n * p)
resp[miss_idx] <- NA
result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5)
print(result)