Type: | Package |
Title: | Quantile Least Mahalanobis Distance Estimator for Tukey g-&-h Mixture |
Version: | 0.1.8 |
Date: | 2025-03-15 |
Description: | Functions for simulation, estimation, and model selection of finite mixtures of Tukey g-and-h distributions. |
License: | GPL-2 |
Imports: | methods, mixtools, tclust, fmx, TukeyGH77 |
Encoding: | UTF-8 |
Language: | en-US |
Depends: | R (≥ 4.4.0) |
Suggests: | knitr, rmarkdown, mixsmsn |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-03-15 18:33:14 UTC; tingtingzhan |
Author: | Tingting Zhan |
Maintainer: | Tingting Zhan <Tingting.Zhan@jefferson.edu> |
Repository: | CRAN |
Date/Publication: | 2025-03-15 18:50:02 UTC |
Quantile Least Mahalanobis Distance Estimator for Tukey g
-&-h
Mixture
Description
Tools for simulating and fitting finite mixtures of the 4-parameter Tukey g
-&-h
distributions.
Tukey g
-&-h
mixture is highly flexible to model multimodal distributions with variable degree of skewness and kurtosis in the components.
The Quantile Least Mahalanobis Distance estimator QLMDe is used for estimating parameters of the finite Tukey g
-&-h
mixtures.
QLMDe is an indirect estimator that minimizes the Mahalanobis distance between the sample and model-based quantiles.
A backward-forward stepwise model selection algorithm is provided to find
-
a parsimonious Tukey
g
-&-h
mixture model, conditional on a given number-of-components; and -
the optimal number of components within the user-specified range.
Author(s)
Maintainer: Tingting Zhan Tingting.Zhan@jefferson.edu (ORCID)
Authors:
Inna Chervoneva Inna.Chervoneva@jefferson.edu (ORCID)
Examples
# see ?QLMDe
Quantile Least Mahalanobis Distance estimates
Description
The quantile least Mahalanobis distance algorithm estimates the parameters of single-component or finite mixture distributions by minimizing the Mahalanobis distance between the vectors of sample and theoretical quantiles. See QLMDp for the default selection of probabilities at which the sample and theoretical quantiles are compared.
The default initial values are estimated based on trimmed k
-means
clustering with re-assignment.
Usage
QLMDe(
x,
distname = c("GH", "norm", "sn"),
K,
data.name = deparse1(substitute(x)),
constraint = character(),
probs = QLMDp(x = x),
init = c("logLik", "letterValue", "normix"),
tol = .Machine$double.eps^0.25,
maxiter = 1000,
...
)
Arguments
x |
|
distname |
character scalar, name of mixture distribution to be fitted. Currently supports |
K |
integer scalar, number of components (e.g., must use |
data.name |
character scalar, name for the observations for user-friendly print out. |
constraint |
character vector, parameters ( |
probs |
numeric vector, percentiles at where the sample and theoretical quantiles are to be matched.
See function |
init |
character scalar for the method of initial values selection,
or an fmx object of the initial values.
See function |
tol , maxiter |
see function vuniroot2 |
... |
additional parameters of optim |
Details
Quantile Least Mahalanobis Distance estimator fits a single-component or finite mixture distribution by minimizing the Mahalanobis distance between the theoretical and observed quantiles, using the empirical quantile variance-covariance matrix quantile_vcov.
Value
Function QLMDe()
returns an fmx object.
See Also
Examples
data(bmi, package = 'mixsmsn')
hist(x <- bmi[[1L]])
QLMDe(x, distname = 'GH', K = 2L)
Forward Selection of the Number of Components K
Description
To compare gh
-parsimonious models of Tukey g
-&-h
mixtures with different number of components K
(up to a user-specified K_\text{max}
)
and select the optimal number of components.
Usage
QLMDe_stepK(
x,
distname = c("GH", "norm"),
data.name = deparse1(substitute(x)),
Kmax = 3L,
test = c("BIC", "AIC"),
direction = c("forward", "backward"),
...
)
Arguments
x |
|
distname , data.name |
character scalars,
see parameters of the same names in function |
Kmax |
integer scalar |
test |
character scalar, criterion to be used, either Akaike's information criterion AIC, or Bayesian information criterion BIC (default). |
direction |
character scalar, direct of selection in function |
... |
additional parameters |
Details
Function QLMDe_stepK()
compares the gh
-parsimonious models with different number of components K
,
and selects the optimal number of components using BIC (default) or AIC.
The forward selection starts with finding the gh
-parsimonious model (via function step_fmx()
)
at K = 1
.
Let the current number of component be K^c
.
We compare the gh
-parsimonious models of K^c+1
and K^c
component, respectively,
using BIC or AIC.
If K^c
is preferred, then the forward selection is stopped, and K^c
is considered the
optimal number of components.
If K^c+1
is preferred, then
the forward selection is stopped if K^c+1=K_{max}
,
otherwise update K^c
with K_c+1
and repeat the previous steps.
Value
Function QLMDe_stepK()
returns an object of S3 class 'stepK'
,
which is a list of selected models (in reversed order) with attribute(s)
'direction'
and
'test'
.
Examples
data(bmi, package = 'mixsmsn')
hist(x <- bmi[[1L]])
QLMDe_stepK(x, distname = 'GH', Kmax = 2L)
Percentages for Quantile Least Mahalanobis Distance estimation
Description
A vector of probabilities to be used in Quantile Least Mahalanobis Distance estimation (QLMDe).
Usage
QLMDp(
from = 0.05,
to = 0.95,
length.out = 15L,
equidistant = c("prob", "quantile"),
extra = c(0.005, 0.01, 0.02, 0.03, 0.97, 0.98, 0.99, 0.995),
x
)
Arguments
from , to |
numeric scalar,
minimum and maximum of the equidistant (in probability or quantile) probabilities.
Default |
length.out |
non-negative integer scalar, the number of the equidistant (in probability or quantile) probabilities. |
equidistant |
character scalar.
If |
extra |
numeric vector of additional probabilities,
default |
x |
numeric vector of observations, only used when |
Details
The default arguments of function QLMDp()
returns the probabilities of
c(.005, .01, .02, .03, seq.int(.05, .95, length.out = 15L), .97, .98, .99, .995)
.
Value
A numeric vector of probabilities to be supplied to parameter p
of
Quantile Least Mahalanobis Distance QLMDe estimation).
In practice, the length of this probability vector p
must be equal or larger than the number of parameters in the distribution model to be estimated.
Examples
library(fmx)
(d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2)))
set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2))
# equidistant in probabilities
(p1 = QLMDp())
# equidistant in quantiles
(p2 = QLMDp(equidistant = 'quantile', x = x2))
Determine Nearly-Equal Elements
Description
Determine nearly-equal elements and extract non-nearly-equal elements in a double vector.
Usage
unique_allequal(x, ...)
duplicated_allequal(x, ...)
Arguments
x |
|
... |
additional parameters of function |
Value
Function duplicated_allequal()
returns a logical vector of the same length as the input vector,
indicating whether each element is nearly-equal to any of the previous elements.
Function unique_allequal()
returns the non-nearly-equal elements in the input vector.
See Also
duplicated.default unique.default
Examples
(x = c(.3, 1-.7, 0, .Machine$double.eps))
duplicated.default(x) # not desired
unique.default(x) # not desired
duplicated_allequal(x)
unique_allequal(x)
unique_allequal(x, tol = .Machine$double.eps/2)
Test if Two double Vectors are Element-Wise (Nearly) Equal
Description
Test if two double vectors are element-wise (nearly) equal.
Usage
allequal_o_(target, current, tolerance = sqrt(.Machine$double.eps), ...)
Arguments
target |
length- |
current |
length- |
tolerance |
positive double scalar, default |
... |
potential parameters, currently not in use |
Details
Function allequal_o_()
is different from all.equal.numeric, such that
element-wise comparison is performed, in a fashion similar to function outer
a logical scalar is always returned for each element-wise comparison.
Value
Function allequal_o_()
returns an n_t\times n_c
logical matrix
indicating whether the length-n_c
vector current
is element-wise near-equal to
the length-n_t
vector target
within the pre-specified tolerance
.
Examples
allequal_o_(target = c(.3, 0), current = c(.3, 1-.7, 0, .Machine$double.eps))
Drop or Add One Parameter from fmx Object
Description
Fit fmx models with a single parameters being added or dropped.
Usage
## S3 method for class 'fmx'
drop1(object, ...)
## S3 method for class 'fmx'
add1(object, ...)
Arguments
object |
fmx object |
... |
additional parameters, currently not in use. |
Details
..
Value
Functions drop1.fmx()
and add1.fmx()
return a list of fmx objects,
in the reverse order of model selection.
Note
Functions drop1.fmx()
and add1.fmx()
do not
return an anova table, like other
stats:::drop.*
or stats:::add1.*
functions do.
See Also
Examples
# donttest to save time
library(fmx)
(d2 = fmx('GH', A = c(1,6), B = 1.2, g = c(0,.3), h = c(.2,0), w = c(1,2)))
set.seed(312); hist(x2 <- rfmx(n = 1e3L, dist = d2))
system.time(m0 <- QLMDe(x2, distname = 'GH', K = 2L, constraint = c('g1', 'g2', 'h1', 'h2')))
system.time(m1 <- QLMDe(x2, distname = 'GH', K = 2L, constraint = c('g1', 'h2')))
system.time(m2 <- QLMDe(x2, distname = 'GH', K = 2L)) # ~2 secs
d1 = drop1(m1)
d1 # NULL
d2 = drop1(m2)
vapply(d2, FUN = getTeX, FUN.VALUE = '')
a0 = add1(m0)
vapply(a0, FUN = getTeX, FUN.VALUE = '')
a1 = add1(m1)
vapply(a1, FUN = getTeX, FUN.VALUE = '')
Naive Estimates of Finite Mixture Distribution via Clustering
Description
Naive estimates for finite mixture distribution fmx via clustering.
Usage
fmx_cluster(
x,
K,
distname = c("GH", "norm", "sn"),
constraint = character(),
...
)
Arguments
x |
|
K |
integer scalar, number of mixture components |
distname |
character scalar, name of parametric distribution of the mixture components |
constraint |
character vector,
parameters ( |
... |
additional parameters, currently not in use |
Details
First of all, if the specified number of components K\geq 2
,
trimmed k
-means clustering with re-assignment will be performed;
otherwise, all observations will be considered as one single cluster.
The standard k
-means clustering is not used since the heavy tails of
Tukey g
-&-h
distribution could be mistakenly classified as individual cluster(s).
In each of the one or more clusters,
letterValue-based estimates of Tukey
g
-&-h
distribution (Hoaglin, 2006) are calculated, for anyK\geq 1
, serving as the starting values for QLMD algorithm. These estimates are provided by functionfmx_cluster()
.the median and mad will serve as the starting values for
\mu
and\sigma
(orA
andB
for Tukeyg
-&-h
distribution, withg = h = 0
), for QLMD algorithm whenK = 1
.
Value
Function fmx_cluster()
returns an fmx object.
Best Naive Estimates for Finite Mixture Distribution
Description
Best estimates for finite mixture distribution fmx.
Usage
fmx_hybrid(x, test = c("logLik", "CvM", "KS"), ...)
Arguments
x |
|
test |
character scalar, criteria for selecting the optimal estimates. See Details. |
... |
additional parameters of functions |
Details
Function fmx_hybrid()
compares
Tukey g
-&-h
mixture estimate provided by function fmx_cluster()
and the normal mixture estimate by function fmx_normix()
,
and select the one either with maximum likelihood (test = 'logLik'
, default),
with minimum Cramer-von Mises distance (test = 'CvM'
) or
with minimum Kolmogorov distance (Kolmogorov_fmx).
Value
Function fmx_hybrid()
returns an fmx object.
Examples
library(fmx)
d1 = fmx('norm', mean = c(1, 2), sd = .5, w = c(.4, .6))
set.seed(100); hist(x1 <- rfmx(n = 1e3L, dist = d1))
fmx_normix(x1, distname = 'norm', K = 2L)
fmx_normix(x1, distname = 'GH', K = 2L)
(d2 = fmx('GH', A = c(1,6), B = 2, g = c(0,.3), h = c(.2,0), w = c(1,2)))
set.seed(100); hist(x2 <- rfmx(n = 1e3L, dist = d2))
fmx_cluster(x2, K = 2L)
fmx_cluster(x2, K = 2L, constraint = c('g1', 'h2'))
fmx_normix(x2, K = 2L, distname = 'GH')
fmx_hybrid(x2, distname = 'GH', K = 2L)
Naive Parameter Estimates using Mixture of Normal
Description
Naive parameter estimates for finite mixture distribution fmx using mixture of normal distributions.
Usage
fmx_normix(x, K, distname = c("norm", "GH", "sn"), alpha = 0.05, R = 10L, ...)
Arguments
x |
|
K |
integer scalar, number of mixture components |
distname |
character scalar, name of parametric distribution of the mixture components |
alpha |
numeric scalar, proportion of observations to be trimmed in
trimmed |
R |
integer scalar, number of normalmixEM replicates |
... |
additional parameters, currently not in use |
Details
fmx_normix ... the cluster centers are provided as the starting values of \mu
's for
the univariate normal mixture by EM algorithm.
R
replicates of normal mixture estimates are obtained, and
the one with maximum likelihood will be selected
Value
Function fmx_normix()
returns an fmx object.
Data Clusters by (modified) k
-Means
Description
To create a list of observations based on (modified) k
-means algorithm.
Usage
klist(x, K, method = c("reassign_tkmeans"), alpha = 0.05, ...)
Arguments
x |
|
K |
integer scalar, number of clusters |
method |
character scalar,
only |
alpha |
numeric scalar, proportion of observations to be trimmed in
trimmed |
... |
additional parameters, currently not in use |
Value
Function klist()
returns a list of numeric vectors.
See Also
Simpler and Faster Mahalanobis Distance
Description
Simpler and faster mahalanobis distance.
Usage
mahalanobis_int(x, center, invcov)
Arguments
x |
|
center |
|
invcov |
numeric matrix, inverted variance-covariance |
Value
Function mahalanobis_int()
returns a numeric scalar.
Variance-Covariance of Quantiles
Description
To compute the variance-covariance matrix of quantiles based on Theorem 1 and 2 of Mosteller (1946).
Usage
quantile_vcov(p, d)
Arguments
p |
numeric vector, cumulative probabilities at the given quantiles |
d |
numeric vector, probability densities at the given quantiles |
Details
The end user should make sure no density too close to 0 is included in argument d
.
Function quantile_vcov()
must not be used in a compute-intensive way.
Value
Function quantile_vcov()
returns the variance-covariance matrix of quantiles.
References
Frederick Mosteller. On Some Useful "Inefficient" Statistics (1946). doi:10.1214/aoms/1177730881
Re-Assign Observations Trimmed Prior to Trimmed k
-Means Clustering
Description
Re-assign the observations,
which are trimmed in the trimmed k
-means algorithm,
back to the closest cluster as determined by the smallest
Mahalanobis distance.
Usage
reAssign(x, ...)
## S3 method for class 'tkmeans'
reAssign(x, ...)
Arguments
x |
a tkmeans object |
... |
potential parameters, currently not in use. |
Details
Given the tkmeans input, the mahalanobis distance is computed between each trimmed observation and each cluster. Each trimmed observation is assigned to the closest cluster (i.e., with the smallest Mahalanobis distance).
Value
Function reAssign.tkmeans()
returns an 'reAssign_tkmeans'
object,
which inherits from tkmeans class.
Note
Either kmeans or tkmeans is slow for big x
.
Examples
library(tclust)
data(geyser2)
clus = tkmeans(geyser2, k = 3L, alpha = .03)
plot(clus, main = 'Before Re-Assigning')
plot(reAssign(clus), main = 'After Re-Assigning')
Forward Selection of gh
-parsimonious Model with Fixed Number of Components K
Description
To select the gh
-parsimonious mixture model,
i.e., with some g
and/or h
parameters equal to zero,
conditionally on a fixed number of components K
.
Usage
step_fmx(
object,
test = c("BIC", "AIC"),
direction = c("forward", "backward"),
...
)
Arguments
object |
fmx object |
test |
character scalar, criterion to be used, either Akaike's information criterion AIC-like, or Bayesian information criterion BIC-like (default). |
direction |
character scalar, |
... |
additional parameters, currently not in use |
Details
The algorithm starts with quantile least Mahalanobis distance estimates
of either the full mixture of Tukey g
-&-h
distributions model, or
a constrained model (i.e., some g
and/or h
parameters equal to zero according to the user input).
Next, each of the non-zero g
and/or h
parameters is tested using the likelihood ratio test.
If all tested g
and/or h
parameters are significantly different from zero at the level 0.05
the algorithm is stopped and the initial model is considered gh
-parsimonious.
Otherwise, the g
or h
parameter with the largest p-value is constrained to zero
for the next iteration of the algorithm.
The algorithm iterates until only significantly-different-from-zero g
and h
parameters
are retained, which corresponds to gh
-parsimonious Tukey g
-&-h
mixture model.
Value
Function step_fmx()
returns an object of S3 class 'step_fmx'
,
which is a list of selected models (in reversed order) with attribute(s)
'direction'
and
'test'
.