Type: Package
Title: Fit Mixture Models Using the Expectation Maximisation (EM) Algorithm
Version: 1.0-10
Date: 2023-01-18
Description: A set of functions which use the Expectation Maximisation (EM) algorithm (Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) <doi:10.1111/j.2517-6161.1977.tb01600.x> Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39(1), 1–22) to take a finite mixture model approach to clustering. The package is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. The method is described in Hunt, L. and Jorgensen, M. (1999) <doi:10.1111/1467-842X.00071> Australian & New Zealand Journal of Statistics 41(2), 153–171 and Hunt, L. and Jorgensen, M. (2003) <doi:10.1016/S0167-9473(02)00190-1> Mixture model clustering for mixed data with missing information, Computational Statistics & Data Analysis, 41(3-4), 429–440.
Depends: mvtnorm, R (≥ 4.0.0)
Encoding: UTF-8
Imports: methods
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
RoxygenNote: 7.2.3
URL: https://github.com/jmcurran/multimix
BugReports: https://github.com/jmcurran/multimix/issues
NeedsCompilation: no
Packaged: 2023-01-18 00:26:06 UTC; james
Author: Murray Jorgensen [aut], James Curran [cre, ctb]
Maintainer: James Curran <j.curran@auckland.ac.nz>
Repository: CRAN
Date/Publication: 2023-01-18 11:40:02 UTC

multimix: Fit Mixture Models Using the Expectation Maximisation (EM) Algorithm

Description

A set of functions which use the Expectation Maximisation (EM) algorithm (Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) doi:10.1111/j.2517-6161.1977.tb01600.x Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39(1), 1–22) to take a finite mixture model approach to clustering. The package is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. The method is described in Hunt, L. and Jorgensen, M. (1999) doi:10.1111/1467-842X.00071 Australian & New Zealand Journal of Statistics 41(2), 153–171 and Hunt, L. and Jorgensen, M. (2003) doi:10.1016/S0167-9473(02)00190-1 Mixture model clustering for mixed data with missing information, Computational Statistics & Data Analysis, 41(3-4), 429–440.

Author(s)

Maintainer: James Curran j.curran@auckland.ac.nz [contributor]

Authors:

See Also

Useful links:


Prostate cancer patient data

Description

Data on 475 prostate cancer patients

Usage

data(cancer.df)

Format

A data.frame with 475 rows and 12 columns:

age

Age in years

wt

Weight in pounds

pf

Patient activity

hx

Family history of cancer

sbp

Systolic blood pressure

dbp

Diastolic blood pressure

ekg

Electrocardiogram code

hg

Serum haemoglobin

sz

Size of primary tumour

sg

Index of tumour stage and histolic grade

ap

Serum prostatic acid phosphatase

bm

Bone metastatses

Details

There are twelve pre-trial covariates measured on each patient, seven may be taken to be continuous, four to be discrete, and one variable (SG) is an index nearly all of whose values lie between 7 and 15, and which could be considered either discrete or continuous. We will treat SG as a continuous variable.

A preliminary inspection of the data showed that the sizeof the primary tumour (SZ) and serum prostatic acid phosphatase (AP) were both skewed variables. These variables have therefore been transformed. A square root transformation was used for SZ, and a logarithmic transformation was used for AP to achieve approximate normality. (As for correlation, skewness over the whole data set does not necessarily mean skewness within clusters. But when clusters were formed, within-cluster skewness was observed for these variables.)

Observations that had missing values in any of the twelve pretreatment covariates were omitted from furtheranalysis, leaving 475 out of the original 506 observations available.

The categorical variable Patient activity had 4 levels: 'Normally Active', 'Bed rest below 50 or more', and 'Confined to bed'. The numbers of the 475 in these groups were 428, 32, 12, and 3. The least active two groups are grouped in our data, giving 3 groups of size 428, 32, and 15.

Source

D.P. Byar and S.B. Green 'The choice of treatment for cancer patients based on covariate information - application to prostate cancer', Bulletin du Cancer 1980: 67:477–490, reproduced in D.A. Andrews and A.M. Herzberg 'Data: a collection of problems from many fields for the student and research worker' p.261–274 Springer series in statistics, Springer-Verlag. New York.


Contraceptive Method Choice data

Description

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The cases are 1473 married women who were either not pregnant or do not know if they were at the time of interview.

Usage

data(cmc.df)

Format

A data.frame with 1473 rows and 10 columns:

age

Wife's age

edu

Wife's education

eduh

Husband's education

nborn

Number of children ever born

islam

Wife's religion

working

Wife is now working?

husocc

Husband's occupation

sol

Standard-of-living index

medex

Media exposure

method

Contraceptive method used

Details

The variables 'age' (in years) and 'nborn' (ranging from 0 to 16) would normally be treated as continuous; 'nborn' is skew and might well be transformed. The remaining 8 variables are categorical.

The variables 'edu', 'eduh' and 'sol' take values '1,2,3,4', #' they are ordinal with 1 = low and 4 = high. The variable 'husocc' takes the same 4 values, but it is not clear if the order has any significance.

The variables 'islam', 'working', and 'medex' are binary-valued with 0=Non-Islam, 1=Islam for 'islam'; 0=Yes, 1=No for 'working'; and 0=Good, 1=Not good for 'medex'.

The variable 'method' is ternary: 1=No-use, 2=Long-term, 3=Short-term.

Source

Tjen-Sien Lim 'Contraceptive Method Choice' 1997, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


Count the number of unique items ion a vector x

Description

Count the number of unique items ion a vector x

Usage

count.unique(x)

Arguments

x

a vector

Value

the number of unique items in x.

Author(s)

Murray Jorgensen

Examples

x = c(1, 2, 3)
count.unique(x)

x = c(1, 1, 1, 2, 3)
count.unique(x)

Prepare data for use with multimix

Description

Prepare data for use with multimix

Usage

data_organise(
  dframe,
  numClusters,
  numIter = 1000,
  cdep = NULL,
  lcdep = NULL,
  minpstar = 1e-09
)

Arguments

dframe

a data frame containing the data set you wish to model.

numClusters

the clusters you wish to fit.

numIter

the maximum number of steps to that the EM agorithm will run before terminating.

cdep

a list of multivariate normal cells.

lcdep

a list of location cells.

minpstar

Minimum denominator for application of Bayes Rule.

Value

An object of class multimixSettings which is a list with the following elements:

Author(s)

Murray Jorgensen

Examples

data(cancer.df)
D = data_organise(cancer.df, numClusters = 2)

The E(xpectation) step

Description

The E(xpectation) step

Usage

eStep(P, D)

Arguments

P

an object of class multimixParamList–see initParamList for more information.

D

an object of class multimixSettings—see data_organise for more information.

Value

a list containing two elements: a matrix named Z—see mStep for more information, and a scalar llik containing the current value of the log-likelihood.

Author(s)

Murray Jorgensen


Initialise the parameter list.

Description

Although the starting parameter list P may be specified directly, Note also that any matrices specified must be positive definite. This function calculates an initial P from D and a starting value for Z.

Usage

initParamList(D, Z)

Arguments

D

an object of class multimixSettings—see data_organise for details.

Z

an n \times q matrix, where n is the number of rows of dframe and q is the number of components in the mixture. During the fitting Z_{ij} holds the currently estimated probability that observation i belongs to component j. Often Z is initialized to a matrix of indicator columns for a partition of the data. It is also common to initialize Z to be the final Z from the fitting of a simpler model.

Value

an object of class multimixParamList which is a list with the following elements:


Map integer index N>0 back to left member of generating pair.

Description

Map integer index N>0 back to left member of generating pair.

Usage

left(N)

Arguments

N

positive integer scalar

Value

positive integer scalar

Author(s)

Murray Jorgensen

Examples

left(131)
left(57)

The M(aximisation) step

Description

Uses the current group membership to estimate the probabilities.

Usage

mStep(Z, D)

Arguments

Z

an n \times q matrix, where n is the number of rows of dframe and q is the number of components in the mixture. During the fitting Z_{ij} holds the currently estimated probability that observation i belongs to component j. Commonly Z is initialized to a matrix of indicator columns for a partition of the data.

D

an object of class multimixSettings—see data_organise for details.

Value

an object of class multimixParamList—see initParamList for more information.

Author(s)

Murray Jorgensen


Make initial Z matrix from initial assignment of observations to clusters

Description

Z is an n by numClusters matrix of non-negative numbers whose rows sum to 1. The ij^{\mathrm{th}} element z_{ij} is a probability that observation i belongs to cluster j. Rather than begin from an initial assignment Multimix allows for a weighted assignment accross several clusters.

Usage

make_Z_discrete(d)

Arguments

d

integer

Details

This function yields a 0/1 valued matrix.

Value

a matrix whose entries are non-negative, and whose entries sum to 1.

Author(s)

Murray Jorgensen

Examples

stage = scan(file = system.file('extdata', 'Stage.txt', package = 'multimix'))
stage = stage - 2
Z = make_Z_discrete(stage)

Read Z from FORTRAN output. Make into R matrix

Description

The FORTRAN version of Multimix produces two output files: GENERAL.OUT and GROUPS.OUT. The latter mainly contains the Z matrix.

Usage

make_Z_fortran(gr.out = "groups.out")

Arguments

gr.out

string containing a file name.

Details

This function facilitates the obtaining of Multimix R output given Multimix FORTRAN output.

Value

a matrix containing a Z matrix.

Author(s)

Murray Jorgensen

Examples

Z <- make_Z_fortran(system.file('extdata', 'GROUPS-BP-Multimixf90.OUT', 
                    package = 'multimix'))

Start from random groups of similar size.

Description

A large number (n) of observations are assigned randomly into (xq) clusters. It is recommended to repeat Multimix runs with a number of different seeds to search for a log-likelihood maximum.

Usage

make_Z_random(D, seed = NULL)

Arguments

D

an object of class multimixSettings – see data_organise for more information.

seed

a positive integer to use as a random number seed.

Details

Also consider making additional clusters from observations with low probabilities of belonging to any cluster in a previous clustering.

Value

a matrix of dimension n\times q where n is the number of observations in D$dframe and q is the number of clusters in the model as specified by D$numClusters.

Examples

data(cancer.df)
D = data_organise(cancer.df, numClusters = 2)
Z = make_Z_random(D)
table(Z)

Title

Description

Title

Usage

mmain(D, Z, P, eps = 1e-09)

Arguments

D

an object of class multimixSettings - see data_organise for full description.

Z

a matrix

P

a matrix

eps

Minimum increase in loglikelihood per EM step. If this is not exceeded the the algorithm will terminate.

Value

an object of class multimix results which is a a list containing four elements: the multmixSettings object D, the Z matrix, the P matrix, and a results matrix, called results, with n rows and numClusters columns.

Author(s)

Murray Jorgensen

Examples

data(cancer.df)
D <- data_organise(cancer.df, numClusters = 2)
stage <- scan(system.file('extdata', 'Stage.txt', package = 'multimix')) - 2
Z <- make_Z_discrete(stage)
P <- initParamList(D,Z) 
zpr <- mmain(D,Z,P)
zpr

Maps integer pairs (u,v) with 0<u<v bijectively to positive integers.

Description

Used to reduce array dimensions by replacing A(x,y,z) by A*(x,pair.index(y,z))

Usage

pair.index(u, v)

Arguments

u

positive integer scalar

v

positive integer scalar

Value

integer scalar

Author(s)

Murray Jorgensen

Examples

pair.index(11,17)
pair.index(2,12)

S3 method for plotting multimix results objects

Description

S3 method for plotting multimix results objects

Usage

## S3 method for class 'multimixResults'
plot(x, ...)

Arguments

x

an object of class multimixResults – see mmain for more information.

...

any other arguments to be passed to plot. Note that because there are two calls to plot, the ... arguments will be passed to each call, and it is unlikely that this will have the desired effect.

Value

No return value, called for side effects.

Author(s)

James Curran


S3 printing method for for multimix parameter results

Description

S3 printing method for for multimix parameter results

Usage

## S3 method for class 'multimixParamList'
print(
  x,
  type = c("means", "vars"),
  byLevel = FALSE,
  digits = c(4, 2, 3, 16),
  pedantic = FALSE,
  raw = FALSE,
  ...
)

Arguments

x

an object of class multimixParamResults – see initParamList for more information.

type

the statistic you want displayed. If means then the cluster means will be displayed for each univariate continuous variable, the cluster proportions for each level of a categorical variable, and the mean vector for each cluster and each multivariate normal variable.

byLevel

if TRUE then location model summary stats will be printed by the level of the factor in the location model. Otheriwse (default), they will be printed cluster by cluster.

digits

a vector of length 4. The first value determines how many decimal places to round categorical proportions to. The second value determines how many significant digits to display means to, and the third how many siginificant digits to display variances to. By default proportions are rounded to 4 decimal places, means 2 significant digits, and variances 3 significant digits. The fourth value is only used if pedantic == TRUE, and is set to 16 significant figures by default.

pedantic

if TRUE then the results are printed to high precision for checking purposes. This means digits[4] which is 16 decimal places by default.

raw

if TRUE then switches off all of the customised printing and uses the default print methods for lists etc.

...

additional arguments passed to print.

Value

No return value, called for side effects.

Author(s)

James Curran


S3 method for the printing of multimix results

Description

S3 method for the printing of multimix results

Usage

## S3 method for class 'multimixResults'
print(x, n = FALSE, ...)

Arguments

x

an object of class multimixResults—see mmain for a description.

n

display the last few iterations of the cluster probabilities. If TRUE then the last 5 iterations will be displayed by default. Alternatively, a positive integer can be supplied. If this exceeds the number of actual iterations, the output will be truncated.

...

other parameters passed to print. Not currently used.

Value

No return value, called for side effects.

Author(s)

James Curran


Description

Map integer index N>0 back to right member of generating pair.

Usage

right(N)

Arguments

N

positive integer scalar

Value

positive integer scalar

Author(s)

Murray Jorgensen

Examples

right(131)
right(57)