This vignette is organized as follows:
Getting started
Exploring the data
Model specification
Parameter estimation
Robust prediction of the area-level means
Mean square prediction error
In small area estimation (SAE), we distinguish two types of models:
The classification of the models (A or B) is from Rao (2003, Chapter 7). The current version of the package implements the following estimation methods under the unit-level model (model B):
The package can be installed from CRAN using
install.packages("rsae"). Once the rsae
package has been installed, we need to load it to the current session by
library("rsae").
Work flow
The prediction of the area-level means takes three steps.
saemodel(),fitsaemodel(),robpredict(); this step includes the computation of the
mean square prediction error.We use the landsat data of Battese et
al. (1988), which is loaded by
The landsat data is a compilation of survey and
satellite data from Battese et al. (1988). It
consists of data on segments (primary sampling unit; 1 segement approx.
250 hectares) under corn and soybeans for 12 counties in north-central
Iowa.
In the three smallest counties (Cerro Gordo, Hamilton, and Worth), data is available only for one sample segment. All other counties have data for more than one sample segment (i.e., unbalanced data). The largest area (Hardin) covers six units.
The data for the observations 32, 33, and 34 are shown below
> landsat[32:34,]
   SegmentsInCounty SegementID HACorn HASoybeans PixelsCorn PixelsSoybeans
32              556          1  88.59     102.59        220            262
33              556          2  88.59      29.46        340             87
34              556          3 165.35      69.28        355            160
   MeanPixelsCorn MeanPixelsSoybeans outlier CountyName
32         325.99             177.05   FALSE     Hardin
33         325.99             177.05    TRUE     Hardin
34         325.99             177.05   FALSE     HardinWe added the variable outlier to the original data. It
flags observation 33 as an outlier, which is in line with the discussion
in Battese et al. (1988).
We consider estimating the parameters of the basic unit-level model (Battese et al. , 1988)
\[ \begin{equation*} \mathrm{HACorn}_{i,j} = \alpha + \beta_1 \mathrm{PixelsCorn}_{i,j} + \beta_2 \mathrm{PixelsSoybeans}_{i,j} + u_i + e_{i,j}, \end{equation*} \]
where \(j=1,\ldots, n_i\), \(i=1, \ldots,12\), and
The model is defined with the help of
thesaemodel()function:
> bhfmodel <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans,
+                      area = ~ CountyName,
+                      data = subset(landsat, subset = (outlier == FALSE)))where
formula defines the fixed-effect part of the model (the
~ operator separates dependent and independent variables;
by default, the model includes a regression intercept),area specifies the area-level random effect (variable
CountyName serves as area identifier; note that the
argument area is also a formula object),data specifies the data.frame (here, we
consider the subset of observations that are not flagged as
outliers).If you need to know more about a particular model, you can use the
summary() method.
Having specified bhfmodel, we consider estimating its
parameters by different estimation method.
The maximum likelihood (ML) estimates of the parameters are computed by
On print, object mlfit shows the following.
> mlfit
ESTIMATES OF SAE-MODEL (model type B) 
Method:  Maximum likelihood estimation 
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans 
Coefficients: 
   (Intercept)      PixelsCorn  PixelsSoybeans  
       50.9676          0.3286         -0.1337  
---
Random effects
Model: ~1| CountyName 
           (Intercept)  Residual
Std. Dev.  11.00        11.72   
---
Number of Observations:  36 
Number of Areas:  12 Inferential statistics for the object mlfit are computed
by the summary() method.
> summary(mlfit)
ESTIMATION SUMMARY
Method: Maximum likelihood estimation 
---
Fixed effects
                  Value Std.Error  t-value df  p-value    
(Intercept)    50.96756  23.47509  2.17113 22   0.0410 *  
PixelsCorn      0.32858   0.04798  6.84772 22 7.05e-07 ***
PixelsSoybeans -0.13371   0.05306 -2.51984 22   0.0195 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Function coef() extracts the estimated coefficients from
a fitted model.
coef() is called with argument
type = "both", which implies that the fixed effects and the
random effect variances are returned.type = "ranef" or
type = "fixef" returns the random effect variances or the
fixed effects, respectively.Function convergence() can be useful if the algorithm
did not converge; see Appendix.
The Huber-type M-estimator is appropriate for situations where the response variable is supposed to be (moderately) contaminated by outliers. The M-estimator downweights residual outliers, but it does not limit the effect of high-leverage observations.
The Huber-type M-estimator of bhfmodel is
where k is the robustness tuning constant of the Huber
\(\psi\)-function (\(0<k < \infty\)). On print, we
have
> huberfit
ESTIMATES OF SAE-MODEL (model type B) 
Method:  Huber-type M-estimation 
Robustness tuning constant: k = 1.5 
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans 
Coefficients: 
   (Intercept)      PixelsCorn  PixelsSoybeans  
       50.2849          0.3285         -0.1392  
---
Random effects
Model: ~1| CountyName 
           (Intercept)  Residual
Std. Dev.  11.95        12.36   
---
Number of Observations:  36 
Number of Areas:  12 If the algorithm did not converge, see paragraph safe mode
(below) and Appendix. Inferential statistics for
huberfit are computed by the summary()
method.
> summary(huberfit)
ESTIMATION SUMMARY
Method: Huber-type M-estimation 
Robustness tuning constant: k = 1.5 
---
Fixed effects
                  Value Std.Error  t-value df  p-value    
(Intercept)    50.28489  24.84651  2.02382 22   0.0553 .  
PixelsCorn      0.32845   0.05077  6.46906 22 1.65e-06 ***
PixelsSoybeans -0.13916   0.05618 -2.47711 22   0.0214 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
---
Degree of downweighting/winsorization:
                 sum(wgt)/n
fixeff           0.9841    
residual var     0.9723    
area raneff var  0.9841    The output is separated into 2 blocks. The first block shows inferential statistics of the estimated fixed effects. The second block reports the degree of downweighting that is applied to outlying residuals at the final iteration (by estimating equations, EE, separately). The more the value of “sum(wgt)/n” deviates from 1.0, the more downweighting has been applied.
The methods coef() and convergence() are
also available.
In the safe mode, the algorithm is initialized by a
high-breakdown-point regression estimator. Note. In
order to use the safe mode, the robustbase package of Maechler et al. (2021) must be installed.
The safe mode is entered by specifying one the following initialization methods:
init = "lts" : least trimmed squares (LTS) regression
estimator; see Rousseeuw (1984) and Rousseeuw and van Driessen (2006),init = "s" : regression S-estimator; see Rousseeuw and Yohai (1984) and Salibian-Barerra and Yohai (2006)in the call of fitsaemodel(). The safe mode uses a
couple of internal tests to check whether the estimates at consecutive
iterations behave well. Notably, it tries to detect cycles in the
sequence iteratively refined estimates; see Schoch
(2012).
The package implements the following methods to predict the area-level means:
EBLUP and robust predictions are computed by
where
fit is a fitted model (ML estimate or
M-estimate),k is the robustness tuning constant of the Huber \(\psi\)-function for robust prediction. By
default k is NULL which means that the
procedure inherits the tuning constant k that has been used
in fitting the model; see fitsaemodel(). For the ML
estimator, \(k\) is taken as a large
value that “represents” infinity; see argument k_Inf in
fitsaemodel.control().reps and progress_bar are used in the
computation of the mean square prediction error (see below).In the landsat data, the county-specific population
means of pixels of the segments under corn and soybeans are recorded in
the variables MeanPixelsCorn and MeanPixelsSoybeans, respectively. Each
sample segment in a particular county is assigned the county-specific
means. Therefore, the population mean of the variables MeanPixelsCorn
and MeanPixelsSoybeans occurs \(n_i\)
times. The unique county-specific population means are obtained
using
> dat <- unique(landsat[-33, c("MeanPixelsCorn", "MeanPixelsSoybeans", "CountyName")])
> dat <- cbind(rep(1,12), dat)
> rownames(dat) <- dat$CountyName
> dat <- dat[,1:3]
> dat
            rep(1, 12) MeanPixelsCorn MeanPixelsSoybeans
Cerro Gordo          1         295.29             189.70
Hamilton             1         300.40             196.65
Worth                1         289.60             205.28
Humboldt             1         290.74             220.22
Franklin             1         318.21             188.06
Pocahontas           1         257.17             247.13
Winnebago            1         291.77             185.37
Wright               1         301.26             221.36
Webster              1         262.17             247.09
Hancock              1         314.28             198.66
Kossuth              1         298.65             204.61
Hardin               1         325.99             177.05The first column of dat is a dummy variable (because
bhfmodel has a regression intercept). The second and third
column represent, respectively, the county-specific population means of
segments under corn and soybeans.
Consider the ML estimate, mlfit, of the
bhfmodel model. The EBLUP of the area-level means is
computed by
> pred <- robpredict(mlfit, areameans = dat)
> pred
Robustly Estimated/Predicted Area-Level Means:
             raneff   fixeff   area mean
Cerro Gordo   -0.348  122.629  122.281  
Hamilton       2.731  123.379  126.110  
Worth        -11.522  118.677  107.154  
Humboldt      -8.313  117.053  108.741  
Franklin      13.641  130.380  144.021  
Pocahontas     9.529  102.425  111.954  
Winnebago     -9.043  122.052  113.009  
Wright         1.648  120.358  122.006  
Webster       11.082  104.073  115.155  
Hancock       -3.229  127.671  124.442  
Kossuth      -14.621  121.740  107.119  
Hardin         8.445  134.408  142.853  because (by default) k = NULL . By explicitly specifying
k in the call of robpredict(), we can—in
principle—compute robust predictions for the ML estimate instead of the
EBLUP.
Consider the Huber M-estimate huberfit (with
tuning constant k = 1.5, see call of above). If
k is not specified in the call of
robpredict(), robust predictions are computed with \(k\) equal to 1.5; otherwise, the robust
predictions are based on the value of k in the call of
robpredict().
Object pred is a list with slots
"fixeff", "raneff", "means",
"mspe", etc. For instance, the predicted means can be
extracted by pred$means or
pred[["means"]].
A plot() function is available to display the predicted
means.
Function residuals() computes the residuals.
Function robpredict() can be used to compute bootstrap
estimates of the mean squared prediction errors (MSPE) of the predicted
area-level means; see Sinha and Rao (2009). To
compute the MSPE, we must specify the number of bootstrap replicates
(reps). If reps = NULL, the MSPE is not
computed.
Consider (for instance) the ML estimate. EBLUP and MSPE of the EBLUP based on 100 bootstrap replicates are computed by
> pred <- robpredict(mlfit, areameans = dat, reps = 100,
+                    progress_bar = FALSE)
> pred
Robustly Estimated/Predicted Area-Level Means:
             raneff   fixeff   area mean  mspe   
Cerro Gordo   -0.348  122.629  122.281     82.418
Hamilton       2.731  123.379  126.110     99.031
Worth        -11.522  118.677  107.154     60.627
Humboldt      -8.313  117.053  108.741     64.231
Franklin      13.641  130.380  144.021     39.708
Pocahontas     9.529  102.425  111.954     39.070
Winnebago     -9.043  122.052  113.009     43.537
Wright         1.648  120.358  122.006     42.533
Webster       11.082  104.073  115.155     32.976
Hancock       -3.229  127.671  124.442     32.567
Kossuth      -14.621  121.740  107.119     28.018
Hardin         8.445  134.408  142.853     20.681
(mpse: 100 boostrap replicates)The number of reps = 100 is kept small for illustration
purposes; in practice, we should choose larger values. The progress bar
has been disabled as it is not suitable for the automatic vignette
generation.
A visual display of the predicted area-level means obtains by
plot(pred, type = "l", sort = "means").
COPT, S. and M.-P. VICTORIA-FESER (2009). Robust prediction in mixed linear models, Tech. rep., University of Geneva.
BATTESE, G. E., R. M. HARTER and W. A. FULLER (1988). An error component model for prediction of county crop areas using, Journal of the American Statistical Association 83, 28–36. DOI: 10.2307/2288915
FELLNER, W. H. (1986). Robust estimation of variance components, Technometrics 28, 51–60. DOI: 10.1080/00401706.1986.10488097
HERITIER, S., E. CANTONI, S. COPT, and M.-P. VICTORIA-FESER (2009). Robust Methods in Biostatistics, New York: John Wiley & Sons.
MAECHLER, M., P. J. ROUSSEEUW, C. CROUX, V. TODOROC, A. RUCKSTUHL, M. SALIBIAN-BARRERA, T. VERBEKE, M. KOLLER, M. KOLLER, E. L. T. CONCEICAO and M. A. DI PALMA (2023). robustbase: Basic Robust Statistics. R package version 0.99-1. URL: CRAN.R-project.org/package=robustbase.
RAO, J.N.K. (2003). Small Area Estimation, New York: John Wiley and Sons.
RICHARDSON, A. M. and A. H. WELSH (1995). Robust Restricted Maximum Likelihood in Mixed Linear Models, Biometrics 51, 1429–1439. DOI: 10.2307/2533273
ROUSSEEUW, P. J. (1984). Least Median of Squares Regression, Journal of the American Statistical Association 79, 871–880. DOI: 10.2307/2288718
ROUSSEEUW, P. J. and K. VAN DRIESSEN (2006). Computing LTS Regression for Large Data Sets, Data Mining and Knowledge Discovery 12, 29–45. DOI: 10.1007/s10618-005-0024-4
ROUSSEEUW, P. J. and V. YOHAI (1984). Robust Regression by Means of S Estimators, in Robust and Nonlinear Time Series Analysis, ed. by FRANKE, J., W. HÄRDLE AND R. D. MARTIN, New York: Springer, 256–274.
SALIBIAN-BARRERA, M. and V. J. YOHAI (2006). A Fast Algorithm for S-Regression Estimates, Journal of Computational and Graphical Statistics 15, 414–427. DOI: 10.1198/106186006x113629
SCHOCH, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. DOI: 10.17713/ajs.v41i4.1548
SINHA, S.K. and J.N.K. RAO (2009). Robust small area estimation. Canadian Journal of Statistics 37, 381–399. DOI: 10.1002/cjs.10029
Suppose that we computed
The algorithm did not converge and we get the following output
> est
THE ALGORITHM DID NOT CONVERGE!
---
 1) use convergence() of your fitted model to learn more
 2) study the documentation using the command ?fitsaemodel
 3) you may call fitsaemodel with 'init' equal to (either) 'lts'
    or 's' (this works also for ML, though it may no be very efficient)
 4) if it still does not converge, the last resort is to modify
   'acc' and/or 'niter'
ESTIMATES OF SAE-MODEL (model type B) 
Method:  Maximum likelihood estimation 
---
Fixed effects
Model: HACorn ~ (Intercept) + PixelsCorn + PixelsSoybeans 
Coefficients: 
   (Intercept)      PixelsCorn  PixelsSoybeans  
            NA              NA              NA  
---
Random effects
Model: ~1| CountyName 
           (Intercept)  Residual
Std. Dev.  NA           NA      
---
Number of Observations:  36 
Number of Areas:  12 To learn more, we call convergence(est) and get
> convergence(est)
CONVERGENCE REPORT
NOTE: ALGORITHM DID NOT CONVERGE!
---
User specified number of iterations (niter) and
numeric precision (acc):
                niter   acc
overall loop        3 1e-05
fixeff            200 1e-05
residual var      200 1e-05
area raneff var   100 1e-05
---
Number of EE-specific iterations in each call
(given the user-defined specs), reported for
each of the3overall iterations:
  fixeff residual var area raneff var
1      2            2              20
2      2            2              13
3      2            2              14Clearly, we have deliberately set the number of (overall or outer
loop) iterations equal to niter = 3 to illustrate the
behavior; see also “niter = 3” on the line “overall loop” in the above
table. As a consequence, the algorithm failed to converge.
The maximum number of iterations for the fixed effects (“fixeff”) is
200, for the residual variance estimator (“residual var”) is 200, for
the area-level random effect variance is (“area raneff var”) is 100. The
default values can be modified; see documentation of
fitsaemodel.control().
The last table in the above output shows the number of iterations
that the algorithm required for the niter = 3 overall loop
iteration (i.e., in the first loop, we have fixef = 2,
residual var = 2 and area raneff var = 18
iterations). From this table we can learn how to adjust the default
values in case the algorithm does not converge.