We use the api dataset from package survey to illustrate
estimation of a population mean from a sample using a linear regression
model. First let’s estimate the population mean of the academic
performance indicator 2000 from a simple random sample,
apisrs. Using package survey’s GREG estimator (Särndal, Swensson, and Wretman 1992), we
find
library(survey)
data(api)
# define the regression model
model <- api00 ~ ell + meals + stype + hsg + col.grad + grad.sch
# compute corresponding population totals
XpopT <- colSums(model.matrix(model, apipop))
N <- XpopT[["(Intercept)"]]  # population size
# create the survey design object
des <- svydesign(ids=~1, data=apisrs, weights=~pw, fpc=~fpc)
# compute the calibration or GREG estimator
cal <- calibrate(des, formula=model, population=XpopT)
svymean(~ api00, des)  # equally weighted estimate##         mean     SE
## api00 656.58 9.2497##         mean     SE
## api00 663.86 4.1942The true population mean in this case can be obtained from the
apipop dataset:
## [1] 664.7126Note that the GREG estimate is more accurate than the simple equally weighted estimate, which is also reflected by the smaller estimated standard error of the former.
We can run the same linear model using package mcmcsae. In the next
code chunk, function create_sampler sets
up a sampler function that is used as input to function MCMCsim, which runs a
simulation to obtain draws from the posterior distribution of the model
parameters. By default three chains with independently generated
starting values are run over 1250 iterations with the first 250
discarded as burnin. As the starting values for the MCMC simulation are
randomly generated, we set a random seed for reproducibility.
The results of the simulation are subsequently summarized, and the DIC model criterion is computed. The simulation summary shows several statistics for the model parameters, including the posterior mean, standard error, quantiles, as well as the R-hat convergence diagnostic.
library(mcmcsae)
set.seed(1)
sampler <- create_sampler(model, data=apisrs)
sim <- MCMCsim(sampler, verbose=FALSE)
(summ <- summary(sim))## llh_ :
##       Mean   SD t-value   MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1104 2.12    -520 0.0418 -1108 -1104 -1101  2581     1
## 
## sigma_ :
##        Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 60.5 3.11    19.4 0.0631  55.5 60.4    66  2433     1
## 
## reg1 :
##                Mean     SD t-value    MCSE    q0.05    q0.5    q0.95 n_eff R_hat
## (Intercept)  778.32 24.600   31.64 0.44912  737.311  778.33 818.5935  3000 1.000
## ell           -1.72  0.298   -5.79 0.00544   -2.210   -1.72  -1.2382  3000 1.000
## meals         -1.75  0.275   -6.36 0.00502   -2.209   -1.75  -1.3115  3000 1.000
## stypeH      -108.81 14.023   -7.76 0.25603 -131.403 -108.60 -86.1115  3000 1.002
## stypeM       -59.05 12.117   -4.87 0.22122  -78.968  -59.08 -39.2607  3000 0.999
## hsg           -0.70  0.415   -1.69 0.00758   -1.408   -0.70  -0.0101  3000 1.000
## col.grad       1.22  0.487    2.51 0.00889    0.421    1.22   2.0163  3000 1.000
## grad.sch       2.20  0.501    4.39 0.00937    1.402    2.19   3.0274  2859 1.000##         DIC       p_DIC 
## 2217.368326    8.910283The output of function MCMCsim is an object of
class mcdraws. The package provides methods for the generic
functions summary, plot, predict,
residuals and fitted for this class.
To compute a model-based estimate of the population mean, we need to predict the values of the target variable for the unobserved units. Let \(U\) denote the set of population units, \(s \subset U\) the set of sample (observed) units, and let \(y_i\) denote the value of the variable of interest taken by the \(i\)th unit. The population mean of the variable of interest is \[ \bar{Y} = \frac{1}{N}\sum_{i=1}^N y_i = \frac{1}{N}\left(\sum_{i\in s} y_i + \sum_{i\in U\setminus s} y_i \right)\,. \]
Posterior draws for \(\bar{Y}\) can
be obtained by generating draws for the non-sampled population units,
summing them and adding the sample sum. This is done in the next code
chunk, where method predict is used to
generate draws from the posterior predictive distribution for the new,
unobserved, units.
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
# use only a sample of 250 draws from each chain
predictions <- predict(
  sim, newdata=apipop[-m, ], iters=sample(1:1000, 250),
  show.progress=FALSE
)
str(predictions)## List of 3
##  $ : num [1:250, 1:5994] 699 623 586 599 608 ...
##  $ : num [1:250, 1:5994] 704 672 619 746 692 ...
##  $ : num [1:250, 1:5994] 587 701 743 673 582 ...
##  - attr(*, "class")= chr "dc"samplesum <- sum(apisrs$api00)
summary(transform_dc(
  predictions, fun = function(x) (samplesum + sum(x))/N
))##      Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## [1,]  664 4.12     161 0.151   657  664   671   750     1The result for the population mean can also be obtained directly (and more efficient memory-wise) by supplying the appropriate aggregation function to the prediction method:
summary(predict(
  sim, newdata=apipop[-m, ],
  fun=function(x, p) (samplesum + sum(x))/N,
  show.progress=FALSE
))##      Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## [1,]  664 4.21     158 0.083   657  664   671  2565 0.999For any linear model one can obtain the same result more efficiently by precomputing covariate population aggregates. Posterior draws for \(\bar{Y}\) are then computed as
\[ \bar{Y}_r = \frac{1}{N} \left( n\bar{y} + \beta_r'(X - n\bar{x}) + e_r\right)\,, \]
where \(e_r \sim {\cal N}(0, (N-n)\sigma_r^2)\), representing the sum of \(N-n\) independent normal draws. The code to do this is
n <- nrow(apisrs)
XsamT <- colSums(model.matrix(model, apisrs))
XpopR <- matrix(XpopT - XsamT, nrow=1) / (N - n)
predictions <- predict(
  sim, X=list(reg1=XpopR), weights = N-n,
  fun=function(x, p) (samplesum + x)/N,
  show.progress=FALSE
)
summary(predictions)##      Mean  SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## [1,]  664 4.2     158 0.0797   657  664   671  2782 0.999To compute weights corresponding to the population total:
sampler <- create_sampler(model, data=apisrs,
                          linpred=list(reg1=matrix(XpopT/N, nrow=1)),
                          compute.weights=TRUE)
sim <- MCMCsim(sampler, verbose=FALSE)
plot(weights(cal)/N, weights(sim)); abline(0, 1)## [1] 663.8594## linpred_ :
##             Mean      SD t-value      MCSE   q0.05    q0.5   q0.95 n_eff    R_hat
## linpred_ 663.792 4.35517 152.415 0.0795141 656.689 663.682 671.264  3000 0.999854Note the small difference between the weighted sample sum of the target variable and the posterior mean of the linear predictor. This is due to Monte Carlo error; the weighted sum is exact for the simple linear regression case.
One possible way to deal with outliers is to use a Student-t sampling
distribution, which has fatter tails than the normal distribution. In
the next example, the var.model argument of
f_gaussian is used to add local variance parameters with
inverse chi-squared distributions. The marginal sampling distribution
then becomes Student-t. Here the degrees of freedom parameter is
modelled, i.e. assigned a prior distribution and inferred from the
data.
sampler <- create_sampler(
  model,
  family = f_gaussian(var.model = ~vfac(prior=pr_invchisq(df="modeled"))),
  linpred=list(reg1=matrix(XpopR, nrow=1)),
  data=apisrs, compute.weights=TRUE
)
sim <- MCMCsim(sampler, burnin=1000, n.iter=5000, thin=2, verbose=FALSE)
(summ <- summary(sim))## llh_ :
##       Mean   SD t-value  MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1080 8.26    -131 0.569 -1093 -1081 -1066   211  1.03
## 
## sigma_ :
##        Mean  SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 49.3 4.5      11 0.312  41.8 49.3  56.8   207  1.03
## 
## linpred_ :
##          Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## linpred_  664 4.05     164 0.0498   658  664   671  6607     1
## 
## reg1 :
##                 Mean     SD t-value    MCSE   q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  794.075 25.899   30.66 0.54383  751.22  794.131 836.1946  2268  1.00
## ell           -1.474  0.361   -4.08 0.01039   -2.06   -1.480  -0.8706  1207  1.00
## meals         -2.092  0.361   -5.80 0.01158   -2.71   -2.086  -1.5083   971  1.01
## stypeH      -105.416 12.980   -8.12 0.15833 -126.64 -105.395 -83.9306  6720  1.00
## stypeM       -56.574 10.823   -5.23 0.12788  -74.54  -56.487 -38.9582  7163  1.00
## hsg           -0.673  0.458   -1.47 0.00632   -1.43   -0.675   0.0694  5239  1.00
## col.grad       0.968  0.478    2.03 0.00773    0.19    0.966   1.7666  3817  1.00
## grad.sch       2.108  0.465    4.53 0.00582    1.34    2.103   2.8724  6381  1.00
## 
## vfac1_df :
##          Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## vfac1_df 7.47 4.26    1.76 0.376  3.39 6.48  14.8   129  1.05## $vfac1_df
## $vfac1_df[[1]]
## [1] 0.3274
## 
## $vfac1_df[[2]]
## [1] 0.3034
## 
## $vfac1_df[[3]]
## [1] 0.3116##        DIC      p_DIC 
## 2194.78743   34.26584predictions <- predict(sim, newdata=apipop[-m, ], show.progress=FALSE,
                       fun=function(x, p) (samplesum + sum(x))/N)
summary(predictions)##      Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## [1,]  664 3.99     167 0.0491   657  664   671  6583     1##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2058  0.9282  1.0785  1.0002  1.1333  1.1672