Statistical Models used for Smoothing and Scaling
smoothscale-models.Rmd
Introduction
The smoothscale package provides functions for estimating counts and prevalences in small populations. The functions deal with two common problems: sampling errors and measurement errors. Statisticans have developed many methods for estimation in small populations, under the headings of “small area estimation” (Rao and Molina 2015). Many of these methods are complex, and require specialist statistical skills. The methods used in smoothscale are, in contrast, deliberately simple.
This vignette defines sampling and measurement errors, and describes how they are dealt with by functions in smoothscale. The vignette then discusses two further issues: stratification by additional variables such as age and sex/gender, and interactions between sampling and measurement errors.
Sampling Errors
The setting
We would like to estimate the prevalence of some attribute, such as employment or school attendance, in each of \(k = 1, \dots, K\) small areas. We are interested in absolute numbers and in underlying probabilities.
We have data on \(n_k\) people from each area, including the number, \(x_k\), of people who are reported to have the attribute in question. In the following hypothetical data, for instance, \(n_k\) is the number of respondents in each area, and \(x_k\) is the number of respondents who attend school:
Area | Respondents | Respondents attending school |
---|---|---|
1001 | 83 | 62 |
1002 | 25 | 18 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
1156 | 54 | 41 |
If the data come from a census or administrative system, then the number of repondents may be approximately equal to size of the target population in each area. The number of children captured in the census, for instance, should be close to the actual number of children living in an area.
If, however, the data come from a survey, then the number of respondents is likely to be much smaller than the number of people in the target population. In this case, to estimate counts of people with the attribute in each area, we need an additional set of data, \(N_k\), \(k = 1, \cdots, K\), on the size of the target population in each area. Data on a target population of school-age children, for instance, might look like this:
Area | Total school-age children |
---|---|
1001 | 8301 |
1002 | 2500 |
\(\vdots\) | \(\vdots\) |
1156 | 5402 |
Observed prevalences always, to some extent, reflect random variation. Whether a particular child is involved in child labour, for instance, depends partly on random factors such as the economic status of the child’s family. If the data come from a survey, then the sample selection procedures add further randomness. We can therefore think of the number of people with the attribute as reflecting
- a stable ‘expected’ value, and
- random error.
Similarly, the proportion of people in an area who have an attribute reflects
- an underlying propensity to have the attribute, and
- random error.
In small populations, it can be essential to distinguish between an observed proportion and an underlying propensity. Consider, for instance, a population of 10 people in which, during a given year, no one dies. The observed proportion of people dying for that year is zero. However, assuming the people are mortal, the underlying propensity – the risk of dying that people in fact faced during the year – is not zero.
In large populations, random errors tend to cancel out, so that observed counts can be reliable indicators of underlying propensities. In small populations, there is much less scope for cancellation, so that observed counts are much more noisy. The statistical challenge is to minimise the effects of the noise, and to get at the underlying expected values and propensities.
Statistical model
Function smooth_prob()
produces smoothed estimates that
try to strip out the effects of random variation. The model draws on
ideas from Bayesian statistics (Gelman et al.
2014), though, in contrast to a fully Bayesian analysis,
smooth_prob()
only yields point estimates.
The estimates are based on a statistical model. The number of people
in area \(k\) with the attribute of
interest is treated as a random draw from a binomial distribution, \[\begin{equation}
x_k \sim \text{Binomial}(n_k, \pi_k),
\end{equation}\] where \(\pi_k\)
is the probability of having the attribute. The job of
smooth_prob()
is to estimate \(\pi_k\) for all \(K\) areas.
Function smooth_prob()
assumes that the \(\pi_k\) come from a beta distribution,
\[\begin{equation}
\pi_k \sim \text{Beta}(\alpha, \beta).
\end{equation}\] The centre of this distribution, and the extent
to which the values concentrate around the centre, are determined by
parameters \(\alpha\) and \(\beta\). Values for these parameters are
estimated from the data.
Rather than work directly with \(\alpha\) and \(\beta\), we express them in terms of two new parameters, \[\begin{align} \alpha & = \lambda \nu \\ \beta & = (1 - \lambda) \nu \end{align}\] Parameter \(\lambda\) is the central value for \(\pi_k\), and parameter \(\nu\) governs how tightly the \(\pi_k\) are concentrated around \(\lambda\), with larger values implying greater concentration. We do not place any constraints on \(\lambda\), except to restrict it to the range \(0 < \lambda < 1\). We assume that the \(\nu\) parameter comes from a log-normal distribution, \[\begin{equation} \nu \sim \text{LogNormal}(\log 10, 1). \end{equation}\] The distributional assumptions about \(\nu\) have virtually no impact on the estimates of \(\pi_k\), except when the number of small areas and the population of each area is small. Placing a soft constraint on \(\nu\) does, however, help stabilise estimates at reasonable values when counts are small.
Fitted values \(\hat{\lambda}\) and \(\hat{\nu}\) can be obtained by finding the values that maximise the quantity \[\begin{equation} \prod_{k=1}^K \text{BetaBinom}(x_k | n_k, \lambda \nu, (1 - \lambda) \nu) \text{LogNormal}(\nu | \log 10, 1). \end{equation}\] This quantity is proportional to the posterior distribution, or, in non-Bayesian terms, to the penalised likelihood. See Wikipedia contributors (2023a) on the definition of the beta-binomial distribution. After the maximization, values for the \(\pi_k\) can be derived using the properties of the binomial and beta distributions (Wikipedia contributors 2023b). The fitted value for \(\pi_k\) is \[\begin{equation} \hat{\pi}_k = \frac{x_k + \hat{\lambda} \hat{\nu}}{n_k + \hat{\nu}} (\#eq:pihat) \end{equation}\]
A smoothed value for the number of people with the attribute is then \(\hat{\pi}_k N_k\), or, when the data are a complete enumeration of the population, \(\hat{\pi}_k n_k\).
Partial pooling
The formula for \(\hat{\pi}_k\) given in @ref(eq:pihat) can be rewritten as \[\begin{equation} \hat{\pi}_k = \phi_k \frac{x_k}{n_k} + (1 - \phi_k) \hat{\lambda} (\#eq:partial) \end{equation}\] where \[\begin{equation} \phi_k = \frac{n_k}{n_k + \hat{\nu}_k}. \end{equation}\] The quantity \(\frac{x_k}{n_k}\) in @ref(eq:partial) is the traditional ‘direct’ estimate of the probability in area \(k\). The direct estimate is based entirely on data for area \(k\), with no smoothing. The parameter \(\hat{\lambda}\) is an estimate of the average value across all areas, and uses all available data. The parameter \(\phi_k\) is type of weight, which is close to 1 when \(n_k\) is large, and close to 0 when \(n_k\) is small.
Equation @ref(eq:partial) shows how \(\hat{\pi}_k\) can be regarded a compromise between the local estimate \(\frac{x_k}{n_k}\) and the global estimate \(\hat{\lambda}\). The way that the weights \(\phi_k\) are constructed means that the local estimate exerts greater influence when there is more data to support it, and less influence when there is less data to support it. This data-driven balancing of local and global estimates is common in small area estimation, where it is known, among other things, as “partial pooling”. Partial pooling makes intuitive sense, and usually performs well in practice.
A characteristic of partially-pooled estimates is that they are more tightly concentrated than direct estimates. This is a form of smoothing, also referred to as shrinkage or regularisation. The estimator in @ref(eq:partial) has this property. Partial pooling shifts the \(\pi_k\) towards \(\hat{\lambda}\), so that they are more concentrated than the direct estimates \(\frac{x_k}{n_k}\).
Measurement error
Setting
A second sort of error that often affects small area estimates is measurement error. Measurement error can occur, for instance, when respondents misunderstand the question, when proxy respondents were used, or when the definitions used in the data are different from the ones required by the analysis.
One way of dealing with measurement error is to constrain small area
estimates to agree with aggregate ‘benchmark’ values that are assumed to
be accurate (Bell, Datta, and Ghosh 2013; Zhang
and Bryant 2020). Function scale_prob()
takes this
approach. A typical source of reliable aggregate-level data is household
surveys. Although household surveys do not have large enough samples to
provide stable estimates for small areas, they do typically have large
enough samples to provide stable estimates at the national level, while
also having relatively low measurement errors.
Model
Let \(p_k\) be an unscaled estimate of the probability that a person in area \(k\) possesses the attribute in question. We assume that \(p_k\) differs from the true value \(\pi_k\) because of measurement error, that is, \[\begin{equation} \pi_k = p_k + \epsilon_k, \end{equation}\] where \(\epsilon_k\) is an area-level measurement error. To estimate \(\pi_k\), we need a way of estimating \(\epsilon_k\).
Let \[\begin{equation} w_k = \frac{N_k}{\sum_{k=1}^K N_k} \quad \text{or} \quad \frac{n_k}{\sum_{k=1}^K n_k}, \quad k = 1, \cdots, K, \end{equation}\] be a known set of population weights. Let \[\begin{equation} \bar{p} = \sum_{k=1}^K w_k p_k \end{equation}\] be the national-level probability obtained by averaging the initial area-level estimates, and let \[\begin{equation} \bar{\pi} = \sum_{k=1}^K w_k \pi_k \end{equation}\] be the true national-level probability.
We obtain an accurate estimate of \(\bar{\pi}\) from, for instance, a national-level survey. Using \(\bar{\pi}\), we can calculate error \[\begin{equation} \bar{\epsilon} = \sum_{k=1}^K w_k \epsilon_k = \sum_{k=1}^K w_k (\pi_k - p_k) = \sum_{k=1}^K w_k \pi_k - \sum_{k=1}^K w_k p_k = \bar{\pi} - \bar{p}. (\#eq:epsilon) \end{equation}\]
We would like to construct an estimator for \(\epsilon_k\) with the form \[\begin{equation} \epsilon_k = \alpha_k \bar{\epsilon} \end{equation}\] where \(\alpha_k\) is an area-specific scaling factor.
The estimator implemented in scale_prob()
takes slightly
different forms when the average error is positive (so that we need to
scale upwards) and the average error is negative (so that we need to
scale downwards): \[\begin{equation}
\alpha_k = \begin{cases}
\displaystyle \frac{1 - p_k}{1 - \bar{p}} & \text{if }
\bar{\epsilon} > 0 \\
\displaystyle \frac{p_k}{\bar{p}} & \text{if } \bar{\epsilon}
< 0.
\end{cases}
\end{equation}\] The scaling factor \(\alpha_k\) is irrelevant if \(\bar{\epsilon} = 0\), in which case no
scaling is necessary.
When national-level error \(\bar{\epsilon}\) is positive, the \(\alpha_k\) are all non-negative, and \(\pi_k\) is always greater than or equal to \(p_k\). Conversely, when \(\bar{\epsilon}\) is negative, the \(\alpha_k\) are all non-positive, and \(\pi_k\) is always less than than or equal to \(p_k\).
The scale_prob()
estimator for \(\pi_k\) passes some basic sanity
checks:
- when adjusting upwards, \(\pi_k\) never exceeds 1,
- when adjusting downwards, \(\pi_k\) less never less than 0, and
- when adjusting in either direction, the \(\epsilon_k\) always sum to \(\bar{\epsilon}\) as defined in @ref(eq:epsilon). (See the Appendix for details.)
The estimator also smooths the estimates, relative to the original \(\pi_k\), pulling them in towards a central value. When adjusting upwards, the size of each adjustment is proportional to \(1 - p_k\), implying that unusually low values get large adjustments, and unusually high values get small adjustments. When adjusting downwards, the size of each adjustment is proportional to \(p_k\), implying that unusually high values get large adjustments, and unusually low values get small adjustments. The adjustment process pulls extreme values towards the centre, which is reminiscent of the shrinkage that occurs under partial pooling. If we assume that being extreme is a symptom of measurement error, then this smoothing behaviour is, on average, appropriate.
Stratification
Often the data is also disaggregated by subpopulation. For instance, data disaggregated by age and sex might look like this:
Area | Age | Sex | All Children | Children attending school |
---|---|---|---|---|
1001 | 5-9 | Female | 14 | 10 |
1001 | 5-9 | Male | 16 | 13 |
1001 | 10-14 | Female | 28 | 21 |
1001 | 10-14 | Male | 25 | 18 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
1156 | 5-9 | Female | 11 | 15 |
1156 | 5-9 | Male | 13 | 13 |
1156 | 10-14 | Female | 17 | 5 |
1156 | 10-14 | Male | 13 | 8 |
The national-level prevalences required by scale_prob()
might like this:
Age | Sex | Percent of children attending school |
---|---|---|
5-9 | Female | 92 |
5-9 | Male | 91 |
10-14 | Female | 84 |
10-14 | Male | 82 |
smoothscale takes a simple approach to disaggregation: estimates are constructed independently for each combination of the stratifying variables. For instance, with the data above, estimates would be constructed separately for 5-9 year old females, 5-9 year old males, 10-14 year old females, and 10-14 year old males. Although there might be some gains in accuracy or stability from sharing information across combinations of variables, this would mean sacrificing simplicity.
Combinations of sampling and measurement error
The derivation of the estimator used by smooth_prob()
refers only to sampling error, and the derivation of the estimator used
by scale_prob()
refers only to measurement error. What if
data are subject to substantial sampling error and measurement
error?
In principle, there might be advantages to modelling sampling error and measurement error simultaneously. However, doing so would require deriving extra assumptions about the nature of the measurement errors, and the ways that these interacted with sample size, which would typically be difficult and somewhat arbitrary.
The recommended approach with smoothscale is instead
to use smooth_prob()
to deal with sampling error, and then
use scale_prob()
to deal with measurement error.
Appendix: Sanity checks for scale_prob()
estimator
When adjusting upwards, \(\pi_k\) is never greater than 1, since \[\begin{equation} \pi_k = p_k + \frac{1 - p_k}{1 - \bar{p}} (\bar{\pi} - \bar{p}) \le p_k + \frac{1 - p_k}{1 - \bar{p}} (1 - \bar{p}) = p_k + 1 - p_k = 1. \end{equation}\] Similarly, when adjusting downwards, \(\pi_k\) is less never than 0, since \[\begin{equation} \pi_k = p_k + \frac{p_k}{\bar{p}} (\bar{\pi} - \bar{p}) \ge p_k + \frac{p_k}{\bar{p}} (-\bar{p}) = p_k - p_k = 0. \end{equation}\] Moreoever, when adusting upwards, \[\begin{equation} \sum_{k=1}^K w_k \epsilon_k = \sum_{k=1}^K w_k \frac{1 - p_k}{1 - \bar{p}} \bar{\epsilon} = \frac{1 - \bar{p}}{1 - \bar{p}} \bar{\epsilon} = \bar{\epsilon}, \end{equation}\] and, when adusting downwards, \[\begin{equation} \sum_{k=1}^K w_k \epsilon_k = \sum_{k=1}^K w_k \frac{p_k}{\bar{p}} \bar{\epsilon} = \frac{\bar{p}}{\bar{p}} \bar{\epsilon} = \bar{\epsilon}, \end{equation}\] as required.