Smooth Probabilities — smooth_prob • smoothscale

Calculate probabilities for multiple populations. Given data on the number of 'trials' and 'successes' in each population, calculate the probability of success in each population. For instance, given data on the number of respondents, and number of employed respondents, by area, calculate the probability of being employed in each area. The probabilities are smoothed: the values are shifted towards the overall mean, with values that are based on small sample sizes being shifted further than values that are based on large sample sizes.

Usage

smooth_prob(x, size, prior_cases = 10)

Arguments

x: Number of successes in each population. A numeric vector.
size: Number of trials in each population. A numeric vector.
prior_cases: Parameter controlling smoothing. Default is 10.

Value

A numeric vector with smoothed probabilities.

Stratifying

It is often appropriate to stratify a population and smooth separately within each strata. For instance, when estimating probabilities of being employed, it may be appropriate to divide the population into strata defined by age and sex.

The easiest way to do stratified smoothing is to use grouped data frames. See below for an example.

`prior_counts`

The argument prior_counts controls the degree of smoothing. It only has a noticeable effect when the number of areas, and the population per area, is small. Larger values for prior_counts produce more smoothing.

Mathematical details

The smoothing is based on the model

$$x_k \sim \text{Binom}(n_k, \pi_k)$$

$$\pi_k \sim \text{Beta}(\lambda \nu, (1 - \lambda) \nu)$$

$$\lambda \sim \text{Unif}(0, 1)$$

$$\nu \sim \text{LogNormal}(\log M, 1)$$

where

$k$ indexes area or population,
$x_k$ is the number of successes, which is specified by argument x,
$n_k$ is the number of trials, which is specified by argument size,
$\pi_k$ is the probability of success, and
$M$ control smoothing, and can be specified by argument prior_counts.

smooth_prob() returns $\hat{\pi}_k$, the maximum posterior density estimate of $\pi_k$.

The "direct" (unsmoothed) estimate of the probability of success is $x_k / n_k$.

For details on the model, see the vignette Statistical Models used for Smoothing and Scaling.

Examples

## use synthetic census data
census <- smoothscale::syn_census

## smooth all groups towards the national level
smoothed <- smooth_prob(x = census$child_labour,
                        size = census$all_children)
smoothed
#>   [1] 0.3549757 0.3483956 0.2386031 0.1411431 0.2486301 0.2904461 0.2811255
#>   [8] 0.2828551 0.1052609 0.2768318 0.1608094 0.2004207 0.2446564 0.2304158
#>  [15] 0.2110614 0.2561533 0.1905153 0.2172556 0.2017625 0.2611232 0.2173088
#>  [22] 0.2145986 0.2994043 0.2539228 0.1639726 0.4176366 0.2514717 0.2754479
#>  [29] 0.1457469 0.2343583 0.3653098 0.2452549 0.2778678 0.1686767 0.2898178
#>  [36] 0.1665318 0.2057348 0.2526217 0.2426039 0.2149188 0.1887129 0.3004677
#>  [43] 0.2658116 0.1625439 0.1875074 0.1442691 0.2001372 0.2509286 0.2799954
#>  [50] 0.2273524 0.4307438 0.4802877 0.3179332 0.2053953 0.3320846 0.2904461
#>  [57] 0.2732541 0.3250327 0.2609691 0.3508330 0.1937606 0.2262503 0.1815952
#>  [64] 0.2304158 0.3077062 0.3558618 0.3045517 0.2452549 0.2362838 0.2526217
#>  [71] 0.2760318 0.2207024 0.2535269 0.2890901 0.2109253 0.5185639 0.4981704
#>  [78] 0.4011522 0.1990179 0.2697980 0.3160477 0.2811255 0.3643850 0.2644619
#>  [85] 0.4062783 0.2412001 0.1792203 0.3077176 0.2073400 0.2747700 0.2989522
#>  [92] 0.2973610 0.2930483 0.1985837 0.2526217 0.2430911 0.2759489 0.3158740
#>  [99] 0.3851219 0.2584483

## compare smoothed and unsmoothed ("direct") estimates
unsmoothed <- census$child_labour / census$all_children
rbind(head(smoothed), head(unsmoothed))
#>           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
#> [1,] 0.3549757 0.3483956 0.2386031 0.1411431 0.2486301 0.2904461
#> [2,] 0.3602151 0.4000000 0.2371134 0.1333333 0.2450980 0.4000000

## use tidyverse functions to smooth
## each age-sex group towards a
## different average
library(dplyr, warn.conflicts = FALSE)
census |>
  group_by(age, sex) |>
  mutate(smoothed = smooth_prob(x = child_labour,
                                size = all_children))
#> # A tibble: 100 × 6
#> # Groups:   age, sex [4]
#>    area    age   sex    child_labour all_children smoothed
#>    <chr>   <chr> <chr>         <int>        <dbl>    <dbl>
#>  1 Area 01 5-9   Female          134          372   0.351 
#>  2 Area 02 5-9   Female           14           35   0.325 
#>  3 Area 03 5-9   Female           92          388   0.236 
#>  4 Area 04 5-9   Female           46          345   0.140 
#>  5 Area 05 5-9   Female           25          102   0.241 
#>  6 Area 06 5-9   Female            2            5   0.252 
#>  7 Area 07 5-9   Female            4           13   0.252 
#>  8 Area 08 5-9   Female           10           34   0.264 
#>  9 Area 09 5-9   Female            2           52   0.0995
#> 10 Area 10 5-9   Female          578         2087   0.276 
#> # ℹ 90 more rows