Smooth Probabilities
smooth_prob.Rd
Calculate probabilities for multiple populations. Given data on the number of 'trials' and 'successes' in each population, calculate the probability of success in each population. For instance, given data on the number of respondents, and number of employed respondents, by area, calculate the probability of being employed in each area. The probabilities are smoothed: the values are shifted towards the overall mean, with values that are based on small sample sizes being shifted further than values that are based on large sample sizes.
Arguments
- x
Number of successes in each population. A numeric vector.
- size
Number of trials in each population. A numeric vector.
- prior_cases
Parameter controlling smoothing. Default is
10
.
Stratifying
It is often appropriate to stratify a population and smooth separately within each strata. For instance, when estimating probabilities of being employed, it may be appropriate to divide the population into strata defined by age and sex.
The easiest way to do stratified smoothing is to use grouped data frames. See below for an example.
prior_counts
The argument prior_counts
controls the degree
of smoothing. It only has a noticeable effect
when the number of areas, and the population per
area, is small. Larger values for prior_counts
produce more smoothing.
Mathematical details
The smoothing is based on the model
$$x_k \sim \text{Binom}(n_k, \pi_k)$$
$$\pi_k \sim \text{Beta}(\lambda \nu, (1 - \lambda) \nu)$$
$$\lambda \sim \text{Unif}(0, 1)$$
$$\nu \sim \text{LogNormal}(\log M, 1)$$
where
\(k\) indexes area or population,
\(x_k\) is the number of successes, which is specified by argument
x
,\(n_k\) is the number of trials, which is specified by argument
size
,\(\pi_k\) is the probability of success, and
\(M\) control smoothing, and can be specified by argument
prior_counts
.
smooth_prob()
returns \(\hat{\pi}_k\), the
maximum posterior density estimate of \(\pi_k\).
The "direct" (unsmoothed) estimate of the probability of success is \(x_k / n_k\).
For details on the model, see the vignette Statistical Models used for Smoothing and Scaling.
Examples
## use synthetic census data
census <- smoothscale::syn_census
## smooth all groups towards the national level
smoothed <- smooth_prob(x = census$child_labour,
size = census$all_children)
smoothed
#> [1] 0.3549757 0.3483956 0.2386031 0.1411431 0.2486301 0.2904461 0.2811255
#> [8] 0.2828551 0.1052609 0.2768318 0.1608094 0.2004207 0.2446564 0.2304158
#> [15] 0.2110614 0.2561533 0.1905153 0.2172556 0.2017625 0.2611232 0.2173088
#> [22] 0.2145986 0.2994043 0.2539228 0.1639726 0.4176366 0.2514717 0.2754479
#> [29] 0.1457469 0.2343583 0.3653098 0.2452549 0.2778678 0.1686767 0.2898178
#> [36] 0.1665318 0.2057348 0.2526217 0.2426039 0.2149188 0.1887129 0.3004677
#> [43] 0.2658116 0.1625439 0.1875074 0.1442691 0.2001372 0.2509286 0.2799954
#> [50] 0.2273524 0.4307438 0.4802877 0.3179332 0.2053953 0.3320846 0.2904461
#> [57] 0.2732541 0.3250327 0.2609691 0.3508330 0.1937606 0.2262503 0.1815952
#> [64] 0.2304158 0.3077062 0.3558618 0.3045517 0.2452549 0.2362838 0.2526217
#> [71] 0.2760318 0.2207024 0.2535269 0.2890901 0.2109253 0.5185639 0.4981704
#> [78] 0.4011522 0.1990179 0.2697980 0.3160477 0.2811255 0.3643850 0.2644619
#> [85] 0.4062783 0.2412001 0.1792203 0.3077176 0.2073400 0.2747700 0.2989522
#> [92] 0.2973610 0.2930483 0.1985837 0.2526217 0.2430911 0.2759489 0.3158740
#> [99] 0.3851219 0.2584483
## compare smoothed and unsmoothed ("direct") estimates
unsmoothed <- census$child_labour / census$all_children
rbind(head(smoothed), head(unsmoothed))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.3549757 0.3483956 0.2386031 0.1411431 0.2486301 0.2904461
#> [2,] 0.3602151 0.4000000 0.2371134 0.1333333 0.2450980 0.4000000
## use tidyverse functions to smooth
## each age-sex group towards a
## different average
library(dplyr, warn.conflicts = FALSE)
census |>
group_by(age, sex) |>
mutate(smoothed = smooth_prob(x = child_labour,
size = all_children))
#> # A tibble: 100 × 6
#> # Groups: age, sex [4]
#> area age sex child_labour all_children smoothed
#> <chr> <chr> <chr> <int> <dbl> <dbl>
#> 1 Area 01 5-9 Female 134 372 0.351
#> 2 Area 02 5-9 Female 14 35 0.325
#> 3 Area 03 5-9 Female 92 388 0.236
#> 4 Area 04 5-9 Female 46 345 0.140
#> 5 Area 05 5-9 Female 25 102 0.241
#> 6 Area 06 5-9 Female 2 5 0.252
#> 7 Area 07 5-9 Female 4 13 0.252
#> 8 Area 08 5-9 Female 10 34 0.264
#> 9 Area 09 5-9 Female 2 52 0.0995
#> 10 Area 10 5-9 Female 578 2087 0.276
#> # ℹ 90 more rows