Introduction
bage (= “Bayesian” + “age”) implements Bayesian hierarchical models for rates, probabilities, and means, where the rates, probabilities and means are classified by variables such as age, sex, and region. Models in bage can be used for estimation and for forecasting.
The models in bage are hierarchical in that they are built out of smaller submodels. A model for mortality rates, for instance, might contain submodels describing how rates vary over age, sex, and time.
Internally, bage draws on package TMB to fit the models. TMB is fast and can handle large datasets – characteristics that are inherited by bage.
This vignette introduces the main features of bage, using data on injuries.
Preliminaries
Packages
We begin by loading the packages that we will need for the analysis of the injuries data. Loading bage automatically loads package rvec, which contains functions for working with draws from probability distributions. Package poputils contains functions for working with demographic data. Packages dplyr and tidyr are core tidyverse packages for manipulating data. We use ggplot2 to graph inputs and outputs.
library(bage)
#> Loading required package: rvec
#>
#> Attaching package: 'rvec'
#> The following objects are masked from 'package:stats':
#>
#> sd, var
#> The following object is masked from 'package:base':
#>
#> rank
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
library(poputils)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
Data
We analyse a dataset called nzl_injuries
that is
included in the bage package. The dataset contains
counts of fatal injuries and population in New Zealand, classified by
age, sex, ethnicity, and year.
head(nzl_injuries)
#> # A tibble: 6 × 6
#> age sex ethnicity year injuries popn
#> <fct> <chr> <chr> <int> <int> <int>
#> 1 0-4 Female Maori 2000 12 35830
#> 2 5-9 Female Maori 2000 6 35120
#> 3 10-14 Female Maori 2000 3 32830
#> 4 15-19 Female Maori 2000 6 27130
#> 5 20-24 Female Maori 2000 6 24380
#> 6 25-29 Female Maori 2000 6 24160
Specifying a model
Functions for specifying models
We specify a model in which, within each combination of age, sex,
ethnicity, and year, injuries are treated as a random draw from a
Poisson distribution. The function used to specify a Poisson model is
mod_pois()
. Binomial models can be specified with function
mod_binom()
and normal models with function
mod_norm()
.
The first argument to mod_pois()
is an R-style formula
identifying the outcome variable and the combinations of predictor
variables to be used in the model.
mod <- mod_pois(injuries ~ age * sex + age * ethnicity + year,
data = nzl_injuries,
exposure = popn)
Model structure
The resulting model has the following structure:
The number of injuries occurring within each combination of age, sex,
ethnicity, and time reflects (i) the population at risk and (ii) an
underlying rate that in bage is referred to as
.fitted
. The expected value for .fitted
is
obtained by summing up values for the intercept, age effect, sex effect,
and so forth. The actual value of .fitted
can diverge from
.expected
: the amount of divergence is governed by the
disp
(dispersion) parameter.
The model terms are all given “prior distributions”. A prior distribution is a submodel capturing features of the unknown quantity or quantities being estimated. Possible features include the range within which the quantity is likely to fall, or the amount of smoothness expected in series of values. Priors distributions are a distinctive feature of Bayesian methods.
Printing the model object
Printing a model object provides information on its structure:
mod
#>
#> ------ Unfitted Poisson model ------
#>
#> injuries ~ age * sex + age * ethnicity + year
#>
#> exposure = popn
#>
#> term prior along n_par n_par_free
#> (Intercept) NFix() - 1 1
#> age RW() age 12 12
#> sex NFix() - 2 2
#> ethnicity NFix() - 2 2
#> year RW() year 19 19
#> age:sex RW() age 24 24
#> age:ethnicity RW() age 24 24
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age var_sexgender
#> 1000 year age sex
The table in the the middle of the printout above shows the default prior distribution assigned to each term. We return to priors in Section @ref(sec:priors).
The bottom row of the printed object shows various model settings.
n_draw
is the number of random draws produced by extractor
functions, which we discuss in Section @ref(sec:outputs).
var_time
, var_age
, and
var_sexgender
are the names of the variables in
nzl_injuries
that represent time, age, and sex or gender.
If bage fails to correctly identify these variables,
they can be identified using functions such as
set_var_time()
.
Fitting a model
fit()
Function mod_pois()
specifies a model, but does not
actually estimate any of the unknown quantities in the model. For that,
we need function fit()
.
mod <- mod |>
fit()
Reprinting the model object
The printout for a fitted model differs from that of an unfitted model.
mod
#>
#> ------ Fitted Poisson model ------
#>
#> injuries ~ age * sex + age * ethnicity + year
#>
#> exposure = popn
#>
#> term prior along n_par n_par_free std_dev
#> (Intercept) NFix() - 1 1 -
#> age RW() age 12 12 0.71
#> sex NFix() - 2 2 0.11
#> ethnicity NFix() - 2 2 0.36
#> year RW() year 19 19 0.09
#> age:sex RW() age 24 24 0.43
#> age:ethnicity RW() age 24 24 0.13
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age var_sexgender optimizer
#> 1000 year age sex multi
#>
#> time_total time_optim time_report iter converged message
#> 0.60 0.24 0.22 13 TRUE relative convergence (4)
Among other things, a new row appears at the bottom of the printout, providing information about the fitting process. The most important value piece of information is whether the calculations have converged.
Extracting outputs
Extractor functions
Function fit()
derives values for the unknown
quantities, but does not return these quantities. For that, we need an
extractor function. The two most important extractor functions in
bage are augment()
and
components()
. Both of these are generic functions that work
on many sorts of R objects (see here and here).
augment()
augment()
returns the original dataset plus some
additional columns of estimated values,
aug <- mod |>
augment()
aug
#> # A tibble: 912 × 9
#> age sex ethnicity year injuries popn .observed
#> <fct> <chr> <chr> <int> <int> <int> <dbl>
#> 1 0-4 Female Maori 2000 12 35830 0.000335
#> 2 5-9 Female Maori 2000 6 35120 0.000171
#> 3 10-14 Female Maori 2000 3 32830 0.0000914
#> 4 15-19 Female Maori 2000 6 27130 0.000221
#> 5 20-24 Female Maori 2000 6 24380 0.000246
#> 6 25-29 Female Maori 2000 6 24160 0.000248
#> 7 30-34 Female Maori 2000 12 22560 0.000532
#> 8 35-39 Female Maori 2000 3 22230 0.000135
#> 9 40-44 Female Maori 2000 6 18130 0.000331
#> 10 45-49 Female Maori 2000 6 13770 0.000436
#> # ℹ 902 more rows
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>
The columns of estimated values are .observed
,
.fitted
, and .expected
.
aug |>
select(.observed, .fitted, .expected)
#> # A tibble: 912 × 3
#> .observed .fitted .expected
#> <dbl> <rdbl<1000>> <rdbl<1000>>
#> 1 0.000335 3e-04 (0.00023, 0.00038) 0.00029 (0.00026, 0.00034)
#> 2 0.000171 7.8e-05 (5.9e-05, 0.00011) 7.5e-05 (6.1e-05, 9e-05)
#> 3 0.0000914 1e-04 (7.5e-05, 0.00013) 1e-04 (8.6e-05, 0.00012)
#> 4 0.000221 0.00041 (0.00032, 0.00053) 0.00045 (4e-04, 0.00051)
#> 5 0.000246 0.00038 (0.00029, 0.00049) 4e-04 (0.00036, 0.00046)
#> 6 0.000248 0.00035 (0.00026, 0.00044) 0.00037 (0.00032, 0.00042)
#> 7 0.000532 0.00038 (0.00029, 0.00048) 0.00036 (0.00032, 0.00042)
#> 8 0.000135 0.00031 (0.00024, 4e-04) 0.00034 (3e-04, 0.00038)
#> 9 0.000331 0.00033 (0.00025, 0.00042) 0.00033 (0.00029, 0.00038)
#> 10 0.000436 0.00031 (0.00024, 0.00041) 3e-04 (0.00026, 0.00035)
#> # ℹ 902 more rows
In a Poisson model with exposures,
-
.observed
is a “direct” estimate of the rate, obtained by dividing the the outcome variable by the population at risk; -
.expected
is a model-based estimate of the rate, based purely on model predictors (eg age, sex, time); and -
.fitted
is a model-based estimate of the rate that is a compromise between.observed
and.expected
.
components()
components()
is used to extract values for higher-level
parameters. It returns values for all the parameters.
comp <- mod |>
components()
comp
#> # A tibble: 89 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 (Intercept) effect (Intercept) -1.7 (-3.4, 0.051)
#> 2 age effect 0-4 -1.7 (-3.4, 0.005)
#> 3 age effect 5-9 -3 (-4.6, -1.2)
#> 4 age effect 10-14 -2.5 (-4.2, -0.76)
#> 5 age effect 15-19 -0.87 (-2.6, 0.89)
#> 6 age effect 20-24 -0.77 (-2.5, 1.1)
#> 7 age effect 25-29 -0.9 (-2.6, 1)
#> 8 age effect 30-34 -0.96 (-2.7, 0.95)
#> 9 age effect 35-39 -0.99 (-2.7, 0.92)
#> 10 age effect 40-44 -0.99 (-2.7, 0.93)
#> # ℹ 79 more rows
A little bit of manipulation is required to extract and graph the parameter of interest,
age_effect <- comp |>
filter(term == "age",
component == "effect") |>
select(age = level, .fitted)
age_effect
#> # A tibble: 12 × 2
#> age .fitted
#> <chr> <rdbl<1000>>
#> 1 0-4 -1.7 (-3.4, 0.005)
#> 2 5-9 -3 (-4.6, -1.2)
#> 3 10-14 -2.5 (-4.2, -0.76)
#> 4 15-19 -0.87 (-2.6, 0.89)
#> 5 20-24 -0.77 (-2.5, 1.1)
#> 6 25-29 -0.9 (-2.6, 1)
#> 7 30-34 -0.96 (-2.7, 0.95)
#> 8 35-39 -0.99 (-2.7, 0.92)
#> 9 40-44 -0.99 (-2.7, 0.93)
#> 10 45-49 -1.1 (-2.8, 0.91)
#> 11 50-54 -1 (-2.8, 0.94)
#> 12 55-59 -0.97 (-2.8, 0.95)
Posterior samples
The output from a Bayesian model typically consists of draws from the “posterior distribution” for unknown quantities. The posterior distribution is a probability distribution that describes what we know about the unknown quantities in the model after applying our combining model assumptions (including priors) and data.
Rvecs
Draws from probability distributions can be awkward to work with, so
bage uses a special type of vector, called an “rvec”,
implemented by package rvec. An rvec contains multiple
draws, but tries to behave as much as possible like a standard numerical
vector. The printout of .fitted
and .expected
in the aug
object above shows medians and 95% credible
intervals.
Graphing outputs
The best way to understand output from a fitted model it to graph it.
We first need to prepare data for the graph. We select values for 2018
and use the draws_ci()
function from rvec
to create 95% credible intervals.
data_plot <- aug |>
filter(year == 2018) |>
mutate(draws_ci(.fitted))
data_plot |>
select(starts_with(".fitted"))
#> # A tibble: 48 × 4
#> .fitted .fitted.lower .fitted.mid .fitted.upper
#> <rdbl<1000>> <dbl> <dbl> <dbl>
#> 1 0.00023 (0.00017, 0.00029) 0.000167 0.000228 0.000293
#> 2 5.8e-05 (4.1e-05, 7.7e-05) 0.0000410 0.0000580 0.0000769
#> 3 8.7e-05 (6.3e-05, 0.00012) 0.0000633 0.0000873 0.000115
#> 4 0.00036 (0.00028, 0.00045) 0.000278 0.000360 0.000449
#> 5 0.00035 (0.00027, 0.00044) 0.000267 0.000345 0.000438
#> 6 0.00032 (0.00025, 0.00042) 0.000250 0.000318 0.000419
#> 7 0.00034 (0.00026, 0.00044) 0.000259 0.000339 0.000435
#> 8 0.00027 (2e-04, 0.00035) 0.000204 0.000273 0.000351
#> 9 0.00028 (0.00021, 0.00037) 0.000213 0.000282 0.000366
#> 10 0.00023 (0.00018, 0.00031) 0.000179 0.000235 0.000312
#> # ℹ 38 more rows
We then use ggplot()
to visualise the outputs. In the
figure below we use use geom_ribbon()
to plot credible
intervals, geom_line()
to plot model-based point estimates,
and geom_point()
to plot direct estimates.
ggplot(data_plot, aes(x = age_mid(age))) +
facet_grid(vars(sex), vars(ethnicity)) +
geom_ribbon(aes(ymin = .fitted.lower,
ymax = .fitted.upper),
fill = "lightblue") +
geom_line(aes(y = .fitted.mid),
color = "darkblue") +
geom_point(aes(y = .observed),
color = "red") +
xlab("age")
Priors
Priors in Bayesian models
In a Bayesian model, every unknown quantity needs a prior distribution. When the data provides lots of information on the value of an unknown quantity, the particular choice of prior distribution for that quantity does not normally have much effect on the ultimate estimates for the quantity. But when the data provides only limited information, different priors can lead to very different estimates. The choice of prior is particularly important when forecasting, since the data provides provides only indirect information about events in the future.
Current priors in bage
help(priors)
produces a list of the priors that have
been implemented so far in bage.
Examples of priors are:
-
NFix()
. Each element of the term being modeled is drawn from a normal distribution with mean0
, and standard deviationsd
. By default,sd
is 1. -
N()
. LikeNFix()
, butsd
is estimated from the data. -
RW()
. The elements of the term being modeled follow a random walk. This is appropriate for terms involving time and age, where neighboring elements are strongly correlated. -
AR1()
. First-order autoregressive process. Suitable for time series that revert to a mean value.
Defaults
bage uses the following rules to assign default priors to a model term:
- if the term has less than 3 elements, use
NFix()
; - otherwise, if the term involves time, use
RW()
, with time as the `along’ dimension; - otherwise, if the term involves age, use
RW()
, with age as the `along’ dimension; - otherwise, use
N()
.
Priors can be over-ridden using set_prior()
:
Replacing a prior deletes any existing estimates and returns a model to an ‘unfitted’ state.
is_fitted(mod)
#> [1] FALSE
So we re-fit the model.
mod <- mod |>
fit()
SVD-based priors
bage implements a special type of prior based on
applying a singular value decomposition (SVD) to data from international
databases. These SVD-based priors represent the age-sex patterns in
demographic processes such as fertility, mortality, and labor force
participation in a parsimonious way. The model below, for instance, uses
SVD-based priors for age effects and age-time interactions in a model of
regional fertility rates in Korea. Note that the number of free
parameters (denoted n_par_free
) for age
and
age:time
in the printout below is much less than the total
number of parameters (denoted n_par
) for age
and age:time
.
mod_births <- mod_pois(births ~ age * region + age * time,
data = kor_births,
exposure = popn) |>
set_prior(age ~ SVD(HFD)) |>
set_prior(age:time ~ SVD_RW(HFD)) |>
fit()
mod_births
#>
#> ------ Fitted Poisson model ------
#>
#> births ~ age * region + age * time
#>
#> exposure = popn
#>
#> term prior along n_par n_par_free std_dev
#> (Intercept) NFix() - 1 1 -
#> age SVD(HFD) - 9 3 1.66
#> region N() - 16 16 0.03
#> time RW() time 13 13 0.46
#> age:region RW() age 144 144 0.57
#> age:time SVD_RW(HFD) time 117 39 1.89
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age optimizer
#> 1000 time age multi
#>
#> time_total time_optim time_report iter converged message
#> 1.56 0.83 0.57 18 TRUE relative convergence (4)
Forecasting
Forecasts can be constructed by calling function
forecast()
on a model object. When output
is
"augment"
(the default), forecast()
produces
values like those produced by augment()
.
aug_forecast <- mod |>
forecast(labels = 2019:2028)
names(aug_forecast)
#> [1] "age" "sex" "ethnicity" "year" "injuries" "popn"
#> [7] ".observed" ".fitted" ".expected"
When output
is "components"
,
forecast()
produces values like those produced by
components()
.
comp_forecast <- mod |>
forecast(labels = 2019:2028,
output = "components")
comp_forecast
#> # A tibble: 10 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 year effect 2019 -0.089 (-0.38, 0.17)
#> 2 year effect 2020 -0.088 (-0.44, 0.24)
#> 3 year effect 2021 -0.081 (-0.45, 0.31)
#> 4 year effect 2022 -0.068 (-0.51, 0.33)
#> 5 year effect 2023 -0.063 (-0.53, 0.38)
#> 6 year effect 2024 -0.062 (-0.55, 0.41)
#> 7 year effect 2025 -0.056 (-0.56, 0.44)
#> 8 year effect 2026 -0.043 (-0.56, 0.44)
#> 9 year effect 2027 -0.032 (-0.6, 0.46)
#> 10 year effect 2028 -0.022 (-0.56, 0.45)
When include_estimates
is TRUE
, the return
value includes historical estimates. This is useful for plotting.
data_forecast <- mod |>
fit() |>
forecast(labels = 2019:2028,
include_estimates = TRUE) |>
filter(sex == "Female",
age %in% c("10-14", "25-29", "40-44")) |>
mutate(draws_ci(.fitted))
ggplot(data_forecast, aes(x = year)) +
facet_grid(vars(age), vars(ethnicity)) +
geom_ribbon(aes(ymin = .fitted.lower,
ymax = .fitted.upper),
fill = "lightblue") +
geom_line(aes(y = .fitted.mid),
color = "darkblue") +
geom_point(aes(y = .observed),
color = "red")
#> Warning: Removed 60 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Imputation
bage automatically accommodates missing values in
the outcome variables. We illustrate with a version of the injuries
dataset where values for 2010–2014 are set to NA
.
years_mis <- 2010:2014
injuries_mis <- nzl_injuries |>
mutate(injuries = if_else(year %in% years_mis, NA, injuries))
We fit our exactly the same model that we use for the complete dataset.
mod_mis <- mod_pois(injuries ~ age * sex + age * ethnicity + year,
data = injuries_mis,
exposure = popn) |>
fit()
bage creates a new variable, called
.injuries
containing imputed values for the missing
outcomes.
mod_mis |>
augment() |>
filter(year %in% years_mis)
#> # A tibble: 240 × 10
#> age sex ethnicity year injuries .injuries popn .observed
#> <fct> <chr> <chr> <int> <int> <rdbl<1000>> <int> <dbl>
#> 1 0-4 Female Maori 2010 NA 11 (5, 20) 43440 NA
#> 2 5-9 Female Maori 2010 NA 3 (0, 7) 36340 NA
#> 3 10-14 Female Maori 2010 NA 3 (0, 8) 35520 NA
#> 4 15-19 Female Maori 2010 NA 14 (7, 23) 34960 NA
#> 5 20-24 Female Maori 2010 NA 12 (5, 19) 31060 NA
#> 6 25-29 Female Maori 2010 NA 8 (3, 15) 24000 NA
#> 7 30-34 Female Maori 2010 NA 8 (3, 14) 23180 NA
#> 8 35-39 Female Maori 2010 NA 7 (2, 13) 24260 NA
#> 9 40-44 Female Maori 2010 NA 7 (2, 13) 22660 NA
#> 10 45-49 Female Maori 2010 NA 6 (2, 12) 21730 NA
#> # ℹ 230 more rows
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>
Rates estimates in years where the outcome is missing have wider credible intervals than rates estimates in years where the outcome is observed.
data_plot_mis <- mod_mis |>
augment() |>
filter(age == "20-24") |>
mutate(draws_ci(.fitted))
ggplot(data_plot_mis, aes(x = year)) +
facet_grid(vars(sex), vars(ethnicity)) +
geom_ribbon(aes(ymin = .fitted.lower,
ymax = .fitted.upper),
fill = "lightblue") +
geom_line(aes(y = .fitted.mid),
color = "darkblue") +
geom_point(aes(y = .observed),
color = "red") +
xlab("age")
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Data models
So far we have ignored the possibility that the input data for our
models might be subject to measurement errors. However,
injuries
variable in the nzl_injuries
dataset
is subject to a special sort of error: Statistics New Zealand has
randomly-rounded the values to base 3 to protect confidentiality.
To deal with measurement error, we add a ‘data model’ to our model.
mod <- mod |>
set_datamod_outcome_rr3() |>
fit()
The results from calling augment()
now include a
variable called .injuries
with estimated values for the
true, unrounded injury counts.
mod |>
augment()
#> # A tibble: 912 × 10
#> age sex ethnicity year injuries .injuries popn .observed
#> <fct> <chr> <chr> <int> <int> <rdbl<1000>> <int> <dbl>
#> 1 0-4 Female Maori 2000 12 12 (10, 14) 35830 0.000335
#> 2 5-9 Female Maori 2000 6 5 (4, 8) 35120 0.000171
#> 3 10-14 Female Maori 2000 3 3 (1, 5) 32830 0.0000914
#> 4 15-19 Female Maori 2000 6 7 (5, 8) 27130 0.000221
#> 5 20-24 Female Maori 2000 6 6 (4, 8) 24380 0.000246
#> 6 25-29 Female Maori 2000 6 6 (4, 8) 24160 0.000248
#> 7 30-34 Female Maori 2000 12 11 (10, 14) 22560 0.000532
#> 8 35-39 Female Maori 2000 3 4 (2, 5) 22230 0.000135
#> 9 40-44 Female Maori 2000 6 6 (4, 8) 18130 0.000331
#> 10 45-49 Female Maori 2000 6 6 (4, 8) 13770 0.000436
#> # ℹ 902 more rows
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>
Model checking
Replicate data
A standard Bayesian approach to checking a model is to use the model
to generate simulate data and see if the simulated data looks like the
actual data. Function replicate_data()
creates multiple
sets of simulated data.
rep_data <- mod |>
replicate_data()
rep_data
#> # A tibble: 18,240 × 7
#> .replicate age sex ethnicity year injuries popn
#> <fct> <fct> <chr> <chr> <int> <dbl> <int>
#> 1 Original 0-4 Female Maori 2000 12 35830
#> 2 Original 5-9 Female Maori 2000 6 35120
#> 3 Original 10-14 Female Maori 2000 3 32830
#> 4 Original 15-19 Female Maori 2000 6 27130
#> 5 Original 20-24 Female Maori 2000 6 24380
#> 6 Original 25-29 Female Maori 2000 6 24160
#> 7 Original 30-34 Female Maori 2000 12 22560
#> 8 Original 35-39 Female Maori 2000 3 22230
#> 9 Original 40-44 Female Maori 2000 6 18130
#> 10 Original 45-49 Female Maori 2000 6 13770
#> # ℹ 18,230 more rows
Comparing full datasets is difficult, so the usual strategy is to calculate summary measures that capture some important feature of the data, and compare those instead. Here we see if the model is properly capturing male-female differences in injury rates.
sex_ratio <- rep_data |>
count(.replicate, year, sex, wt = injuries) |>
pivot_wider(names_from = sex, values_from = n) |>
mutate(ratio = Male / Female)
sex_ratio
#> # A tibble: 380 × 5
#> .replicate year Female Male ratio
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 Original 2000 279 873 3.13
#> 2 Original 2001 276 846 3.07
#> 3 Original 2002 303 855 2.82
#> 4 Original 2003 330 873 2.65
#> 5 Original 2004 306 840 2.75
#> 6 Original 2005 300 876 2.92
#> 7 Original 2006 291 828 2.85
#> 8 Original 2007 321 843 2.63
#> 9 Original 2008 339 864 2.55
#> 10 Original 2009 303 900 2.97
#> # ℹ 370 more rows
We graph the results and see if the original data looks like it was drawn from the same underlying distribution as the simulated data.
ggplot(sex_ratio, aes(x = year, y = ratio)) +
facet_wrap(vars(.replicate)) +
geom_line()
Simulation studies
A simulation study, where we create the data ourselves, and hence
know what the underlying true values are, can be a useful way of
assessing model performance. In the example below, we use function
report_sim()
to perform a simple simulation study where the
true population is generated using a first-order random walk, but the
estimation model assumes that the population is generated using a
second-order random walk.
set.seed(0)
## Create simulated data
fake_data <- data.frame(year = 2001:2010,
population = NA)
## Define the true data-generating model
mod_rw <- mod_pois(population ~ year,
data = fake_data,
exposure = 1) |>
set_prior(`(Intercept)` ~ NFix(sd = 0.1)) |>
set_prior(year ~ RW(s = 0.1, sd = 0.1))
## Define the estimation model
mod_rw2 <- mod_pois(population ~ year,
data = fake_data,
exposure = 1) |>
set_prior(year ~ RW2())
## Run the simulation
report_sim(mod_est = mod_rw2, mod_sim = mod_rw)
#> $components
#> # A tibble: 3 × 7
#> term component .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) effect -0.0311 0.95 1 1.13 3.34
#> 2 year effect -0.225 0.779 1 1.51 4.50
#> 3 disp disp -0.0280 0.48 0.87 1.17 4.92
#>
#> $augment
#> # A tibble: 2 × 7
#> .var .observed .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 .fitted 1.08 -0.209 0.522 0.965 0.885 2.97
#> 2 .expected 1.08 0.000438 0.59 0.993 1.06 9.90
Prior predictive checks
Another way of gaining insights about a model is to look at estimates
based purely on the priors, without using data on the outcome variable.
In bage, this can be done by calling
augment()
, components()
or
forecast()
on an unfitted version of the model.
mod |>
unfit() |>
components()
#> ℹ Model not fitted, so values drawn straight from prior distribution.
#> # A tibble: 90 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 (Intercept) effect (Intercept) 0.061 (-2.1, 1.9)
#> 2 age effect 0-4 -0.0028 (-1.9, 1.9)
#> 3 age effect 5-9 0.036 (-2.6, 2.6)
#> 4 age effect 10-14 -0.046 (-3.4, 3.5)
#> 5 age effect 15-19 -0.039 (-4.4, 3.9)
#> 6 age effect 20-24 -0.1 (-5.3, 4.8)
#> 7 age effect 25-29 -0.057 (-5.5, 5.3)
#> 8 age effect 30-34 -0.016 (-5.7, 5.8)
#> 9 age effect 35-39 -0.058 (-6.2, 6.8)
#> 10 age effect 40-44 -0.11 (-6.5, 7.2)
#> # ℹ 80 more rows
Future development of bage
Future features
bage is a new package, and still under very active development. Some features that we intend to add are:
- Covariates The ability to add covariates as predictors in a model, eg data on regional incomes in a model of regional mortality.
- Data models Additional data models, eg models that allow for under-reporting of events, or errors in the exposure variable.
- Priors More options for priors, eg a damped linear trend.
- Sets of priors Pre-specified collections of priors for specific purposes such as modelling fertility rates
- Documentation More vignettes and examples.
- Model choice Tools for model comparison and model choice
Experimental status
bage currently has an life cycle badge, to warn users that some features of the bage interface, such as function arguments, are still evolving. From version 0.9.0 onwards, however, we will use a formal deprecation process to manage any changes to existing function arguments or defaults.
Bug reports and feature requests
We would be grateful for bug reports or suggestions for features. The best way to do so is to raise an issue on the bage GitHub repository.