1 Introduction
Covariates can be used to extend standard bage models in two ways:
- incorporating information beyond what is contained in classifying variables, such as age, sex, or region; and
- dealing with subsets of the data with unusual behavior.
This vignette gives a brief description of covariates in bage and then illustrates their use with a case study of births in South Korea.
2 Mathematical details
A model in bage typically has a structure that is something like: \[\begin{align} y_{ast} & \sim \text{Poisson}(\gamma_{ast} w_{ast}) \\ \log \gamma_{ast} & \sim \text{Gamma}(\xi^{-1}, (\mu_{ast} \xi)^{-1}) \\ \log \mu_{ast} & = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} \tag{2.1} \end{align}\] where
- \(y_{ast}\) is the outcome for people in age group \(a\) and sex \(s\) during period \(t\);
- \(w_{ast}\) is exposure;
- \(\gamma_{ast}\) is a rate;
- \(\xi\) governs overall dispersion;
- \(\beta^{(0)}\) is an intercept;
- \(\pmb{\beta}^{\text{age:sex}}\) and \(\pmb{\beta}^{\text{time}}\) are terms formed the classifying variables age, sex, and time;
- \(\beta_{as}^{\text{age:sex}}\) and \(\pmb{\beta}_t^{\text{time}}\) are elements of these terms; and
- the intercept and the terms formed from the classifying variables all have prior distributions.
In a model with covariates, (2.1) changes to \[\begin{equation} \log \mu_{ast} = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} + (\pmb{Z} \pmb{\zeta})_{ast} \end{equation}\]
where
- \(\pmb{Z}\) is a \(N \times P\) covariate matrix;
- \(\pmb{\zeta}\) is a \(P\)-element vector of coefficients; and
- \((\pmb{Z} \pmb{\zeta})_{ast}\) is an element from the vector obtained by multiplying \(\pmb{Z}\) and \(\pmb{\zeta}\).
Change (2.2) to \[\begin{equation} \mu_i = \sum_{m=0}^M \beta_{j_i^m}^{(m)} + (\pmb{Z} \pmb{\eta})_i \tag{2.2} \end{equation}\]
where - \(\pmb{Z}\) is an \(I \times P\) matrix of covariates; and - \(\pmb{\eta}\) is a vector of coefficients.
The covariate matrix \(\pmb{Z}\) is derived from the raw covariate data by scaling any numeric variables to have mean 0 and standard deviation 1, and by converting any categorical variables to sets of indicator variables. The conversion to indicator variables follows the rules that R uses for “treatment” contrasts. If the categorical has \(C\) categories, then \(C-1\) indicator variabls are constructed, with the first category being omitted.
Each element of \(\pmb{\eta}\) has prior \[\begin{equation} \eta_p \sim \text{N}(0, 1) \end{equation}\]
3 Example: Births in South Korea
To illustrate the use of covariates, we will analyse data on births in South Korea.
3.1 Preliminaries
Besides bage itself, we use dplyr and vctrs for data manipulation, and ggplot2 for plotting results.
Our data is the data frame kor_births
, which is part of bage. The data frame gives numbers of births disaggregated by age of mother, region, and year. It also contains a numeric variable called gdp_pc_2023
that gives GDP per capita (in thousands of US dollars) in 2023, and a categorical variable (with levels "Low"
, "Medium"
, and "High"
) describing population density in 2020.
kor_births
#> # A tibble: 1,872 × 7
#> age region time births popn gdp_pc_2023 dens_2020
#> <chr> <fct> <int> <int> <int> <dbl> <chr>
#> 1 10-14 Busan 2011 1 89822 25.7 High
#> 2 10-14 Busan 2012 0 83884 25.7 High
#> 3 10-14 Busan 2013 0 79061 25.7 High
#> 4 10-14 Busan 2014 1 74741 25.7 High
#> 5 10-14 Busan 2015 0 68783 25.7 High
#> 6 10-14 Busan 2016 1 64905 25.7 High
#> 7 10-14 Busan 2017 1 64251 25.7 High
#> 8 10-14 Busan 2018 0 63249 25.7 High
#> 9 10-14 Busan 2019 0 62154 25.7 High
#> 10 10-14 Busan 2020 0 63498 25.7 High
#> # ℹ 1,862 more rows
3.2 Covariates that bring in extra information
The variables gdp_pc_2023
and dens_2020
are both examples of covariates that contribute extra information to the model, beyond what is contained in the outcome, exposure, and classifying variables.
We use function set_covariates()
to instruct mod_pois()
to treat these variables as covariates.
mod_gdp_dens <- mod_pois(births ~ (age + region + time)^2,
data = kor_births,
exposure = popn) |>
set_covariates(~ gdp_pc_2023 + dens_2020) |>
fit()
mod_gdp_dens
#>
#> ------ Fitted Poisson model ------
#>
#> births ~ (age + region + time)^2
#>
#> exposure = popn
#>
#> term prior along n_par n_par_free std_dev
#> (Intercept) NFix() - 1 1 -
#> age RW() age 9 9 2.64
#> region N() - 16 16 0.03
#> time RW() time 13 13 0.20
#> age:region RW() age 144 144 0.20
#> age:time RW() time 117 117 1.20
#> region:time RW() time 208 208 0.19
#>
#> covariates: ~gdp_pc_2023 + dens_2020
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age optimizer
#> 1000 time age nlminb
#>
#> time_total time_optim time_report iter converged message
#> 3.14 1.42 0.96 26 TRUE relative convergence (4)
To obtain estimates of the coefficients (ie estimates of the \(\zeta_p\)) we call function components()
and filter out rows for the "covariates"
term.
mod_gdp_dens |>
components() |>
filter(term == "covariates")
#> # A tibble: 3 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 covariates coef gdp_pc_2023 -0.0029 (-0.67, 0.67)
#> 2 covariates coef dens_2020Low -0.53 (-1.9, 0.85)
#> 3 covariates coef dens_2020Medium -0.77 (-2.1, 0.55)
3.3 Covariates that allow for unusual subsets
In East Asia, years with the Dragon zodiac sign sometimes have larger-than-usual numbers of births. To allow for this possibility, we create a dragon-year covariate, and incorporate it into a new model.
births <- kor_births |>
mutate(is_dragon_year = time == 2012)
mod_dragon <- mod_pois(births ~ (age + region + time)^2,
data = births,
exposure = popn) |>
set_covariates(~ is_dragon_year) |>
fit()
mod_dragon |>
components() |>
filter(term == "covariates")
#> # A tibble: 1 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 covariates coef is_dragon_yearTRUE 0.063 (-0.043, 0.17)
There is some evidence for extra births, though there is substantial uncertainty about the size of the effect.
Next we expand the model to allow the dragon-year effect to differ across age groups. We create a variable that takes the values "baseline"
in all years, except in dragon years, when it takes the name of the age group. We turn this variable into a factor with "baseline"
as its first level.
births <- births |>
mutate(is_dragon_year_age = if_else(time == 2012, age, "baseline"),
is_dragon_year_age = factor(is_dragon_year_age,
levels = c("baseline", unique(age))))
births |>
filter(time %in% 2011:2013)
#> # A tibble: 432 × 9
#> age region time births popn gdp_pc_2023 dens_2020 is_dragon_year
#> <chr> <fct> <int> <int> <int> <dbl> <chr> <lgl>
#> 1 10-14 Busan 2011 1 89822 25.7 High FALSE
#> 2 10-14 Busan 2012 0 83884 25.7 High TRUE
#> 3 10-14 Busan 2013 0 79061 25.7 High FALSE
#> 4 10-14 Chungcheongbuk… 2011 3 47811 40.3 Low FALSE
#> 5 10-14 Chungcheongbuk… 2012 1 44864 40.3 Low TRUE
#> 6 10-14 Chungcheongbuk… 2013 0 42259 40.3 Low FALSE
#> 7 10-14 Chungcheongnam… 2011 0 62308 50.4 Low FALSE
#> 8 10-14 Chungcheongnam… 2012 0 56746 50.4 Low TRUE
#> 9 10-14 Chungcheongnam… 2013 0 54415 50.4 Low FALSE
#> 10 10-14 Daegu 2011 3 75776 22.3 Medium FALSE
#> # ℹ 422 more rows
#> # ℹ 1 more variable: is_dragon_year_age <fct>
We create a new model with the age-sepcific dragon-year indicator.
mod_dragon_age <- mod_pois(births ~ (age + region + time)^2,
data = births,
exposure = popn) |>
set_covariates(~ is_dragon_year_age) |>
fit()
mod_dragon_age
#>
#> ------ Fitted Poisson model ------
#>
#> births ~ (age + region + time)^2
#>
#> exposure = popn
#>
#> term prior along n_par n_par_free std_dev
#> (Intercept) NFix() - 1 1 -
#> age RW() age 9 9 2.64
#> region N() - 16 16 0.02
#> time RW() time 13 13 0.18
#> age:region RW() age 144 144 0.17
#> age:time RW() time 117 117 1.22
#> region:time RW() time 208 208 0.14
#>
#> covariates: ~is_dragon_year_age
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age optimizer
#> 1000 time age nlminb
#>
#> time_total time_optim time_report iter converged message
#> 2.06 1.14 0.81 24 TRUE relative convergence (4)
Rather than a single dragon-year coefficient, we have a coefficient for each age group. We extract them and tidy up the labels.
mod_dragon_age |>
components() |>
filter(term == "covariates") |>
mutate(age = sub("is_dragon_year_age", "", level)) |>
select(age, .fitted)
#> # A tibble: 9 × 2
#> age .fitted
#> <chr> <rdbl<1000>>
#> 1 10-14 0.31 (-0.14, 0.79)
#> 2 15-19 0.0094 (-0.19, 0.18)
#> 3 20-24 0.049 (-0.15, 0.23)
#> 4 25-29 0.075 (-0.1, 0.27)
#> 5 30-34 0.075 (-0.094, 0.26)
#> 6 35-39 0.034 (-0.15, 0.21)
#> 7 40-44 0.058 (-0.12, 0.24)
#> 8 45-49 0.11 (-0.11, 0.33)
#> 9 50-54 0.2 (-0.19, 0.55)
3.4 Forecasting
If all the covariates in a model are fixed, then the model can be forecasted as normal.
mod_gdp_dens |>
forecast(labels = 2024:2025)
#> # A tibble: 288 × 10
#> age region time births popn gdp_pc_2023 dens_2020 .observed
#> <chr> <fct> <int> <dbl> <int> <dbl> <chr> <dbl>
#> 1 10-14 Busan 2024 NA NA 25.7 High NA
#> 2 10-14 Busan 2025 NA NA 25.7 High NA
#> 3 10-14 Chungcheongbuk-do 2024 NA NA 40.3 Low NA
#> 4 10-14 Chungcheongbuk-do 2025 NA NA 40.3 Low NA
#> 5 10-14 Chungcheongnam-do 2024 NA NA 50.4 Low NA
#> 6 10-14 Chungcheongnam-do 2025 NA NA 50.4 Low NA
#> 7 10-14 Daegu 2024 NA NA 22.3 Medium NA
#> 8 10-14 Daegu 2025 NA NA 22.3 Medium NA
#> 9 10-14 Daejeon 2024 NA NA 27.6 Medium NA
#> 10 10-14 Daejeon 2025 NA NA 27.6 Medium NA
#> # ℹ 278 more rows
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>
If, however, a covariate varies over time, forecasting only works if values for future periods are provided. The following code will result in an error:
mod_dragon |>
forecast(labels = 2024:2025)
Instead we need to create a newdata
data frame…
newdata <- expand.grid(age = unique(kor_births$age),
region = unique(kor_births$region),
time = 2024:2025) |>
mutate(is_dragon_year = FALSE)
head(newdata)
#> age region time is_dragon_year
#> 1 10-14 Busan 2024 FALSE
#> 2 15-19 Busan 2024 FALSE
#> 3 20-24 Busan 2024 FALSE
#> 4 25-29 Busan 2024 FALSE
#> 5 30-34 Busan 2024 FALSE
#> 6 35-39 Busan 2024 FALSE
…and supply it to forecast()
.
mod_dragon |>
forecast(newdata = newdata)
#> # A tibble: 288 × 11
#> age region time births popn gdp_pc_2023 dens_2020 is_dragon_year
#> <chr> <fct> <int> <dbl> <int> <dbl> <chr> <lgl>
#> 1 10-14 Busan 2024 NA NA NA NA FALSE
#> 2 15-19 Busan 2024 NA NA NA NA FALSE
#> 3 20-24 Busan 2024 NA NA NA NA FALSE
#> 4 25-29 Busan 2024 NA NA NA NA FALSE
#> 5 30-34 Busan 2024 NA NA NA NA FALSE
#> 6 35-39 Busan 2024 NA NA NA NA FALSE
#> 7 40-44 Busan 2024 NA NA NA NA FALSE
#> 8 45-49 Busan 2024 NA NA NA NA FALSE
#> 9 50-54 Busan 2024 NA NA NA NA FALSE
#> 10 10-14 Chungcheongbuk… 2024 NA NA NA NA FALSE
#> # ℹ 278 more rows
#> # ℹ 3 more variables: .observed <dbl>, .fitted <rdbl<1000>>,
#> # .expected <rdbl<1000>>