Skip to contents

1 Introduction

Simulation studies can be used to assess the performance of a model. The basic idea is to generate some parameter values, use these parameters to generate some data, use the data to try to infer the original parameter values, and then see how close the inferred parameter values are to the actual parameter values.
Simulation study of a model

Figure 1.1: Simulation study of a model

Function report_sim() automates the process of doing a simulation study. We are still experimenting with report_sim(), and the interface may change. Suggestions are welcome, ideally through raising an issue here.

2 Estimation model matches data generation model

The most straightforward type of simulation is when the estimation model' used to do the inference matches thedata generation model’ used to create the data. Even when the estimation model matches the data generation model, the inferred values for the parameters will not exactly reproduce the actual values, since data is drawn at random, and provides a noisy signal about parameters it was generated from. However, if the experiment is repeated many times, with a different randomly-drawn dataset each time, the errors should more or less average out at zero, 50% credible intervals should contain the true values close to 50% of the time, and 95% credible intervals should contain the true values close to 95% of the time.

To illustrate, we use investigate the performance of a model of divorce rates in New Zealand. We reduce the number of ages and time periods to speed up the calculations.

library(bage)
#> Loading required package: rvec
#> 
#> Attaching package: 'rvec'
#> The following objects are masked from 'package:stats':
#> 
#>     sd, var
#> The following object is masked from 'package:base':
#> 
#>     rank
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found
library(dplyr, warn.conflicts = FALSE)
library(poputils)

divorces_small <- nzl_divorces |>
  filter(age_upper(age) < 40,
         time >= 2016) |>
  droplevels()

mod <- mod_pois(divorces ~ age * sex + time,
                data = divorces_small,
                    exposure = population)
mod     
#> 
#>     ------ Unfitted Poisson model ------
#> 
#>    divorces ~ age * sex + time
#> 
#>   exposure = population
#> 
#>         term  prior along n_par n_par_free
#>  (Intercept) NFix()     -     1          1
#>          age   RW()   age     4          4
#>          sex NFix()     -     2          2
#>         time   RW()  time     6          6
#>      age:sex   RW()   age     8          8
#> 
#>  disp: mean = 1
#> 
#>  n_draw var_time var_age var_sexgender
#>    1000     time     age           sex

To do the simulation study, we pass the model to report_sim(). If only one model is supplied, report_sim() assumes that that model should be used as the estimation model and as the data generation model. By default report_sim() repeats the experiment 100 times, generating a different dataset each time.

set.seed(0)
res <- report_sim(mod_est = mod)
res
#> $components
#> # A tibble: 9 × 7
#>   term        component   .error .cover_50 .cover_95 .length_50 .length_95
#>   <chr>       <chr>        <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 (Intercept) effect     0.205       0.47      0.93       1.15       3.36 
#> 2 age         effect    -0.218       0.47      0.928      1.46       4.26 
#> 3 age         hyper      0.0808      0.37      0.75       0.603      1.91 
#> 4 sex         effect     0.00726     0.535     0.915      1.12       3.21 
#> 5 time        effect    -0.0704      0.505     0.953      1.25       3.61 
#> 6 time        hyper      0.0188      0.47      0.87       0.393      1.15 
#> 7 age:sex     effect    -0.00151     0.498     0.961      1.40       4.07 
#> 8 age:sex     hyper     -0.0767      0.46      0.89       0.477      1.62 
#> 9 disp        disp      -0.0301      0.37      0.92       0.220      0.642
#> 
#> $augment
#> # A tibble: 2 × 7
#>   .var      .observed     .error .cover_50 .cover_95 .length_50 .length_95
#>   <chr>         <dbl>      <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 .fitted       4358.    0.00255     0.495     0.946     0.0172     0.0501
#> 2 .expected     4358. 1479.          0.477     0.935  3239.     10851.

The output from report_sim() is a list of two data frames. The first data frame contains results for parameters associated with the components() function: main effects and interactions, associated hyper-parameters, and dispersion. The second data frame contains results for parameters associated with the augment() function: the lowest-level rates parameters.

As can be seen in the results, the errors do not average out at exactly zero, 50% credible intervals do not contain the true value exactly 50% of the time, and 95% credible intervals do not contain the true value exactly 95% of the time. However, increasing the number of simulations from the default value of 100 to, say, 1000 will reduce the average size of the errors closer to zero, and bring the actual coverage rates closer to their advertised values. When larger values of n_sim are used, it can be helpful to use parallel processing to speed up calculations, which is done through the n_core argument.

One slightly odd feature of the results is that the mean for .expected is very large. This reflects the fact that the data generation model draws some extreme values. We are developing a set of more informative priors that should avoid this behavior in future versions of bage.

3 Estimation model different from data generation model

In actual applications, no estimation model ever perfectly describes the true data generating process. It can therefore be helpful to see how robust a given model is to misspecification, that is, to cases where the estimation model differs from the data generation model.

With report_sim(), this can be done by using one model for the mod_est argument, and a different model for the mod_sim argument.

Consider, for instance, a case where the time effect is generated from an AR1() prior, while the estimation model continues to use the default value of a RW() prior,

mod_ar1 <- mod |>
  set_prior(time ~ AR1())
mod_ar1 
#> 
#>     ------ Unfitted Poisson model ------
#> 
#>    divorces ~ age * sex + time
#> 
#>   exposure = population
#> 
#>         term  prior along n_par n_par_free
#>  (Intercept) NFix()     -     1          1
#>          age   RW()   age     4          4
#>          sex NFix()     -     2          2
#>         time  AR1()  time     6          6
#>      age:sex   RW()   age     8          8
#> 
#>  disp: mean = 1
#> 
#>  n_draw var_time var_age var_sexgender
#>    1000     time     age           sex

We set the mod_sim argument to mod_ar1 and generate the report.

res_ar1 <- report_sim(mod_est = mod, mod_sim = mod_ar1)
res_ar1
#> $components
#> # A tibble: 8 × 7
#>   term        component  .error .cover_50 .cover_95 .length_50 .length_95
#>   <chr>       <chr>       <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 (Intercept) effect    -0.136      0.44      0.94       1.15       3.36 
#> 2 age         effect    -0.130      0.47      0.89       1.45       4.22 
#> 3 age         hyper      0.120      0.43      0.77       0.620      1.95 
#> 4 sex         effect     0.0231     0.485     0.92       1.12       3.20 
#> 5 time        effect    -0.0449     0.518     0.87       1.24       3.60 
#> 6 age:sex     effect     0.209      0.484     0.949      1.39       4.03 
#> 7 age:sex     hyper     -0.0846     0.4       0.89       0.494      1.72 
#> 8 disp        disp      -0.0225     0.5       0.93       0.248      0.722
#> 
#> $augment
#> # A tibble: 2 × 7
#>   .var      .observed   .error .cover_50 .cover_95 .length_50 .length_95
#>   <chr>         <dbl>    <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 .fitted        55.8 0.000334     0.498     0.948    0.00641     0.0186
#> 2 .expected      55.8 6.93         0.505     0.937   32.5       109.

In this case, although actual coverage for the hyper-parameters (in the components part of the results) now diverges from the advertised coverage, coverage for the low-level rates (in the augment part of the results) is still close to advertised coverage.

4 The relationship between report_sim() and replicate_data()

Functions report_sim() and replicate_data() overlap, in that both use simulated data to provide insights into model performance. Their aims are, however, different. Typically, report_sim() is used before fitting a model, to assess its performance across a random selection of possible datasets, while replicate_data() is used after fitting a model, to assess its performance on the dataset to hand.