multipleImputation.Rmd

---
title: "Multiple Imputation"
output:
  html_document:
    code_folding: show
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  error = TRUE,
  comment = "",
  class.source = "fold-show")
```

# Preamble

## Install Libraries

```{r}
#install.packages("remotes")
#remotes::install_github("DevPsyLab/petersenlab")
```

## Load Libraries

```{r, message = FALSE, warning = FALSE}
library("tidyverse")
library("psych")
library("mice")
library("micemd")
library("miceadds")
library("mitml")
library("Amelia")
library("jomo")
library("parallel")
library("future")
library("furrr")
library("nlme")
library("MplusAutomation")
```

## Data

```{r}
data(Oxboys, package = "nlme")

Oxboys_addNA <- data.frame(Oxboys)
Oxboys_addNA <- Oxboys_addNA %>% 
  rename(id = Subject)

Oxboys_addNA$id <- as.integer(Oxboys_addNA$id)
Oxboys_addNA$Occasion <- as.integer(Oxboys_addNA$Occasion)

set.seed(52242)
Oxboys_addNA[sample(1:nrow(Oxboys_addNA), 25), "height"] <- NA

dataToImpute <- Oxboys_addNA
varsToImpute <- c("height")
Y <- varsToImpute
```

# Types of Missingness

1. Missing Completely at Random (MCAR)
    - the probability of being missing is the same for all cases
    - missingness is not related to variables in the model
1. (Conditionally) Missing at Random (MAR)
    - missingness is related to variables in the model, but once we condition on the variables in the model, the missingness is haphazard
    - the unobserved values do not play a role
1. Missing Not at Random (MNAR)
    - missingness is related to variables that are not in the model (i.e., for reasons that are unknown)
    - unobserved variables carry information about whether a case will have missing data

# Approaches to Handle Missing Data

## For MCAR/MAR Missingness

1. Full Information Maximum Likelihood (FIML)
1. Multiple Imputation

## For MNAR Missingness

https://stefvanbuuren.name/fimd/sec-nonignorable.html (archived at https://perma.cc/N7WW-HDZF)

https://cran.r-project.org/web/packages/missingHE/vignettes/Fitting_MNAR_models_in_missingHE.html (archived at https://perma.cc/9X25-5D8G)

1. find more data about the causes for the missingness
1. sensitivity analyses (what-if analyses) to see how sensitive the results are under various scenarios
    - https://stefvanbuuren.name/fimd/sec-MCAR.html (archived at https://perma.cc/9KM9-3NX3)
1. selection models
    - simultaneously estimate the focal model and a missingness model, where the missingness model has the missing data indicator as a dependent variable, as predicted by the original dependent variable, the original predictors, and any additional covariates etc.
    - if the missingness model is approximately correct, the focal model adjusts in way that removes nonresponse bias
    - similar to a mediation process
        - X → Y
        - Y → missingness
        - X → missingness
    - https://quantitudepod.org/s4e08-craig-enders/ (archived at https://perma.cc/FY9L-L3F7)
1. pattern-mixture models
    - missing data indicator is a predictor variable
    - similar to a moderation process; subgroups of cases have different parameter estimates

# Approaches to Multiple Imputation

- multiple imputation by joint modeling
    - assumes that the variables in follow a joint distribution, e.g.:
        - multivariate normal (i.e., multivariate normal imputation)
    - `R` packages:
        - `jomo`
        - `pan`
        - `Amelia`
        - `Mplus`
- multiple imputation by chained equations
    - aka:
        - fully conditional specification multiple imputation
        - sequential regression multiple imputation
    - `R` packages:
        - `mice`
            - predictive mean matching (`?mice::mice.impute.pmm`) can be useful to obtain bounded imputations for non-normally distributed variables

# Describe Missing Data

```{r}
describe(dataToImpute)
md.pattern(dataToImpute, rotate.names = TRUE)
```

# Pre-Imputation Setup

## Specify Number of Imputations

```{r}
numImputations <- 5 # generally use 100 imputations; this example uses 5 for speed
```

## Detect Cores

```{r}
numCores <- parallel::detectCores() - 1
```

# Multilevel Multiple Imputation

## Three-Level and Cross-Classified Data

https://simongrund1.github.io/posts/multiple-imputation-for-three-level-and-cross-classified-data/ (archived at https://perma.cc/N4PP-A3V6)

## Methods

### `mice` {#mice}

#### Types

##### Continuous Outcomes

https://stefvanbuuren.name/fimd/sec-multioutcome.html#methods (archived at https://perma.cc/8CDA-TS3K)

```{r, eval = FALSE}
?mice::mice.impute.2l.norm
?mice::mice.impute.2l.pan
?mice::mice.impute.2l.lmer
?miceadds::mice.impute.2l.pmm
?miceadds::mice.impute.2l.contextual.pmm
?miceadds::mice.impute.2l.continuous
?micemd::mice.impute.2l.2stage.norm
?micemd::mice.impute.2l.2stage.pmm
?micemd::mice.impute.2l.glm.norm
?micemd::mice.impute.2l.jomo
```

##### Binary Outcomes

https://stefvanbuuren.name/fimd/sec-catoutcome.html#methods-1 (archived at https://perma.cc/5QHF-YRP6)

```{r, eval = FALSE}
?mice::mice.impute.2l.bin
?miceadds::mice.impute.2l.binary
?miceadds::mice.impute.2l.pmm
?miceadds::mice.impute.2l.contextual.pmm
?micemd::mice.impute.2l.2stage.bin
?micemd::mice.impute.2l.glm.bin
```

##### Ordinal Outcomes

https://stefvanbuuren.name/fimd/sec-multioutcome.html#methods (archived at https://perma.cc/8CDA-TS3K)

```{r, eval = FALSE}
?miceadds::mice.impute.2l.pmm
?miceadds::mice.impute.2l.contextual.pmm
?micemd::mice.impute.2l.2stage.pmm
```

##### Count Outcomes

https://stefvanbuuren.name/fimd/sec-catoutcome.html#methods-1 (archived at https://perma.cc/5QHF-YRP6)

```{r, eval = FALSE}
?micemd::mice.impute.2l.2stage.pois
?micemd::mice.impute.2l.glm.pois
?countimp::mice.impute.2l.poisson
?countimp::mice.impute.2l.nb2
?countimp::mice.impute.2l.zihnb
```

#### Specifying the Imputation Method

```{r}
meth <- make.method(dataToImpute)
meth[1:length(meth)] <- ""
meth[Y] <- "2l.pmm" # specify the imputation method here; this can differ by outcome variable
```

#### Specifying the Predictor Matrix

A predictor matrix is a matrix of values, where:

- columns with non-zero values are predictors of the variable specified in the given row
- the diagonal of the predictor matrix should be zero because a variable cannot predict itself

The values are:

- NOT a predictor of the outcome: `0`
- cluster variable: `-2`
- fixed effect of predictor: `1`
- fixed effect and random effect of predictor: `2`
- include cluster mean of predictor in addition to fixed effect of predictor: `3`
- include cluster mean of predictor in addition to fixed effect and random effect of predictor: `4`

```{r}
pred <- make.predictorMatrix(dataToImpute)
pred[1:nrow(pred), 1:ncol(pred)] <- 0
pred[Y, "id"] <- (-2) # cluster variable
pred[Y, "Occasion"] <- 1 # fixed effect predictor
pred[Y, "age"] <- 2 # random effect predictor
pred[Y, Y] <- 1 # fixed effect predictor

diag(pred) <- 0
pred
```

#### Syntax

```{r}
mi_mice <- mice(
  as.data.frame(dataToImpute),
  method = meth,
  predictorMatrix = pred,
  m = numImputations,
  maxit = 5, # generally use 100 maximum iterations; this example uses 5 for speed
  seed = 52242)
```

### `jomo` {#jomo}

```{r}
level1Vars <- c("height")
level2Vars <- c("v3","v4")
clusterVars <- c("id")
fullyObservedCovariates <- c("age","Occasion")

set.seed(52242)

mi_jomo <- jomo(
  Y = data.frame(dataToImpute[, level1Vars]),
  #Y2 = data.frame(dataToImpute[, level2Vars]),
  X = data.frame(dataToImpute[, fullyObservedCovariates]),
  clus = data.frame(dataToImpute[, clusterVars]),
  nimp = numImputations,
  meth = "random"
)
```

### `Amelia` {#amelia}

- `!` in the console output indicates that the current estimated complete data covariance matrix is not invertible
- `*` in the console output indicates that the likelihood has not monotonically increased in that step

```{r}
boundVars <- c("height")
boundCols <- which(names(dataToImpute) %in% boundVars)
boundLower <- 100
boundUpper <- 200

varBounds <- cbind(boundCols, boundLower, boundUpper)

set.seed(52242)

mi_amelia <- amelia(
  dataToImpute,
  m = numImputations,
  ts = "age",
  cs = "id",
  polytime = 1,
  intercs = TRUE,
  #ords = ordinalVars,
  #bounds = varBounds,
  parallel = "snow",
  #ncpus = numCores,
  empri = .01*nrow(dataToImpute)) # ridge prior for numerical stability
```

### `Mplus` {#mplus}

#### Save `Mplus` Data File

Save `R` object as `Mplus` data file:

```{r}
prepareMplusData(dataToImpute, file.path("dataToImpute.dat"))
```

#### `Mplus` Syntax for Multilevel Imputation

`Mplus` syntax for multilevel imputation:

```Mplus
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!  MPLUS SYNTAX LINES CANNOT EXCEED 90 CHARACTERS;
!!!!!  VARIABLE NAMES AND PARAMETER LABELS CANNOT EXCEED 8 CHARACTERS EACH;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

TITLE: Model Title

DATA: FILE = "dataToImpute.dat";

VARIABLE:
  NAMES = id age height Occasion;
  MISSING = .;
  USEVARIABLES ARE age height Occasion;
  !CATEGORICAL ARE INSERT_NAMES_OF_CATEGORICAL_VARIABLES_HERE;
  CLUSTER = id;

ANALYSIS:
  TYPE = twolevel basic;
  bseed = 52242;
  PROCESSORS = 2;
    
DATA IMPUTATION:
  IMPUTE = age(0-100) height Occasion; !put ' (c)' after categorical vars
  NDATASETS = 100;
  SAVE = imp*.dat
```

Putting a range of values after a variable (e.g., `0-100`) sets the lower and upper bounds of the imputed values.
This would save a `implist.dat` file that can be used to run the model on the multiply imputed data, as shown [here](mplus.html#multipleImputation).

# Parallel Processing

https://www.gerkovink.com/miceVignettes/futuremice/Vignette_futuremice.html (archived at https://perma.cc/4SNE-RCSR)

```{r}
mi_parallel_mice <- futuremice(
  dataToImpute,
  method = meth,
  predictorMatrix = pred,
  m = numImputations,
  maxit = 5, # generally use 100 maximum iterations; this example uses 5 for speed
  parallelseed = 52242,
  n.core = numCores,
  packages = "miceadds")
```

# Imputation Diagnostics

## Logged Events

```{r}
mi_mice$loggedEvents
```

## Trace Plots

On convergence, the streams of the trace plots should intermingle and be free of any trend (at the later iterations).
Convergence is diagnosed when the variance between different sequences is no larger than the variance within each individual sequence.

- https://stefvanbuuren.name/fimd/sec-algoptions.html (archived at https://perma.cc/4S54-465R)

```{r}
plot(mi_mice, c("height"))
```

## Density Plots

```{r}
densityplot(mi_mice)
densityplot(mi_mice, ~ height)
```

## Strip Plots

```{r}
stripplot(mi_mice)
stripplot(mi_mice, height ~ .imp)
```

# Post-Processing

## Modify/Create New Variables

### `mice`

```{r}
mi_mice_long <- complete(
  mi_mice,
  action = "long",
  include = TRUE)

mi_mice_long$newVar <- mi_mice_long$age * mi_mice_long$height
```

### `jomo`

```{r}
mi_jomo <- mi_jomo %>% 
  rename(height = dataToImpute...level1Vars.)

mi_jomo$newVar <- mi_jomo$age * mi_jomo$height
```

### `Amelia`

```{r}
for(i in 1:length(mi_amelia$imputations)){
  mi_amelia$imputations[[i]]$newVar <- mi_amelia$imputations[[i]]$age * mi_amelia$imputations[[i]]$height
}
```

## Convert to `mids` object

### `mice`

```{r}
mi_mice_mids <- as.mids(mi_mice_long)
```

### `jomo`

```{r}
mi_jomo_mids <- miceadds::jomo2mids(mi_jomo)
```

### `Amelia`

```{r}
mi_amelia_mids <- miceadds::datlist2mids(mi_amelia$imputations)
```

## Export for `Mplus`

```{r, eval = FALSE}
mids2mplus(mi_mice_mids, path = file.path("InsertFilePathHere", fsep = ""))
mids2mplus(mi_jomo_mids, path = file.path("InsertFilePathHere", fsep = ""))
mids2mplus(mi_amelia_mids, path = file.path("InsertFilePathHere", fsep = ""))
```

# Resources

https://stefvanbuuren.name/fimd (archived at https://perma.cc/46U2-QTM6)

https://www.gerkovink.com/miceVignettes/Multi_level/Multi_level_data.html (archived at https://perma.cc/SF32-D7ZF)