---
title: "Multivariate missingness and monotonicity"
author: "Janick Weberpals"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Multivariate missingness and monotonicity}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dpi = 150,
fig.width = 6,
fig.height = 4.5
)
```
```{r setup}
library(smdi)
library(gt)
suppressPackageStartupMessages(library(dplyr))
```
# Multivariate missing data in `smdi`
In this article, we want to briefly highlight two aspect regarding **multivariate missingness**:
1. How does `smdi` handle multivariate missingness?
2. What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the `smdi` functionality?
## Established taxonomies
In general, there are two classic established missing data taxonomies:
* Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR)
* Patterns: Monotone versus Non-monotone missingness
# How does `smdi` handle multivariate missingness?
In all `smdi` functions, except `smdi_little()`, binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the `covar` parameter or automatically identified via `smdi_check_covar()` if `covar` = NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in `smdi_rf()`, these variables are indicated with a *"_NA"* suffix. Missing values are accordingly indicated with a *1* and complete observations with a *0*. This functionality is controlled via the `smdi_na_indicator()` utility function.
```{r, fig.cap="Illustrating missing indicator variable generation within `smdi` functions"}
smdi_data %>%
smdi_na_indicator(
drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE
) %>%
select(
ecog_cat, ecog_cat_NA,
egfr_cat, egfr_cat_NA,
pdl1_num, pdl1_num_NA
) %>%
head() %>%
gt()
```
Now, let's assume we have three partially observed covariates *X*, *Y* and *Z*, which we would like to include in our missingness diagnostics. All `smdi_diagnose()` functions, except `smdi_little()`, create *X_NA*, *Y_NA* and *Z_NA* and *X*, *Y* and *Z* are discarded from the dataset. The functions will then iterate the diagnostics through all *X_NA*, *Y_NA* and *Z_NA* one-by-one. That is, if, for example, *X_NA* is assessed, *Y_NA* and *Z_NA* serve as predictor variables along with all other covariates in the dataset. If *Y_NA* is assessed, *X_NA* and *Z_NA* are included as predictor variables, and so forth.
Important! It is important to notice that this strategy is the default to deal with multivariate missingness in the `smdi` package, however, another possible approach could be to *not* consider the other partially observed variables in the first place (e.g. by dropping them before applying any `smdi` function) and stacking the diagnostics focusing on one partially observed variable at a time. Such a strategy would be advisable in scenarios of monotone missing data patterns (see next section).
# `smdi` in case of monotone missing data patterns
While in the `smdi` package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the [routine structural missing data diagnostics article](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/b_routine_diagnostics.html#descriptives), where we highlight the importance of describing missingness proportions and patterns before running any of the `smdi` diagnostics.
As mentioned in the section before, in case of monotone missing data patterns, the `smdi` functionality may be misleading.
Monotonicity A missing data pattern is said to be monotone if the variables Yj can be ordered such that if Yj is missing then all variables Yk with k > j are also missing (taken from Stef van Buuren [^1]).
[^1]: For more information on missing data patterns see
A good example for monotone missing data could be clinical blood laboratory tests ("labs") which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing.
```{r}
# we simulatea monotone missingness pattern
# following an MCAR mechanism
set.seed(42)
data_monotone <- smdi_data_complete %>%
mutate(
lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5),
lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25)
)
data_monotone[3:503, "lab1"] <- NA
data_monotone[1:500, "lab2"] <- NA
```
```{r}
smdi::gg_miss_upset(data = data_monotone)
```
```{r}
smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE)
```
In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as `system is exactly singular` or `-InfWarning: Variable has only NA's in at least one stratum`.
In cases in which monotonicity is still clearly present but not as extreme (like in the example above), `smdi` will prompt a message to the analyst to raise awareness of this issue as the `smdi` output can be **highly misleading** in such instances.
```{r}
diagnostics_jointly <- smdi_diagnose(
data = data_monotone,
covar = NULL, # NULL includes all covariates with at least one NA
model = "cox",
form_lhs = "Surv(eventtime, status)"
)
```
```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
diagnostics_jointly %>%
smdi_style_gt()
```
In such cases, it may be advisable to *not* consider including *lab2* in the missingness diagnostics of *lab1* and vice versa and stack the diagnostics focusing on one partially observed variable at a time.
## Lab 1 analyzed without Lab 2
```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
# lab 1
lab1_diagnostics <- smdi_diagnose(
data = data_monotone %>% select(-lab2),
model = "cox",
form_lhs = "Surv(eventtime, status)"
)
lab1_diagnostics %>%
smdi_style_gt()
```
## Lab 2 analyzed without Lab 1
```{r, fig.cap="Diagnostics of lab 2 if analyzed separately."}
# lab 2
lab2_diagnostics <- smdi_diagnose(
data = data_monotone %>% select(-lab1),
model = "cox",
form_lhs = "Surv(eventtime, status)"
)
lab2_diagnostics %>%
smdi_style_gt()
```
## Presented in one table using `smdi_style_gt()`
We can also combine the output of individually stacked `smdi_diagnose` tables and enhance it with a global Little's test that takes into account the multivariate missingness of the entire dataset.
```{r}
# computing a gloabl p-value for Little's test including both lab1 and lab2
little_global <- smdi_little(data = data_monotone)
# combining two individual lab smdi tables and global Little's test
smdi_style_gt(
smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl),
include_little = little_global
)
```
Since the missingness follows an MCAR mechanism, `smdi_diagnose()` now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.