Multibias example: NHANES

Setup

library(multibias)
library(dplyr)

For this demonstration in multi-bias analysis, we’ll be using the 2013-2018 National Health and Nutrition Examination Survey (NHANES) data linked to the CDC’s 2019 public-use mortality data. Within NHANES, we have data linked across demographics, 24-hour dietary recall, alcohol use, and smoking use.

nhanes <- read.csv("nhanes.csv")

We’ll be assessing the risk of alcohol intake on all-cause mortality. We’re specifically going to zoom in on the comparison of “extreme” drinkers to “non-extreme” drinkers. We filter out participants who reported having no alcohol over the past twelve months and create a sex-specific alcohol_extreme feature based off of USDGA guidelines of 14 g/day for women and 28 g/day for men. Participants are “extreme drinkers” if they reported drinking more than 1.5x these guidelines.

nhanes_filtered <- nhanes |>
  mutate(alcohol_extreme = case_when(
    alcohol_day_total > 14 * 1.5 & gender_female == 1 ~ 1,
    alcohol_day_total > 28 * 1.5 & gender_female == 0 ~ 1,
    TRUE ~ 0
  )) |>
  filter(age >= 18) |>
  filter(eligstat == 1) |>
  filter(alcohol_12mo > 0)

Inspecting the exposure-outcome counts, we are dealing with a rare exposure and rare outcome.

nhanes_filtered |>
  group_by(alcohol_extreme, mortstat) |>
  summarize(count = n()) |>
  ungroup() |>
  mutate(proportion = count / sum(count))
#> # A tibble: 4 × 4
#>   alcohol_extreme mortstat count proportion
#>             <dbl>    <int> <int>      <dbl>
#> 1               0        0  8412    0.856  
#> 2               0        1   281    0.0286 
#> 3               1        0  1089    0.111  
#> 4               1        1    44    0.00448

There are no major age/gender differences among the exposures.

nhanes_filtered |>
  group_by(alcohol_extreme) |>
  summarize(
    age_mean = mean(age),
    gender_prop = mean(gender_female)
  )
#> # A tibble: 2 × 3
#>   alcohol_extreme age_mean gender_prop
#>             <dbl>    <dbl>       <dbl>
#> 1               0     46.2       0.486
#> 2               1     45.1       0.454

To no surprise, age is higher in those who die.

nhanes_filtered |>
  group_by(mortstat) |>
  summarize(
    age_mean = mean(age),
    gender_prop = mean(gender_female)
  )
#> # A tibble: 2 × 3
#>   mortstat age_mean gender_prop
#>      <int>    <dbl>       <dbl>
#> 1        0     45.4       0.486
#> 2        1     66.2       0.375

Now tht we’ve loaded and inspected the data, let’s fit our base model for reference: a logistic regression of mortality on alcohol, gender, and age. For our exposure of interest, we observe an odds ratio of 1.6 (95% CI: 1.1, 2.2) indicating increased mortality risk among the extreme drinkers, adjusted for age and gender. Note that many important covariates and considerations are not considered here.

base_mod <- glm(
  mortstat ~ alcohol_extreme + gender_female + age,
  data = nhanes_filtered,
  family = binomial(link = "logit")
)

exp(coef(base_mod))
#>     (Intercept) alcohol_extreme   gender_female             age 
#>    0.0003640025    1.5603553141    0.6838661096    1.0852279381
exp(confint(base_mod))
#> Waiting for profiling to be done...
#>                        2.5 %       97.5 %
#> (Intercept)     0.0001952562 0.0006553778
#> alcohol_extreme 1.0999439503 2.1670092413
#> gender_female   0.5393792722 0.8637743150
#> age             1.0758586614 1.0950985762

The family of functions in multibias requires inputting the observed data as a data_observed() object.

df_obs <- data_observed(
  data = nhanes_filtered,
  exposure = "alcohol_extreme",
  outcome = "mortstat",
  confounders = c("age", "gender_female")
)

print(df_obs)
#> Observed Data
#> ------------------
#> Exposure: alcohol_extreme 
#> Outcome: mortstat 
#> Confounders: age, gender_female 
#> Data head: 
#>   alcohol_extreme mortstat age gender_female
#> 1               0        0  69             0
#> 2               1        1  54             0
#> 3               0        0  56             0
#> 4               0        0  61             1
#> 5               1        0  56             1

1. Adjust for uncontrolled confounding

An important confounder, smoking, was not included in our analysis! Let’s demonstrate how we could use validation data with smoking information to adjust for the uncontrolled confounding. Feature smoked_100cigs corresponds to whether a participant reported having smoked 100 cigarettes throughout their life.

df_temp1 <- nhanes_filtered |>
  filter(!is.na(smoked_100cigs))

df_val1 <- data_validation(
  data = df_temp1,
  true_exposure = "alcohol_extreme",
  true_outcome = "mortstat",
  confounders = c("age", "gender_female", "smoked_100cigs")
)

We now put these two objects into the multibias function named adjust_uc() to adjust for uncontrolled confounding. This new bias-adjusted odds ratio (1.4) is closer to the null than that of our base model. This is not a surprise considering that extreme drinkers are more likely to be smokers, and smoking increases the risk of death. Due to these strong correlations, part of the alcohol effect in our base model included the effect from smoking.

set.seed(1234)
adjust_uc(df_obs, df_val1)
#> $estimate
#> [1] 1.411961
#> 
#> $ci
#> [1] 1.003703 1.986279

2. Adjust for uncontrolled confounding and exposure misclassification

What if we also wanted to account for the fact that people commonly under-report their alcohol use due to social stigma? Let’s bring in the assumption that wealthy, college-educated individuals are most susceptible to this stigma. We’ll say that the actual alcohol intake for this group, which we’ll name alcohol_adj, should be 1.5 times what they reported. Then based on this feature, we’ll re-calculate the exteme drinking indicator: alcohol_extreme_adj.

df_temp2 <- nhanes_filtered |>
  filter(!is.na(smoked_100cigs)) |>
  mutate(alcohol_adj = if_else(
    income == "$100,000 and Over" | education == "College graduate or above",
    alcohol_day_total * 1.5,
    alcohol_day_total
  )) |>
  mutate(alcohol_extreme_adj = case_when(
    alcohol_adj > 14 * 2 & gender_female == 1 ~ 1,
    alcohol_adj > 14 * 4 & gender_female == 0 ~ 1,
    TRUE ~ 0
  ))

df_val2 <- data_validation(
  data = df_temp2,
  true_exposure = "alcohol_extreme_adj",
  true_outcome = "mortstat",
  confounders = c("age", "gender_female", "smoked_100cigs"),
  misclassified_exposure = "alcohol_extreme"
)

We now put these two objects into the multibias function named adjust_uc_em() to adjust for uncontrolled confounding and exposure misclassification. The new bias-adjsted odds ratio is even closer to the null.

set.seed(1234)
adjust_uc_em(df_obs, df_val2)
#> $estimate
#> [1] 1.216906
#> 
#> $ci
#> [1] 0.8335281 1.7766173

3. Adjust for uncontrolled confounding, exposure misclassification, and selection bias

An astute Epidemiologist reading along here may have noticed that we have not taken into account the NHANES sampling weights! NHANES does not select participants in order to have a group representative of the U.S. population, they instead over-sample from under-represented groups to ensure there is representation across all demographic categories. This could create a selection bias if both the exposure and outcome are related to the selection probability.

There are two weights included from the two-day total nutrition survey; we’ll combine them into a single feature.

df_temp3a <- nhanes_filtered |>
  filter(!is.na(smoked_100cigs)) |>
  mutate(alcohol_adj = if_else(
    income == "$100,000 and Over" | education == "College graduate or above",
    alcohol_day_total * 1.5,
    alcohol_day_total
  )) |>
  mutate(alcohol_extreme_adj = case_when(
    alcohol_adj > 14 * 2 & gender_female == 1 ~ 1,
    alcohol_adj > 14 * 4 & gender_female == 0 ~ 1,
    TRUE ~ 0
  )) |>
  mutate(
    weight = if_else(weight_day2 == 0, weight_day1, weight_day2)
  )

First, we sample with replacement based on each participant’s survey weight to get a dataframe representing selected participants. Then we’ll make a dataframe of the participants who are not selected here.

selected_sample <- sample(
  x = df_temp3a$seqn,
  size = nrow(df_temp3a),
  replace = TRUE,
  prob = df_temp3a$weight
)

df_selected_sample <- data.frame()
for (id in selected_sample) {
  df_selected_sample <- rbind(
    df_selected_sample,
    df_temp3a[df_temp3a$seqn == id, ]
  )
}

not_selected_sample <- df_temp3a$seqn[!(df_temp3a$seqn %in% selected_sample)]
df_not_selected_sample <- df_temp3a[df_temp3a$seqn %in% not_selected_sample, ]

Let’s confirm that the sum of these two groups equals the row count of the original dataframe.

length(not_selected_sample) + length(unique(selected_sample)) == nrow(df_temp3a)
#> [1] TRUE

Then we’ll make our ultimate dataframe here with an indicator for whether a given subject was or was not selected based on our sampling procedure.

df_selected_sample$selection <- 1
df_not_selected_sample$selection <- 0
df_temp3b <- rbind(df_selected_sample, df_not_selected_sample)

As usual, we make our data_validation object and specify our selection column, which represents whether the participant was selected into our study.

df_val3 <- data_validation(
  data = df_temp3b,
  true_exposure = "alcohol_extreme_adj",
  true_outcome = "mortstat",
  confounders = c("age", "gender_female", "smoked_100cigs"),
  misclassified_exposure = "alcohol_extreme",
  selection = "selection"
)

Now let’s see if this differential study selection led to a bias of our alcohol risk effect. We now put these two objects into the adjust_uc_em_sel() function to adjust for uncontrolled confounding, exposure misclassification, and selection bias.

set.seed(1234)
adjust_uc_em_sel(df_obs, df_val3)
#> $estimate
#> [1] 1.332608
#> 
#> $ci
#> [1] 0.9950054 1.7847580

Any selection bias based on the sampling weights seems to be pretty minor; the bias-adjusted odds ratio of the effect of alcohol on mortality does not change much relative to the prior adjustment we made.