For this demonstration in multi-bias analysis, we’ll be using the 2013-2018 National Health and Nutrition Examination Survey (NHANES) data linked to the CDC’s 2019 public-use mortality data. Within NHANES, we have data linked across demographics, 24-hour dietary recall, alcohol use, and smoking use.
We’ll be assessing the risk of alcohol intake on all-cause mortality.
We’re specifically going to zoom in on the comparison of “extreme”
drinkers to “non-extreme” drinkers. We filter out participants who
reported having no alcohol over the past twelve months and create a
sex-specific alcohol_extreme
feature based off of USDGA
guidelines of 14 g/day for women and 28 g/day for men. Participants
are “extreme drinkers” if they reported drinking more than 1.5x these
guidelines.
nhanes_filtered <- nhanes |>
mutate(alcohol_extreme = case_when(
alcohol_day_total > 14 * 1.5 & gender_female == 1 ~ 1,
alcohol_day_total > 28 * 1.5 & gender_female == 0 ~ 1,
TRUE ~ 0
)) |>
filter(age >= 18) |>
filter(eligstat == 1) |>
filter(alcohol_12mo > 0)
Inspecting the exposure-outcome counts, we are dealing with a rare exposure and rare outcome.
nhanes_filtered |>
group_by(alcohol_extreme, mortstat) |>
summarize(count = n()) |>
ungroup() |>
mutate(proportion = count / sum(count))
#> # A tibble: 4 × 4
#> alcohol_extreme mortstat count proportion
#> <dbl> <int> <int> <dbl>
#> 1 0 0 8412 0.856
#> 2 0 1 281 0.0286
#> 3 1 0 1089 0.111
#> 4 1 1 44 0.00448
There are no major age/gender differences among the exposures.
nhanes_filtered |>
group_by(alcohol_extreme) |>
summarize(
age_mean = mean(age),
gender_prop = mean(gender_female)
)
#> # A tibble: 2 × 3
#> alcohol_extreme age_mean gender_prop
#> <dbl> <dbl> <dbl>
#> 1 0 46.2 0.486
#> 2 1 45.1 0.454
To no surprise, age is higher in those who die.
nhanes_filtered |>
group_by(mortstat) |>
summarize(
age_mean = mean(age),
gender_prop = mean(gender_female)
)
#> # A tibble: 2 × 3
#> mortstat age_mean gender_prop
#> <int> <dbl> <dbl>
#> 1 0 45.4 0.486
#> 2 1 66.2 0.375
Now tht we’ve loaded and inspected the data, let’s fit our base model for reference: a logistic regression of mortality on alcohol, gender, and age. For our exposure of interest, we observe an odds ratio of 1.6 (95% CI: 1.1, 2.2) indicating increased mortality risk among the extreme drinkers, adjusted for age and gender. Note that many important covariates and considerations are not considered here.
base_mod <- glm(
mortstat ~ alcohol_extreme + gender_female + age,
data = nhanes_filtered,
family = binomial(link = "logit")
)
exp(coef(base_mod))
#> (Intercept) alcohol_extreme gender_female age
#> 0.0003640025 1.5603553141 0.6838661096 1.0852279381
exp(confint(base_mod))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.0001952562 0.0006553778
#> alcohol_extreme 1.0999439503 2.1670092413
#> gender_female 0.5393792722 0.8637743150
#> age 1.0758586614 1.0950985762
The family of functions in multibias
requires inputting
the observed data as a data_observed()
object.
df_obs <- data_observed(
data = nhanes_filtered,
exposure = "alcohol_extreme",
outcome = "mortstat",
confounders = c("age", "gender_female")
)
print(df_obs)
#> Observed Data
#> ------------------
#> Exposure: alcohol_extreme
#> Outcome: mortstat
#> Confounders: age, gender_female
#> Data head:
#> alcohol_extreme mortstat age gender_female
#> 1 0 0 69 0
#> 2 1 1 54 0
#> 3 0 0 56 0
#> 4 0 0 61 1
#> 5 1 0 56 1
An important confounder, smoking, was not included in our analysis!
Let’s demonstrate how we could use validation data with smoking
information to adjust for the uncontrolled confounding. Feature
smoked_100cigs
corresponds to whether a participant
reported having smoked 100 cigarettes throughout their life.
df_temp1 <- nhanes_filtered |>
filter(!is.na(smoked_100cigs))
df_val1 <- data_validation(
data = df_temp1,
true_exposure = "alcohol_extreme",
true_outcome = "mortstat",
confounders = c("age", "gender_female", "smoked_100cigs")
)
We now put these two objects into the multibias
function
named adjust_uc()
to adjust for uncontrolled confounding.
This new bias-adjusted odds ratio (1.4) is closer to the null than that
of our base model. This is not a surprise considering that extreme
drinkers are more likely to be smokers, and smoking increases the risk
of death. Due to these strong correlations, part of the alcohol effect
in our base model included the effect from smoking.
What if we also wanted to account for the fact that people commonly
under-report their alcohol use due to social stigma? Let’s bring in the
assumption that wealthy, college-educated individuals are most
susceptible to this stigma. We’ll say that the actual alcohol intake for
this group, which we’ll name alcohol_adj
, should be 1.5
times what they reported. Then based on this feature, we’ll re-calculate
the exteme drinking indicator: alcohol_extreme_adj
.
df_temp2 <- nhanes_filtered |>
filter(!is.na(smoked_100cigs)) |>
mutate(alcohol_adj = if_else(
income == "$100,000 and Over" | education == "College graduate or above",
alcohol_day_total * 1.5,
alcohol_day_total
)) |>
mutate(alcohol_extreme_adj = case_when(
alcohol_adj > 14 * 2 & gender_female == 1 ~ 1,
alcohol_adj > 14 * 4 & gender_female == 0 ~ 1,
TRUE ~ 0
))
df_val2 <- data_validation(
data = df_temp2,
true_exposure = "alcohol_extreme_adj",
true_outcome = "mortstat",
confounders = c("age", "gender_female", "smoked_100cigs"),
misclassified_exposure = "alcohol_extreme"
)
We now put these two objects into the multibias
function
named adjust_uc_em()
to adjust for uncontrolled confounding
and exposure misclassification. The new bias-adjsted odds ratio is even
closer to the null.
An astute Epidemiologist reading along here may have noticed that we have not taken into account the NHANES sampling weights! NHANES does not select participants in order to have a group representative of the U.S. population, they instead over-sample from under-represented groups to ensure there is representation across all demographic categories. This could create a selection bias if both the exposure and outcome are related to the selection probability.
There are two weights included from the two-day total nutrition survey; we’ll combine them into a single feature.
df_temp3a <- nhanes_filtered |>
filter(!is.na(smoked_100cigs)) |>
mutate(alcohol_adj = if_else(
income == "$100,000 and Over" | education == "College graduate or above",
alcohol_day_total * 1.5,
alcohol_day_total
)) |>
mutate(alcohol_extreme_adj = case_when(
alcohol_adj > 14 * 2 & gender_female == 1 ~ 1,
alcohol_adj > 14 * 4 & gender_female == 0 ~ 1,
TRUE ~ 0
)) |>
mutate(
weight = if_else(weight_day2 == 0, weight_day1, weight_day2)
)
First, we sample with replacement based on each participant’s survey weight to get a dataframe representing selected participants. Then we’ll make a dataframe of the participants who are not selected here.
selected_sample <- sample(
x = df_temp3a$seqn,
size = nrow(df_temp3a),
replace = TRUE,
prob = df_temp3a$weight
)
df_selected_sample <- data.frame()
for (id in selected_sample) {
df_selected_sample <- rbind(
df_selected_sample,
df_temp3a[df_temp3a$seqn == id, ]
)
}
not_selected_sample <- df_temp3a$seqn[!(df_temp3a$seqn %in% selected_sample)]
df_not_selected_sample <- df_temp3a[df_temp3a$seqn %in% not_selected_sample, ]
Let’s confirm that the sum of these two groups equals the row count of the original dataframe.
Then we’ll make our ultimate dataframe here with an indicator for whether a given subject was or was not selected based on our sampling procedure.
df_selected_sample$selection <- 1
df_not_selected_sample$selection <- 0
df_temp3b <- rbind(df_selected_sample, df_not_selected_sample)
As usual, we make our data_validation
object and specify
our selection
column, which represents whether the
participant was selected into our study.
df_val3 <- data_validation(
data = df_temp3b,
true_exposure = "alcohol_extreme_adj",
true_outcome = "mortstat",
confounders = c("age", "gender_female", "smoked_100cigs"),
misclassified_exposure = "alcohol_extreme",
selection = "selection"
)
Now let’s see if this differential study selection led to a bias of
our alcohol risk effect. We now put these two objects into the
adjust_uc_em_sel()
function to adjust for uncontrolled
confounding, exposure misclassification, and selection bias.
set.seed(1234)
adjust_uc_em_sel(df_obs, df_val3)
#> $estimate
#> [1] 1.332608
#>
#> $ci
#> [1] 0.9950054 1.7847580
Any selection bias based on the sampling weights seems to be pretty minor; the bias-adjusted odds ratio of the effect of alcohol on mortality does not change much relative to the prior adjustment we made.