library(anticlust)In this vignette I explore some ways to incorporate categorical variables with anticlustering, focusing on the main function of anticlust: anticlustering(). Historically, the first option we had to deal with categorical variables was the categories argument (Papenberg & Klau, 2021). It can be used easily enough: We just pass the numeric variables as first argument (x) and our categorical variable(s) to categories. I will use the penguin data set to illustrate its usage:
data(penguins)
# First exclude cases with missing values
df <- na.omit(palmerpenguins::penguins)
head(df)
#> # A tibble: 6 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> # ℹ 2 more variables: sex <fct>, year <int>
nrow(df)
#> [1] 333In the data set, each row represents a penguin, and the data set has four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and several categorical variables (species, island, sex) as descriptions of the penguins.
Let’s call anticlustering() to divide the 333 penguins into 3 groups. We use the four the numeric variables as first argument (i.e., the anticlustering objective is computed on the basis of the numeric variables), and the penguins’ sex as categorical variable:
numeric_vars <- df[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$sex
)Let’s check out how well our categorical variables are balanced:
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56A perfect split! Similarly, we could use the species as categorical variable:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df$species
)
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40As good as it could be! Now, let’s use both categorical variables at the same time:
groups <- anticlustering(
numeric_vars,
K = 3,
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 54 57
#> 2 56 55
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40The results for the sex variable are worse than previously when we only considered one variable at a time. This is because when using multiple variables with the categories argument, all columns are “merged” into a single column, and each combination of sex / species is treated as a separate category. Some information on the original variables is lost, and the results may become less optimal—while being still pretty okay here. Alas, using only the categories argument, we cannot improve this balancing even if a better split with regard to both categorical variables would be possible.
A second possibility to incorporate categorical variables is to treat them as numeric variables and use them as part of the first argument x, which is used to compute the anticlustering objective (e.g., the diversity or variance). This approach can lead to better results when multiple categorical variables are available, and / or if the group sizes are unequal. Since version 0.8.12, we can use categorical variables as part of the first argument when they are defined as factors. Before that, we manually had to convert categorical variables into a binary representation via categories_to_binary(). Manual conversion can still be useful, as shown further below.
In the penguin data sets, all variables are already correctly coded, i.e., categorical variables are defined as factors. So I generate a data frame that includes all features – numeric and categorical features – and use it as input for anticlustering.
all_features <- data.frame(numeric_vars, df[, c("species", "sex")])groups <- anticlustering(
all_features,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 22 40
#> 3 49 23 39The results are quite convincing. In particular, the penguins’ sex is better balanced than previously when we used the argument categories. If we have multiple categorical variables and / or unequal-sized groups, it may be useful to try out using categorical variables as factors, instead of using the categories argument.
If we also wish to ensure that the categorical variables in their combination are balanced between groups, we must do some manual data preparation. For anticlustering, categorical variables are converted into a binary representation via “one hot” encoding. The anticlust package has the convenience function categories_to_binary(). for this purpose.1 This is done internally via anticlustering() when using categorical variables as part of the data input (as factors). In that case, however, combinations of categorical variables are not considered. To consider combinations, we can manually create our data set with binary categorical variables, setting the optional argument use_combinations of categories_to_binary() to TRUE. First, let’s see how we would manually encode categorical variables without considering their combinations. We will use collection year (2007, 2008, 2009) and species as categorical variables:
binary_categories <- categories_to_binary(df[, c("species", "year")], use_combinations = FALSE)
data_input <- data.frame(binary_categories, numeric_vars)
groups <- anticlustering(
data_input,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$year, df$species)
#> , , = Adelie
#>
#>
#> groups 2007 2008 2009
#> 1 15 17 17
#> 2 14 17 18
#> 3 15 16 17
#>
#> , , = Chinstrap
#>
#>
#> groups 2007 2008 2009
#> 1 9 6 8
#> 2 8 6 8
#> 3 9 6 8
#>
#> , , = Gentoo
#>
#>
#> groups 2007 2008 2009
#> 1 10 15 14
#> 2 12 15 13
#> 3 11 15 14When setting use_combinations = TRUE, we will also balance the proportions of species collected in each year across groups, which was not explicitly done before:
binary_categories <- categories_to_binary(df[, c("species", "year")], use_combinations = TRUE)
data_input <- data.frame(binary_categories, numeric_vars)
groups <- anticlustering(
data_input,
K = 3,
method = "local-maximum",
standardize = TRUE
)
table(groups, df$year, df$species)
#> , , = Adelie
#>
#>
#> groups 2007 2008 2009
#> 1 15 16 17
#> 2 15 17 17
#> 3 14 17 18
#>
#> , , = Chinstrap
#>
#>
#> groups 2007 2008 2009
#> 1 9 6 8
#> 2 8 6 8
#> 3 9 6 8
#>
#> , , = Gentoo
#>
#>
#> groups 2007 2008 2009
#> 1 11 15 14
#> 2 11 15 14
#> 3 11 15 13Now, the year of data collection is perfectly balance across groups for each of the three species, which is not accomplished when setting use_combinations = FALSE or when using the categories as factors, which internally sets use_combinations = FALSE.
As of version 0.8.13, we have another option of incorporating categorical variables with anticlustering(): blocking. It is pretty similar to using the categories argument:
groups1 <- anticlustering(
numeric_vars,
K = 3,
categories = df$sex,
standardize = TRUE,
objective = "kplus",
method = "local-maximum"
)
groups2 <- anticlustering(
numeric_vars,
K = 3,
blocks = df$sex,
standardize = TRUE,
objective = "kplus",
method = "local-maximum"
)
table(df$sex, groups1)
#> groups1
#> 1 2 3
#> female 55 55 55
#> male 56 56 56
table(df$sex, groups2)
#> groups2
#> 1 2 3
#> female 55 55 55
#> male 56 56 56There is one difference: With blocking, we also attempt to balance the numeric variables within each level of the blocking variable across groups; with the categories argument, we only attempt to achieve overall balance. As we can see:
knitr::kable(mean_sd_tab(numeric_vars[df$sex == "female", ], groups1[df$sex == "female"]), row.names = TRUE) # categories argument| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| 1 | 42.29 (5.01) | 16.32 (1.75) | 197.82 (12.39) | 3883.18 (685.95) |
| 2 | 42.15 (5.07) | 16.57 (1.94) | 196.75 (12.39) | 3853.18 (623.37) |
| 3 | 41.85 (4.71) | 16.39 (1.70) | 197.53 (12.92) | 3850.45 (698.57) |
knitr::kable(mean_sd_tab(numeric_vars[df$sex == "female", ], groups2[df$sex == "female"]), row.names = TRUE) # blocks argument| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| 1 | 42.07 (4.91) | 16.42 (1.81) | 197.36 (12.58) | 3865.91 (674.37) |
| 2 | 42.11 (4.95) | 16.44 (1.81) | 197.44 (12.60) | 3862.73 (671.23) |
| 3 | 42.11 (4.94) | 16.41 (1.80) | 197.29 (12.55) | 3858.18 (665.16) |
knitr::kable(mean_sd_tab(numeric_vars[df$sex == "male", ], groups1[df$sex == "male"]), row.names = TRUE) # categories argument| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| 1 | 45.65 (5.45) | 17.99 (1.84) | 204.07 (14.99) | 4523.66 (795.45) |
| 2 | 45.80 (5.33) | 17.76 (1.84) | 205.09 (14.49) | 4552.68 (822.19) |
| 3 | 46.11 (5.41) | 17.92 (1.94) | 204.36 (14.41) | 4560.71 (757.72) |
knitr::kable(mean_sd_tab(numeric_vars[df$sex == "male", ], groups2[df$sex == "male"]), row.names = TRUE) # blocks argument| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| 1 | 45.87 (5.40) | 17.89 (1.87) | 204.50 (14.62) | 4544.64 (791.10) |
| 2 | 45.85 (5.40) | 17.88 (1.88) | 204.48 (14.66) | 4545.54 (791.76) |
| 3 | 45.84 (5.40) | 17.90 (1.87) | 204.54 (14.62) | 4546.88 (794.30) |
Within each species, there is increased similarity between groups when using the blocks argument. Note that this is currently only achieved with the blocks argument; it is also not achieved when using the categorical variables as factors as described above.2
Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301
Internally, categories_to_binary() is wrapper around the base R function model.matrix().↩︎
It should be possible to manually construct the input matrix x in such a way that it includes an interaction term for the blocking variable and the other variables of interest. In this case, we should also obtain balance within each level of the blocking variable. We are already doing the same thing in categories_to_binary() – however, just with categorical variables – when using use_combinations. With model.matrix(), advanced users may be able to construct interaction variables for categorical and numeric variables; future versions of anticlust may offer convenience support for this option.↩︎