Optimising random forest for variable selection

Introduction

Random forest is a particularly prominent method from the field of machine learning that is frequently used for variable selection where the importance of each variable is estimated and the variables with the highest importance estimates are selected. Despite its many advantages, one should take into consideration that random forest is a non-deterministic method which means that the variable importance estimates can vary when repeating the analysis even with the same data set. To assess the impact of non-determinism on on variable selection, the opt_importance function from the optRF package can be used which is thoroughly described in this vignette. This function models the relationship between the number of trees and the resulting stability of the random forest and uses this model to determine the optimal number of trees from where on adding more trees to the random forest would increase the computation time but would only marginally increase the stability.

To demonstrate the functionality of the optRF package, we will use the SNPdata data set included in the package. This data set contains in the first column the yield of 250 wheat individuals as well as 5000 single nucleotide polymorphisms (SNPs) that can contain either the value 0 or 2 (with 1 indicating heterozygous alleles which do not occur here). While this example focuses on genomic research, the optRF package is versatile and can optimise random forest for any type of data set. Before proceeding, ensure the necessary packages are loaded into your R environment. The optRF package depends on the ranger package for efficient random forest implementation.

library(ranger)
library(optRF)
SNPdata[1:5,1:5]
#>           Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
#> ID_001 670.7588        0        0        0        0
#> ID_002 542.5611        0        2        0        0
#> ID_003 591.6631        2        2        0        2
#> ID_004 476.3727        0        0        0        0
#> ID_005 635.9814        2        2        0        2

Variable importance estimation using random forest

Random forest is a powerful tool for estimating the importance of predictor variables (in this case the SNPs) based on their association with a response variable (in this case the yield). In the ranger function, variable importance can be calculated using the importance argument, which can be set to "impurity", "impurity_corrected", or "permutation". In this example, we define variable importance using the "permutation" method. To ensure reproducibility, we set a seed before running the analysis:

set.seed(123) # Set a seed for reproducibility
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], write.forest = TRUE, importance="permutation")
D_VI = data.frame(variable = names(SNPdata)[-1], importance = RF_model$variable.importance)
D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
head(D_VI)
#>          variable importance
#> SNP_0020 SNP_0020   80.22909
#> SNP_0019 SNP_0019   60.37387
#> SNP_0043 SNP_0043   50.52367
#> SNP_0005 SNP_0005   43.47999
#> SNP_0034 SNP_0034   38.52494
#> SNP_0015 SNP_0015   34.88654

The data frame D_VI now contains the variable importance scores for each predictor, sorted in descending order. These scores provide insights into the relative significance of each variable in the data set. Since random forest is a non-deterministic method, repeated analyses using the same data set can yield different variable importance scores. To illustrate this, we repeat the analysis with a different seed:

set.seed(321) # Set a seed for reproducibility
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]

We can now merge D_VI and D_VI_2 to see how much the variable importance estimates varied simply by repeating random forest after setting a different seed:

M = merge(D_VI, D_VI_2, by="variable")
plot(M$importance.x, M$importance.y,
     xlab="Variable importances in the first run", ylab="Variable importances in the second run")

cor(M$importance.x, M$importance.y)
#> [1] 0.266544

The scatter plot and the correlation coefficient (0.27) reveal substantial variability in the variable importance estimates between the two runs. This demonstrates how the non-deterministic nature of random forest can lead to inconsistencies in variable rankings.

Increasing the stability of random forest

Random forest operates by growing multiple decision trees (or regression trees, in this case) and calculating the variable importance within each tree. Since each tree is built independently, the importance values for a given variable can differ between trees. To provide a more reliable measure, random forest averages these importance values across all trees. Consequently, the stability of the variable importance estimates improves as the number of trees increases. By default, R packages like ranger, randomForest, and Rborist typically use 500 trees. However, for the SNPdata data set, the default of 500 trees was insufficient to achieve stable variable importance estimates.

To evaluate how the number of trees affects both variable importance stability (correlation of variable importance estimates across runs) and computation time, we can test random forests with different tree counts: 500, 2,500, 5,000, 10,000, 15,000, and 20,000 trees. The stability can be calculated as the correlation between variable importance scores obtained from two repeated runs of random forest, each using a different seed. Additionally, the average computation time for generating these random forests can be measured:

num.trees_values = c(500, 2500, 5000, 10000, 15000, 20000)
result = data.frame()
for(i in num.trees_values){
  
  start.time = Sys.time()
  
  set.seed(123)
  RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=i,
                      write.forest = TRUE, importance="permutation")
  D_VI_1 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_1$variable.importance)
  
  set.seed(321)
  RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1],  num.trees=i,
                      write.forest = TRUE, importance="permutation")
  D_VI_2 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_2$variable.importance)
  end.time = Sys.time()
  
  M = merge(D_VI_1, D_VI_2, by="variable")
  result = rbind(result, data.frame(number_of_trees = i,
                                    variable_importance_stability = cor(M$importance.x, M$importance.y),
                                    computation_time = (end.time - start.time)/2))
}
result
#>   number_of_trees variable_importance_stability computation_time
#> 1             500                     0.2665440   0.9322098 secs
#> 2            2500                     0.6022889   3.5782338 secs
#> 3            5000                     0.7274168   6.8280137 secs
#> 4           10000                     0.8496699  13.6059827 secs
#> 5           15000                     0.8962504  19.8898976 secs
#> 6           20000                     0.9178299  27.1540488 secs

As expected, the variable importance stability improves with a higher number of trees. Here, the stability rises from 0.27 at 500 trees to 0.92 at 20,000 trees. However, while the computation time increases linearly with increasing numbers of trees, is the relationship between the number of trees and the variable importance stability non-linear. We can plot the stability and computation time to better understand these trends:

par(mfrow=c(1,2))
plot(variable_importance_stability ~ number_of_trees, data=result)
plot(computation_time ~ number_of_trees, data=result)
abline(lm(result$computation_time ~ result$number_of_trees), lwd=2, col="grey")

Initially, the stability increases steeply with additional trees but eventually begins to plateau. This diminishing return indicates that beyond a certain number of trees, the stability gains become marginal. In contrast to the stability, the computation time increases linearly with an increasing number of trees. This is a predictable outcome since each additional tree contributes directly to processing time. Since computation time increases linearly with the number of trees, an optimal trade-off point can be identified from where on adding more trees to random forest would increase the computation time but would increase the stability only marginally.

Using optRF to optimise random forest for variable importance estimation

The optimal number of trees in a random forest model is highly data-dependent and must be estimated for each new data set. To automate this process, the optRF package provides a convenient method for determining the optimal number of trees.

Therefore, it determines the stability of random forest with 250, 500, 750, 1,000, and 2,000 trees where random forest is repeated ten times and the variable importance stability is calculated as the intraclass correlation coefficient between the variable importance scores in these ten repetitions. Subsequently, it uses these data points to model the relationship between the number of trees and the corresponding stability using a two parameter logistic regression model. Next, it calculates the increase in stability with each additional tree being added to random forest. The optimal number of trees is then defined as the point where adding additional trees increases variable importance stability by only \(10^{-6}\) or less.

To determine the optimal number of trees regarding the variable importance stability, the opt_importance function from the optRF package can be used where the response variable is passed to the y argument and the predictor variables are inserted in the X argument:

set.seed(123)
optRF_result = opt_importance(y=SNPdata[,1], X=SNPdata[,-1])

Once the opt_importance function has completed, the recommended number of trees and the expected stability of the random forest model can be easily accessed using the summary function:

summary(optRF_result)
#> Recommended number of trees: 40000
#> Expected variable importance stability: 0.958165
#> Expected selection stability: 0.5856996
#> Expected computation time (sec): 53.51205

Here, the optimal number of trees to receive stable variable importance estimates is 40,000 trees. Using random forest with this number of trees, one can expect a variable importance stability of 0.96.

The effect of the optimal number of trees

Now that the optimal number of trees has been determined, we can repeat the analysis from above to see how the stability of random forest changed when running random forest with 40,000 trees:

set.seed(123)
RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
                   write.forest = TRUE, importance="permutation")
D_VI_1 = data.frame(variable = names(SNPdata)[-1], 
                    importance = RF_model_1$variable.importance)

set.seed(321)
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
                   write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1], 
                    importance = RF_model_2$variable.importance)

M = merge(D_VI_1, D_VI_2, by="variable")
plot(M$importance.x, M$importance.y,
     xlab="Variable importances in the first run", ylab="Variable importances in the second run")

cor(M$importance.x, M$importance.y)
#> [1] 0.959417

With 40,000 trees, the correlation between the variable importance estimates in the two repetitions is 0.96, a dramatic improvement over the correlation of 0.27 observed with 500 trees. The scatter plot of variable importance estimates from the two runs shows a tightly clustered pattern, confirming consistency in the results and the high correlation aligns with the stability value predicted by opt_importance, demonstrating the reliability of the function’s recommendation.

Using optRF to optimise random forest for variable selection

With stable variable importance estimates obtained, these can now be used to identify and select the most important variables from the data set. The selection process first involves defining a threshold for the number of variables to be selected. Subsequently, in each repetition of random forest, the variables with the highest variable importance estimates can be classified as selected and the rest of the variables as rejected and Fleiss’ Kappa can be used to quantify the stability of the variable selection process, as it measures the agreement among these classifications in repeated analyses.

The number of variables selected should align with the number of truly important variables in the data set. Instead of selecting an arbitrary number, this should be data-driven. By default, opt_importance will determine the optimal number of trees to receive stable selections when selecting the top 5% of variables from the data set. With this data set, this would mean to select for the 250 most important variables in the data set. However, the assumption that 250 important variables exist in the data set may not actually hold.

To better estimate the number of actually important variables in the data set, random forest can be run with the number of trees that was recommended by the opt_importance function to get stable variable importance estimates, which can be visualised using a histogram:

set.seed(123)
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
                 write.forest = TRUE, importance="permutation")
hist(RF_model$variable.importance, xlim=c(-10,50),
     main="Histogram of variable importances", xlab="")

One can see that the majority of variables have a variable importance between -5 and 5. Only a small subset of variables has importance scores above 5, likely indicating actually important predictors which should be selected. To refine the selection process, we can calculate the number of variables with variable importance above 5:

hist(RF_model$variable.importance, ylim=c(0,20), xlim=c(-10,50),
     main="Histogram of variable importances", xlab="")
abline(v=5, col="red", lwd=4)

selection_size = sum(RF_model$variable.importance>5)
selection_size
#> [1] 39

In this data set, 39 variables provided a variable importance greater than 5. So, we can rerun the opt_importance function and this time specify that the 39 most important variables should be selected. This can be adjusted in the alpha argument of the function which allows for either an absolute number of variables or a relative proportion of variables to be selected from the data set:

set.seed(123)
optRF_result_2 = opt_importance(y=SNPdata[,1], X=SNPdata[,-1], 
                              recommendation = "selection",
                              alpha = selection_size)

The results from the new application of the opt_importance function provide an updated recommendation for the optimal number of trees which can be accessed as above:

summary(optRF_result_2)
#> Recommended number of trees: 42000
#> Expected variable importance stability: 0.9600647
#> Expected selection stability: 0.9407908
#> Expected computation time (sec): 55.85653

The results indicate that 42,000 trees are required to achieve stable selection of the 39 variables, resulting in a variable importance stability of 0.96 and a selection stability of 0.94. With the newly determined optimal number of trees, we can repeat the analysis from above to see how often the same variables would be selected in two repeated runs of random forest:

set.seed(123)
RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
                   write.forest = TRUE, importance="permutation")
D_VI_1 = data.frame(variable = names(SNPdata)[-1],
                    importance = RF_model_1$variable.importance)
D_VI_1 = D_VI_1[order(D_VI_1$importance, decreasing=TRUE),]
D_VI_1$variable_ranking = c(1:nrow(D_VI_1))
selected_variables_1 = D_VI_1[1:selection_size,1]

set.seed(321)
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
                   write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1],
                    importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]
D_VI_2$variable_ranking = c(1:nrow(D_VI_2))
selected_variables_2 = D_VI_2[1:selection_size,1]
sum(selected_variables_1 %in% selected_variables_2)
#> [1] 36

36 out of the 39 selected variables overlap between the two repeated runs of random forest with the recommended number of trees which translates to 92% of the variables to be selected in a stable way.

Fine-tuning the optimal number of trees for variable selection

This approach effectively estimates the number of important variables in the data set and determines the optimal number of trees necessary to achieve stable variable selection. However, it also highlights that even with optimisation, some variability persists in the selected variables across repeated random forest runs. Nonetheless, with the relationship between stability and the number of trees now established, additional functions from the optRF package can be used to further refine the analysis, including estimating computation time specific to the machine on which the opt_importance function was run.

The relationship between stability and the number of trees can be conveniently visualised using the plot_stability function. This visualisation allows users to specify the range of tree counts, the stability measure of interest (variable importance stability or selection stability), and whether the recommended number of trees should be highlighted. For instance, visualising selection stability from 0 to 100,000 trees without indicating the recommended number of trees can be achieved as follows:

plot_stability(optRF_result_2, measure="selection", 
               0, 100000, add.recommendation = FALSE, ylim=c(0, 1))
abline(h=1, col="red")

The plot demonstrates that even with 100,000 trees, perfect selection stability is not achievable. However, since the relationship between stability and number of trees is known for this data set, the number of trees required to achieve specific levels of selection stability can be estimated using the estimate_numtrees function. For example:

estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.9)
#>             selection_stability opt_numtrees computation_time
#> (Intercept)                 0.9        19658         26.25823
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.95)
#>             selection_stability opt_numtrees computation_time
#> (Intercept)                0.95        53329         70.86432
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.99)
#>             selection_stability opt_numtrees computation_time
#> (Intercept)                0.99       483566         640.8252
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.999)
#>             selection_stability opt_numtrees computation_time
#> (Intercept)               0.999     10600894         14043.87

The output of the estimate_numtrees function shows the desired selection stability together with the number of trees that would be necessary to reach such a selection stability and the computation time that the ranger function would require to run random forest with such a number of trees. One can see that the increase from a selection stability of 0.9 to 0.95 would more than double the computation time of random forest. To increase the selection stability from 0.95 to 0.99 would increase the computation time of random forest almost by a factor ten and to increase the selection stability from 0.99 to 0.999 would increase the computation time by a factor 20.

It is important to note that the computation time estimated by opt_importance is specific to the machine on which the function was run. On a different system, the computation time may vary. Additionally, these estimates are based on the ranger function which can execute random forest computations in parallel using multiple threads. To provide realistic estimates, the number of threads can be specified using the num.threads argument in both the ranger and opt_importance functions.

Assume that after analysing the results from estimate_numtrees, we decide to use 500,000 trees for for variable selection with random forest. Based on earlier findings, this would yield a selection stability slightly above 0.99. However, to publish exact stability metrics, we can use the estimate_stability function with the chosen number of trees:

estimate_stability(optRF_result_2, with_num.trees=500000)
#>             num.trees VI_stability selection_stability computation_time
#> (Intercept)     5e+05    0.9964582           0.9902447         662.5974

The results indicate that when using random forest for variable selection with 500,000 trees and this data set, the variable importance stability is 0.9965 and the selection stability is 0.9903. These values now give others the possibility to assess the reproducibility of the results and can show that the results of the analysis are reliable. Thus, the variable importance and selection stability should always be given when publishing decisions based on random forest predictions.

Customising the opt_importance function

This vignette illustrated an application of the opt_importance function where one could assume that only around 1% of the variables in the data set affect the response. Using this assumption, the number of trees was adjusted to enable stable variable selection. However, this step becomes unnecessary if a user already knows the exact number of variables to select. In such cases, the desired number of variables can simply be specified directly in the alpha argument of the opt_importance function.

In this example, the optimal number of trees was defined as the number of trees at which adding more trees would increase the variable importance stability by \(10^{-6}\) or less. This threshold is arbitrary and can be customised using the rec.thresh argument to fit specific requirements. Next to setting the recommendation argument to "importance" (default) or "selection", the recommendation argument can be set to "none" to analyse the relationship between the number of trees and stability without receiving a specific recommendation. The user can then manually determine the optimal number of trees using the estimate_numtrees, estimate_stability, and plot_stability functions.

Users can also adjust computational settings to tailor the opt_prediction function to their needs. Using the number.repetitions argument, it can be controlled how many repetitions are performed. Fewer repetitions will reduce computation time but may lead to more variable results. Using the num.trees_values argument, certain numbers of trees can be specified. Smaller ranges or fewer values will speed up computation but might reduce the precision of the stability analysis. We recommend not to adjust these arguments.

Next to the arguments described here, the user can also add arguments from the ranger function to customise the application of opt_prediction. This gives the user the possibility to adjust the num.threads argument to allow for parallel computing or add specific values for hyperparameters in random forest such as mtry, min.node.size, or sample.fraction.

Conclusions

While random forest has many advantages and can be easily used to estimate the importance of the variables in a data set, it must be taken into consideration that random forest is a non-deterministic method that can lead to different variable importance estimates even when analysing the same data set. The stability of random forest depends strongly on the number of trees which not only increases stability but also computation time. The optRF package provides an efficient solution for this problem by modelling the relationship between the number of trees and the corresponding stability of random forest. With its opt_importance function, users can systematically evaluate this relationship for a specific data set and determine the optimal number of trees required to achieve satisfactory stability and improve reproducibility. Researchers are encouraged to evaluate the stability of the chosen number of trees to ensure reliable and reproducible results when using random forest.