Random forest is a particularly prominent method from the field of
machine learning that is frequently used for variable selection where
the importance of each variable is estimated and the variables with the
highest importance estimates are selected. Despite its many advantages,
one should take into consideration that random forest is a
non-deterministic method which means that the variable importance
estimates can vary when repeating the analysis even with the same data
set. To assess the impact of non-determinism on on variable selection,
the opt_importance
function from the optRF
package can be used which is thoroughly described in this vignette. This
function models the relationship between the number of trees and the
resulting stability of the random forest and uses this model to
determine the optimal number of trees from where on adding more trees to
the random forest would increase the computation time but would only
marginally increase the stability.
To demonstrate the functionality of the optRF
package,
we will use the SNPdata
data set included in the package.
This data set contains in the first column the yield of 250 wheat
individuals as well as 5000 single nucleotide polymorphisms (SNPs) that
can contain either the value 0 or 2 (with 1 indicating heterozygous
alleles which do not occur here). While this example focuses on genomic
research, the optRF
package is versatile and can optimise
random forest for any type of data set. Before proceeding, ensure the
necessary packages are loaded into your R environment. The
optRF
package depends on the ranger
package
for efficient random forest implementation.
Random forest is a powerful tool for estimating the importance of
predictor variables (in this case the SNPs) based on their association
with a response variable (in this case the yield). In the
ranger
function, variable importance can be calculated
using the importance
argument, which can be set to
"impurity"
, "impurity_corrected"
, or
"permutation"
. In this example, we define variable
importance using the "permutation"
method. To ensure
reproducibility, we set a seed before running the analysis:
set.seed(123) # Set a seed for reproducibility
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], write.forest = TRUE, importance="permutation")
D_VI = data.frame(variable = names(SNPdata)[-1], importance = RF_model$variable.importance)
D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
head(D_VI)
#> variable importance
#> SNP_0020 SNP_0020 80.22909
#> SNP_0019 SNP_0019 60.37387
#> SNP_0043 SNP_0043 50.52367
#> SNP_0005 SNP_0005 43.47999
#> SNP_0034 SNP_0034 38.52494
#> SNP_0015 SNP_0015 34.88654
The data frame D_VI
now contains the variable importance
scores for each predictor, sorted in descending order. These scores
provide insights into the relative significance of each variable in the
data set. Since random forest is a non-deterministic method, repeated
analyses using the same data set can yield different variable importance
scores. To illustrate this, we repeat the analysis with a different
seed:
set.seed(321) # Set a seed for reproducibility
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]
We can now merge D_VI
and D_VI_2
to see how
much the variable importance estimates varied simply by repeating random
forest after setting a different seed:
M = merge(D_VI, D_VI_2, by="variable")
plot(M$importance.x, M$importance.y,
xlab="Variable importances in the first run", ylab="Variable importances in the second run")
The scatter plot and the correlation coefficient (0.27) reveal substantial variability in the variable importance estimates between the two runs. This demonstrates how the non-deterministic nature of random forest can lead to inconsistencies in variable rankings.
Random forest operates by growing multiple decision trees (or
regression trees, in this case) and calculating the variable importance
within each tree. Since each tree is built independently, the importance
values for a given variable can differ between trees. To provide a more
reliable measure, random forest averages these importance values across
all trees. Consequently, the stability of the variable importance
estimates improves as the number of trees increases. By default, R
packages like ranger
, randomForest
, and
Rborist
typically use 500 trees. However, for the
SNPdata
data set, the default of 500 trees was insufficient
to achieve stable variable importance estimates.
To evaluate how the number of trees affects both variable importance stability (correlation of variable importance estimates across runs) and computation time, we can test random forests with different tree counts: 500, 2,500, 5,000, 10,000, 15,000, and 20,000 trees. The stability can be calculated as the correlation between variable importance scores obtained from two repeated runs of random forest, each using a different seed. Additionally, the average computation time for generating these random forests can be measured:
num.trees_values = c(500, 2500, 5000, 10000, 15000, 20000)
result = data.frame()
for(i in num.trees_values){
start.time = Sys.time()
set.seed(123)
RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=i,
write.forest = TRUE, importance="permutation")
D_VI_1 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_1$variable.importance)
set.seed(321)
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=i,
write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_2$variable.importance)
end.time = Sys.time()
M = merge(D_VI_1, D_VI_2, by="variable")
result = rbind(result, data.frame(number_of_trees = i,
variable_importance_stability = cor(M$importance.x, M$importance.y),
computation_time = (end.time - start.time)/2))
}
result
#> number_of_trees variable_importance_stability computation_time
#> 1 500 0.2665440 0.9322098 secs
#> 2 2500 0.6022889 3.5782338 secs
#> 3 5000 0.7274168 6.8280137 secs
#> 4 10000 0.8496699 13.6059827 secs
#> 5 15000 0.8962504 19.8898976 secs
#> 6 20000 0.9178299 27.1540488 secs
As expected, the variable importance stability improves with a higher number of trees. Here, the stability rises from 0.27 at 500 trees to 0.92 at 20,000 trees. However, while the computation time increases linearly with increasing numbers of trees, is the relationship between the number of trees and the variable importance stability non-linear. We can plot the stability and computation time to better understand these trends:
par(mfrow=c(1,2))
plot(variable_importance_stability ~ number_of_trees, data=result)
plot(computation_time ~ number_of_trees, data=result)
abline(lm(result$computation_time ~ result$number_of_trees), lwd=2, col="grey")
Initially, the stability increases steeply with additional trees but
eventually begins to plateau. This diminishing return indicates that
beyond a certain number of trees, the stability gains become marginal.
In contrast to the stability, the computation time increases linearly
with an increasing number of trees. This is a predictable outcome since
each additional tree contributes directly to processing time. Since
computation time increases linearly with the number of trees, an optimal
trade-off point can be identified from where on adding more trees to
random forest would increase the computation time but would increase the
stability only marginally.
The optimal number of trees in a random forest model is highly
data-dependent and must be estimated for each new data set. To automate
this process, the optRF
package provides a convenient
method for determining the optimal number of trees.
Therefore, it determines the stability of random forest with 250, 500, 750, 1,000, and 2,000 trees where random forest is repeated ten times and the variable importance stability is calculated as the intraclass correlation coefficient between the variable importance scores in these ten repetitions. Subsequently, it uses these data points to model the relationship between the number of trees and the corresponding stability using a two parameter logistic regression model. Next, it calculates the increase in stability with each additional tree being added to random forest. The optimal number of trees is then defined as the point where adding additional trees increases variable importance stability by only \(10^{-6}\) or less.
To determine the optimal number of trees regarding the variable
importance stability, the opt_importance
function from the
optRF
package can be used where the response variable is
passed to the y
argument and the predictor variables are
inserted in the X
argument:
Once the opt_importance
function has completed, the
recommended number of trees and the expected stability of the random
forest model can be easily accessed using the summary
function:
summary(optRF_result)
#> Recommended number of trees: 40000
#> Expected variable importance stability: 0.958165
#> Expected selection stability: 0.5856996
#> Expected computation time (sec): 53.51205
Here, the optimal number of trees to receive stable variable importance estimates is 40,000 trees. Using random forest with this number of trees, one can expect a variable importance stability of 0.96.
Now that the optimal number of trees has been determined, we can repeat the analysis from above to see how the stability of random forest changed when running random forest with 40,000 trees:
set.seed(123)
RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
write.forest = TRUE, importance="permutation")
D_VI_1 = data.frame(variable = names(SNPdata)[-1],
importance = RF_model_1$variable.importance)
set.seed(321)
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1],
importance = RF_model_2$variable.importance)
M = merge(D_VI_1, D_VI_2, by="variable")
plot(M$importance.x, M$importance.y,
xlab="Variable importances in the first run", ylab="Variable importances in the second run")
With 40,000 trees, the correlation between the variable importance
estimates in the two repetitions is 0.96, a dramatic improvement over
the correlation of 0.27 observed with 500 trees. The scatter plot of
variable importance estimates from the two runs shows a tightly
clustered pattern, confirming consistency in the results and the high
correlation aligns with the stability value predicted by
opt_importance
, demonstrating the reliability of the
function’s recommendation.
With stable variable importance estimates obtained, these can now be
used to identify and select the most important variables from the data
set. The selection process first involves defining a threshold for the
number of variables to be selected. Subsequently, in each repetition of
random forest, the variables with the highest variable importance
estimates can be classified as selected
and the rest of the
variables as rejected
and Fleiss’ Kappa can be used to
quantify the stability of the variable selection process, as it measures
the agreement among these classifications in repeated analyses.
The number of variables selected should align with the number of
truly important variables in the data set. Instead of selecting an
arbitrary number, this should be data-driven. By default,
opt_importance
will determine the optimal number of trees
to receive stable selections when selecting the top 5% of variables from
the data set. With this data set, this would mean to select for the 250
most important variables in the data set. However, the assumption that
250 important variables exist in the data set may not actually hold.
To better estimate the number of actually important variables in the
data set, random forest can be run with the number of trees that was
recommended by the opt_importance
function to get stable
variable importance estimates, which can be visualised using a
histogram:
set.seed(123)
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
write.forest = TRUE, importance="permutation")
hist(RF_model$variable.importance, xlim=c(-10,50),
main="Histogram of variable importances", xlab="")
One can see that the majority of variables have a variable importance between -5 and 5. Only a small subset of variables has importance scores above 5, likely indicating actually important predictors which should be selected. To refine the selection process, we can calculate the number of variables with variable importance above 5:
hist(RF_model$variable.importance, ylim=c(0,20), xlim=c(-10,50),
main="Histogram of variable importances", xlab="")
abline(v=5, col="red", lwd=4)
In this data set, 39 variables provided a variable importance greater
than 5. So, we can rerun the opt_importance
function and
this time specify that the 39 most important variables should be
selected. This can be adjusted in the alpha
argument of the
function which allows for either an absolute number of variables or a
relative proportion of variables to be selected from the data set:
set.seed(123)
optRF_result_2 = opt_importance(y=SNPdata[,1], X=SNPdata[,-1],
recommendation = "selection",
alpha = selection_size)
The results from the new application of the
opt_importance
function provide an updated recommendation
for the optimal number of trees which can be accessed as above:
summary(optRF_result_2)
#> Recommended number of trees: 42000
#> Expected variable importance stability: 0.9600647
#> Expected selection stability: 0.9407908
#> Expected computation time (sec): 55.85653
The results indicate that 42,000 trees are required to achieve stable selection of the 39 variables, resulting in a variable importance stability of 0.96 and a selection stability of 0.94. With the newly determined optimal number of trees, we can repeat the analysis from above to see how often the same variables would be selected in two repeated runs of random forest:
set.seed(123)
RF_model_1 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
write.forest = TRUE, importance="permutation")
D_VI_1 = data.frame(variable = names(SNPdata)[-1],
importance = RF_model_1$variable.importance)
D_VI_1 = D_VI_1[order(D_VI_1$importance, decreasing=TRUE),]
D_VI_1$variable_ranking = c(1:nrow(D_VI_1))
selected_variables_1 = D_VI_1[1:selection_size,1]
set.seed(321)
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1],
importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]
D_VI_2$variable_ranking = c(1:nrow(D_VI_2))
selected_variables_2 = D_VI_2[1:selection_size,1]
36 out of the 39 selected variables overlap between the two repeated runs of random forest with the recommended number of trees which translates to 92% of the variables to be selected in a stable way.
This approach effectively estimates the number of important variables
in the data set and determines the optimal number of trees necessary to
achieve stable variable selection. However, it also highlights that even
with optimisation, some variability persists in the selected variables
across repeated random forest runs. Nonetheless, with the relationship
between stability and the number of trees now established, additional
functions from the optRF
package can be used to further
refine the analysis, including estimating computation time specific to
the machine on which the opt_importance
function was
run.
The relationship between stability and the number of trees can be
conveniently visualised using the plot_stability
function.
This visualisation allows users to specify the range of tree counts, the
stability measure of interest (variable importance stability or
selection stability), and whether the recommended number of trees should
be highlighted. For instance, visualising selection stability from 0 to
100,000 trees without indicating the recommended number of trees can be
achieved as follows:
plot_stability(optRF_result_2, measure="selection",
0, 100000, add.recommendation = FALSE, ylim=c(0, 1))
abline(h=1, col="red")
The plot demonstrates that even with 100,000 trees, perfect selection
stability is not achievable. However, since the relationship between
stability and number of trees is known for this data set, the number of
trees required to achieve specific levels of selection stability can be
estimated using the estimate_numtrees
function. For
example:
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.9)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.9 19658 26.25823
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.95)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.95 53329 70.86432
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.99)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.99 483566 640.8252
estimate_numtrees(optRF_result_2, measure="selection", for_stability = 0.999)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.999 10600894 14043.87
The output of the estimate_numtrees
function shows the
desired selection stability together with the number of trees that would
be necessary to reach such a selection stability and the computation
time that the ranger
function would require to run random
forest with such a number of trees. One can see that the increase from a
selection stability of 0.9 to 0.95 would more than double the
computation time of random forest. To increase the selection stability
from 0.95 to 0.99 would increase the computation time of random forest
almost by a factor ten and to increase the selection stability from 0.99
to 0.999 would increase the computation time by a factor 20.
It is important to note that the computation time estimated by
opt_importance
is specific to the machine on which the
function was run. On a different system, the computation time may vary.
Additionally, these estimates are based on the ranger
function which can execute random forest computations in parallel using
multiple threads. To provide realistic estimates, the number of threads
can be specified using the num.threads
argument in both the
ranger
and opt_importance
functions.
Assume that after analysing the results from
estimate_numtrees
, we decide to use 500,000 trees for for
variable selection with random forest. Based on earlier findings, this
would yield a selection stability slightly above 0.99. However, to
publish exact stability metrics, we can use the
estimate_stability
function with the chosen number of
trees:
estimate_stability(optRF_result_2, with_num.trees=500000)
#> num.trees VI_stability selection_stability computation_time
#> (Intercept) 5e+05 0.9964582 0.9902447 662.5974
The results indicate that when using random forest for variable selection with 500,000 trees and this data set, the variable importance stability is 0.9965 and the selection stability is 0.9903. These values now give others the possibility to assess the reproducibility of the results and can show that the results of the analysis are reliable. Thus, the variable importance and selection stability should always be given when publishing decisions based on random forest predictions.
opt_importance
functionThis vignette illustrated an application of the
opt_importance
function where one could assume that only
around 1% of the variables in the data set affect the response. Using
this assumption, the number of trees was adjusted to enable stable
variable selection. However, this step becomes unnecessary if a user
already knows the exact number of variables to select. In such cases,
the desired number of variables can simply be specified directly in the
alpha
argument of the opt_importance
function.
In this example, the optimal number of trees was defined as the
number of trees at which adding more trees would increase the variable
importance stability by \(10^{-6}\) or
less. This threshold is arbitrary and can be customised using the
rec.thresh
argument to fit specific requirements. Next to
setting the recommendation
argument to
"importance"
(default) or "selection"
, the
recommendation
argument can be set to "none"
to analyse the relationship between the number of trees and stability
without receiving a specific recommendation. The user can then manually
determine the optimal number of trees using the
estimate_numtrees
, estimate_stability
, and
plot_stability
functions.
Users can also adjust computational settings to tailor the
opt_prediction
function to their needs. Using the
number.repetitions
argument, it can be controlled how many
repetitions are performed. Fewer repetitions will reduce computation
time but may lead to more variable results. Using the
num.trees_values
argument, certain numbers of trees can be
specified. Smaller ranges or fewer values will speed up computation but
might reduce the precision of the stability analysis. We recommend not
to adjust these arguments.
Next to the arguments described here, the user can also add arguments
from the ranger
function to customise the application of
opt_prediction
. This gives the user the possibility to
adjust the num.threads
argument to allow for parallel
computing or add specific values for hyperparameters in random forest
such as mtry
, min.node.size
, or
sample.fraction
.
While random forest has many advantages and can be easily used to
estimate the importance of the variables in a data set, it must be taken
into consideration that random forest is a non-deterministic method that
can lead to different variable importance estimates even when analysing
the same data set. The stability of random forest depends strongly on
the number of trees which not only increases stability but also
computation time. The optRF
package provides an efficient
solution for this problem by modelling the relationship between the
number of trees and the corresponding stability of random forest. With
its opt_importance
function, users can systematically
evaluate this relationship for a specific data set and determine the
optimal number of trees required to achieve satisfactory stability and
improve reproducibility. Researchers are encouraged to evaluate the
stability of the chosen number of trees to ensure reliable and
reproducible results when using random forest.