The right way to Set the Variety of Timber in Random Forest
Scientific publication
T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of bushes. BMC bioinformatics, 26(1), 95.
Observe this LINK to the unique publication.
Random Forest — A Highly effective Device for Anybody Working With Information
What’s Random Forest?
Have you ever ever wished you may make higher choices utilizing information — like predicting the chance of ailments, crop yields, or recognizing patterns in buyer conduct? That’s the place machine studying is available in and one of the vital accessible and highly effective instruments on this area is one thing referred to as Random Forest.
So why is random forest so fashionable? For one, it’s extremely versatile. It really works nicely with many sorts of information whether or not numbers, classes, or each. It’s additionally extensively utilized in many fields — from predicting affected person outcomes in healthcare to detecting fraud in finance, from bettering purchasing experiences on-line to optimising agricultural practices.
Regardless of the title, random forest has nothing to do with bushes in a forest — but it surely does use one thing referred to as Decision Trees to make good predictions. You may consider a choice tree as a flowchart that guides a sequence of sure/no questions based mostly on the info you give it. A random forest creates a complete bunch of those bushes (therefore the “forest”), every barely completely different, after which combines their outcomes to make one closing determination. It’s a bit like asking a bunch of specialists for his or her opinion after which going with the bulk vote.
However till just lately, one query was unanswered: What number of determination bushes do I really need? If every determination tree can result in completely different outcomes, averaging many bushes would result in higher and extra dependable outcomes. However what number of are sufficient? Fortunately, the optRF package deal solutions this query!
So let’s take a look at the right way to optimise Random Forest for predictions and variable choice!
Making Predictions with Random Forests
To optimise and to make use of random forest for making predictions, we will use the open-source statistics programme R. As soon as we open R, we have now to put in the 2 R packages “ranger” which permits to make use of random forests in R and “optRF” to optimise random forests. Each packages are open-source and accessible through the official R repository CRAN. In an effort to set up and cargo these packages, the next traces of R code may be run:
> set up.packages(“ranger”)
> set up.packages(“optRF”)
> library(ranger)
> library(optRF)
Now that the packages are put in and loaded into the library, we will use the features that these packages comprise. Moreover, we will additionally use the info set included within the optRF package deal which is free to make use of beneath the GPL license (simply because the optRF package deal itself). This information set referred to as SNPdata comprises within the first column the yield of 250 wheat vegetation in addition to 5000 genomic markers (so referred to as single nucleotide polymorphisms or SNPs) that may comprise both the worth 0 or 2.
> SNPdata[1:5,1:5]
Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
ID_001 670.7588 0 0 0 0
ID_002 542.5611 0 2 0 0
ID_003 591.6631 2 2 0 2
ID_004 476.3727 0 0 0 0
ID_005 635.9814 2 2 0 2
This information set is an instance for genomic information and can be utilized for genomic prediction which is an important device for breeding high-yielding crops and, thus, to battle world starvation. The concept is to foretell the yield of crops utilizing genomic markers. And precisely for this function, random forest can be utilized! That signifies that a random forest mannequin is used to explain the connection between the yield and the genomic markers. Afterwards, we will predict the yield of wheat vegetation the place we solely have genomic markers.
Due to this fact, let’s think about that we have now 200 wheat vegetation the place we all know the yield and the genomic markers. That is the so-called coaching information set. Let’s additional assume that we have now 50 wheat vegetation the place we all know the genomic markers however not their yield. That is the so-called check information set. Thus, we separate the info body SNPdata in order that the primary 200 rows are saved as coaching and the final 50 rows with out their yield are saved as check information:
> Coaching = SNPdata[1:200,]
> Check = SNPdata[201:250,-1]
With these information units, we will now take a look at the right way to make predictions utilizing random forests!
First, we acquired to calculate the optimum variety of bushes for random forest. Since we wish to make predictions, we use the operate opt_prediction
from the optRF package deal. Into this operate we have now to insert the response from the coaching information set (on this case the yield), the predictors from the coaching information set (on this case the genomic markers), and the predictors from the check information set. Earlier than we run this operate, we will use the set.seed operate to make sure reproducibility though this isn’t mandatory (we’ll see later why reproducibility is a matter right here):
> set.seed(123)
> optRF_result = opt_prediction(y = Coaching[,1],
+ X = Coaching[,-1],
+ X_Test = Check)
Really useful variety of bushes: 19000
All the outcomes from the opt_prediction
operate are actually saved within the object optRF_result, nevertheless, an important data was already printed within the console: For this information set, we should always use 19,000 bushes.
With this data, we will now use random forest to make predictions. Due to this fact, we use the ranger operate to derive a random forest mannequin that describes the connection between the genomic markers and the yield within the coaching information set. Additionally right here, we have now to insert the response within the y argument and the predictors within the x argument. Moreover, we will set the write.forest
argument to be TRUE and we will insert the optimum variety of bushes within the num.bushes
argument:
> RF_model = ranger(y = Coaching[,1], x = Coaching[,-1],
+ write.forest = TRUE, num.bushes = 19000)
And that’s it! The item RF_model
comprises the random forest mannequin that describes the connection between the genomic markers and the yield. With this mannequin, we will now predict the yield for the 50 vegetation within the check information set the place we have now the genomic markers however we don’t know the yield:
> predictions = predict(RF_model, information=Check)$predictions
> predicted_Test = information.body(ID = row.names(Check), predicted_yield = predictions)
The information body predicted_Test now comprises the IDs of the wheat vegetation along with their predicted yield:
> head(predicted_Test)
ID predicted_yield
ID_201 593.6063
ID_202 596.8615
ID_203 591.3695
ID_204 589.3909
ID_205 599.5155
ID_206 608.1031
Variable Choice with Random Forests
A unique method to analysing such an information set can be to search out out which variables are most necessary to foretell the response. On this case, the query can be which genomic markers are most necessary to foretell the yield. Additionally this may be completed with random forests!
If we deal with such a job, we don’t want a coaching and a check information set. We will merely use your complete information set SNPdata and see which of the variables are an important ones. However earlier than we do this, we should always once more decide the optimum variety of bushes utilizing the optRF package deal. Since we’re insterested in calculating the variable significance, we use the operate opt_importance
:
> set.seed(123)
> optRF_result = opt_importance(y=SNPdata[,1],
+ X=SNPdata[,-1])
Really useful variety of bushes: 40000
One can see that the optimum variety of bushes is now larger than it was for predictions. That is really typically the case. Nevertheless, with this variety of bushes, we will now use the ranger operate to calculate the significance of the variables. Due to this fact, we use the ranger operate as earlier than however we alter the variety of bushes within the num.bushes argument to 40,000 and we set the significance argument to “permutation” (different choices are “impurity” and “impurity_corrected”).
> set.seed(123)
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, num.bushes = 40000,
+ significance="permutation")
> D_VI = information.body(variable = names(SNPdata)[-1],
+ significance = RF_model$variable.significance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
The information body D_VI now comprises all of the variables, thus, all of the genomic markers, and subsequent to it, their significance. Additionally, we have now straight ordered this information body in order that an important markers are on the highest and the least necessary markers are on the backside of this information body. Which signifies that we will take a look at an important variables utilizing the pinnacle operate:
> head(D_VI)
variable significance
SNP_0020 45.75302
SNP_0004 38.65594
SNP_0019 36.81254
SNP_0050 34.56292
SNP_0033 30.47347
SNP_0043 28.54312
And that’s it! We’ve got used random forest to make predictions and to estimate an important variables in an information set. Moreover, we have now optimised random forest utilizing the optRF package deal!
Why Do We Want Optimisation?
Now that we’ve seen how simple it’s to make use of random forest and the way rapidly it may be optimised, it’s time to take a better have a look at what’s taking place behind the scenes. Particularly, we’ll discover how random forest works and why the outcomes would possibly change from one run to a different.
To do that, we’ll use random forest to calculate the significance of every genomic marker however as an alternative of optimising the variety of bushes beforehand, we’ll keep on with the default settings within the ranger operate. By default, ranger makes use of 500 determination bushes. Let’s strive it out:
> set.seed(123)
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, significance="permutation")
> D_VI = information.body(variable = names(SNPdata)[-1],
+ significance = RF_model$variable.significance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
> head(D_VI)
variable significance
SNP_0020 80.22909
SNP_0019 60.37387
SNP_0043 50.52367
SNP_0005 43.47999
SNP_0034 38.52494
SNP_0015 34.88654
As anticipated, every thing runs easily — and rapidly! In reality, this run was considerably sooner than once we beforehand used 40,000 bushes. However what occurs if we run the very same code once more however this time with a special seed?
> set.seed(321)
> RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, significance="permutation")
> D_VI2 = information.body(variable = names(SNPdata)[-1],
+ significance = RF_model2$variable.significance)
> D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]
> head(D_VI2)
variable significance
SNP_0050 60.64051
SNP_0043 58.59175
SNP_0033 52.15701
SNP_0020 51.10561
SNP_0015 34.86162
SNP_0019 34.21317
As soon as once more, every thing seems to work superb however take a better have a look at the outcomes. Within the first run, SNP_0020 had the best significance rating at 80.23, however within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a a lot decrease significance rating of 51.11. That’s a major shift! So what modified?
The reply lies in one thing referred to as non-determinism. Random forest, because the title suggests, entails plenty of randomness: it randomly selects information samples and subsets of variables at numerous factors throughout coaching. This randomness helps stop overfitting but it surely additionally signifies that outcomes can differ barely every time you run the algorithm — even with the very same information set. That’s the place the set.seed() operate is available in. It acts like a bookmark in a shuffled deck of playing cards. By setting the identical seed, you make sure that the random decisions made by the algorithm observe the identical sequence each time you run the code. However whenever you change the seed, you’re successfully altering the random path the algorithm follows. That’s why, in our instance, an important genomic markers got here out in a different way in every run. This conduct — the place the identical course of can yield completely different outcomes on account of inner randomness — is a traditional instance of non-determinism in machine studying.

Taming the Randomness in Random Forests
As we simply noticed, random forest fashions can produce barely completely different outcomes each time you run them even when utilizing the identical information as a result of algorithm’s built-in randomness. So, how can we scale back this randomness and make our outcomes extra steady?
One of many easiest and handiest methods is to extend the variety of bushes. Every tree in a random forest is educated on a random subset of the info and variables, so the extra bushes we add, the higher the mannequin can “common out” the noise brought on by particular person bushes. Consider it like asking 10 folks for his or her opinion versus asking 1,000 — you’re extra prone to get a dependable reply from the bigger group.
With extra bushes, the mannequin’s predictions and variable significance rankings are likely to turn into extra steady and reproducible even with out setting a selected seed. In different phrases, including extra bushes helps to tame the randomness. Nevertheless, there’s a catch. Extra bushes additionally imply extra computation time. Coaching a random forest with 500 bushes would possibly take a couple of seconds however coaching one with 40,000 bushes might take a number of minutes or extra, relying on the scale of your information set and your laptop’s efficiency.
Nevertheless, the connection between the steadiness and the computation time of random forest is non-linear. Whereas going from 500 to 1,000 bushes can considerably enhance stability, going from 5,000 to 10,000 bushes would possibly solely present a tiny enchancment in stability whereas doubling the computation time. Sooner or later, you hit a plateau the place including extra bushes offers diminishing returns — you pay extra in computation time however achieve little or no in stability. That’s why it’s important to search out the appropriate stability: Sufficient bushes to make sure steady outcomes however not so many who your evaluation turns into unnecessarily gradual.
And that is precisely what the optRF package deal does: it analyses the connection between the steadiness and the variety of bushes in random forests and makes use of this relationship to find out the optimum variety of bushes that results in steady outcomes and past which including extra bushes would unnecessarily improve the computation time.
Above, we have now already used the opt_importance operate and saved the outcomes as optRF_result. This object comprises the details about the optimum variety of bushes but it surely additionally comprises details about the connection between the steadiness and the variety of bushes. Utilizing the plot_stability operate, we will visualise this relationship. Due to this fact, we have now to insert the title of the optRF object, which measure we’re excited by (right here, we have an interest within the “significance”), the interval we wish to visualise on the X axis, and if the advisable variety of bushes must be added:
> plot_stability(optRF_result, measure="significance",
+ from=0, to=50000, add_recommendation=FALSE)

This plot clearly reveals the non-linear relationship between stability and the variety of bushes. With 500 bushes, random forest solely results in a stability of round 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a special seed. With the advisable 40,000 bushes, nevertheless, the steadiness is close to 1 (which signifies an ideal stability). Including greater than 40,000 bushes would get the steadiness additional to 1 however this improve can be solely very small whereas the computation time would additional improve. That’s the reason 40,000 bushes point out the optimum variety of bushes for this information set.
The Takeaway: Optimise Random Forest to Get the Most of It
Random forest is a strong ally for anybody working with information — whether or not you’re a researcher, analyst, pupil, or information scientist. It’s simple to make use of, remarkably versatile, and extremely efficient throughout a variety of functions. However like several device, utilizing it nicely means understanding what’s taking place beneath the hood. On this put up, we’ve uncovered one in every of its hidden quirks: The randomness that makes it sturdy also can make it unstable if not fastidiously managed. Luckily, with the optRF package deal, we will strike the right stability between stability and efficiency, making certain we get dependable outcomes with out losing computational assets. Whether or not you’re working in genomics, drugs, economics, agriculture, or every other data-rich area, mastering this stability will aid you make smarter, extra assured choices based mostly in your information.
The put up How to Set the Number of Trees in Random Forest appeared first on Towards Data Science.