Combined Results Machine Studying for Excessive-Cardinality Categorical Variables — Half II: A Demo of the GPBoost Library
Excessive-cardinality categorical variables are variables for which the variety of totally different ranges is giant relative to the pattern measurement of a knowledge set. In Part I of this collection, we did an empirical comparability of various machine studying strategies and located that random results are an efficient instrument for dealing with high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the best prediction accuracy. On this article, we display how the GPBoost algorithm, which mixes tree-boosting with random results, could be utilized with the Python and R packages of the GPBoost
library. GPBoost
model 1.2.1 is used on this demo.
Desk of contents
∘ 1 Introduction
∘ 2 Data: description, loading, and sample split
∘ 3 Training a GPBoost model
∘ 4 Choosing tuning parameter
∘ 5 Prediction
∘ 6 Interpretation
∘ 7 Further modeling options
· · 7.1 Interaction between categorical variables and other predictor variables
· · 7.2 (Generalized) linear mixed effects models
∘ 8 Conclusion and references
Making use of a GPBoost mannequin includes the next most important steps:
- Outline a
GPModel
by which one specifies the next:
— A random results mannequin: grouped random results throughgroup_data
and/or Gaussian processes throughgp_coords
— Thechance
(= distribution of the response variable conditional on fastened and random results) - Create a
Dataset
containing the response variable (label
) and glued results predictor variables (knowledge
) - Select tuning parameters, e.g., utilizing the perform
gpb.grid.search.tune.parameters
- Prepare the mannequin
- Make predictions and/or interpret the skilled mannequin
Within the following, we undergo these factors step-by-step.