Navigating Imbalanced Datasets with Pandas and Scikit-learn


Navigating Imbalanced Datasets with Pandas and Scikit-learn
Picture by Writer | Ideogram
Introduction
Imbalanced datasets, the place a majority of the information samples belong to at least one class and the remaining minority belong to others, will not be that uncommon. Actually, imbalanced knowledge can stem from various real-world conditions, reminiscent of fraud detection programs in banking and finance, the place fraudulent transactions are a lot much less frequent than reliable ones, and medical diagnostics, the place uncommon illnesses come up far much less usually than frequent well being circumstances.
Right here’s the catch: having imbalanced knowledge normally makes evaluation processes tougher, particularly for machine studying fashions that may simply get biased towards the bulk class on account of coping with knowledge with a remarkably unequal class distribution, thereby ending up turning into an nearly “dummy classifier” that assigns the identical class to nearly all the pieces — in probably the most excessive case.
This text reveals a number of methods to navigate and deal with imbalanced datasets utilizing two of Python’s most stellar libraries for “all issues knowledge”: Pandas and Scikit-learn.
Sensible Information: The Financial institution Advertising and marketing Dataset
To exemplify this sensible information to take care of imbalanced knowledge in Python, we are going to take into account the Bank Marketing Dataset. That is an brazenly obtainable imbalanced dataset containing knowledge describing financial institution prospects, labeled with two doable courses: whether or not or not the consumer subscribed to a time period deposit (“sure” vs. “no”) after having obtained a advertising name from the financial institution.
Why is that this dataset imbalanced? As a result of solely ~11% of the shoppers within the dataset subscribed to a time period deposit, with the remaining ~89% refusing to, subsequently, the constructive class (“sure”) is remarkably underrepresented.
Let’s begin by loading the dataset:
from ucimlrepo import fetch_ucirepo
bank_marketing = fetch_ucirepo(id=222)
# Separate the goal labels from the remainder of the options X = bank_marketing.knowledge.options y = bank_marketing.knowledge.targets
# Present some dataset metadata print(bank_marketing.metadata) print(bank_marketing.variables) |
The primary and most reasonable factor to do with a presumably imbalanced dataset is to discover its class distribution.
print(“Class Distribution:”) print(y.value_counts()) print(f“Share: n{y.value_counts(normalize=True) * 100}”)
# Imbalance ratio (%) imbalance_ratio = y.value_counts().min() / y.value_counts().max() print(f“Imbalance ratio: {imbalance_ratio:.3f}”) |
Thus, to be exact, 39922 financial institution prospects refused to subscribe to the supplied service, in comparison with solely 5289 prospects who subscribed to it. That accounts for 88.3% and 11.7% of the information, respectively.
Technique #1: Inverse Frequency-Dependent Weighting
Time to introduce some methods to navigate imbalanced datasets. The primary technique is offered by Scikit-learn, and it consists of utilizing particular machine studying fashions for classification with customized choices for being educated on imbalanced knowledge in a simpler and fewer biased style. The class_weight="balanced"
argument adjusts occasion weights inversely proportional to the frequency of courses, thereby giving better weight to minority courses and compensating for sophistication imbalance.
This code trains the balanced random forest classifier on a preprocessed model of the dataset that encodes categorical attributes through one-hot encoding (utilizing Pandas’ pd.get_dummies()
).
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import pandas as pd
rf_balanced = RandomForestClassifier(class_weight=‘balanced’, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Encoding categorical variables X_train_encoded = pd.get_dummies(X_train) X_test_encoded = pd.get_dummies(X_test).reindex(columns=X_train_encoded.columns, fill_value=0)
rf_balanced.match(X_train_encoded, y_train.values.ravel()) |
Technique #2: Undersampling
One other technique, this time led by Pandas and centered on the information preprocessing stage earlier than coaching a machine studying mannequin, is undersampling. This can be a frequent method to deal with conditions the place sure courses are very underrepresented within the dataset, and it entails decreasing the variety of cases within the majority class to match that within the minority class or courses. The effectiveness of this technique depends upon whether or not there are nonetheless sufficient cases within the minority courses to keep away from a lack of data from the bulk class cases as soon as they may have been drastically undersampled. Whereas it reduces the bias of the later educated mannequin in the direction of majority courses, undersampling may additionally incur mannequin variance, which generally may yield underfitting because of the lack of sufficiently informative cases.
This instance reveals how one can apply undersampling utilizing Pandas &mdash discover the predictor attributes and the label are first unified for simpler manipulation:
df_combined = pd.concat([X, y], axis=1) target_col = y.columns[0]
# Cut up cases into majority vs minority class/courses df_majority = df_combined[df_combined[target_col] == ‘no’] df_minority = df_combined[df_combined[target_col] == ‘sure’]
# Undersampling: we preserve as many majority cases (n) as minority ones df_majority_downsampled = df_majority.pattern(n=len(df_minority), random_state=42) df_balanced = pd.concat([df_majority_downsampled, df_minority])
print(f“Unique dataset: {len(df_combined)}”) print(f“Balanced dataset: {len(df_balanced)}”) |
The balanced dataset ensuing from undersampling has simply 10.5K cases as a substitute of the roughly 45K cases within the full dataset. Is that this sufficient? The efficiency of the classification mannequin you practice afterwards could provide the reply.
In sum, give undersampling a go in case your dataset is giant sufficient that there’ll nonetheless be a sufficiently consultant and various subset of cases after overfitting, consultant knowledge.
Technique #3: Oversampling
Conversely, Pandas additionally permits oversampling the minority courses by randomly replicating cases utilizing sampling with substitute. Use this technique provided that the minority courses are small however consultant and, most significantly, in situations the place including duplicated cases is unlikely to introduce noise or trigger issues like overfitting. Nonetheless, this system can generally assist mitigate mannequin biases in the direction of majority courses.
df_minority_upsampled = df_minority.pattern(n=len(df_majority), exchange=True, random_state=42) df_oversampled = pd.concat([df_majority, df_minority_upsampled])
print(f“Oversampled dataset: {len(df_oversampled)}”) |
Wrapping Up
This text examined the category imbalance downside in a dataset and launched a number of frequent methods to navigate it utilizing Pandas and Scikit-learn libraries. We centered on the three particular frequently-used methods of coaching balanced classification fashions, undersampling, and oversampling. It’s price noting that there are extra methods for coping with imbalanced datasets on the market, reminiscent of Scikit-learn’s resampling instruments, superior methods like SMOTE (Artificial Minority Oversampling), and extra.