# Gaussian Naive Bayes, Defined: A Visible Information with Code Examples for Newbies | by Samy Baladram | Oct, 2024

## CLASSIFICATION ALGORITHM

`⛳️ Extra CLASSIFICATION ALGORITHM, defined: `

· Dummy Classifier

· K Nearest Neighbor Classifier

· Bernoulli Naive Bayes

▶ Gaussian Naive Bayes

· Decision Tree Classifier

· Logistic Regression

· Support Vector Classifier

· Multilayer Perceptron (quickly!)

Constructing on our earlier article about Bernoulli Naive Bayes, which handles binary information, we now discover Gaussian Naive Bayes for steady information. In contrast to the binary strategy, this algorithm assumes every characteristic follows a standard (Gaussian) distribution.

Right here, we’ll see how Gaussian Naive Bayes handles steady, bell-shaped information — ringing in correct predictions — all **with out moving into the intricate math** of Bayes’ Theorem.

Like different Naive Bayes variants, Gaussian Naive Bayes makes the “naive” assumption of characteristic independence. It assumes that the options are conditionally impartial given the category label.

Nonetheless, whereas Bernoulli Naive Bayes is fitted to datasets with binary options, Gaussian Naive Bayes assumes that the options comply with **a steady regular (Gaussian)** distribution. Though this assumption could not all the time maintain true in actuality, it simplifies the calculations and infrequently results in surprisingly correct outcomes.

All through this text, we’ll use this synthetic golf dataset (made by creator) for example. This dataset predicts whether or not an individual will play golf primarily based on climate circumstances.

`# IMPORTING DATASET #`

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

import pandas as pd

import numpy as npdataset_dict = {

'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],

'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],

'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],

'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']

}

df = pd.DataFrame(dataset_dict)

# Set characteristic matrix X and goal vector y

X, y = df.drop(columns='Play'), df['Play']

# Break up the information into coaching and testing units

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

print(pd.concat([X_train, y_train], axis=1), finish='nn')

print(pd.concat([X_test, y_test], axis=1))

Gaussian Naive Bayes works with steady information, assuming every characteristic follows a Gaussian (regular) distribution.

- Calculate the likelihood of every class within the coaching information.
- For every characteristic and sophistication, estimate the imply and variance of the characteristic values inside that class.
- For a brand new occasion:

a. For every class, calculate the likelihood density perform (PDF) of every characteristic worth underneath the Gaussian distribution of that characteristic throughout the class.

b. Multiply the category likelihood by the product of the PDF values for all options. - Predict the category with the very best ensuing likelihood.

## Remodeling non-Gaussian distributed information

Keep in mind that this algorithm naively assume that every one the enter options are having Gaussian/regular distribution?

Since we aren’t actually positive concerning the distribution of our information, particularly for options that clearly don’t comply with a Gaussian distribution, making use of a power transformation (like Field-Cox) earlier than utilizing Gaussian Naive Bayes will be useful. This strategy may help make the information extra Gaussian-like, which aligns higher with the assumptions of the algorithm.

`from sklearn.preprocessing import PowerTransformer`# Initialize and match the PowerTransformer

pt = PowerTransformer(standardize=True) # Customary Scaling already included

X_train_transformed = pt.fit_transform(X_train)

X_test_transformed = pt.rework(X_test)

Now we’re prepared for the coaching.

1.** Class Chance Calculation**: For every class, calculate its likelihood: (Variety of cases on this class) / (Complete variety of cases)

`from fractions import Fraction`def calc_target_prob(attr):

total_counts = attr.value_counts().sum()

prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())

return prob_series

print(calc_target_prob(y_train))

2. **Function Chance Calculation** : For every characteristic and every class, calculate the imply (μ) and customary deviation (σ) of the characteristic values inside that class utilizing the coaching information. Then, calculate the likelihood utilizing Gaussian Chance Density Operate (PDF) method.

`def calculate_class_probabilities(X_train_transformed, y_train, feature_names):`

courses = y_train.distinctive()

equations = pd.DataFrame(index=courses, columns=feature_names)for cls in courses:

X_class = X_train_transformed[y_train == cls]

imply = X_class.imply(axis=0)

std = X_class.std(axis=0)

k1 = 1 / (std * np.sqrt(2 * np.pi))

k2 = 2 * (std ** 2)

for i, column in enumerate(feature_names):

equation = f"{k1[i]:.3f}·exp(-(x-({imply[i]:.2f}))²/{k2[i]:.3f})"

equations.loc[cls, column] = equation

return equations

# Use the perform with the reworked coaching information

equation_table = calculate_class_probabilities(X_train_transformed, y_train, X.columns)

# Show the equation desk

print(equation_table)

3. **Smoothing**: Gaussian Naive Bayes makes use of a novel smoothing strategy. In contrast to Laplace smoothing in other variants, it provides a tiny worth (0.000000001 occasions the most important variance) to all variances. This prevents numerical instability from division by zero or very small numbers.

Given a brand new occasion with steady options:

1. **Chance Assortment**:

For every potential class:

· Begin with the likelihood of this class occurring (class likelihood).

· For every characteristic within the new occasion, calculate the likelihood density perform of that characteristic throughout the class.

2. **Rating Calculation & Prediction**:

For every class:

· Multiply all of the collected PDF values collectively.

· The result’s the rating for this class.

· The category with the very best rating is the prediction.

`from scipy.stats import norm`def calculate_class_probability_products(X_train_transformed, y_train, X_new, feature_names, target_name):

courses = y_train.distinctive()

n_features = X_train_transformed.form[1]

# Create column names utilizing precise characteristic names

column_names = [target_name] + checklist(feature_names) + ['Product']

probability_products = pd.DataFrame(index=courses, columns=column_names)

for cls in courses:

X_class = X_train_transformed[y_train == cls]

imply = X_class.imply(axis=0)

std = X_class.std(axis=0)

prior_prob = np.imply(y_train == cls)

probability_products.loc[cls, target_name] = prior_prob

feature_probs = []

for i, characteristic in enumerate(feature_names):

prob = norm.pdf(X_new[0, i], imply[i], std[i])

probability_products.loc[cls, feature] = prob

feature_probs.append(prob)

product = prior_prob * np.prod(feature_probs)

probability_products.loc[cls, 'Product'] = product

return probability_products

# Assuming X_new is your new pattern reshaped to (1, n_features)

X_new = np.array([-1.28, 1.115, 0.84, 0.68]).reshape(1, -1)

# Calculate likelihood merchandise

prob_products = calculate_class_probability_products(X_train_transformed, y_train, X_new, X.columns, y.title)

# Show the likelihood product desk

print(prob_products)

`from sklearn.naive_bayes import GaussianNB`

from sklearn.metrics import accuracy_score# Initialize and prepare the Gaussian Naive Bayes mannequin

gnb = GaussianNB()

gnb.match(X_train_transformed, y_train)

# Make predictions on the take a look at set

y_pred = gnb.predict(X_test_transformed)

# Calculate the accuracy

accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy

print(f"Accuracy: {accuracy:.4f}")

GaussianNB is understood for its simplicity and effectiveness. The primary factor to recollect about its parameters is:

**priors**: That is essentially the most notable parameter, similar to Bernoulli Naive Bayes. Most often, you don’t have to set it manually. By default, it’s calculated out of your coaching information, which frequently works effectively.**var_smoothing**: It is a stability parameter that you simply not often want to regulate. (the default is 0.000000001)

The important thing takeaway is that this algoritm is designed to work effectively out-of-the-box. In most conditions, you need to use it with out worrying about parameter tuning.

## Professionals:

**Simplicity**: Maintains the easy-to-implement and perceive trait.**Effectivity**: Stays swift in coaching and prediction, making it appropriate for large-scale functions with steady options.**Flexibility with Knowledge**: Handles each small and huge datasets effectively, adapting to the size of the issue at hand.**Steady Function Dealing with**: Thrives with steady and real-valued options, making it best for duties like predicting real-valued outputs or working with information the place options fluctuate on a continuum.

## Cons:

**Independence Assumption**: Nonetheless assumes that options are conditionally impartial given the category, which could not maintain in all real-world situations.**Gaussian Distribution Assumption**: Works finest when characteristic values really comply with a standard distribution. Non-normal distributions could result in suboptimal efficiency (however will be fastened with Energy Transformation we’ve mentioned)**Sensitivity to Outliers**: Might be considerably affected by outliers within the coaching information, as they skew the imply and variance calculations.

Gaussian Naive Bayes stands as an environment friendly classifier for a variety of functions involving steady information. Its skill to deal with real-valued options extends its use past binary classification duties, making it a go-to selection for quite a few functions.

Whereas it makes some assumptions about information (characteristic independence and regular distribution), when these circumstances are met, it provides strong efficiency, making it a favourite amongst each freshmen and seasoned information scientists for its stability of simplicity and energy.

`import pandas as pd`

from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import PowerTransformer

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split# Load the dataset

dataset_dict = {

'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],

'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],

'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],

'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']

}

df = pd.DataFrame(dataset_dict)

# Put together information for mannequin

X, y = df.drop('Play', axis=1), (df['Play'] == 'Sure').astype(int)

# Break up information into coaching and testing units

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

# Apply PowerTransformer

pt = PowerTransformer(standardize=True)

X_train_transformed = pt.fit_transform(X_train)

X_test_transformed = pt.rework(X_test)

# Practice the mannequin

nb_clf = GaussianNB()

nb_clf.match(X_train_transformed, y_train)

# Make predictions

y_pred = nb_clf.predict(X_test_transformed)

# Test accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")