Naive Bayes Classification. In-depth clarification of the Naive Bayes… | by Dr. Roi Yehoshua | Jun, 2023

The occasion fashions described above will also be mixed in case we’ve got a heterogenous information set, i.e., an information set that incorporates various kinds of options (for instance, each categorical and steady options).

The module sklearn.naive_bayes supplies implementations for all of the 4 Naive Bayes classifiers talked about above:

  1. BernoulliNB implements the Bernoulli Naive Bayes mannequin.
  2. CategoricalNB implements the explicit Naive Bayes mannequin.
  3. MultinomialNB implements the multinomial Naive Bayes mannequin.
  4. GaussianNB implements the Gaussian Naive Bayes mannequin.

The primary three lessons settle for a parameter referred to as alpha that defines the smoothing parameter (by default it’s set to 1.0).

Within the following demonstration we are going to use MultinomialNB to unravel a doc classification process. The info set we’re going to use is the 20 newsgroups dataset, which consists of 18,846 newsgroups posts, partitioned (practically) evenly throughout 20 completely different subjects. This information set has been extensively utilized in analysis of textual content functions in machine studying, together with doc classification and clustering.

Loading the Information Set

You need to use the perform fetch_20newsgroups() in Scikit-Be taught to obtain the textual content paperwork with their labels. You’ll be able to both obtain all of the paperwork as one group, or obtain the coaching set and the check set individually (utilizing the subset parameter). The cut up between the coaching and the check units relies upon messages posted earlier than or after a selected date.

By default, the textual content paperwork comprise some metadata reminiscent of headers (e.g., the date of the put up), footers (signatures) and quotes to different posts. Since these options should not related for the textual content classification process, we are going to strip them out by utilizing the take away parameter:

from sklearn.datasets import fetch_20newsgroups

train_set = fetch_20newsgroups(subset='practice', take away=('headers', 'footers', 'quotes'))
test_set = fetch_20newsgroups(subset='check', take away=('headers', 'footers', 'quotes'))

Word that the primary time you name this perform it might take a couple of minutes to obtain all of the paperwork, after which they are going to be cached domestically within the folder ~/scikit_learn_data.

The output of the perform is a dictionary that incorporates the next attributes:

  • information — the set of paperwork
  • goal — the goal labels
  • target_names — the names of the doc classes

Let’s retailer the paperwork and their labels in correct variables:

X_train, y_train = train_set.information, train_set.goal
X_test, y_test = test_set.information, test_set.goal

Information Exploration

Let’s do some primary exploration of the information. The variety of paperwork we’ve got within the coaching and the check units is:

print('Paperwork in coaching set:', len(X_train))
print('Paperwork in check set:', len(X_test))
Paperwork in coaching set: 11314
Paperwork in check set: 7532

A easy calculation reveals that 60% of the paperwork belong to the coaching set, and 40% to the check set.

Let’s print the checklist of classes:

classes = train_set.target_names

As evident, a number of the classes are intently associated to one another (e.g., comp.sys.mac.{hardware} and{hardware}), whereas others are extremely uncorrelated (e.g., sci.electronics and

Lastly, let’s study one of many paperwork within the coaching set (e.g., the primary one):

I used to be questioning if anybody on the market may enlighten me on this automobile I noticed
the opposite day. It was a 2-door sports activities automobile, seemed to be from the late 60s/
early 70s. It was referred to as a Bricklin. The doorways had been actually small. As well as,
the entrance bumper was separate from the remainder of the physique. That is
all I do know. If anybody can tellme a mannequin identify, engine specs, years
of manufacturing, the place this automobile is made, historical past, or no matter information you
have on this funky wanting automobile, please e-mail.

Unsurprisingly, the label of this doc is:


Changing Textual content to Vectors

As a way to feed textual content paperwork into machine studying fashions, we first must convert them into vectors of numerical values (i.e., vectorize the textual content). This course of usually includes preprocessing and cleansing of the textual content, after which selecting an appropriate numerical illustration for the phrases within the textual content.

Textual content preprocessing consists of assorted steps, amongst which the most typical ones are:

  1. Cleansing and normalizing the textual content. This consists of eradicating punctuation marks and particular characters, and changing the textual content into lower-case.
  2. Textual content tokenization, i.e., splitting the textual content into particular person phrases or phrases.
  3. Removing of cease phrases. Cease phrases are a set of generally used phrases in a given language. For instance, cease phrases in English embrace phrases like “the”, “a”, “is”, “and”. These phrases are normally filtered out since they don’t carry helpful data.
  4. Stemming or lemmatization. Stemming reduces the phrase to its lexical root by eradicating or changing its suffix, whereas lemmatization reduces the phrase to its canonical type (lemma) and in addition takes into consideration the context of the phrase (its part-of-speech). For instance, the phrase computer systems has the lemma laptop, however its lexical root is comput.

The next instance demonstrates these steps on a given sentence:

Textual content preprocessing instance

After cleansing the textual content, we have to select how one can vectorize it right into a numerical vector. The most typical approaches are:

  1. Bag-of-words (BOW) mannequin. On this mannequin, every doc is represented by a phrase counts vector (just like the one we’ve got used within the spam filter instance).
  2. TF-IDF (Time period Frequency instances Inverse Doc Frequency) measures how related a phrase is to a doc by multiplying two metrics:
    (a) TF (Time period Frequency) — what number of instances the phrase seems within the doc.
    (b) IDF (Inverse Doc Frequency) — the inverse of the frequency wherein the phrase seems in paperwork throughout your entire corpus.
    The concept is to lower the burden of phrases that happen ceaselessly within the corpus, whereas rising the burden of phrases that happen not often (and thus are extra indicative of the doc’s class).
  3. Phrase embeddings. On this strategy, phrases are mapped into real-valued vectors in such a means that phrases with related that means have shut illustration within the vector house. This mannequin is usually utilized in deep studying and shall be mentioned in a future put up.

Scikit-Be taught supplies the next two transformers, which assist each textual content preprocessing and vectorization:

  1. CountVectorizer makes use of the bag-of-words mannequin.
  2. TfIdfVectorizer makes use of the TF-IDF illustration.

Vital hyperparameters of those transformers embrace:

  • lowercase — whether or not to transform all of the characters to lowercase earlier than tokenizing (defaults to True).
  • token_pattern — the common expression used to outline what’s a token (the default regex selects tokens of two or extra alphanumeric characters).
  • stop_words — if ‘english’, makes use of a built-in cease glossary for English. If None (the default), no cease phrases shall be used. You may also present your personal customized cease phrases checklist.
  • max_features — if not None, construct a vocabulary that features solely the highest max_features with the very best time period frequency throughout the coaching corpus. In any other case, all of the options are used (that is the default).

Word that these transformers don’t present superior preprocessing methods reminiscent of stemming or lemmatization. To use these methods, you’ll have to use different libraries reminiscent of NLTK (Pure Language Toolkit) or spaCy.

Since Naive Bayes fashions are identified to work higher with TF-IDF representations, we are going to use the TfidfVectorizer to transform the paperwork within the coaching set into TF-IDF vectors:

from sklearn.feature_extraction.textual content import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)

The form of the extracted TF-IDF vectors is:

(11314, 101322)

That’s, there are 101,322 distinctive tokens within the vocabulary of the corpus. We are able to study these tokens by calling the strategy get_feature_names_out() of the vectorizer:

vocab = vectorizer.get_feature_names_out()
print(vocab[50000:50010]) # choose a subset of the tokens
['innacurate' 'innappropriate' 'innards' 'innate' 'innately' 'inneficient'
'inner' 'innermost' 'innertubes' 'innervation']

Evidently, there was no computerized spell checker again within the 90s 🙂

The TF-IDF vectors are very sparse, with a median of 67 non-zero elements out of greater than 100,000:

print(X_train_vec.nnz / X_train_vec.form[0])

Let’s additionally vectorize the paperwork within the check set (notice that on the check set we name the remodel technique as an alternative of fit_transform):

X_test_vec = vectorizer.remodel(X_test)

Constructing the Mannequin

Let’s now construct a multinomial Naive Bayes classifier and match it to the coaching set:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.01)
clf.match(X_train_vec, y_train)

Word that we have to set the smoothing parameter α to a really small quantity, because the TF-IDF values are scaled to be between 0 and 1, so the default α = 1 would trigger a dramatic shift of the values.

Evaluating the Mannequin

Subsequent, let’s consider the mannequin on each the coaching and the check units.

The accuracy and F1 rating of the mannequin on the coaching set are:

from sklearn.metrics import f1_score

accuracy_train = clf.rating(X_train_vec, y_train)
y_train_pred = clf.predict(X_train_vec)
f1_train = f1_score(y_train, y_train_pred, common='macro')

print(f'Accuracy (practice): {accuracy_train:.4f}')
print(f'F1 rating (practice): {f1_train:.4f}')

Accuracy (practice): 0.9595
F1 rating (practice): 0.9622

And the accuracy and F1 rating on the check set are:

accuracy_test = clf.rating(X_test_vec, y_test)
y_test_pred = clf.predict(X_test_vec)
f1_test = f1_score(y_test, y_test_pred, common='macro')

print(f'Accuracy (check): {accuracy_test:.4f}')
print(f'F1 rating (check): {f1_test:.4f}')

Accuracy (check): 0.7010
F1 rating (check): 0.6844

The scores on the check set are comparatively low in comparison with the coaching set. To research the place the errors come from, let’s plot the confusion matrix of the check paperwork:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(ax=ax, cmap='Blues')

The confusion matrix on the check set

As we are able to see, a lot of the confusions happen between extremely correlated subjects, for instance:

  • 74 confusions between subject 0 (alt.atheism) and subject 15 (
  • 92 confusions between subject 18 (discuss.politics.misc) and subject 16 (discuss.politics.weapons)
  • 89 confusions between subject 19 ( and subject 15 (

In gentle of those findings, plainly the Naive Bayes classifier did a fairly good job. Let’s study the way it compares to different commonplace classification algorithms.


We’ll benchmark the Naive Bayes mannequin in opposition to 4 different classifiers: logistic regression, KNN, random forest and AdaBoost.

Let’s first write a perform that will get a set of classifiers and evaluates them on the given information set and in addition measures their coaching time:

import time

def benchmark(classifiers, names, X_train, y_train, X_test, y_test, verbose=True):
evaluations = []

for clf, identify in zip(classifiers, names):
analysis = {}
analysis['classifier'] = identify

start_time = time.time()
clf.match(X_train, y_train)
analysis['training_time'] = time.time() - start_time

analysis['accuracy'] = clf.rating(X_test, y_test)
y_test_pred = clf.predict(X_test)
analysis['f1_score'] = f1_score(y_test, y_test_pred, common='macro')

if verbose:
return evaluations

We’ll now name this perform with our 5 classifiers:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

classifiers = [clf, LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), AdaBoostClassifier()]
names = ['Multinomial NB', 'Logistic Regression', 'KNN', 'Random Forest', 'AdaBoost']

evaluations = benchmark(classifiers, names, X_train_vec, y_train, X_test_vec, y_test)

The output we get is:

{'classifier': 'Multinomial NB', 'training_time': 0.06482672691345215, 'accuracy': 0.7010090281465746, 'f1_score': 0.6844389919212164}
{'classifier': 'Logistic Regression', 'training_time': 39.38498568534851, 'accuracy': 0.6909187466808284, 'f1_score': 0.6778246092753284}
{'classifier': 'KNN', 'training_time': 0.003989696502685547, 'accuracy': 0.08218268720127456, 'f1_score': 0.07567337211476842}
{'classifier': 'Random Forest', 'training_time': 43.847145318984985, 'accuracy': 0.6233404142326076, 'f1_score': 0.6062667217793061}
{'classifier': 'AdaBoost', 'training_time': 6.09197473526001, 'accuracy': 0.36563993627190655, 'f1_score': 0.40123307742451064}

Let’s plot the accuracy and F1 scores of the classifiers:

df = pd.DataFrame(evaluations).set_index('classifier')

plt.xlabel('Accuracy (check)')

Accuracy scores on the check set
plt.xlabel('F1 rating (check)')
F1 scores on the check set

Multinomial NB achieves each the very best accuracy and F1 scores. Discover that the classifiers have been used with their default parameters with none tuning. For a extra truthful comparability, the algorithms ought to be in contrast after advantageous tuning their hyperparameters. As well as, some algorithms reminiscent of KNN undergo from the curse of dimensionality, and dimensionality discount is required with a purpose to make them work effectively.

Let’s additionally plot the coaching instances of the classifiers:

plt.xlabel('Coaching time (sec)')
Coaching time of the completely different classifiers

The coaching of Multinomial NB is so quick that we can not even see its time within the graph! By inspecting the perform’s output from above, we are able to see that its coaching time is simply 0.064 seconds. Word that the coaching of KNN can be very quick (since no mannequin is definitely constructed), however its prediction time (not proven) could be very sluggish.

In conclusion, Multinomial NB has proven superiority over the opposite classifiers in all of the examined standards.

Discovering the Most Informative Options

The Naive Bayes mannequin additionally permits us to get probably the most informative options of every class, i.e., the options with the very best probability P(xⱼ|y).

The MultinomialNB class has an attribute named feature_log_prob_, which supplies the log likelihood of the options for every class in a matrix of form (n_classes, n_features).

Utilizing this attribute, let’s write a perform to search out the ten most informative options (tokens) in every class:

def show_top_n_features(clf, vectorizer, classes, n=10):
feature_names = vectorizer.get_feature_names_out()

for i, class in enumerate(classes):
top_n = np.argsort(clf.feature_log_prob_[i])[-n:]
print(f"{class}: {' '.be a part of(feature_names[top_n])}")

show_top_n_features(clf, vectorizer, classes)

The output we get is:

alt.atheism: islam atheists say simply faith atheism assume don individuals god wanting format 3d know program file recordsdata thanks picture graphics card drawback thanks driver drivers use recordsdata dos file home windows{hardware}: monitor disk thanks laptop ide controller bus card scsi drive
comp.sys.mac.{hardware}: know monitor does quadra simms thanks drawback drive apple mac
comp.home windows.x: utilizing home windows x11r5 use utility thanks widget server motif window asking electronic mail promote worth situation new delivery supply 00 sale don ford new good supplier simply engine like automobiles automobile
rec.bikes: don simply helmet using like motorbike journey bikes dod bike braves gamers pitching hit runs video games sport baseball workforce 12 months league 12 months nhl video games season gamers play hockey workforce sport
sci.crypt: individuals use escrow nsa keys authorities chip clipper encryption key
sci.electronics: don thanks voltage used know does like circuit energy use skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg simply lunar earth shuttle like moon launch orbit nasa house imagine religion christian christ bible individuals christians church jesus god
discuss.politics.weapons: simply regulation firearms authorities fbi don weapons individuals weapons gun
discuss.politics.mideast: mentioned arabs arab turkish individuals armenians armenian jews israeli israel
discuss.politics.misc: know state clinton president simply assume tax don authorities individuals assume don koresh goal christians bible individuals christian jesus god

A lot of the phrases appear to be strongly correlated with their corresponding class. Nevertheless, there are a number of generic phrases reminiscent of “simply” and “does” that don’t present beneficial data. This means that our mannequin could also be improved by having a greater stop-words checklist. Certainly, Scikit-Be taught recommends to not use its personal default checklist, quoting from its documentation: “There are a number of identified points with ‘english’ and it’s best to take into account an alternate”. 😲

Let’s summarize the professionals and cons of Naive Bayes as in comparison with different classification fashions:


  • Extraordinarily quick each in coaching and prediction
  • Gives class likelihood estimates
  • Can be utilized each for binary and multi-class classification issues
  • Requires a small quantity of coaching information to estimate its parameters
  • Extremely interpretable
  • Extremely scalable (the variety of parameters is linear within the variety of options)
  • Works effectively with high-dimensional information
  • Strong to noise (the noisy samples are averaged out when estimating the conditional possibilities)
  • Can cope with lacking values (the lacking values are ignored when computing the likelihoods of the options)
  • No hyperparameters to tune (aside from the smoothing parameter, which is never modified)


  • Depends on the Naive Bayes assumption which doesn’t maintain in lots of real-world domains
  • Correlation between the options can degrade the efficiency of the mannequin
  • Usually outperformed by extra complicated fashions
  • The zero frequency drawback: if a categorical function has a class that was not noticed within the coaching set, the mannequin will assign a zero likelihood to its incidence. Smoothing alleviates this drawback however doesn’t remedy it fully.
  • Can’t deal with steady attributes with out discretization or making assumptions on their distribution
  • Can be utilized just for classification duties

That is the longest article I’ve written on Medium thus far. I hope you loved studying it at the very least as a lot as I loved writing it. Let me know within the feedback if one thing was not clear.

You will discover the code examples of this text on my github:

All photographs except in any other case famous are by the writer.

The 20 newsgroups information set information:

Leave a Reply

Your email address will not be published. Required fields are marked *