Subjects per Class Utilizing BERTopic. The best way to perceive the variations in… | by Mariya Mansurova | Sep, 2023

The best way to perceive the variations in texts by classes

Photograph by Fas Khan on Unsplash

These days, working in product analytics, we face numerous free-form texts:

  • Customers go away feedback in AppStore, Google Play or different providers;
  • Shoppers attain out to our Buyer Assist and describe their issues utilizing pure language;
  • We launch surveys ourselves to get much more suggestions, and typically, there are some free-form inquiries to get a greater understanding.

We have now a whole bunch of hundreds of texts. It will take years to learn all of them and get some insights. Fortunately, there are numerous DS instruments that would assist us automate this course of. One such software is Matter Modelling, which I wish to talk about right now.

Primary Matter Modelling can provide you an understanding of the primary subjects in your texts (for instance, critiques) and their combination. But it surely’s difficult to make choices based mostly on one level. For instance, 14.2% of critiques are about too many adverts in your app. Is it dangerous or good? Ought to we glance into it? To inform the reality, I do not know.

But when we attempt to section prospects, we might be taught that this share is 34.8% for Android customers and three.2% for iOS. Then, it’s obvious that we have to examine whether or not we present too many adverts on Android or why Android customers’ tolerance to adverts is decrease.

That’s why I wish to share not solely the right way to construct a subject mannequin but additionally the right way to examine subjects throughout classes. Ultimately we’ll get such insightful graphs for every matter.

Graph by creator

The commonest real-life circumstances of free-form texts are some type of critiques. So, let’s use a dataset with resort critiques for this instance.

I’ve filtered feedback associated to a number of resort chains in London.

Earlier than beginning textual content evaluation, it’s price getting an outline of our knowledge. In whole, we have now 12 890 critiques on 7 totally different resort chains.

Graph by creator

Now we have now knowledge and may apply our new fancy software Matter Modeling to get insights from it. As I discussed at first, we’ll use Matter Modelling and a strong and easy-to-use BERTopic package deal (documentation) for this textual content evaluation.

You may surprise what Matter Modelling is. It’s an unsupervised ML approach associated to Pure Language Processing. It permits you to discover hidden semantic patterns in texts (often known as paperwork) and assign “subjects” to them. You don’t have to have a listing of subjects beforehand. The algorithm will outline them mechanically — often within the type of a bag of crucial phrases (tokens) or N-grams.

BERTopic is a package deal for Matter Modelling utilizing HuggingFace transformers and class-based TF-IDF. BERTopic is a extremely versatile modular package deal so to tailor it to your wants.

Picture from BERTopic docs (source)

If you wish to perceive the way it works higher, I counsel you to observe this video from the creator of the library.

You will discover the complete code on GitHub.

In response to the documentation, we usually don’t have to preprocess knowledge until there may be numerous noise, for instance, HTML tags or different markdowns that don’t add which means to the paperwork. It’s a major benefit of BERTopic as a result of, for a lot of NLP strategies, there may be numerous boilerplate to preprocess your knowledge. In case you are involved in the way it may appear like, see this guide for Matter Modelline utilizing LDA.

You should use BERTopic with knowledge in a number of languages specifying BERTopic(language= "multilingual"). Nonetheless, from my expertise, the mannequin works a bit higher with texts translated into one language. So, I’ll translate all feedback into English.

For translation, we’ll use deep-translator package deal (you’ll be able to set up it from PyPI).

Additionally, it could possibly be fascinating to see distribution by languages, for that we may use langdetect package deal.

import langdetect
from deep_translator import GoogleTranslator

def get_language(textual content):
return langdetect.detect(textual content)
besides KeyboardInterrupt as e:
return '<-- ERROR -->'

def get_translation(textual content):
return GoogleTranslator(supply='auto', goal='en')
.translate(str(textual content))
besides KeyboardInterrupt as e:
return '<-- ERROR -->'

df['language'] =
df['reviews_transl'] =

In our case, 95+% of feedback are already in English.

Graph by creator

To know our knowledge higher, let’s take a look at the distribution of critiques’ size. It reveals that there are numerous extraordinarily brief (and most probably not significant feedback) — round 5% of critiques are lower than 20 symbols.

Graph by creator

We will take a look at the most typical examples to make sure that there’s not a lot data in such feedback. x: x.decrease().strip()).value_counts().head(10)

none 74
<-- error --> 37
nice resort 12
excellent 8
glorious worth for cash 7
good worth for cash 7
superb resort 6
glorious resort 6
nice location 6
very good resort 5

So we are able to filter out all feedback shorter than 20 symbols — 556 out of 12 890 critiques (4.3%). Then, we’ll analyse solely lengthy statements with extra context. It’s an arbitrary threshold based mostly on examples, you’ll be able to attempt a few ranges and see what texts are filtered out.

It’s price checking whether or not this filter disproportionally impacts some lodges. Shares of brief feedback are fairly shut for various classes. So, the info seems to be OK.

Graph by creator

Now, it’s time to construct our first matter mannequin. Let’s begin easy with essentially the most primary one to know how library works, then we’ll enhance it.

We will prepare a subject mannequin in only a few code traces that could possibly be simply understood by anybody who has used at the very least one ML package deal earlier than.

from bertopic import BERTopic
docs = record(df.critiques.values)
topic_model = BERTopic()
subjects, probs = topic_model.fit_transform(docs)

The default mannequin returned 113 subjects. We will take a look at high subjects.

['Count', 'Name', 'Representation']]

The most important group is Matter -1 , which corresponds to outliers. By default, BERTopic makes use of HDBSCAN for clustering, and it doesn’t pressure all knowledge factors to be a part of clusters. In our case, 6 356 critiques are outliers (round 49.3% of all critiques). It’s nearly a half of our knowledge, so we’ll work with this group later.

A subject illustration is often a set of most necessary phrases particular to this matter and never others. So, one of the best ways to know a subject is to take a look at the primary phrases (in BERTopic, a class-based TF-IDF rating is used to rank the phrases).

topic_model.visualize_barchart(top_n_topics = 16, n_words = 10)
Graph by creator

BERTopic even has Topics per Class illustration that may clear up our job of understanding the variations in course critiques.

topics_per_class = topic_model.topics_per_class(docs, 

top_n_topics=10, normalize_frequency = True)

Graph by creator

In case you are questioning the right way to interpret this graph, you aren’t alone — I additionally wasn’t in a position to guess. Nonetheless, the creator kindly helps this package deal, and there are numerous solutions on GitHub. From the discussion, I realized that the present normalisation strategy doesn’t present the share of various subjects for courses. So, it hasn’t fully solved our preliminary job.

Nonetheless, we did the primary iteration in lower than 10 rows of code. It’s implausible, however there’s some room for enchancment.

As we noticed earlier, nearly 50% of knowledge factors are thought of outliers. It’s rather a lot, let’s see what we may do with it.

The documentation gives 4 totally different methods to take care of the outliers:

  • based mostly on topic-document possibilities,
  • based mostly on matter distributions,
  • based mostly on c-TF-IFD representations,
  • based mostly on doc and matter embeddings.

You may attempt totally different methods and see which one matches your knowledge the perfect.

Let’s take a look at examples of outliers. Though these critiques are comparatively brief, they’ve a number of subjects.

BERTopic makes use of clustering to outline subjects. It implies that not multiple matter is assigned to every doc. In most real-life circumstances, you’ll be able to have a mix of subjects in your texts. We could also be unable to assign a subject to the paperwork as a result of they’ve a number of ones.

Fortunately, there’s an answer for it — use Topic Distributions. With such an strategy, every doc might be cut up into tokens. Then, we’ll kind subsentences (outlined by sliding window and stride) and assign a subject for every such subsentence.

Let’s do this strategy and see whether or not we will cut back the variety of outliers with out subjects.

Nonetheless, Matter Distributions are based mostly on the fitted matter mannequin, so let’s improve it.

To start with, we are able to use CountVectorizer. It defines how a doc might be cut up into tokens. Additionally, it may possibly assist us to eliminate meaningless phrases like to, not or the (there are numerous such phrases in our first mannequin).

Additionally, we may enhance subjects’ representations and even attempt a few totally different fashions. I used the KeyBERTInspired mannequin (more details), however you might attempt different choices (for instance, LLMs).

from sklearn.feature_extraction.textual content import CountVectorizer
from bertopic.illustration import KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevance

main_representation_model = KeyBERTInspired()
aspect_representation_model1 = PartOfSpeech("en_core_web_sm")
aspect_representation_model2 = [KeyBERTInspired(top_n_words=30),

representation_model = {
"Major": main_representation_model,
"Aspect1": aspect_representation_model1,
"Aspect2": aspect_representation_model2

vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')
topic_model = BERTopic(nr_topics = 'auto',
vectorizer_model = vectorizer_model,
representation_model = representation_model)

subjects, ini_probs = topic_model.fit_transform(docs)

I specified nr_topics = 'auto' to cut back the variety of subjects. Then, all subjects with a similarity over threshold might be merged mechanically. With this characteristic, we obtained 99 subjects.

I’ve created a perform to get high subjects and their shares so we may analyse it simpler. Let’s take a look at the brand new set of subjects.

def get_topic_stats(topic_model, extra_cols = []):
topics_info_df = topic_model.get_topic_info().sort_values('Rely', ascending = False)
topics_info_df['Share'] = 100.*topics_info_df['Count']/topics_info_df['Count'].sum()
topics_info_df['CumulativeShare'] = 100.*topics_info_df['Count'].cumsum()/topics_info_df['Count'].sum()
return topics_info_df[['Topic', 'Count', 'Share', 'CumulativeShare',
'Name', 'Representation'] + extra_cols]

get_topic_stats(topic_model, ['Aspect1', 'Aspect2']).head(10)

Graph by creator

We will additionally take a look at the Interoptic distance map to raised perceive our clusters, for instance, that are shut to one another. You may as well use it to outline some guardian subjects and subtopics. It’s known as Hierarchical Topic Modelling and you should utilize different instruments for it.

Graph by creator

One other insightful strategy to higher perceive your subjects is to take a look at visualize_documents graph (documentation).

We will see that the variety of subjects has decreased considerably. Additionally, there aren’t any meaningless cease phrases in subjects’ representations.

Nonetheless, we nonetheless see related subjects within the outcomes. We will examine and merge such subjects manually.

For this, we are able to draw a Similarity matrix. I specified n_clusters, and our subjects had been clustered to visualise them higher.

topic_model.visualize_heatmap(n_clusters = 20)
Graph by creator

There are some fairly shut subjects. Let’s calculate the pair distances and take a look at the highest subjects.

from sklearn.metrics.pairwise import cosine_similarity
distance_matrix = cosine_similarity(np.array(topic_model.topic_embeddings_))
dist_df = pd.DataFrame(distance_matrix, columns=topic_model.topic_labels_.values(),

tmp = []
for rec in dist_df.reset_index().to_dict('data'):
t1 = rec['index']
for t2 in rec:
if t2 == 'index':
'topic1': t1,
'topic2': t2,
'distance': rec[t2]

pair_dist_df = pd.DataFrame(tmp)

pair_dist_df = pair_dist_df[(
lambda x: not x.startswith('-1'))) &
( x: not x.startswith('-1')))]
pair_dist_df = pair_dist_df[pair_dist_df.topic1 < pair_dist_df.topic2]
pair_dist_df.sort_values('distance', ascending = False).head(20)

I discovered steerage on the right way to get the gap matrix from GitHub discussions.

We will now see the highest pairs of subjects by cosine similarity. There are subjects with shut meanings that we may merge.

topic_model.merge_topics(docs, [[26, 74], [43, 68, 62], [16, 50, 91]])
df['merged_topic'] = topic_model.topics_

Consideration: after merging, all subjects’ IDs and representations might be recalculated, so it’s price updating when you use them.

Now, we’ve improved our preliminary mannequin and are prepared to maneuver on.

With real-life duties, it’s price spending extra time on merging subjects and attempting totally different approaches to illustration and clustering to get the perfect outcomes.

The opposite potential concept is splitting critiques into separate sentences as a result of feedback are reasonably lengthy.

Let’s calculate subjects’ and tokens’ distributions. I’ve used a window equal to 4 (the creator suggested utilizing 4–8 tokens) and stride equal 1.

topic_distr, topic_token_distr = topic_model.approximate_distribution(
docs, window = 4, calculate_tokens=True)

For instance, this remark might be cut up into subsentences (or units of 4 tokens), and the closest of current subjects might be assigned to every. Then, these subjects might be aggregated to calculate possibilities for the entire sentence. You will discover extra particulars in the documentation.

Instance reveals how cut up works with primary CountVectorizer, window = 4 and stride = 1

Utilizing this knowledge, we are able to get the possibilities of various subjects for every evaluation.

topic_model.visualize_distribution(topic_distr[doc_id], min_probability=0.05)
Graph by creator

We will even see the distribution of phrases for every matter and perceive why we obtained this consequence. For our sentence, greatest very lovelywas the primary time period for Matter 74, whereas location nearoutlined a bunch of location-related subjects.

vis_df = topic_model.visualize_approximate_distribution(docs[doc_id], 
Graph by creator

This instance additionally reveals that we’d have spent extra time merging subjects as a result of there are nonetheless fairly related ones.

Now, we have now possibilities for every matter and evaluation. The subsequent job is to pick out a threshold to filter irrelevant subjects with too low likelihood.

We will do it as traditional utilizing knowledge. Let’s calculate the distribution of chosen subjects per evaluation for various threshold ranges.

tmp_dfs = []

# iterating via totally different threshold ranges
for thr in tqdm.tqdm(np.arange(0, 0.35, 0.001)):
# calculating variety of subjects with likelihood > threshold for every doc
tmp_df = pd.DataFrame(record(map(lambda x: len(record(filter(lambda y: y >= thr, x))), topic_distr))).rename(
columns = {0: 'num_topics'}
tmp_df['num_docs'] = 1

tmp_df['num_topics_group'] = tmp_df['num_topics']
.map(lambda x: str(x) if x < 5 else '5+')

# aggregating stats
tmp_df_aggr = tmp_df.groupby('num_topics_group', as_index = False).num_docs.sum()
tmp_df_aggr['threshold'] = thr


num_topics_stats_df = pd.concat(tmp_dfs).pivot(index = 'threshold',
values = 'num_docs',
columns = 'num_topics_group').fillna(0)

num_topics_stats_df = num_topics_stats_df.apply(lambda x: 100.*x/num_topics_stats_df.sum(axis = 1))

# visualisation
colormap = px.colours.sequential.YlGnBu,
title = 'Distribution of variety of subjects',
labels = {'num_topics_group': 'variety of subjects',
'worth': 'share of critiques, %'},
color_discrete_map = {
'0': colormap[0],
'1': colormap[3],
'2': colormap[4],
'3': colormap[5],
'4': colormap[6],
'5+': colormap[7]

Graph by creator

threshold = 0.05 seems to be like an excellent candidate as a result of, with this stage, the share of critiques with none matter remains to be low sufficient (lower than 6%), whereas the proportion of feedback with 4+ subjects can be not so excessive.

This strategy has helped us to cut back the variety of outliers from 53.4% to five.8%. So, assigning a number of subjects could possibly be an efficient strategy to deal with outliers.

Let’s calculate the subjects for every doc with this threshold.

threshold = 0.13

# outline matter with likelihood > 0.13 for every doc
df['multiple_topics'] = record(map(
lambda doc_topic_distr: record(map(
lambda y: y[0], filter(lambda x: x[1] >= threshold,
)), topic_distr

# making a dataset with docid, matter
tmp_data = []

for rec in df.to_dict('data'):
if len(rec['multiple_topics']) != 0:
mult_topics = rec['multiple_topics']
mult_topics = [-1]

for matter in mult_topics:
'matter': matter,
'id': rec['id'],
'course_id': rec['course_id'],
'reviews_transl': rec['reviews_transl']

mult_topics_df = pd.DataFrame(tmp_data)

Now, we have now a number of subjects mapped to every evaluation and we are able to examine subjects’ mixtures for various resort chains.

Let’s discover circumstances when a subject has too excessive or low share for a specific resort. For that, we’ll calculate for every pair matter + resort share of feedback associated to the subject for this resort vs. all others.

tmp_data = []
for resort in mult_topics_df.resort.distinctive():
for matter in mult_topics_df.matter.distinctive():
'resort': resort,
'topic_id': matter,
'total_hotel_reviews': mult_topics_df[mult_topics_df.hotel == hotel].id.nunique(),
'topic_hotel_reviews': mult_topics_df[(mult_topics_df.hotel == hotel)
& (mult_topics_df.topic == topic)].id.nunique(),
'other_hotels_reviews': mult_topics_df[mult_topics_df.hotel != hotel].id.nunique(),
'topic_other_hotels_reviews': mult_topics_df[(mult_topics_df.hotel != hotel)
& (mult_topics_df.topic == topic)].id.nunique()

mult_topics_stats_df = pd.DataFrame(tmp_data)
mult_topics_stats_df['topic_hotel_share'] = 100*mult_topics_stats_df.topic_hotel_reviews/mult_topics_stats_df.total_hotel_reviews
mult_topics_stats_df['topic_other_hotels_share'] = 100*mult_topics_stats_df.topic_other_hotels_reviews/mult_topics_stats_df.other_hotels_reviews

Nonetheless, not all variations are important for us. We will say that the distinction in subjects’ distribution is price taking a look at if there are

  • statistical significance — the distinction isn’t just by likelihood,
  • sensible significance — the distinction is greater than X% factors (I used 1%).
from statsmodels.stats.proportion import proportions_ztest

mult_topics_stats_df['difference_pval'] = record(map(
lambda x1, x2, n1, n2: proportions_ztest(
depend = [x1, x2],
nobs = [n1, n2],
various = 'two-sided'

mult_topics_stats_df['sign_difference'] =
lambda x: 1 if x <= 0.05 else 0

def get_significance(d, signal):
sign_percent = 1
if signal == 0:
return 'no diff'
if (d >= -sign_percent) and (d <= sign_percent):
return 'no diff'
if d < -sign_percent:
return 'decrease'
if d > sign_percent:
return 'larger'

mult_topics_stats_df['diff_significance_total'] = record(map(
mult_topics_stats_df.topic_hotel_share - mult_topics_stats_df.topic_other_hotels_share,

We have now all of the stats for all subjects and lodges, and the final step is to create a visualisation evaluating matter shares by classes.

import plotly

# outline colour relying on distinction significance
def get_color_sign(rel):
if rel == 'no diff':
return plotly.colours.qualitative.Set2[7]
if rel == 'decrease':
return plotly.colours.qualitative.Set2[1]
if rel == 'larger':
return plotly.colours.qualitative.Set2[0]

# return matter illustration in an acceptable for graph title format
def get_topic_representation_title(topic_model, matter):
knowledge = topic_model.get_topic(matter)
knowledge = record(map(lambda x: x[0], knowledge))

return ', '.be a part of(knowledge[:5]) + ', <br> ' + ', '.be a part of(knowledge[5:])

def get_graphs_for_topic(t):
topic_stats_df = mult_topics_stats_df[mult_topics_stats_df.topic_id == t]
.sort_values('total_hotel_reviews', ascending = False).set_index('resort')

colours = record(map(

fig =, x = 'resort', y = 'topic_hotel_share',
title = 'Matter: %s' % get_topic_representation_title(topic_model,
text_auto = '.1f',
labels = {'topic_hotel_share': 'share of critiques, %'},
fig.update_layout(showlegend = False)
fig.update_traces(marker_color=colours, marker_line_color=colours,
marker_line_width=1.5, opacity=0.9)

topic_total_share = 100.*((topic_stats_df.topic_hotel_reviews + topic_stats_df.topic_other_hotels_reviews)
/(topic_stats_df.total_hotel_reviews + topic_stats_df.other_hotels_reviews)).min()

x0=0, y0=topic_total_share,
x1=1, y1=topic_total_share,
width=3, sprint="dot"


Then, we are able to calculate the highest subjects record and make graphs for them.

top_mult_topics_df = mult_topics_df.groupby('matter', as_index = False).id.nunique()
top_mult_topics_df['share'] = 100.*
top_mult_topics_df['topic_repr'] =
lambda x: get_topic_representation(topic_model, x)

for t in top_mult_topics_df.head(32).matter.values:

Listed below are a few examples of ensuing charts. Let’s attempt to make some conclusions based mostly on this knowledge.

We will see that Vacation Inn, Travelodge and Park Inn have higher costs and worth for cash in comparison with Hilton or Park Plaza.

Graph by creator

The opposite perception is that in Travelodge noise could also be an issue.

Graph by creator

It’s a bit difficult for me to interpret this consequence. I’m undecided what this matter is about.

Graph by creator

The most effective observe for such circumstances is to take a look at some examples.

  • We stayed within the East tower the place the lifts are underneath renovation, just one works, however there are indicators exhibiting the best way to service lifts which can be utilized additionally.
  • Nonetheless, the carpet and the furnishings may have a refurbishment.
  • It’s constructed proper over Queensway station. Beware that this tube cease might be closed for refurbishing for one yr! So that you may take into account noise ranges.

So, this matter is concerning the circumstances of short-term points in the course of the resort keep or furnishings not in the perfect situation.

You will discover the complete code on GitHub.

At present, we’ve achieved an end-to-end Matter Modelling evaluation:

  • Construct a primary matter mannequin utilizing the BERTopic library.
  • Then, we’ve dealt with outliers, so solely 5.8% of our critiques don’t have a subject assigned.
  • Decreased the variety of subjects each mechanically and manually to have a concise record.
  • Realized the right way to assign a number of subjects to every doc as a result of, typically, your textual content could have a mix of subjects.

Lastly, we had been in a position to examine critiques for various programs, create inspiring graphs and get some insights.

Thank you a large number for studying this text. I hope it was insightful to you. When you have any follow-up questions or feedback, please go away them within the feedback part.

Ganesan, Kavita and Zhai, ChengXiang. (2011). OpinRank Evaluate Dataset.
UCI Machine Studying Repository.

Leave a Reply

Your email address will not be published. Required fields are marked *