Create and fine-tune sentence transformers for enhanced classification accuracy


Sentence transformers are highly effective deep studying fashions that convert sentences into high-quality, fixed-length embeddings, capturing their semantic which means. These embeddings are helpful for varied pure language processing (NLP) duties comparable to textual content classification, clustering, semantic search, and data retrieval.

On this publish, we showcase learn how to fine-tune a sentence transformer particularly for classifying an Amazon product into its product class (comparable to toys or sporting items). We showcase two completely different sentence transformers, paraphrase-MiniLM-L6-v2 and a proprietary Amazon giant language mannequin (LLM) known as M5_ASIN_SMALL_V2.0, and examine their outcomes. M5 LLMS are BERT-based LLMs fine-tuned on inside Amazon product catalog information utilizing product title, bullet factors, description, and extra. They’re presently getting used to be used instances comparable to automated product classification and comparable product suggestions. Our speculation is that M5_ASIN_SMALL_V2.0 will carry out higher for the use case of Amazon product class classification because of it being fine-tuned with Amazon product information. We show this speculation within the following experiment illustrated on this publish.

Resolution overview

On this publish, we reveal learn how to fine-tune a sentence transformer with Amazon product information and learn how to use the ensuing sentence transformer to enhance classification accuracy of product classes utilizing an XGBoost choice tree. For this demonstration, we use a public Amazon product dataset known as Amazon Product Dataset 2020 from a kaggle competition. This dataset comprises the next attributes and fields:

  • Area identify – amazon.com
  • Date vary – January 1, 2020, by January 31, 2020
  • File extension – CSV
  • Obtainable fields – Uniq Id, Product Identify, Model Identify, Asin, Class, Upc Ean Code, Listing Value, Promoting Value, Amount, Mannequin Quantity, About Product, Product Specification, Technical Particulars, Delivery Weight, Product Dimensions, Picture, Variants, SKU, Product Url, Inventory, Product Particulars, Dimensions, Shade, Elements, Route To Use, Is Amazon Vendor, Dimension Amount Variant, and Product Description
  • Label subject – Class

Conditions

Earlier than you start, set up the next packages. You are able to do this in both an Amazon SageMaker pocket book or your native Jupyter pocket book by operating the next instructions:

!pip set up sentencepiece --quiet
!pip set up sentence_transformers --quiet
!pip set up xgboost –-quiet
!pip set up scikit-learn –-quiet/

Preprocess the info

Step one wanted for fine-tuning a sentence transformer is to preprocess the Amazon product information for the sentence transformer to have the ability to devour the info and fine-tune successfully. It includes normalizing the textual content information, defining the product’s essential class by extracting the primary class from the Class subject, and deciding on a very powerful fields from the dataset that contribute to classifying the product’s essential class precisely. We use the next code for preprocessing:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

information = pd.read_csv('marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv')
information.columns = information.columns.str.decrease().str.substitute(' ', '_')
information['main_category'] = information['category'].str.break up("|").str[0]
information["all_text"] = information.apply(
    lambda r: " ".be part of(
        [
            str(r["product_name"]) if pd.notnull(r["product_name"]) else "",
            str(r["about_product"]) if pd.notnull(r["about_product"]) else "",
            str(r["product_specification"]) if pd.notnull(r["product_specification"]) else "",
            str(r["technical_details"]) if pd.notnull(r["technical_details"]) else ""
        ]
    ),
    axis=1
)
label_encoder = LabelEncoder()
labels_transform = label_encoder.fit_transform(information['main_category'])
information['label']=labels_transform
information[['all_text','label']]

The next screenshot exhibits an instance of what our dataset seems to be like after it has been preprocessed.

Tremendous-tune the sentence transformer paraphrase-MiniLM-L6-v2

The primary sentence transformer we fine-tune known as paraphrase-MiniLM-L6-v2. It makes use of the favored BERT mannequin as its underlying structure to rework product description textual content right into a 384-dimensional dense vector embedding that can be consumed by our XGBoost classifier for product class classification. We use the next code to fine-tune paraphrase-MiniLM-L6-v2 utilizing the preprocessed Amazon product information:

from sentence_transformers import SentenceTransformer
model_name="paraphrase-MiniLM-L6-v2"
mannequin = SentenceTransformer(model_name)

Step one is to outline a classification head that represents the 24 product classes that an Amazon product will be categorised into. This classification head can be used to coach the sentence transformer particularly to be more practical at reworking product descriptions based on the 24 product classes. The concept is that every one product descriptions which might be throughout the identical class ought to be remodeled right into a vector embedding that’s nearer in distance in comparison with product descriptions that belong in several classes.

 The next code is for fine-tuning sentence transformer 1:

import torch.nn as nn

# Outline classification head
class ClassificationHead(nn.Module):
    def __init__(self, embedding_dim, num_classes):
        tremendous(ClassificationHead, self).__init__()
        self.linear = nn.Linear(embedding_dim, num_classes)

    def ahead(self, options):
        x = options['sentence_embedding']
        x = self.linear(x)
        return x

# Outline the variety of lessons for a classification process.
num_classes = 24
print('class quantity:', num_classes)
classification_head = ClassificationHead(mannequin.get_sentence_embedding_dimension(), num_classes)

# Mix SentenceTransformer mannequin and classification head."
class SentenceTransformerWithHead(nn.Module):
    def __init__(self, transformer, head):
        tremendous(SentenceTransformerWithHead, self).__init__()
        self.transformer = transformer
        self.head = head

    def ahead(self, enter):
        options = self.transformer(enter)
        logits = self.head(options)
        return logits

model_with_head = SentenceTransformerWithHead(mannequin, classification_head)

We then set the fine-tuning parameters. For this publish, we practice on 5 epochs, optimize for cross-entropy loss, and use the AdamW optimization methodology. We selected epoch 5 as a result of, after testing varied epoch values, we noticed that the loss minimized at epoch 5. This made it the optimum variety of coaching iterations for attaining the very best classification outcomes.

The next code is for fine-tuning sentence transformer 2:

import os
os.environ["TORCH_USE_CUDA_DSA"] = "1"
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from sentence_transformers import SentenceTransformer, InputExample, LoggingHandler
import torch
from torch.utils.information import DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup

train_sentences = information['all_text']
train_labels = information['label']
# coaching parameters
num_epochs = 5
batch_size = 2
learning_rate = 2e-5

# Convert the dataset to PyTorch tensors.
train_examples = [InputExample(texts=[s], label=l) for s, l in zip(train_sentences, train_labels)]

# Customise collate_fn to transform InputExample objects into tensors.
def collate_fn(batch):
    texts = [example.texts[0] for instance in batch]
    labels = torch.tensor([example.label for example in batch])
    return texts, labels

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)

# Outline the loss perform, optimizer, and studying charge scheduler.
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model_with_head.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Coaching loop
loss_list=[]
for epoch in vary(num_epochs):
    model_with_head.practice()
    for step, (texts, labels) in enumerate(train_dataloader):
        labels = labels.to(mannequin.gadget)
        optimizer.zero_grad()

        # Encode textual content and move by classification head.
        inputs = mannequin.tokenize(texts)
        input_ids = inputs['input_ids'].to(mannequin.gadget)
        input_attention_mask = inputs['attention_mask'].to(mannequin.gadget)
        inputs_final = {'input_ids': input_ids, 'attention_mask': input_attention_mask}
        
        # transfer model_with_head to the identical gadget
        model_with_head = model_with_head.to(mannequin.gadget)
        logits = model_with_head(inputs_final)
        
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()
        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, Loss: {loss.merchandise()}")

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.merchandise()}')
    model_save_path = f'./intermediate-output/epoch-{epoch}'
    mannequin.save(model_save_path)
    loss_list.append(loss.merchandise())
# Save the ultimate mannequin
model_final_save_path="st_ft_epoch_5"
mannequin.save(model_final_save_path)

To look at whether or not our ensuing fine-tuned sentence transformer improves our product class classification accuracy, we use it as our textual content embedder within the XGBoost classifier within the subsequent step.

XGBoost classification

XGBoost (Extreme Gradient Boosting) classification is a machine studying approach used for classification duties. It’s an implementation of the gradient boosting framework designed to be environment friendly, versatile, and transportable. For this publish, we have now XGBoost devour the product description textual content embedding output of our sentence transformers and observe product class classification accuracy. We use the next code to make use of the usual paraphrase-MiniLM-L6-v2 sentence transformer earlier than it was fine-tuned to categorise Amazon merchandise to their respective classes:

from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

mannequin = SentenceTransformer('paraphrase-MiniLM-L6-v2')  
information['text_embedding'] = information['all_text'].apply(lambda x: mannequin.encode(str(x)))
text_embeddings = pd.DataFrame(information['text_embedding'].tolist(), index=information.index, dtype=float)

# Convert numeric columns saved as strings to floats
numeric_columns = ['selling_price', 'shipping_weight', 'product_dimensions']  # Add extra columns as wanted
for col in numeric_columns:
    information[col] = pd.to_numeric(information[col], errors="coerce")

# Convert categorical columns to class sort
categorical_columns = ['model_number', 'is_amazon_seller']  # Add extra columns as wanted
for col in categorical_columns:
    information[col] = information[col].astype('class')
    
X_0 = information[['selling_price','model_number','is_amazon_seller']]
X = pd.concat([X_0, text_embeddings], axis=1)
label_encoder = LabelEncoder()
information['main_category_encoded'] = label_encoder.fit_transform(information['main_category'])
y = information['main_category_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Re-encode the labels to make sure they're consecutive integers ranging from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)

# Allow categorical help for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

param = {
    'max_depth': 6,
    'eta': 0.3,
    'goal': 'multi:softmax',
    'num_class': len(label_mapping),
    'eval_metric': 'mlogloss'
}

num_round = 100
bst = xgb.practice(param, dtrain, num_round)

# Consider the mannequin
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.78

We observe a 78% accuracy utilizing the inventory paraphrase-MiniLM-L6-v2 sentence transformer. To look at the outcomes of the fine-tuned paraphrase-MiniLM-L6-v2 sentence transformer, we have to replace the start of the code as follows. All different code stays the identical.

mannequin = SentenceTransformer('st_ft_epoch_5')  
information['text_embedding_miniLM_ft10'] = information['all_text'].apply(lambda x: mannequin.encode(str(x)))
text_embeddings = pd.DataFrame(information['text_embedding_finetuned'].tolist(), index=information.index, dtype=float)
X_pa_finetuned = pd.concat([X_0, text_embeddings], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_pa_finetuned, y, test_size=0.2, random_state=42)

# Re-encode the labels to make sure they're consecutive integers ranging from 0
unique_labels = sorted(set(y_train) | set(y_test))
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}

y_train = y_train.map(label_mapping)
y_test = y_test.map(label_mapping)

# Construct and practice the XGBoost mannequin
# Allow categorical help for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

param = {
    'max_depth': 6,
    'eta': 0.3,
    'goal': 'multi:softmax',
    'num_class': len(label_mapping),
    'eval_metric': 'mlogloss'
}

num_round = 100
bst = xgb.practice(param, dtrain, num_round)

y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Optionally, convert the expected labels again to the unique class labels
inverse_label_mapping = {idx: label for label, idx in label_mapping.objects()}
y_pred_labels = pd.Sequence(y_pred).map(inverse_label_mapping)

Accuracy: 0.94

With the fine-tuned paraphrase-MiniLM-L6-v2 sentence transformer, we observe a 94% accuracy, a 16% enhance from the baseline of 78% accuracy. From this commentary, we conclude that fine-tuning paraphrase-MiniLM-L6-v2 is efficient for classifying Amazon product information into product classes.

Tremendous-tune the sentence transformer M5_ASIN_SMALL_V20

Now we create a sentence transformer from a BERT-based mannequin known as M5_ASIN_SMALL_V2.0. It’s a 40-million-parameter BERT-based mannequin skilled at M5, an inside workforce at Amazon specializing in fine-tuning LLMs utilizing Amazon product information. It was distilled from a bigger instructor mannequin (roughly 5 billion parameters), which was pre-trained on a considerable amount of unlabeled ASIN information and pre-fine-tuned on a set of Amazon supervised studying duties (multi-task pre-fine-tuning). It’s a multi-task, multi-lingual, multi-locale, and multi-modal BERT-based encoder-only mannequin skilled on textual content and structured information enter. Its neural community architectural particulars are as follows:

Mannequin spine:
 Hidden measurement: 384
 Variety of hidden layers: 24
 Variety of consideration heads: 16
 Intermediate measurement: 1536
 Vocabulary measurement: 256,035
Variety of spine parameters: 42,587,904
Variety of phrase embedding parameters (bert.embedding.*): 98,517,504
Whole variety of parameters: 141,259,023

As a result of M5_ASIN_SMALL_V20 was pre-trained on Amazon product information particularly, we hypothesize that constructing a sentence transformer from it should enhance the accuracy of product class classification. We full the next steps to construct a sentence transformer from M5_ASIN_SMALL_V20, fine-tune it, and enter it into an XGBoost classifier to look at accuracy influence:

  1. Load a pre-trained M5 mannequin that you just wish to use as the bottom encoder.
  2. Use the M5 mannequin throughout the SentenceTransformer framework to create a sentence transformer.
  3. Add a pooling layer to create fixed-size sentence embeddings from the variable-length output of the BERT mannequin.
  4. Mix the M5 mannequin and pooling layer right into a single mannequin.
  5. Tremendous-tune the mannequin on a related dataset.

See the next code for Steps 1–3:

from sentence_transformers import fashions 
from transformers import AutoTokenizer

# Step 1: Load Pre-trained M5 Mannequin
model_path="M5_ASIN_SMALL_V20"  # or your customized mannequin path
transformer_model = fashions.Transformer(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Step 2: Outline Pooling Layer
pooling_model = fashions.Pooling(transformer_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True)

# Step 3: Create SentenceTransformer Mannequin
model_mean_m5_base = SentenceTransformer(modules=[transformer_model, pooling_model])

The remainder of the code stays the identical as fine-tuning for the paraphrase-MiniLM-L6-v2 sentence transformer, besides that we use the fine-tuned M5 sentence transformer as a substitute to create embeddings for the texts within the dataset:

loaded_model = SentenceTransformer('m5_ft_epoch_5_mean')
information['text_embedding_m5'] = information['all_text'].apply(lambda x: loaded_model.encode(str(x)))

End result

We observe comparable outcomes to paraphrase-MiniLM-L6-v2 when taking a look at accuracy earlier than fine-tuning, observing a 78% accuracy for M5_ASIN_SMALL_V20. Nevertheless, we observe that the fine-tuned M5_ASIN_SMALL_V20 sentence transformer performs higher than the fine-tuned paraphrase-MiniLM-L6-v2. Its accuracy is 98%, in comparison with 94% for the fine-tuned paraphrase-MiniLM-L6-v2. We fine-tuned the sentence transformers for five epochs, as a result of experiments confirmed this was the optimum quantity to attenuate loss. The next graph summarizes our observations of accuracy enchancment with fine-tuning for five epochs in a single comparability chart.

Clear up

We suggest utilizing GPUs to fine-tune the sentence transformers, for instance, ml.g5.4xlarge or ml.g4dn.16xlarge. You should definitely clear up sources to keep away from incurring extra prices.

In the event you’re utilizing a SageMaker pocket book occasion, check with Clean up Amazon SageMaker notebook instance resources. In the event you’re utilizing Amazon SageMaker Studio, check with Delete or stop your Studio running instances, applications, and spaces.

Conclusion

On this publish, we explored sentence transformers and learn how to use them successfully for textual content classification duties. We dived deep into the sentence transformer paraphrase-MiniLM-L6-v2, demonstrated learn how to use a BERT-based mannequin like M5_ASIN_SMALL_V20 to create a sentence transformer, confirmed learn how to fine-tune sentence transformers, and confirmed the accuracy results of fine-tuning sentence transformers.

Tremendous-tuning sentence transformers has confirmed to be extremely efficient for classifying product descriptions into classes, considerably enhancing prediction accuracy. As a subsequent step, we encourage you to discover completely different sentence transformers from Hugging Face.

Lastly, if you wish to discover M5, word that it’s proprietary to Amazon and you’ll solely entry it as an Amazon associate or buyer as of the time of this publication. Join along with your Amazon level of contact in case you’re an Amazon associate or buyer wanting to make use of M5, and they’ll information you thru M5’s choices and the way it may be used on your use case.


Concerning the Authors

Kara Yang is a Information Scientist at AWS Skilled Companies within the San Francisco Bay Space, with in depth expertise in AI/ML. She makes a speciality of leveraging cloud computing, machine studying, and Generative AI to assist clients deal with advanced enterprise challenges throughout varied industries. Kara is captivated with innovation and steady studying.

Farshad Harirchi is a Principal Information Scientist at AWS Skilled Companies. He helps clients throughout industries, from retail to industrial and monetary companies, with the design and improvement of generative AI and machine studying options. Farshad brings in depth expertise in the whole machine studying and MLOps stack. Exterior of labor, he enjoys touring, taking part in outside sports activities, and exploring board video games.

James Poquiz is a Information Scientist with AWS Skilled Companies based mostly in Orange County, California. He has a BS in Laptop Science from the College of California, Irvine and has a number of years of expertise working within the information area having performed many various roles. At present he works on implementing and deploying scalable ML options to realize enterprise outcomes for AWS shoppers.

Leave a Reply

Your email address will not be published. Required fields are marked *