High-quality-tune Whisper fashions on Amazon SageMaker with LoRA

Whisper is an Computerized Speech Recognition (ASR) mannequin that has been skilled utilizing 680,000 hours of supervised information from the net, encompassing a spread of languages and duties. One in every of its limitations is the low-performance on low-resource languages corresponding to Marathi language and Dravidian languages, which could be remediated with fine-tuning. Nevertheless, fine-tuning a Whisper mannequin has grow to be a substantial problem, each when it comes to computational assets and storage necessities. 5 to 10 runs of full fine-tuning for Whisper fashions calls for roughly 100 hours A100 GPU (40 GB SXM4) (varies primarily based on mannequin sizes and mannequin parameters), and every fine-tuned checkpoint necessitates about 7 GB of cupboard space. This mixture of excessive computational and storage calls for can pose important hurdles, particularly in environments with restricted assets, typically making it exceptionally troublesome to attain significant outcomes.

Low-Rank Adaptation, often known as LoRA, takes a novel method to mannequin fine-tuning. It maintains the pre-trained mannequin weights in a static state and introduces trainable rank decomposition matrices into every layer of the Transformer construction. This methodology can lower the variety of trainable parameters wanted for downstream duties by 10,000 occasions and cut back GPU reminiscence requirement by 3 occasions. When it comes to mannequin high quality, LoRA has been proven to match and even exceed the efficiency of conventional fine-tuning strategies, regardless of working with fewer trainable parameters (see the outcomes from the unique LoRA paper). It additionally presents the good thing about elevated coaching throughput. In contrast to the adapter strategies, LoRA doesn’t introduce extra latency throughout inference, thereby sustaining the effectivity of the mannequin throughout the deployment part. High-quality-tuning Whisper utilizing LoRA has proven promising outcomes. Take Whisper-Massive-v2, as an illustration: running 3-epochs with a 12-hour common voice dataset on 8 GB memory GPU takes 6–8 hours, which is 5 occasions quicker than full fine-tuning with comparable efficiency.

Amazon SageMaker is a perfect platform to implement LoRA fine-tuning of Whisper. Amazon SageMaker allows you to construct, prepare, and deploy machine studying fashions for any use case with absolutely managed infrastructure, instruments, and workflows. Further mannequin coaching advantages can embody decrease coaching prices with Managed Spot Coaching, distributed coaching libraries to separate fashions and coaching datasets throughout AWS GPU situations, and more.  The skilled SageMaker fashions could be simply deployed for inference straight on SageMaker. On this submit, we current a step-by-step information to implement LoRA fine-tuning in SageMaker. The supply code related to this implementation could be discovered on GitHub.

Put together the dataset for fine-tuning

We use the low-resource language Marathi for the fine-tuning process. Utilizing the Hugging Face datasets library, you’ll be able to obtain and cut up the Frequent Voice dataset into coaching and testing datasets. See the next code:

from datasets import load_dataset, DatasetDict

language = "Marathi"
language_abbr = "mr"
process = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

common_voice = DatasetDict()
common_voice["train"] = load_dataset(dataset_name, language_abbr, cut up="prepare+validation", use_auth_token=True)
common_voice["test"] = load_dataset(dataset_name, language_abbr, cut up="check", use_auth_token=True)

The Whisper speech recognition mannequin requires audio inputs to be 16kHz mono 16-bit signed integer WAV files. As a result of the Frequent Voice dataset is 48K sampling fee, you’ll need to downsample the audio recordsdata first. Then you have to apply Whisper’s characteristic extractor to the audio to extract log-mel spectrogram options, and apply Whisper’s tokenizer to the framed options to transform every sentence within the transcript right into a token ID. See the next code:

from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, process=process)

def prepare_dataset(batch):
# load and resample audio information from 48 to 16kHz
audio = batch["audio"]

# compute log-Mel enter options from enter audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

# encode goal textual content to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch

#apply the information preparation operate to all of our fine-tuning dataset samples utilizing dataset's .map methodology.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)
!aws s3 cp --recursive "marathi-common-voice-processed" s3://<Your-S3-Bucket>

After you might have processed all of the coaching samples, add the processed information to Amazon S3, in order that when utilizing the processed coaching information within the fine-tuning stage, you should utilize FastFile to mount the S3 file straight as an alternative of copying it to native disk:

from sagemaker.inputs import TrainingInput
coaching = TrainingInput(
s3_data_type="S3Prefix", # Out there Choices: S3Prefix | ManifestFile | AugmentedManifestFile
distribution='FullyReplicated', # Out there Choices: FullyReplicated | ShardedByS3Key

Practice the mannequin

For demonstration, we use whisper-large-v2 because the pre-trained mannequin (whisper v3 is now out there), which could be imported by Hugging Face transformers library. You should use 8-bit quantization to additional enhance coaching effectivity. 8-bit quantization presents the reminiscence optimization by rounding from floating level to 8-bit integers. It’s a generally used mannequin compression method to get the financial savings of decreased reminiscence with out sacrificing precision throughout inference an excessive amount of.

To load the pre-trained mannequin in 8-bit quantized format, we merely add the load_in_8bit=True argument when instantiating the mannequin, as proven within the following code. This may load the mannequin weights quantized to eight bits, decreasing the reminiscence footprint.

from transformers import WhisperForConditionalGeneration

mannequin = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")

We use the LoRA implementation from Hugging Face’s peft bundle. There are 4 steps to fine-tune a mannequin utilizing LoRA:

  1. Instantiate a base mannequin (as we did within the final step).
  2. Create a configuration (LoraConfig) the place LoRA-specific parameters are outlined.
  3. Wrap the bottom mannequin with get_peft_model() to get a trainable PeftModel.
  4. Practice the PeftModel as the bottom mannequin.

See the next code:

from peft import LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")
mannequin = get_peft_model(mannequin, config)

training_args = Seq2SeqTrainingArguments(
coach = Seq2SeqTrainer(
eval_dataset=train_dataset.get("check", train_dataset["test"]),

To run a SageMaker training job, we convey our personal Docker container. You possibly can obtain the Docker picture from GitHub, the place ffmpeg4 and git-lfs are packaged along with different Python necessities. To be taught extra about find out how to adapt your personal Docker container to work with SageMaker, seek advice from Adapting your own training container. Then you should utilize the Hugging Face Estimator and begin a SageMaker coaching job:


huggingface_estimator = HuggingFace(entry_point="prepare.sh",
output_path= OUTPUT_PATH,
# transformers_version='4.17.0',
# pytorch_version='1.10.2',
metric_definitions = metric_definitions,

huggingface_estimator.match(job_name=TRAINING_JOB_NAME, wait=False)

The implementation of LoRA enabled us to run the Whisper massive fine-tuning process on a single GPU occasion (for instance, ml.g5.2xlarge). Compared, the Whisper massive full fine-tuning process requires a number of GPUs (for instance, ml.p4d.24xlarge) and a for much longer coaching time. Extra particularly, our experiment demonstrated that the total fine-tuning process requires 24 occasions extra GPU hours in comparison with the LoRA method.

Consider mannequin efficiency

To judge the efficiency of the fine-tuned Whisper mannequin, we calculate the phrase error fee (WER) on a held-out check set. WER measures the distinction between the expected transcript and the bottom reality transcript. A decrease WER signifies higher efficiency. You possibly can run the next script in opposition to the pre-trained mannequin and fine-tuned mannequin and examine their WER distinction:

metric = consider.load("wer")

eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

for step, batch in enumerate(tqdm(eval_dataloader)):
with torch.cuda.amp.autocast():
with torch.no_grad():
generated_tokens = (
decoder_input_ids=batch["labels"][:, :4].to("cuda"),
labels = batch["labels"].cpu().numpy()
labels = np.the place(labels != -100, labels, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
del generated_tokens, labels, batch
wer = 100 * metric.compute()


On this submit, we demonstrated fine-tuning Whisper, a state-of-the-art speech recognition mannequin. Particularly, we used Hugging Face’s PEFT LoRA and enabled 8-bit quantization for environment friendly coaching. We additionally demonstrated find out how to run the coaching job on SageMaker.

Though this is a crucial first step, there are a number of methods you’ll be able to construct on this work to additional enhance the whisper mannequin. Going ahead, think about using SageMaker distributed coaching to scale coaching on a a lot bigger dataset. This may permit the mannequin to coach on extra various and complete information, bettering accuracy. You may as well optimize latency when serving the Whisper mannequin, to allow real-time speech recognition. Moreover, you may broaden work to deal with longer audio transcriptions, which requires adjustments to mannequin structure and coaching schemes.


The authors prolong their gratitude to Paras Mehra, John Sol and Evandro Franco for his or her insightful suggestions and assessment of the submit.

In regards to the Authors

Jun Shi is a Senior Options Architect at Amazon Internet Providers (AWS). His present areas of focus are AI/ML infrastructure and purposes. He has over a decade expertise within the FinTech business as software program engineer.

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Laptop Science, a grasp’s diploma in Training Psychology, and years of expertise in information science and impartial consulting in AI/ML. She is keen about researching methodological approaches for machine and human intelligence. Outdoors of labor, she loves mountain climbing, cooking, searching meals, and spending time with mates and households.

Leave a Reply

Your email address will not be published. Required fields are marked *