Host the Whisper Mannequin on Amazon SageMaker: exploring inference choices


OpenAI Whisper is a sophisticated automated speech recognition (ASR) mannequin with an MIT license. ASR know-how finds utility in transcription providers, voice assistants, and enhancing accessibility for people with listening to impairments. This state-of-the-art mannequin is educated on an unlimited and various dataset of multilingual and multitask supervised information collected from the online. Its excessive accuracy and flexibility make it a precious asset for a wide selection of voice-related duties.

Within the ever-evolving panorama of machine studying and synthetic intelligence, Amazon SageMaker offers a complete ecosystem. SageMaker empowers information scientists, builders, and organizations to develop, practice, deploy, and handle machine studying fashions at scale. Providing a variety of instruments and capabilities, it simplifies your entire machine studying workflow, from information pre-processing and mannequin improvement to easy deployment and monitoring. SageMaker’s user-friendly interface makes it a pivotal platform for unlocking the total potential of AI, establishing it as a game-changing resolution within the realm of synthetic intelligence.

On this put up, we embark on an exploration of SageMaker’s capabilities, particularly specializing in internet hosting Whisper fashions. We’ll dive deep into two strategies for doing this: one using the Whisper PyTorch mannequin and the opposite utilizing the Hugging Face implementation of the Whisper mannequin. Moreover, we’ll conduct an in-depth examination of SageMaker’s inference choices, evaluating them throughout parameters equivalent to pace, price, payload dimension, and scalability. This evaluation empowers customers to make knowledgeable selections when integrating Whisper fashions into their particular use circumstances and programs.

Resolution overview

The next diagram reveals the primary parts of this resolution.

  1. With a purpose to host the mannequin on Amazon SageMaker, step one is to save lots of the mannequin artifacts. These artifacts check with the important parts of a machine studying mannequin wanted for varied purposes, together with deployment and retraining. They’ll embrace mannequin parameters, configuration information, pre-processing parts, in addition to metadata, equivalent to model particulars, authorship, and any notes associated to its efficiency. It’s essential to notice that Whisper fashions for PyTorch and Hugging Face implementations consist of various mannequin artifacts.
  2. Subsequent, we create customized inference scripts. Inside these scripts, we outline how the mannequin ought to be loaded and specify the inference course of. That is additionally the place we are able to incorporate customized parameters as wanted. Moreover, you possibly can record the required Python packages in a necessities.txt file. In the course of the mannequin’s deployment, these Python packages are mechanically put in within the initialization part.
  3. Then we choose both the PyTorch or Hugging Face deep studying containers (DLC) offered and maintained by AWS. These containers are pre-built Docker photographs with deep studying frameworks and different vital Python packages. For extra info, you possibly can examine this link.
  4. With the mannequin artifacts, customized inference scripts and chosen DLCs, we’ll create Amazon SageMaker fashions for PyTorch and Hugging Face respectively.
  5. Lastly, the fashions might be deployed on SageMaker and used with the next choices: real-time inference endpoints, batch rework jobs, and asynchronous inference endpoints. We’ll dive into these choices in additional element later on this put up.

The instance pocket book and code for this resolution can be found on this GitHub repository.

Determine 1. Overview of Key Resolution Elements

Walkthrough

Internet hosting the Whisper Mannequin on Amazon SageMaker

On this part, we’ll clarify the steps to host the Whisper mannequin on Amazon SageMaker, utilizing PyTorch and Hugging Face Frameworks, respectively. To experiment with this resolution, you want an AWS account and entry to the Amazon SageMaker service.

PyTorch framework

  1. Save mannequin artifacts

The primary choice to host the mannequin is to make use of the Whisper official Python package, which might be put in utilizing pip set up openai-whisper. This package deal offers a PyTorch mannequin. When saving mannequin artifacts within the native repository, step one is to save lots of the mannequin’s learnable parameters, equivalent to mannequin weights and biases of every layer within the neural community, as a ‘pt’ file. You’ll be able to select from totally different mannequin sizes, together with ‘tiny,’ ‘base,’ ‘small,’ ‘medium,’ and ‘massive.’ Bigger mannequin sizes provide greater accuracy efficiency, however come at the price of longer inference latency. Moreover, it’s essential to save the mannequin state dictionary and dimension dictionary, which include a Python dictionary that maps every layer or parameter of the PyTorch mannequin to its corresponding learnable parameters, together with different metadata and customized configurations. The code beneath reveals tips on how to save the Whisper PyTorch artifacts.

### PyTorch
import whisper
# Load the PyTorch mannequin and put it aside within the native repo
mannequin = whisper.load_model("base")
torch.save(
    {
        'model_state_dict': mannequin.state_dict(),
        'dims': mannequin.dims.__dict__,
    },
    'base.pt'
)

  1. Choose DLC

The subsequent step is to pick out the pre-built DLC from this link. Watch out when selecting the proper picture by contemplating the next settings: framework (PyTorch), framework model, job (inference), Python model, and {hardware} (i.e., GPU). It is suggested to make use of the most recent variations for the framework and Python each time doable, as this leads to higher efficiency and handle identified points and bugs from earlier releases.

  1. Create Amazon SageMaker fashions

Subsequent, we make the most of the SageMaker Python SDK to create PyTorch fashions. It’s essential to recollect so as to add setting variables when making a PyTorch mannequin. By default, TorchServe can solely course of file sizes as much as 6MB, whatever the inference kind used.

# Create a PyTorchModel for deployment
from sagemaker.pytorch.mannequin import PyTorchModel

whisper_pytorch_model = PyTorchModel(
    model_data=model_uri,
    image_uri=picture,
    position=position,
    entry_point="inference.py",
    source_dir="code",
    title=model_name,
    env = {
        'TS_MAX_REQUEST_SIZE': '100000000',
        'TS_MAX_RESPONSE_SIZE': '100000000',
        'TS_DEFAULT_RESPONSE_TIMEOUT': '1000'
    }
)

The next desk reveals the settings for various PyTorch variations:

Framework Surroundings variables
PyTorch 1.8 (primarily based on TorchServe) TS_MAX_REQUEST_SIZE‘: ‘100000000’
TS_MAX_RESPONSE_SIZE‘: ‘100000000’
TS_DEFAULT_RESPONSE_TIMEOUT‘: ‘1000’
PyTorch 1.4 (primarily based on MMS) MMS_MAX_REQUEST_SIZE‘: ‘1000000000’
MMS_MAX_RESPONSE_SIZE‘: ‘1000000000’
MMS_DEFAULT_RESPONSE_TIMEOUT‘: ‘900’
  1. Outline the mannequin loading methodology in inference.py

Within the customized inference.py script, we first examine for the supply of a CUDA-capable GPU. If such a GPU is out there, then we assign the 'cuda' machine to the DEVICE variable; in any other case, we assign the 'cpu' machine. This step ensures that the mannequin is positioned on the accessible {hardware} for environment friendly computation. We load the PyTorch mannequin utilizing the Whisper Python package deal.

### PyTorch
DEVICE = torch.machine('cuda' if torch.cuda.is_available() else 'cpu')
def model_fn(model_dir):
    """
    Load and return the mannequin
    """
    mannequin = whisper.load_model(os.path.be a part of(model_dir, 'base.pt'))
    mannequin = mannequin.to(DEVICE)
    return mannequin

Hugging Face framework

  1. Save mannequin artifacts

The second possibility is to make use of Hugging Face’s Whisper implementation. The mannequin might be loaded utilizing the AutoModelForSpeechSeq2Seq transformers class. The learnable parameters are saved in a binary (bin) file utilizing the save_pretrained methodology. The tokenizer and preprocessor additionally must be saved individually to make sure the Hugging Face mannequin works correctly. Alternatively, you possibly can deploy a mannequin on Amazon SageMaker straight from the Hugging Face Hub by setting two setting variables: HF_MODEL_ID and HF_TASK. For extra info, please check with this webpage.

### Hugging Face
from transformers import WhisperTokenizer, WhisperProcessor, AutoModelForSpeechSeq2Seq

# Load the pre-trained mannequin
model_name = "openai/whisper-base"
mannequin = AutoModelForSpeechSeq2Seq.from_pretrained(model_name)
tokenizer = WhisperTokenizer.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Outline a listing the place you need to save the mannequin
save_directory = "./mannequin"

# Save the mannequin to the required listing
mannequin.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
processor.save_pretrained(save_directory)

  1. Choose DLC

Much like the PyTorch framework, you possibly can select a pre-built Hugging Face DLC from the identical link. Make certain to pick out a DLC that helps the most recent Hugging Face transformers and contains GPU help.

  1. Create Amazon SageMaker fashions

Equally, we make the most of the SageMaker Python SDK to create Hugging Face fashions. The Hugging Face Whisper mannequin has a default limitation the place it will probably solely course of audio segments as much as 30 seconds. To handle this limitation, you possibly can embrace the chunk_length_s parameter within the setting variable when creating the Hugging Face mannequin, and later move this parameter into the customized inference script when loading the mannequin. Lastly, set the setting variables to extend payload dimension and response timeout for the Hugging Face container.

# Create a HuggingFaceModel for deployment
from sagemaker.huggingface.mannequin import HuggingFaceModel

whisper_hf_model = HuggingFaceModel(
    model_data=model_uri,
    position=position, 
    image_uri = picture,
    entry_point="inference.py",
    source_dir="code",
    title=model_name,
    env = {
        "chunk_length_s":"30",
        'MMS_MAX_REQUEST_SIZE': '2000000000',
        'MMS_MAX_RESPONSE_SIZE': '2000000000',
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '900'
    }
)

Framework Surroundings variables

HuggingFace Inference Container

(primarily based on MMS)

MMS_MAX_REQUEST_SIZE‘: ‘2000000000’
MMS_MAX_RESPONSE_SIZE‘: ‘2000000000’
MMS_DEFAULT_RESPONSE_TIMEOUT‘: ‘900’
  1. Outline the mannequin loading methodology in inference.py

When creating customized inference script for the Hugging Face mannequin, we make the most of a pipeline, permitting us to move the chunk_length_s as a parameter. This parameter permits the mannequin to effectively course of lengthy audio information throughout inference.

### Hugging Face
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
chunk_length_s = int(os.environ.get('chunk_length_s'))
def model_fn(model_dir):
    """
    Load and return the mannequin
    """
    mannequin = pipeline(
        "automatic-speech-recognition",
        mannequin=model_dir,
        chunk_length_s=chunk_length_s,
        machine=DEVICE,
        )
    return mannequin

Exploring totally different inference choices on Amazon SageMaker

The steps for choosing inference choices are the identical for each PyTorch and Hugging Face fashions, so we received’t differentiate between them beneath. Nevertheless, it’s value noting that, on the time of penning this put up, the serverless inference possibility from SageMaker doesn’t help GPUs, and in consequence, we exclude this feature for this use-case.

  1. Real-time inference

We will deploy the mannequin as a real-time endpoint, offering responses in milliseconds. Nevertheless, it’s essential to notice that this feature is proscribed to processing inputs beneath 6 MB. We outline the serializer as an audio serializer, which is accountable for changing the enter information into an appropriate format for the deployed mannequin. We make the most of a GPU occasion for inference, permitting for accelerated processing of audio information. The inference enter is an audio file that’s from the native repository.

from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer

# Outline serializers and deserializer
audio_serializer = DataSerializer(content_type="audio/x-audio")
deserializer = JSONDeserializer()

# Deploy the mannequin for real-time inference
endpoint_name = f'whisper-real-time-endpoint-{id}'

real_time_predictor = whisper_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name = endpoint_name,
    serializer=audio_serializer,
    deserializer = deserializer
    )

# Carry out real-time inference
audio_path = "sample_audio.wav" 
response = real_time_predictor.predict(information=audio_path)

  1. Batch transform job

The second inference possibility is the batch rework job, which is able to processing enter payloads as much as 100 MB. Nevertheless, this methodology could take a couple of minutes of latency. Every occasion can deal with just one batch request at a time, and the occasion initiation and shutdown additionally require a couple of minutes. The inference outcomes are saved in an Amazon Easy Storage Service (Amazon S3) bucket upon completion of the batch rework job.

When configuring the batch transformer, remember to embrace max_payload = 100 to deal with bigger payloads successfully. The inference enter ought to be the Amazon S3 path to an audio file or an Amazon S3 Bucket folder containing an inventory of audio information, every with a dimension smaller than 100 MB.

Batch Rework partitions the Amazon S3 objects within the enter by key and maps Amazon S3 objects to cases. For instance, when you might have a number of audio information, one occasion would possibly course of input1.wav, and one other occasion would possibly course of the file named input2.wav to boost scalability. Batch Rework means that you can configure max_concurrent_transforms to extend the variety of HTTP requests made to every particular person transformer container. Nevertheless, it’s essential to notice that the worth of (max_concurrent_transforms* max_payload) should not exceed 100 MB.

# Create a transformer
whisper_transformer = whisper_model.transformer(
    instance_count = 1,
    instance_type = "ml.g4dn.xlarge", 
    output_path="s3://{}/{}/batch-transform/".format(bucket, prefix),
    max_payload = 100
)
# Begin batch rework job
whisper_transformer.rework(information = information, job_name= job_name, wait = False)

  1. Asynchronous inference

Lastly, Amazon SageMaker Asynchronous Inference is good for processing a number of requests concurrently, providing average latency and supporting enter payloads of as much as 1 GB. This feature offers glorious scalability, enabling the configuration of an autoscaling group for the endpoint. When a surge of requests happens, it mechanically scales as much as deal with the visitors, and as soon as all requests are processed, the endpoint scales right down to 0 to save lots of prices.

Utilizing asynchronous inference, the outcomes are mechanically saved to an Amazon S3 bucket. Within the AsyncInferenceConfig, you possibly can configure notifications for profitable or failed completions. The enter path factors to an Amazon S3 location of the audio file. For added particulars, please check with the code on GitHub.

from sagemaker.async_inference import AsyncInferenceConfig

# Create an AsyncInferenceConfig object
async_config = AsyncInferenceConfig(
    output_path=f"s3://{bucket}/{prefix}/output", 
    max_concurrent_invocations_per_instance = 4,
    # notification_config = {
            #   "SuccessTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            #   "ErrorTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
    #}, #  Notification configuration 
)

# Deploy the mannequin for async inference
endpoint_name = f'whisper-async-endpoint-{id}'
async_predictor = whisper_model.deploy(
    async_inference_config=async_config,
    initial_instance_count=1, 
    instance_type="ml.g4dn.xlarge",
    endpoint_name = endpoint_name
)

# Carry out async inference
initial_args = {'ContentType':"audio/x-audio"}
response = async_predictor.predict_async(initial_args = initial_args, input_path=input_path)

Non-obligatory: As talked about earlier, now we have the choice to configure an autoscaling group for the asynchronous inference endpoint, which permits it to deal with a sudden surge in inference requests. A code instance is offered on this GitHub repository. Within the following diagram, you possibly can observe a line chart displaying two metrics from Amazon CloudWatch: ApproximateBacklogSize and ApproximateBacklogSizePerInstance. Initially, when 1000 requests have been triggered, just one occasion was accessible to deal with the inference. For 3 minutes, the backlog dimension constantly exceeded three (please observe that these numbers might be configured), and the autoscaling group responded by spinning up further cases to effectively filter out the backlog. This resulted in a big lower within the ApproximateBacklogSizePerInstance, permitting backlog requests to be processed a lot sooner than throughout the preliminary part.

Determine 2. Line chart illustrating the temporal modifications in Amazon CloudWatch metrics

Comparative evaluation for the inference choices

The comparisons for various inference choices are primarily based on frequent audio processing use circumstances. Actual-time inference gives the quickest inference pace however restricts payload dimension to six MB. This inference kind is appropriate for audio command programs, the place customers management or work together with gadgets or software program utilizing voice instructions or spoken directions. Voice instructions are sometimes small in dimension, and low inference latency is essential to make sure that transcribed instructions can promptly set off subsequent actions. Batch Rework is good for scheduled offline duties, when every audio file’s dimension is beneath 100 MB, and there’s no particular requirement for quick inference response occasions. Asynchronous inference permits for uploads of as much as 1 GB and gives average inference latency. This inference kind is well-suited for transcribing films, TV sequence, and recorded conferences the place bigger audio information must be processed.

Each real-time and asynchronous inference choices present autoscaling capabilities, permitting the endpoint cases to mechanically scale up or down primarily based on the amount of requests. In circumstances with no requests, autoscaling removes pointless cases, serving to you keep away from prices related to provisioned cases that aren’t actively in use. Nevertheless, for real-time inference, not less than one persistent occasion should be retained, which may result in greater prices if the endpoint operates repeatedly. In distinction, asynchronous inference permits occasion quantity to be diminished to 0 when not in use. When configuring a batch rework job, it’s doable to make use of a number of cases to course of the job and alter max_concurrent_transforms to allow one occasion to deal with a number of requests. Subsequently, all three inference choices provide nice scalability.

Cleansing up

After getting accomplished using the answer, guarantee to take away the SageMaker endpoints to forestall incurring further prices. You need to use the offered code to delete real-time and asynchronous inference endpoints, respectively.

# Delete real-time inference endpoint
real_time_predictor.delete_endpoint()

# Delete asynchronous inference endpoint
async_predictor.delete_endpoint()

Conclusion

On this put up, we confirmed you the way deploying machine studying fashions for audio processing has develop into more and more important in varied industries. Taking the Whisper mannequin for instance, we demonstrated tips on how to host open-source ASR fashions on Amazon SageMaker utilizing PyTorch or Hugging Face approaches. The exploration encompassed varied inference choices on Amazon SageMaker, providing insights into effectively dealing with audio information, making predictions, and managing prices successfully. This put up goals to offer data for researchers, builders, and information scientists considering leveraging the Whisper mannequin for audio-related duties and making knowledgeable selections on inference methods.

For extra detailed info on deploying fashions on SageMaker, please check with this Developer guide. Moreover, the Whisper mannequin might be deployed utilizing SageMaker JumpStart. For added particulars, kindly examine the Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart put up.

Be happy to take a look at the pocket book and code for this venture on GitHub and share your remark with us.


Concerning the Creator

Ying Hou, PhD, is a Machine Studying Prototyping Architect at AWS. Her main areas of curiosity embody Deep Studying, with a deal with GenAI, Laptop Imaginative and prescient, NLP, and time sequence information prediction. In her spare time, she relishes spending high quality moments along with her household, immersing herself in novels, and climbing within the nationwide parks of the UK.

Leave a Reply

Your email address will not be published. Required fields are marked *