Internet hosting NVIDIA speech NIM fashions on Amazon SageMaker AI: Parakeet ASR


This publish was written with NVIDIA and the authors wish to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for his or her collaboration.

Organizations right this moment face the problem of processing giant volumes of audio information–from buyer calls and assembly recordings to podcasts and voice messages–to unlock helpful insights. Computerized Speech Recognition (ASR) is a important first step on this course of, changing speech to textual content in order that additional evaluation could be carried out. Nonetheless, operating ASR at scale is computationally intensive and could be costly. That is the place asynchronous inference on Amazon SageMaker AI is available in. By deploying state-of-the-art ASR fashions (like NVIDIA Parakeet models) on SageMaker AI with asynchronous endpoints, you possibly can deal with giant audio information and batch workloads effectively. With asynchronous inference, long-running requests could be processed within the background (with outcomes delivered later); it additionally helps auto-scaling to zero when there’s no work and handles spikes in demand with out blocking different jobs.

On this weblog publish, we’ll discover find out how to host the NVIDIA Parakeet ASR mannequin on SageMaker AI and combine it into an asynchronous pipeline for scalable audio processing. We’ll additionally spotlight the advantages of Parakeet’s structure and the NVIDIA Riva toolkit for speech AI, and focus on find out how to use NVIDIA NIM for deployment on AWS.

NVIDIA speech AI applied sciences: Parakeet ASR and Riva Framework

NVIDIA provides a complete suite of speech AI applied sciences, combining high-performance fashions with environment friendly deployment options. At its core, the Parakeet ASR mannequin household represents state-of-the-art speech recognition capabilities, achieving industry-leading accuracy with low word error rates (WERs) . The mannequin’s structure makes use of the Quick Conformer encoder with the CTC or transducer decoder, enabling 2.4× quicker processing than customary Conformers whereas sustaining accuracy.

NVIDIA speech NIM is a group of GPU-accelerated microservices for constructing customizable speech AI purposes. NVIDIA Speech fashions ship correct transcription accuracy and pure, expressive voices in over 36 languages–ultimate for customer support, contact facilities, accessibility, and international enterprise workflows. Builders can fine-tune and customise fashions for particular languages, accents, domains, and vocabularies, supporting accuracy and model voice alignment.

Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA fashions ultimate for agentic AI purposes, serving to your group stand out with safer, high-performing, voice AI. The NIM framework delivers these companies as containerized options, making deployment simple by Docker containers that embrace the mandatory dependencies and optimizations.

This mixture of high-performance fashions and deployment instruments gives organizations with a whole resolution for implementing speech recognition at scale.

Answer overview

The structure illustrated within the diagram showcases a complete asynchronous inference pipeline designed particularly for ASR and summarization workloads. The answer gives a sturdy, scalable, and cost-effective processing pipeline.

Structure elements

The structure consists of 5 key elements working collectively to create an environment friendly audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR mannequin with auto scaling capabilities that may scale to zero when idle for value optimization.

  1. The info ingestion course of begins when audio information are uploaded to Amazon Easy Storage Service (Amazon S3), triggering AWS Lambda features that course of metadata and provoke the workflow.
  2. For occasion processing, the SageMaker endpoint routinely sends out Amazon Easy Notification Service (Amazon SNS) success and failure notifications by separate queues, enabling correct dealing with of transcriptions.
  3. Efficiently transcribed content material on Amazon S3 strikes to Amazon Bedrock LLMs for clever summarization and extra processing like classification and insights extraction.
  4. Lastly, a complete monitoring system utilizing Amazon DynamoDB shops workflow standing and metadata, enabling real-time monitoring and analytics of all the pipeline.

Detailed implementation walkthrough

On this part, we’ll present the detailed walkthrough of the answer implementation.

SageMaker asynchronous endpoint stipulations

To run the instance notebooks, you want an AWS account with an AWS Identity and Access Management (IAM) function with least-privilege permissions to handle assets created. For particulars, discuss with Create an AWS account. You may have to request a service quota improve for the corresponding SageMaker async internet hosting situations. On this instance, we want one ml.g5.xlarge SageMaker async internet hosting occasion and a ml.g5.xlarge SageMaker pocket book occasion. You can too select a distinct built-in improvement setting (IDE), however be sure the setting comprises GPU compute assets for native testing.

SageMaker asynchronous endpoint configuration

Once you deploy a customized mannequin like Parakeet, SageMaker has a few choices:

  • Use a NIM container supplied by NVIDIA
  • Use a big mannequin inference (LMI) container
  • Use a prebuilt PyTorch container

We’ll present examples for all three approaches.

Utilizing an NVIDIA NIM container

NVIDIA NIM gives a streamlined strategy to deploying optimized AI fashions by containerized options. Our implementation takes this idea additional by making a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to assist maximize each efficiency and capabilities whereas simplifying the deployment course of.

Progressive dual-protocol structure

The important thing innovation is the mixed HTTP + gRPC structure that exposes a single SageMaker AI endpoint with clever routing capabilities. This design addresses the frequent problem of selecting between protocol effectivity and have completeness by routinely deciding on the optimum transport methodology. The HTTP route is optimized for easy transcription duties with information beneath 5MB, offering quicker processing and decrease latency for frequent use instances. In the meantime, the gRPC route helps bigger information (SageMaker AI real-time endpoints assist a max payload of 25MB) and superior options like speaker diarization with exact word-level timing data. The system’s auto-routing performance analyzes incoming requests to find out file measurement and requested options, then routinely selects essentially the most acceptable protocol with out requiring handbook configuration. For purposes that want specific management, the endpoint additionally helps compelled routing by /invocations/http for easy transcription or /invocations/grpc when speaker diarization is required. This flexibility permits each automated optimization and fine-grained management based mostly on particular utility necessities.

Superior speech recognition and speaker diarization capabilities

The NIM container allows a complete audio processing pipeline that seamlessly combines speech recognition with speaker identification by the NVIDIA Riva built-in capabilities. The container handles audio preprocessing, together with format conversion and segmentation, whereas ASR and speaker diarization processes run concurrently on the identical audio stream. Outcomes are routinely aligned utilizing overlapping time segments, with every transcribed section receiving acceptable speaker labels (for instance, Speaker_0, Speaker_1). The inference handler processes audio information by the entire pipeline, initializing each ASR and speaker diarization companies, operating them in parallel, and aligning transcription segments with speaker labels. The output contains the complete transcription, timestamped segments with speaker attribution, confidence scores, and whole speaker depend in a structured JSON format.

Implementation and deployment

The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the muse, including a Python aiohttp server that seamlessly manages the entire NIM lifecycle by routinely beginning and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to acceptable NIM APIs, implements the clever routing logic that analyzes request traits, and gives complete error dealing with with detailed error messages and fallback mechanisms for strong manufacturing deployment. The containerized resolution streamlines deployment by customary Docker and AWS CLI instructions, that includes a pre-configured Docker file with the mandatory dependencies and optimizations. The system accepts a number of enter codecs together with multipart form-data (really helpful for optimum compatibility), JSON with base64 encoding for easy integration situations, and uncooked binary uploads for direct audio processing.

For detailed implementation directions and dealing examples, groups can reference the complete implementation and deployment notebook within the AWS samples repository, which gives complete steering on deploying Parakeet ASR with NIM on SageMaker AI utilizing the carry your personal container (BYOC) strategy. For organizations with particular architectural preferences, separate HTTP-only and gRPC-only implementations are additionally obtainable, offering less complicated deployment fashions for groups with well-defined use instances whereas the mixed implementation provides most flexibility and computerized optimization.

AWS prospects can deploy these fashions both as production-grade NVIDIA NIM containers straight from SageMaker Market or JumpStart, or open supply NVIDIA fashions obtainable on Hugging Face, which could be deployed by customized containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This permits organizations to decide on between absolutely managed, enterprise-tier endpoints with auto-scaling and safety, or versatile open-source improvement for analysis or constrained use instances.

Utilizing an AWS LMI container

LMI containers are designed to simplify internet hosting giant fashions on AWS. These containers embrace optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that may routinely deal with issues like mannequin parallelism, quantization, and batching for giant fashions. The LMI container is basically a pre-configured Docker picture that runs an inference server (for instance a Python server with these optimizations) and lets you specify mannequin parameters by utilizing setting variables.

To make use of the LMI container for Parakeet, we might usually:

  1. Select the suitable LMI picture: AWS gives completely different LMI photos for various frameworks. For Parakeet , we’d use the DJLServing picture for environment friendly inference. Alternatively, NVIDIA Triton Inference Server (which Riva makes use of) is an choice if we bundle the mannequin in ONNX or TensorRT format.
  2. Specify the mannequin configuration: With LMI, we regularly present a model_id (if pulling from Hugging Face Hub) or a path to our mannequin, together with configuration for find out how to load it (variety of GPUs, tensor parallel diploma, quantization bits). The container then downloads the mannequin and initializes it with the desired settings. We will additionally obtain our personal mannequin information from Amazon S3 as a substitute of utilizing the Hub.
  3. Outline the inference handler: The LMI container may require a small handler script or configuration to inform it find out how to course of requests. For ASR, this may contain studying the audio enter, passing it to the mannequin, and returning textual content.

AWS LMI containers ship excessive efficiency and scalability by superior optimization methods, together with steady batching, tensor parallelism, and state-of-the-art quantization strategies. LMI containers combine a number of inference backends (vLLM, TensorRT-LLM by a single unified configuration), serving to customers seamlessly experiment and change between frameworks to seek out the optimum efficiency stack in your particular use case.

Utilizing a SageMaker PyTorch container

SageMaker provides PyTorch Deep Learning Containers (DLCs) that include PyTorch and lots of frequent libraries pre-installed. In this example, we demonstrated find out how to prolong our prebuilt container to put in essential packages for the mannequin. You’ll be able to obtain the mannequin straight from Hugging Face throughout the endpoint creation or obtain the Parakeet mannequin artifacts, packaging it with essential configuration information right into a mannequin.tar.gz archive, and importing it to Amazon S3. Together with the mannequin artifacts, an inference.py script is required because the entry level script to outline mannequin loading and inference logic, together with audio preprocessing and transcription dealing with. When utilizing the SageMaker Python SDK to create a PyTorchModel, the SDK will routinely repackage the mannequin archive to incorporate the inference script beneath /decide/ml/mannequin/code/inference.py, whereas maintaining mannequin artifacts in /decide/ml/mannequin/ on the endpoint. As soon as the endpoint is deployed efficiently, it may be invoked by the predict API by sending audio information as byte streams to get transcription outcomes.

For the SageMaker real-time endpoint, we presently permit a most of 25MB for payload measurement. Be sure you have arrange the container to additionally permit the utmost request measurement. Nonetheless, in case you are planning to make use of the identical mannequin for the asynchronous endpoint, the utmost file measurement that the async endpoint helps is 1GB and the response time is as much as 1 hour. Accordingly, you need to setup the container to be ready for this payload measurement and timeout. When utilizing the PyTorch containers, listed here are some key configuration parameters to contemplate:

  • SAGEMAKER_MODEL_SERVER_WORKERS: Set the variety of torch staff that may load the variety of fashions copied into GPU reminiscence.
  • TS_DEFAULT_RESPONSE_TIMEOUT: Set the day out setting for Torch server staff; for lengthy audio processing, you possibly can set it to the next quantity
  • TS_MAX_REQUEST_SIZE: Set the byte measurement values for requests to 1G for async endpoints.
  • TS_MAX_RESPONSE_SIZE: Set the byte measurement values for response.

Within the instance pocket book, we additionally showcase find out how to leverage the SageMaker native session supplied by the SageMaker Python SDK. It helps you create estimators and run coaching, processing, and inference jobs domestically utilizing Docker containers as a substitute of managed AWS infrastructure, offering a quick option to take a look at and debug your machine studying scripts earlier than scaling to manufacturing.

CDK pipeline stipulations

Earlier than deploying this resolution, be sure to have:

  1. AWS CLI configured with acceptable permissions – Installation Guide
  2. AWS Cloud Growth Package (AWS CDK) put inInstallation Guide
  3. Node.js 18+ and Python 3.9+ put in
  4. Docker – Installation Guide
  5. SageMaker endpoint deployed along with your ML mannequin (Parakeet ASR fashions or comparable)
  6. Amazon SNS matters created for fulfillment and failure notifications

CDK pipeline setup

The answer deployment begins with provisioning the mandatory AWS assets utilizing Infrastructure as Code (IaC) rules. AWS CDK creates the foundational elements together with:

  • DynamoDB Desk: Configured for on-demand capability to trace invocation metadata, processing standing, and outcomes
  • S3 Buckets: Safe storage for enter audio information, transcription outputs, and summarization outcomes
  • SNS matters: Separate queues for fulfillment and failure occasion dealing with
  • Lambda features: Serverless features for metadata processing, standing updates, and workflow orchestration
  • IAM roles and insurance policies: Applicable permissions for cross-service communication and useful resource entry

Surroundings setup

Clone the repository and set up dependencies:

# Set up degit, a library for downloading particular sub directories
npm set up -g degit

# Clone simply the precise folder
npx degit aws-samples/genai-ml-platform-examples/infrastructure/automated-speech-recognition-async-pipeline-sagemaker-ai/sagemaker-async-batch-inference-cdk sagemaker-async-batch-inference-cdk

# Navigate to folder
cd sagemaker-async-batch-inference-cdk

# Set up Node.js dependencies
npm set up

# Arrange Python digital setting
python3 -m venv .venv
supply .venv/bin/activate

# On Home windows:
.venvScriptsactivate
pip set up -r necessities.txt

Configuration

Replace the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:

vim bin/aws-blog-sagemaker.ts 

# Change the endpoint title 
sageMakerConfig: { 
    endpointName: 'your-sagemaker-endpoint-name',     
    enableSageMakerAccess: true 
}

You probably have adopted the pocket book to deploy the endpoint, you need to have created the 2 SNS matters. In any other case, be sure to create the proper SNS matters utilizing CLI:

# Create SNS matters
aws sns create-topic --name success-inf
aws sns create-topic --name failed-inf

Construct and deploy

Earlier than you deploy the AWS CloudFormation template, be sure Docker is operating.

# Compile TypeScript to JavaScript
npm run construct

# Bootstrap CDK (first time solely)
npx cdk bootstrap

# Deploy the stack
npx cdk deploy

Confirm deployment

After profitable deployment, observe the output values:

  • DynamoDB desk title for standing monitoring
  • Lambda operate ARNs for processing and standing updates
  • SNS subject ARNs for notifications

Submit audio file for processing

Processing Audio Information

Replace the upload_audio_invoke_lambda.sh

LAMBDA_ARN="YOUR_LAMBDA_FUNCTION_ARN"
S3_BUCKET="YOUR_S3_BUCKET_ARN"

Run the Script:

AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh

This script will:

  • Obtain a pattern audio file
  • Add the audio file to your s3 bucket
  • Ship the bucket path to Lambda and set off the transcription and summarization pipeline

Monitoring progress

You’ll be able to verify the lead to DynamoDB desk utilizing the next command:

aws dynamodb scan --table-name YOUR_DYNAMODB_TABLE_NAME

Verify processing standing within the DynamoDB desk:

  • submitted: Efficiently queued for inference
  • accomplished: Transcription accomplished efficiently
  • failed: Processing encountered an error

Audio processing and workflow orchestration

The core processing workflow follows an event-driven sample:

Preliminary processing and metadata extraction: When audio information are uploaded to S3, the triggered Lambda operate analyzes the file metadata, validates format compatibility, and creates detailed invocation information in DynamoDB. This facilitates complete monitoring from the second audio content material enters the system.

Asynchronous Speech Recognition: Audio information are processed by the SageMaker endpoint utilizing optimized ASR fashions. The asynchronous course of can deal with numerous file sizes and durations with out timeout issues. Every processing request is assigned a novel identifier for monitoring functions.

Success path processing: Upon profitable transcription, the system routinely initiates the summarization workflow. The transcribed textual content is distributed to Amazon Bedrock, the place superior language fashions generate contextually acceptable summaries based mostly on configurable parameters reminiscent of abstract size, focus areas, and output format.

Error dealing with and restoration: Failed processing makes an attempt set off devoted Lambda features that log detailed error data, replace processing standing, and might provoke retry logic for transient failures. This strong error dealing with ends in minimal information loss and gives clear visibility into processing points.

Actual-world purposes

Customer support analytics: Organizations can course of 1000’s of customer support name recordings to generate transcriptions and summaries, enabling sentiment evaluation, high quality assurance, and insights extraction at scale.

Assembly and convention processing: Enterprise groups can routinely transcribe and summarize assembly recordings, creating searchable archives and actionable summaries for members and stakeholders.

Media and content material processing: Media corporations can course of podcast episodes, interviews, and video content material to generate transcriptions and summaries for improved accessibility and content material discoverability.

Compliance and authorized documentation: Authorized and compliance groups can course of recorded depositions, hearings, and interviews to create correct transcriptions and summaries for case preparation and documentation.

Cleanup

After you have used the answer, take away the SageMaker endpoints to stop incurring further prices. You should use the supplied code to delete real-time and asynchronous inference endpoints, respectively:

# Delete real-time inference
endpointreal_time_predictor.delete_endpoint()

# Delete asynchronous inference
endpointasync_predictor.delete_endpoint()

You also needs to delete all of the assets created by the CDK stack.

# Delete CDK Stack
cdk destroy

Conclusion

The combination of highly effective NVIDIA speech AI applied sciences with AWS cloud infrastructure creates a complete resolution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and pace with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can obtain each high-performance speech recognition and cost-effective scaling. The answer leverages the managed companies of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automatic, scalable pipeline for processing audio content material. With options like auto scaling to zero, complete error dealing with, and real-time monitoring by DynamoDB, organizations can give attention to extracting enterprise worth from their audio content material quite than managing infrastructure complexity. Whether or not processing customer support calls, assembly recordings, or media content material, this structure delivers dependable, environment friendly, and cost-effective audio processing capabilities. To expertise the complete potential of this resolution, we encourage you to discover the answer and attain out to us when you have any particular enterprise necessities and wish to customise the answer in your use case.


In regards to the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS based mostly in Sydney, Australia, the place her focus is on working with prospects to construct options utilizing state-of-the-art AI/ML instruments. She has been actively concerned in a number of generative AI initiatives throughout APJ, harnessing the ability of LLMs. Previous to becoming a member of AWS, Dr. Li held information science roles within the monetary and retail industries.

Tony Trinh is a Senior AI/ML Specialist Architect at AWS. With 13+ years of expertise within the IT {industry}, Tony focuses on architecting scalable, compliance-driven AI and ML options—notably in generative AI, MLOps, and cloud-native information platforms. As a part of his PhD, he’s doing analysis in Multimodal AI and Spatial AI. In his spare time, Tony enjoys mountain climbing, swimming and experimenting with residence enchancment.

Alick Wong is a Senior Options Architect at Amazon Net Companies, the place he helps startups and digital-native companies modernize, optimize, and scale their platforms within the cloud. Drawing on his expertise as a former startup CTO, he works carefully with founders and engineering leaders to drive development and innovation on AWS.

Andrew Smith is a Sr. Cloud Assist Engineer within the SageMaker, Imaginative and prescient & Different staff at AWS, based mostly in Sydney, Australia. He helps prospects utilizing many AI/ML companies on AWS with experience in working with Amazon SageMaker. Outdoors of labor, he enjoys spending time with family and friends in addition to studying about completely different applied sciences.

Derrick Choo is a Senior AI/ML Specialist Options Architect at AWS who accelerates enterprise digital transformation by cloud adoption, AI/ML, and generative AI options. He focuses on full-stack improvement and ML, designing end-to-end options spanning frontend interfaces, IoT purposes, information integrations, and ML fashions, with a specific give attention to laptop imaginative and prescient and multi-modal methods.

Tim Ma is a Principal Specialist in Generative AI at AWS, the place he collaborates with prospects to design and deploy cutting-edge machine studying options. He additionally leads go-to-market methods for generative AI companies, serving to organizations harness the potential of superior AI applied sciences.

Curt Lockhart is an AI Options Architect at NVIDIA, the place he helps prospects deploy language and imaginative and prescient fashions to construct finish to finish AI workflows utilizing NVIDIA’s tooling on AWS. He enjoys making advanced AI really feel approachable and spending his time exploring the artwork, music, and open air of the Pacific Northwest.

Francesco Ciannella is a senior engineer at NVIDIA, the place he works on conversational AI options constructed round giant language fashions (LLMs) and audio language fashions (ALMs). He holds a M.S. in engineering of telecommunications from the College of Rome “La Sapienza” and an M.S. in language applied sciences from the College of Laptop Science at Carnegie Mellon College.

Leave a Reply

Your email address will not be published. Required fields are marked *