Introducing bidirectional streaming for real-time inference on Amazon SageMaker AI
In 2025, generative AI has developed from textual content era to multi-modal use instances starting from audio transcription and translation to voice brokers that require real-time knowledge streaming. At the moment’s purposes demand one thing extra: steady, real-time dialogue between customers and fashions—the power for knowledge to circulate each methods, concurrently, over a single persistent connection. Think about a speech to textual content use-case, the place you’ll need to stream the audio stream as enter and obtain the transcripted textual content as a steady stream. Such use-cases would require bi-directional streaming functionality.
We’re introducing bidirectional streaming for Amazon SageMaker AI Inference, which transforms inference from a transactional trade right into a steady dialog. Speech works greatest with real-time AI when conversations circulate naturally with out interruptions. With bidirectional streaming, speech to textual content turns into rapid. The mannequin listens and transcribes on the identical time, so phrases seem the second they’re spoken. Image a caller describing a problem to a assist line. As they communicate, the dwell transcript seems in entrance of the decision heart agent, giving the agent immediate context and letting them reply with out ready for the caller to complete. This sort of steady trade makes voice experiences really feel fluid, responsive and human.
This put up reveals you the best way to construct and deploy a container with bidirectional streaming functionality to a SageMaker AI endpoint. We additionally display how one can deliver your individual container or use our accomplice Deepgram’s pre-built fashions and containers on SageMaker AI to allow bi-directional streaming characteristic for real-time inference.
Bidirectional streaming: Deep dive
With bidirectional streaming, knowledge flows each methods directly by way of a single, persistent connection.
Within the conventional strategy to inference requests, the consumer sends an entire query and waits, whereas the mannequin processes the request and returns an entire reply earlier than the consumer can ship the subsequent query.
In bidirectional streaming, the consumer’s speech begins flowing whereas the mannequin concurrently begins processing and transcribing the reply instantly.
Customers see outcomes as quickly because the mannequin begins producing them. Sustaining one persistent connection replaces a whole bunch of short-lived connections. This reduces overhead on networking infrastructure, TLS handshakes, and connection administration. Fashions can preserve context throughout a steady stream, enabling multi-turn interactions with out resending dialog historical past every time.
SageMaker AI Inference bidirectional streaming functionality
SageMaker AI Inference combines HTTP/2 and WebSocket protocols for real-time, two-way communication between shoppers and fashions. Once you invoke a SageMaker AI Inference endpoint with bidirectional streaming, your request travels by way of the three-layer infrastructure in SageMaker AI:
- Shopper to SageMaker AI router: Your software connects to the Amazon SageMaker AI runtime endpoint utilizing HTTP/2, establishing an environment friendly, multiplexed connection that helps bidirectional streaming.
- SageMaker AI router to mannequin container: The router forwards your request to a Sidecar (a light-weight proxy operating alongside your mannequin container), which then establishes a WebSocket connection to your mannequin container at
ws://localhost:8080/invocations-bidirectional-stream.
As soon as the connection is established, knowledge flows freely in each instructions:
- Request stream: Your software sends enter as a sequence of payload chunks over HTTP/2. The SageMaker AI infrastructure converts these into WebSocket knowledge frames—both textual content (for UTF-8 knowledge) or binary—and forwards them to your mannequin container. The mannequin receives these frames in real-time and may start processing instantly, even earlier than the entire enter arrives corresponding to for transcribing use instances.
- Response stream: Your mannequin generates output and sends it again as WebSocket frames. SageMaker AI wraps every body right into a response payload and streams it on to your software over HTTP/2. Customers see outcomes as quickly because the mannequin produces them—phrase by phrase for textual content, body by body for video, or pattern by pattern for audio.
The WebSocket connection between the Sidecar and mannequin container stays open all through your session, with built-in well being monitoring. To keep up connection well being, SageMaker AI sends WebSocket ping frames each 60 seconds to confirm the connection is energetic, and your mannequin container responds with pong frames to verify it’s wholesome. If 5 consecutive pings go unanswered, the connection is gracefully closed.
Constructing your individual container for implementing bidirectional streaming
If you want to make use of open supply or your individual fashions, you may customise your container to assist bidirectional streaming. Your container should implement the WebSocket protocol to deal with incoming knowledge frames and ship response frames again to SageMaker AI.
To get began, allow us to construct an instance bi-directional streaming software with bring-your-own container use case. With this instance we’ll:
- Construct a docker container with bi-directional streaming functionality – a easy echo container that streams the identical bytes as acquired as an enter to the container
- Deploy the container to a SageMaker AI endpoint
- Invoke the SageMaker AI endpoint with the brand new bidirectional streaming API
Conditions
- AWS Account with SageMaker AI permissions
- Docker put in regionally
- Python 3.12+
- Set up aws-sdk-python for SageMaker AI Runtime
InvokeEndpointWithBidirectionalStreamAPI
Construct docker container with bi-directional streaming functionality
First, clone our demo repository and arrange your atmosphere as outlined within the README.md. The steps beneath will create a easy demo docker picture and push it Amazon ECR repository in your account.
This creates a container with a Docker label indicating to SageMaker AI that bidirectional streaming functionality is supported on this container.
Deploy the demo bi-directional streaming container to the SageMaker AI endpoint
The next instance script creates the SageMaker AI endpoint with the created container:
Invoke the SageMaker AI endpoint with the brand new bidirectional streaming API
As soon as the SageMaker AI endpoint is InService, we will proceed to invoke the endpoint to check the bidirectional streaming performance of the check container.
The next is pattern output displaying the enter and output streams generated by the earlier script. The container echoes incoming knowledge to the output stream, demonstrating bidirectional streaming functionality.

SageMaker AI integration with Deepgram fashions
SageMaker AI and Deepgram have collaborated to construct bidirectional streaming assist for SageMaker AI endpoints. Deepgram, an AWS Superior Tier Accomplice, delivers enterprise-grade voice AI fashions with industry-leading accuracy and velocity. Their fashions energy real-time transcription, text-to-speech and voice brokers for contact facilities, media platforms, and conversational AI purposes.
For patrons with strict compliance necessities that require audio processing to by no means go away their AWS VPC, conventional self-hosted choices have required important operational overhead to setup and preserve. Amazon SageMaker bidirectional streaming transforms this expertise so prospects can deploy and scale real-time AI purposes with only a few actions within the AWS Administration Console.
Deepgram Nova-3 speech-to-text mannequin is offered in the present day within the AWS Market for deployment as a SageMaker AI endpoint with further fashions coming quickly. Capabilities of Deepgram Nova-3 embody multi-lingual transcription, enterprise scale efficiency and area particular recognition. Deepgram is providing a 14 day free trial on Amazon SageMaker AI for builders to prototype purposes with out incurring software program license charges. Infrastructure costs of the chosen machine kind will nonetheless be incurred throughout this time. For extra particulars, see the Amazon SageMaker AI Pricing documentation.
A high-level overview and pattern code is supplied within the following part. Confer with the detailed fast begin information on the Deepgram documentation page for extra info and examples. Join with the Deepgram Developer Community should you want further assist with arrange.
Arrange a Deepgram SageMaker AI real-time inference endpoint
To arrange a Deepgram SageMaker AI endpoint:
- Navigate to the AWS Marketplace Model packages section inside the Amazon SageMaker AI console and seek for Deepgram.
- Subscribe to the product and proceed to the launch wizard on the product web page.

- Proceed by offering particulars within the Amazon SageMaker AI real-time endpoint creation wizard. Confirm that you simply edit the manufacturing variant to incorporate a legitimate occasion kind when creating your endpoint configuration. The edit button could also be hidden till scrolling proper within the manufacturing variant desk.
ml.g6.2xlargeis a most well-liked occasion kind for preliminary testing. Confer with the Deepgram documentation for particular {hardware} necessities and choice steerage.


- Within the endpoint abstract web page, pay attention to the endpoint title you supplied as this might be wanted within the following part.
Utilizing the Deepgram SageMaker AI real-time inference endpoint
We’ll now stroll by way of a pattern typescript software that streams an audio file to the Deepgram mannequin hosted on a SageMaker AI real-time inference endpoint and prints a transcription streamed again in real-time.
- Create a easy operate to stream the WAV file
- This operate opens an area audio file and sends it to Amazon SageMaker AI Inference in small binary chunks.
- Configure the Amazon SageMaker AI runtime consumer
- This part configures the AWS Area, the SageMaker AI endpoint title, and the Deepgram mannequin route contained in the container. Replace the next values as essential:
areaif not utilizing us-east-1endpointNamefamous from the endpoint setup abovecheck.wavif utilizing a distinct title for the regionally saved audio file
- This part configures the AWS Area, the SageMaker AI endpoint title, and the Deepgram mannequin route contained in the container. Replace the next values as essential:
- Invoke the endpoint and print the streaming transcription
- This last snippet sends the audio stream to the SageMaker AI endpoint and prints Deepgram’s streaming JSON occasions as they arrive. The applying will present dwell speech-to-text output being generated.
Conclusion
On this put up, we supplied an outline of constructing actual time brokers with generative AI, the challenges, and the way SageMaker AI bidirectional streaming helps you tackle these challenges. We additionally supplied particulars on the best way to construct your individual container to leverage bidirectional streaming characteristic. We then walked you thru the steps to construct a pattern chatbot container and the real-time speech-to-text mannequin provided by our accomplice Deepgram which is a core part in a real-time voice AI agent software.
Begin constructing bidirectional streaming purposes with LLMs and SageMaker AI in the present day.
In regards to the authors
Lingran Xia is a software program improvement engineer at AWS. He at present focuses on bettering inference efficiency of machine studying fashions. In his free time, he enjoys touring and snowboarding.
Vivek Gangasani is a Worldwide Lead GenAI Specialist Options Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy, handle, and scale their GenAI fashions with SageMaker and GPUs. At the moment, he’s targeted on growing methods and options for optimizing inference efficiency and GPU effectivity for internet hosting Giant Language Fashions. In his free time, Vivek enjoys climbing, watching films, and attempting completely different cuisines.
Victor Wang is a Sr. Options Architect at Amazon Internet Providers, based mostly in San Francisco, CA, supporting GenAI Startups together with Deepgram. Victor has spent 7 years at Amazon; earlier roles embody software program developer for AWS Website-to-Website VPN, AWS ProServe Guide for Public Sector Companions, and Technical Program Supervisor for Amazon Aurora MySQL. His ardour is studying new applied sciences and touring the world. Victor has flown over two million miles and plans to proceed his everlasting journey of exploration.
Chinmay Bapat is an Engineering Supervisor within the Amazon SageMaker AI Inference group at AWS, the place he leads engineering efforts targeted on constructing scalable infrastructure for generative AI inference. His work permits prospects to deploy and serve massive language fashions and different AI fashions effectively at scale. Exterior of labor, he enjoys enjoying board video games and is studying to ski.
Deepti Ragha is a Senior Software program Improvement Engineer on the Amazon SageMaker AI group, specializing in ML inference infrastructure and mannequin internet hosting optimization. She builds options that enhance deployment efficiency, cut back inference prices, and make ML accessible to organizations of all sizes. Exterior of labor, she enjoys touring, climbing, and gardening.
Kareem Syed-Mohammed is a Product Supervisor at AWS. He’s focuses on enabling Gen AI mannequin improvement and governance on SageMaker HyperPod. Previous to this, at Amazon QuickSight, he led embedded analytics, and developer expertise. Along with QuickSight, he has been with AWS Market and Amazon retail as a Product Supervisor. Kareem began his profession as a developer for name heart applied sciences, Native Skilled and Advertisements for Expedia, and administration guide at McKinsey.
Xu Deng is a Software program Engineer Supervisor with the SageMaker group. He focuses on serving to prospects construct and optimize their AI/ML inference expertise on Amazon SageMaker. In his spare time, he loves touring and snowboarding.