How Cisco accelerated the usage of generative AI with Amazon SageMaker Inference

This submit is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.

Webex by Cisco is a number one supplier of cloud-based collaboration options, together with video conferences, calling, messaging, occasions, polling, asynchronous video, and buyer expertise options like contact middle and purpose-built collaboration gadgets. Webex’s deal with delivering inclusive collaboration experiences fuels their innovation, which makes use of synthetic intelligence (AI) and machine studying (ML), to take away the boundaries of geography, language, character, and familiarity with know-how. Its options are underpinned with safety and privateness by design. Webex works with the world’s main enterprise and productiveness apps—together with AWS.

Cisco’s Webex AI (WxAI) staff performs a vital function in enhancing these merchandise with AI-driven options and functionalities, utilizing giant language fashions (LLMs) to enhance consumer productiveness and experiences. Previously 12 months, the staff has more and more centered on constructing AI capabilities powered by LLMs to enhance productiveness and expertise for customers. Notably, the staff’s work extends to Webex Contact Middle, a cloud-based omni-channel contact middle resolution that empowers organizations to ship distinctive buyer experiences. By integrating LLMs, the WxAI staff allows superior capabilities comparable to clever digital assistants, pure language processing (NLP), and sentiment evaluation, permitting Webex Contact Middle to offer extra customized and environment friendly buyer assist. Nevertheless, as these LLM fashions grew to include a whole lot of gigabytes of knowledge, the WxAI staff confronted challenges in effectively allocating sources and beginning purposes with the embedded fashions. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, bettering velocity, scalability, and price-performance.

This submit highlights how Cisco carried out new functionalities and migrated present workloads to Amazon SageMaker inference parts for his or her industry-specific contact middle use instances. By integrating generative AI, they’ll now analyze name transcripts to higher perceive buyer ache factors and enhance agent productiveness. Cisco has additionally carried out conversational AI experiences, together with chatbots and digital brokers that may generate human-like responses, to automate customized communications primarily based on buyer context. Moreover, they’re utilizing generative AI to extract key name drivers, optimize agent workflows, and acquire deeper insights into buyer sentiment. Cisco’s adoption of SageMaker Inference has enabled them to streamline their contact middle operations and supply extra satisfying, customized interactions that tackle buyer wants.

On this submit, we talk about the next:

Cisco’s enterprise use instances and outcomes
How Cisco accelerated the usage of generative AI powered by LLMs for his or her contact middle use instances with the assistance of SageMaker Inference
Cisco’s generative AI inference structure, which is constructed as a sturdy and safe basis, utilizing varied providers and options comparable to SageMaker Inference, Amazon Bedrock, Kubernetes, Prometheus, Grafana, and extra
How Cisco makes use of an LLM router and auto scaling to route requests to applicable LLMs for various duties whereas concurrently scaling their fashions for resiliency and efficiency effectivity.
How the options on this submit impacted Cisco’s enterprise roadmap and strategic partnership with AWS
How Cisco helped SageMaker Inference construct new capabilities to deploy generative AI purposes at scale

Enhancing collaboration and buyer engagement with generative AI: Webex’s AI-powered options

On this part, we talk about Cisco’s AI-powered use instances.

Assembly summaries and insights

For Webex Conferences, the platform makes use of generative AI to routinely summarize assembly recordings and transcripts. This extracts the important thing takeaways and motion gadgets, serving to distributed groups keep knowledgeable even when they missed a reside session. The AI-generated summaries present a concise overview of necessary discussions and choices, permitting staff to shortly stand up to hurry. Past summaries, Webex’s generative AI capabilities additionally floor clever insights from assembly content material. This consists of figuring out motion gadgets, highlighting crucial choices, and producing customized assembly notes and to-do lists for every participant. These insights assist make conferences extra productive and maintain attendees accountable.

Enhancing contact middle experiences

Webex can be making use of generative AI to its contact middle options, enabling extra pure, human-like conversations between prospects and brokers. The AI can generate contextual, empathetic responses to buyer inquiries, in addition to routinely draft customized emails and chat messages. This helps contact middle brokers work extra effectively whereas sustaining a excessive degree of customer support.

Webex prospects understand constructive outcomes with generative AI

Webex’s adoption of generative AI is driving tangible advantages for patrons. Purchasers utilizing the platform’s AI-powered assembly summaries and insights have reported productiveness features. Webex prospects utilizing the platform’s generative AI for contact facilities have dealt with a whole lot of 1000’s of calls with improved buyer satisfaction and lowered deal with instances, enabling extra pure, empathetic conversations between brokers and purchasers. Webex’s strategic integration of generative AI is empowering customers to work smarter and ship distinctive experiences.

For extra particulars on how Webex is harnessing generative AI to reinforce collaboration and buyer engagement, see Webex | Exceptional Experiences for Every Interaction on the Webex weblog.

Utilizing SageMaker Inference to optimize sources for Cisco

Cisco’s WxAI staff is devoted to delivering superior collaboration experiences powered by cutting-edge ML. The staff develops a complete suite of AI and ML options for the Webex ecosystem, together with audio intelligence capabilities like noise elimination and optimizing speaker voices, language intelligence for transcription and translation, and video intelligence options like digital backgrounds. On the forefront of WxAI’s improvements is the AI-powered Webex Assistant, a digital assistant that gives voice-activated management and seamless assembly assist in a number of languages. To construct these subtle capabilities, WxAI makes use of LLMs, which may include as much as a whole lot of gigabytes of coaching information.

Initially, WxAI embedded LLM fashions straight into the appliance container pictures working on Amazon Elastic Kubernetes Service (Amazon EKS). Nevertheless, because the fashions grew bigger and extra complicated, this method confronted vital scalability and useful resource utilization challenges. Working the resource-intensive LLMs via the purposes required provisioning substantial compute sources, which slowed down processes like allocating sources and beginning purposes. This inefficiency hampered WxAI’s capacity to quickly develop, check, and deploy new AI-powered options for the Webex portfolio. To deal with these challenges, the WxAI staff turned to SageMaker Inference—a totally managed AI inference service that permits seamless deployment and scaling of fashions independently from the purposes that use them. By decoupling the LLM internet hosting from the Webex purposes, WxAI may provision the mandatory compute sources for the fashions with out impacting the core collaboration and communication capabilities.

“The purposes and the fashions work and scale essentially in a different way, with fully totally different value issues; by separating them moderately than lumping them collectively, it’s a lot less complicated to unravel points independently.”

– Travis Mehlinger, Principal Engineer at Cisco.

This architectural shift has enabled Webex to harness the ability of generative AI throughout its suite of collaboration and buyer engagement options.

Resolution overview: Enhancing effectivity and decreasing prices by migrating to SageMaker Inference

To deal with the scalability and useful resource utilization challenges confronted with embedding LLMs straight into their purposes, the WxAI staff migrated to SageMaker Inference. By profiting from this totally managed service for deploying LLMs, Cisco unlocked vital efficiency and cost-optimization alternatives. Key advantages embrace the flexibility to deploy a number of LLMs behind a single endpoint for quicker scaling and improved response latencies, in addition to value financial savings. Moreover, the WxAI staff carried out an LLM proxy to simplify entry to LLMs for Webex groups, allow centralized information assortment, and cut back operational overhead. With SageMaker Inference, Cisco can effectively handle and scale their LLM deployments, harnessing the ability of generative AI throughout the Webex portfolio whereas sustaining optimum efficiency, scalability, and cost-effectiveness.

The next diagram illustrates the WxAI structure on AWS.

The structure is constructed on a sturdy and safe AWS basis:

The structure makes use of AWS providers like Application Load Balancer, AWS WAF, and EKS clusters for seamless ingress, risk mitigation, and containerized workload administration.
The LLM proxy (a microservice deployed on an EKS pod as a part of the Service VPC) simplifies the mixing of LLMs for Webex groups, offering a streamlined interface and decreasing operational overhead. The LLM proxy helps LLM deployments on SageMaker Inference, Amazon Bedrock, or different LLM suppliers for Webex groups.
The structure makes use of SageMaker Inference for optimized mannequin deployment, auto scaling, and routing mechanisms.
The system integrates Loki for logging, Amazon Managed Service for Prometheus for metrics, and Grafana for unified visualization, seamlessly built-in with Cisco SSO.
The Knowledge VPC homes the info layer parts, together with Amazon ElastiCache for caching and Amazon Relational Database Service (Amazon RDS) for database providers, offering environment friendly information entry and administration.

Use case overview: Contact middle matter analytics

A key focus space for the WxAI staff is to reinforce the capabilities of the Webex Contact Middle platform. A typical Webex Contact Middle set up has a whole lot of brokers dealing with many interactions via varied channels like telephone calls and digital channels. Webex’s AI-powered Matter Analytics function extracts the important thing causes prospects are calling about by analyzing aggregated historic interactions and clustering them into significant matter classes, as proven within the following screenshot. The contact middle administrator can then use these insights to optimize operations, improve agent efficiency, and finally ship a extra passable buyer expertise.

The Matter Analytics function is powered by a pipeline of three fashions: a name driver extraction mannequin, a subject clustering mannequin, and a subject labeling mannequin, as illustrated within the following diagram.

The mannequin particulars are as follows:

Name driver extraction – This generative mannequin summarizes the first cause or intent (known as the name driver) behind a buyer’s name. Correct automated tagging of calls with name drivers helps contact middle supervisors and directors shortly perceive the first cause for any historic name. One of many key issues when fixing this downside was deciding on the correct mannequin to stability high quality and operational prices. The WxAI staff selected the FLAN T5 mannequin on SageMaker Inference and instruction fine-tuned it for extracting name drivers from name transcripts. FLAN-T5 is a strong text-to-text switch transformer mannequin that performs varied pure language understanding and era duties. This workload had a world footprint deployed in us-east-2, eu-west-2, eu-central-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, and ca-central-1 AWS
Matter clustering – Though routinely tagging each contact middle interplay with its name driver is a helpful function in itself, analyzing these name drivers in an aggregated trend over a big batch of calls can uncover much more fascinating developments and insights. The subject clustering mannequin achieves this by clustering all of the individually extracted name drivers from a big batch of calls into totally different matter clusters. It does this by making a semantic embedding for every name driver and using an unsupervised hierarchical clustering approach that operates on the vector embeddings. This leads to distinct and coherent matter clusters the place semantically related name drivers are grouped collectively.
Matter labeling – The subject labeling mannequin is a generative mannequin that creates a descriptive title to function the label for every matter cluster. A number of LLMs had been prompt-tuned and evaluated in a few-shot setting to decide on the perfect mannequin for the label era process. Lastly, Llama2-13b-chat, with its capacity to higher seize contextual nuances and semantics of pure language dialog, was used for its accuracy, efficiency, and cost-effectiveness. Moreover, Llama2-13b-chat was deployed and used on SageMaker inference parts, whereas sustaining comparatively low working prices in comparison with different LLMs, through the use of particular {hardware} like g4dn and g5

This resolution additionally used the auto scaling capabilities of SageMaker to dynamically alter the variety of cases primarily based on a desired minimal of 1 endpoint and most of 30. This method gives environment friendly useful resource utilization whereas sustaining excessive throughput, permitting the WxAI platform to deal with batch jobs in a single day and scale to a whole lot of inferences per minute throughout peak hours. By deploying the mannequin on SageMaker Inference with auto scaling, WxAI staff was in a position to ship dependable and correct responses to buyer interactions for his or her Matter Analytics use case.

By precisely pinpointing the decision driver, the system can counsel applicable actions, sources, and subsequent steps to the agent, streamlining the client assist course of, additional resulting in customized and correct responses to buyer questions.

To deal with fluctuating demand and optimize useful resource utilization, the WxAI staff carried out auto scaling for his or her SageMaker Inference endpoints. They configured the endpoints to scale from a minimal to a most occasion depend primarily based on GPU utilization. Moreover, the LLM proxy routed requests between the totally different LLMs deployed on SageMaker Inference. This proxy abstracts the complexities of speaking with varied LLM suppliers and allows centralized information assortment and evaluation. This led to enhanced generative AI workflows, optimized latency, and customized use case implementations.

Advantages

Via the strategic adoption of AWS AI providers, Cisco’s WxAI staff has realized vital advantages, enabling them to construct cutting-edge, AI-powered collaboration capabilities extra quickly and cost-effectively:

Improved growth and deployment cycle time – By decoupling fashions from purposes, the staff has streamlined processes like bug fixes, integration testing, and have rollouts throughout environments, accelerating their general growth velocity.
Simplified engineering and supply – The clear separation of issues between the lean utility layer and resource-intensive mannequin layer has simplified engineering efforts and supply, permitting the staff to deal with innovation moderately than infrastructure complexities.
Diminished prices – By utilizing totally managed providers like SageMaker Inference, the staff has offloaded infrastructure administration overhead. Moreover, capabilities like asynchronous inference and multi-model endpoints have enabled vital value optimization with out compromising efficiency or availability.
Scalability and efficiency – Providers like SageMaker Inference and Amazon Bedrock, mixed with applied sciences like NVIDIA Triton Inference Server on SageMaker, have empowered the WxAI staff to scale their AI/ML workloads reliably and ship high-performance inference for demanding use instances.
Accelerated innovation – The partnership with AWS has given the WxAI staff entry to cutting-edge AI providers and experience, enabling them to quickly prototype and deploy progressive capabilities just like the AI-powered Webex Assistant and superior contact middle AI options.

Cisco’s contributions to SageMaker Inference: Enhancing generative AI inference capabilities

Constructing upon the success of their strategic migration to SageMaker Inference, Cisco has been instrumental in partnering with the SageMaker Inference staff to construct and improve key generative AI capabilities throughout the SageMaker platform. Because the early days of generative AI, Cisco has supplied the SageMaker Inference staff with invaluable inputs and experience, enabling the introduction of a number of new options and optimizations:

Value and efficiency optimizations for generative AI inference – Cisco helped the SageMaker Inference staff develop innovative techniques to optimize the usage of accelerators, enabling SageMaker Inference to cut back basis mannequin (ML) deployment prices by 50% on average and latency by 20% on average with inference components. This breakthrough delivers vital value financial savings and efficiency enhancements for patrons working generative AI workloads on SageMaker.
Scaling enhancements for generative AI inference – Cisco’s experience in distributed techniques and auto scaling has additionally helped the SageMaker staff develop superior capabilities to higher deal with the scaling necessities of generative AI fashions. These enhancements cut back auto scaling instances by up to 40% and auto scaling detection by 6 times, so prospects can quickly scale their generative AI workloads on SageMaker to satisfy spikes in demand with out compromising efficiency.
Streamlined generative AI mannequin deployment for inference – Recognizing the necessity for simplified generative AI mannequin deployment, Cisco collaborated with AWS to introduce the ability to deploy open source LLMs and FMs with only a few clicks. This user-friendly performance removes the complexity historically related to deploying these superior fashions, empowering extra prospects to harness the ability of generative AI.
Simplified inference deployment for Kubernetes prospects – Cisco’s deep experience in Kubernetes and container applied sciences helped the SageMaker staff develop new Kubernetes Operator-based inference capabilities. These improvements make it easy for customers running applications on Kubernetes to deploy and manage generative AI models, reducing LLM deployment costs by 50% on average.
Utilizing NVIDIA Triton Inference Server for generative AI – Cisco labored with AWS to combine the NVIDIA Triton Inference Server, a high-performance model serving container managed by SageMaker, to energy generative AI inference on SageMaker Inference. This enabled the WxAI staff to scale their AI/ML workloads reliably and ship high-performance inference for demanding generative AI use instances.
Packaging generative AI fashions extra effectively – To additional simplify the generative AI mannequin lifecycle, Cisco labored with AWS to reinforce the capabilities in SageMaker for packaging LLMs and FMs for deployment. These enhancements make it easy to arrange and deploy these generative AI fashions, accelerating their adoption and integration.
Improved documentation for generative AI – Recognizing the significance of complete documentation to assist the rising generative AI ecosystem, Cisco collaborated with the AWS staff to reinforce the SageMaker documentation. This consists of detailed guides, finest practices, and reference supplies tailor-made particularly for generative AI use instances, serving to prospects shortly ramp up their generative AI initiatives on the SageMaker platform.

By carefully partnering with the SageMaker Inference staff, Cisco has performed a pivotal function in driving the fast evolution of generative AI Inference capabilities in SageMaker. The options and optimizations launched via this collaboration are empowering AWS prospects to unlock the transformative potential of generative AI with larger ease, cost-effectiveness, and efficiency.

“Our partnership with the SageMaker Inference product staff goes again to the early days of generative AI, and we consider the options we have now in-built collaboration, from value optimizations to high-performance mannequin deployment, will broadly assist different enterprises quickly undertake and scale generative AI workloads on SageMaker, unlocking new frontiers of innovation and enterprise transformation.”

– Travis Mehlinger, Principal Engineer at Cisco.

Conclusion

By utilizing AWS providers like SageMaker Inference and Amazon Bedrock for generative AI, Cisco’s WxAI staff has been in a position to optimize their AI/ML infrastructure, enabling them to construct and deploy AI-powered options extra effectively, reliably, and cost-effectively. This strategic method has unlocked vital advantages for Cisco in deploying and scaling its generative AI capabilities for the Webex platform. Cisco’s personal journey with generative AI, as showcased on this submit, presents invaluable classes and insights for different makes use of of SageMaker Inference.

Recognizing the affect of generative AI, Cisco has performed a vital function in shaping the way forward for these capabilities inside SageMaker Inference. By offering invaluable insights and hands-on collaboration, Cisco has helped AWS develop a variety of highly effective options which are making generative AI extra accessible and scalable for organizations. From optimizing infrastructure prices and efficiency to streamlining mannequin deployment and scaling, Cisco’s contributions have been instrumental in enhancing the SageMaker Inference service.

Shifting ahead, the Cisco-AWS partnership goals to drive additional developments in areas like conversational and generative AI inference. As generative AI adoption accelerates throughout industries, Cisco’s Webex platform is designed to scale and streamline consumer experiences via varied use instances mentioned on this submit and past. You possibly can anticipate to see ongoing innovation from this collaboration in SageMaker Inference capabilities, as Cisco and SageMaker Inference proceed to push the boundaries of what’s attainable on the planet of AI.

For extra info on Webex Contact Middle’s Matter Analytics function and associated AI capabilities, seek advice from The Webex Advantage: Navigating Customer Experience in the Age of AI on the Webex weblog.

In regards to the Authors

Travis Mehlinger is a Principal Software program Engineer within the Webex Collaboration AI group, the place he helps groups develop and function cloud-centered AI and ML capabilities to assist Webex AI options for patrons around the globe. In his spare time, Travis enjoys cooking barbecue, taking part in video video games, and touring across the US and UK to race go-karts.

Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI within the Webex Collaboration AI Group. He leads a multidisciplinary staff of software program engineers, machine studying engineers, information scientists, computational linguists, and designers who develop superior AI-driven options for the Webex collaboration portfolio. Previous to Cisco, Karthik held analysis positions at MindMeld (acquired by Cisco), Microsoft, and Stanford College.

Saurabh Trikande is a Senior Product Supervisor for Amazon SageMaker Inference. He’s obsessed with working with prospects and is motivated by the objective of democratizing machine studying. He focuses on core challenges associated to deploying complicated ML purposes, multi-tenant ML fashions, value optimizations, and making deployment of deep studying fashions extra accessible. In his spare time, Saurabh enjoys mountain climbing, studying about progressive applied sciences, following TechCrunch and spending time along with his household.

Ravi Thakur is a Senior Options Architect at AWS, primarily based in Charlotte, NC. He makes a speciality of fixing complicated enterprise challenges utilizing distributed, cloud-centered, and well-architected patterns. Ravi’s experience consists of microservices, containerization, AI/ML, and generative AI. He empowers AWS strategic prospects on digital transformation journeys, delivering bottom-line advantages. In his spare time, Ravi enjoys bike rides, household time, studying, films, and touring.

Amit Arora is an AI and ML Specialist Architect at Amazon Net Providers, serving to enterprise prospects use cloud-based machine studying providers to quickly scale their improvements. He’s additionally an adjunct lecturer within the MS information science and analytics program at Georgetown College in Washington D.C.

Madhur Prashant is an AI and ML Options Architect at Amazon Net Providers. He’s passionate in regards to the intersection of human considering and generative AI. His pursuits lie in generative AI, particularly constructing options which are useful and innocent, and most of all optimum for patrons. Outdoors of labor, he loves doing yoga, mountain climbing, spending time along with his twin, and taking part in the guitar.

How Cisco accelerated the usage of generative AI with Amazon SageMaker Inference