How Amazon Bedrock Customized Mannequin Import streamlined LLM deployment for Salesforce
This submit is cowritten by Salesforce’s AI Platform group members Srikanta Prasad, Utkarsh Arora, Raghav Tanaji, Nitin Surya, Gokulakrishnan Gopalakrishnan, and Akhilesh Deepak Gotmare.
Salesforce’s Synthetic Intelligence (AI) platform group runs custom-made massive language fashions (LLMs)—fine-tuned variations of Llama, Qwen, and Mistral—for agentic AI functions like Agentforce. Deploying these fashions creates operational overheads: groups spend months optimizing occasion households, serving engines, and configurations. This course of is time-consuming, arduous to keep up with frequent releases, and costly because of GPU capability reservations for peak utilization.
Salesforce solved this by adopting Amazon Bedrock Custom Model Import. With Amazon Bedrock Customized Mannequin Import, groups can import and deploy custom-made fashions by way of a unified API, minimizing infrastructure administration whereas integrating with Amazon Bedrock options like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. This shift lets Salesforce give attention to fashions and enterprise logic as a substitute of infrastructure.
This submit exhibits how Salesforce built-in Amazon Bedrock Customized Mannequin Import into their machine studying operations (MLOps) workflow, reused current endpoints with out utility adjustments, and benchmarked scalability. We share key metrics on operational effectivity and value optimization features, and supply sensible insights for simplifying your deployment technique.
Integration strategy
Salesforce’s transition from Amazon SageMaker Inference to Amazon Bedrock Customized Mannequin Import required cautious integration with their current MLOps pipeline to keep away from disrupting manufacturing workloads. The group’s major aim was to keep up their present API endpoints and mannequin serving interfaces, retaining zero downtime and no required adjustments to downstream functions. With this strategy, they may use the serverless capabilities of Amazon Bedrock whereas preserving the funding of their current infrastructure and tooling. The mixing technique targeted on making a seamless bridge between their present deployment workflows and Amazon Bedrock managed providers, enabling gradual migration with out further operational danger.
As proven within the following deployment move diagram, Salesforce enhanced their current mannequin supply pipeline with a single further step to make use of Amazon Bedrock Customized Mannequin Import. After their steady integration and steady supply (CI/CD) course of saves mannequin artifacts to their mannequin retailer (an Amazon Simple Storage Service (Amazon S3) bucket), they now name the Amazon Bedrock Customized Mannequin Import API to register the mannequin with Amazon Bedrock. This management aircraft operation is light-weight as a result of Amazon Bedrock pulls the mannequin instantly from Amazon S3, including minimal overhead (5–7 minutes, relying on mannequin dimension) to their deployment timeline—their total mannequin launch course of stays at roughly 1 hour. The mixing delivered an instantaneous efficiency profit: SageMaker now not must obtain weights at container startup as a result of Amazon Bedrock preloads the mannequin. The principle configuration adjustments concerned granting Amazon Bedrock permissions to permit cross-account access to their S3 mannequin bucket and updating AWS Identity and Access Management (IAM) insurance policies to permit inference purchasers to invoke Amazon Bedrock endpoints.

The next inference move diagram illustrates how Salesforce maintained their current utility interfaces whereas utilizing Amazon Bedrock serverless capabilities. Consumer requests move by way of their established preprocessing layer for enterprise logic like immediate formatting earlier than reaching Amazon Bedrock, with postprocessing utilized to the uncooked mannequin output. To deal with complicated processing necessities, they deployed light-weight SageMaker CPU containers that act as clever proxies—operating their customized mannequin.py logic whereas forwarding the precise inference to Amazon Bedrock endpoints. This hybrid structure preserves their current tooling framework: their prediction service continues calling SageMaker endpoints with out routing adjustments, and so they retain mature SageMaker monitoring and logging for preprocessing and postprocessing logic. The trade-off entails a further community hop including 5–10 millisecond latency and the price of always-on CPU cases, however this strategy delivers backward-compatibility with current integrations whereas retaining the GPU-intensive inference totally serverless by way of Amazon Bedrock.

Scalability benchmarking
To validate the efficiency capabilities of Amazon Bedrock Customized Mannequin Import, Salesforce performed complete load testing throughout numerous concurrency situations. Their testing methodology targeted on measuring how the clear auto scaling habits of Amazon Bedrock—the place the service routinely spins up mannequin copies on-demand and scales out underneath heavy load—would impression real-world efficiency. Every take a look at concerned sending standardized payloads containing mannequin IDs and enter information by way of their proxy containers to Amazon Bedrock endpoints, measuring latency and throughput underneath totally different load patterns. Outcomes (see the next desk) present that at low concurrency, Amazon Bedrock achieved 44% decrease latency than the ml.g6e.xlarge baseline (bf16 precision). Underneath increased hundreds, Amazon Bedrock Customized Mannequin Import maintained constant throughput with acceptable latency (lower than 10 milliseconds), demonstrating the serverless structure’s capability to deal with manufacturing workloads with out guide scaling.
| Concurrency (Depend) | P95 Latency (in Seconds) | Throughput (Request per Minute) |
| 1 | 7.2 | 11 |
| 4 | 7.96 | 41 |
| 16 | 9.35 | 133 |
| 32 | 10.44 | 232 |
The outcomes present P95 latency and throughput efficiency of the ApexGuru mannequin (fine-tuned QWEN-2.5 13B) at various concurrency ranges. Amazon Bedrock Customized Mannequin Import auto scaled from one to 3 copies as concurrency reached 32. Every mannequin copy used 1 mannequin unit.
Outcomes and metrics
Past scalability enhancements, Salesforce evaluated Amazon Bedrock Customized Mannequin Import throughout two important enterprise dimensions: operational effectivity and value optimization. The operational effectivity features have been substantial—the group achieved a 30% discount in time to iterate and deploy fashions to manufacturing. This enchancment stemmed from assuaging complicated decision-making round occasion choice, parameter tuning, and selecting between serving engines like vLLM vs. TensorRT-LLM. The streamlined deployment course of allowed builders to give attention to mannequin efficiency moderately than infrastructure configuration.
Price optimization delivered much more dramatic outcomes, with Salesforce reaching as much as 40% value discount by way of Amazon Bedrock. This financial savings was primarily pushed by their numerous site visitors patterns throughout generative AI functions—starting from low to excessive manufacturing site visitors—the place they beforehand needed to reserve GPU capability for peak workloads. The pay-per-use mannequin proved particularly helpful for growth, efficiency testing, and staging environments that solely required GPU assets throughout energetic growth cycles, avoiding the necessity for round the clock reserved capability that always sat idle.
Classes discovered
Salesforce’s journey with Amazon Bedrock Customized Mannequin Import revealed a number of key insights that may information different organizations contemplating an analogous strategy. First, though Amazon Bedrock Customized Mannequin Import helps well-liked open supply mannequin architectures (Qwen, Mistral, Llama) and expands its portfolio incessantly primarily based on demand, groups working with cutting-edge architectures would possibly want to attend for assist. Nevertheless, organizations fine-tuning with the most recent mannequin architectures ought to confirm compatibility earlier than committing to the deployment timeline.
For pre- and post-inference processing, Salesforce evaluated different approaches utilizing Amazon API Gateway and AWS Lambda features, which provide full serverless scaling and pay-per-use pricing right down to milliseconds of execution. Nevertheless, they discovered this strategy much less backward-compatible with current integrations and noticed chilly begin impacts when utilizing bigger libraries of their processing logic.
Chilly begin latency emerged as a important consideration, notably for bigger (over 7B parameter) fashions. Salesforce noticed chilly begin delays of a few minutes with 26B parameter fashions, with latency various primarily based on mannequin dimension. For latency-sensitive functions that may’t tolerate such delays, they advocate retaining endpoints heat by sustaining at the very least one mannequin copy energetic by way of well being examine invocations each 14 minutes. This strategy balances cost-efficiency with efficiency necessities for manufacturing workloads.
Conclusion
Salesforce’s adoption of Amazon Bedrock Customized Mannequin Import exhibits the way to simplify LLM deployment with out sacrificing scalability or efficiency. They achieved 30% quicker deployments and 40% value financial savings whereas sustaining backward-compatibility by way of their hybrid structure utilizing SageMaker proxy containers alongside Amazon Bedrock serverless inference. For extremely custom-made fashions or unsupported architectures, Salesforce continues utilizing SageMaker AI as a managed ML answer.
Their success got here from methodical execution: thorough load testing, and gradual migration beginning with non-critical workloads. The outcomes show serverless AI deployment works for manufacturing, particularly with variable site visitors patterns. ApexGuru is now deployed of their manufacturing surroundings.
For groups managing LLMs at scale, this case research offers a transparent blueprint. Verify your mannequin structure compatibility, plan for chilly begins with bigger fashions, and protect current interfaces. Amazon Bedrock Customized Mannequin Import presents a confirmed path to serverless AI that may cut back overhead, velocity deployment, and lower prices whereas assembly efficiency necessities.
To study extra about pricing for Amazon Bedrock, seek advice from Optimizing cost for using foundational models with Amazon Bedrock and Amazon Bedrock pricing.
For assist selecting between Amazon Bedrock and SageMaker AI, see Amazon Bedrock or Amazon SageMaker AI?
For extra details about Amazon Bedrock Customized Mannequin Import, see How to configure cross-account model deployment using Amazon Bedrock Custom Model Import.
For extra particulars about ApexGuru, seek advice from Get AI-Powered Insights for Your Apex Code with ApexGuru.
Concerning the authors
Srikanta Prasad is a Senior Supervisor in Product Administration specializing in generative AI options at Salesforce. He leads Mannequin Internet hosting and Inference initiatives, specializing in LLM inference serving, LLMOps, and scalable AI deployments.
Utkarsh Arora is an Affiliate Member of Technical Employees at Salesforce, combining robust educational grounding from IIIT Delhi with early profession contributions in ML engineering and analysis.
Raghav Tanaji is a Lead Member of Technical Employees at Salesforce, specializing in machine studying, sample recognition, and statistical studying. He holds an M.Tech from IISc Bangalore.
Akhilesh Deepak Gotmare is a Senior Analysis Employees Member at Salesforce Analysis, primarily based in Singapore. He’s an AI Researcher specializing in deep studying, pure language processing, and code-related functions
Gokulakrishnan Gopalakrishnan is a Principal Software program Engineer at Salesforce, the place he leads engineering efforts on ApexGuru. With 15+ years of expertise, together with at Microsoft, he focuses on constructing scalable software program techniques
Nitin Surya is a Lead Member of Technical Employees at Salesforce with 8+ years in software program/ML engineering. He holds a B.Tech in CS from VIT College and MS in CS (AI/ML focus) from College of Illinois Chicago.
Hrushikesh Gangur is a Principal Options Architect at AWS primarily based in San Francisco, California. He focuses on generative and agentic AI, serving to startups and ISVs construct and deploy AI functions.