Benchmarking personalized fashions on Amazon Bedrock utilizing LLMPerf and LiteLLM


Open basis fashions (FMs) permit organizations to construct personalized AI purposes by fine-tuning for his or her particular domains or duties, whereas retaining management over prices and deployments. Nevertheless, deployment generally is a significant slice of the trouble, usually requiring 30% of challenge time as a result of engineers should fastidiously optimize occasion varieties and configure serving parameters via cautious testing. This course of could be each complicated and time-consuming, requiring specialised data and iterative testing to attain the specified efficiency.

Amazon Bedrock Custom Model Import simplifies deployments of customized fashions by providing a simple API for mannequin deployment and invocation. You’ll be able to add mannequin weights and let AWS deal with an optimum, totally managed deployment. This makes positive that deployments are performant and value efficient. Amazon Bedrock Customized Mannequin Import additionally handles computerized scaling, together with scaling to zero. When not in use and there are not any invocations for five minutes, it scales to zero. You pay just for what you utilize in 5-minute increments. It additionally handles scaling up, robotically growing the variety of lively mannequin copies when increased concurrency is required. These options make Amazon Bedrock Customized Mannequin Import a gorgeous answer for organizations wanting to make use of customized fashions on Amazon Bedrock offering simplicity and cost-efficiency.

Earlier than deploying these fashions in manufacturing, it’s essential to judge their efficiency utilizing benchmarking instruments. These instruments assist to proactively detect potential manufacturing points resembling throttling and confirm that deployments can deal with anticipated manufacturing masses.

This publish begins a weblog collection exploring DeepSeek and open FMs on Amazon Bedrock Customized Mannequin Import. It covers the method of efficiency benchmarking of customized fashions in Amazon Bedrock utilizing in style open supply instruments: LLMPerf and LiteLLM. It features a notebook that features step-by-step directions to deploy a DeepSeek-R1-Distill-Llama-8B mannequin, however the identical steps apply for another mannequin supported by Amazon Bedrock Customized Mannequin Import.

Stipulations

This publish requires an Amazon Bedrock customized mannequin. In the event you don’t have one in your AWS account but, observe the directions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Utilizing open supply instruments LLMPerf and LiteLLM for efficiency benchmarking

To conduct efficiency benchmarking, you’ll use LLMPerf, a well-liked open-source library for benchmarking basis fashions. LLMPerf simulates load exams on mannequin invocation APIs by creating concurrent Ray Shoppers and analyzing their responses. A key benefit of LLMPerf is broad help of basis mannequin APIs. This contains LiteLLM, which helps all models available on Amazon Bedrock.

Establishing your customized mannequin invocation with LiteLLM

LiteLLM is a flexible open supply device that can be utilized each as a Python SDK and a proxy server (AI gateway) for accessing over 100 totally different FMs utilizing a standardized format. LiteLLM standardizes inputs to match every FM supplier’s particular endpoint necessities. It helps Amazon Bedrock APIs, together with InvokeModel and Converse APIs, and FMs accessible on Amazon Bedrock, together with imported customized fashions.

To invoke a customized mannequin with LiteLLM, you utilize the mannequin parameter (see Amazon Bedrock documentation on LiteLLM). This can be a string that follows the bedrock/provider_route/model_arn format.

The provider_route signifies the LiteLLM implementation of request/response specification to make use of. DeepSeek R1 fashions could be invoked utilizing their customized chat template utilizing the DeepSeek R1 provider route, or with the Llama chat template utilizing the Llama provider route.

The model_arn is the mannequin Amazon Useful resource Title (ARN) of the imported mannequin. You may get the mannequin ARN of your imported mannequin within the console or by sending a ListImportedModels request.

For instance, the next script invokes the customized mannequin utilizing the DeepSeek R1 chat template.

import time
from litellm import completion

whereas True:
    strive:
        response = completion(
            mannequin=f"bedrock/deepseek_r1/{model_id}",
            messages=[{"role": "user", "content": """Given the following financial data:
        - Company A's revenue grew from $10M to $15M in 2023
        - Operating costs increased by 20%
        - Initial operating costs were $7M
        
        Calculate the company's operating margin for 2023. Please reason step by step."""},
                      {"role": "assistant", "content": "<think>"}],
            max_tokens=4096,
        )
        print(response['choices'][0]['message']['content'])
        break
    besides:
        time.sleep(60)

After the invocation parameters for the imported mannequin have been verified, you possibly can configure LLMPerf for benchmarking.

Configuring a token benchmark check with LLMPerf

To benchmark efficiency, LLMPerf makes use of Ray, a distributed computing framework, to simulate practical masses. It spawns a number of distant purchasers, every able to sending concurrent requests to mannequin invocation APIs. These purchasers are applied as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests throughout the Ray Shoppers, and permits for simulation of varied load eventualities and concurrent request patterns. On the similar time, every shopper will gather efficiency metrics through the requests, together with latency, throughput, and error charges.

Two crucial metrics for efficiency embrace latency and throughput:

  • Latency refers back to the time it takes for a single request to be processed.
  • Throughput measures the variety of tokens which are generated per second.

Deciding on the suitable configuration to serve FMs usually entails experimenting with totally different batch sizes whereas carefully monitoring GPU utilization and contemplating components resembling accessible reminiscence, mannequin measurement, and particular necessities of the workload. To study extra, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Though Amazon Bedrock Customized Mannequin Import simplifies this by providing pre-optimized serving configurations, it’s nonetheless essential to confirm your deployment’s latency and throughput.

Begin by configuring token_benchmark.py, a pattern script that facilitates the configuration of a benchmarking check. Within the script, you possibly can outline parameters resembling:

  • LLM API: Use LiteLLM to invoke Amazon Bedrock customized imported fashions.
  • Mannequin: Outline the route, API, and mannequin ARN to invoke equally to the earlier part.
  • Imply/commonplace deviation of enter tokens: Parameters to make use of within the likelihood distribution from which the variety of enter tokens shall be sampled.
  • Imply/commonplace deviation of output tokens: Parameters to make use of within the likelihood distribution from which the variety of output tokens shall be sampled.
  • Variety of concurrent requests: The variety of customers that the applying is more likely to help when in use.
  • Variety of accomplished requests: The entire variety of requests to ship to the LLM API within the check.

The next script exhibits an instance of learn how to invoke the mannequin. See this notebook for step-by-step directions on importing a customized mannequin and operating a benchmarking check.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py 
--model "bedrock/llama/{model_id}" 
--mean-input-tokens {mean_input_tokens} 
--stddev-input-tokens {stddev_input_tokens} 
--mean-output-tokens {mean_output_tokens} 
--stddev-output-tokens {stddev_output_tokens} 
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} 
--timeout 1800 
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} 
--results-dir "${{LLM_PERF_OUTPUT}}" 
--llm-api litellm 
--additional-sampling-params '{{}}'

On the finish of the check, LLMPerf will output two JSON recordsdata: one with mixture metrics, and one with separate entries for each invocation.

Scale to zero and cold-start latency

One factor to recollect is that as a result of Amazon Bedrock Customized Mannequin Import will scale right down to zero when the mannequin is unused, you should first make a request to guarantee that there’s at the least one lively mannequin copy. In the event you get hold of an error indicating that the mannequin isn’t prepared, you should wait for about ten seconds and as much as 1 minute for Amazon Bedrock to arrange at the least one lively mannequin copy. When prepared, run a check invocation once more, and proceed with benchmarking.

Instance situation for DeepSeek-R1-Distill-Llama-8B

Think about a DeepSeek-R1-Distill-Llama-8B mannequin hosted on Amazon Bedrock Customized Mannequin Import, supporting an AI software with low visitors of not more than two concurrent requests. To account for variability, you possibly can modify parameters for token rely for prompts and completions. For instance:

  • Variety of purchasers: 2
  • Imply enter token rely: 500
  • Commonplace deviation enter token rely: 25
  • Imply output token rely: 1000
  • Commonplace deviation output token rely: 100
  • Variety of requests per shopper: 50

This illustrative check takes roughly 8 minutes. On the finish of the check, you’ll get hold of a abstract of outcomes of mixture metrics:

inter_token_latency_s
    p25 = 0.010615988283217918
    p50 = 0.010694698716183695
    p75 = 0.010779359342088015
    p90 = 0.010945443657517748
    p95 = 0.01100556307365132
    p99 = 0.011071086908721675
    imply = 0.010710014800224604
    min = 0.010364670612635254
    max = 0.011485444453299149
    stddev = 0.0001658793389904756
ttft_s
    p25 = 0.3356793452499005
    p50 = 0.3783651359990472
    p75 = 0.41098671700046907
    p90 = 0.46655246950049334
    p95 = 0.4846706690498647
    p99 = 0.6790834719300077
    imply = 0.3837810468001226
    min = 0.1878921090010408
    max = 0.7590946710006392
    stddev = 0.0828713133225014
end_to_end_latency_s
    p25 = 9.885957818500174
    p50 = 10.561580732000039
    p75 = 11.271923759749825
    p90 = 11.87688222009965
    p95 = 12.139972019549713
    p99 = 12.6071144856102
    imply = 10.406450886010116
    min = 2.6196457750011177
    max = 12.626598834998731
    stddev = 1.4681851822617253
request_output_throughput_token_per_s
    p25 = 104.68609252502657
    p50 = 107.24619111072519
    p75 = 108.62997591951486
    p90 = 110.90675007239598
    p95 = 113.3896235445618
    p99 = 116.6688412475626
    imply = 107.12082450567561
    min = 97.0053466021563
    max = 129.40680882698936
    stddev = 3.9748004356837137
number_input_tokens
    p25 = 484.0
    p50 = 500.0
    p75 = 514.0
    p90 = 531.2
    p95 = 543.1
    p99 = 569.1200000000001
    imply = 499.06
    min = 433
    max = 581
    stddev = 26.549294727074212
number_output_tokens
    p25 = 1050.75
    p50 = 1128.5
    p75 = 1214.25
    p90 = 1276.1000000000001
    p95 = 1323.75
    p99 = 1372.2
    imply = 1113.51
    min = 339
    max = 1392
    stddev = 160.9598415942952
Quantity Of Errored Requests: 0
General Output Throughput: 208.0008834264341
Quantity Of Accomplished Requests: 100
Accomplished Requests Per Minute: 11.20784995697034

Along with the abstract, you’ll obtain metrics for particular person requests that can be utilized to arrange detailed reviews like the next histograms for time to first token and token throughput.

Analyzing efficiency outcomes from LLMPerf and estimating prices utilizing Amazon CloudWatch

LLMPerf provides you the flexibility to benchmark the efficiency of customized fashions served in Amazon Bedrock with out having to examine the specifics of the serving properties and configuration of your Amazon Bedrock Customized Mannequin Import deployment. This info is efficacious as a result of it represents the anticipated finish person expertise of your software.

As well as, the benchmarking train can function a priceless device for price estimation. Through the use of Amazon CloudWatch, you possibly can observe the variety of lively mannequin copies that Amazon Bedrock Customized Mannequin Import scales to in response to the load check. ModelCopy is uncovered as a CloudWatch metric within the AWS/Bedrock namespace and is reported utilizing the imported mannequin ARN as a label. The plot for the ModelCopy metric is proven within the determine under. This information will help in estimating prices, as a result of billing is predicated on the variety of lively mannequin copies at a given time.

Conclusion

Whereas Amazon Bedrock Customized Mannequin Import simplifies mannequin deployment and scaling, efficiency benchmarking stays important to foretell manufacturing efficiency, and examine fashions throughout key metrics resembling price, latency, and throughput.

To study extra, strive the example notebook together with your customized mannequin.

Further assets:


Concerning the Authors

Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he centered on modeling and optimization merchandise for industrial purposes.

Rupinder Grewal is a Senior AI/ML Specialist Options Architect with AWS. He presently focuses on the serving of fashions and MLOps on Amazon SageMaker. Previous to this position, he labored as a Machine Studying Engineer constructing and internet hosting fashions. Exterior of labor, he enjoys enjoying tennis and biking on mountain trails.

Paras Mehra is a Senior Product Supervisor at AWS. He’s centered on serving to construct Amazon Bedrock. In his spare time, Paras enjoys spending time together with his household and biking across the Bay Space.

Prashant Patel is a Senior Software program Improvement Engineer in AWS Bedrock. He’s keen about scaling giant language fashions for enterprise purposes. Previous to becoming a member of AWS, he labored at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a grasp’s diploma from NYU Tandon College of Engineering. Whereas not at work, he enjoys touring and enjoying together with his canines.

Leave a Reply

Your email address will not be published. Required fields are marked *