Improved ML mannequin deployment utilizing Amazon SageMaker Inference Recommender

Every machine studying (ML) system has a novel service stage settlement (SLA) requirement with respect to latency, throughput, and price metrics. With developments in {hardware} design, a variety of CPU- and GPU-based infrastructures can be found that will help you pace up inference efficiency. Additionally, you may construct these ML programs with a mix of ML fashions, duties, frameworks, libraries, instruments, and inference engines, making it necessary to judge the ML system efficiency for the very best deployment configurations. You want suggestions on discovering probably the most cost-effective ML serving infrastructure and the precise mixture of software program configuration to attain the very best price-performance to scale these purposes.

Amazon SageMaker Inference Recommender is a functionality of Amazon SageMaker that reduces the time required to get ML fashions in manufacturing by automating load testing and mannequin tuning throughout SageMaker ML cases. On this publish, we spotlight a number of the latest updates to Inference Recommender:

  • SageMaker Python SDK help for operating Inference Recommender
  • Inference Recommender usability enhancements
  • New APIs that present flexibility in operating Inference Recommender
  • Deeper integration with Amazon CloudWatch for logging and metrics

Bank card fraud detection use case

Any fraudulent exercise that isn’t detected and mitigated instantly may cause vital monetary loss. Notably, bank card cost fraud transactions must be recognized instantly to guard the person’s and firm’s monetary well being. On this publish, we talk about a bank card fraud detection use case, and discover ways to use Inference Recommender to search out the optimum inference occasion sort and ML system configurations that may detect fraudulent bank card transactions in milliseconds.

We display methods to arrange Inference Recommender jobs for a bank card fraud detection use case. We practice an XGBoost mannequin for a classification job on a bank card fraud dataset. We use Inference Recommender with a customized load to satisfy inference SLA necessities to fulfill peak concurrency of 30,000 transactions per minute whereas serving predictions leads to lower than 100 milliseconds. Primarily based on Inference Recommender’s occasion sort suggestions, we will discover the precise real-time serving ML cases that yield the precise price-performance for this use case. Lastly, we deploy the mannequin to a SageMaker real-time endpoint to get prediction outcomes.

The next desk summarizes the small print of our use case.

Mannequin Framework XGBoost
Mannequin Dimension 10 MB
Finish-to-Finish Latency 100 milliseconds
Invocations per Second 500 (30,000 per minute)
ML Job Binary Classification
Enter Payload 10 KB

We use a synthetically created bank card fraud dataset. The dataset accommodates 28 numerical options, time of the transaction, transaction quantity, and sophistication goal variables. The class column corresponds as to if or not a transaction is fraudulent. Nearly all of knowledge is non-fraudulent (284,315 samples), with solely 492 samples similar to fraudulent examples. Within the knowledge, Class is the goal classification variable (fraudulent vs. non-fraudulent) within the first column, adopted by different variables.

Within the following sections, we present methods to use Inference Recommender to get ML internet hosting occasion sort suggestions and discover optimum mannequin configurations to attain higher price-performance to your inference utility.

Which ML occasion sort and configurations ought to you choose?

With Inference Recommender, you may run two sorts of jobs: default and superior.

The default Occasion Recommender job runs a set of load checks to really helpful the precise ML occasion sorts for any ML use case. SageMaker real-time deployment helps a variety of ML cases to host and serve the bank card fraud detection XGBoost mannequin. The default job can run a load take a look at on a number of cases that you just present within the job configuration. You probably have an current endpoint for this use case, you may run this job to search out the cost-optimized performant occasion sort. Inference Recommender will compile and optimize the mannequin for a selected {hardware} of inference endpoint occasion sort utilizing Amazon SageMaker Neo. It’s necessary to notice that not all compilation leads to improved efficiency. Inference Recommender will report compilation particulars when the next situations are met:

  • Profitable compilation of the mannequin utilizing Neo. There might be points within the compilation course of corresponding to invalid payload, knowledge sort, or extra. On this case, compilation data shouldn’t be obtainable.
  • Profitable inference utilizing the compiled mannequin that reveals efficiency enchancment, which seems within the inference job response.

A complicated job is a customized load take a look at job that permits you to carry out in depth benchmarks based mostly in your ML utility SLA necessities, corresponding to latency, concurrency, and site visitors sample. You’ll be able to configure a customized site visitors sample to simulate bank card transactions. Moreover, you may outline the end-to-end mannequin latency to foretell if a transaction is fraudulent and outline the utmost concurrent transactions to the mannequin for prediction. Inference Recommender makes use of this data to run a efficiency benchmark load take a look at. The latency, concurrency, and price metrics from the superior job assist you to make knowledgeable choices in regards to the ML serving infrastructure for mission-critical purposes.

Answer overview

The next diagram reveals the answer structure for coaching an XGBoost mannequin on the bank card fraud dataset, operating a default job as an example sort advice, and performing load testing to determine the optimum inference configuration for the very best price-performance.

The diagram reveals the next steps:

  1. Prepare an XGBoost mannequin to categorise bank card transactions as fraudulent or legit. Deploy the educated mannequin to a SageMaker real-time endpoint. Package deal the mannequin artifacts and pattern payload (.tar.gz format), and add them to Amazon Simple Storage Service (Amazon S3) so Inference Recommender can use these when the job is run. Notice that the coaching step on this publish is elective.
  2. Configure and run a default Inference Recommender job on an inventory of supported occasion sorts to search out the precise ML occasion sort that provides the very best price-performance for this use case.
  3. Optionally, run a default Inference Recommender job on an current endpoint.
  4. Configure and run a complicated Inference Recommender job to carry out a customized load take a look at to simulate person interactions with the bank card fraud detection utility. This helps you discover the precise configurations to fulfill latency, concurrency, and price for this use case.
  5. Analyze the default and superior Inference Recommender job outcomes, which embody ML occasion sort advice latency, efficiency, and price metrics.

A whole instance is out there in our GitHub notebook.


To make use of Inference Recommender, make certain to satisfy the prerequisites.

Python SDK help for Inference Recommender

We lately launched Python SDK help for Inference Recommender. Now you can run default and superior jobs utilizing a single operate: right_size. Primarily based on the parameters of the operate name, Inference Recommender infers if it ought to run default or superior jobs. This enormously simplifies the usage of Inference Recommender utilizing the Python SDK. To run the Inference Recommender job, full the next steps:

  1. Create a SageMaker mannequin by specifying the framework, model, and picture scope:
    mannequin = Mannequin(
        image_uri = sagemaker.image_uris.retrieve(framework="xgboost", 

  2. Optionally, register the mannequin within the SageMaker model registry. Notice that parameters corresponding to area and job throughout mannequin bundle creation are additionally elective parameters within the latest launch.
    model_package = mannequin.register(

  3. Run the right_size operate on the supported ML inference occasion sorts utilizing the next configuration. As a result of XGBoost is a memory-intensive algorithm, we offer ml.m5 sort cases to get occasion sort suggestions. You’ll be able to name the right_size operate on the mannequin registry object as effectively.
    INFO:sagemaker:Advance Job parameters weren't specified. Working Default job...

  4. Outline extra parameters to the right_size operate to run a complicated job and customized load take a look at on the mannequin:
    1. Configure the site visitors sample utilizing the phases parameter. Within the first part, we begin the load take a look at with two preliminary customers and create two new customers for each minute for two minutes. Within the following part, we begin the load take a look at with six preliminary customers and create two new customers for each minute for two minutes. Stopping situations for the load checks are p95 end-to-end latency of 100 milliseconds and concurrency to help 30,000 transactions per minute or 500 transactions per second.
    2. We tune the endpoint in opposition to the setting variable OMP_NUM_THREADS with values [3,4,5] and we goal to restrict the latency requirement to 100 milliseconds and obtain max concurrency of 30,000 invocations per minute. The aim is to search out which worth for OMP_NUM_THREADS supplies the very best efficiency.
from sagemaker.parameter import CategoricalParameter 
from sagemaker.inference_recommender.inference_recommender_mixin import (  
hyperparameter_ranges = [ 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),
phases = [ 
    Phase(duration_in_seconds=120, initial_number_of_users=2, spawn_rate=2), 
    Phase(duration_in_seconds=120, initial_number_of_users=6, spawn_rate=2) 
model_latency_thresholds = [ 
    ModelLatencyThreshold(percentile="P95", value_in_milliseconds=100) 

    phases=phases, # TrafficPattern 
    max_invocations=30000, # StoppingConditions 
INFO:sagemaker:Advance Job parameters have been specified. Working Superior job...

Run Inference Recommender jobs utilizing the Boto3 API

You should utilize the Boto3 API to launch Inference Recommender default and superior jobs. It is advisable use the Boto3 API (create_inference_recommendations_job) to run Inference Recommender jobs on an current endpoint. Inference Recommender infers the framework and model from the prevailing SageMaker real-time endpoint. The Python SDK doesn’t help operating Inference Recommender jobs on current endpoints.

The next code snippet reveals methods to create a default job:

    JobName = "credit-card-fraud-default-job",
    RoleArn = <ROLE_ARN>,
    InputConfig = {
        'ModelPackageVersionArn': <MODEL_PACKAGE_ARN>, #elective
        'Endpoints': ['EndpointName': <ENDPOINT_POINT>]

Later on this publish, we talk about the parameters wanted to configure a complicated job.

Configure a site visitors sample utilizing the TrafficPattern parameter. Within the first part, we begin a load take a look at with two preliminary customers (InitialNumberOfUsers) and create two new customers (SpawnRate) for each minute for two minutes (DurationInSeconds). Within the following part, we begin the load take a look at with six preliminary customers and create two new customers for each minute for two minutes. Stopping situations (StoppingConditions) for the load checks are p95 end-to-end latency (ModelLatencyThresholds) of 100 milliseconds (ValueInMilliseconds) and concurrency to help 30,000 transactions per minute or 500 transactions per second (MaxInvocations). See the next code:

env_parameter_ranges = [{"Name": "OMP_NUM_THREADS", "Value": ["3", "4", "5"]}]

        JobType="Superior", RoleArn=role_arn, InputConfig={
    'ModelPackageVersionArn': model_package_arn, #elective
    'JobDurationInSeconds': 7200,
    'TrafficPattern': {'TrafficType': 'PHASES',
                       'Phases': [
                       {'InitialNumberOfUsers': 2,
                       'SpawnRate': 2, 
                       'DurationInSeconds': 120
                       {'InitialNumberOfUsers': 6, 
                       'SpawnRate': 6,
                       'DurationInSeconds': 120
    'ResourceLimit': {'MaxNumberOfTests': 10, 'MaxParallelOfTests': 3},
    'EndpointConfigurations': [{'InstanceType': 'ml.m5.4xlarge'
                                {'CategoricalParameterRanges': env_parameter_ranges}
    }, StoppingConditions={'MaxInvocations': 30000,
                           [{'Percentile': 'P95',
                            'ValueInMilliseconds': 100

Inference Recommender job outcomes and metrics

The outcomes of the default Inference Recommender job comprise an inventory of endpoint configuration suggestions, together with occasion sort, occasion depend, and setting variables. The outcomes comprise configurations for SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS related to the latency, concurrency, and throughput metrics. OMP_NUM_THREADS is the mannequin server tunable setting parameter. As proven within the particulars within the following desk, with an ml.m5.4xlarge occasion with SAGEMAKER_MODEL_SERVER_WORKERS=3 and OMP_NUM_THREADS=3, we acquired a throughput of 32,628 invocations per minute and mannequin latency beneath 10 milliseconds. ml.m5.4xlarge had 100% enchancment in latency, an approximate 115% enhance in concurrency in comparison with the ml.m5.xlarge occasion configuration. Additionally, it was 66% less expensive in comparison with the ml.m5.12xlarge occasion configurations whereas reaching comparable latency and throughput.

Occasion Kind Preliminary Occasion Depend OMP_NUM_THREADS Value Per Hour Max Invocations Mannequin Latency CPU Utilization Reminiscence Utilization SageMaker Mannequin Server Employees
ml.m5.xlarge 1 2 0.23 15189 18 108.864 1.62012 1
ml.m5.4xlarge 1 3 0.922 32628 9 220.57001 0.69791 3
ml.m5.giant 1 2 0.115 13793 19 106.34 3.24398 1
ml.m5.12xlarge 1 4 2.765 32016 4 215.32401 0.44658 7
ml.m5.2xlarge 1 2 0.461 32427 13 248.673 1.43109 3

We’ve got included CloudWatch helper features within the pocket book. You should utilize the features to get detailed charts of your endpoints throughout the load take a look at. The charts have particulars on invocation metrics like invocations, mannequin latency, overhead latency, and extra, and occasion metrics corresponding to CPUUtilization and MemoryUtilization. The next instance reveals the CloudWatch metrics for our ml.m5.4xlarge mannequin configuration.

You’ll be able to visualize Inference Recommender job leads to Amazon SageMaker Studio by selecting Inference Recommender beneath Deployments within the navigation pane. With a deployment aim for this use case (excessive latency, excessive throughput, default price), the default Inference Recommender job really helpful an ml.m5.4xlarge occasion as a result of it offered the very best latency efficiency and throughput to help a most 34,600 invocations per minute (576 TPS). You should utilize these metrics to investigate and discover the very best configurations that fulfill latency, concurrency, and price necessities of your ML utility.

We lately launched ListInferenceRecommendationsJobSteps, which lets you analyze subtasks in an Inference Recommender job. The next code snippet reveals methods to use the list_inference_recommendations_job_steps Boto3 API to get the record of subtasks. This will help with debugging Inference Recommender job failures on the step stage. This performance shouldn’t be supported within the Python SDK but.

sm_client = boto3.shopper("sagemaker", region_name=area)
list_job_steps_response = sm_client.list_inference_recommendations_job_steps

The next code reveals the response:

    "Steps": [
            "StepType": "BENCHMARK",
            "JobName": "SMPYTHONSDK-<JOB_NAME>",
            "Status": "COMPLETED",
            "InferenceBenchmark": {
                "Metrics": {
                    "CostPerHour": 1.8359999656677246,
                    "CostPerInference": 1.6814110495033674e-06,
                    "MaxInvocations": 18199,
                    "ModelLatency": 40,
                    "CpuUtilization": 106.06400299072266,
                    "MemoryUtilization": 0.3920480012893677
                "EndpointConfiguration": {
                    "EndpointName": "sm-epc-<ENDPOINTNAME>",
                    "VariantName": "sm-epc-<VARIANTNAME>",
                    "InstanceType": "ml.c5.9xlarge",
                    "InitialInstanceCount": 1
                "ModelConfiguration": {
                    "EnvironmentParameters": [
                            "Key": "SAGEMAKER_MODEL_SERVER_WORKERS",
                            "ValueType": "String",
                            "Value": "1"
                            "Key": "OMP_NUM_THREADS",
                            "ValueType": "String",
                            "Value": "28"
     ...... <TRUNCATED>
    "ResponseMetadata": {
        "RequestId": "<RequestId>",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "<x-amzn-requestid>",
            "content-type": "utility/x-amz-json-1.1",
            "content-length": "1443",
            "date": "Mon, 20 Feb 2023 16:53:30 GMT"
        "RetryAttempts": 0

Run a complicated Inference Recommender job

Subsequent, we run a complicated Inference Recommender job to search out optimum configurations corresponding to SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS on an ml.m5.4xlarge occasion sort. We set the hyperparameters of the superior job to run a load take a look at on completely different combos:

hyperparameter_ranges = [ 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),

You’ll be able to view the superior Inference Recommender job outcomes on the Studio console, as proven within the following screenshot.

Utilizing the Boto3 API or CLI instructions, you may entry all of the metrics from the superior Inference Recommender job outcomes. InitialInstanceCount is the variety of cases that you need to provision within the endpoint to satisfy ModelLatencyThresholds and MaxInvocations talked about in StoppingConditions. The next desk summarizes our outcomes.

Occasion Kind Preliminary Occasion Depend OMP_NUM_THREADS Value Per Hour Max Invocations Mannequin Latency CPU Utilization Reminiscence Utilization
ml.m5.2xlarge 2 3 0.922 39688 6 86.732803 3.04769
ml.m5.2xlarge 2 4 0.922 42604 6 177.164993 3.05089
ml.m5.2xlarge 2 5 0.922 39268 6 125.402 3.08665
ml.m5.4xlarge 2 3 1.844 38174 4 102.546997 2.68003
ml.m5.4xlarge 2 4 1.844 39452 4 141.826004 2.68136
ml.m5.4xlarge 2 5 1.844 40472 4 107.825996 2.70936

Clear up

Observe the directions within the pocket book to delete all of the assets created as a part of this publish to keep away from incurring extra expenses.


Discovering the precise ML serving infrastructure, together with occasion sort, mannequin configurations, and auto scaling polices, might be tedious. This publish confirmed how you need to use the Inference Recommender Python SDK and Boto3 APIs to launch default and superior jobs to search out the optimum inference infrastructure and configurations. We additionally mentioned the brand new enhancements to Inference Recommender, together with Python SDK help and usefulness enhancements. Take a look at our GitHub repository to get began.

In regards to the Authors

Shiva Raaj Kotini works as a Principal Product Supervisor within the AWS SageMaker inference product portfolio. He focuses on mannequin deployment, efficiency tuning, and optimization in SageMaker for inference.

John Barboza is a Software program Engineer at AWS. He has in depth expertise engaged on distributed programs. His present focus is on bettering the SageMaker inference expertise. In his spare time, he enjoys cooking and biking.

Mohan Gandhi is a Senior Software program Engineer at AWS. He has been with AWS for the final 10 years and has labored on varied AWS providers like Amazon EMR, Amazon EFA, and Amazon RDS. Presently, he’s targeted on bettering the SageMaker inference expertise. In his spare time, he enjoys mountaineering and marathons.

Ram Vegiraju is an ML Architect with the SageMaker service workforce. He focuses on serving to clients construct and optimize their AI/ML options on Amazon SageMaker. In his spare time, he loves touring and writing.

Vikram Elango is an Sr. AIML Specialist Options Architect at AWS, based mostly in Virginia USA. He’s at present targeted on Generative AI, LLMs, immediate engineering, giant mannequin inference optimization and scaling ML throughout enterprises. Vikram helps monetary and insurance coverage business clients with design, thought management to construct and deploy machine studying purposes at scale. In his spare time, he enjoys touring, mountaineering, cooking and tenting along with his household.

Leave a Reply

Your email address will not be published. Required fields are marked *