Maximize Secure Diffusion efficiency and decrease inference prices with AWS Inferentia2

Generative AI fashions have been experiencing speedy progress in current months because of its spectacular capabilities in creating reasonable textual content, pictures, code, and audio. Amongst these fashions, Secure Diffusion fashions stand out for his or her distinctive energy in creating high-quality pictures based mostly on textual content prompts. Secure Diffusion can generate all kinds of high-quality pictures, together with reasonable portraits, landscapes, and even summary artwork. And, like different generative AI fashions, Secure Diffusion fashions require highly effective computing to supply low-latency inference.

On this submit, we present how one can run Secure Diffusion fashions and obtain excessive efficiency on the lowest price in Amazon Elastic Compute Cloud (Amazon EC2) utilizing Amazon EC2 Inf2 instances powered by AWS Inferentia2. We have a look at the structure of a Secure Diffusion mannequin and stroll by way of the steps of compiling a Secure Diffusion mannequin utilizing AWS Neuron and deploying it to an Inf2 occasion. We additionally focus on the optimizations that the Neuron SDK routinely makes to enhance efficiency. You’ll be able to run each Secure Diffusion 2.1 and 1.5 variations on AWS Inferentia2 cost-effectively. Lastly, we present how one can deploy a Secure Diffusion mannequin to an Inf2 occasion with Amazon SageMaker.

The Secure Diffusion 2.1 mannequin measurement in floating level 32 (FP32) is 5 GB and a pair of.5 GB in bfoat16 (BF16). A single inf2.xlarge occasion has one AWS Inferentia2 accelerator with 32 GB of HBM reminiscence. The Secure Diffusion 2.1 mannequin can match on a single inf2.xlarge occasion. Secure Diffusion is a text-to-image mannequin that you should utilize to create pictures of various types and content material just by offering a textual content immediate as an enter. To be taught extra concerning the Secure Diffusion mannequin structure, confer with Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

How the Neuron SDK optimizes Secure Diffusion efficiency

Earlier than we are able to deploy the Secure Diffusion 2.1 mannequin on AWS Inferentia2 cases, we have to compile the mannequin elements utilizing the Neuron SDK. The Neuron SDK, which features a deep studying compiler, runtime, and instruments, compiles and routinely optimizes deep studying fashions to allow them to run effectively on Inf2 cases and extract full efficiency of the AWS Inferentia2 accelerator. We’ve examples out there for Secure Diffusion 2.1 mannequin on the GitHub repo. This pocket book presents an end-to-end instance of methods to compile a Secure Diffusion mannequin, save the compiled Neuron fashions, and cargo it into the runtime for inference.

We use StableDiffusionPipeline from the Hugging Face diffusers library to load and compile the mannequin. We then compile all of the elements of the mannequin for Neuron utilizing torch_neuronx.hint() and save the optimized mannequin as TorchScript. Compilation processes will be fairly memory-intensive, requiring a big quantity of RAM. To bypass this, earlier than tracing every mannequin, we create a deepcopy of the a part of the pipeline that’s being traced. Following this, we delete the pipeline object from reminiscence utilizing del pipe. This method is especially helpful when compiling on cases with low RAM.

Moreover, we additionally carry out optimizations to the Secure Diffusion fashions. UNet holds essentially the most computationally intensive facet of the inference. The UNet part operates on enter tensors which have a batch measurement of two, producing a corresponding output tensor additionally with a batch measurement of two, to supply a single picture. The weather inside these batches are fully unbiased of one another. We will benefit from this habits to get optimum latency by working one batch on every Neuron core. We compile the UNet for one batch (by utilizing enter tensors with one batch), then use the torch_neuronx.DataParallel API to load this single batch mannequin onto every core. The output of this API is a seamless two-batch module: we are able to cross to the UNet the inputs of two batches, and a two-batch output is returned, however internally, the 2 single-batch fashions are working on the 2 Neuron cores. This technique optimizes useful resource utilization and reduces latency.

Compile and deploy a Secure Diffusion mannequin on an Inf2 EC2 occasion

To compile and deploy the Secure Diffusion mannequin on an Inf2 EC2 occasion, signal to the AWS Management Console and create an inf2.8xlarge occasion. Word that an inf2.8xlarge occasion is required just for the compilation of the mannequin as a result of compilation requires a better host reminiscence. The Secure Diffusion mannequin will be hosted on an inf2.xlarge occasion. You will discover the newest AMI with Neuron libraries utilizing the next AWS Command Line Interface (AWS CLI) command:

aws ec2 describe-images --region us-east-1 --owners amazon 
--filters 'Title=title,Values=Deep Studying AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????' 'Title=state,Values=out there' 
--query 'reverse(sort_by(Pictures, &CreationDate))[:1].ImageId' 
--output textual content

For this instance, we created an EC2 occasion utilizing the Deep Studying AMI Neuron PyTorch 1.13 (Ubuntu 20.04). You’ll be able to then create a JupyterLab lab surroundings by connecting to the occasion and working the next steps:

run supply /decide/aws_neuron_venv_pytorch/bin/activate
pip set up jupyterlab

A pocket book with all of the steps for compiling and internet hosting the mannequin is situated on GitHub.

Let’s have a look at the compilation steps for one of many textual content encoder blocks. Different blocks which are a part of the Secure Diffusion pipeline will be compiled equally.

Step one is to load the pre-trained mannequin from Hugging Face. The StableDiffusionPipeline.from_pretrained methodology hundreds the pre-trained mannequin into our pipeline object, pipe. We then create a deepcopy of the textual content encoder from our pipeline, successfully cloning it. The del pipe command is then used to delete the unique pipeline object, liberating up the reminiscence that was consumed by it. Right here, we’re quantizing the mannequin to BF16 weights:

model_id = "stabilityai/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

This step entails wrapping our textual content encoder with the NeuronTextEncoder wrapper. The output of a compiled textual content encoder module shall be of dict. We convert it to a checklist kind utilizing this wrapper:

text_encoder = NeuronTextEncoder(text_encoder)

We initialize PyTorch tensor emb with some values. The emb tensor is used as instance enter for the torch_neuronx.hint operate. This operate traces our textual content encoder and compiles it right into a format optimized for Neuron. The listing path for the compiled mannequin is constructed by becoming a member of COMPILER_WORKDIR_ROOT with the subdirectory text_encoder:

emb = torch.tensor([...])
text_encoder_neuron = torch_neuronx.hint(
        emb, part of(COMPILER_WORKDIR_ROOT, 'text_encoder'),

The compiled textual content encoder is saved utilizing It’s saved underneath the file title within the text_encoder listing of our compiler’s workspace:

text_encoder_filename = part of(COMPILER_WORKDIR_ROOT, 'text_encoder/'), text_encoder_filename)

The notebook consists of comparable steps to compile different elements of the mannequin: UNet, VAE decoder, and VAE post_quant_conv. After you might have compiled all of the fashions, you possibly can load and run the mannequin following these steps:

  1. Outline the paths for the compiled fashions.
  2. Load a pre-trained StableDiffusionPipeline mannequin, with its configuration specified to make use of the bfloat16 knowledge kind.
  3. Load the UNet mannequin onto two Neuron cores utilizing the torch_neuronx.DataParallel API. This permits knowledge parallel inference to be carried out, which might considerably velocity up mannequin efficiency.
  4. Load the remaining components of the mannequin (text_encoder, decoder, and post_quant_conv) onto a single Neuron core.

You’ll be able to then run the pipeline by offering enter textual content as prompts. The next are some photos generated by the mannequin for the prompts:

  • Portrait of renaud sechan, pen and ink, intricate line drawings, by craig mullins, ruan jia, kentaro miura, greg rutkowski, loundraw

  • Portrait of outdated coal miner in nineteenth century, lovely portray, with extremely detailed face portray by greg rutkowski

  • A fortress in the course of a forest

Host Secure Diffusion 2.1 on AWS Inferentia2 and SageMaker

Internet hosting Secure Diffusion fashions with SageMaker additionally requires compilation with the Neuron SDK. You’ll be able to full the compilation forward of time or throughout runtime utilizing Massive Mannequin Inference (LMI) containers. Compilation forward of time permits for quicker mannequin loading occasions and is the popular choice.

SageMaker LMI containers present two methods to deploy the mannequin:

  • A no-code choice the place we simply present a file with the required configurations
  • Deliver your individual inference script

We have a look at each options and go over the configurations and the inference script ( On this submit, we display the deployment utilizing a pre-compiled mannequin saved in an Amazon Simple Storage Service (Amazon S3) bucket. You need to use this pre-compiled mannequin in your deployments.

Configure the mannequin with a offered script

On this part, we present methods to configure the LMI container to host the Secure Diffusion fashions. The SD2.1 pocket book out there on GitHub. Step one is to create the mannequin configuration package deal per the next listing construction. Our goal is to make use of the minimal mannequin configurations wanted to host the mannequin. The listing construction wanted is as follows:

<config-root-directory> / 
    └── [OPTIONAL]

Subsequent, we create the file with the next parameters:

%%writefile code_sd/

The parameters specify the next:

  • choice.model_id – The LMI containers use s5cmd to load the mannequin from the S3 location and subsequently we have to specify the situation of the place our compiled weights are.
  • choice.entryPoint – To make use of the built-in handlers, we specify the transformers-neuronx class. You probably have a customized inference script, it’s worthwhile to present that as an alternative.
  • choice.dtype – This specifies to load the weights in a particular measurement. For this submit, we use BF16, which additional reduces our reminiscence necessities vs. FP32 and lowers our latency because of that.
  • choice.tensor_parallel_degree – This parameter specifies the variety of accelerators we use for this mannequin. The AWS Inferentia2 chip accelerator has two Neuron cores and so specifying a price of two means we use one accelerator (two cores). This implies we are able to now create a number of staff to extend the throughput of the endpoint.
  • choice.engine – That is set to Python to point we won’t be utilizing different compilers like DeepSpeed or Quicker Transformer for this internet hosting.

Deliver your individual script

If you wish to convey your individual customized inference script, it’s worthwhile to take away the choice.entryPoint from The LMI container in that case will search for a file in the identical location because the and use that to run the inferencing.

Create your individual inference script (

Creating your individual inference script is comparatively simple utilizing the LMI container. The container requires your file to have an implementation of the next methodology:

def deal with(inputs: Enter) which returns an object of kind Outputs

Let’s study a few of the essential areas of the attached notebook, which demonstrates the convey your individual script operate.

Substitute the cross_attention module with the optimized model:

# Substitute authentic cross-attention module with customized cross-attention module for higher efficiency
    CrossAttention.get_attention_scores = get_attention_scores
Load the compiled weights for the next
text_encoder_filename = part of(COMPILER_WORKDIR_ROOT, '')
decoder_filename = part of(COMPILER_WORKDIR_ROOT, '')
unet_filename = part of(COMPILER_WORKDIR_ROOT, '')
post_quant_conv_filename =. part of(COMPILER_WORKDIR_ROOT, '')

These are the names of the compiled weights file we used when creating the compilations. Be at liberty to vary the file names, however be certain your weights file names match what you specify right here.

Then we have to load them utilizing the Neuron SDK and set these within the precise mannequin weights. When loading the UNet optimized weights, be aware we’re additionally specifying the variety of Neuron cores we have to load these onto. Right here, we load to a single accelerator with two cores:

# Load the compiled UNet onto two neuron cores.
    pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
    logging.information(f"Loading mannequin: unet:created")
    device_ids = [idx for idx in range(tensor_parallel_degree)]
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
    # Load different compiled fashions onto a single neuron core.
    # - load encoders
    pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
    clip_compiled = torch.jit.load(text_encoder_filename)
    pipe.text_encoder.neuron_text_encoder = clip_compiled
    #- load decoders
    pipe.vae.decoder = torch.jit.load(decoder_filename)
    pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

Operating the inference with a immediate invokes the pipe object to generate a picture.

Create the SageMaker endpoint

We use Boto3 APIs to create a SageMaker endpoint. Full the next steps:

  1. Create the tarball with simply the serving and the non-obligatory information and add it to Amazon S3.
  2. Create the mannequin utilizing the picture container and the mannequin tarball uploaded earlier.
  3. Create the endpoint config utilizing the next key parameters:
    1. Use an ml.inf2.xlarge occasion.
    2. Set ContainerStartupHealthCheckTimeoutInSeconds to 240 to make sure the well being test begins after the mannequin is deployed.
    3. Set VolumeInGB to a bigger worth so it may be used for loading the mannequin weights which are 32 GB in measurement.

Create a SageMaker mannequin

After you create the mannequin.tar.gz file and add it to Amazon S3, we have to create a SageMaker mannequin. We use the LMI container and the mannequin artifact from the earlier step to create the SageMaker mannequin. SageMaker permits us to customise and inject numerous surroundings variables. For this workflow, we are able to go away every little thing as default. See the next code:

inference_image_uri = (
    f"763104351884.dkr.ecr.{area} djl-serving-inf2"

Create the mannequin object, which primarily creates a lockdown container that’s loaded onto the occasion and used for inferencing:

model_name = name_from_base(f"inf2-sd")
create_model_response = boto3_sm_client.create_model(
    PrimaryContainer={"Picture": inference_image_uri, "ModelDataUrl": s3_code_artifact},

Create a SageMaker endpoint

On this demo, we use an ml.inf2.xlarge occasion. We have to set the VolumeSizeInGB parameters to supply the mandatory disk house to load the mannequin and the weights. This parameter is relevant to cases supporting the Amazon Elastic Block Store (Amazon EBS) quantity attachment. We will go away the mannequin obtain timeout and container startup well being test to a better worth, which is able to give enough time for the container to drag the weights from Amazon S3 and cargo into the AWS Inferentia2 accelerators. For extra particulars, confer with CreateEndpointConfig.

endpoint_config_response = boto3_sm_client.create_endpoint_config(

            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.inf2.xlarge", # - 
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 360, 
            "VolumeSizeInGB": 400

Lastly, we create a SageMaker endpoint:

create_endpoint_response = boto3_sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name

Invoke the mannequin endpoint

It is a generative mannequin, so we cross within the immediate that the mannequin makes use of to generate the picture. The payload is of the sort JSON:

response_model = boto3_sm_run_client.invoke_endpoint(

            "immediate": "Mountain Panorama", 
            "parameters": {} # 

Benchmarking the Secure Diffusion mannequin on Inf2

We ran a number of checks to benchmark the Secure Diffusion mannequin with BF 16 knowledge kind on Inf2, and we’re in a position to derive latency numbers that rival or exceed a few of the different accelerators for Secure Diffusion. This, coupled with the decrease price of AWS Inferentia2 chips, makes this a particularly worthwhile proposition.

The next numbers are from the Secure Diffusion mannequin deployed on an inf2.xl occasion. For extra details about prices, confer with Amazon EC2 Inf2 Instances.

Mannequin Decision Knowledge kind Iterations P95 Latency (ms) Inf2.xl On-Demand price per hour Inf2.xl (Price per picture)
Secure Diffusion 1.5 512×512 bf16 50 2,427.4 $0.76 $0.0005125
Secure Diffusion 1.5 768×768 bf16 50 8,235.9 $0.76 $0.0017387
Secure Diffusion 1.5 512×512 bf16 30 1,456.5 $0.76 $0.0003075
Secure Diffusion 1.5 768×768 bf16 30 4,941.6 $0.76 $0.0010432
Secure Diffusion 2.1 512×512 bf16 50 1,976.9 $0.76 $0.0004174
Secure Diffusion 2.1 768×768 bf16 50 6,836.3 $0.76 $0.0014432
Secure Diffusion 2.1 512×512 bf16 30 1,186.2 $0.76 $0.0002504
Secure Diffusion 2.1 768×768 bf16 30 4,101.8 $0.76 $0.0008659


On this submit, we dove deep into the compilation, optimization, and deployment of the Secure Diffusion 2.1 mannequin utilizing Inf2 cases. We additionally demonstrated deployment of Secure Diffusion fashions utilizing SageMaker. Inf2 cases additionally ship nice value efficiency for Secure Diffusion 1.5. To be taught extra about why Inf2 cases are nice for generative AI and enormous language fashions, confer with Amazon EC2 Inf2 Instances for Low-Cost, High-Performance Generative AI Inference are Now Generally Available. For efficiency particulars, confer with Inf2 Performance. Try extra examples on the GitHub repo.

Particular due to Matthew Mcclain, Beni Hegedus, Kamran Khan, Shruti Koparkar, and Qing Lan for reviewing and offering worthwhile inputs.

In regards to the Authors

Vivek Gangasani is a Senior Machine Studying Options Architect at Amazon Net Providers. He works with machine studying startups to construct and deploy AI/ML functions on AWS. He’s at present targeted on delivering options for MLOps, ML inference, and low-code ML. He has labored on initiatives in numerous domains, together with pure language processing and pc imaginative and prescient.

Okay.C. Tung is a Senior Resolution Architect in AWS Annapurna Labs. He focuses on giant deep studying mannequin coaching and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the College of Texas Southwestern Medical Middle in Dallas. He has spoken at AWS Summits and AWS Reinvent. At the moment he helps prospects to coach and deploy giant PyTorch and TensorFlow fashions in AWS cloud. He’s the creator of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.

Rupinder Grewal is a Sr Ai/ML Specialist Options Architect with AWS. He at present focuses on serving of fashions and MLOps on SageMaker. Previous to this function he has labored as Machine Studying Engineer constructing and internet hosting fashions. Outdoors of labor he enjoys enjoying tennis and biking on mountain trails.

Leave a Reply

Your email address will not be published. Required fields are marked *