Value-effective AI picture technology with PixArt-Σ inference on AWS Trainium and AWS Inferentia


PixArt-Sigma is a diffusion transformer mannequin that’s able to picture technology at 4k decision. This mannequin reveals important enhancements over earlier technology PixArt fashions like Pixart-Alpha and different diffusion fashions by means of dataset and architectural enhancements. AWS Trainium and AWS Inferentia are purpose-built AI chips to speed up machine studying (ML) workloads, making them splendid for cost-effective deployment of huge generative fashions. By utilizing these AI chips, you’ll be able to obtain optimum efficiency and effectivity when working inference with diffusion transformer fashions like PixArt-Sigma.

This submit is the primary in a collection the place we’ll run a number of diffusion transformers on Trainium and Inferentia-powered situations. On this submit, we present how one can deploy PixArt-Sigma to Trainium and Inferentia-powered situations.

Answer overview

The steps outlined beneath will probably be used to deploy the PixArt-Sigma mannequin on AWS Trainium and run inference on it to generate high-quality photos.

  • Step 1 – Pre-requisites and setup
  • Step 2 – Obtain and compile the PixArt-Sigma mannequin for AWS Trainium
  • Step 3 – Deploy the mannequin on AWS Trainium to generate photos

Step 1 – Conditions and setup

To get began, you’ll need to arrange a improvement setting on a trn1, trn2, or inf2 host. Full the next steps:

  1. Launch a trn1.32xlarge or trn2.48xlarge occasion with a Neuron DLAMI. For directions on the right way to get began, consult with Get Started with Neuron on Ubuntu 22 with Neuron Multi-Framework DLAMI.
  2. Launch a Jupyter Pocket book sever. For directions to arrange a Jupyter server, consult with the next user guide.
  3. Clone the aws-neuron-samples GitHub repository:
    git clone https://github.com/aws-neuron/aws-neuron-samples.git

  4. Navigate to the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book:
    cd aws-neuron-samples/torch-neuronx/inference

The offered instance script is designed to run on a Trn2 occasion, however you’ll be able to adapt it for Trn1 or Inf2 situations with minimal modifications. Particularly, throughout the pocket book and in every of the element recordsdata below the neuron_pixart_sigma listing, you can see commented-out adjustments to accommodate Trn1 or Inf2 configurations.

Step 2 – Obtain and compile the PixArt-Sigma mannequin for AWS Trainium

This part supplies a step-by-step information to compiling PixArt-Sigma for AWS Trainium.

Obtain the mannequin

You’ll discover a helper perform in cache-hf-model.py in above talked about GitHub repository that reveals the right way to obtain the PixArt-Sigma mannequin from Hugging Face. If you’re utilizing PixArt-Sigma in your personal workload, and decide to not use the script included on this submit, you should use the huggingface-cli to obtain the mannequin as an alternative.

The Neuron PixArt-Sigma implementation incorporates just a few scripts and lessons. The varied recordsdata and scrips are damaged down as follows:

├── compile_latency_optimized.sh # Full Mannequin Compilation script for Latency Optimized
├── compile_throughput_optimized.sh # Full Mannequin Compilation script for Throughput Optimized
├── hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb # Pocket book to run Latency Optimized Pixart-Sigma
├── hf_pretrained_pixart_sigma_1k_throughput_optimized.ipynb # Pocket book to run Throughput Optimized Pixart-Sigma
├── neuron_pixart_sigma
│ ├── cache_hf_model.py # Mannequin downloading Script
│ ├── compile_decoder.py # Textual content Encoder Compilation Script and Wrapper Class
│ ├── compile_text_encoder.py # Textual content Encoder Compilation Script and Wrapper Class
│ ├── compile_transformer_latency_optimized.py # Latency Optimized Transformer Compilation Script and Wrapper Class
│ ├── compile_transformer_throughput_optimized.py # Throughput Optimized Transformer Compilation Script and Wrapper Class
│ ├── neuron_commons.py # Base Courses and Consideration Implementation
│ └── neuron_parallel_utils.py # Sharded Consideration Implementation
└── necessities.txt

This pocket book will make it easier to to obtain the mannequin, compile the person element fashions, and invoke the technology pipeline to generate a picture. Though the notebooks could be run as a standalone pattern, the following few sections of this submit will stroll by means of the important thing implementation particulars throughout the element recordsdata and scripts to assist working PixArt-Sigma on Neuron.

Sharding PixArt linear layers

For every element of PixArt (T5, Transformer, and VAE), the instance makes use of Neuron particular wrapper lessons. These wrapper lessons serve two functions. The primary objective is it permits us to hint the fashions for compilation:

class InferenceTextEncoderWrapper(nn.Module):
    def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
        tremendous().__init__()
        self.dtype = dtype
        self.system = t.system
        self.t = t
    def ahead(self, text_input_ids, attention_mask=None):
        return [self.t(text_input_ids, attention_mask)['last_hidden_state'].to(self.dtype)]

Please consult with the neuron_commons.py file for all wrapper modules and lessons.

The second motive for utilizing wrapper lessons is to switch the eye implementation to run on Neuron. As a result of diffusion fashions like PixArt are usually compute-bound, you’ll be able to enhance efficiency by sharding the eye layer throughout a number of gadgets. To do that, you substitute the linear layers with NeuronX Distributed’s RowParallelLinear and ColumnParallelLinear layers:

def shard_t5_self_attention(tp_degree: int, selfAttention: T5Attention):
    orig_inner_dim = selfAttention.q.out_features
    dim_head = orig_inner_dim // selfAttention.n_heads
    original_nheads = selfAttention.n_heads
    selfAttention.n_heads = selfAttention.n_heads // tp_degree
    selfAttention.inner_dim = dim_head * selfAttention.n_heads
    orig_q = selfAttention.q
    selfAttention.q = ColumnParallelLinear(
        selfAttention.q.in_features,
        selfAttention.q.out_features,
        bias=False, 
        gather_output=False)
    selfAttention.q.weight.knowledge = get_sharded_data(orig_q.weight.knowledge, 0)
    del(orig_q)
    orig_k = selfAttention.okay
    selfAttention.okay = ColumnParallelLinear(
        selfAttention.okay.in_features, 
        selfAttention.okay.out_features, 
        bias=(selfAttention.okay.bias will not be None),
        gather_output=False)
    selfAttention.okay.weight.knowledge = get_sharded_data(orig_k.weight.knowledge, 0)
    del(orig_k)
    orig_v = selfAttention.v
    selfAttention.v = ColumnParallelLinear(
        selfAttention.v.in_features, 
        selfAttention.v.out_features, 
        bias=(selfAttention.v.bias will not be None),
        gather_output=False)
    selfAttention.v.weight.knowledge = get_sharded_data(orig_v.weight.knowledge, 0)
    del(orig_v)
    orig_out = selfAttention.o
    selfAttention.o = RowParallelLinear(
        selfAttention.o.in_features,
        selfAttention.o.out_features,
        bias=(selfAttention.o.bias will not be None),
        input_is_parallel=True)
    selfAttention.o.weight.knowledge = get_sharded_data(orig_out.weight.knowledge, 1)
    del(orig_out)
    return selfAttention

Please consult with the neuron_parallel_utils.py file for extra particulars on parallel consideration.

Compile particular person sub-models

The PixArt-Sigma mannequin consists of three elements. Every element is compiled so your complete technology pipeline can run on Neuron:

  • Text encoder – A 4-billion-parameter encoder, which interprets a human-readable immediate into an embedding. Within the textual content encoder, the eye layers are sharded, together with the feed-forward layers, with tensor parallelism.
  • Denoising transformer model – A 700-million-parameter transformer, which iteratively denoises a latent (a numerical illustration of a compressed picture). Within the transformer, the eye layers are sharded, together with the feed-forward layers, with tensor parallelism.
  • Decoder – A VAE decoder that converts our denoiser-generated latent to an output picture. For the decoder, the mannequin is deployed with knowledge parallelism.

Now that the mannequin definition is prepared, you might want to hint a mannequin to run it on Trainium or Inferentia. You’ll be able to see the right way to use the hint() perform to compile the decoder element mannequin for PixArt within the following code block:

compiled_decoder = torch_neuronx.hint(
    decoder,
    sample_inputs,
    compiler_workdir=f"{compiler_workdir}/decoder",
    compiler_args=compiler_flags,
    inline_weights_to_neff=False
)

Please consult with the compile_decoder.py file for extra on the right way to instantiate and compile the decoder.

To run fashions with tensor parallelism, a method used to separate a tensor into chunks throughout a number of NeuronCores, you might want to hint with a pre-specified tp_degree. This tp_degree specifies the variety of NeuronCores to shard the mannequin throughout. It then makes use of the parallel_model_trace API to compile the encoder and transformer element fashions for PixArt:

compiled_text_encoder = neuronx_distributed.hint.parallel_model_trace(
    get_text_encoder_f,
    sample_inputs,
    compiler_workdir=f"{compiler_workdir}/text_encoder",
    compiler_args=compiler_flags,
    tp_degree=tp_degree,
)

Please consult with the compile_text_encoder.py file for extra particulars on tracing the encoder with tensor parallelism.

Lastly, you hint the transformer mannequin with tensor parallelism:

compiled_transformer = neuronx_distributed.hint.parallel_model_trace(
    get_transformer_model_f,
    sample_inputs,
    compiler_workdir=f"{compiler_workdir}/transformer",
    compiler_args=compiler_flags,
    tp_degree=tp_degree,
    inline_weights_to_neff=False,
)

Please consult with the compile_transformer_latency_optimized.py file for extra particulars on tracing the transformer with tensor parallelism.

You’ll use the compile_latency_optimized.sh script to compile all three fashions as described on this submit, so these capabilities will probably be run routinely whenever you run by means of the pocket book.

Step 3 – Deploy the mannequin on AWS Trainium to generate photos

This part will stroll us by means of the steps to run inference on PixArt-Sigma on AWS Trainium.

Create a diffusers pipeline object

The Hugging Face diffusers library is a library for pre-trained diffusion fashions, and consists of model-specific pipelines that bundle the elements (independently-trained fashions, schedulers, and processors) wanted to run a diffusion mannequin. The PixArtSigmaPipeline is restricted to the PixArtSigma mannequin, and is instantiated as follows:

pipe: PixArtSigmaPipeline = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    torch_dtype=torch.bfloat16,
    local_files_only=True,
    cache_dir="pixart_sigma_hf_cache_dir_1024")

Please consult with the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book for particulars on pipeline execution.

Load compiled element fashions into the technology pipeline

After every element mannequin has been compiled, load them into the general technology pipeline for picture technology. The VAE mannequin is loaded with knowledge parallelism, which permits us to parallelize picture technology for batch dimension or a number of photos per immediate. For extra particulars, consult with the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book.

vae_decoder_wrapper.mannequin = torch_neuronx.DataParallel( 
    torch.jit.load(decoder_model_path), [0, 1, 2, 3], False
)

text_encoder_wrapper.t = neuronx_distributed.hint.parallel_model_load(
    text_encoder_model_path
)

Lastly, the loaded fashions are added to the technology pipeline:

pipe.text_encoder = text_encoder_wrapper
pipe.transformer = transformer_wrapper
pipe.vae.decoder = vae_decoder_wrapper
pipe.vae.post_quant_conv = vae_post_quant_conv_wrapper

Compose a immediate

Now that the mannequin is prepared, you’ll be able to write a immediate to convey what sort of picture you need generated. When making a immediate, it is best to all the time be as particular as potential. You should utilize a optimistic immediate to convey what is needed in your new picture, together with a topic, motion, fashion, and site, and may use a unfavorable immediate to point options that must be eliminated.

For instance, you should use the next optimistic and unfavorable prompts to generate a photograph of an astronaut driving a horse on mars with out mountains:

# Topic: astronaut
# Motion: driving a horse
# Location: Mars
# Fashion: photograph
immediate = "a photograph of an astronaut driving a horse on mars"
negative_prompt = "mountains"

Be happy to edit the immediate in your pocket book utilizing prompt engineering to generate a picture of your selecting.

Generate a picture

To generate a picture, you move the immediate to the PixArt mannequin pipeline, after which save the generated picture for later reference:

# pipe: variable holding the Pixart technology pipeline with every of 
# the compiled element fashions
photos = pipe(
        immediate=immediate,
        negative_prompt=negative_prompt,
        num_images_per_prompt=1,
        peak=1024, # variety of pixels
        width=1024, # variety of pixels
        num_inference_steps=25 # Variety of passes by means of the denoising mannequin
    ).photos
    
    for idx, img in enumerate(photos): 
        img.save(f"image_{idx}.png")

Cleanup

To keep away from incurring further prices, stop your EC2 instance utilizing both the AWS Management Console or AWS Command Line Interface (AWS CLI).

Conclusion

On this submit, we walked by means of the right way to deploy PixArt-Sigma, a state-of-the-art diffusion transformer, on Trainium situations. This submit is the primary in a collection targeted on working diffusion transformers for various technology duties on Neuron. To study extra about working diffusion transformers fashions with Neuron, consult with Diffusion Transformers.


Concerning the Authors

Achintya Pinninti is a Options Architect at Amazon Internet Providers. He helps public sector clients, enabling them to realize their aims utilizing the cloud. He focuses on constructing knowledge and machine studying options to unravel advanced issues.

Miriam Lebowitz is a Options Architect targeted on empowering early-stage startups at AWS. She leverages her expertise with AI/ML to information firms to pick and implement the best applied sciences for his or her enterprise aims, setting them up for scalable development and innovation within the aggressive startup world.

Sadaf Rasool is a Options Architect in Annapurna Labs at AWS. Sadaf collaborates with clients to design machine studying options that tackle their important enterprise challenges. He helps clients prepare and deploy machine studying fashions leveraging AWS Trainium or AWS Inferentia chips to speed up their innovation journey.

John Grey is a Options Architect in Annapurna Labs, AWS, based mostly out of Seattle. On this position, John works with clients on their AI and machine studying use circumstances, architects options to cost-effectively clear up their enterprise issues, and helps them construct a scalable prototype utilizing AWS AI chips.

Leave a Reply

Your email address will not be published. Required fields are marked *