Amazon EC2 DL2q occasion for cost-efficient, high-performance AI inference is now usually out there
This can be a visitor publish by A.Okay Roy from Qualcomm AI.
Amazon Elastic Compute Cloud (Amazon EC2) DL2q situations, powered by Qualcomm AI 100 Commonplace accelerators, can be utilized to cost-efficiently deploy deep studying (DL) workloads within the cloud. They can be used to develop and validate efficiency and accuracy of DL workloads that can be deployed on Qualcomm gadgets. DL2q situations are the primary situations to deliver Qualcomm’s synthetic clever (AI) expertise to the cloud.
With eight Qualcomm AI 100 Commonplace accelerators and 128 GiB of complete accelerator reminiscence, prospects may use DL2q situations to run standard generative AI purposes, akin to content material era, textual content summarization, and digital assistants, in addition to traditional AI purposes for pure language processing and laptop imaginative and prescient. Moreover, Qualcomm AI 100 accelerators function the identical AI expertise used throughout smartphones, autonomous driving, private computer systems, and prolonged actuality headsets, so DL2q situations can be utilized to develop and validate these AI workloads earlier than deployment.
New DL2q occasion highlights
Every DL2q occasion incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated efficiency of over 2.8 PetaOps of Int8 inference efficiency and 1.4 PetaFlops of FP16 inference efficiency. The occasion has an combination 112 of AI cores, accelerator reminiscence capability of 128 GB and reminiscence bandwidth of 1.1 TB per second.
Every DL2q occasion has 96 vCPUs, a system reminiscence capability of 768 GB and helps a networking bandwidth of 100 Gbps in addition to Amazon Elastic Block Store (Amazon EBS) storage of 19 Gbps.
Occasion title | vCPUs | Cloud AI100 accelerators | Accelerator reminiscence | Accelerator reminiscence BW (aggregated) | Occasion reminiscence | Occasion networking | Storage (Amazon EBS) bandwidth |
DL2q.24xlarge | 96 | 8 | 128 GB | 1.088 TB/s | 768 GB | 100 Gbps | 19 Gbps |
Qualcomm Cloud AI100 accelerator innovation
The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core structure, supporting a variety of deep-learning use-cases spanning from the datacenter to the sting. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capability of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.
The AI100 accelerator helps a broad and complete vary of fashions and use-cases. The desk under highlights the vary of the mannequin help.
Mannequin class | Variety of fashions | Examples |
NLP | 157 | BERT, BART, FasterTransformer, T5, Z-code MOE |
Generative AI – NLP | 40 | LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen |
Generative AI – Picture | 3 | Secure diffusion v1.5 and v2.1, OpenAI CLIP |
CV – Picture classification | 45 | ViT, ResNet, ResNext, MobileNet, EfficientNet |
CV – Object detection | 23 | YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet |
CV – Different | 15 | LPRNet, Tremendous-resolution/SRGAN, ByteTrack |
Automotive networks* | 53 | Notion and LIDAR, pedestrian, lane, and visitors gentle detection |
Complete | >300 | |
* Most automotive networks are composite networks consisting of a fusion of particular person networks.
The massive on-die SRAM on the DL2q accelerator allows environment friendly implementation of superior efficiency methods akin to MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent expertise is described within the following Open Compute Venture (OCP) {industry} announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI » Open Compute Project.
The occasion person can use the next technique to maximise the performance-per-cost:
- Retailer weights utilizing the MX6 micro-exponent precision within the on-accelerator DDR reminiscence. Utilizing the MX6 precision maximizes the utilization of the out there reminiscence capability and the memory-bandwidth to ship best-in-class throughput and latency.
- Compute in FP16 to ship the required use case accuracy, whereas utilizing the superior on-chip SRAM and spare TOPs on the cardboard, to implement high-performance low-latency MX6 to FP16 kernels.
- Use an optimized batching technique and the next batch-size by utilizing the big on-chip SRAM out there to maximise the reuse of weights, whereas retaining the activations on-chip to the utmost attainable.
DL2q AI Stack and toolchain
The DL2q occasion is accompanied by the Qualcomm AI Stack that delivers a constant developer expertise throughout Qualcomm AI within the cloud and different Qualcomm merchandise. The identical Qualcomm AI stack and base AI expertise runs on the DL2q situations and Qualcomm edge gadgets, offering prospects a constant developer expertise, with a unified API throughout their cloud, automotive, private laptop, prolonged actuality, and smartphone improvement environments.
The toolchain allows the occasion person to shortly onboard a beforehand skilled mannequin, compile and optimize the mannequin for the occasion capabilities, and subsequently deploy the compiled fashions for manufacturing inference use-cases in three steps proven within the following determine.
To be taught extra about tuning the efficiency of a mannequin, see the Cloud AI 100 Key Performance Parameters Documentation.
Get began with DL2q situations
On this instance, you compile and deploy a pre-trained BERT model from Hugging Face on an EC2 DL2q occasion utilizing a pre-built out there DL2q AMI, in 4 steps.
You should use both a pre-built Qualcomm DLAMI on the occasion or begin with an Amazon Linux2 AMI and construct your personal DL2q AMI with the Cloud AI 100 Platform and Apps SDK out there on this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/newest/
.
The steps that comply with use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.
Use SSH to entry your DL2q occasion with the Qualcomm Base AL2 DLAMI AMI and comply with steps 1 through 4.
Step 1. Arrange the atmosphere and set up required packages
- Set up Python 3.8.
- Arrange the Python 3.8 digital atmosphere.
- Activate the Python 3.8 digital atmosphere.
- Set up the required packages, proven within the requirements.txt document out there on the Qualcomm public Github web site.
- Import the required libraries.
Step 2. Import the mannequin
- Import and tokenize the mannequin.
- Outline a pattern enter and extract the
inputIds
andattentionMask
. - Convert the mannequin to ONNX, which may then be handed to the compiler.
- You’ll run the mannequin in FP16 precision. So, it’s essential to test if the mannequin accommodates any constants past the FP16 vary. Cross the mannequin to the
fix_onnx_fp16
perform to generate the brand new ONNX file with the fixes required.
Step 3. Compile the mannequin
The qaic-exec
command line interface (CLI) compiler software is used to compile the mannequin. The enter to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (known as QPC, for Qualcomm program container) within the path outlined by -aic-binary-dir
argument.
Within the compile command under, you employ 4 AI compute cores and a batch measurement of 1 to compile the mannequin.
The QPC is generated within the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc
folder.
Step 4. Run the mannequin
Arrange a session to run the inference on a Cloud AI100 Qualcomm accelerator within the DL2q occasion.
The Qualcomm qaic Python library is a set of APIs that gives help for working inference on the Cloud AI100 accelerator.
- Use the Session API name to create an occasion of session. The Session API name is the entry level to utilizing the qaic Python library.
- Restructure the info from output buffer with
output_shape
andoutput_type
. - Decode the output produced.
Listed here are the outputs for the enter sentence “The canine [MASK] on the mat.”
That’s it. With only a few steps, you compiled and ran a PyTorch mannequin on an Amazon EC2 DL2q occasion. To be taught extra about onboarding and compiling fashions on the DL2q occasion, see the Cloud AI100 Tutorial Documentation.
To be taught extra about which DL mannequin architectures are a great match for AWS DL2q situations and the present mannequin help matrix, see the Qualcomm Cloud AI100 documentation.
Obtainable now
You may launch DL2q situations at the moment within the US West (Oregon) and Europe (Frankfurt) AWS Areas as On-demand, Reserved, and Spot Instances, or as a part of a Savings Plan. As typical with Amazon EC2, you pay just for what you employ. For extra info, see Amazon EC2 pricing.
DL2q situations will be deployed utilizing AWS Deep Learning AMIs (DLAMI), and container photos can be found by way of managed companies akin to Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.
To be taught extra, go to the Amazon EC2 DL2q instance web page, and ship suggestions to AWS re:Post for EC2 or by way of your typical AWS Help contacts.
Concerning the authors
A.Okay Roy is a Director of Product Administration at Qualcomm, for Cloud and Datacenter AI merchandise and options. He has over 20 years of expertise in product technique and improvement, with the present focus of best-in-class efficiency and efficiency/$ end-to-end options for AI inference within the Cloud, for the broad vary of use-cases, together with GenAI, LLMs, Auto and Hybrid AI.
Jianying Lang is a Principal Options Architect at AWS Worldwide Specialist Group (WWSO). She has over 15 years of working expertise in HPC and AI area. At AWS, she focuses on serving to prospects deploy, optimize, and scale their AI/ML workloads on accelerated computing situations. She is keen about combining the methods in HPC and AI fields. Jianying holds a PhD diploma in Computational Physics from the College of Colorado at Boulder.