Amazon EC2 DL2q occasion for cost-efficient, high-performance AI inference is now usually out there


This can be a visitor publish by A.Okay Roy from Qualcomm AI.

Amazon Elastic Compute Cloud (Amazon EC2) DL2q situations, powered by Qualcomm AI 100 Commonplace accelerators, can be utilized to cost-efficiently deploy deep studying (DL) workloads within the cloud. They can be used to develop and validate efficiency and accuracy of DL workloads that can be deployed on Qualcomm gadgets. DL2q situations are the primary situations to deliver Qualcomm’s synthetic clever (AI) expertise to the cloud.

With eight Qualcomm AI 100 Commonplace accelerators and 128 GiB of complete accelerator reminiscence, prospects may use DL2q situations to run standard generative AI purposes, akin to content material era, textual content summarization, and digital assistants, in addition to traditional AI purposes for pure language processing and laptop imaginative and prescient. Moreover, Qualcomm AI 100 accelerators function the identical AI expertise used throughout smartphones, autonomous driving, private computer systems, and prolonged actuality headsets, so DL2q situations can be utilized to develop and validate these AI workloads earlier than deployment.

New DL2q occasion highlights

Every DL2q occasion incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated efficiency of over 2.8 PetaOps of Int8 inference efficiency and 1.4 PetaFlops of FP16 inference efficiency. The occasion has an combination 112 of AI cores, accelerator reminiscence capability of 128 GB and reminiscence bandwidth of 1.1 TB per second.

Every DL2q occasion has 96 vCPUs, a system reminiscence capability of 768 GB and helps a networking bandwidth of 100 Gbps in addition to Amazon Elastic Block Store (Amazon EBS) storage of 19 Gbps.

Occasion title vCPUs Cloud AI100 accelerators Accelerator reminiscence Accelerator reminiscence BW (aggregated) Occasion reminiscence Occasion networking Storage (Amazon EBS) bandwidth
DL2q.24xlarge 96 8 128 GB 1.088 TB/s 768 GB 100 Gbps 19 Gbps

Qualcomm Cloud AI100 accelerator innovation

The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core structure, supporting a variety of deep-learning use-cases spanning from the datacenter to the sting. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capability of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.

The AI100 accelerator helps a broad and complete vary of fashions and use-cases. The desk under highlights the vary of the mannequin help.

Mannequin class Variety of fashions Examples​
NLP​ 157 BERT, BART, FasterTransformer, T5, Z-code MOE
Generative AI – NLP 40 LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen
Generative AI – Picture 3 Secure diffusion v1.5 and v2.1, OpenAI CLIP
CV – Picture classification 45 ViT, ResNet, ResNext, MobileNet, EfficientNet
CV – Object detection 23 YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet
CV – Different 15 LPRNet, Tremendous-resolution/SRGAN, ByteTrack
Automotive networks* 53 Notion and LIDAR, pedestrian, lane, and visitors gentle detection
Complete​ >300 

* Most automotive networks are composite networks consisting of a fusion of particular person networks.

The massive on-die SRAM on the DL2q accelerator allows environment friendly implementation of superior efficiency methods akin to MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent expertise is described within the following Open Compute Venture (OCP) {industry} announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI » Open Compute Project.

The occasion person can use the next technique to maximise the performance-per-cost:

  • Retailer weights utilizing the MX6 micro-exponent precision within the on-accelerator DDR reminiscence. Utilizing the MX6 precision maximizes the utilization of the out there reminiscence capability and the memory-bandwidth to ship best-in-class throughput and latency.
  • Compute in FP16 to ship the required use case accuracy, whereas utilizing the superior on-chip SRAM and spare TOPs on the cardboard, to implement high-performance low-latency MX6 to FP16 kernels.
  • Use an optimized batching technique and the next batch-size by utilizing the big on-chip SRAM out there to maximise the reuse of weights, whereas retaining the activations on-chip to the utmost attainable.

DL2q AI Stack and toolchain

The DL2q occasion is accompanied by the Qualcomm AI Stack that delivers a constant developer expertise throughout Qualcomm AI within the cloud and different Qualcomm merchandise. The identical Qualcomm AI stack and base AI expertise runs on the DL2q situations and Qualcomm edge gadgets, offering prospects a constant developer expertise, with a unified API throughout their cloud, automotive, private laptop, prolonged actuality, and smartphone improvement environments.

The toolchain allows the occasion person to shortly onboard a beforehand skilled mannequin, compile and optimize the mannequin for the occasion capabilities, and subsequently deploy the compiled fashions for manufacturing inference use-cases in three steps proven within the following determine.

To be taught extra about tuning the efficiency of a mannequin, see the Cloud AI 100 Key Performance Parameters Documentation.

Get began with DL2q situations

On this instance, you compile and deploy a pre-trained BERT model from Hugging Face on an EC2 DL2q occasion utilizing a pre-built out there DL2q AMI, in 4 steps.

You should use both a pre-built Qualcomm DLAMI on the occasion or begin with an Amazon Linux2 AMI and construct your personal DL2q AMI with the Cloud AI 100 Platform and Apps SDK out there on this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/newest/.

The steps that comply with use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.

Use SSH to entry your DL2q occasion with the Qualcomm Base AL2 DLAMI AMI and comply with steps 1 through 4.

Step 1. Arrange the atmosphere and set up required packages

  1. Set up Python 3.8.
    sudo amazon-linux-extras set up python3.8

  2. Arrange the Python 3.8 digital atmosphere.
    python3.8 -m venv /house/ec2-user/userA/pyenv

  3. Activate the Python 3.8 digital atmosphere.
    supply /house/ec2-user/userA/pyenv/bin/activate

  4. Set up the required packages, proven within the requirements.txt document out there on the Qualcomm public Github web site.
    pip3 set up -r necessities.txt

  5. Import the required libraries.
    import transformers 
    from transformers import AutoTokenizer, AutoModelForMaskedLM
    import sys
    import qaic
    import os
    import torch
    import onnx
    from onnxsim import simplify
    import argparse
    import numpy as np

Step 2. Import the mannequin

  1. Import and tokenize the mannequin.
    model_card = 'bert-base-cased'
    mannequin = AutoModelForMaskedLM.from_pretrained(model_card)
    tokenizer = AutoTokenizer.from_pretrained(model_card)

  2. Outline a pattern enter and extract the inputIds and attentionMask.
    sentence = "The canine [MASK] on the mat."
    encodings = tokenizer(sentence, max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    inputIds = encodings["input_ids"]
    attentionMask = encodings["attention_mask"]

  3. Convert the mannequin to ONNX, which may then be handed to the compiler.
    # Set dynamic dims and axes.
    dynamic_dims = {0: 'batch', 1 : 'sequence'}
    dynamic_axes = {
        "input_ids" : dynamic_dims,
        "attention_mask" : dynamic_dims,
        "logits" : dynamic_dims
    }
    input_names = ["input_ids", "attention_mask"]
    inputList = [inputIds, attentionMask]
    
    torch.onnx.export(
        mannequin,
        args=tuple(inputList),
        f=f"{gen_models_path}/{model_base_name}.onnx",
        verbose=False,
        input_names=input_names,
        output_names=["logits"],
        dynamic_axes=dynamic_axes,
        opset_version=11,
    )

  4. You’ll run the mannequin in FP16 precision. So, it’s essential to test if the mannequin accommodates any constants past the FP16 vary. Cross the mannequin to the fix_onnx_fp16 perform to generate the brand new ONNX file with the fixes required.
    from onnx import numpy_helper
            
    def fix_onnx_fp16(
        gen_models_path: str,
        model_base_name: str,
    ) -> str:
        finfo = np.finfo(np.float16)
        fp16_max = finfo.max
        fp16_min = finfo.min
        mannequin = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
        fp16_fix = False
        for tensor in onnx.external_data_helper._get_all_tensors(mannequin):
            nptensor = numpy_helper.to_array(tensor, gen_models_path)
            if nptensor.dtype == np.float32 and (
                np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
            ):
                # print(f'tensor worth : {nptensor} above {fp16_max} or under {fp16_min}')
                nptensor = np.clip(nptensor, fp16_min, fp16_max)
                new_tensor = numpy_helper.from_array(nptensor, tensor.title)
                tensor.CopyFrom(new_tensor)
                fp16_fix = True
                
        if fp16_fix:
            # Save FP16 mannequin
            print("Discovered constants out of FP16 vary, clipped to FP16 vary")
            model_base_name += "_fix_outofrange_fp16"
            onnx.save(mannequin, f=f"{gen_models_path}/{model_base_name}.onnx")
            print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
        return model_base_name
    
    fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)

Step 3. Compile the mannequin

The qaic-exec command line interface (CLI) compiler software is used to compile the mannequin. The enter to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (known as QPC, for Qualcomm program container) within the path outlined by -aic-binary-dir argument.

Within the compile command under, you employ 4 AI compute cores and a batch measurement of 1 to compile the mannequin.

/decide/qti-aic/exec/qaic-exec 
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx 
-aic-num-cores=4 
-convert-to-fp16 
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128 
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc 
-aic-hw -aic-hw-version=2.0 
-compile-only

The QPC is generated within the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc folder.

Step 4. Run the mannequin

Arrange a session to run the inference on a Cloud AI100 Qualcomm accelerator within the DL2q occasion.

The Qualcomm qaic Python library is a set of APIs that gives help for working inference on the Cloud AI100 accelerator.

  1. Use the Session API name to create an occasion of session. The Session API name is the entry level to utilizing the qaic Python library.
    qpcPath="bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc"
    
    bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', num_activations=1)  
    bert_sess.setup() # Hundreds the community to the gadget. 
    
    # Right here we're studying out all of the enter and output shapes/sorts
    input_shape, input_type = bert_sess.model_input_shape_dict['input_ids']
    attn_shape, attn_type = bert_sess.model_input_shape_dict['attention_mask']
    output_shape, output_type = bert_sess.model_output_shape_dict['logits']
    
    #create the enter dictionary for given enter sentence
    input_dict = {"input_ids": inputIds.numpy().astype(input_type), "attention_mask" : attentionMask.numpy().astype(attn_type)}
    
    #run inference on Cloud AI 100
    output = bert_sess.run(input_dict)

  2. Restructure the info from output buffer with output_shape and output_type.
    token_logits = np.frombuffer(output['logits'], dtype=output_type).reshape(output_shape)

  3. Decode the output produced.
    mask_token_logits = torch.from_numpy(token_logits[0, mask_token_index, :]).unsqueeze(0)
    top_5_results = torch.topk(mask_token_logits, 5, dim=1)
    print("Mannequin output (top5) from Qualcomm Cloud AI 100:")
    for i in vary(5):
        idx = top_5_results.indices[0].tolist()[i]
        val = top_5_results.values[0].tolist()[i]
        phrase = tokenizer.decode([idx])
        print(f"{i+1} :(phrase={phrase}, index={idx}, logit={spherical(val,2)})")

Listed here are the outputs for the enter sentence “The canine [MASK] on the mat.”

1 :(phrase=sat, index=2068, logit=11.46)
2 :(phrase=landed, index=4860, logit=11.11)
3 :(phrase=spat, index=15732, logit=10.95)
4 :(phrase=settled, index=3035, logit=10.84)
5 :(phrase=was, index=1108, logit=10.75)

That’s it. With only a few steps, you compiled and ran a PyTorch mannequin on an Amazon EC2 DL2q occasion. To be taught extra about onboarding and compiling fashions on the DL2q occasion, see the Cloud AI100 Tutorial Documentation.

To be taught extra about which DL mannequin architectures are a great match for AWS DL2q situations and the present mannequin help matrix, see the Qualcomm Cloud AI100 documentation.

Obtainable now

You may launch DL2q situations at the moment within the US West (Oregon) and Europe (Frankfurt) AWS Areas as On-demandReserved, and Spot Instances, or as a part of a Savings Plan. As typical with Amazon EC2, you pay just for what you employ. For extra info, see Amazon EC2 pricing.

DL2q situations will be deployed utilizing AWS Deep Learning AMIs (DLAMI), and container photos can be found by way of managed companies akin to Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

To be taught extra, go to the Amazon EC2 DL2q instance web page, and ship suggestions to AWS re:Post for EC2 or by way of your typical AWS Help contacts.


Concerning the authors

A.Okay Roy is a Director of Product Administration at Qualcomm, for Cloud and Datacenter AI merchandise and options. He has over 20 years of expertise in product technique and improvement, with the present focus of best-in-class efficiency and efficiency/$ end-to-end options for AI inference within the Cloud, for the broad vary of use-cases, together with GenAI, LLMs, Auto and Hybrid AI.

Jianying Lang is a Principal Options Architect at AWS Worldwide Specialist Group (WWSO). She has over 15 years of working expertise in HPC and AI area. At AWS, she focuses on serving to prospects deploy, optimize, and scale their AI/ML workloads on accelerated computing situations. She is keen about combining the methods in HPC and AI fields. Jianying holds a PhD diploma in Computational Physics from the College of Colorado at Boulder.

Leave a Reply

Your email address will not be published. Required fields are marked *