run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a set of pre-trained and instruction tuned generative fashions in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (textual content in/textual content out and code out). The Qwen 2.5 high quality tuned text-only fashions are optimized for multilingual dialogue use instances and outperform each earlier generations of Qwen fashions, and most of the publicly accessible chat fashions primarily based on widespread business benchmarks.

At its core, Qwen 2.5 is an auto-regressive language mannequin that makes use of an optimized transformer structure. The Qwen2.5 assortment can assist over 29 languages and has enhanced role-playing skills and condition-setting for chatbots.

On this put up, we define learn how to get began with deploying the Qwen 2.5 household of fashions on an Inferentia occasion utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker utilizing the Hugging Face Textual content Technology Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are additionally supported.

Preparation

Hugging Face offers two instruments which are regularly used when utilizing AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which offer assist for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The primary time a mannequin is run on Inferentia or Trainium, you compile the mannequin to just remember to have a model that may carry out optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face together with the Optimum Neuron cache will transparently provide a compiled mannequin when accessible. In case you’re utilizing a special mannequin with the Qwen2.5 structure, you would possibly must compile the mannequin earlier than deploying. For extra info, see Compiling a model for Inferentia or Trainium.

You possibly can deploy TGI as a docker container on an Inferentia or Trainium EC2 occasion or on Amazon SageMaker.

Choice 1: Deploy TGI on Amazon EC2 Inf2

On this instance, you’ll deploy Qwen2.5-7B-Instruct on an inf2.xlarge occasion. (See this article for detailed directions on learn how to deploy an occasion utilizing the Hugging Face DLAMI.)

For this feature, you SSH into the occasion and create a .env file (the place you’ll outline your constants and specify the place your mannequin is cached) and a file named docker-compose.yaml (the place you’ll outline the entire surroundings parameters that you simply’ll must deploy your mannequin for inference). You possibly can copy the next recordsdata for this use case.

Create a .env file with the next content material:

MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/information/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # signifies the auto solid kind that was used to compile the mannequin
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file named docker-compose.yaml with the next content material:

model: '3.7'

providers:
  tgi-1:
    picture: ghcr.io/huggingface/neuronx-tgi:newest
    ports:
      - "8081:8081"
    surroundings:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #solely wanted for gated fashions
    volumes:
      - $PWD:/information #may be eliminated for those who aren't loading regionally
    units:
      - "/dev/neuron0"

Use docker compose to deploy the mannequin:

docker compose -f docker-compose.yaml --env-file .env up

To verify that the mannequin deployed appropriately, ship a take a look at immediate to the mannequin:

curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Inform me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

To verify that the mannequin can reply in a number of languages, strive sending a immediate in Chinese language:

#"Inform me learn how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

Choice 2: Deploy TGI on SageMaker

It’s also possible to use Hugging Face’s Optimum Neuron library to shortly deploy fashions immediately from SageMaker utilizing directions on the Hugging Face Mannequin Hub.

From the Qwen 2.5 mannequin card hub, select Deploy, then SageMaker, and at last AWS Inferentia & Trainium.

Copy the instance code right into a SageMaker pocket book, then select Run.
The pocket book you copied will appear to be the next:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

strive:
    function = sagemaker.get_execution_role()
besides ValueError:
    iam = boto3.shopper("iam")
    function = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Mannequin configuration. https://huggingface.co/fashions
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


area = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{area}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Mannequin Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    function=function,
)

# deploy mannequin to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# ship request
predictor.predict(
    {
        "inputs": "What's is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clear Up

Just remember to terminate your EC2 situations and delete your SageMaker endpoints to keep away from ongoing prices.

Terminate EC2 situations by means of the AWS Management Console.

Terminate a SageMaker endpoint by means of the console or with the next instructions:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia ship excessive efficiency and low value for deploying Qwen2.5 fashions. We’re excited to see how you’ll use these highly effective fashions and our purpose-built AI infrastructure to construct differentiated AI purposes. To be taught extra about learn how to get began with AWS AI chips, see the AWS Neuron documentation.

Concerning the Authors

Jim Burtoft is a Senior Startup Options Architect at AWS and works immediately with startups in addition to the workforce at Hugging Face. Jim is a CISSP, a part of the AWS AI/ML Technical Subject Neighborhood, a part of the Neuron Information Science neighborhood, and works with the open supply neighborhood to allow using Inferentia and Trainium. Jim holds a bachelor’s diploma in arithmetic from Carnegie Mellon College and a grasp’s diploma in economics from the College of Virginia.

Miriam Lebowitz Profile Miriam Lebowitz is a Options Architect centered on empowering early-stage startups at AWS. She leverages her expertise with AIML to information corporations to pick and implement the suitable applied sciences for his or her enterprise goals, setting them up for scalable development and innovation within the aggressive startup world.

Rhia Soni is a Startup Options Architect at AWS. Rhia focuses on working with early stage startups and helps prospects undertake Inferentia and Trainium. Rhia can also be a part of the AWS Analytics Technical Subject Neighborhood and is a topic knowledgeable in Generative BI. Rhia holds a bachelor’s diploma in Data Science from the College of Maryland.

Paul Aiuto is a Senior Resolution Architect Supervisor specializing in Startups at AWS. Paul created a workforce of AWS Startup Resolution architects that target the adoption of Inferentia and Trainium. Paul holds a bachelor’s diploma in Laptop Science from Siena School and has a number of Cyber Safety certifications.