AWS AI chips ship excessive efficiency and low price for Llama 3.1 fashions on AWS

At present, we’re excited to announce AWS Trainium and AWS Inferentia assist for fine-tuning and inference of the Llama 3.1 fashions. The Llama 3.1 household of multilingual massive language fashions (LLMs) is a set of pre-trained and instruction tuned generative fashions in 8B, 70B, and 405B sizes. In a previous post, we coated methods to deploy Llama 3 fashions on AWS Trainium and Inferentia based mostly cases in Amazon SageMaker JumpStart. On this publish, we define methods to get began with fine-tuning and deploying the Llama 3.1 household of fashions on AWS AI chips, to comprehend their price-performance advantages.

Overview of Llama 3.1 fashions

The Llama 3.1 household of multilingual LLMs are a set of pre-trained and instruction tuned generative fashions in 8B, 70B, and 405B sizes (textual content in/textual content and code out). All fashions assist lengthy context size (128k) and are optimized for inference with assist for grouped question consideration (GQA).

The Llama 3.1 instruction tuned fashions (8B, 70B, 405B) are optimized for multilingual dialogue use instances and outperform lots of the out there publicly out there chat fashions on widespread trade benchmarks. They’ve been skilled to generate instrument requires just a few particular instruments for capabilities like search, picture era, code execution, and mathematical reasoning. As well as, they assist zero-shot instrument use.

Llama 3.1 405B is the world’s largest publicly out there LLM in line with Meta. The mannequin units a brand new customary for synthetic intelligence (AI) and is good for enterprise-level functions and analysis and growth. It’s best for duties like artificial knowledge era, the place the outputs of the mannequin can be utilized to enhance smaller Llama fashions after ﬁne-tuning, and mannequin distillations to switch information to smaller fashions from the 405B mannequin. This mannequin excels at normal information, long-form textual content era, multilingual translation, machine translation, coding, math, instrument use, enhanced contextual understanding, and superior reasoning and decision-making.

Architecturally, the core LLM for Llama 3 and Llama 3.1 has the identical dense structure. They’re auto-regressive language fashions that use an optimized transformer structure. The tuned variations use supervised fine-tuning (SFT) and reinforcement studying with human suggestions (RLHF) to align with human preferences for helpfulness and security.

The responsible use guide from Meta can help you in implementing further fine-tuning that could be essential to customise and optimize the fashions with acceptable security mitigations.

Trainium powers Llama 3.1 on Amazon Bedrock and Amazon SageMaker

The quickest method to get started with Llama 3.1 on AWS is thru Amazon Bedrock, which is powered by our purpose-built AI infrastructure together with AWS Trainium. Via its absolutely managed API, Amazon Bedrock delivers the advantages of our purpose-built AI infrastructure and simplifies entry to those highly effective fashions so you’ll be able to deal with constructing differentiated AI functions.

For those who want larger management over the underlying sources, you’ll be able to fine-tune and deploy Llama 3.1 models with SageMaker. Trainium assist for Llama 3.1 in SageMaker JumpStart is coming quickly.

AWS Trainium and AWS Inferentia2 allow excessive efficiency and low price for Llama 3.1 fashions

If you wish to construct your individual ML pipelines for coaching and inference for larger flexibility and management, you will get began with Llama 3.1 on AWS AI chips utilizing Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 cases. Let’s see how one can get began with the brand new Llama 3.1 8/70B fashions on Trainium utilizing the AWS Neuron SDK.

Advantageous-tune Llama 3.1 on Trainium

To get began with fine-tuning both Llama 3.1 8B or Llama 3.1 70B, you should use the NeuronX Distributed library. NeuronX Distributed gives implementations of a few of the extra in style distributed coaching and inference strategies. To begin fine-tuning, you should use the next samples:

Each samples are constructed on prime of AWS ParallelCluster to handle the Trainium cluster infrastructure and Slurm for workload administration. The next is the instance Slurm command to provoke coaching for Llama3.1 70B:

sbatch --exclusive 
--nodes 32 
--cpus-per-task 128 
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

Contained in the Slurm script, we launch a distributed coaching course of on our cluster. Within the runner scripts, we load the pre-trained weights and configuration supplied by Meta, and launch the coaching course of:

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py 
    --train_batch_size $BS 
    --use_meta_device_init 1 
    --training_dir $DATA_PATH 
    --training_config $SCRIPT_DIR/${MODEL_SIZE}_config_llama${LLAMA_VERSION} 
    --max_steps $max_steps 
    --seq_len $SEQ_LEN 
    --pipeline_parallel_size $PP_DEGREE 
    --tensor_parallel_size $TP_DEGREE 
    --num_microbatches $NUM_MICROBATCHES 
    --lr 0.000015 
    --min_lr 1e-06 
    --beta1 0.9 
    --beta2 0.95 
    --weight_decay 0.1 
    --warmup_steps 2000 
    --constant_steps 0 
    --use_zero1_optimizer 1 
    --use_selective_checkpoint 1 
    --use_flash_attention 1 
    --qkv_linear 1 
    --kv_replicator 4 
    --pretrained_weight 1 
    --save_load_xser 1 
    --checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" 
    --checkpoint_freq $checkpoint_freq 
    --num_kept_checkpoint -1 
    --loading_step -1 
    --tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS[0]}

Deploy Llama 3.1 on Trainium or Inferentia

When your mannequin is able to deploy, you are able to do so by updating the mannequin ID within the earlier Llama 3 8B Neuron pattern code. For instance, the under code deploys the mannequin on a inf2.48xlarge occasion.

model_id = "meta-llama/Meta-Llama-3.1-8B"
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)
neuron_model.to_neuron()

You should utilize the identical pattern inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
immediate = "Good day, I am a language mannequin and I prefer to"
input_ids = tokenizer.encode(immediate, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    begin = time.time()
    generated_sequences = neuron_model.pattern(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - begin

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

For step-by-step particulars, check with the brand new Llama 3.1 examples:

It’s also possible to use Hugging Face’s Optimum Neuron library to shortly deploy fashions straight from SageMaker via the Hugging Face Mannequin Hub. From the Llama 3.1 mannequin card hub, select Deploy, then SageMaker, and at last AWS Inferentia & Trainium. Copy the instance code right into a SageMaker pocket book, then select Run.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

strive:
    function = sagemaker.get_execution_role()
besides ValueError:
    iam = boto3.shopper("iam")
    function = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Mannequin configuration. https://huggingface.co/fashions
hub = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert hub["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>", "Please change '<REPLACE WITH YOUR TOKEN>' along with your Hugging Face Hub API token"


# create Hugging Face Mannequin Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", model="0.0.23"),
    env=hub,
    function=function,
)

# deploy mannequin to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# ship request
predictor.predict(
    {
        "inputs": "What's is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Moreover, if you wish to use vLLM to deploy the fashions, you’ll be able to check with the continuous batching guide to create the surroundings. After you create the surroundings, you should use vLLM to deploy Llama 3.1 8/70B fashions on AWS Trainium or Inferentia. The next an instance to deploy Llama 3.1 8B:

from vllm import LLM, SamplingParams
# Pattern prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    mannequin="meta-llama/Meta-Llama-3.1-8B",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be similar as max sequence size,
    # when focusing on neuron system. At the moment, it is a recognized limitation in steady batching
    # assist in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The system might be mechanically detected when AWS Neuron SDK is put in.
    # The system argument might be both unspecified for automated detection, or explicitly assigned.
    system="neuron",
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a listing of RequestOutput objects
# that comprise the immediate, generated textual content, and different info.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    immediate = output.immediate
    generated_text = output.outputs[0].textual content
    print(f"Immediate: {immediate!r}, Generated textual content: {generated_text!r}")

Conclusion

AWS Trainium and Inferentia ship excessive efficiency and low price for fine-tuning and deploying Llama 3.1 fashions. We’re excited to see how you’ll use these highly effective fashions and our purpose-built AI infrastructure to construct differentiated AI functions. To be taught extra about methods to get began with AWS AI chips, check with Model Samples and Tutorials in AWS Neuron Documentation.

Concerning the Authors

John Grey is a Sr. Options Architect in Annapurna Labs, AWS, based mostly out of Seattle. On this function, John works with clients on their AI and machine studying use instances, architects options to cost-effectively remedy their enterprise issues, and helps them construct a scalable prototype utilizing AWS AI chips.

Pinak Panigrahi works with clients to construct ML-driven options to unravel strategic enterprise issues on AWS. In his present function, he works on optimizing coaching and inference of generative AI fashions on AWS AI chips.

Kamran Khan, Head of Enterprise Growth for AWS Inferentina/Trianium at AWS. He has over a decade of expertise serving to clients deploy and optimize deep studying coaching and inference workloads utilizing AWS Inferentia and AWS Trainium.

Shruti Koparkar is a Senior Product Advertising Supervisor at AWS. She helps clients discover, consider, and undertake Amazon EC2 accelerated computing infrastructure for his or her machine studying wants.