Easy information to coaching Llama 2 with AWS Trainium on Amazon SageMaker


Massive language fashions (LLMs) are making a big affect within the realm of synthetic intelligence (AI). Their spectacular generative talents have led to widespread adoption throughout numerous sectors and use instances, together with content material era, sentiment evaluation, chatbot improvement, and digital assistant expertise. Llama2 by Meta is an instance of an LLM provided by AWS. Llama 2 is an auto-regressive language mannequin that makes use of an optimized transformer structure and is meant for industrial and analysis use in English. It is available in a variety of parameter sizes—7 billion, 13 billion, and 70 billion—in addition to pre-trained and fine-tuned variations. To be taught extra about Llama 2 on AWS, check with Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart.

Many practitioners fine-tune or pre-train these Llama 2 fashions with their very own textual content knowledge to enhance accuracy for his or her particular use case. Nevertheless, in some instances, a problem arises for practitioners: the excessive price of fine-tuning and coaching. As organizations attempt to push the boundaries of what LLMs can obtain, the demand for cost-effective coaching options has by no means been extra urgent. On this publish, we discover how you should utilize the Neuron distributed coaching library to fine-tune, constantly pre-train, and scale back the price of coaching LLMs equivalent to Llama 2 with AWS Trainium situations on Amazon SageMaker.

AWS Trainium situations for coaching workloads

SageMaker ml.trn1 and ml.trn1n situations, powered by Trainium accelerators, are purpose-built for high-performance deep studying coaching and provide as much as 50% cost-to-train financial savings over comparable coaching optimized Amazon Elastic Compute Cloud (Amazon EC2) situations. This publish implements an answer with the ml.trn1.32xlarge Trainium occasion kind, sometimes used for coaching large-scale fashions. Nevertheless, there are additionally comparable ml.trn1n situations that supply twice as a lot networking throughput (1,600 Gbps) through Amazon Elastic Material Adapter (EFAv2). SageMaker Coaching helps the provision of ml.trn1 and ml.trn1n situations within the US East (N. Virginia) and US West (Oregon) AWS Areas, and most not too long ago introduced common availability within the US East (Ohio) Area. These situations can be found within the listed Areas with On-Demand, Reserved, and Spot Cases, or moreover as a part of a Financial savings Plan.

For extra info on Trainium Accelerator chips, check with Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker. Moreover, try AWS Trainium Customers to be taught extra about buyer testimonials, or see Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available to dive into the accelerator highlights and specs.

Utilizing the Neuron Distributed library with SageMaker

SageMaker is a completely managed service that gives builders, knowledge scientists, and practitioners the power to construct, practice, and deploy machine studying (ML) fashions at scale. SageMaker Coaching consists of options that enhance and simplify the ML coaching expertise, together with managed infrastructure and pictures for deep studying, computerized mannequin tuning with hyperparameter optimization, and a pay-for-what-you-use billing construction. This part highlights the benefits of utilizing SageMaker for distributed coaching with the Neuron Distributed library—particularly, the managed infrastructure, time-to-train, and cost-to-train advantages of its related resiliency and restoration options, and is a part of the AWS Neuron SDK used to run deep studying workloads on AWS Inferentia and AWS Trainum based mostly situations.

In excessive efficiency computing (HPC) clusters, equivalent to these used for deep studying mannequin coaching, {hardware} resiliency points is usually a potential impediment. Though {hardware} failures whereas coaching on a single occasion could also be uncommon, points leading to stalled coaching grow to be extra prevalent as a cluster grows to tens or a whole lot of situations. Common checkpointing helps mitigate wasted compute time, however engineering groups managing their very own infrastructure should nonetheless intently monitor their workloads and be ready to remediate a failure in any respect hours to reduce coaching downtime. The managed infrastructure of SageMaker Coaching consists of a number of resiliency options that make this monitoring and restoration course of streamlined:

  • Cluster well being checks – Earlier than a coaching job begins, SageMaker runs well being checks and verifies communication on the provisioned situations. It then replaces any defective situations, if vital, to verify the coaching script begins working on a wholesome cluster of situations. Well being checks are at present enabled for the TRN1 occasion household in addition to P* and G* GPU-based occasion sorts.
  • Computerized checkpointing – Checkpoints from an area path (/decide/ml/checkpoints by default) are robotically copied to an Amazon Simple Storage Service (Amazon S3) location specified by the consumer. When coaching is restarted, SageMaker robotically copies the beforehand saved checkpoints from the S3 location again to the native checkpoint listing to verify the coaching script can load and resume the final saved checkpoint.
  • Monitoring and monitoring coaching – Within the case of a node failure, it’s necessary to have the visibility of the place the failure happens. Utilizing PyTorch Neuron provides knowledge scientists the power to track training progress in a TensorBoard. This lets you seize the lack of the coaching job to find out when the coaching job must be stopped to determine the convergence of the mannequin for optimum coaching.
  • Constructed-in retries and cluster restore – You possibly can configure SageMaker to robotically retry coaching jobs that fail with a SageMaker inner server error (ISE). As a part of retrying a job, SageMaker replaces any situations that encountered unrecoverable errors with contemporary situations, reboots all wholesome situations, and begins the job once more. This ends in quicker restarts and workload completion. Cluster replace is at present enabled for the TRN1 occasion household in addition to P and G GPU-based occasion sorts. Practitioners can add in their very own applicative retry mechanism across the consumer code that submits the job, to deal with different sorts of launch errors, equivalent to like exceeding your account quota.

For patrons working with massive clusters of a whole lot of situations for a coaching job, the resiliency and restoration options of SageMaker Coaching can scale back whole time for a mannequin to converge by as much as 20% through fewer failures and quicker restoration. This additionally permits engineering groups to watch and react to failures in any respect hours. Though SageMaker coaching jobs are appropriate for general-purpose coaching use instances with customizable configurations and integration with the broader AWS ecosystem, Amazon SageMaker HyperPod is particularly optimized for environment friendly and resilient coaching of basis fashions at scale. For extra info on SageMaker HyperPod use instances, check with the SageMaker HyperPod developer guide.

On this publish, we use the Neuron Distributed library to constantly pre-train a Llama 2 mannequin utilizing tensor and pipeline parallelism utilizing SageMaker coaching jobs. To be taught extra in regards to the resiliency and restoration options of SageMaker Coaching, check with Training large language models on Amazon SageMaker: Best practices.

Resolution overview

On this answer, we use an ml.t3.medium occasion kind on a SageMaker Jupyter pocket book to course of the offered cells. We can be constantly pre-training our llama2-70b mannequin utilizing the trn1.32xlarge Trainium occasion. First, let’s familiarize ourselves with the methods we use to deal with the distribution of the coaching job created in our answer to contiuously pre-train our llama2-70b mannequin utilizing the Neuron distributed coaching library.

The methods used to transform the pre-trained weights within the convert_pretrained_weights.ipynb pocket book right into a .pt (PyTorch) weights file are referred to as pipeline parallelism and tensor parallelism:

  • Pipeline parallelism entails a coaching technique that mixes parts of pipeline parallelism to optimize the coaching course of by splitting a batch or deep neural community into a number of microbatches or layers, permitting every stage employee to course of one microbatch.
  • Tensor parallelism splits tensors of a neural community into a number of gadgets. This system permits fashions with massive tensors that may’t match into the reminiscence of a single gadget.

After we convert our pre-trained weights with the previous methods in our first notebook, we comply with two separate notebooks in the identical sagemaker-trainium-examples folder. The second pocket book is Training_llama2_70b.ipynb, which walks by way of the continual pre-training course of by saving our checkpoint of transformed mannequin weights within the first pocket book and prepping it for inference. When this step is full, we are able to run the Convert_Nxd_to_hf.ipynb pocket book, which takes our pre-trained weights utilizing the NeuronX library and converts it right into a readable format in Hugging Face to serve inference.

Stipulations

You might want to full some conditions earlier than you possibly can run the primary pocket book.

First, be sure to have created a Hugging Face access token so you possibly can obtain the Hugging Face tokenizer for use later. After you will have the entry token, it is advisable to make a number of quota improve requests for SageMaker. You might want to request a minimal of 8 Trn1 situations ranging to a most of 32 Trn1 situations (relying on time-to-train and cost-to-train trade-offs to your use case).

On the Service Quotas console, request the next SageMaker quotas:

  • Trainium situations (ml.trn1.32xlarge) for coaching job utilization: 8–32
  • ml.trn1.32xlarge for coaching heat pool utilization: 8–32
  • Most variety of situations per coaching job: 8–32

It could take as much as 24 hours for the quota improve to get authorised. Nevertheless, after submitting the quota improve, you possibly can go to the sagemaker-trainium-examples GitHub repo and find the convert_pretrained_weights.ipynb file. That is the file that you just use to start the continuous pre-training course of.

Now that you just’re prepared to start the method to constantly pre-train the llama2-70b mannequin, you possibly can convert the pre-trained weights within the subsequent part to prep the mannequin and create the checkpoint.

Getting began

Full the next steps:

  1. Set up all of the required packages and libraries: SageMaker, Boto3, transformers, and datasets.

These packages just be sure you can arrange your surroundings to entry your pre-trained Llama 2 mannequin, obtain your tokenizer, and get your pre-training dataset.

!pip set up -U sagemaker boto3 --quiet
!pip set up transformers datasets[s3] --quiet

  1. After the packages are put in, retrieve your Hugging Face entry token, and obtain and outline your tokenizer.

The tokenizer meta-llama/Llama-2-70b-hf is a specialised tokenizer that breaks down textual content into smaller items for pure language processing. This tokenized knowledge will later be uploaded into Amazon S3 to permit for working your coaching job.

from huggingface_hub.hf_api
import HfFolder
# Replace the entry token to obtain the tokenizer
access_token = "hf_insert-key-here"
HfFolder.save_token(access_token)

from transformers import AutoTokenizer
tokenizer_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
block_size = 4096

  1. After following the above cells, you’ll now obtain the wikicorpus dataset from the Hugging Face dataset.
  2. Tokenize the dataset with the llama-2 tokenizer that you just simply initialized.

By tokenizing the info, you’re getting ready to pre-train your Llama 2 mannequin to boost the mannequin’s efficiency to reveal it to the trilingual (Catalan, English, Spanish) textual content knowledge within the wikicorpus dataset to be taught intricate patterns and relationships within the dataset.

After the info is tokenized, run the next cell to retailer the coaching dataset to s3:

# save coaching dataset to s3
training_input_path = f's3://{sess.default_bucket()}/neuronx_distributed/knowledge'
print(f"importing coaching dataset to: {training_input_path}")
train_dataset.save_to_disk(training_input_path)

print(f"uploaded knowledge to: {training_input_path}")

The cell above makes positive that you just outline the training_input_path and have uploaded the info to your S3 bucket. You’re now prepared to start the coaching job course of.

Run the coaching job

For the coaching job, we use the trn1.32xlarge situations with every of the situations having 32 neuron cores. We use tensor parallelism and pipeline parallelism, which lets you shard the mannequin throughout Neuron cores for coaching.

The next code is the configuration for pretraining llama2-70b with trn1:

#Variety of processes per node
PROCESSES_PER_NODE = 32
# Variety of situations throughout the cluster, change this if you wish to tweak the instance_count parameter
WORLD_SIZE = 32
# International batch measurement
GBS = 512
# Enter sequence size
SEQ_LEN = 4096
# Pipeline parallel diploma
PP_DEGREE = 8<br /># Tensor parallel diploma
TP_DEGREE = 8
# Information paralell measurement
DP = ((PROCESSES_PER_NODE * WORLD_SIZE / TP_DEGREE / PP_DEGREE))
# Batch measurement per mannequin duplicate
BS = ((GBS / DP))
# Quantity microbatches for pipeline execution. Setting identical as BS so every microbatch accommodates a single datasample
NUM_MICROBATCHES = BS
# Variety of whole steps for which to coach mannequin. This quantity must be adjusted to the step quantity when the loss perform is approaching convergence.
MAX_STEPS = 1500
# Timeout in seconds for coaching. After this period of time Amazon SageMaker terminates the job no matter its present standing.
MAX_RUN = 2 * (24 * 60 * 60)

Now you possibly can outline the hyperparameters for coaching. Word that adjusting these parameters based mostly on {hardware} capabilities, dataset traits, and convergence necessities can considerably affect coaching efficiency and effectivity.

The next is the code for the hyperparameters:

hyperparameters = {}
hyperparameters["train_batch_size"] = int(BS)
hyperparameters["use_meta_device_init"] = 1
hyperparameters["training_dir"] = "/decide/ml/enter/knowledge/practice" # path the place sagemaker uploads the coaching knowledge
hyperparameters["training_config"] = "config.json" # config file containing llama 70b configuration , change this for tweaking the variety of parameters.

hyperparameters["max_steps"] = MAX_STEPS
hyperparameters["seq_len"] = SEQ_LEN
hyperparameters["pipeline_parallel_size"] = PP_DEGREE
hyperparameters["tensor_parallel_size"] = TP_DEGREE
hyperparameters["num_microbatches"] = int(NUM_MICROBATCHES)
hyperparameters["lr"] = 0.00015
hyperparameters["min_lr"] = 1e-05
hyperparameters["beta1"] = 0.9
hyperparameters["beta2"] = 0.95
hyperparameters["weight_decay"] = 0.1
hyperparameters["warmup_steps"] = 2000
hyperparameters["constant_steps"] = 0
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["tb_dir"] = "/decide/ml/checkpoints/tensorboard" # The tensorboard logs can be saved right here and ultimately pushed to S3.

Now you specify the Docker picture that can be used to coach the mannequin on Trainium:

docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.0-ubuntu20.04"

The picture we outlined is designed for PyTorch coaching with Neuron optimizations. This picture is configured to work with PyTorch, utilizing Neuron SDK model 2.18.0 for enhanced efficiency and effectivity on Trn1 situations geared up with AWS Trainium chips. This picture can also be suitable with Python 3.10, indicated by the py310, and relies on Ubuntu 20.04.

Previous to beginning your coaching job, it is advisable to configure it by defining all vital variables. You achieve this by defining the coaching job title, checkpoint listing, and cache listing:

import time
# Outline Coaching Job Title
job_name = f'llama-neuron-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# Outline checkpoint listing that accommodates the weights and different related knowledge for the educated mannequin
checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuron_llama_experiment"
checkpoint_dir="/decide/ml/checkpoints"</p><p>
In [ ]:
# Outline neuron chache listing
cache_dir = "/decide/ml/checkpoints/neuron_cache"

The parameters allow you to do the next:

  • The coaching job permits you to determine and monitor particular person coaching jobs based mostly on timestamps
  • The checkpoint listing specifies the S3 URI the place the checkpoint knowledge, weights, and different info are saved for the educated mannequin
  • The cache listing helps optimize the coaching course of by storing and reusing beforehand calculated values, from the checkpoint listing, lowering redundancy and bettering effectivity
  • The surroundings variables be sure that the coaching job is optimally configured and settings are tailor-made to allow environment friendly and efficient coaching utilizing options like RDMA, optimized reminiscence allocation, fused operations, and Neuron-specific gadget optimizations

After you will have outlined your coaching job and configured all directories and surroundings variables for an optimum coaching pipeline, you now arrange your PyTorch estimator to start the coaching job on SageMaker. A SageMaker estimator is a high-level interface that handles the end-to-end SageMaker coaching and deployment duties.

The entry_point is specified because the Python script run_llama_nxd.py. We use the instance_type ml.trn1.32xlarge, the occasion depend is 32 (which was beforehand outlined as a worldwide variable within the configuration code), and input_mode is ready to FastFile. Quick File mode in SageMaker streams knowledge from Amazon S3 on demand, which optimizes knowledge loading efficiency by fetching knowledge as wanted, lowering general useful resource consumption. For extra info on enter, check with Access Training Data.

from sagemaker.pytorch import PyTorch

# Deal with end-to-end Amazon SageMaker coaching and deployment duties.
pt_estimator = PyTorch(<br />entry_point="run_llama_nxd.py",
source_dir="./scripts",<br />instance_type="ml.trn1.32xlarge",
image_uri=docker_image,<br />instance_count=WORLD_SIZE,
max_run=MAX_RUN,
hyperparameters=hyperparameters,
position=position,
base_job_name=job_name,
surroundings=env,
input_mode="FastFile",
disable_output_compression=True,
keep_alive_period_in_seconds=600, # that is added to allow heat pool functionality
checkpoint_s3_uri=checkpoint_s3_uri,
checkpoint_local_path=checkpoint_dir,
distribution={"torch_distributed": {"enabled": True}} # allow torchrun
)

Lastly, you can begin the coaching job with the SageMaker match() technique, which trains the mannequin based mostly on the outlined hyperparameters:

# Begin coaching job
pt_estimator.match({"practice": training_input_path})

You’ve efficiently began the method to constantly pre-train a llama2-70b mannequin by changing pre-trained weights with tokenized knowledge utilizing SageMaker coaching on Trainium situations.

Steady pre-training

After following the conditions, finishing the offered pocket book, and changing the pre-trained weights as a checkpoint, now you can start the continuous pre-training course of, utilizing the checkpoint as some extent of reference to pre-train the llama2-70b mannequin. The methods used to transform the pre-trained weights within the convert_pretrained_weights.ipynb pocket book right into a .pt (PyTorch) weights file are referred to as pipeline parallelism and tensor parallelism.

To start the continual pre-training course of, comply with the Training_llama2_70b.ipynb file within the sagemaker-trainium-examples repo.

Given the massive measurement of the llama2-70b mannequin, it is advisable to convert the pre-trained weights right into a extra environment friendly and useable format (.pt). You are able to do so by defining the hyperparameters in your configuration to retailer transformed weights and checkpoints. The next are the hyperparameters:

# Use the sagemaker s3 checkpoints mechanism since we'd like learn/write entry to the paths.
hyperparameters["output_dir"] = "/decide/ml/checkpoints/llama70b_weights"
hyperparameters["checkpoint-dir"] = '/decide/ml/checkpoints'<br />hyperparameters["n_layers"] = 80
hyperparameters["convert_from_full_model"] = ""

In case you have a look at the hyperparameters, the output_dir is used as a reference for pre-training. If you’re at this cell, you need to have already adopted the Training_llama2_70b.ipynb pocket book and gone by way of the method of establishing your SageMaker consumer and Docker picture, and getting ready the pre-trained weights for pre-training. You’re now able to carry out the continual pre-training course of on the llama2-70b mannequin.

We use the next parameters to take the pre-trained weights saved in output_dir within the convert_pretrained_weights.ipynb file to be reused constantly for pre-training:

hyperparameters["checkpoint_dir"] = "/decide/ml/checkpoints/checkpts"
hyperparameters["checkpoint_freq"] = 10
hyperparameters["num_kept_checkpoint"] = 1
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["save_load_xser"] = 0
hyperparameters["pretrained_weight_dir"] = "/decide/ml/checkpoints/llama70b_weights"

After these hyperparameters are applied, you possibly can run the remainder of the pocket book cells to finish the continual pre-training course of. After the SageMaker estimator has accomplished the coaching job, you possibly can find the brand new checkpoint within the S3 checkpoint listing containing the weights. Now you can find the convert_Nxd_to_hf.ipynb file to get the checkpoint prepared for inferencing.

Convert the Neuron Distributed checkpoint for inferencing

Checkpoints play an important position within the context of distributed coaching with the NeuronX library as a result of it has checkpoint compatibility with Hugging Face Transformers. You may get the coaching job output prepared for inferencing by taking the coaching job that’s saved as a NeuronX distributed checkpoint and changing the weights into .pt weights information.

To transform the checkpoints to Hugging Face format utilizing NeuronX, you first want to avoid wasting the S3 nxd_checkpoint_path listing:

# S3 checkpoint listing that accommodates the weights and different related knowledge from the continual pre-trained mannequin
checkpoint_s3_uri = "&lt;pre-training-checkpoint-s3-uri&gt;"
nxd_checkpoint_path = f"s3://{checkpoint_s3_uri}/neuronx_llama_experiment/checkpts/step10/mannequin/"
# Checkpoint is saved as a part of Pocket book 2

After you save the checkpoint within the nxd_checkpoint_path listing, it can save you your hyperparameters and configure your SageMaker estimator, which makes positive the pre-training course of can start. Now you can run the match() perform throughout the estimator to transform the pre-trained weights right into a checkpoint for inferencing with the next cell:

# Begin SageMaker job
estimator.match({"checkpoint": nxd_checkpoint_path})

Abstract

You’ve efficiently carried out steady pre-training on a llama2-70b mannequin by changing your pre-trained weights and checkpoint for use to serve inference utilizing the Neuron SDK and Trainium situations. By following the answer on this publish, you need to now know the best way to configure a pipeline for steady pre-training of an LLM utilizing SageMaker and Trainium accelerator chips.

For extra info on the best way to use Trainium to your workloads, check with the Neuron SDK documentation or attain out on to the workforce. We worth buyer suggestions and are at all times trying to have interaction with ML practitioners and builders. Be at liberty to go away feedback or questions within the feedback part.


Concerning the authors


Marco Punio is a Options Architect targeted on generative AI technique, utilized AI options and conducting analysis to assist clients hyperscale on AWS. He’s a certified technologist with a ardour for machine studying, synthetic intelligence, and mergers & acquisitions. Marco relies in Seattle, WA and enjoys writing, studying, exercising, and constructing functions in his free time.


Armando Diaz is a Options Architect at AWS. He focuses on generative AI, AI/ML, and Information Analytics. At AWS, Armando helps clients integrating cutting-edge generative AI capabilities into their methods, fostering innovation and aggressive benefit. When he’s not at work, he enjoys spending time together with his spouse and household, mountain climbing, and touring the world.


Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker Service workforce. He focuses on serving to clients construct, practice, and migrate ML manufacturing workloads to SageMaker at scale. He focuses on deep studying, particularly within the space of NLP and CV. Outdoors of labor, he enjoys working and mountain climbing.


Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads frameworks, compilers, and optimization methods for deep studying coaching.


Niithiyn Vijeaswaran is a Options Architect at AWS. His space of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s diploma in Laptop Science and Bioinformatics. Niithiyn works intently with the Generative AI GTM workforce to allow AWS clients on a number of fronts and speed up their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys accumulating sneakers.


Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Net Providers (AWS). He’s partnering with high generative AI mannequin builders, strategic clients, key AI/ML companions, and AWS Service Groups to allow the following era of synthetic intelligence, machine studying, and accelerated computing on AWS. He was beforehand an Enterprise Options Architect, and the International Options Lead for AWS Mergers & Acquisitions Advisory.


Sebastian Bustillo is a Options Architect at AWS. He focuses on AI/ML applied sciences with a profound ardour for generative AI and compute accelerators. At AWS, he helps clients unlock enterprise worth by way of generative AI. When he’s not at work, he enjoys brewing an ideal cup of specialty espresso and exploring the world together with his spouse.

Leave a Reply

Your email address will not be published. Required fields are marked *