Positive-tune Llama 2 utilizing QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2


On this publish, we showcase fine-tuning a Llama 2 mannequin utilizing a Parameter-Environment friendly Positive-Tuning (PEFT) methodology and deploy the fine-tuned mannequin on AWS Inferentia2. We use the AWS Neuron software program growth package (SDK) to entry the AWS Inferentia2 gadget and profit from its excessive efficiency. We then use a big mannequin inference container powered by Deep Java Library (DJLServing) as our mannequin serving answer.

Answer overview

Environment friendly Positive-tuning Llama2 utilizing QLoRa

The Llama 2 household of enormous language fashions (LLMs) is a group of pre-trained and fine-tuned generative textual content fashions ranging in scale from 7 billion to 70 billion parameters. Llama 2 was pre-trained on 2 trillion tokens of knowledge from publicly accessible sources. AWS clients typically select to fine-tune Llama 2 fashions utilizing clients’ personal information to attain higher efficiency for downstream duties. Nonetheless, because of Llama 2 mannequin’s giant variety of parameters, full fine-tuning may very well be prohibitively costly and time consuming. Parameter-Environment friendly Positive-Tuning (PEFT) method can deal with this drawback by solely fine-tune a small variety of additional mannequin parameters whereas freezing most parameters of the pre-trained mannequin. For extra info on PEFT, one can learn this post. On this publish, we use QLoRa to fine-tune a Llama 2 7B mannequin.

Deploy a fine-tuned Mannequin on Inf2 utilizing Amazon SageMaker

AWS Inferentia2 is purpose-built machine studying (ML) accelerator designed for inference workloads and delivers high-performance at as much as 40% decrease value for generative AI and LLM workloads over different inference optimized situations on AWS. On this publish, we use Amazon Elastic Compute Cloud (Amazon EC2) Inf2 occasion, that includes AWS Inferentia2, the second era Inferentia2 accelerators, every containing two NeuronCores-v2. Every NeuronCore-v2 is an unbiased, heterogenous compute-unit, with 4 foremost engines: Tensor, Vector, Scalar, and GPSIMD engines. It consists of an on-chip software-managed SRAM reminiscence for maximizing information locality. Since a number of blogs on Inf2 has been printed, the reader can discuss with this post and our documentation for extra info on Inf2.

To deploy fashions on Inf2, we want AWS Neuron SDK because the software program layer working on high of the Inf2 {hardware}. AWS Neuron is the SDK used to run deep studying workloads on AWS Inferentia and AWS Trainium based mostly situations. It permits end-to-end ML growth lifecycle to construct new fashions, practice and optimize these fashions, and deploy them for manufacturing. AWS Neuron features a deep studying compiler, runtime, and tools which can be natively built-in with fashionable frameworks like TensorFlow and PyTorch. On this weblog, we’re going to use transformers-neuronx, which is a part of the AWS Neuron SDK for transformer decoder inference workflows. It supports a variety of fashionable fashions, together with Llama 2.

To deploy fashions on Amazon SageMaker, we normally use a container that accommodates the required libraries, resembling Neuron SDK and transformers-neuronx in addition to the mannequin serving part. Amazon SageMaker maintains deep learning containers (DLCs) with fashionable open supply libraries for internet hosting giant fashions. On this publish, we use the Large Model Inference Container for Neuron. This container has all the things you want to deploy your Llama 2 mannequin on Inf2. For sources to get began with LMI on Amazon SageMaker, please discuss with lots of our present posts (blog 1, blog 2, blog 3) on this matter. In brief, you’ll be able to run the container with out writing any further code. You need to use the default handler for a seamless person expertise and move in one of many supported mannequin names and any load time configurable parameters. This compiles and serve an LLM on an Inf2 occasion. For instance, to deploy OpenAssistant/llama2-13b-orca-8k-3319, you’ll be able to present the comply with configuration (as serving.properties file). In serving.properties, we specify the mannequin kind as llama2-13b-orca-8k-3319, the batch dimension as 4, the tensor parallel diploma as 2, and that’s it. For the total listing of configurable parameters, discuss with All DJL configuration options.

# Engine to make use of: MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, and so forth.
engine = Python 
# default handler for mannequin serving
choice.entryPoint = djl_python.transformers_neuronx
# The Hugging Face ID of a mannequin or the s3 url of the mannequin artifacts. 
choice.model_id = meta-llama/Llama-2-7b-chat-hf
#the dynamic batch dimension, default is 1.
choice.batch_size=4
# This feature specifies variety of tensor parallel partitions carried out on the mannequin.
choice.tensor_parallel_degree=2
# The enter sequence size
choice.n_positions=512
#Allow iteration stage batching utilizing one among "auto", "scheduler", "lmi-dist"
choice.rolling_batch=auto
# The information kind to which you propose to forged the mannequin default
choice.dtype=fp16
# employee load mannequin timeout
choice.model_loading_timeout=1500

Alternatively, you’ll be able to write your individual mannequin handler file as proven on this example, however that requires implementing the mannequin loading and inference strategies to function a bridge between the DJLServing APIs.

Conditions

The next listing outlines the conditions for deploying the mannequin described on this weblog publish. You possibly can implement both from the AWS Management Console or utilizing the newest model of the AWS Command Line Interface (AWS CLI).

Walkthrough

Within the following part, we’ll walkthrough the code in two components:

  1. Positive-tuning a Llama2-7b mannequin, and add the mannequin artifacts to a specified Amazon S3 bucket location.
  2. Deploy the mannequin into an Inferentia2 utilizing DJL serving container hosted in Amazon SageMaker.

The entire code samples with directions might be discovered on this GitHub repository.

Half 1: Positive-tune a Llama2-7b mannequin utilizing PEFT

We’re going to use the not too long ago launched methodology within the paper QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation by Tim Dettmers et al. QLoRA is a brand new method to scale back the reminiscence footprint of enormous language fashions throughout fine-tuning, with out sacrificing efficiency.

Observe: The fine-tuning of llama2-7b mannequin proven within the following was examined on an Amazon SageMaker Studio Notebook with Python 2.0 GPU Optimized Kernel utilizing a ml.g5.2xlarge occasion kind. As a greatest observe, we advocate utilizing an Amazon SageMaker Studio Built-in Growth Surroundings (IDE) launched in your individual Amazon Virtual Private Cloud (Amazon VPC). This lets you management, monitor, and examine community site visitors inside and out of doors your VPC utilizing customary AWS networking and safety capabilities. For extra info, see Securing Amazon SageMaker Studio connectivity using a private VPC.

Quantize the bottom mannequin

We first load a quantized mannequin with 4-bit quantization utilizing Huggingface transformers library as follows:

# The bottom pretrained mannequin for fine-tuning
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to make use of
dataset_name = "mlabonne/guanaco-llama2-1k"

#Activate 4-bit precision base mannequin loading
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

# Load base mannequin and tokenizer
mannequin = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
mannequin.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Load coaching dataset

Subsequent, we load the dataset to feed the mannequin for fine-tuning step proven as adopted:

# Load dataset (you'll be able to course of it right here)
dataset = load_dataset(dataset_name, cut up="practice")

Connect an adapter layer

Right here we connect a small, trainable adapter layer, configured as LoraConfig outlined within the Hugging Face’s peft library.

# embody linear layers to use LoRA to.
modules = find_all_linear_names(mannequin)

## Establishing LoRA configuration
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout likelihood for LoRA layers
lora_dropout = 0.1

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
target_modules=modules)

Prepare a mannequin

Utilizing the LoRA configuration proven above, we’ll fine-tune the Llama2 mannequin together with hyper-parameters. A code snippet for coaching the mannequin is proven within the following:

# Set coaching parameters
training_arguments = TrainingArguments(...)

coach = SFTTrainer(
mannequin=mannequin,
train_dataset=dataset,
peft_config=peft_config, # LoRA config
dataset_text_field="textual content",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Prepare mannequin
coach.practice()

# Save educated mannequin
coach.mannequin.save_pretrained(new_model)

Merge mannequin weight

The fine-tuned mannequin executed above created a brand new mannequin containing the educated LoRA adapter weights. Within the following code snippet, we’ll merge the adapter with the bottom mannequin in order that we may use the fine-tuned mannequin for inference.

# Reload mannequin in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
mannequin = PeftModel.from_pretrained(base_model, new_model)
mannequin = mannequin.merge_and_unload()

save_dir = "merged_model"
mannequin.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")

# Reload tokenizer to put it aside
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "proper"
tokenizer.save_pretrained(save_dir)

Add mannequin weight to Amazon S3

Within the last step of half 1, we’ll save the merged mannequin weights to a specified Amazon S3 location. The mannequin weight will probably be utilized by a mannequin serving container in Amazon SageMaker to host the mannequin utilizing an Inferentia2 occasion.

model_data_s3_location = "s3://<bucket_name>/<prefix>/"
!cd {save_dir} && aws s3 cp —recursive . {model_data_s3_location}

Half 2: Host QLoRA mannequin for inference with AWS Inf2 utilizing SageMaker LMI Container

On this part, we’ll stroll by means of the steps of deploying a QLoRA fine-tuned mannequin into an Amazon SageMaker internet hosting surroundings. We’ll use a DJL serving container from SageMaker DLC, which integrates with the transformers-neuronx library to host this mannequin. The setup facilitates the loading of fashions onto AWS Inferentia2 accelerators, parallelizes the mannequin throughout a number of NeuronCores, and permits serving by way of HTTP endpoints.

Put together mannequin artifacts

DJL helps many deep studying optimization libraries, together with DeepSpeed, FasterTransformer and extra. For mannequin particular configurations, we offer a serving.properties with key parameters, resembling tensor_parallel_degree and model_id to outline the mannequin loading choices. The model_id may very well be a Hugging Face mannequin ID, or an Amazon S3 path the place the mannequin weights are saved. In our instance, we offer the Amazon S3 location of our fine-tuned mannequin. The next code snippet reveals the properties used for the mannequin serving:

%%writefile serving.properties
engine=Python
choice.entryPoint=djl_python.transformers_neuronx
choice.model_id=<mannequin information s3 location>
choice.batch_size=4
choice.neuron_optimize_level=2
choice.tensor_parallel_degree=8
choice.n_positions=512
choice.rolling_batch=auto
choice.dtype=fp16
choice.model_loading_timeout=1500

Please discuss with this documentation for extra details about the configurable choices accessible by way of serving.properties. Please notice that we use choice.n_position=512 on this weblog for quicker AWS Neuron compilation. If you wish to strive bigger enter token size, then we advocate the reader to pre-compile the mannequin forward of time (see AOT Pre-Compile Model on EC2). In any other case, you may run into timeout error if the compilation time is an excessive amount of.

After the serving.properties file is outlined, we’ll package deal the file right into a tar.gz format, as follows:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

Then, we’ll add the tar.gz to an Amazon S3 bucket location:

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to deal with artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Mannequin tar ball uploaded to --- > {code_artifact}")

Create an Amazon SageMaker mannequin endpoint

To make use of an Inf2 occasion for serving, we use an Amazon SageMaker LMI container with DJL neuronX assist. Please discuss with this post for extra details about utilizing a DJL NeuronX container for inference. The next code reveals methods to deploy a mannequin utilizing Amazon SageMaker Python SDK:

# Retrieves the DJL-neuronx docker picture URI
image_uri = image_uris.retrieve(
framework="djl-neuronx",
area=sess.boto_session.region_name,
model="0.24.0"
)

# Outline inf2 occasion kind to make use of for serving
instance_type = "ml.inf2.48xlarge"

endpoint_name = sagemaker.utils.name_from_base("lmi-model")

# Deploy the mannequin for inference
mannequin.deploy(initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1500,
volume_size=256,
endpoint_name=endpoint_name)

# our requests and responses will probably be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sess,
serializer=serializers.JSONSerializer(),
)

Check mannequin endpoint

After the mannequin is deployed efficiently, we are able to validate the endpoint by sending a pattern request to the predictor:

immediate="What's machine studying?"
input_data = f"<s>[INST] <<SYS>>nAs an information scientistn<</SYS>>n{immediate} [/INST]"

response = predictor.predict(
{"inputs": input_data, "parameters": {"max_new_tokens":300, "do_sample":"True"}}
)

print(json.masses(response)['generated_text'])

The pattern output is proven as follows:

Within the context of knowledge evaluation, Machine Studying (ML) refers to a statistical method able to extracting predictive energy from a dataset with an growing complexity and accuracy by iteratively narrowing down the scope of a statistic.

Machine Studying isn’t a brand new statistical method, however quite a mix of present strategies. Moreover, it has not been designed for use with a selected dataset or to provide a selected final result. Moderately, it was designed to be versatile sufficient to adapt to any dataset and to make predictions about any final result.

Clear up

When you resolve that you simply now not wish to preserve the SageMaker endpoint working, you’ll be able to delete it utilizing AWS SDK for Python (boto3), AWS CLI or Amazon SageMaker Console. Moreover, you can too shutdown the Amazon SageMaker Studio Resources which can be now not required.

Conclusion

On this publish, we confirmed you methods to fine-tune a Llama2-7b mannequin utilizing LoRA adaptor with 4-bit quantization utilizing a single GPU occasion. Then we deployed the mannequin to an Inf2 occasion hosted in Amazon SageMaker utilizing a DJL serving container. Lastly, we validated the Amazon SageMaker mannequin endpoint with a textual content era prediction utilizing the SageMaker Python SDK. Go forward and provides it a strive, we love to listen to your suggestions. Keep tuned for updates on extra capabilities and new improvements with AWS Inferentia.

For extra examples about AWS Neuron, see aws-neuron-samples.


In regards to the Authors

Wei Teh is a Senior AI/ML Specialist Options Architect at AWS. He’s captivated with serving to clients advance their AWS journey, specializing in Amazon Machine Studying companies and machine learning-based options. Exterior of labor, he enjoys out of doors actions like tenting, fishing, and mountain climbing along with his household.

Qingwei Li is a Machine Studying Specialist at Amazon Internet Providers. He acquired his Ph.D. in Operations Analysis after he broke his advisor’s analysis grant account and didn’t ship the Nobel Prize he promised. Presently he helps clients within the monetary service and insurance coverage trade construct machine studying options on AWS. In his spare time, he likes studying and instructing.

Leave a Reply

Your email address will not be published. Required fields are marked *