Nice-tune Meta Llama 3.1 fashions utilizing torchtune on Amazon SageMaker


This submit is co-written with Meta’s PyTorch staff.

In right now’s quickly evolving AI panorama, companies are always searching for methods to make use of superior massive language fashions (LLMs) for his or her particular wants. Though basis fashions (FMs) supply spectacular out-of-the-box capabilities, true aggressive benefit usually lies in deep mannequin customization by way of fine-tuning. Nevertheless, fine-tuning LLMs for complicated duties sometimes requires superior AI experience to align and optimize them successfully. Recognizing this problem, the PyTorch staff developed torchtune, a PyTorch-native library that simplifies authoring, fine-tuning, and experimenting with LLMs, making it extra accessible to a broader vary of customers and purposes.

On this submit, AWS collaborates with Meta’s PyTorch staff to showcase how you need to use PyTorch’s torchtune library to fine-tune Meta Llama-like architectures whereas utilizing a fully-managed setting supplied by Amazon SageMaker Training. We exhibit this by way of a step-by-step implementation of mannequin fine-tuning, inference, quantization, and analysis. We carry out the steps on a Meta Llama 3.1 8B mannequin using the LoRA fine-tuning technique on a single p4d.24xlarge employee node (offering 8 Nvidia A100 GPUs).

Earlier than we dive into the step-by-step information, we first explored the efficiency of our technical stack by fine-tuning a Meta Llama 3.1 8B mannequin throughout numerous configurations and occasion sorts.

As could be seen within the following chart, we discovered {that a} single p4d.24xlarge delivers 70% increased efficiency than two g5.48xlarge situations (every with 8 NVIDIA A10 GPUs) at virtually 47% diminished value. We subsequently have optimized the instance on this submit for a p4d.24xlarge configuration. Nevertheless, you may use the identical code to run single-node or multi-node coaching on totally different occasion configurations by altering the parameters handed to the SageMaker estimator. You possibly can additional optimize the time for coaching within the following graph by utilizing a SageMaker managed warm pool and accessing pre-downloaded fashions utilizing Amazon Elastic File System (Amazon EFS).

Challenges with fine-tuning LLMs

Generative AI fashions supply many promising enterprise use instances. Nevertheless, to take care of factual accuracy and relevance of those LLMs to particular enterprise domains, fine-tuning is required. Because of the rising variety of mannequin parameters and the growing context size of contemporary LLMs, this course of is reminiscence intensive. To deal with these challenges, fine-tuning methods like LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) restrict the variety of trainable parameters by including low-rank parallel constructions to the transformer layers. This lets you practice LLMs even on methods with low reminiscence availability like commodity GPUs. Nevertheless, this results in an elevated complexity as a result of new dependencies need to be dealt with and coaching recipes and hyperparameters must be tailored to the brand new strategies.

What companies want right now is user-friendly coaching recipes for these widespread fine-tuning strategies, which offer abstractions to the end-to-end tuning course of, addressing the widespread pitfalls in essentially the most opinionated means.

How does torchtune helps?

torchtune is a PyTorch-native library that goals to democratize and streamline the fine-tuning course of for LLMs. By doing so, it makes it easy for researchers, builders, and organizations to adapt these highly effective LLMs to their particular wants and constraints. It offers coaching recipes for a wide range of fine-tuning strategies, which could be configured by way of YAML recordsdata. The recipes implement widespread fine-tuning strategies (full-weight, LoRA, QLoRA) in addition to different widespread duties like inference and analysis. They robotically apply a set of essential options (FSDP, activation checkpointing, gradient accumulation, combined precision) and are particular to a given mannequin household (comparable to Meta Llama 3/3.1 or Mistral) in addition to compute setting (single-node vs. multi-node).

Moreover, torchtune integrates with main libraries and frameworks like Hugging Face datasets, EleutherAI’s Eval Harness, and Weights & Biases. This helps tackle the necessities of the generative AI fine-tuning lifecycle, from information ingestion and multi-node fine-tuning to inference and analysis. The next diagram reveals a visualization of the steps we describe on this submit.

Check with the installation instructions and PyTorch documentation to study extra about torchtune and its ideas.

Resolution overview

This submit demonstrates the usage of SageMaker Coaching for operating torchtune recipes by way of task-specific coaching jobs on separate compute clusters. SageMaker Coaching is a complete, totally managed ML service that allows scalable mannequin coaching. It offers versatile compute useful resource choice, help for customized libraries, a pay-as-you-go pricing mannequin, and self-healing capabilities. By managing workload orchestration, well being checks, and infrastructure, SageMaker helps scale back coaching time and whole value of possession.

The answer structure incorporates the next key parts to reinforce safety and effectivity in fine-tuning workflows:

  • Safety enhancement – Coaching jobs are run inside non-public subnets of your digital non-public cloud (VPC), considerably enhancing the safety posture of machine studying (ML) workflows.
  • Environment friendly storage resolution – Amazon EFS is used to speed up mannequin storage and entry throughout numerous phases of the ML workflow.
  • Customizable setting – We use customized containers in coaching jobs. The help in SageMaker for customized containers permits you to bundle all obligatory dependencies, specialised frameworks, and libraries right into a single artifact, offering full management over your ML setting.

The next diagram illustrates the answer structure. Customers provoke the method by calling the SageMaker management aircraft by way of APIs or command line interface (CLI) or utilizing the SageMaker SDK for every particular person step. In response, SageMaker spins up coaching jobs with the requested quantity and sort of compute situations to run particular duties. Every step outlined within the diagram accesses torchtune recipes from an Amazon Simple Storage Service (Amazon S3) bucket and makes use of Amazon EFS to save lots of and entry mannequin artifacts throughout totally different levels of the workflow.

By decoupling each torchtune step, we obtain a stability between flexibility and integration, permitting for each unbiased execution of steps and the potential for automating this course of utilizing seamless pipeline integration.

On this use case, we fine-tune a Meta Llama 3.1 8B mannequin with LoRA. Subsequently, we run mannequin inference, and optionally quantize and consider the mannequin utilizing torchtune and SageMaker Coaching.

Recipes, configs, datasets, and immediate templates are fully configurable and let you align torchtune to your necessities. To exhibit this, we use a customized immediate template on this use case and mix it with the open supply dataset Samsung/samsum from the Hugging Face hub.

We fine-tune the mannequin utilizing torchtune’s multi device LoRA recipe (lora_finetune_distributed) and use the SageMaker custom-made model of Meta Llama 3.1 8B default config (llama3_1/8B_lora).

Stipulations

That you must full the next stipulations earlier than you’ll be able to run the SageMaker Jupyter notebooks:

  1. Create a Hugging Face access token to get entry to the gated repo meta-llama/Meta-Llama-3.1-8B on Hugging Face.
  2. Create a Weights & Biases API key to entry the Weights & Biases dashboard for logging and monitoring
  3. Request a SageMaker service quota for 1x ml.p4d.24xlarge and 1xml.g5.2xlarge.
  4. Create an AWS Identity and Access Management (IAM) role with managed insurance policies AmazonSageMakerFullAccess, AmazonEC2FullAccess, AmazonElasticFileSystemFullAccess, and AWSCloudFormationFullAccess to offer required entry to SageMaker to run the examples. (That is for demonstration functions. It is best to regulate this to your particular safety necessities for manufacturing.)
  5. Create an Amazon SageMaker Studio area (see Quick setup to Amazon SageMaker) to entry Jupyter notebooks with the previous function. Check with the instructions to set permissions for Docker construct.
  6. Log in to the pocket book console and clone the GitHub repo:
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd sagemaker-distributed-training-workshop/13-torchtune

  1. Run the pocket book ipynb to arrange VPC and Amazon EFS utilizing an AWS CloudFormation stack.

Overview torchtune configs

The next determine illustrates the steps in our workflow.

You may lookup the torchtune configs to your use case by straight utilizing the tune CLI.For this submit, we offer modified config recordsdata aligned with SageMaker listing path’s construction:

sh-4.2$ cd config/
sh-4.2$ ls -ltr
-rw-rw-r-- 1 ec2-user ec2-user 1151 Aug 26 18:34 config_l3.1_8b_gen_orig.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1172 Aug 26 18:34 config_l3.1_8b_gen_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user  644 Aug 26 18:49 config_l3.1_8b_quant.yaml
-rw-rw-r-- 1 ec2-user ec2-user 2223 Aug 28 14:53 config_l3.1_8b_lora.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1223 Sep  4 14:28 config_l3.1_8b_eval_trained.yaml
-rw-rw-r-- 1 ec2-user ec2-user 1213 Sep  4 14:29 config_l3.1_8b_eval_original.yaml

torchtune makes use of these config recordsdata to pick out and configure the parts (assume fashions and tokenizers) in the course of the execution of the recipes.

Construct the container

As a part of our instance, we create a customized container to offer customized libraries like torch nightlies and torchtune. Full the next steps:

sh-4.2$ cat Dockerfile
# Set the default worth for the REGION construct argument
ARG REGION=us-west-2
# SageMaker PyTorch picture for TRAINING
FROM ${ACCOUNTID}.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
# Uninstall current PyTorch packages
RUN pip uninstall torch torchvision transformer-engine -y
# Set up newest launch of PyTorch and torchvision
RUN pip set up --force-reinstall torch==2.4.1 torchao==0.4.0 torchvision==0.19.1

Run the 1_build_container.ipynb pocket book till the next command to push this file to your ECR repository:

!sm-docker construct . --repository speed up:newest

sm-docker is a CLI device designed for constructing Docker photographs in SageMaker Studio utilizing AWS CodeBuild. We set up the library as a part of the pocket book.

Subsequent, we are going to run the 2_torchtune-llama3_1.ipynb pocket book for all fine-tuning workflow duties.

For each process, we assessment three artifacts:

  • torchtune configuration file
  • SageMaker process config with compute and torchtune recipe particulars
  • SageMaker process output

Run the fine-tuning process

On this part, we stroll by way of the steps to run and monitor the fine-tuning process.

Run the fine-tuning job

The next code reveals a shortened torchtune recipe configuration highlighting a number of key parts of the file for a fine-tuning job:

  • Mannequin element together with LoRA rank configuration
  • Meta Llama 3 tokenizer to tokenize the info
  • Checkpointer to learn and write checkpoints
  • Dataset element to load the dataset
sh-4.2$ cat config_l3.1_8b_lora.yaml
# Mannequin Arguments
mannequin:
  _component_: torchtune.fashions.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.fashions.llama3.llama3_tokenizer
  path: /choose/ml/enter/information/mannequin/hf-model/authentic/tokenizer.mannequin

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_files: [
    consolidated.00.pth
  ]
  …

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.samsum_dataset
  train_on_input: True
batch_size: 13

# Coaching
epochs: 1
gradient_accumulation_steps: 2

... and extra ...

We use Weights & Biases for logging and monitoring our coaching jobs, which helps us monitor our mannequin’s efficiency:

metric_logger:
_component_: torchtune.utils.metric_logging.WandBLogger
…

Subsequent, we outline a SageMaker process that will probably be handed to our utility perform within the script create_pytorch_estimator. This script creates the PyTorch estimator with all of the outlined parameters.

Within the process, we use the lora_finetune_distributed torchrun recipe with config config-l3.1-8b-lora.yaml on an ml.p4d.24xlarge occasion. Be sure to obtain the bottom mannequin from Hugging Face earlier than it’s fine-tuned utilizing the use_downloaded_model parameter. The image_uri parameter defines the URI of the customized container.

sagemaker_tasks={
    "fine-tune":{
        "hyperparameters":{
            "tune_config_name":"config-l3.1-8b-lora.yaml",
            "tune_action":"fine-tune",
            "use_downloaded_model":"false",
            "tune_recipe":"lora_finetune_distributed"
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",        
        "image_uri":"<accountid>.dkr.ecr.<area>.amazonaws.com/speed up:newest"
    }
    ... and extra ...
}

To create and run the duty, run the next code:

Activity="fine-tune"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The next code reveals the duty output and reported standing:

# Refer-Output

2024-08-16 17:45:32 Beginning - Beginning the coaching job...
...
...

1|140|Loss: 1.4883038997650146:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]
1|141|Loss: 1.4621509313583374:  99%|█████████▉| 141/142 [06:26<00:02,  2.47s/it]

Coaching accomplished with code: 0
2024-08-26 14:19:09,760 sagemaker-training-toolkit INFO     Reporting coaching SUCCESS

The ultimate mannequin is saved to Amazon EFS, which makes it obtainable with out obtain time penalties.

Monitor the fine-tuning job

You may monitor numerous metrics comparable to loss and studying charge to your coaching run by way of the Weights & Biases dashboard. The next figures present the outcomes of the coaching run the place we tracked GPU utilization, GPU reminiscence utilization, and loss curve.

For the next graph, to optimize reminiscence utilization, torchtune makes use of solely rank 0 to initially load the mannequin into CPU reminiscence. rank 0 subsequently will probably be answerable for loading the mannequin weights from the checkpoint.

The instance is optimized to make use of GPU reminiscence to its most capability. Rising the batch measurement additional will result in CUDA out-of-memory (OOM) errors.

The run took about 13 minutes to finish for one epoch, ensuing within the loss curve proven within the following graph.

Run the mannequin era process

Within the subsequent step, we use the beforehand fine-tuned mannequin weights to generate the reply to a pattern immediate and examine it to the bottom mannequin.

The next code reveals the configuration of the generate recipe config_l3.1_8b_gen_trained.yaml. The next are key parameters:

  • FullModelMetaCheckpointer – We use this to load the educated mannequin checkpoint meta_model_0.pt from Amazon EFS
  • CustomTemplate.SummarizeTemplate – We use this to format the immediate for inference
# torchtune - educated mannequin era config - config_l3.1_8b_gen_trained.yaml
mannequin:
  _component_: torchtune.fashions.llama3_1.llama3_1_8b
  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /choose/ml/enter/information/mannequin/
  checkpoint_files: [
    meta_model_0.pt
  ]
  …

# Technology arguments; defaults taken from gpt-fast
instruct_template: CustomTemplate.SummarizeTemplate

... and extra ...

Subsequent, we configure the SageMaker process to run on a single ml.g5.2xlarge occasion:

immediate=r'{"dialogue":"Amanda: I baked  cookies. Would you like some?rnJerry: Certain rnAmanda: I'll carry you tomorrow :-)"}'

sagemaker_tasks={
    "generate_inference_on_trained":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_gen_trained.yaml ",
            "tune_action":"generate-trained",
            "use_downloaded_model":"true",
            "immediate":json.dumps(immediate)
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
 "image_uri":"<accountid>.dkr.ecr.<area>.amazonaws.com/speed up:newest"
    }
}

Within the output of the SageMaker process, we see the mannequin abstract output and a few stats like tokens per second:

#Refer- Output
...
Amanda: I baked  cookies. Would you like some?rnJerry: Certain rnAmanda: I'll carry you tomorrow :-)

Abstract:
Amanda baked cookies. She is going to carry some to Jerry tomorrow.

INFO:torchtune.utils.logging:Time for inference: 1.71 sec whole, 7.61 tokens/sec
INFO:torchtune.utils.logging:Reminiscence used: 18.32 GB

... and extra ...

We are able to generate inference from the unique mannequin utilizing the unique mannequin artifact consolidated.00.pth:

# torchtune - educated authentic era config - config_l3.1_8b_gen_orig.yaml
…  
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /choose/ml/enter/information/mannequin/hf-model/authentic/
  checkpoint_files: [
    consolidated.00.pth
  ]
  
... and extra ...

The next code reveals the comparability output from the bottom mannequin run with the SageMaker process (generate_inference_on_original). We are able to see that the fine-tuned mannequin is performing subjectively higher than the bottom mannequin by additionally mentioning that Amanda baked the cookies.

# Refer-Output 
---
Abstract:
Jerry tells Amanda he desires some cookies. Amanda says she's going to carry him some cookies tomorrow.

... and extra ...

Run the mannequin quantization process

To hurry up the inference and reduce the mannequin artifact measurement, we are able to apply post-training quantization. torchtune depends on torchao for post-training quantization.

We configure the recipe to make use of Int8DynActInt4WeightQuantizer, which refers to int8 dynamic per token activation quantization mixed with int4 grouped per axis weight quantization. For extra particulars, confer with the torchao implementation.

# torchtune mannequin quantization config - config_l3.1_8b_quant.yaml
mannequin:
  _component_: torchtune.fashions.llama3_1.llama3_1_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  …

quantizer:
  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256

We once more use a single ml.g5.2xlarge occasion and use SageMaker heat pool configuration to hurry up the spin-up time for the compute nodes:

sagemaker_tasks={
"quantize_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_quant.yaml",
            "tune_action":"run-quant",
            "use_downloaded_model":"true"
            },
        "instance_count":1,
        "instance_type":"ml.g5.2xlarge",
        "image_uri":"<accountid>.dkr.ecr.<area>.amazonaws.com/speed up:newest"
    }
}

Within the output, we see the situation of the quantized mannequin and the way a lot reminiscence we saved because of the course of:

#Refer-Output
...

linear: layers.31.mlp.w1, in=4096, out=14336
linear: layers.31.mlp.w2, in=14336, out=4096
linear: layers.31.mlp.w3, in=4096, out=14336
linear: output, in=4096, out=128256
INFO:torchtune.utils.logging:Time for quantization: 7.40 sec
INFO:torchtune.utils.logging:Reminiscence used: 22.97 GB
INFO:torchtune.utils.logging:Mannequin checkpoint of measurement 8.79 GB saved to /choose/ml/enter/information/mannequin/quantized/meta_model_0-8da4w.pt

... and extra ...

You may run mannequin inference on the quantized mannequin meta_model_0-8da4w.pt by updating the inference-specific configurations.

Run the mannequin analysis process

Lastly, let’s consider our fine-tuned mannequin in an goal method by operating an analysis on the validation portion of our dataset.

torchtune integrates with EleutherAI’s evaluation harness and offers the eleuther_eval recipe.

For our analysis, we use a customized process for the analysis harness to judge the dialogue summarizations utilizing the rouge metrics.

The recipe configuration factors the analysis harness to our customized analysis process:

# torchtune educated mannequin analysis config - config_l3.1_8b_eval_trained.yaml

mannequin:
...

include_path: "/choose/ml/enter/information/config/duties"
duties: ["samsum"]
...

The next code is the SageMaker process that we run on a single ml.p4d.24xlarge occasion:

sagemaker_tasks={
"evaluate_trained_model":{
        "hyperparameters":{
            "tune_config_name":"config_l3.1_8b_eval_trained.yaml",
            "tune_action":"run-eval",
            "use_downloaded_model":"true",
            },
        "instance_count":1,
        "instance_type":"ml.p4d.24xlarge",
    }
}

Run the mannequin analysis on ml.p4d.24xlarge:

Activity="evaluate_trained_model"
estimator=create_pytorch_estimator(**sagemaker_tasks[Task])
execute_task(estimator)

The next tables present the duty output for the fine-tuned mannequin in addition to the bottom mannequin.

The next output is for the fine-tuned mannequin.

 

Duties Model Filter n-shot Metric Course Worth ± Stderr
samsum 2 none None rouge1 45.8661 ± N/A
none None rouge2 23.6071 ± N/A
none None rougeL 37.1828 ± N/A

The next output is for the bottom mannequin.

Duties Model Filter n-shot Metric Course Worth ± Stderr
samsum 2 none None rouge1 33.6109 ± N/A
none None rouge2 13.0929 ± N/A
none None rougeL 26.2371 ± N/A

Our fine-tuned mannequin achieves an enchancment of roughly 46% on the summarization process, which is roughly 12 factors higher than the baseline.

Clear up

Full the next steps to scrub up your assets:

  1. Delete any unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. Delete the CloudFormation stack to delete the VPC and Amazon EFS assets.

Conclusion

On this submit, we mentioned how one can fine-tune Meta Llama-like architectures utilizing numerous fine-tuning methods in your most well-liked compute and libraries, utilizing customized dataset immediate templates with torchtune and SageMaker. This structure offers you a versatile means of operating fine-tuning jobs which can be optimized for GPU reminiscence and efficiency. We demonstrated this by way of fine-tuning a Meta Llama3.1 mannequin utilizing P4 and G5 situations on SageMaker and used observability instruments like Weights & Biases to watch loss curve, in addition to CPU and GPU utilization.

We encourage you to make use of SageMaker coaching capabilities and PyTorch’s torchtune library to fine-tune Meta Llama-like architectures to your particular enterprise use instances. To remain knowledgeable about upcoming releases and new options, confer with the torchtune GitHub repo and the official Amazon SageMaker coaching documentation .

Particular because of Kartikay Khandelwal (Software program Engineer at Meta), Eli Uriegas (Engineering Supervisor at Meta), Raj Devnath (Sr. Product Supervisor Technical at AWS) and Arun Kumar Lokanatha (Sr. ML Resolution Architect at AWS) for his or her help to the launch of this submit.


Concerning the Authors

Kanwaljit Khurmi is a Principal Options Architect at Amazon Internet Providers. He works with AWS prospects to offer steering and technical help, serving to them enhance the worth of their options when utilizing AWS. Kanwaljit focuses on serving to prospects with containerized and machine studying purposes.

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS.He helps AWS prospects—from small startups to massive enterprises—practice and deploy massive language fashions effectively on AWS.

Matthias Reso is a Associate Engineer at PyTorch engaged on open supply, high-performance mannequin optimization, distributed coaching (FSDP), and inference. He’s a co-maintainer of llama-recipes and TorchServe.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Internet Providers (AWS) and an AWS Licensed Options Architect – Skilled. He serves as a voting member of the PyTorch Basis Governing Board, the place he contributes to the strategic development of open-source deep studying frameworks. At AWS, Trevor works with prospects to design and implement machine studying options and leads go-to-market methods for generative AI companies.

Leave a Reply

Your email address will not be published. Required fields are marked *