Effectively prepare fashions with massive sequence lengths utilizing Amazon SageMaker mannequin parallel


Large language models (LLMs) have witnessed an unprecedented surge in reputation, with prospects more and more utilizing publicly accessible fashions equivalent to Llama, Steady Diffusion, and Mistral. Throughout numerous industries—together with healthcare, finance, and advertising—organizations are actually engaged in pre-training and fine-tuning these more and more bigger LLMs, which regularly boast billions of parameters and bigger enter sequence size. Though these developments provide exceptional capabilities, additionally they current vital challenges. Longer sequence lengths and the sheer variety of trainable parameters demand progressive approaches to mannequin improvement and deployment. To maximise efficiency and optimize coaching, organizations ceaselessly have to make use of superior distributed coaching methods.

On this publish, we display how the Amazon SageMaker model parallel library (SMP) addresses this want by way of assist for brand new options equivalent to 8-bit floating level (FP8) mixed-precision coaching for accelerated coaching efficiency and context parallelism for processing massive enter sequence lengths, increasing the record of its existing features.

We information you thru a step-by-step implementation, demonstrating the best way to speed up workloads with FP8 and work with longer sequence lengths utilizing context parallelism, with minimal code adjustments to your present coaching workflow.

The implementation of those new SMP options guarantees a number of benefits for patrons working with LLMs. First, it might probably result in decrease prices to convergence, permitting for extra environment friendly use of assets throughout the coaching course of. This ends in diminished time to market, permitting organizations to deploy their optimized fashions extra rapidly and achieve a aggressive edge. Second, it allows coaching with bigger dataset data, increasing the scope and complexity of duties that may be tackled.

The next sections take a deeper look into this.

Enterprise problem

Companies in the present day face a major problem when coaching LLMs effectively and cost-effectively. As fashions develop bigger and extra advanced, organizations are utilizing fine-tuning and steady pre-training methods to coach these fashions with domain-specific knowledge, utilizing bigger sequence lengths that may vary from 8K to 128K tokens. These longer sequence lengths permit fashions to higher perceive long-range dependencies in textual content, generate extra globally coherent outputs, and deal with duties requiring evaluation of prolonged paperwork.

Though there exist varied methods equivalent to Fully Shared Data Parallelism (FSDP), tensor parallelism (TP), and pipeline parallelism to successfully prepare fashions with billions of parameters, these strategies are primarily designed to distribute mannequin parameters, gradients, and optimizer states throughout GPUs, and so they don’t give attention to enter knowledge–associated optimizations. This strategy reduces reminiscence stress and allows environment friendly coaching of huge fashions. Nevertheless, none of those strategies successfully deal with partitioning alongside the sequence dimension. Because of this, coaching with longer sequence lengths can nonetheless result in out-of-memory (OOM) errors, regardless of utilizing FSDP.

Because of this, working with bigger sequence size may end in reminiscence stress, and it usually requires progressive approaches equivalent to FP8 and context parallelism.

How does SMP context parallelism and FP8 assist speed up mannequin coaching?

SMP addresses the challenges of reminiscence stress by offering an implementation of context parallelism, which is a parallelization approach that partitions on the dimension of sequence size. Moreover, it might probably work along with different parallelism strategies equivalent to FSDP and TP. SMP additionally implements FP8 for supported fashions equivalent to Llama. FP8 is a reduced-precision floating-point format that reinforces effectivity by enabling sooner matrix multiplications with out vital accuracy loss. You should utilize these strategies collectively to coach advanced fashions which might be orders of magnitude sooner and quickly iterate and deploy progressive AI options that drive enterprise worth.

The next sections dive deep into the implementation particulars for every of those options in SMP.

Context parallelism

Context parallelism is a mannequin parallelism approach to permit the mannequin to coach with lengthy sequences. It’s a parallelization scheme that partitions a mannequin’s activations alongside the sequence dimension. Throughout coaching with SMP context parallel technique, the inputs are partitioned alongside the sequence dimension earlier than being fed to the mannequin. With activations being partitioned alongside the sequence dimension, we have to contemplate how our mannequin’s computations are affected. For layers that don’t have inter-token dependency throughout computation, we don’t require particular concerns. In a transformer structure, such layers are the embedding layers and the multilayer perceptron (MLP) layers. The layers which have inter-token dependency are the eye layers. For the eye layer, as we see from the eye computation, Question projections (Q) have to work together with the tokens of key (Okay) and worth (V) projections.

As a result of we solely have a partition of Okay and V, we require an AllGather operation to gather the keys and queries from different ranks. As detailed within the following determine, we contemplate a context parallel scheme with context parallel diploma 2 for a causal language mannequin. Thus GPU 0 has the primary half of the enter sequence and GPU 1 has the opposite half. Throughout ahead, the non-attention layers compute their activations as regular. For consideration computation, an AllGather operation is carried out for Okay and V throughout the context parallel ranks belonging to GPU 0 and GPU 1. To preserve reminiscence, the Okay and V tensors obtained from the AllGather operation are discarded after the eye computation is accomplished. Consequently, throughout the backward move, we require the identical AllGather operation for Okay and V. Moreover, after the eye backward move, a ReduceScatter operation is carried out to scatter the gradients to corresponding context parallel ranks.

In contrast to different mannequin parallel schemes equivalent to tensor parallelism, context parallelism retains the mannequin parameters intact. Thus, there aren’t any further communication collectives for parameters required for context parallelism.

Supported fashions

SMP helps context parallelism utilizing NVIDIA Transformer Engine, and it seamlessly integrates with different mannequin parallelism strategies Absolutely Sharded Knowledge Parallel and Tensor Parallelism. SMP v2.6 helps the Llama 3.1 (and prior Llama fashions) and Mistral mannequin architectures for context parallelism.

Blended Precision Coaching with FP8

As proven in determine under, FP8 is a datatype supported by NVIDIA’s H100 and H200 GPUs, allows environment friendly deep studying workloads. The FP8 format occupies solely 8 bits of reminiscence, half that of its BF16 or FP16 counterparts, considerably decreasing computational prices for operations equivalent to matrix multiplication. The compute throughput for working matrix operations equivalent to multipliers and convolutions is considerably greater on 8-bit float tensors in comparison with 32-bit float tensors. FP8 precision reduces the info footprint and computational necessities, making it supreme for large-scale fashions the place reminiscence and velocity are vital.

Delving deeper into FP8’s structure, we uncover two distinct subtypes: E4M3 and E5M2. The E4M3 configuration, with its 1 signal bit, 4 exponent bits, and three mantissa bits, presents superior precision however a restricted dynamic vary. This makes it supreme for the ahead move in mannequin coaching. Conversely, E5M2, that includes 1 signal bit, 5 exponent bits, and a couple of mantissa bits, boasts a broader dynamic vary on the expense of diminished precision. This configuration excels within the backward move, the place precision is much less vital, however a wider vary proves advantageous.

The transition to combined precision coaching with FP16 or BF16 has traditionally necessitated static or dynamic loss-scaling to handle convergence points that stemmed from diminished precision in gradient circulation. This problem is additional amplified in FP8 because of its narrower vary. To fight this, the Transformer Engine launched an progressive answer known as DelayedScaling. This system selects scaling components primarily based on the utmost noticed worth for every tensor from earlier iterations. Though DelayedScaling maximizes the efficiency advantages of FP8 computation, it does include a reminiscence overhead for storing the tensors’ most worth historical past. Nevertheless, regardless of the extra overhead, the improved throughput noticed with 8-bit tensor computations make this strategy beneficial.

Supported fashions

SMP helps FP8 combined precision coaching utilizing NVIDIA Transformer Engine and retains compatibility with PyTorch MixedPrecision. Because of this you should use FP8 coaching for supported layers and half-precision utilizing PyTorch Automated Blended Precision for others. SMP v2.6 helps the next mannequin architectures for FP8 coaching: Llama 3.1 (and prior Llama fashions), Mixtral, and Mistral.

Extra particulars about FP8 will be discovered at FP8 Formats For Deep Learning.

Answer overview

We are able to use SMP with each Amazon SageMaker Model training jobs  and Amazon SageMaker HyperPod.

For this publish, we display SMP implementation on SageMaker trainings jobs.

Launching a machine learning (ML) coaching cluster with Amazon SageMaker coaching jobs is a seamless course of that begins with an easy API name, AWS Command Line Interface (AWS CLI) command, or AWS SDK interplay. After they’re initiated, SageMaker coaching jobs spin up the cluster, provisioning the desired quantity and kind of compute situations.

In our instance, we use a single ml.p5.48xlarge occasion, although we’re illustrating the usage of 4 GPUs for demonstration functions. The coaching knowledge, securely saved in Amazon Simple Storage Service (Amazon S3), is copied to the cluster. Every document sequence (Seq0) is strategically break up into a number of subsequences and assigned to every GPU in our cluster.

Our implementation makes use of the FP8 capabilities of SMP to execute mannequin coaching on Nvidia H100 GPUs and showcases context parallelism capabilities. Due to the pliability of SageMaker, you’ll be able to scale your compute assets as wanted, accommodating workloads throughout of a spread of sizes. SageMaker creates a resilient coaching cluster, handles orchestration, intently screens the infrastructure, and recovers from faults, offering a clean and uninterrupted coaching expertise. Moreover, the SageMaker training jobs cost-effective design mechanically terminates the cluster upon completion of the coaching job, with billing calculated right down to the second of precise coaching time used. This mix of energy, flexibility, and cost-efficiency makes SageMaker a great service for ML practitioners of all ranges.

The next diagram reveals the answer structure.

The next walkthrough reveals you how one can prepare a Llama 3.1 8B Instruct mannequin utilizing the PubMed tokenized dataset with a sequence size of roughly 16K tokens. We use SMP context parallelism implementation to allow coaching for this massive sequence size. We examine two approaches: one with out context parallelism and one other one with it. This comparability highlights the significance of context parallelism when working with LLMs and datasets containing lengthy sequences.

Moreover, we conduct a comparative run on p5.48xlarge situations with context parallelism enabled, each with FP8 enabled and disabled. This demonstration will showcase the incremental throughput advantages we are able to obtain by enabling FP8-based coaching alongside context parallelism.

In abstract, the implementation follows these 4 steps:

  1. Arrange libraries and course of knowledge
  2. Run coaching with out context parallelism
  3. Run coaching with context parallelism enabled to trace reminiscence optimizations
  4. Run coaching with FP8 enabled to achieve additional efficiency

The next circulation diagram reveals these 4 steps.

Stipulations

To carry out the answer, that you must have the next stipulations in place:

  1. Create a Hugging Face User Access Token and get entry to the gated repository meta-llama/Llama-3.1-8B on Hugging Face.
  2. Request a Service Quota for 1x p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker. To request a service quota enhance, on the AWS Service Quotas console, select AWS providers, Amazon SageMaker, after which select one ml.p4d.24xlarge and one ml.p5.48xlarge coaching job utilization.
  3. Create an AWS Identity and Access Management (IAM) role with managed insurance policies AmazonSageMakerFullAccess, AmazonEC2FullAccess to present required entry to SageMaker to run the examples.

This walkthrough is for demonstration functions solely. You need to alter this to your particular safety necessities for manufacturing. Adhere to the principle of least privilege whereas defining IAM insurance policies in manufacturing.

  1. Create an Amazon SageMaker Studio area (confer with Quick setup to Amazon SageMaker) to entry Jupyter notebooks.

Answer walkthrough

To carry out the answer, use the directions within the following steps.

Arrange libraries and course of knowledge

To arrange libraries and course of knowledge, comply with these directions. The next circulation diagram reveals step 1 highlighted.

  1. Enter the next command to put in the related HuggingFace and SageMaker libraries:
    %pip set up --upgrade "sagemaker>=2.233"
    %pip set up "datasets==2.14.5"
    %pip set up transformers

  2. Load the PubMed dataset and tokenize it

On this instance, we use the PubMed Scientific Papers dataset, containing 133,215 biomedical analysis articles. For our experiment, we choose 1,000 papers break up 80/20 for coaching and validation. Utilizing the Meta-LlaMA-3 tokenizer, we course of every paper into sequences of 16,384 tokens.

The dataset undergoes two foremost processing steps: tokenization with Llama’s tokenizer and grouping into fixed-length chunks of 16,384 tokens utilizing utility operate group_texts. This uniform sequence size allows even distribution throughout GPUs whereas sustaining the pure construction of the scientific papers.

import datasets
from datasets import load_dataset, DatasetDict

# Load the PubMed dataset
pubmed_dataset = load_dataset(
    "scientific_papers",
    "pubmed",
    cache_dir="/residence/ec2-user/SageMaker/datasets",
    download_mode="force_redownload"
)

# Create a smaller subset of the dataset for our experiment
train_test = pubmed_dataset['train'].shuffle(seed=42).choose(vary(1000)).train_test_split(
    test_size=0.2,
    seed=42
)

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}",
)

  1. Put together knowledge for the coaching job

On this part, we put together the PubMed dataset for SageMaker coaching by managing knowledge transfers to Amazon S3. Each coaching and validation splits are transformed to JSON format and uploaded to designated S3 buckets, with separate paths for enter knowledge and output artifacts.

if lm_datasets["train"] is just not None:
    train_dataset = lm_datasets["train"]
    train_dataset.to_json("./coaching.json")
    training_dataset_location = f"s3://{default_bucket}/dataset/prepare/"

if lm_datasets["validation"] is just not None:
    eval_dataset = lm_datasets["validation"]
    eval_dataset.to_json("./validation.json")
    validation_dataset_location = f"s3://{default_bucket}/dataset/validation/"

  1. Arrange coaching hyper parameters

On this configuration, we outline hyperparameters for coaching Llama on PubMed, protecting reminiscence optimizations, coaching parameters, mannequin structure settings, and efficiency tuning. Beginning with conservative settings (batch dimension=1, BF16 precision), we set up a baseline configuration that can be modified to check completely different optimization methods, significantly for context parallelism experiments.

hyperparameters = {
    # Reminiscence and optimization settings
    "activation_checkpointing": 1,
    "auto_wrap_policy": "transformer_auto_wrap_policy",
    ...
    
    # Coaching settings
    "train_batch_size": 1,
    "val_batch_size": 1,
    ...
    
    # Mannequin configuration
    "vocab_size": 128256, # Vocab dimension from Llama 3.1 config file on Hugging Face
    "hf_pretrained_model_name_or_dir": model_id,
    
    ...
    
}

Run coaching with out context parallelism

To run coaching with out context parallelism, comply with these directions. The next circulation diagram reveals step 2 highlighted.

On this setup, we configure a baseline coaching job by disabling context parallelism and FP8 options, whereas maximizing reminiscence utilization by way of FP32 precision and bigger batch sizes. Every GPU processes the complete 16,384 token sequence with out splitting, and memory-saving options are disabled to display the constraints and potential reminiscence constraints when working with out superior optimizations equivalent to context parallelism and FP8.

instance_type= "p4d.24xlarge"
instance_count= 1
hybrid_shard_degree= 8

hyperparameters.replace({
    "use_smp_implementation": 0,  # Disable SMP/CP. Solely FSDP is lively
    "train_batch_size": 1,        # Batch dimension
    "max_context_width": 16384,   # Full sequence size
    "clean_cache": 0,
    "bf16": 1,                    # Use bf16
    ...
})

smp_estimator = PyTorch(
    entry_point="prepare.py",
    hyperparameters=hyperparameters,
    ...
    instance_type=instance_type,
    volume_size=400,
    instance_type=instance_count,
    distribution={
        "torch_distributed": {
            "enabled": True,
        },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,  # Allow mannequin parallelism however with minimal parameters
                "parameters": {
                    "hybrid_shard_degree": hybrid_shard_degree,
                    "delayed_parameter_initialization": True
                }
            }
        }
    },
    
   ...
)

smp_estimator.match(inputs=data_channels)

The results of not utilizing context parallelism with a big context width (16,384) means that we are going to get a CUDA out-of-memory error:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage “[rank3]: torch.OutOfMemoryError: CUDA out of reminiscence. Tried to allocate 7.83 GiB. GPU 3 has a complete capability of 39.38 GiB of which 5.53 GiB is free. Together with non-PyTorch reminiscence, this course of has 0 bytes reminiscence in use.

Run coaching with context parallelism enabled to trace reminiscence optimizations

To run coaching with context parallelism enabled to trace reminiscence optimizations, comply with these directions. The next circulation diagram reveals step 3 highlighted.

On this configuration, we allow context parallelism whereas retaining FP8 disabled. By setting context parallel diploma to eight, we distribute the 16,384 token sequence throughout all accessible GPUs for environment friendly processing. The setup consists of important context parallelism parameters and launches the coaching job in a background thread, permitting for unblocked pocket book execution whereas sustaining clear job identification for comparability with different configurations.

instance_type= "p4d.24xlarge"
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

smp_estimator = PyTorch(
    ...
    entry_point="prepare.py",
    instance_type=instance_type,
    instance_count=instance_count,
    distribution={
        "torch_distributed": {
            "enabled": True,
        },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "context_parallel_degree": context_parallel_degree,
                    "hybrid_shard_degree": hybrid_shard_degree,
                    "delayed_parameter_initialization": True,
                }
            }
        }
    },
    ...
)

smp_estimator.match(inputs=data_channels)

The results of utilizing context parallelism with such a big context width is that the job efficiently completes, as proven within the following screenshot.

We additionally enabled delayed parameter initialization and hybrid sharding capabilities from SMP for each previous configurations. Delayed parameter initialization permits initializing massive fashions on a meta gadget with out attaching knowledge. This could resolve restricted GPU reminiscence points whenever you first load the mannequin. This strategy is especially helpful for coaching LLMs with tens of billions of parameters, the place even CPU reminiscence may not be adequate for initialization. Hybrid sharding is a reminiscence saving approach that shards parameters inside the hybrid shard diploma (HSD) group and replicates parameters throughout teams. The HSD controls sharding throughout GPUs and will be set to an integer from 0 to world_size. This ends in diminished communication quantity as a result of costly AllGathers and ReduceScatters are solely completed inside a node, which carry out higher for medium-sized fashions.

Run coaching with FP8 enabled to achieve additional efficiency

To run coaching with FP8 enabled to achieve additional reminiscence efficiency, comply with these directions. The next circulation diagram reveals step 4 highlighted.

On this absolutely optimized configuration, we allow each context parallelism and FP8 coaching utilizing a NVIDIA P5 occasion (ml.p5.48xlarge). This setup combines sequence splitting throughout GPUs with FP8 precision coaching, making a extremely environment friendly coaching setting. Utilizing P5 situations offers the required {hardware} assist for FP8 computation, with the consequence that we are able to maximize the advantages of each memory-saving strategies.

instance_type= "p5.48xlarge"
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

hyperparameters.replace({
    "use_smp_implementation": 1,  # Allow SMP/CP
    "max_context_width": 16384,   # Full sequence size
    "fp8": 1,  # Allow FP8 flag
    "distributed_backend": "nccl"  # Add this line to explicitly use NCCL
    ...

})

smp_estimator = PyTorch(
    ...
    entry_point="prepare.py",
    instance_type=instance_type,
    instance_count=instance_count,
    distribution={
        "torch_distributed": {
            "enabled": True,
        },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "context_parallel_degree": context_parallel_degree,
                    "hybrid_shard_degree": hybrid_shard_degree,
                    "delayed_parameter_initialization": True,
                }
            }
        }
    },
   ...
)

smp_estimator.match(inputs=data_channels)

Begin coaching with context parallelism, with out FP8 (on a P5 occasion)

To do a good comparability with and with out FP8, we are going to do one other run with out FP8 however with context parallelism on a P5.48xlarge occasion and examine the throughputs for each runs.

instance_type= "p5.48xlarge"
instance_count= 1
hybrid_shard_degree= 8
context_parallel_degree=8

hyperparameters.replace({
    "use_smp_implementation": 1,  # Allow SMP/CP
    "max_context_width": 16384,   # Full sequence size
    "bf16": 1,                    # Use BF16
    "distributed_backend": "nccl"  # Add this line to explicitly use NCCL
    ...
})

# This stays the identical as within the earlier step
smp_estimator = PyTorch(
    ...
    )
    
smp_estimator.match(inputs=data_channels)

If we examine each runs, we are able to inform that the velocity of the identical context parallelism enabled job with FP8 is sort of 10 instances sooner

With FP8, velocity is round 14.6 samples/second, as proven within the following screenshot.

With out FP8, velocity is round 1.4 samples/second, as proven within the following screenshot.

The next desk depicts the throughput increment you get in every of the listed circumstances. All these circumstances are run on a P5.48xLarge.

The throughput might fluctuate primarily based on components such because the context width or batch dimension. The next numbers are what we’ve got noticed in our testing.

Configuration (ml.P5.48xlarge; CP on 8 GPUs, Practice Batch Measurement 4) Noticed samples velocity Noticed throughput
No context parallelism & No FP8 torch.OutOfMemoryError: CUDA out of reminiscence torch.OutOfMemoryError: CUDA out of reminiscence
Solely Context Parallelism 2.03 samples/sec 247 TFLOPS/GPU
Context parallelism + FP8 3.05 samples/sec 372 TFLOPS/GPU

Cleanup

To wash up your assets to keep away from incurring extra expenses, comply with these steps:

  1. Delete any unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. Delete any S3 buckets created
  4. Confirm that your coaching job isn’t working anymore! To take action, in your SageMaker console, select Coaching and test Coaching jobs.

To be taught extra about cleansing up your assets provisioned, try Clean up.

Conclusion

On this publish, we demonstrated the method of organising and working coaching jobs for the PubMed dataset utilizing the Llama 3.1 8B Instruct mannequin, each with and with out context parallelism. We additionally showcased the best way to allow FP8 primarily based coaching for even sooner throughputs.

Key takeaways:

  • For datasets which have lengthy sequence lengths, we observe that utilizing context parallelism helps keep away from OOM errors.
  • For sooner coaching, we are able to allow FP8 primarily based coaching and mix it with context parallelism to get elevated throughput instances. On this pocket book, we noticed that the throughput goes up tenfold if we allow FP8 with context parallelism.

As subsequent steps, check out the above instance by following the pocket book steps at sagemaker-distributed-training-workshop.

Particular due to Roy Allela, Senior AI/ML Specialist Options Architect for his assist on the launch of this publish.


In regards to the Authors

Kanwaljit Khurmi is a Principal Worldwide Generative AI Options Architect at AWS. He collaborates with AWS product groups, engineering departments, and prospects to supply steerage and technical help, serving to them improve the worth of their hybrid machine studying options on AWS. Kanwaljit focuses on helping prospects with containerized functions and high-performance computing options.

Surya Kari is a Senior Generative AI Knowledge Scientist at AWS. With a background in laptop imaginative and prescient and AI gadgets, his present specializations embody LLM coaching, multi-modal RAG, vision-language fashions, and edge computing.

Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker staff. He focuses on LLM coaching workloads, serving to prospects construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, climbing, and cooking.

Suhit Kodgule is a Software program Growth Engineer with the AWS Synthetic Intelligence group engaged on deep studying frameworks. In his spare time, he enjoys climbing, touring, and cooking.

Anirudh Viswanathan is a Sr Product Supervisor, Technical – Exterior Providers with the SageMaker Coaching staff. He holds a Masters in Robotics from Carnegie Mellon College, an MBA from the Wharton College of Enterprise, and is called inventor on over 40 patents. He enjoys long-distance working, visiting artwork galleries, and Broadway reveals.

Leave a Reply

Your email address will not be published. Required fields are marked *