Amazon SageMaker mannequin parallel library now accelerates PyTorch FSDP workloads by as much as 20%

Massive language mannequin (LLM) coaching has surged in recognition during the last yr with the discharge of a number of common fashions corresponding to Llama 2, Falcon, and Mistral. Clients at the moment are pre-training and fine-tuning LLMs starting from 1 billion to over 175 billion parameters to optimize mannequin efficiency for purposes throughout industries, from healthcare to finance and advertising and marketing.

Coaching performant fashions at this scale could be a problem. Extremely correct LLMs can require terabytes of coaching knowledge and 1000’s and even hundreds of thousands of hours of accelerator compute time to realize goal accuracy. To finish coaching and launch merchandise in a well timed method, clients depend on parallelism strategies to distribute this huge workload throughout as much as 1000’s of accelerator gadgets. Nevertheless, these parallelism strategies might be tough to make use of: completely different strategies and libraries are solely suitable with sure workloads or restricted to sure mannequin architectures, coaching efficiency might be extremely delicate to obscure configurations, and the cutting-edge is shortly evolving. Because of this, machine studying practitioners should spend weeks of preparation to scale their LLM workloads to giant clusters of GPUs.

On this publish, we spotlight new options of the Amazon SageMaker mannequin parallel (SMP) library that simplify the massive mannequin coaching course of and show you how to practice LLMs quicker. Specifically, we cowl the SMP library’s new simplified person expertise that builds on open supply PyTorch Totally Sharded Knowledge Parallel (FSDP) APIs, expanded tensor parallel performance that allows coaching fashions with tons of of billions of parameters, and efficiency optimizations that scale back mannequin coaching time and value by as much as 20%.

To be taught extra concerning the SageMaker mannequin parallel library, seek advice from SageMaker model parallelism library v2 documentation. You may as well seek advice from our example notebooks to get began.

New options that simplify and speed up giant mannequin coaching

This publish discusses the newest options included within the v2.0 launch of the SageMaker mannequin parallel library. These options enhance the usability of the library, develop performance, and speed up coaching. Within the following sections, we summarize the brand new options and focus on how you should utilize the library to speed up your giant mannequin coaching.

Aligning SMP with open supply PyTorch

Since its launch in 2020, SMP has enabled high-performance, large-scale coaching on SageMaker compute cases. With this newest main model launch of SMP, the library simplifies the person expertise by aligning its APIs with open supply PyTorch.

PyTorch gives Fully Sharded Data Parallelism (FSDP) as its most important methodology for supporting giant coaching workload throughout many compute gadgets. As demonstrated within the following code snippet, SMP’s up to date APIs for strategies corresponding to sharded knowledge parallelism mirror these of PyTorch. You possibly can merely run import torch.sagemaker and use it rather than torch.

## training_script.py
import torch.sagemaker as tsm
tsm.init()

# Arrange a PyTorch mannequin
mannequin = ...

# Wrap the PyTorch mannequin utilizing the PyTorch FSDP module
mannequin = FSDP(
mannequin,
...
)

optimizer = ...
...

With these updates to SMP’s APIs, now you can understand the efficiency advantages of SageMaker and the SMP library with out overhauling your present PyTorch FSDP coaching scripts. This paradigm additionally lets you use the identical code base when coaching on premises as on SageMaker, simplifying the person expertise for purchasers who practice in a number of environments.

For extra info on how one can allow SMP along with your present PyTorch FSDP coaching scripts, seek advice from Get started with SMP.

Integrating tensor parallelism to allow coaching on huge clusters

This launch of SMP additionally expands PyTorch FSDP’s capabilities to incorporate tensor parallelism strategies. One downside with utilizing sharded knowledge parallelism alone is that you could encounter convergence issues as you scale up your cluster measurement. It is because sharding parameters, gradients, and the optimizer state throughout knowledge parallel ranks additionally will increase your world batch measurement; on giant clusters, this world batch measurement might be pushed past the brink beneath which the mannequin would converge. You have to incorporate a further parallelism method that doesn’t require a rise in world batch measurement as you scale your cluster.

To mitigate this downside, SMP v2.0 introduces the power to compose sharded knowledge parallelism with tensor parallelism. Tensor parallelism permits the cluster measurement to extend with out altering the worldwide batch measurement or affecting mannequin convergence. With this function, you may safely enhance coaching throughput by provisioning clusters with 256 nodes or extra.

At the moment, tensor parallelism with PyTorch FSDP is simply out there with SMP v2. SMP v2 lets you allow this system with a number of traces of code change and unlock steady coaching even on giant clusters. SMP v2 integrates with Transformer Engine for its implementation of tensor parallelism and makes it suitable with the PyTorch FSDP APIs. You possibly can allow PyTorch FSDP and SMP tensor parallelism concurrently with out making any modifications to your PyTorch mannequin or PyTorch FSDP configuration. The next code snippets present how one can arrange the SMP configuration dictionary in JSON format and add the SMP initialization module torch.sagemaker.init(), which accepts the configuration dictionary within the backend while you begin the coaching job, to your coaching script.

The SMP configuration is as follows:

{
"tensor_parallel_degree": 8,
"tensor_parallel_seed": 0
}

In your coaching script, use the next code:

import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
mannequin = AutoModelForCausalLM.from_config(..)
mannequin = tsm.remodel(mannequin)

To be taught extra about utilizing tensor parallelism in SMP, seek advice from the tensor parallelism part of our documentation.

Use superior options to speed up mannequin coaching by as much as 20%

Along with enabling distributed coaching on clusters with tons of of cases, SMP additionally gives optimization strategies that may speed up mannequin coaching by as much as 20%. On this part, we spotlight a number of of those optimizations. To be taught extra, seek advice from the core features part of our documentation.

Hybrid sharding

Sharded knowledge parallelism is a memory-saving distributed coaching method that splits the state of a mannequin (mannequin parameters, gradients, and optimizer states) throughout gadgets. This smaller reminiscence footprint lets you match a bigger mannequin into your cluster or enhance the batch measurement. Nevertheless, sharded knowledge parallelism additionally will increase the communication necessities of your coaching job as a result of the sharded mannequin artifacts are incessantly gathered from completely different gadgets throughout coaching. On this means, the diploma of sharding is a vital configuration that trades off reminiscence consumption and communication overhead.

By default, PyTorch FSDP shards mannequin artifacts throughout the entire accelerator gadgets in your cluster. Relying in your coaching job, this methodology of sharding might enhance communication overhead and create a bottleneck. To assist with this, the SMP library gives configurable hybrid sharded knowledge parallelism on prime of PyTorch FSDP. This function lets you set the diploma of sharding that’s optimum on your coaching workload. Merely specify the diploma of sharding in a configuration JSON object and embrace it in your SMP coaching script.

The SMP configuration is as follows:

{ "hybrid_shard_degree": 16 }

To be taught extra about some great benefits of hybrid sharded knowledge parallelism, seek advice from Near-linear scaling of gigantic-model training on AWS. For extra info on implementing hybrid sharding along with your present FSDP coaching script, see hybrid shared data parallelism in our documentation.

Use the SMDDP collective communication operations optimized for AWS infrastructure

You should use the SMP library with the SageMaker distributed data parallelism (SMDDP) library to speed up your distributed coaching workloads. SMDDP consists of an optimized AllGather collective communication operation designed for finest efficiency on SageMaker p4d and p4de accelerated cases. In distributed coaching, collective communication operations are used to synchronize info throughout GPU staff. AllGather is without doubt one of the core collective communication operations sometimes utilized in sharded knowledge parallelism to materialize the layer parameters earlier than ahead and backward computation steps. For coaching jobs which might be bottlenecked by communication, quicker collective operations can scale back coaching time and value with no negative effects on convergence.

To make use of the SMDDP library, you solely want so as to add two traces of code to your coaching script:

import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # Changing "nccl"

# Initialize with SMP
import torch.sagemaker as tsm
tsm.init()

Along with SMP, SMDDP helps open supply PyTorch FSDP and DeepSpeed. To be taught extra concerning the SMDDP library, see Run distributed training with the SageMaker distributed data parallelism library.

Activation offloading

Usually, the ahead go of mannequin coaching computes activations at every layer and retains them in GPU reminiscence till the backward go for the corresponding layer finishes. These saved activations can eat vital GPU reminiscence throughout coaching. Activation offloading is a way that as an alternative strikes these tensors to CPU reminiscence after the ahead go and later fetches them again to GPU when they’re wanted. This method can considerably scale back GPU reminiscence utilization throughout coaching.

Though PyTorch helps activation offloading, its implementation is inefficient and may trigger GPUs to be idle whereas activations are fetched again from CPU throughout a backward go. This could trigger vital efficiency degradation when utilizing activation offloading.

SMP v2 gives an optimized activation offloading algorithm that may enhance coaching efficiency. SMP’s implementation pre-fetches activations earlier than they’re wanted on the GPU, decreasing idle time.

As a result of SMP is constructed on prime of PyTorch’s APIs, enabling optimized activation offloading requires just some traces of code change. Merely add the related configurations (sm_activation_offloading and activation_loading_horizon parameters) and embrace them in your coaching script.

The SMP configuration is as follows:

{
"activation_loading_horizon": 2,
"sm_activation_offloading": True
}

Within the coaching script, use the next code:

import torch.sagemaker as tsm
tsm.init()

# Native PyTorch module for activation offloading
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
apply_activation_checkpointing,
offload_wrapper,
)

mannequin = FSDP(...)

# Activation offloading requires activation checkpointing.
apply_activation_checkpointing(
mannequin,
check_fn=checkpoint_tformer_layers_policy,
)

mannequin = offload_wrapper(mannequin)

To be taught extra concerning the open supply PyTorch checkpoint instruments for activation offloading, see the checkpoint_wrapper.py script within the PyTorch GitHub repository and Activation Checkpointing within the PyTorch weblog publish Scaling Multimodal Foundation Models in TorchMultimodal with Pytorch Distributed. To be taught extra about SMP’s optimized implementation of activation offloading, see the activation offloading part of our documentation.

Past hybrid sharding, SMDDP, and activation offloading, SMP gives extra optimizations that may speed up your giant mannequin coaching workload. This consists of optimized activation checkpointing, delayed parameter initialization, and others. To be taught extra, seek advice from the core features part of our documentation.

Conclusion

As datasets, mannequin sizes, and coaching clusters proceed to develop, environment friendly distributed coaching is more and more important for well timed and reasonably priced mannequin and product supply. The most recent launch of the SageMaker mannequin parallel library helps you obtain this by decreasing code change and aligning with PyTorch FSDP APIs, enabling coaching on huge clusters through tensor parallelism and optimizations that may scale back coaching time by as much as 20%.

To get began with SMP v2, seek advice from our documentation and our sample notebooks.

In regards to the Authors

Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads frameworks, compilers, and optimization strategies for deep studying coaching.

Luis Quintela is the Software program Developer Supervisor for the AWS SageMaker mannequin parallel library. In his spare time, he might be discovered driving his Harley within the SF Bay Space.

Gautam Kumar is a Software program Engineer with AWS AI Deep Studying. He’s keen about constructing instruments and techniques for AI. In his spare time, he take pleasure in biking and studying books.