Speed up your mannequin coaching with managed tiered checkpointing on Amazon SageMaker HyperPod
As organizations scale their AI infrastructure to help trillion-parameter fashions, they face a troublesome trade-off: decreased coaching time with decrease value or sooner coaching time with a better value. Once they checkpoint steadily to hurry up restoration time and decrease misplaced coaching time, they incur in considerably greater storage value. And after they checkpoint sometimes, they cut back prices on the threat of shedding beneficial coaching progress when failures happen.
This problem is exacerbated in massive distributed coaching environments, with 1000’s of accelerators, the place points can happen steadily. Based on an article launched by Meta, one failure occurred each 3 hours throughout the Meta Llama 3 mannequin coaching. The GPU points accounted for 60% of the whole failures, and community, CPU, and disks account the opposite 40%. With rare checkpointing, these accrued failures can lead to shedding days of coaching progress over the course of a whole coaching run, thereby driving up prices and time to market. Frequent checkpoints can saturate networks, overload storage, and end in unpredictable efficiency.
To assist clear up these challenges, AWS introduced managed tiered checkpointing in Amazon SageMaker HyperPod, a purpose-built infrastructure to scale and speed up generative AI mannequin growth throughout 1000’s of AI accelerators. Managed tiered checkpointing makes use of CPU reminiscence for high-performance checkpoint storage with computerized knowledge replication throughout adjoining compute nodes for enhanced reliability. Though SageMaker HyperPod identifies node points routinely and replaces these nodes so your coaching can resume, managed tiered checkpointing helps you implement the most effective checkpointing technique and maximize your coaching throughput.
Managed tiered checkpointing has been examined on massive distributed coaching clusters starting from lots of of GPU to over 15,000 GPU, with checkpoints being saved inside seconds.
On this put up, we dive deep into these ideas and perceive find out how to use the managed tiered checkpointing characteristic.
Resolution overview
Checkpointing is the strategy of saving an intermediate mannequin’s state throughout the coaching course of. You’ll be able to resume coaching from a latest checkpoint within the occasion of a difficulty by saving the mannequin’s parameters, optimizer states, and different metadata throughout coaching. Moreover, you possibly can resolve coaching issues, corresponding to irregular studying charges, with out a full restart by loading an earlier checkpoint state.
Use the next formulation to discover a tough preliminary estimate of the whole measurement of the checkpoint in your mannequin with out the optimizer state:Mannequin checkpoint measurement (GB) = (Variety of parameters × Bytes per parameter) ÷ 10243 bytesFor instance, if you happen to prepare a Meta Llama 3 70-billion-parameter mannequin utilizing BFloat16 because the parameter’s precision, the checkpoint measurement will likely be 130 GB. For those who prepare a DeepSeek-R1 671-billion-parameter mannequin utilizing BFloat16, the checkpoint measurement will likely be 1.25 TB. All with out storing optimizer states.Checkpoints embrace optimizer states, coaching metadata (corresponding to step quantity), and different extra knowledge, leading to a bigger than anticipated measurement. When utilizing an Adam optimizer, the optimizer will save three extra float16 statistics per parameter, leading to an extra 6 bytes per parameter. Subsequently, with the optimizer state saved, the Meta Llama 3 70B mannequin checkpoint measurement will likely be roughly 521 GB, and the DeepSeek-R1 671B mannequin checkpoint measurement will likely be roughly 5 TB. That may be a four-times enhance in measurement, and dealing with these checkpoints turns into a problem.
The next desk summarizes the checkpoint sizes for every mannequin.
| Mannequin title | Measurement of Checkpoint | Measurement of Checkpoint + Optimizer States |
| Meta Llama 3 70B | 130 GB | 521 GB |
| DeepSeek R1 671B | 1.43 TB | 5 TB |
It’s additionally essential to think about the coaching technique. In a Totally Sharded Knowledge Parallel (FSDP) state of affairs, every rank (a single GPU course of in a distributed coaching) saves its personal a part of the checkpoint. On the similar time, it reduces the quantity of information every rank has to avoid wasting throughout a checkpoint, and imposes a stress on the file system degree. On a Community File System (NFS) shared file system, these concurrent writes turn into a bottleneck. Utilizing a distributed file system, such Amazon FSx for Lustre, might help alleviate that stress at a better whole value. In a Distributed Knowledge Parallel (DDP) state of affairs, a single rank writes the entire checkpoint at one time, and all ranks learn the checkpoint when loading it again. On the file system degree, this implies a single author and a number of readers. On an NFS file system, many readers generally is a downside as a result of they are going to be constrained primarily based on the file system, community stack, and queue measurement. A single author, over the community, is not going to reap the benefits of all of the community throughput. Right here once more, a quick, distributed file system like FSx for Lustre might help clear up these issues at a better whole value of possession.
As we will see, conventional checkpointing strategies that rely solely on distant persistent storage create a computational overhead throughout checkpoint creation, as a result of writing terabytes of mannequin parameters to persistent storage would possibly throttle it, devour costly community bandwidth, and require complicated orchestration throughout distributed programs. By storing checkpoints in fast-access in-memory areas, corresponding to CPU RAM, whereas sustaining configurable backup to Amazon Simple Storage Service (Amazon S3) for persistence, the system delivers sooner restoration occasions, and is a cheap answer in comparison with conventional disk-based approaches.
Managed tiered checkpointing works as follows:
- When coaching your mannequin, you outline the checkpoint frequency.
- Mannequin coaching makes use of GPU HBM reminiscence to retailer the mannequin, its parameters, and intermediate outcomes, and do the heavy computation.
- Triggering a checkpoint stops mannequin coaching. The GPU will convert the mannequin weights (tensors) right into a state dictionary and replica the information to the occasion’s CPU, then the coaching resumes whereas managed tiered checkpointing copies the information to RAM.
- As a result of RAM is unstable, managed tiered checkpointing copies the information asynchronously from the host RAM to adjoining nodes utilizing RDMA over Elastic Fabric Adapter (EFA). If a node experiences a difficulty, its checkpoint knowledge will likely be out there on different nodes too.
- Now and again, it copies the information to a second layer of persistent storage, corresponding to Amazon S3. This helps each when writing to RAM fails and once you need to persistently retailer the checkpoint knowledge for future use.
With managed tiered checkpointing, you possibly can configure frequency and retention insurance policies for each in-memory and protracted storage tiers. You utilize the primary layer (in-memory) to avoid wasting checkpoints at a excessive frequency and for quick restoration, periodically saving to Amazon S3 for backup. Managed tiered checkpointing supplies a file system that may be seamlessly built-in along with your PyTorch Distributed Checkpointing (DCP) coaching. Including it to your coaching script solely requires just a few strains of code. Moreover, it improves the efficiency of checkpoints through the use of in-memory storage whereas utilizing different tiers for persistent storage. PyTorch DCP solves the difficulty of saving a mannequin’s checkpoint when it makes use of distributed assets, corresponding to a number of GPUs throughout a number of compute nodes. Trainers, parameters, and the dataset are partitioned throughout these nodes and assets, then PyTorch DCP saves and hundreds from a number of ranks in parallel. PyTorch DCP produces a number of information per checkpoint, at the least one per rank. Relying on the amount of these information, quantity and measurement, shared and community file programs corresponding to NFS will battle with inode and metadata administration. Managed tiered checkpointing helps clear up that difficulty by making it potential to make use of a number of tiers, decreasing intrusion to the coaching time and nonetheless receiving the advantages of PyTorch DCP, corresponding to deduplication of checkpoint knowledge.
With managed tiered checkpointing in SageMaker HyperPod, you possibly can preserve a excessive coaching throughput even in large-scale environments vulnerable to failures. It makes use of your present SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and compute nodes, and there aren’t any extra prices to make use of the library.
Within the following sections, we discover find out how to configure the SageMaker HyperPod cluster’s coaching scripts to make use of this new characteristic.
Configure your SageMaker HyperPod cluster for managed tiered checkpointing
SageMaker HyperPod provisions resilient clusters for operating machine studying (ML) workloads and creating state-of-the-art fashions corresponding to massive language fashions (LLMs), diffusion fashions, and basis fashions (FMs). By decreasing the complicated work of constructing and sustaining compute clusters utilizing accelerators like AWS Trainium and NVIDIA H200/B200 GPUs, it hastens the creation of basis fashions. To create a brand new SageMaker HyperPod cluster, consult with the Amazon SageMaker HyperPod Developer Guide. If you wish to speed up your deployment through the use of subject hardened property, consult with the next GitHub repo.
The examples shared on this put up are meant that will help you be taught extra about this new characteristic. For those who’re contemplating operating the examples supplied right here in a manufacturing surroundings, have your safety workforce assessment the content material and ensure they adhere to your safety requirements. At AWS, safety is the highest precedence and we perceive that each buyer has their very own safety framework.Earlier than creating or updating a cluster so as to add the managed tiered checkpointing characteristic, you could arrange the EKS pods to entry an S3 bucket both by yourself account or throughout accounts. When working with buckets on the identical account because the SageMaker HyperPod EKS cluster, you should utilize the next coverage (change your bucket title earlier than making use of it):
If the bucket is in a special account, you could authorize an AWS Identity and Access Management (IAM) principal to entry these buckets. The next IAM coverage will do this for you. You should definitely change each the bucket title and the IAM principal (for instance, your AWS account ID).
To create a brand new cluster with managed tiered checkpointing, you possibly can move a parameter utilizing --tiered-storage-config and setting Mode to Allow utilizing an AWS Command Line Interface (AWS CLI) command:
It’s also possible to replace it utilizing the UpdateCluster API and move the CachingConfig parameter with the required AllocatedMemory configuration. You need to use the CachingConfiguration parameter to outline a set worth or a share of the CPU RAM for checkpointing.
Now that your SageMaker HyperPod cluster has the managed tiered checkpointing characteristic, let’s put together the coaching scripts and add them.
Set up the managed tiered checkpoint libraries and combine along with your coaching script
Managed tiered checkpointing integrates with PyTorch DCP. You begin by putting in the sagemaker-checkpointing library. Then you definitely create and configure a namespace to retailer the checkpoints primarily based on the outlined frequency. Lastly, you add the checkpoint perform inside your coaching loop.
To put in the library, we merely use Python’s pip. Be sure you have already got the dependencies put in: Python 3.10 or greater, PyTorch with DCP help, and the AWS credentials configured correctly. To combine Amazon S3 as one other storage layer, you additionally want s3torchconnector put in.
Now you possibly can import the library in your script and configure the namespace and frequency for checkpointing:
Within the previous code snippet, now we have configured managed tiered checkpointing with the identical world_size because the variety of ranks in our cluster. If you begin a distributed coaching, every GPU within the cluster is assigned a rank quantity, and the whole variety of GPUs out there is the world_size. We arrange Amazon S3 as our backup persistent storage, setting managed tiered checkpointing to retailer knowledge in Amazon S3 each 100 coaching steps. Each world_size and namespace are required parameters; the others are non-compulsory.
Now that the configuration is prepared, let’s arrange PyTorch DCP and combine managed tiered checkpointing.
First, configure the storage author. This element will move on to the PyTorch DCP async_save perform alongside the mannequin’s state dictionary. We use the SageMakerTieredStorageWriter when writing the checkpoints and the SageMakeTieredStorageReader when restoring from these checkpoints.
Inside your mannequin coaching loop, you add the storage author configuration and move alongside each the managed tiered checkpointing configuration and the step quantity:
You’ll be able to outline the step quantity explicitly for the storage author, or you possibly can let the storage author establish the step quantity from the trail the place the checkpoint is being saved. If you wish to let the storage author infer the step quantity from the bottom path, don’t set the stepparameter and ensure your path incorporates the step quantity in it.
Now you possibly can name the PyTorch DCP asynchronous save perform and move alongside the state dictionary and the storage author configuration:async_save(state_dict=state_dict, storage_writer=storage_writer)
We now have arrange managed tiered checkpointing to put in writing checkpoints at our desired frequency and placement (in-memory). Let’s use the storage reader to revive these checkpoints. First, move the managed tiered checkpointing configuration to the SageMakerTieredStorageReader, then name the PyTorch DCP load perform, passing the mannequin state dictionary and the storage reader configuration:
To work by a whole instance, consult with the next GitHub repository, the place we’ve created a easy coaching script, together with the managed tiered checkpointing characteristic.
Clear up
After you may have labored with managed tiered checkpointing, and also you need to clear up the surroundings, merely take away the amzn-sagemaker-checkpointing library by operating pip uninstall amzn-sagemaker-checkpointing.
For those who put in the answer in a Python digital surroundings, then simply deleting the digital surroundings will suffice.Managed tiered checkpointing is a free characteristic that doesn’t require extra assets to run. You utilize your present SageMaker HyperPod EKS cluster and compute nodes.
Finest practices to optimize your checkpoint technique with managed tiered checkpointing
Managed tiered checkpointing will try to put in writing to the in-memory tier first. This optimizes the writing efficiency as a result of in-memory supplies ultra-low latency checkpoint entry. You need to configure managed tiered checkpointing to put in writing to a second layer, corresponding to Amazon S3, on occasion. For instance, configure managed tiered checkpointing to put in writing to the in-memory layer each 10 steps, and configure it to put in writing to Amazon S3 each 100 steps.
If managed tiered checkpointing fails to put in writing to the in-memory layer, and the node experiences a difficulty, then you definately nonetheless have your checkpoint saved on Amazon S3. Whereas writing to Amazon S3, managed tiered checkpointing makes use of a number of TCP streams (chunks) to optimize Amazon S3 writes.
By way of consistency, managed tiered checkpointing makes use of an all-or-nothing writing technique. It implements a fallback mechanism that can seamlessly transition between the storage tiers. Checkpoint metadata, corresponding to step quantity, is saved alongside the information for each tier.
When making an attempt to troubleshoot managed tiered checkpointing, you possibly can test the log written domestically to /var/log/sagemaker_checkpointing/{namespace}_checkpointing.log. It publishes knowledge in regards to the coaching step, rank quantity, and the operation particulars. The next is an instance output of that file:
Managed tiered checkpointing additionally writes these metrics to the console, so it’s simple to troubleshoot throughout growth. They comprise info on which step quantity is being written to which storage layer and the throughput and whole time taken to put in writing the information. With that info, you possibly can monitor and troubleshoot managed tiered checkpointing totally.
If you mix these instruments with the SageMaker HyperPod observability stack, you get a whole view of all metrics of your coaching or inference workload.
Conclusion
The brand new managed tiered checkpointing characteristic in SageMaker HyperPod augments FM coaching effectivity by intelligently distributing checkpoints throughout a number of storage tiers. This superior method locations mannequin states in quick entry areas corresponding to CPU RAM reminiscence, whereas utilizing persistent storage corresponding to Amazon S3 for cost-effective, long-term persistence. As of the time of this launch, managed tiered checkpointing is supported solely on SageMaker HyperPod on Amazon EKS.
Managed tiered checkpointing delivers quick restoration occasions with out elevated storage prices, avoiding complicated trade-offs between resiliency, coaching effectivity, and storage prices. It has been validated on massive distributed coaching clusters that vary from lots of of GPU to greater than 15,000 GPU, with checkpoints being saved inside seconds.
Integrating managed tiered checkpointing in your coaching scripts is simple, with just some strains of code, offering quick entry to classy checkpoint administration with out requiring deep engineering experience.
For extra info on how managed tiered checkpointing works, find out how to set it up, and different particulars, consult with HyperPod managed tier checkpointing.
In regards to the authors
Paulo Aragao is a Principal WorldWide Options Architect centered on Generative AI on the Specialist Organisation on AWS. He helps Enterprises and Startups to construct their Basis Fashions technique and innovate sooner by leveraging his in depth information on Excessive Perfomance Computing and Machine Studying. A very long time bass participant, and pure born rock fan, Paulo enjoys spending time travelling together with his household, scuba diving, and enjoying actual time technique and role-playing video games.
Kunal Jha is a Principal Product Supervisor at AWS. He’s centered on constructing Amazon SageMaker Hyperpod because the best-in-class alternative for Generative AI mannequin’s coaching and inference. In his spare time, Kunal enjoys snowboarding and exploring the Pacific Northwest.
Mandar Kulkarni is a Software program Growth Engineer II at AWS, the place he works on Amazon SageMaker. He makes a speciality of constructing scalable and performant machine studying libraries and infrastructure options, significantly specializing in SageMaker HyperPod. His technical pursuits span machine studying, synthetic intelligence, distributed programs and utility safety. When not architecting ML options, Mandar enjoys mountain climbing, training Indian classical music, sports activities, and spending high quality time together with his younger household.
Vinay Devadiga is a Software program Growth Engineer II at AWS with a deep ardour for synthetic intelligence and cloud computing. He focuses on constructing scalable, high-performance programs that allow the ability of AI and machine studying to resolve complicated issues.Vinay enjoys staying on the forefront of expertise, constantly studying, and making use of new developments to drive innovation. Outdoors of labor, he likes enjoying sports activities and spending high quality time together with his household.
Vivek Maran is a Software program Engineer at AWS. He presently works on the event of Amazon SageMaker HyperPod, a resilient platform for giant scale distributed coaching and inference. His pursuits embrace massive scale distributed programs, community programs, and synthetic intelligence. Outdoors of labor, he enjoys music, operating, and retaining updated with enterprise & expertise tendencies.