Checkpointless coaching on Amazon SageMaker HyperPod: Manufacturing-scale coaching with sooner fault restoration

Basis mannequin coaching has reached an inflection level the place conventional checkpoint-based restoration strategies have gotten a bottleneck to effectivity and cost-effectiveness. As fashions develop to trillions of parameters and coaching clusters increase to 1000’s of AI accelerators, even minor disruptions can lead to vital prices and delays.

On this publish, we introduce checkpointless coaching on Amazon SageMaker HyperPod, a paradigm shift in mannequin coaching that reduces the want for conventional checkpointing by enabling peer-to-peer state restoration. Outcomes from production-scale validation present 80–93% discount in restoration time (from 15–half-hour or extra to below 2 minutes) and allows as much as 95% coaching goodput on cluster sizes with 1000’s of AI accelerators.

Understanding goodput

Basis mannequin coaching is among the most resource-intensive processes in AI, typically involving tens of millions of {dollars} in compute spend throughout 1000’s of AI accelerators operating for days to months. Due to the inherent all-or-none distributed synchrony throughout all ranks, even a lack of a single rank due to software program or {hardware} faults brings the coaching workloads to an entire halt. To mitigate such localized faults, the trade has relied on checkpoint-based restoration; periodically saving coaching states (checkpoints) to a sturdy retailer primarily based on a user-defined checkpoint interval. When a fault happens, the coaching workload resumes by restoring from the most recent saved checkpoint. This conventional restart-to-recover mannequin has change into more and more untenable as mannequin sizes develop from billions to trillions of parameters and coaching workloads develop from lots of to 1000’s of AI accelerators.

This problem of sustaining environment friendly coaching operations at scale has led to the idea of goodput—the precise helpful work achieved in an AI coaching system in comparison with its theoretical most capability. In basis mannequin coaching, goodput is impacted by system failures and restoration overhead. The hole between the system’s theoretical most throughput and its precise productive output (goodput) grows bigger with: elevated frequency of failures (which rises with cluster size), longer restoration occasions (which scale with mannequin dimension and cluster dimension), and better prices of idle assets throughout restoration. This definition helps body why measuring and optimizing goodput turns into more and more essential as AI coaching scales to bigger clusters and extra advanced fashions, the place even small inefficiencies can lead to vital monetary and time prices.

A pre-training workload on a HyperPod cluster with 256 P5 situations, checkpointing each 20 minutes, faces two challenges when disrupted: 10 minutes of misplaced work plus 10 minutes for restoration. With ml.p5.24xlarge situations costing $55 per hour, every disruption prices $4,693 in compute time. For a month-long coaching, every day disruptions would accumulate to $141,000 in further prices and delay completion by 10 hours.

As cluster sizes grow, the probability and frequency of failures can increase.

As cluster sizes develop, the likelihood and frequency of failures can enhance.

Because the coaching spans throughout 1000’s of nodes, disruptions brought on by faults change into more and more frequent. In the meantime, restoration turns into slower as a result of the workload reinitialization overhead grows linearly with cluster dimension. The cumulative affect of large-scale AI coaching failures can attain tens of millions of {dollars} yearly and translate on to delayed time-to-market, slower mannequin iteration cycles, and aggressive drawback. Each hour of idle GPU time is an hour not spent advancing mannequin capabilities.

Checkpoint-based restoration

Checkpoint-based restoration in distributed coaching is way extra advanced and time-consuming than generally understood. When a failure happens in conventional distributed coaching, the restart course of includes way over loading the final checkpoint. Understanding what occurs throughout restoration reveals why it takes so lengthy and why your entire cluster should sit idle.

The all-or-none cascade

A single failure—one GPU error, one community timeout, or one {hardware} fault—can set off a whole coaching cluster shutdown. As a result of distributed coaching treats all processes as tightly coupled, any single failure necessitates a whole restart. When any course of fails, the orchestration system (for instance, TorchElastic or Kubernetes) should terminate each course of throughout the job and restart from scratch. Every restart requires navigating a posh, multi-stage restoration course of the place each stage is sequential and blocking:

Stage 1: Coaching job restart – The coaching job orchestrator detects a failure, terminates all processes in all nodes adopted by a cluster-wide restart or the coaching job.
Stage 2: Course of and community initialization – Each course of should re-execute the coaching script from the start. That features rank initialization, loading of Python modules from sturdy retailer similar to Community File System (NFS) or object storage, establishing the coaching topology and communication backend by means of peer discovery and course of teams creation. The method group initialization alone can take tens of minutes on giant clusters.
Stage 3: Checkpoint retrieval – Every course of should first establish the final fully saved checkpoint, then retrieve it from persistent storage (for instance, NFS or object storage) and cargo a number of state dictionaries: the mannequin’s parameters and buffers, the optimizer’s inner state (momentum, variance, and so forth), the educational fee scheduler, and coaching loop metadata (epoch, batch quantity). This step can take tens of minutes or longer relying on cluster and mannequin dimension.
Stage 4: Knowledge loader initialization – The information-loading ranks have further duty to initialize the info buffers. That features retrieving the info checkpoint from sturdy storage similar to Amazon FSx or Amazon Simple Storage Service (Amazon S3) and prefetching the coaching information to start out the coaching loop. Knowledge checkpointing is an important step to keep away from processing the identical information samples a number of occasions or skipping samples upon coaching disruption. Relying on the info combine technique, information locality, and bandwidth, the method can take a couple of minutes.
Stage 5: First step overhead – After checkpoint and coaching information are retrieved and loaded, there’s further overhead to run the primary coaching step, we name it first step overhead (FSO). Throughout this primary step, there’s usually time spent in reminiscence allocation, creating and organising the CUDA context for communication with GPUs, and compilation a part of the CUDA graph, and so forth.
Stage 6: Misplaced steps overhead – Solely in any case earlier levels full efficiently can the coaching loop resume its common progress. As a result of the coaching resumes from the final saved mannequin checkpoint, all of the steps computed between the checkpoint and the fault encountered are misplaced. These misplaced steps have to be recomputed, we name this misplaced steps overhead (LSO). Following the recomputation part, the coaching job resumes productive work that straight contributes to goodput.

How checkpointless coaching eliminates these bottlenecks

The 5 levels outlined above—termination and restart, course of discovery and community setup, checkpoint retrieval, GPU context reinitialization, and coaching loop resumption—signify the elemental bottlenecks in checkpoint-based restoration. Every stage is sequential and blocking, and coaching restoration can take minutes to a number of hours for big fashions. Critically, your entire cluster should wait for each stage to finish earlier than coaching can resume.

Checkpointless coaching eliminates this cascade. Checkpointless coaching preserves mannequin state coherence throughout the distributed cluster, eliminating the necessity for periodic snapshots. When failures happen, the system rapidly recovers through the use of wholesome friends, avoiding each storage I/O operations and full course of restarts usually required by conventional checkpointing approaches.

Checkpointless coaching structure

Checkpointless coaching is constructed on 5 parts that work collectively to remove the standard checkpoint-restart bottlenecks. Every part addresses a selected bottleneck within the restoration course of, and collectively they allow automated detection and restoration of infrastructure faults in minutes with zero handbook intervention, even with 1000’s of AI accelerators.

Element 1: TCPStore-less/root-less NCCL and Gloo initialization (optimizing stage 2)

In a typical distributed coaching setup (for instance, utilizing torch.distributed), all ranks should initialize a course of group. The method group creates a communication layer, permitting all processes (or ranks, that’s, particular person nodes) to concentrate on one another and trade info. A TCPStore is commonly used as a rendezvous level the place all ranks test in to find one another’s connection info. When 1000’s of ranks attempt to contact a designated root server (usually rank 0) concurrently, it turns into a bottleneck. This results in a flood of simultaneous community requests to a single root server that may trigger community congestion, enhance latency by tens of minutes, and additional gradual the communication course of.

Checkpointless coaching eliminates this centralized dependency. As an alternative of funneling all connection requests by means of a single root server, the system makes use of a symmetric deal with sample the place every rank independently computes peer connection info utilizing a worldwide group counter. Ranks join straight to one another utilizing predetermined port assignments, avoiding the TCPStore bottleneck. Course of group initialization drops from tens of minutes to seconds, even on clusters with 1000’s of nodes. The system additionally eliminates the single-point-of-failure danger inherent in root-based initialization.

Element 2: Reminiscence-mapped information loading (optimizing stage 4)

One of many hidden prices in conventional restoration is reloading coaching information. When a course of restarts, it should reload batches from disk, rebuild information loader state, and punctiliously place itself to keep away from processing duplicate samples or skipping information. On large-scale coaching runs, this information loading can add minutes to each restoration cycle.

Checkpointless coaching makes use of memory-mapped information loading to take care of cached information throughout accelerators. Coaching information is mapped into shared reminiscence areas that persist even when particular person processes fail. When a node recovers, it doesn’t reload information from disk however reconnects to the present memory-mapped cache. The information loader state is preserved, serving to to make sure that coaching continues from the proper place with out duplicate or skipped samples. MMAP additionally reduces host CPU reminiscence utilization by sustaining just one copy of information per node (in comparison with eight copies with conventional information loaders on 8-GPU nodes), and coaching can resume instantly utilizing cached batches whereas the info loader concurrently prefetches the subsequent information within the background.

Reminiscence-mapped information loading workflow

Element 3: In-process restoration (optimizing stage 1, 2, and 5)

Conventional checkpoint-based restoration treats failures as job-level occasions: a single GPU error triggers termination of your entire distributed coaching job. Each course of throughout the cluster should be killed and restarted, regardless that just one part failed.

Checkpointless coaching makes use of in-process restoration to isolate failures on the course of degree. When a GPU or course of fails, solely the failed course of executes an in-process restoration to rejoin the coaching loop inside seconds, overcoming recoverable or transient errors. Wholesome processes proceed operating with out interruption. The failed course of stays alive (avoiding full course of teardown), preserving the CUDA context, compiler cache, and GPU state, therefore eliminating minutes of reinitialization overhead. In circumstances the place the error is non-recoverable (similar to {hardware} failure), the system robotically swaps the defective part with a pre-warmed sizzling spare, enabling coaching to proceed with out disruptions.

This eliminates the necessity for full cluster termination and restart, dramatically lowering restoration overhead.

Element 4: Peer-to-peer state replication (optimizing stage 3 and 6)

Checkpoint-based restoration requires loading mannequin and optimizer state from persistent storage (similar to Amazon S3 or FSx for Lustre). For fashions with billions to trillions of parameters, this implies transferring tens to lots of of gigabytes over the community, deserializing state dictionaries, and reconstructing optimizer buffers which may take tens of minutes and create an enormous I/O bottleneck.

Essentially the most vital innovation in checkpointless coaching is steady peer-to-peer state replication. As an alternative of periodically saving mannequin state to centralized storage, every GPU maintains redundant copies of its mannequin shards on peer GPUs. When a failure happens, the recovering course of doesn’t load from Amazon S3. It copies state straight from a wholesome peer over the high-speed Elastic Fabric Adapter (EFA) community interconnect. This peer-to-peer structure eliminates the I/O bottleneck that dominates conventional checkpoint restoration. State switch occurs in seconds, in comparison with minutes for loading multi-gigabyte checkpoints from storage. The recovering node pulls solely the particular shards it wants, additional lowering switch time.

Element 5: SageMaker HyperPod coaching operator (optimizing all levels)

The SageMaker HyperPod coaching operator orchestrates the checkpointless coaching parts, serving because the coordination layer that ties collectively initialization, information loading, checkpointless restoration, and checkpoint fallback mechanisms. It maintains a centralized management aircraft with a worldwide view of coaching course of well being throughout your entire cluster, coordinating fault detection, restoration selections, and cluster-wide synchronization.

The operator implements clever restoration escalation: it first makes an attempt in-process restart for failed parts, and if that’s not possible (for instance, due to container crashes or node failures), it escalates to process-level restoration. Throughout a process-level restoration, as an alternative of restarting your entire job when failures happen, the operator restarts solely coaching processes, protecting the containers alive. Consequently, the restoration occasions are sooner than a job-level restart, which requires tearing down and recreating the coaching infrastructure, involving pod rescheduling, container pulls, surroundings initialization, and re-loading from checkpoints. When failures happen, the operator broadcasts coordinated cease indicators to forestall cascading timeouts and integrates with the SageMaker HyperPod health-monitoring agent to robotically detect {hardware} points and set off restoration with out handbook intervention.

Getting began with checkpointless coaching

This part guides you thru organising and configuring checkpointless coaching on SageMaker HyperPod to cut back fault restoration from hours to minutes.

Conditions

Earlier than integrating checkpointless coaching into your coaching workload, confirm that your surroundings meets the next necessities:

Infrastructure necessities:

Software program necessities:

Supported frameworks: Nemo, PyTorch, PyTorch Lightning
Coaching information codecs: JSON, JSONGZ (compressed JSON), or ARROW
Amazon Elastic Container Registry (Amazon ECR) repository for container photos. Use the HyperPod checkpointless coaching container—required for rootless NCCL initialization (Tier 1) and peer-to-peer checkpointless restoration (Tier 4)

658645717510.dkr.ecr.<area>.amazonaws.com/sagemaker-hyperpod/pytorch-training:2.3.0-checkpointless

Checkpointless coaching workflow

Checkpointless coaching is designed for incremental adoption. You can begin with primary capabilities and progressively allow superior options as your coaching scales. The mixing is organized into 4 tiers, every constructing on the earlier one:

Tier 1: NCCL initialization optimization

NCCL initialization optimization eliminates the centralized root course of bottleneck throughout initialization. Nodes uncover and hook up with friends independently utilizing infrastructure indicators. This permits sooner course of group initialization (seconds as an alternative of minutes) and elimination of single-point-of-failure throughout startup.

Integration steps: Allow an surroundings variable as a part of the job specification and confirm that the job runs with the checkpointless coaching container.

# kubernetes job spec
env:
  - identify: HPCT_USE_CONN_DATA # Allow Rootless
    worth: "1"
  - identify: TORCH_SKIP_TCPSTORE # Allow TCPStore Removing
    worth: "1"

Tier 2: Reminiscence-mapped information loading

Reminiscence mapped information loading retains coaching information cached in shared reminiscence throughout course of restarts, eliminating information reload overhead throughout restoration. This permits prompt information entry throughout restoration. No must reload or re-shuffle information when a course of restarts.

Integration steps: Increase the present information loader with a reminiscence mapped cache

from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

base_data_module = MY_DATA_MODULE(...). # Buyer's personal datamodule

mmap_config = CacheResumeMMAPConfig(
    cache_dir=self.cfg.mmap.cache_dir,
)

mmap_dm = MMAPDataModule(
    data_module=base_data_module,
    mmap_config=CacheResumeMMAPConfig(
        cache_dir=self.cfg.mmap.cache_dir,
    ),
)

Tier 3: In-process restoration

In-process restoration isolates failures to particular person processes as an alternative of requiring full job restarts. Failed processes recuperate independently whereas wholesome processes proceed coaching. It allows sub-minute restoration from process-level failures. Wholesome processes keep alive, whereas failed processes recuperate independently.

Integration steps:

from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck
from hyperpod_checkpointless_training.inprocess.wrap import HPCallWrapper, HPWrapper
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory
@HPWrapper(
    health_check=CudaHealthCheck(),
    hp_api_factory=HPAgentK8sAPIFactory(),
    abort_timeout=60.0,
)
def re_executable_codeblock(): # The re-executable codeblock outlined by person, normally it is important operate or prepare loop
    ...

Tier 4: Checkpointless (peer-to-peer restoration) (NeMo integration)

Checkpointless restoration allows full peer-to-peer state replication and restoration. Failed processes recuperate mannequin and optimizer state straight from wholesome friends with out loading from storage. This step allows elimination of checkpoint loading. Failed processes recuperate mannequin and optimizer state from wholesome replicas over the high-speed EFA interconnect.

Integration steps:

from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
    wait_rank() 
    
def important():   
    @HPWrapper(
        health_check=CudaHealthCheck(),
        hp_api_factory=HPAgentK8sAPIFactory(),
        abort_timeout=60.0,
        checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
        abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
        finalize=CheckpointlessFinalizeCleanup(),
    )
    def run_main(cfg, caller: Elective[HPCallWrapper] = None):
        ...
        coach = Coach(
            technique=CheckpointlessMegatronStrategy(...,
                num_distributed_optimizer_instances=2),
            callbacks=[..., CheckpointlessCallback(...)],
            )
        coach.fresume = resume
        coach._checkpoint_connector = CheckpointlessCompatibleConnector(coach)
        coach.wrapper = caller

wait_rank: All ranks will await the rank info from the Hyperod coaching operator infrastructure.

HPWrapper: Python operate wrapper that permits restart capabilities for a restart code block (RCB). The implementation makes use of a context supervisor as an alternative of a Python decorator as a result of the decision wrapper lacks details about the variety of RCBs it ought to monitor.

CudaHealthCheck: Helps be sure that the CUDA context for the present course of is in a wholesome state. It synchronizes with the GPU and makes use of the machine equivalent to LOCAL_RANK surroundings variable, or the principle thread’s default CUDA machine if LOCAL_RANK was not specified within the surroundings.

HPAgentK8sAPIFactory: That is the API that checkpointless coaching will use to know the coaching standing from the opposite pods in a K8s coaching cluster. It additionally gives an infrastructure-level barrier, which makes certain each rank can efficiently carry out the abort and restart.

CheckpointManager: Manages in-memory checkpoints and peer-to-peer restoration for checkpointless fault tolerance.

We suggest beginning with Tier 1 and validating it in your surroundings. Add Tier 2 when information loading overhead turns into a bottleneck. Undertake Tier 3 and Tier 4 for max resilience on the most important coaching clusters.

For NeMo customers and HyperPod recipe customers, Tier 4 is accessible out-of-the-box with minimal configuration modifications for Llama and GPT open supply recipes. NeMo examples for Llama and GPT open supply fashions could be present in SageMaker HyperPod checkpointless coaching.

Efficiency outcomes

Checkpointless coaching has been validated at manufacturing scale throughout a number of cluster configurations. The most recent Amazon Nova fashions have been educated utilizing this expertise on tens of 1000’s of AI accelerators.

On this part, we display outcomes from in depth testing throughout a spread of cluster sizes, spanning 16 GPUs to 2,304 GPUs. Checkpointless coaching demonstrated vital enhancements in restoration time, persistently lowering downtime by 80–93% in comparison with conventional checkpoint-based restoration.

Cluster (H100s)	Mannequin	Conventional restoration	Checkpointless restoration	Enchancment
2,304 GPUs	Inside mannequin	15–half-hour	Lower than 2 minutes	~87–93% sooner
256 GPUs	Llama-3 70B (pre-training)	4 min, 52 sec	47 seconds	~84% sooner
16 GPUs	Llama-3 70B (fine-tuning)	5 min 10 sec	50 seconds	~84% sooner

These restoration time enhancements have a direct relationship to ML goodput, outlined as the share of time your cluster spends making ahead progress on coaching slightly than sitting idle throughout failures. As clusters scale to 1000’s of nodes, failure frequency will increase proportionally. On the similar time, conventional checkpoint-based restoration occasions additionally enhance with cluster dimension attributable to rising coordination overhead. This creates a compounding drawback: extra frequent failures mixed with longer restoration occasions quickly erode goodput at scale.

Checkpointless coaching makes optimizations throughout your entire restoration stack, enabling greater than 95% goodput even on clusters with 1000’s of AI accelerators. Based mostly on our inner research, we persistently noticed goodput upwards of 95% throughout massive-scale deployments that exceeded 2,300 GPUs.

We additionally verified that mannequin coaching accuracy is just not impacted by checkpointless coaching. Particularly, we measured checksum matching for conventional checkpoint-based coaching and checkpointless coaching, and at each coaching step verified a bit-wise match on coaching loss. The next is a plot for the coaching loss for a Llama-3 70B pre-training workload on 32 x ml.p5.48xlarge situations for each conventional checkpointing versus checkpointless coaching.

Conclusion

Basis mannequin coaching has reached an inflection level. As clusters scale to 1000’s of AI accelerators and coaching runs prolong to months, the standard checkpoint-based restoration paradigm is more and more turning into a bottleneck. A single GPU failure that beforehand would have induced minutes of downtime now triggers tens of minutes of cluster-wide idle time on 1000’s of AI accelerators, with cumulative prices reaching tens of millions of {dollars} yearly.

Checkpointless coaching rethinks this paradigm fully by treating failures as native, recoverable occasions slightly than cluster-wide catastrophes. Failed processes recuperate state from wholesome friends in seconds, enabling the remainder of the cluster to proceed making ahead progress. The shift is key: from How can we restart rapidly? to How can we keep away from stopping in any respect?

This expertise has enabled greater than 95% goodput when coaching on SageMaker HyperPod. Our inner research on 2,304 GPUs present restoration occasions dropped from 15–half-hour to below 90 seconds, translating to over 80% discount in idle GPU time per failure.

To get began, discover What is Amazon SageMaker AI?. Pattern implementations and recipes can be found within the AWS GitHub HyperPod checkpointless training and SageMaker HyperPod recipes repositories.

Concerning the Authors

Anirudh Viswanathan is a Senior Product Supervisor, Technical, at AWS with the SageMaker workforce, the place he focuses on Machine Studying. He holds a Grasp’s in Robotics from Carnegie Mellon College and an MBA from the Wharton College of Enterprise. Anirudh is a named inventor on greater than 50 AI/ML patents. He enjoys long-distance operating, exploring artwork galleries, and attending Broadway exhibits. You’ll be able to join with Anirudh on LinkedIn.

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS clients, from small startups to giant enterprises to coach and deploy basis fashions effectively on AWS. He has a background in Microprocessor Engineering captivated with computational optimization issues and bettering the efficiency of AI workloads. You’ll be able to join with Roy on LinkedIn.

Fei Wu is a Senior Software program Developer at AWS with Sagemaker workforce. Fei’s focus is on ML system and distributed coaching strategies. He holds a PhD in Electrical Engineering from StonyBrook College. When outdoors of labor, Fei enjoys taking part in basketball and watching films. You’ll be able to join with Fei on LinkedIn.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Net Providers (AWS) and an AWS Licensed Options Architect – Skilled. At AWS, Trevor works with clients to design and implement machine studying options and leads go-to-market methods for generative AI providers.

Anirban Roy is a Principal Engineer at AWS with the SageMaker workforce, primarily focussing on AI coaching infra, resiliency and observability. He holds a Grasp’s in Pc Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software program system builder with greater than 20 years of expertise and a number of patents and publications. He enjoys street biking, studying non-fiction, gardening and nature touring. You’ll be able to join with Anirban on LinkedIn

Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI workforce, the place he presently focuses on distributed coaching throughout your entire stack. Since becoming a member of the SageMaker workforce throughout its launch 12 months, Arun has contributed to a number of merchandise inside SageMaker AI, together with real-time inference and MLOps options. When he’s not engaged on machine studying infrastructure, he enjoys exploring the outside within the Pacific Northwest and hitting the slopes for snowboarding.