Introducing Amazon EKS assist in Amazon SageMaker HyperPod
We’re thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This functionality permits for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, utilizing automated node and job resiliency options for basis mannequin (FM) growth.
FMs are usually skilled on large-scale compute clusters with tons of or 1000’s of accelerators. Underneath such circumstances, {hardware} failures pose a major problem, as a result of a single accelerator failure amongst 1000’s can halt the whole coaching course of. For instance, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs skilled 419 surprising interruptions, with 78% attributed to confirmed or suspected {hardware} points, and with 58.7% of those interruptions being GPU-related issues, together with NVLink failures and HBM3 memory failures.
Since its inception, SageMaker HyperPod was designed with a deal with managed resiliency features to mitigate such {hardware} failures, enabling FM builders equivalent to Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM coaching and inference on Slurm clusters. With the EKS assist in HyperPod, now you can additionally profit from the resiliency options on Kubernetes clusters by managing machine studying (ML) workloads utilizing the HyperPod compute and managed Kubernetes management airplane on the EKS cluster.
AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new characteristic set to handle their ML mannequin growth lifecycle:
“By way of our use of SageMaker HyperPod, our clients and inner groups not have to fret about working and configuring the Kubernetes management airplane, and SageMaker HyperPod gives the community efficiency and optimized configurations to assist advanced HPC workloads. With Amazon EKS assist in SageMaker HyperPod, we will cut back time we spent for undifferentiated heavy lifting in infrastructure administration and cut back operational prices by over 30%.”
– Observea
“As a Kubernetes home, we at the moment are thrilled to welcome the launch of Amazon EKS assist for SageMaker HyperPod. It is a sport changer for us because it integrates seamlessly with our current coaching pipelines and makes it even simpler for us to handle and function our large-scale Kubernetes clusters. As well as, this additionally helps our finish clients as we at the moment are in a position to bundle and productize this functionality into our GenAI platform, enabling our clients to run their very own coaching and fine-tuning workloads in a extra streamlined method.”
– Articul8 AI
This publish is designed for Kubernetes cluster directors and ML scientists, offering an summary of the important thing options that SageMaker HyperPod introduces to facilitate large-scale mannequin coaching on an EKS cluster.
The publish is organized into the next three sections:
- Overview of Amazon EKS assist in SageMaker HyperPod – This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introducing three key resiliency options HyperPod compute gives on the EKS cluster. Moreover, this part explains how HyperPod gives a clean developer expertise for admins and scientists.
- HyperPod cluster setup and node resiliency options – This part gives an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, emphasizing how its built-in resiliency options present infrastructure stability. This part is particularly useful for admins.
- Coaching job resiliency with the job auto resume performance – On this part, we display how scientists can submit and handle their distributed coaching jobs utilizing both the native Kubernetes CLI (kubectl) or optionally the brand new HyperPod CLI (hyperpod) with computerized job restoration enabled.
Overview of EKS assist in SageMaker HyperPod
This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introduces three key resiliency options HyperPod compute gives on the EKS cluster, and discusses how SageMaker HyperPod gives clean consumer experiences for admins and scientists.
Structure overview
Amazon EKS assist in HyperPod helps a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (hooked up as a gaggle of employee nodes). You have got three digital personal clouds (VPCs) on this structure, internet hosting various kinds of sources:
- Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t seem within the buyer account. Amazon EKS creates a extremely obtainable endpoint for the managed Kubernetes API server that you simply use to speak along with your cluster (utilizing instruments like kubectl). The managed endpoint makes use of Network Load Balancer to load stability Kubernetes API servers.
- HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t seem within the buyer account. The nodes connect with the EKS management airplane via a cross-account elastic network interface (ENI).
- SageMaker consumer VPC – A user-managed VPC hosts sources equivalent to Amazon FSx for Lustre, which is optionally related to Amazon Simple Storage Service (Amazon S3) utilizing an data repository association, in your account.
Cross-account ENIs additionally bridge communication between HyperPod compute cases and different AWS companies in your account, equivalent to Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.
The next diagram illustrates the high-level structure of Amazon EKS assist in HyperPod.
HyperPod-managed resiliency options
Amazon EKS assist in HyperPod gives the next three capabilities to ensure the cluster stays wholesome and coaching jobs proceed beneath surprising interruptions:
- Deep health checks – It is a managed well being test for stress testing GPUs and AWS Trainium cases, in addition to performing Elastic Fabric Adapter (EFA) These checks will be run through the cluster creation, replace, or node alternative phases, and will be enabled or disabled via HyperPod APIs.
- Automated node recovery – HyperPod performs managed, lightweight, and non-invasive checks, coupled with automated node alternative functionality. The HyperPod monitoring agent repeatedly displays and detects potential points, together with reminiscence exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime points, and out-of-memory (OOM) crashes. Based mostly on the underlying subject, the monitoring agent both replaces or reboots the node.
- Job auto resume – SageMaker HyperPod gives a job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to supply restoration and continuation of coaching jobs within the occasion of interruptions or failures. The extension makes certain the job waits and restarts after the node is changed.
Consumer experiences
Along with the aforementioned managed resiliency options, SageMaker HyperPod gives clean consumer experiences for each admins and scientists which might be essential for managing a big cluster and working large-scale coaching jobs on them as a part of the Amazon EKS integration:
- Admin expertise – SageMaker HyperPod gives APIs and a console expertise to create and handle node teams within the EKS cluster, together with the flexibility to SSH into the cluster nodes. SageMaker HyperPod additionally gives a mechanism to put in extra dependencies on the cluster nodes utilizing lifecycle scripts, and an API-based mechanism to supply cluster software program updates and enhance total observability.
- Scientist expertise – Together with enabling scientists to coach FMs utilizing Amazon EKS because the orchestrator, SageMaker HyperPod gives extra capabilities for scientists to effortlessly prepare fashions. With the HyperPod CLI, scientists can submit coaching jobs by offering a
.yaml
file and handle jobs (listing, describe, view, cancel) while not having to make use ofkubectl
. Scientists can use open supply instruments like Kueue (a Kubernetes software for job queuing) and adjoining SageMaker capabilities like managed MLflow to handle their experiments and coaching runs. Scientists also can entry native SageMaker distributed training libraries that present efficiency enhancements by as much as 20%. You too can allow SageMaker HyperPod compute with Amazon EKS assist utilizing third-party instruments like KubeRay, which runs on the Kubernetes API. This lets you carry your most well-liked job submission and administration capabilities used with different Kubernetes clusters into your HyperPod surroundings.
HyperPod compute setup and node resiliency options
On this part, we offer an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, and focus on how its built-in resiliency options present infrastructure stability.
Stipulations
It’s essential have the next in place previous to the HyperPod compute deployment:
- EKS cluster – You’ll be able to affiliate HyperPod compute to an current EKS cluster that satisfies the set of prerequisites. Alternatively, you may deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
- Customized sources – Operating multi-node distributed coaching requires numerous sources numerous parts, equivalent to system plugins, CSI drivers, and Coaching Operators, to be pre-deployed on the EKS cluster. You additionally have to deploy extra sources for the well being monitoring agent and deep well being test. HyperPodHelmCharts simplify the method utilizing Helm, one among mostly used bundle mangers for Kubernetes. Refer the developer guide for set up.
HyperPod compute setup
With the aforementioned sources efficiently deployed, you’re now ready to create the HyperPod compute. The cluster configuration is specified utilizing a JSON file; the next code gives an instance:
The supplied configuration file comprises two key highlights:
- “OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep well being test at any time when new GPU or Trainium cases are added
- “NodeRecovery”: “Computerized” – Allows HyperPod’s automated node restoration performance
You’ll be able to create a HyperPod compute with the next aws command (you want model 2.17.47 or newer):
To confirm the cluster standing, you need to use the next command:
This command shows the cluster particulars, together with the cluster identify, standing, and creation time:
Alternatively, you may confirm the cluster standing via the SageMaker console. After a short interval, you may observe that the standing for all nodes transitions to Operating.
Node resiliency options
To realize additional perception into the cases, you need to use kubectl get nodes
and look at the node labels. The sagemaker.amazonaws.com/node-health-status
label reveals the life stage of every node. As an illustration, nodes with the ml.m5.2xlarge
occasion kind are labeled as Schedulable, indicating that they’ve efficiently handed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge
occasion kind are labeled as Unschedulable, indicating that they’ve entered the preliminary deep well being checks. The next code exhibits an instance:
The deep well being test logs are saved within the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
. The log streams are logged at DeepHealthCheckResults/<log_stream_id>
. When the deep well being checks establish a difficulty, the output log gives detailed data, together with the occasion ID that failed the deep well being checks and the precise failure cause. For instance:
You’ll be able to test the progress of the deep well being test with the next values for the sagemaker.amazonaws.com/deep-health-check
label on every node:
amazonaws.com/deep-health-check: InProgress
amazonaws.com/deep-health-check: Handed
amazonaws.com/deep-health-check: Failed
If a node fails the deep well being checks, it will likely be changed. In any other case, it will likely be marked with the Schedulable label:
sagemaker.amazonaws.com/node-health-status: Schedulable
While you need to manually exchange a particular node in your cluster, you are able to do so by manually modifying the label.
For full listing of resilience-related Kubernetes labels, please refer AWS documentation.
Even after the preliminary deep well being checks, HyperPod periodically runs common well being checks. To view the well being occasions detected by the HyperPod well being monitoring agent, you may test the CloudWatch stream log:
- Instance log group identify –
/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
- Instance log stream identify –
SagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>
The SagemakerHealthMonitoringAgent log stream for every node comprises solely the detection occasions from the well being monitoring agent. For instance:
The deep well being checks or the well being monitor agent establish points in a sure node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule
to keep away from scheduling pods, after which the node is changed or rebooted.
You’ll be able to monitor the well being standing of HyperPod nodes via CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps accumulate, mixture, and summarize metrics and logs from containerized purposes and microservices, offering detailed insights into efficiency, well being, and standing metrics for CPU, GPU, Trainium, EFA, and file system as much as the container degree. For the whole listing of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can even test the person node well being standing and the entire variety of schedulable and unschedulable nodes, as proven within the following screenshots.
Yow will discover the Container Insights arrange information in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.
Coaching job resiliency with the job auto resume performance
Along with infrastructure resiliency options, you need to use the use job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to take care of the restoration and continuation of coaching jobs within the occasion of interruptions or failures. The job auto resume characteristic makes an attempt to proceed the job, whereas the HyperPod node auto restoration performance works on resolving node failures (node reboot or alternative as wanted) to reduce coaching downtime. This part demonstrates the job auto resume characteristic utilizing a PyTorch FSDP example on the awsome-distributed-training repository.
To allow the job auto resume characteristic, you create a PyTorchJob with the fsdp.yaml manifest, which incorporates the next annotations
and nodeSelector
:
With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true"
and sagemaker.amazonaws.com/job-max-retry-count: "2"
, SageMaker HyperPod resumes interrupted coaching jobs as much as two instances and schedules the resumed jobs onto wholesome nodes. These wholesome nodes are recognized by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable
, guaranteeing that solely nodes which have handed primary well being checks and can be found for working workloads are used for resumed jobs.
Submit the PyTorchJob utilizing the kubectl
command:
With the job auto resume characteristic enabled, if a job fails because of a {hardware} failure or any transient points throughout coaching, SageMaker HyperPod initiates the node alternative workflow and restarts the job after the defective nodes are changed. You’ll be able to confirm the standing of job auto resume by describing the PyTorchJob:
Within the occasion of a {hardware} failure, the Kubeflow coaching job restarts as follows:
While you submit a coaching job with the HyperPod CLI, you can even request the job to be auto resumed within the following manner:
Confer with config.yaml for full configuration. For different CLI choices, consult with the documentation on Github repository.
Clear up
To delete your SageMaker HyperPod compute, use both the SageMaker console or the next AWS Command Line Interface (AWS CLI) command:
Cluster deletion can take a couple of minutes. You’ll be able to verify profitable deletion after you see no clusters on the SageMaker console.
Conclusion
With the assist for Amazon EKS in SageMaker HyperPod, clients who’ve standardized their FM growth workflows on Kubernetes can undertake SageMaker HyperPod and handle their cluster sources utilizing a well-known Kubernetes interface in SageMaker HyperPod. When coaching an FM, SageMaker HyperPod robotically displays cluster well being, and when an infrastructure fault equivalent to a GPU failure happens, SageMaker HyperPod robotically remediates the difficulty and restarts the coaching course of from the final saved checkpoint, with none human intervention. Amazon EKS additional enhances this functionality by working deep well being checks. Each time a brand new occasion is added to the SageMaker HyperPod compute, it undergoes a deep well being test course of to establish and exchange probably problematic cases. SageMaker HyperPod then robotically replaces or reboots nodes recognized as defective and resumes coaching processes within the occasion of surprising interruptions, involving node alternative and job resubmission.
For an end-to-end tutorial on cluster administration and FM coaching, go to the . For extra data on infrastructure deployment and extra distributed coaching check instances, consult with the awsome-distributed-training repository. In case you’re all for deploying HyperPod with step-by-step instructions, you can begin from the aws-do-hyperpod repository.
In regards to the authors
Keita Watanabe is a Senior GenAI Specialist Options Architect within the world-wide specialist group at Amazon Internet Companies, the place he helps develop machine studying options utilizing OSS tasks equivalent to Slurm and Kubernetes. His background is in machine studying analysis and growth. Previous to becoming a member of AWS, Keita labored within the ecommerce trade as a analysis scientist growing picture retrieval techniques for product search. Keita holds a PhD in Science from the College of Tokyo.
Alex Iankoulski is a full-stack software program and infrastructure architect who likes to do deep, hands-on work. He’s at present a Principal Options Architect within the world-wide specialist group at AWS. In his function, he focuses on serving to clients with the orchestration and scaling of ML and AI workloads on container-powered AWS companies. He’s additionally the writer of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges. In the course of the previous 10 years, Alex has labored on democratizing generative AI and ML, combating local weather change, and making journey safer, healthcare higher, and power smarter.
Tomonori Shimomura is a Senior Options Architect on the Amazon SageMaker workforce, the place he gives in-depth technical session to SageMaker clients and suggests product enhancements to the product workforce. Earlier than becoming a member of Amazon, he labored on the design and growth of embedded software program for online game consoles, and now he leverages his in-depth abilities in cloud-side expertise. In his free time, he enjoys taking part in video video games, studying books, and writing software program.
Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker workforce. He makes a speciality of massive language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountain climbing, and cooking.
Manoj Ravi is a Senior Product Supervisor on the Amazon SageMaker workforce. He’s obsessed with constructing next-gen AI merchandise and works on purposes and instruments to make basis mannequin growth and deployment easy for patrons. He holds an MBA from the Haas College of Enterprise and a grasp’s diploma from Carnegie Mellon College. In his spare time, Manoj enjoys taking part in tennis and pursuing panorama images.