Introducing Amazon EKS assist in Amazon SageMaker HyperPod


We’re thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This functionality permits for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, utilizing automated node and job resiliency options for basis mannequin (FM) growth.

FMs are usually skilled on large-scale compute clusters with tons of or 1000’s of accelerators. Underneath such circumstances, {hardware} failures pose a major problem, as a result of a single accelerator failure amongst 1000’s can halt the whole coaching course of. For instance, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs skilled 419 surprising interruptions, with 78% attributed to confirmed or suspected {hardware} points, and with 58.7% of those interruptions being GPU-related issues, together with NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a deal with managed resiliency features to mitigate such {hardware} failures, enabling FM builders equivalent to Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM coaching and inference on Slurm clusters. With the EKS assist in HyperPod, now you can additionally profit from the resiliency options on Kubernetes clusters by managing machine studying (ML) workloads utilizing the HyperPod compute and managed Kubernetes management airplane on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new characteristic set to handle their ML mannequin growth lifecycle:

“By way of our use of SageMaker HyperPod, our clients and inner groups not have to fret about working and configuring the Kubernetes management airplane, and SageMaker HyperPod gives the community efficiency and optimized configurations to assist advanced HPC workloads. With Amazon EKS assist in SageMaker HyperPod, we will cut back time we spent for undifferentiated heavy lifting in infrastructure administration and cut back operational prices by over 30%.”

– Observea

“As a Kubernetes home, we at the moment are thrilled to welcome the launch of Amazon EKS assist for SageMaker HyperPod. It is a sport changer for us because it integrates seamlessly with our current coaching pipelines and makes it even simpler for us to handle and function our large-scale Kubernetes clusters. As well as, this additionally helps our finish clients as we at the moment are in a position to bundle and productize this functionality into our GenAI platform, enabling our clients to run their very own coaching and fine-tuning workloads in a extra streamlined method.”

– Articul8 AI

This publish is designed for Kubernetes cluster directors and ML scientists, offering an summary of the important thing options that SageMaker HyperPod introduces to facilitate large-scale mannequin coaching on an EKS cluster.

The publish is organized into the next three sections:

  • Overview of Amazon EKS assist in SageMaker HyperPod – This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introducing three key resiliency options HyperPod compute gives on the EKS cluster. Moreover, this part explains how HyperPod gives a clean developer expertise for admins and scientists.
  • HyperPod cluster setup and node resiliency options – This part gives an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, emphasizing how its built-in resiliency options present infrastructure stability. This part is particularly useful for admins.
  • Coaching job resiliency with the job auto resume performance – On this part, we display how scientists can submit and handle their distributed coaching jobs utilizing both the native Kubernetes CLI (kubectl) or optionally the brand new HyperPod CLI (hyperpod) with computerized job restoration enabled.

Overview of EKS assist in SageMaker HyperPod

This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introduces three key resiliency options HyperPod compute gives on the EKS cluster, and discusses how SageMaker HyperPod gives clean consumer experiences for admins and scientists.

Structure overview

Amazon EKS assist in HyperPod helps a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (hooked up as a gaggle of employee nodes). You have got three digital personal clouds (VPCs) on this structure, internet hosting various kinds of sources:

  • Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t seem within the buyer account. Amazon EKS creates a extremely obtainable endpoint for the managed Kubernetes API server that you simply use to speak along with your cluster (utilizing instruments like kubectl). The managed endpoint makes use of Network Load Balancer to load stability Kubernetes API servers.
  • HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t seem within the buyer account. The nodes connect with the EKS management airplane via a cross-account elastic network interface (ENI).
  • SageMaker consumer VPC – A user-managed VPC hosts sources equivalent to Amazon FSx for Lustre, which is optionally related to Amazon Simple Storage Service (Amazon S3) utilizing an data repository association, in your account.

Cross-account ENIs additionally bridge communication between HyperPod compute cases and different AWS companies in your account, equivalent to Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The next diagram illustrates the high-level structure of Amazon EKS assist in HyperPod.

HyperPod EKS Architucture

HyperPod-managed resiliency options

Amazon EKS assist in HyperPod gives the next three capabilities to ensure the cluster stays wholesome and coaching jobs proceed beneath surprising interruptions:

  • Deep health checks – It is a managed well being test for stress testing GPUs and AWS Trainium cases, in addition to performing Elastic Fabric Adapter (EFA) These checks will be run through the cluster creation, replace, or node alternative phases, and will be enabled or disabled via HyperPod APIs.
  • Automated node recovery – HyperPod performs managed, lightweight, and non-invasive checks, coupled with automated node alternative functionality. The HyperPod monitoring agent repeatedly displays and detects potential points, together with reminiscence exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime points, and out-of-memory (OOM) crashes. Based mostly on the underlying subject, the monitoring agent both replaces or reboots the node.
  • Job auto resume – SageMaker HyperPod gives a job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to supply restoration and continuation of coaching jobs within the occasion of interruptions or failures. The extension makes certain the job waits and restarts after the node is changed.

Consumer experiences

Along with the aforementioned managed resiliency options, SageMaker HyperPod gives clean consumer experiences for each admins and scientists which might be essential for managing a big cluster and working large-scale coaching jobs on them as a part of the Amazon EKS integration:

  • Admin expertise – SageMaker HyperPod gives APIs and a console expertise to create and handle node teams within the EKS cluster, together with the flexibility to SSH into the cluster nodes. SageMaker HyperPod additionally gives a mechanism to put in extra dependencies on the cluster nodes utilizing lifecycle scripts, and an API-based mechanism to supply cluster software program updates and enhance total observability.
  • Scientist expertise – Together with enabling scientists to coach FMs utilizing Amazon EKS because the orchestrator, SageMaker HyperPod gives extra capabilities for scientists to effortlessly prepare fashions. With the HyperPod CLI, scientists can submit coaching jobs by offering a .yaml file and handle jobs (listing, describe, view, cancel) while not having to make use of kubectl. Scientists can use open supply instruments like Kueue (a Kubernetes software for job queuing) and adjoining SageMaker capabilities like managed MLflow to handle their experiments and coaching runs. Scientists also can entry native SageMaker distributed training libraries that present efficiency enhancements by as much as 20%. You too can allow SageMaker HyperPod compute with Amazon EKS assist utilizing third-party instruments like KubeRay, which runs on the Kubernetes API. This lets you carry your most well-liked job submission and administration capabilities used with different Kubernetes clusters into your HyperPod surroundings.

HyperPod compute setup and node resiliency options

On this part, we offer an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, and focus on how its built-in resiliency options present infrastructure stability.

Stipulations

It’s essential have the next in place previous to the HyperPod compute deployment:

  • EKS cluster – You’ll be able to affiliate HyperPod compute to an current EKS cluster that satisfies the set of prerequisites. Alternatively, you may deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
  • Customized sources – Operating multi-node distributed coaching requires numerous sources numerous parts, equivalent to system plugins, CSI drivers, and Coaching Operators, to be pre-deployed on the EKS cluster. You additionally have to deploy extra sources for the well being monitoring agent and deep well being test. HyperPodHelmCharts simplify the method utilizing Helm, one among mostly used bundle mangers for Kubernetes. Refer the developer guide for set up.

HyperPod compute setup

With the aforementioned sources efficiently deployed, you’re now ready to create the HyperPod compute. The cluster configuration is specified utilizing a JSON file; the next code gives an instance:

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "NodeRecovery": "Computerized"
}
EOL

The supplied configuration file comprises two key highlights:

  • “OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep well being test at any time when new GPU or Trainium cases are added
  • “NodeRecovery”: “Computerized” – Allows HyperPod’s automated node restoration performance

You’ll be able to create a HyperPod compute with the next aws command (you want model 2.17.47 or newer):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json

{
    "ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To confirm the cluster standing, you need to use the next command:

aws sagemaker list-clusters --output desk 

This command shows the cluster particulars, together with the cluster identify, standing, and creation time:

-----------------------------------------------------------------------------------------------------------------------
|                                                    ListClusters                                                     |
+---------------------------------------------------------------------------------------------------------------------+
||                                                 ClusterSummaries                                                  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|

Alternatively, you may confirm the cluster standing via the SageMaker console. After a short interval, you may observe that the standing for all nodes transitions to Operating.

SageMaker Console

Node resiliency options

To realize additional perception into the cases, you need to use kubectl get nodes and look at the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of every node. As an illustration, nodes with the ml.m5.2xlarge occasion kind are labeled as Schedulable, indicating that they’ve efficiently handed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge occasion kind are labeled as Unschedulable, indicating that they’ve entered the preliminary deep well being checks. The next code exhibits an instance:

# kubectl get nodes --show-labels=true
NAME                         ...  LABELS
hyperpod-i-023cfe933b3b34369 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...
hyperpod-i-045961b6424401838 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-074b81fdb5bf52e19 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-0ae97710b3033cdb1 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...

The deep well being test logs are saved within the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep well being checks establish a difficulty, the output log gives detailed data, together with the occasion ID that failed the deep well being checks and the precise failure cause. For instance:

# Example1
{
"degree": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered FaultyInstance. Change the Occasion. Area: us-east-2,
InstanceType: p5.48xlarge. ERROR:Bandwidth has lower than threshold: Anticipated minimal
threshold :80,NCCL Take a look at output Bw: 30"
}
# Example2
{
"degree": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered Unknownerror. Change the Occasion. Area: us-east-2,
InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm check"
}

You’ll be able to test the progress of the deep well being test with the next values for the sagemaker.amazonaws.com/deep-health-check label on every node:

  • amazonaws.com/deep-health-check: InProgress 
  • amazonaws.com/deep-health-check: Handed
  • amazonaws.com/deep-health-check: Failed

If a node fails the deep well being checks, it will likely be changed. In any other case, it will likely be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

While you need to manually exchange a particular node in your cluster, you are able to do so by manually modifying the label.

For full listing of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the preliminary deep well being checks, HyperPod periodically runs common well being checks. To view the well being occasions detected by the HyperPod well being monitoring agent, you may test the CloudWatch stream log:

  • Instance log group identify/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
  • Instance log stream identifySagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for every node comprises solely the detection occasions from the well being monitoring agent. For instance:

# Example1
{
    "degree": "information",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "situation kind: ": "KernelDeadlock",
    "with situation particulars ": {
        "kind": "KernelDeadlock",
        "standing": "False",
        "transition": "2024-09-06T03:15:11.539932213Z",
        "cause": "KernelHasNoDeadlock",
        "message": "kernel has no impasse"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}
# Example2
{
    "degree": "information",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "situation kind: ": "NvidiaErrorTerminate",
    "with situation particulars ": {
        "kind": "NvidiaErrorTerminate",
        "standing": "False",
        "transition": "2024-09-06T03:15:11.539932283Z",
        "cause": "NvidiaNoErrorRequiredTerminate",
        "message": "Nvidia no error required terminate"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}

The deep well being checks or the well being monitor agent establish points in a sure node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to keep away from scheduling pods, after which the node is changed or rebooted.

You’ll be able to monitor the well being standing of HyperPod nodes via CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps accumulate, mixture, and summarize metrics and logs from containerized purposes and microservices, offering detailed insights into efficiency, well being, and standing metrics for CPU, GPU, Trainium, EFA, and file system as much as the container degree. For the whole listing of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can even test the person node well being standing and the entire variety of schedulable and unschedulable nodes, as proven within the following screenshots.

Yow will discover the Container Insights arrange information in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Coaching job resiliency with the job auto resume performance

Along with infrastructure resiliency options, you need to use the use job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to take care of the restoration and continuation of coaching jobs within the occasion of interruptions or failures. The job auto resume characteristic makes an attempt to proceed the job, whereas the HyperPod node auto restoration performance works on resolving node failures (node reboot or alternative as wanted) to reduce coaching downtime. This part demonstrates the job auto resume characteristic utilizing a PyTorch FSDP example on the awsome-distributed-training repository.

To allow the job auto resume characteristic, you create a PyTorchJob with the fsdp.yaml manifest, which incorporates the next annotations and nodeSelector:

apiVersion: "kubeflow.org/v1"
variety: PyTorchJob
metadata:
    identify: fsdpjob
    namespace: kubeflow
    # config for HyperPod job auto-resume
    annotations: {
        sagemaker.amazonaws.com/enable-job-auto-resume: "true",
        sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Employee:
      replicas: 10
      restartPolicy: OnFailure

      template:
          spec:
            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable 
......

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true" and sagemaker.amazonaws.com/job-max-retry-count: "2", SageMaker HyperPod resumes interrupted coaching jobs as much as two instances and schedules the resumed jobs onto wholesome nodes. These wholesome nodes are recognized by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, guaranteeing that solely nodes which have handed primary well being checks and can be found for working workloads are used for resumed jobs.

Submit the PyTorchJob utilizing the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume characteristic enabled, if a job fails because of a {hardware} failure or any transient points throughout coaching, SageMaker HyperPod initiates the node alternative workflow and restarts the job after the defective nodes are changed. You’ll be able to confirm the standing of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

Within the occasion of a {hardware} failure, the Kubeflow coaching job restarts as follows:

Begin Time: 2024-07-11T05:53:10Z
Allow job auto-resume 27

Occasions:
Sort Cause Age From
Message
---- ------ ---- ----

Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-0
Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-1
Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller
PyTorchJob pt-job-1 is restarting as a result of 1 Grasp duplicate(s) failed.
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-0
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-1
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller
PyTorchJob pt-job-1 is restarting as a result of 1 Employee duplicate(s) failed

While you submit a coaching job with the HyperPod CLI, you can even request the job to be auto resumed within the following manner:

hyperpod start-job 
    --config-file ./config.yaml 
   --auto-resume true  
   --max-retry 2

Confer with config.yaml for full configuration. For different CLI choices, consult with the documentation on Github repository.

Clear up

To delete your SageMaker HyperPod compute, use both the SageMaker console or the next AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion can take a couple of minutes. You’ll be able to verify profitable deletion after you see no clusters on the SageMaker console.

Conclusion

With the assist for Amazon EKS in SageMaker HyperPod, clients who’ve standardized their FM growth workflows on Kubernetes can undertake SageMaker HyperPod and handle their cluster sources utilizing a well-known Kubernetes interface in SageMaker HyperPod. When coaching an FM, SageMaker HyperPod robotically displays cluster well being, and when an infrastructure fault equivalent to a GPU failure happens, SageMaker HyperPod robotically remediates the difficulty and restarts the coaching course of from the final saved checkpoint, with none human intervention. Amazon EKS additional enhances this functionality by working deep well being checks. Each time a brand new occasion is added to the SageMaker HyperPod compute, it undergoes a deep well being test course of to establish and exchange probably problematic cases. SageMaker HyperPod then robotically replaces or reboots nodes recognized as defective and resumes coaching processes within the occasion of surprising interruptions, involving node alternative and job resubmission.

For an end-to-end tutorial on cluster administration and FM coaching, go to the . For extra data on infrastructure deployment and extra distributed coaching check instances, consult with the awsome-distributed-training repository. In case you’re all for deploying HyperPod with step-by-step instructions, you can begin from the aws-do-hyperpod repository.


In regards to the authors

Keita Watanabe is a Senior GenAI Specialist Options Architect within the world-wide specialist group at Amazon Internet Companies, the place he helps develop machine studying options utilizing OSS tasks equivalent to Slurm and Kubernetes. His background is in machine studying analysis and growth. Previous to becoming a member of AWS, Keita labored within the ecommerce trade as a analysis scientist growing picture retrieval techniques for product search. Keita holds a PhD in Science from the College of Tokyo.

alex iankAlex Iankoulski is a full-stack software program and infrastructure architect who likes to do deep, hands-on work. He’s at present a Principal Options Architect within the world-wide specialist group at AWS. In his function, he focuses on serving to clients with the orchestration and scaling of ML and AI workloads on container-powered AWS companies. He’s additionally the writer of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges. In the course of the previous 10 years, Alex has labored on democratizing generative AI and ML, combating local weather change, and making journey safer, healthcare higher, and power smarter.

shimoxTomonori Shimomura is a Senior Options Architect on the Amazon SageMaker workforce, the place he gives in-depth technical session to SageMaker clients and suggests product enhancements to the product workforce. Earlier than becoming a member of Amazon, he labored on the design and growth of embedded software program for online game consoles, and now he leverages his in-depth abilities in cloud-side expertise. In his free time, he enjoys taking part in video video games, studying books, and writing software program.

arunkumar-LokhArun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker workforce. He makes a speciality of massive language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountain climbing, and cooking.

manojManoj Ravi is a Senior Product Supervisor on the Amazon SageMaker workforce. He’s obsessed with constructing next-gen AI merchandise and works on purposes and instruments to make basis mannequin growth and deployment easy for patrons. He holds an MBA from the Haas College of Enterprise and a grasp’s diploma from Carnegie Mellon College. In his spare time, Manoj enjoys taking part in tennis and pursuing panorama images.

Leave a Reply

Your email address will not be published. Required fields are marked *