Node downside detection and restoration for AWS Neuron nodes inside Amazon EKS clusters

Implementing {hardware} resiliency in your coaching infrastructure is essential to mitigating dangers and enabling uninterrupted mannequin coaching. By implementing options comparable to proactive well being monitoring and automatic restoration mechanisms, organizations can create a fault-tolerant setting able to dealing with {hardware} failures or different points with out compromising the integrity of the coaching course of.

Within the put up, we introduce the AWS Neuron node downside detector and restoration DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This element can shortly detect uncommon occurrences of points when Neuron units fail by tailing monitoring logs. It marks the employee nodes in a faulty Neuron machine as unhealthy, and promptly replaces them with new employee nodes. By accelerating the velocity of situation detection and remediation, it will increase the reliability of your ML coaching and reduces the wasted time and value resulting from {hardware} failure.

This resolution is relevant if you happen to’re utilizing managed nodes or self-managed node teams (which use Amazon EC2 Auto Scaling groups) on Amazon EKS. On the time of scripting this put up, computerized restoration of nodes provisioned by Karpenter is just not but supported.

Resolution overview

The answer relies on the node downside detector and restoration DaemonSet, a strong device designed to mechanically detect and report varied node-level issues in a Kubernetes cluster.

The node downside detector element will constantly monitor the kernel message (kmsg) logs on the employee nodes. If it detects error messages particularly associated to the Neuron machine (which is the Trainium or AWS Inferentia chip), it’s going to change NodeCondition to NeuronHasError on the Kubernetes API server.

The node restoration agent is a separate element that periodically checks the Prometheus metrics uncovered by the node downside detector. When it finds a node situation indicating a difficulty with the Neuron machine, it’s going to take automated actions. First, it’s going to mark the affected occasion within the related Auto Scaling group as unhealthy, which is able to invoke the Auto Scaling group to cease the occasion and launch a alternative. Moreover, the node restoration agent will publish Amazon CloudWatch metrics for customers to watch and alert on these occasions.

The next diagram illustrates the answer structure and workflow.

Within the following walkthrough, we create an EKS cluster with Trn1 employee nodes, deploy the Neuron plugin for the node downside detector, and inject an error message into the node. We then observe the failing node being stopped and changed with a brand new one, and discover a metric in CloudWatch indicating the error.

Stipulations

Earlier than you begin, be sure you have put in the next instruments in your machine:

Deploy the node downside detection and restoration plugin

Full the next steps to configure the node downside detection and restoration plugin:

Create an EKS cluster utilizing the information on an EKS Terraform module:

git clone https://github.com/awslabs/data-on-eks.git

export TF_VAR_region=us-east-2
export TF_VAR_trn1_32xl_desired_size=4
export TF_VAR_trn1_32xl_min_size=4
cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x set up.sh
./set up.sh

aws eks --region us-east-2 describe-cluster --name trainium-inferentia

# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia

kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-100-64-161-213.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde
ip-100-64-227-31.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde
ip-100-64-70-179.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde

Set up the required AWS Identity and Access Management (IAM) position for the service account and the node downside detector plugin.
Create a coverage as proven under. Replace the Useful resource key worth to match your node group ARN that accommodates the Trainium and AWS Inferentia nodes, and replace the ec2:ResourceTag/aws:autoscaling:groupName key worth to match the Auto Scaling group title.

You may get these values from the Amazon EKS console. Select Clusters within the navigation pane, open the trainium-inferentia cluster, select Node teams, and find your node group.

# To create the coverage, aws cli can be utilized as proven under the place npd-policy-trimmed.json is the coverage json constructed from the template above.

# Create npd-policy-trimmed.json
cat << EOF > npd-policy-trimmed.json
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Action": [
                "autoscaling:SetInstanceHealth",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Impact": "Permit",
            "Useful resource": <arn of the Auto Scaling group equivalent to the Neuron nodes for the cluster>
        },
        {
            "Motion": [
                "ec2:DescribeInstances"
            ],
            "Impact": "Permit",
            "Useful resource": "*",
            "Situation": {
                "ForAllValues:StringEquals": {
                    "ec2:ResourceTag/aws:autoscaling:groupName": <title of the Auto Scaling group equivalent to the Neuron nodes for the cluster>
                }
            }
        },
        {
            "Motion": [
                "cloudwatch:PutMetricData"
            ],
            "Impact": "Permit",
            "Useful resource": "*",
            "Situation": {
                "StringEquals": {
                    "cloudwatch:Namespace": "NeuronHealthCheck"
                }
            }
        }
    ]
}
EOF

This element will probably be put in as a DaemonSet in your EKS cluster.

# To create the coverage, aws cli can be utilized as proven under the place npd-policy-trimmed.json is the coverage json constructed from the template above.

aws iam create-policy  
--policy-name NeuronProblemDetectorPolicy 
--policy-document file://npd-policy-trimmed.json

# Notice the ARN

CLUSTER_NAME=trainium-inferentia # Your EKS Cluster Title 
AWS_REGION=us-east-2
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output textual content)
POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:coverage/NeuronProblemDetectorPolicy

eksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent 
  --region $AWS_REGION

eksctl create podidentityassociation 
    --cluster $CLUSTER_NAME 
    --namespace neuron-healthcheck-system 
    --service-account-name node-problem-detector 
    --permission-policy-arns="$POLICY_ARN" 
    --region $AWS_REGION
    
# Set up the Neuron NPD and restoration plugin 

kubectl create ns neuron-healthcheck-system
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f - 
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f - 
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f -

# Anticipated consequence (with 4 Neuron nodes in cluster):

kubectl get pod -n neuron-healthcheck-system
NAME READY STATUS RESTARTS AGE
node-problem-detector-49p6w 2/2 Operating 0 31s
node-problem-detector-j7wct 2/2 Operating 0 31s
node-problem-detector-qr6jm 2/2 Operating 0 31s
node-problem-detector-vwq8x 2/2 Operating 0 31s

The container pictures within the Kubernetes manifests are saved in public repository comparable to registry.k8s.io and public.ecr.aws. For manufacturing environments, it’s really helpful that clients restrict exterior dependencies that impression these areas and host container pictures in a personal registry and sync from pictures public repositories. For detailed implementation, please consult with the weblog put up: Announcing pull through cache for registry.k8s.io in Amazon Elastic Container Registry.

By default, the node downside detector won’t take any actions on failed node. If you want the EC2 occasion to be terminated mechanically by the agent, replace the DaemonSet as follows:

kubectl edit -n neuron-healthcheck-system ds/node-problem-detector

...
   env:
   - title: ENABLE_RECOVERY
     worth: "true"

Take a look at the node downside detector and restoration resolution

After the plugin is put in, you possibly can see Neuron circumstances present up by working kubectl describe node. We simulate a tool error by injecting error logs within the occasion:

# Confirm node circumstances on any node. Neuron circumstances ought to present up.

kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep Situations: -A7

Situations:
  Kind             Standing  LastHeartbeatTime                 LastTransitionTime                Motive                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error
  MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has ample reminiscence accessible
  DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk strain
  PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has ample PID accessible
  Prepared            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting prepared standing
# To get supplier id
kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep -i supplier | sed -E 's/.*/([^/]+)$/1/'

i-0381404aa69eae3f6

# SSH into to the employee node and simulate the {hardware} error on the neuron machine
aws ssm start-session --target i-0381404aa69eae3f6 --region us-east-2

Beginning session with SessionId: lindarr-0069460593240662a

sh-4.2$
sh-4.2$ sudo bash
[root@ip-192-168-93-211 bin]# echo "check NEURON_HW_ERR=DMA_ERROR check" >> /dev/kmsg

Round 2 minutes later, you possibly can see that the error has been recognized:

kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep 'Situations:' -A7

Situations:
  Kind             Standing  LastHeartbeatTime                 LastTransitionTime                Motive                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     True    Fri, 29 Mar 2024 17:42:43 +0800   Fri, 29 Mar 2024 17:42:38 +0800   NeuronHasError_DMA_ERROR     check NEURON_HW_ERR=DMA_ERROR check

...

Occasions:
  Kind     Motive                    Age   From            Message
  ----     ------                    ----  ----            -------
  Warning  NeuronHasError_DMA_ERROR  36s   kernel-monitor  Node situation NeuronHealth is now: True, motive: NeuronHasError_DMA_ERROR, message: "check NEURON_HW_ERR=DMA_ERROR check"

Now that the error has been detected by the node downside detector, and the restoration agent has mechanically taken the motion to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:

# Confirm the Node scheduling is disabled.
kubectl get node 
NAME                                           STATUS                        ROLES    AGE    VERSION
ip-100-64-1-48.us-east-2.compute.inner      Prepared                         <none>   156m   v1.29.0-eks-5e0fdde
ip-100-64-103-26.us-east-2.compute.inner    Prepared                         <none>   94s    v1.29.0-eks-5e0fdde
ip-100-64-239-245.us-east-2.compute.inner   Prepared                         <none>   154m   v1.29.0-eks-5e0fdde
ip-100-64-52-40.us-east-2.compute.inner     Prepared                         <none>   156m   v1.29.0-eks-5e0fdde
ip-100-64-58-151.us-east-2.compute.inner    NotReady,SchedulingDisabled   <none>   27h    v1.29.0-eks-5e0fdde

You may open the CloudWatch console and confirm the metric for NeuronHealthCheck. You may see the CloudWatch NeuronHasError_DMA_ERROR metric has the worth 1.

After alternative, you possibly can see a brand new employee node has been created:

# The brand new node with age 28s is the brand new node

kubectl get node 
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-65-77.us-east-2.compute.inner    Prepared    <none>   28s   v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde
ip-192-168-81-176.us-east-2.compute.inner   Prepared    <none>   9d    v1.29.5-eks-5e0fdde
ip-192-168-91-218.us-east-2.compute.inner   Prepared    <none>   9d    v1.29.0-eks-5e0fdde
ip-192-168-94-83.us-east-2.compute.inner    Prepared    <none>   9d    v1.29.0-eks-5e0fdde

Let’s take a look at a real-world situation, by which you’re working a distributed coaching job, utilizing an MPI operator as outlined in Llama-2 on Trainium, and there may be an irrecoverable Neuron error in one of many nodes. Earlier than the plugin is deployed, the coaching job will turn out to be caught, leading to wasted time and computational prices. With the plugin deployed, the node downside detector will proactively take away the issue node from the cluster. Within the coaching scripts, it saves checkpoints periodically in order that the coaching will resume from the earlier checkpoint.

The next screenshot reveals instance logs from a distributed coaching job.

The coaching has been began. (You may ignore loss=nan for now; it’s a identified situation and will probably be eliminated. For fast use, consult with the reduced_train_loss metric.)

The next screenshot reveals the checkpoint created at step 77.

Coaching stopped after one of many nodes has an issue at step 86. The error was injected manually for testing.

After the defective node was detected and changed by the Neuron plugin for node downside and restoration, the coaching course of resumed at step 77, which was the final checkpoint.

Though Auto Scaling teams will cease unhealthy nodes, they could encounter points stopping the launch of alternative nodes. In such circumstances, coaching jobs will stall and require guide intervention. Nonetheless, the stopped node won’t incur additional fees on the related EC2 occasion.

If you wish to take customized actions along with stopping cases, you possibly can create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights question like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to judge the alarms. The next screenshots present an instance.

Clear up

To wash up all of the provisioned assets for this put up, run the cleanup script:

# neuron-problem-detector-role-$CLUSTER_NAME
eksctl delete podidentityassociation 
--service-account-name node-problem-detector 
--namespace neuron-healthcheck-system 
--cluster $CLUSTER_NAME 
--region $AWS_REGION

# delete the EKS Cluster
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Conclusion

On this put up, we confirmed how the Neuron downside detector and restoration DaemonSet for Amazon EKS works for EC2 cases powered by Trainium and AWS Inferentia. Should you’re working Neuron based mostly EC2 cases and utilizing managed nodes or self-managed node teams, you possibly can deploy the detector and restoration DaemonSet in your EKS cluster and profit from improved reliability and fault tolerance of your machine studying coaching workloads within the occasion of node failure.

In regards to the authors

Harish Rao is a senior options architect at AWS, specializing in large-scale distributed AI coaching and inference. He empowers clients to harness the facility of AI to drive innovation and resolve complicated challenges. Exterior of labor, Harish embraces an energetic life-style, having fun with the tranquility of climbing, the depth of racquetball, and the psychological readability of mindfulness practices.

Ziwen Ning is a software program improvement engineer at AWS. He presently focuses on enhancing the AI/ML expertise by way of the mixing of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys difficult himself with badminton, swimming and different varied sports activities, and immersing himself in music.

Geeta Gharpure is a senior software program developer on the Annapurna ML engineering crew. She is targeted on working massive scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.

Darren Lin is a Cloud Native Specialist Options Architect at AWS who focuses on domains comparable to Linux, Kubernetes, Container, Observability, and Open Supply Applied sciences. In his spare time, he likes to work out and have enjoyable along with his household.