Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user expertise


Amazon SageMaker HyperPod is designed to assist large-scale machine studying (ML) operations, offering a sturdy surroundings for coaching basis fashions (FMs) over prolonged intervals. A number of customers — akin to ML researchers, software program engineers, knowledge scientists, and cluster directors — can work concurrently on the identical cluster, every managing their very own jobs and recordsdata with out interfering with others.

When utilizing HyperPod, you should use acquainted orchestration choices akin to Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). This weblog submit particularly applies to HyperPod clusters utilizing Slurm because the orchestrator. In these clusters, the idea of login nodes is on the market, which cluster directors can add to facilitate consumer entry. These login nodes function the entry level by means of which customers work together with the cluster’s computational assets. By utilizing login nodes, customers can separate their interactive actions, akin to searching recordsdata, submitting jobs, and compiling code, from the cluster’s head node. This separation helps forestall any single consumer’s actions from affecting the general efficiency of the cluster.

Nevertheless, though HyperPod gives the aptitude to make use of login nodes, it doesn’t present an built-in mechanism for load balancing consumer exercise throughout these nodes. Consequently, customers manually choose a login node, which might result in imbalances the place some nodes are overutilized whereas others stay underutilized. This not solely impacts the effectivity of useful resource utilization however can even result in uneven efficiency experiences for various customers.

On this submit, we discover an answer for implementing load balancing throughout login nodes in Slurm-based HyperPod clusters. By distributing consumer exercise evenly throughout all accessible nodes, this method gives extra constant efficiency, higher useful resource utilization, and a smoother expertise for all customers. We information you thru the setup course of, offering sensible steps to attain efficient load balancing in your HyperPod clusters.

Answer overview

In HyperPod, login nodes function entry factors for customers to work together with the cluster’s computational assets to allow them to handle their duties with out impacting the top node. Though the default technique for accessing these login nodes is thru AWS Systems Manager, there are circumstances the place direct Safe Shell (SSH) entry is extra appropriate. SSH gives a extra conventional and versatile method of managing interactions, particularly for customers who require particular networking configurations or want options akin to TCP load balancing, which Programs Supervisor doesn’t assist.

Provided that HyperPod is often deployed in a digital non-public cloud (VPC) utilizing non-public subnets, direct SSH entry to the login nodes requires safe community connectivity into the non-public subnet. There are a number of choices to attain this:

  1. AWS Site-to-Site VPN – Establishes a safe connection between your on-premises community and your VPC, appropriate for enterprise environments
  2. AWS Direct Connect – Affords a devoted community connection for high-throughput and low-latency wants
  3. AWS VPN Client – A software-based resolution that distant customers can use to securely hook up with the VPC, offering versatile and quick access to the login nodes

This submit demonstrates the way to use the AWS VPN Shopper to ascertain a safe connection to the VPC. We arrange a Community Load Balancer (NLB) inside the non-public subnet to evenly distribute SSH site visitors throughout the accessible login nodes and use the VPN connection to connect with the NLB within the VPC. The NLB ensures that consumer classes are balanced throughout the nodes, stopping any single node from turning into a bottleneck and thereby bettering general efficiency and useful resource utilization.

For environments the place VPN connectivity won’t be possible, an alternate choice is to deploy the NLB in a public subnet to permit direct SSH entry from the web. On this configuration, the NLB might be secured by proscribing entry by means of a safety group that permits SSH site visitors solely from specified, trusted IP addresses. Consequently, approved customers can join on to the login nodes whereas sustaining some stage of management over entry to the cluster. Nevertheless, this public-facing technique is exterior the scope of this submit and isn’t really useful for manufacturing environments, as exposing SSH entry to the web can introduce further safety dangers.

The next diagram gives an outline of the answer structure.

Solution overview

Stipulations

Earlier than following the steps on this submit, be sure you have the foundational elements of a HyperPod cluster setup in place. This consists of the core infrastructure for the HyperPod cluster and the community configuration required for safe entry. Particularly, you want:

  • HyperPod cluster – This submit assumes you have already got a HyperPod cluster deployed. If not, discuss with Getting started with SageMaker HyperPod and the HyperPod workshop for steering on creating and configuring your cluster.
  • VPC, subnets, and safety group – Your HyperPod cluster ought to reside inside a VPC with related subnets. To deploy a brand new VPC and subnets, comply with the directions within the Own Account part of the HyperPod workshop. This course of consists of deploying an AWS CloudFormation stack to create important assets such because the VPC, subnets, safety group, and an Amazon FSx for Lustre quantity for shared storage.

Establishing login nodes for cluster entry

Login nodes are devoted entry factors that customers can use to work together with the HyperPod cluster’s computational assets with out impacting the top node. By connecting by means of login nodes, customers can browse recordsdata, submit jobs, and compile code independently, selling a extra organized and environment friendly use of the cluster’s assets.

Should you haven’t arrange login nodes but, discuss with the Login Node part of the HyperPod Workshop, which gives detailed directions on including these nodes to your cluster configuration.

Every login node in a HyperPod cluster has an related community interface inside your VPC. A community interface, also called an elastic community interface, represents a digital community card that connects every login node to your VPC, permitting it to speak over the community. These interfaces have assigned IPv4 addresses, that are important for routing site visitors from the NLB to the login nodes.

To proceed with the load balancer setup, you have to acquire the IPv4 addresses of every login node. You possibly can acquire these addresses from the AWS Management Console or by invoking a command in your HyperPod cluster’s head node.

Utilizing the AWS Administration Console

To arrange login nodes for cluster entry utilizing the AWS Administration Console, comply with these steps:

  1. On the Amazon EC2 console, choose Community interfaces within the navigation pane
  2. Within the Search bar, choose VPC ID = (Equals) and select the VPC id of the VPC containing the HyperPod cluster
  3. Within the Search bar, choose Description : (Comprises) and enter the title of the occasion group that features your login nodes (sometimes, that is login-group)

For every login node, you can find an entry within the listing, as proven within the following screenshot. Notice down the IPv4 addresses for all login nodes of your cluster.

Search network interfaces

Utilizing the HyperPod head node

Alternatively, you too can retrieve the IPv4 addresses by coming into the next command in your HyperPod cluster’s head node:

sudo cat /choose/ml/config/resource_config.json
    | jq '.InstanceGroups[] | choose(.Identify=="login-group").Situations[].CustomerIpAddress'

Create a Community Load Balancer

The following step is to create a NLB to handle site visitors throughout your cluster’s login nodes.

For the NLB deployment, you want the IPv4 addresses of the login nodes collected earlier and the suitable safety group configurations. Should you deployed your cluster utilizing the HyperPod workshop directions, a safety group that allows communication between all cluster nodes ought to already be in place.

This safety group might be utilized to the load balancer, as demonstrated within the following directions. Alternatively, you may choose to create a devoted safety group that grants entry particularly to the login nodes.

Create goal group

First, we create the goal group that might be utilized by the NLB.

  1. On the Amazon EC2 console, choose Goal teams within the navigation pane
  2. Select Create goal group
  3. Create a goal group with the next parameters:
    1. For Select a goal kind, select IP addresses
    2. For Goal group title, enter smhp-login-node-tg
    3. For Protocol : Port, select TCP and enter 22
    4. For IP handle kind, select IPv4
    5. For VPC, select SageMaker HyperPod VPC (which was created with the CloudFormation template)
    6. For Well being test protocol, select TCP
  4. Select Subsequent, as proven within the following screenshot

Create NLB target group - Step 1

  1. Within the Register targets part, register the login node IP addresses because the targets
  2. For Ports, enter 22 and select Embody as pending beneath, as proven within the following screenshot

Create NLB target group - Step 2

  1. The login node IPs will seem as targets with Pending well being standing. Select Create goal group, as proven within the following screenshot

Create NLB target group - Step 3

Create load balancer

To create the load balancer, comply with these steps:

  1. On the Amazon EC2 console, choose Load Balancers within the navigation pane
  2. Select Create load balancer
  3. Select Community Load Balancer and select Create, as proven within the following screenshot

Create load balancer selection dialog

  1. Present a reputation (for instance, smhp-login-node-lb) and select Inner as Scheme

Create NLB - Step 1

  1. For community mapping, choose the VPC that comprises your HyperPod cluster and an related non-public subnet, as proven within the following screenshot

Create NLB - Step 2

  1. Choose a safety group that permits entry on port 22 to the login nodes. Should you deployed your cluster utilizing the HyperPod workshop directions, you should use the safety group from this deployment.
  1. Choose the Goal group that you just created earlier than and select TCP as Protocol and 22 for Port, as proven within the following screenshot

Create NLB - Step 3

  1. Select Create load balancer

After the load balancer has been created, you could find its DNS title on the load balancer’s element web page, as proven within the following screenshot. 

Find DNS name after NLB creation

Ensuring host keys are constant throughout login nodes

When utilizing a number of login nodes in a load-balanced surroundings, it’s essential to take care of constant SSH host keys throughout all nodes. SSH host keys are distinctive identifiers that every server makes use of to show its identification to connecting purchasers. If every login node has a distinct host key, customers will encounter “WARNING: SSH HOST KEY CHANGED” messages at any time when they hook up with a distinct node, inflicting confusion and doubtlessly main customers to query the safety of the connection.

To keep away from these warnings, configure the identical SSH host keys on all login nodes within the load balancing rotation. This setup makes certain that customers gained’t obtain host key mismatch alerts when routed to a distinct node by the load balancer.

You possibly can enter the next script on the cluster’s head node to repeat the SSH host keys from the primary login node to the opposite login nodes in your HyperPod cluster:

#!/bin/bash

SUDOER_USER="ubuntu"

login_nodes=($(sudo cat /choose/ml/config/resource_config.json | jq '.InstanceGroups[] | choose(.Identify=="login-group").Situations[].CustomerIpAddress' | tr 'n' ' ' | tr -d '"'))
source_node="${login_nodes[0]}"
key_paths=("/and so on/ssh/ssh_host_rsa_key"
           "/and so on/ssh/ssh_host_rsa_key.pub"
           "/and so on/ssh/ssh_host_ecdsa_key"
           "/and so on/ssh/ssh_host_ecdsa_key.pub"
           "/and so on/ssh/ssh_host_ed25519_key"
           "/and so on/ssh/ssh_host_ed25519_key.pub")

tmp_dir="/tmp/ssh_host_keys_$(uuidgen)"

copy_cmd=""
for key_path in "${key_paths[@]}"; do
  copy_cmd="sudo cp $key_path $tmp_dir/;$copy_cmd"
carried out

ssh $source_node "mkdir -p $tmp_dir;${copy_cmd} sudo chown -R $SUDOER_USER $tmp_dir;"

for node in "${login_nodes[@]:1}"; do
  echo "Copying SSH host keys from $source_node to $node..."
  scp -r $source_node:$tmp_dir $node:$tmp_dir
  ssh $node "sudo chown -R root:root $tmp_dir; sudo mv $tmp_dir/ssh_host_* /and so on/ssh/;"
carried out

for node in "${login_nodes[@]}"; do
  echo "Cleansing up tmp dir $tmp_dir on $node..."
  ssh $node "sudo rm -r $tmp_dir"
carried out

Create AWS Shopper VPN endpoint

As a result of the NLB has been created with Inner scheme, it’s solely accessible from inside the HyperPod VPC. To entry the VPC and ship requests to the NLB, we use AWS Client VPN on this submit.

AWS Shopper VPN is a managed client-based VPN service that allows safe entry to your AWS assets and assets in your on-premises community.

We’ll arrange an AWS Shopper VPN endpoint that gives purchasers with entry to the HyperPod VPC and makes use of mutual authentication. With mutual authentication, Shopper VPN makes use of certificates to carry out authentication between purchasers and the Shopper VPN endpoint.

To deploy a shopper VPN endpoint with mutual authentication, you may comply with the steps outlined in Get started with AWS Client VPN. When configuring the shopper VPN to entry the HyperPod VPC and the login nodes, hold these diversifications to the next steps in thoughts:

  • Step 2 (create a Shopper VPN endpoint) – By default, all shopper site visitors is routed by means of the Shopper VPN tunnel. To permit web entry with out routing site visitors by means of the VPN, you may allow the choice Allow split-tunnel when creating the endpoint. When this selection is enabled, solely site visitors destined for networks matching a route within the Shopper VPN endpoint route desk is routed by means of the VPN tunnel. For extra particulars, discuss with Split-tunnel on Client VPN endpoints.
  • Step 3 (goal community associations) – Choose the VPC and personal subnet utilized by your HyperPod cluster, which comprises the cluster login nodes.
  • Step 4 (authorization guidelines) – Select the Classless Inter-Area Routing (CIDR) vary related to the HyperPod VPC. Should you adopted the HyperPod workshop directions, the CIDR vary is 10.0.0.0/16.
  • Step 6 (safety teams) – Choose the safety group that you just beforehand used when creating the NLB.

Connecting to the login nodes

After the AWS Shopper VPN is configured, purchasers can set up a VPN connection to the HyperPod VPC. With the VPN connection in place, purchasers can use SSH to connect with the NLB, which can route them to one of many login nodes.

ssh -i /path/to/your/private-key.pem consumer@<NLB-IP-or-DNS>

To permit SSH entry to the login nodes, you could create consumer accounts on the cluster and add their public keys to the authorized_keys file on every login node (or on all nodes, if obligatory). For detailed directions on managing multi-user entry, discuss with the Multi-User part of the HyperPod workshop.

Along with utilizing the AWS Shopper VPN, you too can entry the NLB from different AWS companies, akin to Amazon Elastic Compute Cloud (Amazon EC2) cases, in the event that they meet the next necessities:

  • VPC connectivity – The EC2 cases have to be both in the identical VPC because the NLB or in a position to entry the HyperPod VPC by means of a peering connection or related community setup.
  • Safety group configuration – The EC2 occasion’s safety group should enable outbound connections on port 22 to the NLB safety group. Likewise, the NLB safety group needs to be configured to simply accept inbound SSH site visitors on port 22 from the EC2 occasion’s safety group.

Clear up

To take away deployed assets, you may clear them up within the following order:

  1. Delete the Shopper VPN endpoint
  2. Delete the Community Load Balancer
  3. Delete the goal group related to the load balancer

Should you additionally need to delete the HyperPod cluster, comply with these further steps:

  1. Delete the HyperPod cluster
  2. Delete the CloudFormation stack, which incorporates the VPC, subnets, safety group, and FSx for Lustre quantity

Conclusion

On this submit, we explored the way to implement login node load balancing for SageMaker HyperPod clusters. By utilizing a Community Load Balancer to distribute consumer site visitors throughout login nodes, you may optimize useful resource utilization and improve the general multi-user expertise, offering seamless entry to cluster assets for every consumer.

This method represents just one strategy to customise your HyperPod cluster. Due to the flexibleness of SageMaker HyperPod you may adapt configurations to your distinctive wants whereas benefiting from a managed, resilient surroundings. Whether or not you have to scale basis mannequin workloads, share compute assets throughout completely different duties, or assist long-running coaching jobs, SageMaker HyperPod gives a flexible resolution that may evolve together with your necessities.

For extra particulars on taking advantage of SageMaker HyperPod, dive into the HyperPod workshop and discover further blog posts covering HyperPod.


Concerning the Authors

Janosch Woschitz is a Senior Options Architect at AWS, specializing in AI/ML. With over 15 years of expertise, he helps clients globally in leveraging AI and ML for revolutionary options and constructing ML platforms on AWS. His experience spans machine studying, knowledge engineering, and scalable distributed methods, augmented by a robust background in software program engineering and business experience in domains akin to autonomous driving.

Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years of software program engineering and an ML background, he works with clients of any measurement to know their enterprise and technical wants and design AI and ML options that make one of the best use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in numerous domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys enjoying soccer.

Leave a Reply

Your email address will not be published. Required fields are marked *