Combine HyperPod clusters with Energetic Listing for seamless multi-user login


Amazon SageMaker HyperPod is purpose-built to speed up basis mannequin (FM) coaching, eradicating the undifferentiated heavy lifting concerned in managing and optimizing a big coaching compute cluster. With SageMaker HyperPod, you possibly can practice FMs for weeks and months with out disruption.

Usually, HyperPod clusters are utilized by a number of customers: machine studying (ML) researchers, software program engineers, knowledge scientists, and cluster directors. They edit their very own recordsdata, run their very own jobs, and need to keep away from impacting one another’s work. To attain this multi-user setting, you possibly can reap the benefits of Linux’s consumer and group mechanism and statically create a number of customers on every occasion by means of lifecycle scripts. The disadvantage to this method, nonetheless, is that consumer and group settings are duplicated throughout a number of cases within the cluster, making it tough to configure them constantly on all cases, equivalent to when a brand new crew member joins.

To resolve this ache level, we will use Lightweight Directory Access Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) to combine with a listing service equivalent to AWS Directory Service for Microsoft Active Directory. With the listing service, you possibly can centrally keep customers and teams, and their permissions.

On this submit, we introduce an answer to combine HyperPod clusters with AWS Managed Microsoft AD, and clarify the best way to obtain a seamless multi-user login setting with a centrally maintained listing.

Resolution overview

The answer makes use of the next AWS companies and assets:

We additionally use AWS CloudFormation to deploy a stack to create the conditions for the HyperPod cluster: VPC, subnets, safety group, and Amazon FSx for Lustre quantity.

The next diagram illustrates the high-level answer structure.

Architecture diagram for HyperPod and Active Directory integration

On this answer, HyperPod cluster cases use the LDAPS protocol to connect with the AWS Managed Microsoft AD through an NLB. We use TLS termination by putting in a certificates to the NLB. To configure LDAPS in HyperPod cluster cases, the lifecycle script installs and configures System Security Services Daemon (SSSD)—an open supply consumer software program for LDAP/LDAPS.

Stipulations

This submit assumes you already know the best way to create a primary HyperPod cluster with out SSSD. For extra particulars on the best way to create HyperPod clusters, confer with Getting started with SageMaker HyperPod and the HyperPod workshop.

Additionally, within the setup steps, you’ll use a Linux machine to generate a self-signed certificates and procure an obfuscated password for the AD reader consumer. In the event you don’t have a Linux machine, you possibly can create an EC2 Linux occasion or use AWS CloudShell.

Create a VPC, subnets, and a safety group

Observe the directions within the Own Account part of the HyperPod workshop. You’ll deploy a CloudFormation stack and create prerequisite assets equivalent to VPC, subnets, safety group, and FSx for Lustre quantity. You could create each a main subnet and backup subnet when deploying the CloudFormation stack, as a result of AWS Managed Microsoft AD requires at least two subnets with completely different Availability Zones.

On this submit, for simplicity, we use the identical VPC, subnets, and safety group for each the HyperPod cluster and listing service. If you could use completely different networks between the cluster and listing service, ensure safety teams and route tables are configured in order that they’ll talk one another.

Create AWS Managed Microsoft AD on Listing Service

Full the next steps to arrange your listing:

  1. On the Directory Service console, select Directories within the navigation pane.
  2. Select Arrange listing.
  3. For Listing sort, choose AWS Managed Microsoft AD.
  4. Select Subsequent.
    Directory type selection screen
  5. For Version, choose Commonplace Version.
  6. For Listing DNS title, enter your most well-liked listing DNS title (for instance, hyperpod.abc123.com).
  7. For Admin password¸ set a password and reserve it for later use.
  8. Select Subsequent.
    Directory creation configuration screen
  9. Within the Networking part, specify the VPC and two personal subnets you created.
  10. Select Subsequent.
    Directory network configuration screen
  11. Overview the configuration and pricing, then select Create listing.
    Directory creation confirmation screen
    The listing creation begins. Wait till the standing adjustments from Creating to Energetic, which might take 20–half-hour.
  12. When the standing adjustments to Energetic, open the element web page of the listing and pay attention to the DNS addresses for later use.Directory details screen

Create an NLB in entrance of Listing Service

To create the NLB, full the next steps:

  1. On the Amazon EC2 console, select Goal teams within the navigation pane.
  2. Select Create goal teams.
  3. Create a goal group with the next parameters:
    1. For Select a goal sort, choose IP addresses.
    2. For Goal group title, enter LDAP.
    3. For Protocol: Port, select TCP and enter 389.
    4. For IP handle sort, choose IPv4.
    5. For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
    6. For Well being verify protocol, select TCP.
  4. Select Subsequent.
    Load balancing target creation configuration screen
  5. Within the Register targets part, register the listing service’s DNS addresses because the targets.
  6. For Ports, select Embody as pending under.Load balancing target registration screenThe addresses are added within the Overview targets part with Pending standing.
  7. Select Create goal group.Load balancing target review screen
  8. On the Load Balancers console, select Create load balancer.
  9. Below Community Load Balancer, select Create.Load balancer type choosing screen
  10. Configure an NLB with the next parameters:
    1. For Load balancer title, enter a reputation (for instance, nlb-ds).
    2. For Scheme, choose Inner.
    3. For IP handle sort, choose IPv4.NLB creation basic configuration section
    4. For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
    5. Below Mappings, choose the 2 personal subnets and their CIDR ranges (which you created with the CloudFormation template).
    6. For Safety teams, select CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).NLB creation network mapping and security groups configurations
  11. Within the Listeners and routing part, specify the next parameters:
    1. For Protocol, select TCP.
    2. For Port, enter 389.
    3. For Default motion, select the goal group named LDAP.

    Right here, we’re including a listener for LDAP. We are going to add LDAPS later.

  12. Select Create load balancer.NLB listeners routing configuration screenWait till the standing adjustments from Provisioning to Energetic, which might take 3–5 minutes.
  13. When the standing adjustments to Energetic, open the element web page of the provisioned NLB and pay attention to the DNS title (xyzxyz.elb.region-name.amazonaws.com) for later use.NLB details screen

Create a self-signed certificates and import it to Certificates Supervisor

To create a self-signed certificates, full the next steps:

  1. In your Linux-based setting (native laptop computer, EC2 Linux occasion, or CloudShell), run the next OpenSSL instructions to create a self-signed certificates and personal key:
    $ openssl genrsa 2048 > ldaps.key
    
    $ openssl req -new -key ldaps.key -out ldaps_server.csr
    
    You're about to be requested to enter data that can be included
    into your certificates request.
    What you might be about to enter is what known as a Distinguished Title or a DN.
    There are fairly a couple of fields however you possibly can go away some clean
    For some fields there can be a default worth,
    In the event you enter '.', the sphere can be left clean.
    -----
    Nation Title (2 letter code) [AU]:US
    State or Province Title (full title) [Some-State]:Washington
    Locality Title (eg, metropolis) []:Bellevue
    Group Title (eg, firm) [Internet Widgits Pty Ltd]:CorpName
    Organizational Unit Title (eg, part) []:OrgName
    Widespread Title (e.g., server FQDN or YOUR title) []:nlb-ds-abcd1234.elb.area.amazonaws.com
    E mail Tackle []:your@e mail.handle.com
    
    Please enter the next 'further' attributes
    to be despatched along with your certificates request
    A problem password []:
    An elective firm title []:
    
    $ openssl x509 -req -sha256 -days 365 -in ldaps_server.csr -signkey ldaps.key -out ldaps.crt
    
    Certificates request self-signature okay
    topic=C = US, ST = Washington, L = Bellevue, O = CorpName, OU = OrgName, CN = nlb-ds-abcd1234.elb.area.amazonaws.com, emailAddress = your@e mail.handle.com
    
    $ chmod 600 ldaps.key

  2. On the Certificate Manager console, select Import.
  3. Enter the certificates physique and personal key, from the contents of ldaps.crt and ldaps.key respectively.
  4. Select Subsequent.Certificate importing screen
  5. Add any elective tags, then select Subsequent.Certificate tag editing screen
  6. Overview the configuration and select Import.Certificate import review screen

Add an LDAPS listener

We added a listener for LDAP already within the NLB. Now we add a listener for LDAPS with the imported certificates. Full the next steps:

  1. On the Load Balancers console, navigate to the NLB particulars web page.
  2. On the Listeners tab, select Add listener.NLB listers screen with add listener button
  3. Configure the listener with the next parameters:
    1. For Protocol, select TLS.
    2. For Port, enter 636.
    3. For Default motion, select LDAP.
    4. For Certificates supply, choose From ACM.
    5. For Certificates, enter what you imported in ACM.
  4. Select Add.NLB listener configuration screenNow the NLB listens to each LDAP and LDAPS. It is strongly recommended to delete the LDAP listener as a result of it transmits knowledge with out encryption, not like LDAPS.NLB listerners list with LDAP and LDAPS

Create an EC2 Home windows occasion to manage customers and teams within the AD

To create and keep customers and teams within the AD, full the next steps:

  1. On the Amazon EC2 console, select Situations within the navigation pane.
  2. Select Launch cases.
  3. For Title, enter a reputation on your occasion.
  4. For Amazon Machine Picture, select Microsoft Home windows Server 2022 Base.
  5. For Occasion sort, select t2.micro.
  6. Within the Community settings part, present the next parameters:
    1. For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, select both of two subnets you created with the CloudFormation template.
    3. For Widespread safety teams, select CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
  7. For Configure storage, set storage to 30 GB gp2.
  8. Within the Superior particulars part, for Area be a part of listing¸ select the AD you created.
  9. For IAM occasion profile, select an AWS Identity and Access Management (IAM) function with at the very least the AmazonSSMManagedEC2InstanceDefaultPolicy coverage.
  10. Overview the abstract and select Launch occasion.

Create customers and teams in AD utilizing the EC2 Home windows occasion

With Remote Desktop, hook up with the EC2 Home windows occasion you created within the earlier step. Utilizing an RDP consumer is beneficial over utilizing a browser-based Distant Desktop to be able to trade the contents of the clipboard along with your native machine utilizing copy-paste operations. For extra particulars about connecting to EC2 Home windows cases, confer with Connect to your Windows instance.

If you’re prompted for a login credential, use hyperpodAdmin (the place hyperpod is the primary a part of your listing DNS title) because the consumer title, and use the admin password you set to the listing service.

  1. When the Home windows desktop display screen opens, select Server Supervisor from the Begin menu.Dashboard screen on Server Manager
  2. Select Native Server within the navigation pane, and make sure that the area is what you specified to the listing service.Local Server screen on Server Manager
  3. On the Handle menu, select Add Roles and Options.Drop down menu opened from Manage button
  4. Select Subsequent till you might be on the Options web page.Add Roles and Features Wizard
  5. Broaden the characteristic Distant Server Administration Instruments, increase Function Administration Instruments, and choose AD DS and AD LDS Instruments and Energetic Listing Rights Administration Service.
  6. Select Subsequent and Set up.Features selection screenCharacteristic set up begins.
  7. When the set up is full, select Shut.Feature installation progress screen
  8. Open Energetic Listing Customers and Computer systems from the Begin menu.Active Directory Users and Computers window
  9. Below hyperpod.abc123.com, increase hyperpod.
  10. Select (right-click) hyperpod, select New, and select Organizational Unit.Context menu opened to create an Organizational Unit
  11. Create an organizational unit known as Teams.Organizational Unit ceation dialog
  12. Select (right-click) Teams, select New, and select Group.Context menu opened to create groups
  13. Create a bunch known as ClusterAdmin.Group creation dialog for ClusterAdmin
  14. Create a second group known as ClusterDev.Group creation dialog for ClusterDev
  15. Select (right-click) Customers, select New, and select Person.
  16. Create a brand new consumer.User creation dialog
  17. Select (right-click) the consumer and select Add to a bunch.Context menu opened to add a user to a group
  18. Add your customers to the teams ClusterAdmin or ClusterDev.Group selection screen to add a user to a groupCustomers added to the ClusterAdmin group can have sudo privilege on the cluster.

Create a ReadOnly consumer in AD

Create a consumer known as ReadOnly underneath Customers. The ReadOnly consumer is utilized by the cluster to programmatically entry customers and teams in AD.

User creation dialog to create ReadOnly user

Be aware of the password for later use.

Password entering screen for ReadOnly user

(For SSH public key authentication) Add SSH public keys to customers

By storing an SSH public key to a consumer in AD, you possibly can log in with out coming into a password. You need to use an present key pair, or you possibly can create a brand new key pair with OpenSSH’s ssh-keygen command. For extra details about producing a key pair, confer with Create a key pair for your Amazon EC2 instance.

  1. In Energetic Listing Customers and Computer systems, on the View menu, allow Superior Options.View menu opened to enable Advanced Features
  2. Open the Properties dialog of the consumer.
  3. On the Attribute Editor tab, select altSecurityIdentities select Edit.Attribute Editor tab on User Properties dialog
  4. For Worth so as to add, select Add.
  5. For Values, add an SSH public key.
  6. Select OK.Attribute editing dialog for altSecurityIdentitiesAffirm that the SSH public key seems as an attribute.Attribute Editor tab with altSecurityIdentities configured

Get an obfuscated password for the ReadOnly consumer

To keep away from together with a plain textual content password within the SSSD configuration file, you obfuscate the password. For this step, you want a Linux setting (native laptop computer, EC2 Linux occasion, or CloudShell).

Set up the sssd-tools bundle on the Linux machine to put in the Python module pysss for obfuscation:

# Ubuntu
$ sudo apt set up sssd-tools

# Amazon Linux
$ sudo yum set up sssd-tools

Run the next one-line Python script. Enter the password of the ReadOnly consumer. You’re going to get the obfuscated password.

$ python3 -c "import getpass,pysss; print(pysss.password().encrypt(getpass.getpass('AD reader consumer password: ').strip(), pysss.password().AES_256))"
AD reader consumer password: (Enter ReadOnly consumer password) 
AAAQACK2....

Create a HyperPod cluster with an SSSD-enabled lifecycle script

Subsequent, you create a HyperPod cluster with LDAPS/Energetic Listing integration.

  1. Discover the configuration file config.py in your lifecycle script listing, open it along with your textual content editor, and edit the properties within the Config class and SssdConfig class:
    1. Set True for enable_sssd to allow establishing SSSD.
    2. The SssdConfig class accommodates configuration parameters for SSSD.
    3. Be sure to use the obfuscated password for the ldap_default_authtok property, not a plain textual content password.
    # Fundamental configuration parameters
    class Config:
             :
        # Set true if you wish to set up SSSD for ActiveDirectory/LDAP integration.
        # You could configure parameters in SssdConfig as nicely.
        enable_sssd = True
    # Configuration parameters for ActiveDirectory/LDAP/SSSD
    class SssdConfig:
    
        # Title of area. Will be default in case you are undecided.
        area = "default"
    
        # Comma separated checklist of LDAP server URIs
        ldap_uri = "ldaps://nlb-ds-xyzxyz.elb.us-west-2.amazonaws.com"
    
        # The default base DN to make use of for performing LDAP consumer operations
        ldap_search_base = "dc=hyperpod,dc=abc123,dc=com"
    
        # The default bind DN to make use of for performing LDAP operations
        ldap_default_bind_dn = "CN=ReadOnly,OU=Customers,OU=hyperpod,DC=hyperpod,DC=abc123,DC=com"
    
        # "password" or "obfuscated_password". Obfuscated password is beneficial.
        ldap_default_authtok_type = "obfuscated_password"
    
        # You could modify this parameter with the obfuscated password, not plain textual content password
        ldap_default_authtok = "placeholder"
    
        # SSH authentication methodology - "password" or "publickey"
        ssh_auth_method = "publickey"
    
        # Dwelling listing. You'll be able to change it to "/residence/%u" in case your cluster would not use FSx quantity.
        override_homedir = "/fsx/%u"
    
        # Group names to simply accept SSH login
        ssh_allow_groups = {
            "controller" : ["ClusterAdmin", "ubuntu"],
            "compute" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
            "login" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
        }
    
        # Group names for sudoers
        sudoers_groups = {
            "controller" : ["ClusterAdmin", "ClusterDev"],
            "compute" : ["ClusterAdmin", "ClusterDev"],
            "login" : ["ClusterAdmin", "ClusterDev"],
        }
    

  2. Copy the certificates file ldaps.crt to the identical listing (the place config.py exists).
  3. Add the modified lifecycle script recordsdata to your Amazon Simple Storage Service (Amazon S3) bucket, and create a HyperPod cluster with it.
  4. Wait till the standing adjustments to InService.

Verification

Let’s confirm the answer by logging in to the cluster with SSH. As a result of the cluster was created in a personal subnet, you possibly can’t immediately SSH into the cluster out of your native setting. You’ll be able to select from two choices to connect with the cluster.

Choice 1: SSH login by means of AWS Programs Supervisor

You need to use AWS Systems Manager as a proxy for the SSH connection. Add a number entry to the SSH configuration file ~/.ssh/config utilizing the next instance. For the HostName area, specify the Programs Manger goal title within the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]. For the IdentityFile area, specify the file path to the consumer’s SSH personal key. This area is just not required when you selected password authentication.

Host MyCluster-LoginNode
    HostName sagemaker-cluster:abcd1234_LoginGroup-i-01234567890abcdef
    Person user1
    IdentityFile ~/keys/my-cluster-ssh-key.pem
    ProxyCommand aws --profile default --region us-west-2 ssm start-session --target %h --document-name AWS-StartSSHSession --parameters portNumber=%p

Run the ssh command utilizing the host title you specified. Affirm you possibly can log in to the occasion with the desired consumer.

$ ssh MyCluster-LoginNode
   :
   :
   ____              __  ___     __             __ __                  ___          __
  / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
 _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
/___/_,_/_, /__/_/  /_/_,_/_/___/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
         /___/                                    /___/_/
You are on the controller
Occasion Sort: ml.m5.xlarge
user1@ip-10-1-111-222:~$

At this level, customers can nonetheless use the Programs Supervisor default shell session to log in to the cluster as ssm-user with administrative privileges. To dam the default Programs Supervisor shell entry and implement SSH entry, you possibly can configure your IAM coverage by referring to the next instance:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Useful resource": [
                "arn:aws:sagemaker:us-west-2:123456789012:cluster/abcd1234efgh",
                "arn:aws:ssm:us-west-2:123456789012:document/AWS-StartSSHSession"
            ],
            "Situation": {
                "BoolIfExists": {
                    "ssm:SessionDocumentAccessCheck": "true"
                }
            }
        }
    ]
}

For extra particulars on the best way to implement SSH entry, confer with Start a session with a document by specifying the session documents in IAM policies.

Choice 2: SSH login by means of bastion host

One other choice to entry the cluster is to make use of a bastion host as a proxy. You need to use this feature when the consumer doesn’t have permission to make use of Programs Supervisor periods, or to troubleshoot when Programs Supervisor is just not working.

  1. Create a bastion safety group that permits inbound SSH entry (TCP port 22) out of your native setting.
  2. Replace the safety group for the cluster to permit inbound SSH entry from the bastion safety group.
  3. Create an EC2 Linux occasion.
  4. For Amazon Machine Picture, select Ubuntu Server 20.04 LTS.
  5. For Occasion sort, select t3.small.
  6. Within the Community settings part, present the next parameters:
    1. For VPC, select SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, select the general public subnet you created with the CloudFormation template.
    3. For Widespread safety teams, select the bastion safety group you created.
  7. For Configure storage, set storage to eight GB.
  8. Establish the general public IP handle of the bastion host and the personal IP handle of the goal occasion (for instance, the login node of the cluster), and add two host entries within the SSH config, by referring to the next instance:
    Host Bastion
        HostName 11.22.33.44
        Person ubuntu
        IdentityFile ~/keys/my-bastion-ssh-key.pem
    
    Host MyCluster-LoginNode-with-Proxy
        HostName 10.1.111.222
        Person user1
        IdentityFile ~/keys/my-cluster-ssh-key.pem
        ProxyCommand ssh -q -W %h:%p Bastion

  9. Run the ssh command utilizing the goal host title you specified earlier, and make sure you possibly can log in to the occasion with the desired consumer:
    $ ssh MyCluster-LoginNode-with-Proxy
       :
       :
       ____              __  ___     __             __ __                  ___          __
      / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
     _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
    /___/_,_/_, /__/_/  /_/_,_/_/___/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
             /___/                                    /___/_/
    You are on the controller
    Occasion Sort: ml.m5.xlarge
    user1@ip-10-1-111-222:~$

Clear up

Clear up the assets within the following order:

  1. Delete the HyperPod cluster.
  2. Delete the Community Load Balancer.
  3. Delete the load balancing goal group.
  4. Delete the certificates imported to Certificates Supervisor.
  5. Delete the EC2 Home windows occasion.
  6. Delete the EC2 Linux occasion for the bastion host.
  7. Delete the AWS Managed Microsoft AD.
  8. Delete the CloudFormation stack for the VPC, subnets, safety group, and FSx for Lustre quantity.

Conclusion

This submit supplied steps to create a HyperPod cluster built-in with Energetic Listing. This answer removes the trouble of consumer upkeep on large-scale clusters and lets you handle customers and teams centrally in a single place.

For extra details about HyperPod, take a look at the HyperPod workshop and the SageMaker HyperPod Developer Guide. Go away your suggestions on this answer within the feedback part.


In regards to the Authors

Tomonori Shimomura is a Senior Options Architect on the Amazon SageMaker crew, the place he gives in-depth technical session to SageMaker prospects and suggests product enhancements to the product crew. Earlier than becoming a member of Amazon, he labored on the design and improvement of embedded software program for online game consoles, and now he leverages his in-depth abilities in Cloud facet expertise. In his free time, he enjoys enjoying video video games, studying books, and writing software program.

Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years software program engineering and an ML background, he works with prospects of any measurement to know their enterprise and technical wants and design AI and ML options that make one of the best use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in several domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys enjoying soccer.

Monidipa Chakraborty at the moment serves as a Senior Software program Improvement Engineer at Amazon Net Companies (AWS), particularly throughout the SageMaker HyperPod crew. She is dedicated to helping prospects by designing and implementing sturdy and scalable programs that exhibit operational excellence. Bringing almost a decade of software program improvement expertise, Monidipa has contributed to numerous sectors inside Amazon, together with Video, Retail, Amazon Go, and AWS SageMaker.

Satish Pasumarthi is a Software program Developer at Amazon Net Companies. With a number of years of software program engineering and an ML background, he likes to bridge the hole between the ML and programs and is passionate to construct programs that make massive scale mannequin coaching attainable. He has labored on tasks in a wide range of domains, together with Machine Studying frameworks, mannequin benchmarking, constructing hyperpod beta involving a broad set of AWS companies. In his free time, Satish enjoys enjoying badminton.

Leave a Reply

Your email address will not be published. Required fields are marked *