Multi-account help for Amazon SageMaker HyperPod activity governance

GPUs are a treasured useful resource; they’re each quick in provide and far more pricey than conventional CPUs. They’re additionally extremely adaptable to many various use circumstances. Organizations constructing or adopting generative AI use GPUs to run simulations, run inference (each for inner or exterior utilization), construct agentic workloads, and run information scientists’ experiments. The workloads vary from ephemeral single-GPU experiments run by scientists to lengthy multi-node steady pre-training runs. Many organizations have to share a centralized, high-performance GPU computing infrastructure throughout totally different groups, enterprise models, or accounts inside their group. With this infrastructure, they will maximize the utilization of pricey accelerated computing sources like GPUs, relatively than having siloed infrastructure that may be underutilized. Organizations additionally use a number of AWS accounts for his or her customers. Bigger enterprises would possibly need to separate totally different enterprise models, groups, or environments (manufacturing, staging, improvement) into totally different AWS accounts. This supplies extra granular management and isolation between these totally different components of the group. It additionally makes it easy to trace and allocate cloud prices to the suitable groups or enterprise models for higher monetary oversight.
The precise causes and setup can differ relying on the dimensions, construction, and necessities of the enterprise. However generally, a multi-account technique supplies higher flexibility, safety, and manageability for large-scale cloud deployments. On this publish, we focus on how an enterprise with a number of accounts can entry a shared Amazon SageMaker HyperPod cluster for operating their heterogenous workloads. We use SageMaker HyperPod activity governance to allow this function.
Resolution overview
SageMaker HyperPod task governance streamlines useful resource allocation and supplies cluster directors the aptitude to arrange insurance policies to maximise compute utilization in a cluster. Activity governance can be utilized to create distinct groups with their very own distinctive namespace, compute quotas, and borrowing limits. In a multi-account setting, you possibly can limit which accounts have entry to which workforce’s compute quota utilizing role-based access control.
On this publish, we describe the settings required to arrange multi-account entry for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and how one can use SageMaker HyperPod activity governance to allocate accelerated compute to a number of groups in numerous accounts.
The next diagram illustrates the answer structure.
On this structure, one group is splitting sources throughout a couple of accounts. Account A hosts the SageMaker HyperPod cluster. Account B is the place the information scientists reside. Account C is the place the information is ready and saved for coaching utilization. Within the following sections, we show how one can arrange multi-account entry in order that information scientists in Account B can prepare a mannequin on Account A’s SageMaker HyperPod and EKS cluster, utilizing the preprocessed information saved in Account C. We break down this setup in two sections: cross-account entry for information scientists and cross-account entry for ready information.
Cross-account entry for information scientists
If you create a compute allocation with SageMaker HyperPod activity governance, your EKS cluster creates a singular Kubernetes namespace per workforce. For this walkthrough, we create an AWS Identity and Access Management (IAM) function per workforce, referred to as cluster entry roles, which might be then scoped entry solely to the workforce’s activity governance-generated namespace within the shared EKS cluster. Role-based access control is how we be certain that the information science members of Workforce A won’t be able to submit duties on behalf of Workforce B.
To entry Account A’s EKS cluster as a consumer in Account B, you’ll need to imagine a cluster entry function in Account A. The cluster entry function could have solely the wanted permissions for information scientists to entry the EKS cluster. For an instance of IAM roles for information scientists utilizing SageMaker HyperPod, see IAM users for scientists.
Subsequent, you’ll need to imagine the cluster entry function from a job in Account B. The cluster entry function in Account A will then have to have a belief coverage for the information scientist function in Account B. The information scientist function is the function in account B that will probably be used to imagine the cluster entry function in Account A. The next code is an instance of the coverage assertion for the information scientist function in order that it may assume the cluster entry function in Account A:
The next code is an instance of the belief coverage for the cluster entry function in order that it permits the information scientist function to imagine it:
The ultimate step is to create an access entry for the workforce’s cluster entry function within the EKS cluster. This entry entry also needs to have an entry coverage, corresponding to EKSEditPolicy, that’s scoped to the namespace of the workforce. This makes certain that Workforce A customers in Account B can’t launch duties exterior of their assigned namespace. You can too optionally arrange customized role-based entry management; see Setting up Kubernetes role-based access control for extra data.
For customers in Account B, you possibly can repeat the identical setup for every workforce. You will need to create a singular cluster entry function for every workforce to align the entry function for the workforce with their related namespace. To summarize, we use two totally different IAM roles:
- Information scientist function – The function in Account B used to imagine the cluster entry function in Account A. This function simply wants to have the ability to assume the cluster entry function.
- Cluster entry function – The function in Account A used to provide entry to the EKS cluster. For an instance, see IAM role for SageMaker HyperPod.
Cross-account entry to ready information
On this part, we show how one can arrange EKS Pod Identity and S3 Access Points in order that pods operating coaching duties in Account A’s EKS cluster have entry to information saved in Account C. EKS Pod Identification mean you can map an IAM function to a service account in a namespace. If a pod makes use of the service account that has this affiliation, then Amazon EKS will set the surroundings variables within the containers of the pod.
S3 Entry Factors are named community endpoints that simplify information entry for shared datasets in S3 buckets. They act as a strategy to grant fine-grained entry management to particular customers or purposes accessing a shared dataset inside an S3 bucket, with out requiring these customers or purposes to have full entry to your entire bucket. Permissions to the entry level is granted via S3 entry level insurance policies. Every S3 Entry Level is configured with an entry coverage particular to a use case or software. Because the HyperPod cluster on this weblog publish can be utilized by a number of groups, every workforce may have its personal S3 entry level and entry level coverage.
Earlier than following these steps, guarantee you’ve got the EKS Pod Identity Add-on installed in your EKS cluster.
- In Account A, create an IAM Function that accommodates S3 permissions (corresponding to
s3:ListBucket
ands3:GetObject
to the entry level useful resource) and has a belief relationship with Pod Identification; this will probably be your Information Entry Function. Under is an instance of a belief coverage.
{
"Model": "2012-10-17",
"Assertion": [
{
"Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
- In Account C, create an S3 entry level by following the steps here.
- Subsequent, configure your S3 entry level to permit entry to the function created in step 1. That is an instance entry level coverage that provides Account A permission to entry factors in account C.
{
"Model": "2012-10-17",
"Assertion": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name>"
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Useful resource": [
"arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>",
"arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>/object/*"
]
}
]
}
- Guarantee your S3 bucket coverage is up to date to permit Account A entry. That is an instance S3 bucket coverage:
{
"Model": "2012-10-17",
"Assertion": [
{
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Useful resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Situation": {
"StringEquals": {
"s3:DataAccessPointAccount": "<Account-C-ID>"
}
}
}
]
}
- In Account A, create a pod id affiliation in your EKS cluster utilizing the AWS CLI.
- Pods accessing cross-account S3 buckets will want the service account identify referenced of their pod specification.
You’ll be able to take a look at cross-account information entry by spinning up a take a look at pod and the executing into the pod to run Amazon S3 instructions:
This instance exhibits making a single information entry function for a single workforce. For a number of groups, use a namespace-specific ServiceAccount with its personal information entry function to assist stop overlapping useful resource entry throughout groups. You can too configure cross-account Amazon S3 entry for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 data across accounts. FSx for Lustre and Amazon S3 will have to be in the identical AWS Area, and the FSx for Lustre file system will have to be in the identical Availability Zone as your SageMaker HyperPod cluster.
Conclusion
On this publish, we supplied steerage on how one can arrange cross-account entry to information scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. As well as, we coated how one can present Amazon S3 information entry from one account to an EKS cluster in one other account. With SageMaker HyperPod activity governance, you possibly can limit entry and compute allocation to particular groups. This structure can be utilized at scale by organizations eager to share a big compute cluster throughout accounts inside their group. To get began with SageMaker HyperPod activity governance, consult with the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance documentation.
Concerning the Authors
Nisha Nadkarni is a Senior GenAI Specialist Options Architect at AWS, the place she guides corporations via greatest practices when deploying massive scale distributed coaching and inference on AWS. Previous to her present function, she spent a number of years at AWS targeted on serving to rising GenAI startups develop fashions from ideation to manufacturing.
Anoop Saha is a Sr GTM Specialist at Amazon Net Providers (AWS) specializing in generative AI mannequin coaching and inference. He companions with high frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and huge companies, primarily specializing in silicon and system structure of AI infrastructure.
Kareem Syed-Mohammed is a Product Supervisor at AWS. He’s targeted on compute optimization and price governance. Previous to this, at Amazon QuickSight, he led embedded analytics, and developer expertise. Along with QuickSight, he has been with AWS Market and Amazon retail as a Product Supervisor. Kareem began his profession as a developer for name middle applied sciences, Native Skilled and Adverts for Expedia, and administration marketing consultant at McKinsey.
Rajesh Ramchander is a Principal ML Engineer in Skilled Providers at AWS. He helps clients at varied phases of their AI/ML and GenAI journey, from these which might be simply getting began all the way in which to those who are main their enterprise with an AI-first technique.