Pace up your cluster procurement time with Amazon SageMaker HyperPod coaching plans


At present, organizations are continually in search of methods to make use of superior giant language fashions (LLMs) for his or her particular wants. These organizations are participating in each pre-training and fine-tuning large LLMs, with parameter counts within the billions. This course of goals to boost mannequin efficacy for a big selection of purposes throughout various sectors, together with healthcare, monetary providers, and advertising. Nonetheless, customizing these bigger fashions requires entry to the newest and accelerated compute sources.

On this submit, we show how one can tackle this requirement through the use of Amazon SageMaker HyperPod training plans, which might convey down your coaching cluster procurement wait time. A coaching plan gives easy and predictable entry to accelerated compute resources (supporting P4d, P5, P5e, P5en, and trn2 as of the time of writing), permitting you to make use of this compute capability to run mannequin coaching on both Amazon SageMaker training  jobs or SageMaker HyperPod.

We information you thru a step-by-step implementation on how you need to use the (AWS CLI) or the AWS Management Console to search out, overview, and create optimum coaching plans on your particular compute and timeline wants. We additional information you thru utilizing the coaching plan to submit SageMaker coaching jobs or create SageMaker HyperPod clusters.

You may try the launch of this new characteristic in Meet your training timelines and budget with new Amazon SageMaker HyperPod flexible training plans.

Enterprise challenges

As organizations try to harness the facility of LLMs for aggressive benefit, they face a major hurdle: securing enough and dependable compute capability for mannequin coaching. The dimensions of those fashions calls for cutting-edge accelerated compute {hardware}. Nonetheless, the excessive price and restricted availability of such sources create a bottleneck for a lot of companies. This shortage not solely impacts timelines, but in addition stretches budgets, probably delaying vital AI initiatives. In consequence, organizations are in search of options that may present constant, scalable, and cost-effective entry to high-performance computing sources, enabling them to coach and fine-tune LLMs with out compromising on velocity or high quality.

Answer overview

SageMaker HyperPod coaching plans, a brand new SageMaker functionality, tackle this problem by providing you a simple-to-use console UI or AWS CLI expertise to look, overview, create, and handle coaching plans.

Capability provisioned by SageMaker coaching plans can be utilized with both SageMaker coaching jobs or SageMaker HyperPod. If you wish to give attention to mannequin improvement slightly than infrastructure administration and like ease of use with a managed expertise, SageMaker coaching jobs are a wonderful selection. For organizations requiring granular management over coaching infrastructure and intensive customization choices, SageMaker HyperPod is the perfect answer. To raised perceive these providers and select the one most applicable on your use case, check with Generative AI foundation model training on Amazon SageMaker, which gives detailed details about each choices.

The next diagram gives an outline of the principle steps concerned in requesting capability utilizing SageMaker coaching plans for SageMaker coaching jobs.

Workflow for securing training plans

Determine 1: The primary steps concerned in procuring capability by way of SageMaker HyperPod coaching plans. Word: This workflow arbitrarily makes use of SageMaker coaching jobs because the goal; chances are you’ll select to make use of SageMaker HyperPod too.

At a excessive stage, the steps to create a coaching plan are as follows:

  1. Search the coaching plans that greatest match your capability necessities, reminiscent of occasion sort, occasion rely, begin time, and period. SageMaker finds the optimum plans throughout a number of segments.
  2. After reviewing the out there coaching plan choices, you’ll be able to reserve the plan that meets your necessities.
  3. Schedule your SageMaker coaching jobs through the use of a coaching plan with a training-job goal useful resource. Word, we’re solely utilizing training-job for illustration functions. You may additionally use hyperpod-cluster as your goal useful resource.
  4. Describe and record your current coaching plans. When the capability is offered, will probably be allotted to the scheduled coaching job.

Within the following sections, we shift our focus to the answer walkthrough related to coaching plans.

Stipulations

Full the next prerequisite steps:

  1. Should you’re utilizing an AWS Identity and Access Management (IAM) consumer for this answer, make it possible for your consumer has the AmazonSageMakerFullAccess policy connected to it. To study extra about learn how to connect a coverage to an IAM consumer, see Adding IAM identity permissions (console).
  2. Should you’re organising the AWS CLI for the primary time, comply with the directions at Getting started with the AWS CLI.
  3. Should you select to make use of the AWS CLI, ensure you are on probably the most up-to-date AWS CLI model.

Create a coaching plan

On this submit, we focus on two methods to create a coaching plan: utilizing the SageMaker console or the AWS CLI.

Create a SageMaker coaching plan utilizing the SageMaker console

The SageMaker console consumer expertise for making a coaching plan is analogous for each coaching jobs and SageMaker HyperPod. On this submit, for demonstration functions, we present learn how to create a coaching plan for a SageMaker HyperPod cluster.

  1. On the SageMaker console, select Coaching plans within the navigation pane.
  2. Create a brand new coaching plan.
  3. For Goal, choose HyperPod cluster.
  4. Beneath Occasion attributes, specify your occasion sort (ml.p5.48xlarge) and occasion rely (16).
  5. Beneath Date settings to seek for an out there plan, select your most popular coaching date and period (for instance, 10 days).
  6. Select Discover coaching plan.

Determine 2: You may seek for out there coaching plan choices by way of the SageMaker console! Select your goal, choose your occasion sort and rely, and specify period.

SageMaker suggests a coaching plan that’s break up into two 5-day segments. This contains the entire upfront value for the plan in addition to the estimated knowledge switch price primarily based on the information location you supplied.

Determine 3: SageMaker suggests a coaching plan primarily based in your inputs. On this instance, SageMaker suggests a coaching plan break up throughout two 5-day segments. Additionally, you will see the entire upfront value.

  1. Evaluate and buy your plan.

Determine 4: When you’re blissful along with your choice, you’ll be able to overview and buy your coaching plan!

After you create the coaching plan, you’ll be able to see the record of coaching plans created. The plan initially enters a Pending state, awaiting fee. As soon as the fee is processed (until the fee cycle has modified), the plan will transition to the Scheduled state. At this level, you’ll be able to start queuing jobs or creating clusters utilizing the plan. On the plan’s begin date, it turns into Energetic, and sources are allotted. Your coaching duties can then begin operating (pending useful resource availability).

Be sure to pay for the coaching plan utilizing the AWS Billing and Cost Management console on your plan to point out up in your SageMaker console. You’ll obtain an bill to resolve earlier than with the ability to proceed.

Determine 5: You may record out your coaching plans on the SageMaker console. You can begin utilizing your plan as soon as it transitions to the Energetic state.

Create a SageMaker coaching plan utilizing the AWS CLI

Full the next steps to create a coaching plan utilizing the AWS CLI:

  1. Begin by calling the API, passing your capability necessities as enter parameters, to seek for all matching coaching plan choices.

The next instance searches for coaching plan choices appropriate for 2 ml.p5.48xlarge situations for 96 hours within the us-west-2 area. On this instance, we even have filters for what timeframe we need to use the coaching plan, and we additionally filter for coaching plans that can be utilized for SageMaker HyperPod cluster workloads utilizing the target-resources parameter:

# Required: occasion sort and occasion rely, goal sources, area
# Optionally available: period hours, begin time after, and finish time earlier than.

aws sagemaker search-training-plan-offerings 
  --region "us-west-2" 
  --instance-type 'ml.p5.48xlarge' 
  --instance-count 2 
  --target-resources 'hyperpod-cluster' 
  --duration-hours 96 
  --start-time-after "2025-01-01T00:00:00" 
  --end-time-before "2025-12-31T23:59:59"

Every TrainingPlanOffering returned within the response is recognized by a singular TrainingPlanOfferingId. The primary offering within the record represents the most effective match on your necessities. On this case, the SageMaker SearchTrainingPlanOfferings API returns a single out there TrainingPlanOffering that matches the desired capability necessities:

{
    'TrainingPlanOfferings': [
      { 
          'TrainingPlanOfferingId': 'tpo-abc123',
          'TargetResources': ['hyperpod-cluster'],
          'RequestedStartTimeAfter': 
          datetime.datetime(2024, 11, 18, 11, 40, 47, 928000, tzinfo=tzlocal()),
          'DurationHours': 96,
          'DurationMinutes': 0,
          'Upfront': 'xx.yy',
          'CurrencyCode': 'USD',
          'ReservedCapacityOfferings': [
            {
                'InstanceType': 'ml.p5.48xlarge',
                'InstanceCount': 2,
                'AvailabilityZone': 'us-east-1a',
                'DurationHours': 96,
                'DurationMinutes': 0,
                'StartTime': datetime.datetime(2024, 11, 21, 3, 30, tzinfo=tzlocal()),
                'EndTime': datetime.datetime(2024, 11, 22, 3, 30, tzinfo=tzlocal())
            }
          ]
      }
    ]
}

Ensure that your SageMaker HyperPod coaching job subnets are in the identical Availability Zone as your coaching plan.

  1. After you select the coaching plan that most closely fits your schedule and necessities, you’ll be able to reserve it by calling the CreateTrainingPlan API as follows:
# Required: training-plan-offering-id, training-plan-name
# Optionally available: target-services (leverages trainig-job by default)
aws sagemaker create-training-plan 
  --training-plan-offering-id "tpo-abc123" 
  --training-plan-name "p5-training-plan" 
  --region "us-west-2"

You will note an output that appears like the next:

{
    "TrainingPlanArn":"arn:aws:sagemaker:us-west-2:123456789123:training-plan/p5-training-plan"
}

After you create the coaching plan, you’ll have to pay. Be looking out for an bill. It’s also possible to discover this on the AWS Billing and Value Administration console.

  1. You may record all of the coaching plans which can be created in your AWS account (and Area) by calling the ListTrainingPlans API:
aws sagemaker list-training-plans

This will provide you with a abstract of the coaching plans in your account. After you’ve got your coaching plan (the newly created p5-training-plan), you’ll be able to verify its particulars utilizing both the console or the DescribeTrainingPlan API as follows:

export TRAINING_PLAN="p5-training-plan"
TRAINING_PLAN_DESCRIPTION=$(aws sagemaker describe-training-plan --training-plan-name "$TRAINING_PLAN")
echo $TRAINING_PLAN_DESCRIPTION

# Selecting out particular person parameters from the DescribeTrainingPlan API
TRAINING_PLAN_ARN=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TrainingPlanArn)
AVAILABLE_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.AvailableInstanceCount')
TOTAL_INSTANCE_COUNT=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.TotalInstanceCount')

# Word: You might have a number of AZs on your TrainingPlans, so regulate the jq command beneath accordingly!
TRAINING_PLAN_AZ=$(echo "$TRAINING_PLAN_DESCRIPTION" | jq -r '.ReservedCapacitySummaries[0].AvailabilityZone')

Use a coaching plan with SageMaker HyperPod

When your coaching plan standing transitions to Scheduled, you need to use it for brand spanking new occasion teams in both a brand new or current SageMaker HyperPod cluster. You need to use each the CreateCluster and UpdateCluster APIs to create a brand new SageMaker HyperPod cluster along with your coaching plan, or replace an current cluster respectively. It’s also possible to select to straight use the SageMaker console.

For a given SageMaker HyperPod cluster, coaching plans are connected on the occasion group stage, individually per every occasion group. If desired, one SageMaker HyperPod cluster can have a number of coaching plans connected to a number of occasion teams. You at all times have the choice to omit a coaching plan and as a substitute proceed utilizing On-Demand capability as beforehand for different combos of occasion teams. Nonetheless, you’ll be able to’t combine coaching plan capability with On-Demand capability inside the similar occasion group. It’s also possible to select to have a partial cluster launch for each occasion group. Which means that even when all of the requested capability isn’t out there, you’ll be able to nonetheless spin up a cluster with capability already out there to you.

When a coaching plan is energetic, that is the time window when the TrainingPlanOfferings inside it are scheduled to start out and cease. Every time a TrainingPlanOffering begins, occasion teams will mechanically scale as much as the desired rely, and the occasion group TrainingPlanStatus will replicate as Energetic. When a TrainingPlanOffering is scheduled to cease, your cluster’s occasion teams will mechanically scale all the way down to zero, and the occasion group TrainingPlanStatus will replicate as Expired.

Use a coaching plan with SageMaker HyperPod on the console

You may select to both create a brand new cluster and create an occasion group, or edit an current cluster and edit an current occasion group. Within the configuration, select the identical occasion sort that was chosen for a coaching plan and specify the specified occasion rely. The Occasion capability choice will seem solely whenever you select an occasion sort that’s supported for coaching plans. Select the dropdown menu to scroll by legitimate coaching plans. The out there coaching plan choices are listed by title and are filtered for less than people who match the chosen occasion sort, which have no less than the desired occasion rely, that had been created with hyperpod-cluster because the goal useful resource, and at present have a standing of Scheduled or Energetic. Double-check these circumstances if you happen to don’t see an anticipated coaching plan title, and make it possible for the anticipated coaching plan was created in the identical account and in the identical Area. The default choice is to make use of no coaching plan. Repeat the method for every occasion group that ought to have a coaching plan.

HyperPod console training plans

Determine 6: You may create an occasion group for a SageMaker HyperPod cluster with the situations in your coaching plan. Make sure that to decide on the correct coaching plan listed underneath “Occasion capability”

Use a coaching plan with SageMaker HyperPod with the AWS CLI

Full the next steps to make use of your coaching plan with the AWS CLI:

  1. Create a SageMaker HyperPod cluster from scratch. For directions, check with the Amazon SageMaker HyperPod workshop or the Amazon EKS Support in Amazon SageMaker HyperPod workshop.

The next cluster configuration file defines a SageMaker HyperPod SLURM cluster named ml-cluster. The steps for utilizing coaching plans would be the similar, no matter if you happen to select SLURM or Amazon Elastic Kubernetes Service (Amazon EKS) because the orchestrator. This cluster accommodates an occasion group named controller-machine with 1 ml.m5.12xlarge instance as the pinnacle node of a SLURM cluster, and it’ll not use a coaching plan for the controller-machine occasion group. We additionally outline a employee occasion group named worker-group-1 that specifies 2 ml.p5.48xlarge instances, which might be sourced out of your coaching plan. Word the road "TrainingPlanArn"—that is the place you specify your coaching plan by the complete Amazon Useful resource Title (ARN). Should you adopted the steps within the prior sections, this ought to be the worth of the atmosphere variable TRAINING_PLAN_ARN. The next cluster configuration additionally skips some configuration parameters, reminiscent of VPCConfig and InstanceStorageConfig. Confer with the workshop or the next script for an entire SageMaker HyperPod cluster configuration file.

supply env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
          "InstanceGroupName": "controller-machine",
          "InstanceType": "ml.m5.12xlarge",
          "InstanceCount": 1,
          ...
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p5.48xlarge",
        "InstanceCount": 2,
        "TrainingPlanArn": "<ENTER TRAINING PLAN ARN HERE>",         ...
      }
    ],
    ...
}
EOF

You may then create the cluster utilizing the next code:

aws sagemaker create-cluster 
  --cli-input-json file://create-cluster-config.json 
  --region $AWS_REGION

These subsequent steps assume that you have already got a SageMaker HyperPod cluster created. This part is related if you happen to’d like so as to add an occasion group that makes use of your coaching plan reserved situations to your current cluster.

  1. To replace an current cluster, you’ll be able to outline one other file known as update-cluster-config.json as follows. Should you adopted the directions within the workshop to provision the cluster, you need to use the supplied create_config.sh to get the values on your env_vars earlier than sourcing them.
# Supply atmosphere varibales
supply env_vars

# Create extra employee group configuration
additional_worker_group=$(cat <<EOF
{
    "InstanceGroupName": "worker-group-2",
    "InstanceType": "ml.p5.48xlarge",
    "InstanceCount": 2,
   "trainingPlan": "<ENTER TRAINING PLAN ARN HERE>"      ...
}
EOF
)

# Copy cluster-config.json to a brief file
cp cluster-config.json temp-cluster-config.json

# Add extra employee group and take away VpcConfig part
jq --argjson additional_worker_group "$additional_worker_group" '.InstanceGroups += [$additional_worker_group] | del(.VpcConfig)' temp-cluster-config.json > update-cluster-config.json

# Take away the non permanent file
rm temp-cluster-config.json

On this file, we outline a further employee group named worker-group-2 consisting of two ml.p5.48xlarge situations. Once more, discover the road “TrainingPlanArn”—that is the place you specify your coaching plan by the complete ARN.

Just be sure you additionally replace provisioning_parameters.json, and add the up to date file to your S3 bucket for SageMaker to make use of whereas provisioning the brand new employee group:

  1. As a result of this file is uploaded to Amazon Simple Storage Service (Amazon S3) for SageMaker to make use of whereas provisioning your cluster, you could first copy that file over from Amazon S3:

aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json provisioning_parameters.json

  1. Assuming your current cluster has a controller machine group and a employee group with an ml.g5.48xlarge, you’ll be able to add the strains in daring to your current yaml file:
{
    ... 
    "controller_group": "controller-machine",
    "worker_groups": [
      {
          "instance_group_name": "worker-group-1",
          "partition_name": "ml.g5.48xlarge"
      },
 {        "instance_group_name": "worker-group-2",        "partition_name": "ml.p5.48xlarge"      }
    ],
    ...
}

This step provides within the new employee group that you just simply created, which consists of your 2 ml.p5.48xlarge nodes out of your coaching plan.

  1. Now you’ll be able to re-upload the up to date provisioning-parameters.json file to Amazon S3:
# copy to the S3 Bucket
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

  1. Now, with each cluster-config.json (now update-cluster-config.json) and provisioning-parameters.json up to date, you’ll be able to add the coaching plan nodes to the cluster:
aws sagemaker update-cluster 
  --cli-input-json file://update-cluster-config.json 
  --region $AWS_REGION

Use a coaching plan with a SageMaker coaching job

SageMaker coaching jobs provide two main strategies for execution: an AWS CLI command and the Python SDK. The AWS CLI strategy gives direct management and is good for scripting, permitting you to create coaching jobs with a single command. The Python SDK gives a extra programmatic interface, enabling seamless integration with current Python workflows and utilizing the high-level options in SageMaker. On this part, we have a look at how you need to use a coaching plan with each choices.

Run a coaching job on a coaching plan utilizing the AWS CLI

The next instance demonstrates learn how to create a SageMaker coaching job and affiliate it with a supplied coaching plan utilizing the CapacityScheduleConfig attribute within the create-training-job AWS CLI command:

# Create a coaching job
aws sagemaker create-training-job 
  --training-job-name training-job-name 
  ...
  --resource-config '{
      "InstanceType": "ml.p5.48xlarge",
      "InstanceCount": 8,
      "VolumeSizeInGB": 10,
 "TrainingPlanArn": "Enter coaching plan arn"   }' 
  ...

After creating the coaching job, you’ll be able to confirm that it was correctly assigned to the coaching plan by calling the DescribeTrainingJob API:

aws sagemaker describe-training-job —training-job-name training-job-name

Run a coaching job on a coaching plan utilizing the SageMaker Python SDK

The next instance demonstrates learn how to create a SageMaker coaching job utilizing the SageMaker Python SDK’s Coaching estimator. It additionally reveals learn how to affiliate the job with a supplied coaching plan through the use of the capacity_schedules attribute within the estimator object when utilizing the SageMaker Python SDK.

For extra info on the SageMaker estimator, see Use a SageMaker estimator to run a training job.

Make sure that the SageMaker Python SDK model is up to date to the newest model.

# Create Estimator
estimator = Estimator(
    entry_point="prepare.py",
    image_uri="123456789123.dkr.ecr.{}.amazonaws.com/picture:tag",
    position=position,
    instance_count=4,
    instance_type="ml.p5.48xlarge",
 training_plan="Enter coaching plan arn", ...
)

# Run the coaching job
estimator.match(inputs=trainingInput, job_name=job_name)

After creating the coaching job, you’ll be able to confirm that it was correctly assigned to the coaching plan by calling the DescribeTrainingJob API:

# Verify job particulars
sagemaker_session.describe_training_job(TrainingJobName=job_name)

Clear up

To wash up your sources to keep away from incurring extra costs, full the next steps:

  1. Delete the SageMaker HyperPod cluster and related sources reminiscent of storage, VPC, and IAM roles.
    1. If utilizing SLURM, check with Cleanup.
    2. If utilizing Amazon EKS, check with Cleanup.
  2. Delete any S3 buckets created.
  3. Ensure that the coaching plan created is used and completes the achievement lifecycle.

Conclusion

SageMaker coaching plans characterize a major leap ahead in addressing the compute capability challenges confronted by organizations working with LLMs. By offering fast entry to high-performance GPU sources, it streamlines the method of mannequin coaching and fine-tuning. This answer not solely reduces wait instances for cluster provisioning, but in addition gives flexibility in selecting between SageMaker coaching jobs and SageMaker HyperPod, catering to various organizational wants. Finally, SageMaker coaching plans empower companies to beat useful resource constraints and speed up their AI initiatives, resulting in extra environment friendly and efficient utilization of superior language fashions throughout varied industries.

To get began with a SageMaker coaching plan and discover its capabilities on your particular LLM coaching wants, check with Reserve capacity with training plans and check out the step-by-step implementation information supplied on this submit.

Particular due to Fei Ge, Oscar Hsu, Takuma Yoshitani, and Yiting Li for his or her assist within the launch of this submit.


In regards to the Authors

Aman Shanbhag is an Affiliate Specialist Options Architect on the ML Frameworks crew at Amazon Internet Providers, the place he helps clients and companions with deploying ML Coaching and Inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in Pc Science, Arithmetic, and Entrepreneurship.

Kanwaljit Khurmi is an AI/ML Principal Options Architect at Amazon Internet Providers. He works with AWS product groups, engineering, and clients to offer steering and technical help for bettering the worth of their hybrid ML options when utilizing AWS. Kanwaljit focuses on serving to clients with containerized and machine studying purposes.

Sean Smith is a Sr Specialist Answer Architect at AWS for HPC and generative AI. Previous to that, Sean labored as a Software program Engineer on AWS Batch and CfnCluster, changing into the primary engineer on the crew that created AWS ParallelCluster.

Ty Bergstrom is a Software program Engineer at Amazon Internet Providers. He works on the Hyperpod Clusters platform for Amazon SageMaker.

Leave a Reply

Your email address will not be published. Required fields are marked *