Prepare and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK
Coaching and deploying giant AI fashions requires superior distributed computing capabilities, however managing these distributed programs shouldn’t be complicated for knowledge scientists and machine studying (ML) practitioners. The newly launched command line interface (CLI) and software program growth equipment (SDK) for Amazon SageMaker HyperPod simplify how you need to use the service’s distributed coaching and inference capabilities.
The SageMaker HyperPod CLI offers knowledge scientists with an intuitive command-line expertise, abstracting away the underlying complexity of distributed programs. Constructed on high of the SageMaker HyperPod SDK, the CLI gives easy instructions for frequent workflows like launching coaching or fine-tuning jobs, deploying inference endpoints, and monitoring cluster efficiency. This makes it ideally suited for fast experimentation and iteration.
For extra superior use instances requiring fine-grained management, the SageMaker HyperPod SDK allows programmatic entry to customise your ML workflows. Builders can use the SDK’s Python interface to exactly configure coaching and deployment parameters whereas sustaining the simplicity of working with acquainted Python objects.
On this submit, we reveal use each the CLI and SDK to coach and deploy giant language fashions (LLMs) on SageMaker HyperPod. We stroll by way of sensible examples of distributed coaching utilizing Absolutely Sharded Knowledge Parallel (FSDP) and mannequin deployment for inference, showcasing how these instruments streamline the event of production-ready generative AI purposes.
Stipulations
To observe the examples on this submit, you need to have the next conditions:
As a result of the use instances that we reveal are about coaching and deploying LLMs with the SageMaker HyperPod CLI and SDK, you need to additionally set up the next Kubernetes operators within the cluster:
Set up the SageMaker HyperPod CLI
First, you need to set up the newest model of the SageMaker HyperPod CLI and SDK (the examples on this submit are based mostly on model 3.1.0). From the native surroundings, run the next command (it’s also possible to set up in a Python digital surroundings):
This command units up the instruments wanted to work together with SageMaker HyperPod clusters. For an present set up, be sure to have the newest model of the package deal put in (sagemaker-hyperpod>=3.1.0) to have the ability to use the related set of options. To confirm if the CLI is put in appropriately, you possibly can run the hyp command and examine the outputs:
The output will likely be just like the next, and consists of directions on use the CLI:
For extra data on CLI utilization and the obtainable instructions and respective parameters, confer with the CLI reference documentation.
Set the cluster context
The SageMaker HyperPod CLI and SDK use the Kubernetes API to work together with the cluster. Due to this fact, make sure that the underlying Kubernetes Python shopper is configured to execute API calls towards your cluster by setting the cluster context.
Use the CLI to checklist the clusters obtainable in your AWS account:
Set the cluster context specifying the cluster identify as enter (in our case, we use ml-cluster as <cluster_name>):
Prepare fashions with the SageMaker HyperPod CLI and SDK
The SageMaker HyperPod CLI offers an easy strategy to submit PyTorch mannequin coaching and fine-tuning jobs to a SageMaker HyperPod cluster. Within the following instance, we schedule a Meta Llama 3.1 8B mannequin coaching job with FSDP.
The CLI executes coaching utilizing the HyperPodPyTorchJob Kubernetes {custom} useful resource, which is applied by the HyperPod coaching operator, that must be put in within the cluster as mentioned within the conditions part.
First, clone the awsome-distributed-training repository and create the Docker picture that you’ll use for the coaching job:
Then, log in to the Amazon Elastic Container Registry (Amazon ECR) to drag the bottom picture and construct the brand new container:
The Dockerfile within the awsome-distributed-training repository referenced within the previous code already comprises the HyperPod elastic agent, which orchestrates lifecycles of coaching employees on every container and communicates with the HyperPod coaching operator. For those who’re utilizing a special Dockerfile, set up the HyperPod elastic agent following the directions in HyperPod elastic agent.
Subsequent, create a brand new registry on your coaching picture if wanted and push the constructed picture to it:
After you will have efficiently created the Docker picture, you possibly can submit the coaching job utilizing the SageMaker HyperPod CLI.
Internally, the SageMaker HyperPod CLI will use the Kubernetes Python shopper to construct a HyperPodPyTorchJob {custom} useful resource after which create it on the Kubernetes the cluster.
You’ll be able to modify the CLI command for different Meta Llama configurations by exchanging the --args to the specified arguments and values; examples could be discovered within the Kubernetes manifests within the awsome-distributed-training repository.
Within the given configuration, the coaching job will write checkpoints to /fsx/checkpoints on the FSx for Lustre PVC.
The hyp create hyp-pytorch-job command helps extra arguments, which could be found by operating the next:
The previous instance code comprises the next related arguments:
--commandand--argsprovide flexibility in setting the command to be executed within the container. The command executed ishyperpodrun, applied by the HyperPod elastic agent that’s put in within the coaching container. The HyperPod elastic agent extends PyTorch’s ElasticAgent and manages the communication of the assorted employees with the HyperPod coaching operator. For extra data, confer with HyperPod elastic agent.--environmentdefines surroundings variables and customizes the coaching execution.--max-retrysignifies the utmost variety of restarts on the course of degree that will likely be tried by the HyperPod coaching operator. For extra data, confer with Using the training operator to run jobs.--volumeis used to map persistent or ephemeral volumes to the container.
If profitable, the command will output the next:
You’ll be able to observe the standing of the coaching job by way of the CLI. Operating hyp checklist hyp-pytorch-job will present the standing first as Created after which as Operating after the containers have been began:
To checklist the pods which are created by this coaching job, run the next command:
You’ll be able to observe the logs of one of many coaching pods that get spawned by operating the next command:
We elaborate on extra superior debugging and observability options on the finish of this part.
Alternatively, when you desire a programmatic expertise and extra superior customization choices, you possibly can submit the coaching job utilizing the SageMaker HyperPod Python SDK. For extra data, confer with the SDK reference documentation. The next code will yield the equal coaching job submission to the previous CLI instance:
Debugging coaching jobs
Along with monitoring the coaching pod logs as described earlier, there are a number of different helpful methods of debugging coaching jobs:
- You’ll be able to submit coaching jobs with an extra
--debug Trueflag, which can print the Kubernetes YAML to the console when the job begins so it may be inspected by customers. - You’ll be able to view an inventory of present coaching jobs by operating
hyp checklist hyp-pytorch-job. - You’ll be able to view the standing and corresponding occasions of the job by operating
hyp describe hyp-pytorch-job —job-name fsdp-llama3-1-8b. - If the HyperPod observability stack is deployed to the cluster, run
hyp get-monitoring --grafanaandhyp get-monitoring --prometheusto get the Grafana dashboard and Prometheus workspace URLs, respectively, to view cluster and job metrics. - To observe GPU utilization or view listing contents, it may be helpful to execute instructions or open an interactive shell into the pods. You’ll be able to run instructions in a pod by operating, for instance,
kubectl exec -it<pod-name>-- nvtopto runnvtopfor visibility into GPU utilization. You’ll be able to open an interactive shell by operatingkubectl exec -it<pod-name>-- /bin/bash. - The logs of the HyperPod coaching operator controller pod can have invaluable details about scheduling. To view them, run
kubectl get pods -n aws-hyperpod | grep hp-training-controller-managerto search out the controller pod identify and runkubectl logs -n aws-hyperpod<controller-pod-name> to view the corresponding logs.
Deploy fashions with the SageMaker HyperPod CLI and SDK
The SageMaker HyperPod CLI offers instructions to shortly deploy fashions to your SageMaker HyperPod cluster for inference. You’ll be able to deploy each basis fashions (FMs) obtainable on Amazon SageMaker JumpStart in addition to {custom} fashions with artifacts which are saved on Amazon S3 or FSx for Lustre file programs.
This performance will robotically deploy the chosen mannequin to the SageMaker HyperPod cluster by way of Kubernetes {custom} assets, that are applied by the HyperPod inference operator, that must be put in within the cluster as mentioned within the conditions part. It’s optionally potential to robotically create a SageMaker inference endpoint in addition to an Application Load Balancer (ALB), which can be utilized straight utilizing HTTPS calls with a generated TLS certificates to invoke the mannequin.
Deploy SageMaker JumpStart fashions
You’ll be able to deploy an FM that’s obtainable on SageMaker JumpStart with the next command:
The previous code consists of the next parameters:
--model-idis the mannequin ID within the SageMaker JumpStart mannequin hub. On this instance, we deploy a DeepSeek R1-distilled version of Qwen 1.5B, which is accessible on SageMaker JumpStart.--instance-typeis the goal occasion sort in your SageMaker HyperPod cluster the place you need to deploy the mannequin. This occasion sort have to be supported by the chosen mannequin.--endpoint-nameis the identify that the SageMaker inference endpoint could have. This identify have to be distinctive. SageMaker inference endpoint creation is elective.--tls-certificate-output-s3-uriis the S3 bucket location the place the TLS certificates for the ALB will likely be saved. This can be utilized to straight invoke the mannequin by way of HTTPS. You need to use S3 buckets which are accessible by the HyperPod inference operator IAM role.--namespaceis the Kubernetes namespace the mannequin will likely be deployed to. The default worth is ready todefault.
The CLI helps extra superior deployment configurations, together with auto scaling, by way of extra parameters, which could be seen by operating the next command:
If profitable, the command will output the next:
After a couple of minutes, each the ALB and the SageMaker inference endpoint will likely be obtainable, which could be noticed by way of the CLI. Operating hyp checklist hyp-jumpstart-endpoint will present the standing first as DeploymentInProgress after which as DeploymentComplete when the endpoint is prepared for use:
To get extra visibility into the deployment pod, run the next instructions to search out the pod identify and examine the corresponding logs:
The output will look just like the next:
You’ll be able to invoke the SageMaker inference endpoint you created by way of the CLI by operating the next command:
You’ll get an output just like the next:
Alternatively, when you desire a programmatic expertise and superior customization choices, you need to use the SageMaker HyperPod Python SDK. The next code will yield the equal deployment to the previous CLI instance:
Deploy {custom} fashions
You can even use the CLI to deploy {custom} fashions with mannequin artifacts saved on both Amazon S3 or FSx for Lustre. That is helpful for fashions which have been fine-tuned on {custom} knowledge. It’s essential to present the storage location of the mannequin artifacts in addition to a container picture for inference that’s suitable with the mannequin artifacts and SageMaker inference endpoints. Within the following instance, we deploy a TinyLlama 1.1B model from Amazon S3 utilizing the DJL Large Model Inference container image.
In preparation, obtain the mannequin artifacts domestically and push them to an S3 bucket:
Now you possibly can deploy the mannequin with the next command:
The previous code comprises the next key parameters:
--model-nameis the identify of the mannequin that will likely be created in SageMaker--model-source-typespecifies bothfsxors3for the situation of the mannequin artifacts--model-locationspecifies the prefix or folder the place the mannequin artifacts are positioned--s3-bucket-nameand —s3-regionspecify the S3 bucket identify and AWS Area, respectively--instance-type,--endpoint-name,--namespace, and--tls-certificatebehave the identical as for the deployment of SageMaker JumpStart fashions
Much like SageMaker JumpStart mannequin deployment, the CLI helps extra superior deployment configurations, together with auto scaling, by way of extra parameters, which you’ll be able to view by operating the next command:
If profitable, the command will output the next:
After a couple of minutes, each the ALB and the SageMaker inference endpoint will likely be obtainable, which you’ll be able to observe by way of the CLI. Operating hyp checklist hyp-custom-endpoint will present the standing first as DeploymentInProgress and as DeploymentComplete when the endpoint is prepared for use:
To get extra visibility into the deployment pod, run the next instructions to search out the pod identify and examine the corresponding logs:
The output will look just like the next:
You’ll be able to invoke the SageMaker inference endpoint you created by way of the CLI by operating the next command:
You’ll get an output just like the next:
Alternatively, you possibly can deploy utilizing the SageMaker HyperPod Python SDK. The next code will yield the equal deployment to the previous CLI instance:
Debugging inference deployments
Along with the monitoring of the inference pod logs, there are a number of different helpful methods of debugging inference deployments:
- You’ll be able to entry the HyperPod inference operator controller logs by way of the SageMaker HyperPod CLI. Run
hyp get-operator-logs<hyp-custom-endpoint/hyp-jumpstart-endpoint>—since-hours 0.5to entry the operator logs for {custom} and SageMaker JumpStart deployments, respectively. - You’ll be able to view an inventory of inference deployments by operating
hyp checklist<hyp-custom-endpoint/hyp-jumpstart-endpoint>. - You’ll be able to view the standing and corresponding occasions of deployments by operating
hyp describe<hyp-custom-endpoint/hyp-jumpstart-endpoint>--name<deployment-name> to view the standing and occasions for {custom} and SageMaker JumpStart deployments, respectively. - If the HyperPod observability stack is deployed to the cluster, run
hyp get-monitoring --grafanaandhyp get-monitoring --prometheusto get the Grafana dashboard and Prometheus workspace URLs, respectively, to view inference metrics as effectively. - To observe GPU utilization or view listing contents, it may be helpful to execute instructions or open an interactive shell into the pods. You’ll be able to run instructions in a pod by operating, for instance,
kubectl exec -it<pod-name>-- nvtopto runnvtopfor visibility into GPU utilization. You’ll be able to open an interactive shell by operatingkubectl exec -it<pod-name>-- /bin/bash.
For extra data on the inference deployment options in SageMaker HyperPod, see Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle and Deploying models on Amazon SageMaker HyperPod.
Clear up
To delete the coaching job from the corresponding instance, use the next CLI command:
To delete the mannequin deployments from the inference instance, use the next CLI instructions for SageMaker JumpStart and {custom} mannequin deployments, respectively:
To keep away from incurring ongoing prices for the cases operating in your cluster, you possibly can scale down the cases or delete instances.
Conclusion
The brand new SageMaker HyperPod CLI and SDK can considerably streamline the method of coaching and deploying large-scale AI fashions. Via the examples on this submit, we’ve demonstrated how these instruments present the next advantages:
- Simplified workflows – The CLI gives easy instructions for frequent duties like distributed coaching and mannequin deployment, making highly effective capabilities of SageMaker HyperPod accessible to knowledge scientists with out requiring deep infrastructure data.
- Versatile growth choices – Though the CLI handles frequent situations, the SDK allows fine-grained management and customization for extra complicated necessities, so builders can programmatically configure each facet of their distributed ML workloads.
- Complete observability – Each interfaces present strong monitoring and debugging capabilities by way of system logs and integration with the SageMaker HyperPod observability stack, serving to shortly establish and resolve points throughout growth.
- Manufacturing-ready deployment – The instruments assist end-to-end workflows from experimentation to manufacturing, together with options like computerized TLS certificates era for safe mannequin endpoints and integration with SageMaker inference endpoints.
Getting began with these instruments is so simple as putting in the sagemaker-hyperpod package deal. The SageMaker HyperPod CLI and SDK present the fitting degree of abstraction for each knowledge scientists seeking to shortly experiment with distributed coaching and ML engineers constructing manufacturing programs.
For extra details about SageMaker HyperPod and these growth instruments, confer with the SageMaker HyperPod CLI and SDK documentation or discover the example notebooks.
In regards to the authors
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years of software program engineering and an ML background, he works with prospects of any measurement to grasp their enterprise and technical wants and design AI and ML options that make the most effective use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on initiatives in numerous domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys enjoying soccer.
Shweta Singh is a Senior Product Supervisor within the Amazon SageMaker Machine Studying platform crew at AWS, main the SageMaker Python SDK. She has labored in a number of product roles in Amazon for over 5 years. She has a Bachelor of Science diploma in Pc Engineering and a Masters of Science in Monetary Engineering, each from New York College.
Nicolas Jourdan is a Specialist Options Architect at AWS, the place he helps prospects unlock the total potential of AI and ML within the cloud. He holds a PhD in Engineering from TU Darmstadt in Germany, the place his analysis targeted on the reliability, idea drift detection, and MLOps of business ML purposes. Nicolas has in depth hands-on expertise throughout industries, together with autonomous driving, drones, and manufacturing, having labored in roles starting from analysis scientist to engineering supervisor. He has contributed to award-winning analysis, holds patents in object detection and anomaly detection, and is enthusiastic about making use of cutting-edge AI to unravel complicated real-world issues.