Speed up basis mannequin coaching and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio


Fashionable generative AI mannequin suppliers require unprecedented computational scale, with pre-training typically involving hundreds of accelerators working repeatedly for days, and typically months. Basis Fashions (FMs) demand distributed coaching clusters — coordinated teams of accelerated compute instances, utilizing frameworks like PyTorch — to parallelize workloads throughout a whole lot of accelerators (like AWS Trainium and AWS Inferentia chips or NVIDIA GPUs).

Orchestrators like SLURM and Kubernetes handle these complicated workloads, scheduling jobs throughout nodes, managing cluster sources, and processing requests. Paired with AWS infrastructure like Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing instances, Elastic Fabric Adapter (EFA), and distributed file techniques like Amazon Elastic File System (Amazon EFS) and Amazon FSx, these ultra clusters can run large-scale machine learning (ML) coaching and inference, dealing with parallelism, gradient synchronization and collective communications, and even routing and cargo balancing. Nonetheless, at scale, even sturdy orchestrators face challenges round cluster resilience. Distributed coaching workloads particularly run synchronously, as a result of every coaching step requires collaborating cases to finish their calculations earlier than continuing to the following step. Which means that if a single occasion fails, the whole job fails. The chance of those failures will increase with the dimensions of the cluster.

Though resilience and infrastructure reliability generally is a problem, developer expertise stays equally pivotal. Conventional ML workflows create silos, the place information and analysis scientists prototype on native Jupyter notebooks or Visual Studio Code cases, missing entry to cluster-scale storage, and engineers handle manufacturing jobs via separate SLURM or Kubernetes (kubectl or helm, for instance) interfaces. This fragmentation has penalties, together with mismatches between pocket book and manufacturing environments, lack of native entry to cluster storage, and most significantly, sub-optimal use of extremely clusters.

On this put up, we discover these challenges. Specifically, we suggest an answer to boost the information scientist expertise on Amazon SageMaker HyperPod—a resilient extremely cluster answer.

Amazon SageMaker HyperPod

SageMaker HyperPod is a compute atmosphere objective constructed for large-scale frontier mannequin coaching. You may construct resilient clusters for ML workloads and develop state-of-the-art frontier fashions. SageMaker HyperPod runs well being monitoring brokers within the background for every occasion. When it detects a {hardware} failure, SageMaker HyperPod mechanically repairs or replaces the defective occasion and resumes coaching from the final saved checkpoint. This automation alleviates the necessity for guide intervention, which suggests you may prepare in distributed settings for weeks or months with minimal disruption.

To study extra concerning the resilience and Whole Value of Possession (TCO) advantages of SageMaker HyperPod, try Reduce ML training costs with Amazon SageMaker HyperPod. As of scripting this put up, SageMaker HyperPod helps each SLURM and Amazon Elastic Kubernetes Service (Amazon EKS) as orchestrators.

To deploy a SageMaker HyperPod cluster, discuss with the SageMaker HyperPod workshops (SLURM, Amazon EKS). To study extra about what’s being deployed, try the structure diagrams later on this put up. You may select to make use of both of the 2 orchestrators primarily based in your choice.

Amazon SageMaker Studio

Amazon SageMaker Studio is a totally built-in growth atmosphere (IDE) designed to streamline the end-to-end ML lifecycle. It offers a unified, web-based interface the place information scientists and builders can carry out ML duties, together with information preparation, mannequin constructing, coaching, tuning, analysis, deployment, and monitoring.

By centralizing these capabilities, SageMaker Studio alleviates the necessity to swap between a number of instruments, considerably enhancing productiveness and collaboration. SageMaker Studio helps a wide range of IDEs, akin to JupyterLab Notebooks, Code Editor based on Code-OSS, Visual Studio Code Open Source, and RStudio, providing flexibility for various growth preferences. SageMaker Studio helps personal and shared areas, so groups can collaborate successfully whereas optimizing useful resource allocation. Shared areas permit a number of customers to entry the identical compute sources throughout profiles, and personal areas present devoted environments for particular person customers. This flexibility empowers information scientists and builders to seamlessly scale their compute sources and improve collaboration inside SageMaker Studio. Moreover, it integrates with superior tooling like managed MLflow and Partner AI Apps to streamline experiment monitoring and speed up AI-driven innovation.

Distributed file techniques: Amazon FSx

Amazon FSx for Lustre is a totally managed file storage service designed to supply high-performance, scalable, and cost-effective storage for compute-intensive workloads. Powered by the Lustre architecture, it’s optimized for functions requiring entry to quick storage, akin to ML, high-performance computing, video processing, monetary modeling, and large information analytics.

FSx for Lustre delivers sub-millisecond latencies, scaling as much as 1 GBps per TiB of throughput, and hundreds of thousands of IOPS. This makes it superb for workloads demanding speedy information entry and processing. The service integrates with Amazon Simple Storage Service (Amazon S3), enabling seamless entry to S3 objects as information and facilitating quick information transfers between Amazon FSx and Amazon S3. Updates in S3 buckets are mechanically mirrored in FSx file techniques and vice versa. For extra info on this integration, try Exporting files using HSM commands and Linking your file system to an Amazon S3 bucket.

Concept behind mounting an FSx for Lustre file system to SageMaker Studio areas

You should utilize FSx for Lustre as a shared high-performance file system to attach SageMaker Studio domains with SageMaker HyperPod clusters, streamlining ML workflows for information scientists and researchers. By utilizing FSx for Lustre as a shared quantity, you may construct and refine your coaching or fine-tuning code utilizing IDEs like JupyterLab and Code Editor in SageMaker Studio, put together datasets, and save your work immediately within the FSx for Lustre quantity.This similar quantity is mounted by SageMaker HyperPod through the execution of coaching workloads, enabling direct entry to ready information and code with out the necessity for repetitive information transfers or customized picture creation. Information scientists can iteratively make modifications, put together information, and submit coaching workloads immediately from SageMaker Studio, offering consistency throughout growth and execution environments whereas enhancing productiveness. This integration alleviates the overhead of transferring information between environments and offers a seamless workflow for large-scale ML initiatives requiring excessive throughput and low-latency storage. You may configure FSx for Lustre volumes to supply file system entry to SageMaker Studio consumer profiles in two distinct methods, every tailor-made to totally different collaboration and information administration wants.

Possibility 1: Shared file system partition throughout each consumer profile

Infrastructure directors can arrange a single FSx for Lustre file system partition shared throughout consumer profiles inside a SageMaker Studio area, as illustrated within the following diagram.

Determine 1: A FSx for Lustre file system partition shared throughout a number of consumer profiles inside a single SageMaker Studio Area

  • Shared venture directories – Groups engaged on large-scale initiatives can collaborate seamlessly by accessing a shared partition. This makes it attainable for a number of customers to work on the identical information, datasets, and FMs with out duplicating sources.
  • Simplified file administration – You don’t must handle personal storage; as a substitute, you may depend on the shared listing on your file-related wants, decreasing complexity.
  • Improved information governance and safety – The shared FSx for Lustre partition is centrally managed by the infrastructure admin, enabling sturdy entry controls and information insurance policies to keep up safety and integrity of shared sources.

Possibility 2: Shared file system partition throughout every consumer profile

Alternatively, directors can configure devoted FSx for Lustre file system partitions for every particular person consumer profile in SageMaker Studio, as illustrated within the following diagram.

Determine 2: A FSx for Lustre file system with a devoted partition per consumer

This setup offers customized storage and facilitates information isolation. Key advantages embrace:

  • Particular person information storage and evaluation – Every consumer will get a non-public partition to retailer private datasets, fashions, and information. This facilitates unbiased work on initiatives with clear segregation by consumer profile.
  • Centralized information administration – Directors retain centralized management over the FSx for Lustre file system, facilitating safe backups and direct entry whereas sustaining information safety for customers.
  • Cross-instance file sharing – You may entry your personal information throughout a number of SageMaker Studio areas and IDEs, as a result of the FSx for Lustre partition offers persistent storage on the consumer profile degree.

Answer overview

The next diagram illustrates the structure of SageMaker HyperPod with SLURM integration.

Determine 3: Structure Diagram for SageMaker HyperPod with Slurm because the orchestrator

The next diagram illustrates the structure of SageMaker HyperPod with Amazon EKS integration.

Determine 4: Structure Diagram for SageMaker HyperPod with EKS because the orchestrator

These diagrams illustrate what you’ll provision as a part of this answer. Along with the SageMaker HyperPod cluster you have already got, you provision a SageMaker Studio area, and fasten the cluster’s FSx for Lustre file system to the SageMaker Studio area. Relying on whether or not or not you select a SharedFSx, you may both connect the file system to be mounted with a single partition shared throughout consumer profiles (that you simply configure) inside your SageMaker area, or connect it to be mounted with a number of partitions for a number of remoted customers. To study extra about this distinction, discuss with the part earlier on this put up discussing the speculation behind mounting an FSx for Lustre file system to SageMaker Studio areas.

Within the following sections, we current a walkthrough of this integration by demonstrating on a SageMaker HyperPod with Amazon EKS cluster how one can:

  1. Connect a SageMaker Studio area.
  2. Use that area to fine-tune the DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Stipulations

This put up assumes that you’ve a SageMaker HyperPod cluster.

Deploy sources utilizing AWS CloudFormation

As a part of this integration, we offer an AWS CloudFormation stack template (SLURM, Amazon EKS). Earlier than deploying the stack, be sure you have a SageMaker HyperPod cluster arrange.

Within the stack for SageMaker HyperPod with SLURM, you create the next sources:

  • A SageMaker Studio area.
  • Lifecycle configurations for putting in obligatory packages for the SageMaker Studio IDE, together with SLURM. Lifecycle configurations might be created for each JupyterLab and Code Editor. We set it up in order that your Code Editor or JupyterLab occasion will basically be configured as a login node on your SageMaker HyperPod cluster.
  • An AWS Lambda operate that:
    • Associates the created security-group-for-inbound-nfs safety group to the SageMaker Studio area.
    • Associates the security-group-for-inbound-nfs safety group to the FSx for Lustre ENIs.
    • Optionally available:
      • If SharedFSx is about to True, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area.
      • If SharedFSx is about to False, a Lambda operate creates the partition /{user_profile_name} and associates it to the SageMaker Studio consumer profile.

Within the stack for SageMaker HyperPod with Amazon EKS, you create the next sources:

  • A SageMaker Studio area.
  • Lifecycle configurations for putting in obligatory packages for SageMaker Studio IDE, akin to kubectl and jq. Lifecycle configurations might be created for each JupyterLab and Code Editor.
  • A Lambda operate that:
    • Associates the created security-group-for-inbound-nfs safety group to the SageMaker Studio area.
    • Associates the security-group-for-inbound-nfs safety group to the FSx for Lustre ENIs.
    • Optionally available:
      • If SharedFSx is about to True, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area.
      • If SharedFSx is about to False, a Lambda operate creates the partition /{user_profile_name} and associates it to the SageMaker Studio consumer profile.

The primary distinction within the implementation of the 2 is within the lifecycle configurations for the JupyterLab or Code Editor servers working on the 2 implementations of SageMaker HyperPod—that is due to the distinction in the way you work together with the cluster utilizing the totally different orchestrators (kubectl or helm for Amazon EKS, and ssm or ssh for SLURM). Along with mounting your cluster’s FSx for Lustre file system, for SageMaker HyperPod with Amazon EKS, the lifecycle scripts configure your JupyterLab or Code Editor server to have the ability to run recognized Kubernetes-based command line interfaces, together with kubectl, eksctl, and helm. Moreover, it preconfigures your context, in order that your cluster is able to use as quickly as your JupyterLab or Code Editor occasion is up.

You could find the lifecycle configuration for SageMaker HyperPod with Amazon EKS on the deployed CloudFormation stack template. SLURM works a bit otherwise. We designed the lifecycle configuration in order that your JupyterLab or Code Editor occasion would function a login node to your SageMaker HyperPod with SLURM cluster. Login nodes help you log in to the cluster, submit jobs, and think about and manipulate information with out working on the crucial slurmctld scheduler node. This additionally makes it attainable to run monitoring servers like aim, TensorBoard, or Grafana or Prometheus. Subsequently, the lifecycle configuration right here mechanically installs SLURM and configures it to be able to interface along with your cluster utilizing your JupyterLab or Code Editor occasion. You could find the script used to configure SLURM on these cases on GitHub.

Each these configurations use the identical logic to mount the file techniques. The directions present in Adding a custom file system to a domain had been achieved in a customized useful resource (Lambda operate) outlined within the CloudFormation stack template.

For extra particulars on deploying these supplied stacks, try the respective workshop pages for SageMaker HyperPod with SLURM and SageMaker HyperPod with Amazon EKS.

Information science journey on SageMaker HyperPod with SageMaker Studio

As an information scientist, after you arrange the SageMaker HyperPod and SageMaker Studio integration, you may log in to the SageMaker Studio atmosphere via your consumer profile.

Determine 5: You may log in to your SageMaker Studio atmosphere via your created consumer profile.

In SageMaker Studio, you may choose your most popular IDE to start out prototyping your fine-tuning workload, and create the MLFlow monitoring server to trace coaching and system metrics through the execution of the workload.

Determine 6: Choose your most popular IDE to connect with your HyperPod cluster

The SageMaker HyperPod clusters web page offers details about the accessible clusters and particulars on the nodes.

Figures 7,8: It’s also possible to see details about your SageMaker HyperPod cluster on SageMaker Studio

For this put up, we chosen Code Editor as our most popular IDE. The automation supplied by this answer preconfigured the FSx for Lustre file system and the lifecycle configuration to put in the mandatory modules for submitting workloads on the cluster through the use of the hyperpod-cli or kubectl. For the occasion kind, you may select a variety of obtainable cases. In our case, we opted for the default ml.t3.medium.

Determine 9: CodeEditor configuration

The event atmosphere already presents the partition mounted as a file system, the place you can begin prototyping your code for information preparation of mannequin fine-tuning. For the aim of this instance, we fine-tune DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Determine 10: Your cluster’s information are accessible immediately in your CodeEditor area, because of your file system being mounted on to your CodeEditor area! This implies you may develop regionally, and deploy onto your ultra-cluster.

The repository is organized as follows:

  • download_model.py – The script to obtain the open supply mannequin immediately within the FSx for Lustre quantity. This manner, we offer a sooner and constant execution of the coaching workload on SageMaker HyperPod.
  • scripts/dataprep.py – The script to obtain and put together the dataset for the fine-tuning workload. Within the script, we format the dataset through the use of the immediate type outlined for the DeepSeek R1 fashions and save the dataset within the FSx for Lustre quantity. This manner, we offer a sooner execution of the coaching workload by avoiding asset copy from different information repositories.
  • scripts/prepare.py – The script containing the fine-tuning logic, utilizing open supply modules like Hugging Face transformers and optimization and distribution strategies utilizing FSDP and QLoRA.
  • scripts/analysis.py – The script to run ROUGE analysis on the fine-tuned mannequin.
  • pod-finetuning.yaml – The manifest file containing the definition of the container used to execute the fine-tuning workload on the SageMaker HyperPod cluster.
  • pod-evaluation.yaml – The manifest file containing the definition of the container used to execute the analysis workload on the SageMaker HyperPod cluster.

After downloading the mannequin and getting ready the dataset for the fine-tuning, you can begin prototyping the fine-tuning script immediately within the IDE.

Determine 11: You can begin creating regionally!

The updates carried out within the script might be mechanically mirrored within the container for the execution of the workload. If you’re prepared, you may outline the manifest file for the execution of the workload on SageMaker HyperPod. Within the following code, we spotlight the important thing parts of the manifest. For an entire instance of a Kubernetes manifest file, discuss with the awsome-distributed-training GitHub repository.

...

apiVersion: "kubeflow.org/v1"
form: PyTorchJob
metadata:
  title: deepseek-r1-qwen-14b-fine-tuning
spec:
  ...
  pytorchReplicaSpecs:
    Employee:
      replicas: 8
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: deepseek-r1-distill-qwen-14b-fine-tuning
        spec:
          volumes:
            - title: shmem
              hostPath: 
                path: /dev/shm
            - title: native
              hostPath:
                path: /mnt/k8s-disks/0
            - title: fsx-volume
              persistentVolumeClaim:
                claimName: fsx-claim
          serviceAccountName: eks-hyperpod-sa
          containers:
            - title: pytorch
              picture: 123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-ec2
              imagePullPolicy: At all times
              sources:
                requests:
                  nvidia.com/gpu: 1
                  vpc.amazonaws.com/efa: 1
                limits:
                  nvidia.com/gpu: 1
                  vpc.amazonaws.com/efa: 1
              ...
              command:
                - /bin/bash
                - -c
                - |
                  pip set up -r /information/Information-Scientist/deepseek-r1-distill-qwen-14b/necessities.txt && 
                  torchrun 
                  --nnodes=8 
                  --nproc_per_node=1 
                  /information/Information-Scientist/deepseek-r1-distill-qwen-14b/scripts/prepare.py 
                  --config /information/Information-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml
              volumeMounts:
                - title: shmem
                  mountPath: /dev/shm
                - title: native
                  mountPath: /native
                - title: fsx-volume
                  mountPath: /information

The important thing parts are as follows:

  • replicas: 8 – This specifies that eight employee pods might be created for this PyTorchJob. That is notably essential for distributed coaching as a result of it determines the size of your coaching job. Having eight replicas means your PyTorch coaching might be distributed throughout eight separate pods, permitting for parallel processing and sooner coaching occasions.
  • Persistent quantity configuration – This consists of the next:
    • title: fsx-volume – Defines a named quantity that might be used for storage.
    • persistentVolumeClaim – Signifies that is utilizing Kubernetes’s persistent storage mechanism.
    • claimName: fsx-claim – References a pre-created PersistentVolumeClaim, pointing to an FSx for Lustre file system used within the SageMaker Studio atmosphere.
  • Container picture – This consists of the next:
  • Coaching command – The highlighted command exhibits the execution directions for the coaching workload:
    • pip set up -r /information/Information-Scientist/deepseek-r1-distill-qwen-14b/necessities.txt – Installs dependencies at runtime, to customise the container with packages and modules required for the fine-tuning workload.
    • torchrun … /information/Information-Scientist/deepseek-r1-distill-qwen-14b/scripts/prepare.py – The precise coaching script, by pointing to the shared FSx for Lustre file system, within the partition created for the SageMaker Studio consumer profile Information-Scientist.
    • –config /information/Information-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml – Arguments supplied to the coaching script, which accommodates definition of the coaching parameters, and extra variables used through the execution of the workload.

The args-fine-tuning.yaml file accommodates the definition of the coaching parameters to supply to the script. As well as, the coaching script was outlined to avoid wasting coaching and system metrics on the managed MLflow server in SageMaker Studio, in case the Amazon Useful resource Title (ARN) and experiment title are supplied:

# Location within the FSx for Lustre file system the place the bottom mannequin was saved
model_id: "/information/Information-Scientist/deepseek-r1-distill-qwen-14b/DeepSeek-R1-Distill-Qwen-14B"
mlflow_uri: "${MLFLOW_ARN}"
mlflow_experiment_name: "deepseek-r1-distill-llama-8b-agent"
# sagemaker particular parameters
# File system path the place the workload will retailer the mannequin 
output_dir: "/information/Information-Scientist/deepseek-r1-distill-qwen-14b/mannequin/"
# File system path the place the workload can entry the dataset prepare dataset
train_dataset_path: "/information/Information-Scientist/deepseek-r1-distill-qwen-14b/information/prepare/"
# File system path the place the workload can entry the dataset take a look at dataset
test_dataset_path: "/information/Information-Scientist/deepseek-r1-distill-qwen-14b/information/take a look at/"
# coaching parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1                 
learning_rate: 2e-4                    # studying price scheduler
num_train_epochs: 1                    # variety of coaching epochs
per_device_train_batch_size: 2         # batch dimension per machine throughout coaching
per_device_eval_batch_size: 2          # batch dimension for analysis
gradient_accumulation_steps: 2         # variety of steps earlier than performing a backward/replace move
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true
merge_weights: true

The parameters model_id, output_dir, train_dataset_path, and test_dataset_path comply with the identical logic described for the manifest file and discuss with the placement the place the FSx for Lustre quantity is mounted within the container, underneath the partition Information-Scientist created for the SageMaker Studio consumer profile.

When you’ve completed the event of the fine-tuning script and outlined the coaching parameters for the workload, you may deploy the workload with the next instructions:

$ kubectl apply -f pod-finetuning.yaml
service/etcd unchanged
deployment.apps/etcd unchanged
pytorchjob.kubeflow.org/deepseek-r1-qwen-14b-fine-tuning created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
deepseek-r1-qwen-14b-fine-tuning-worker-0 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-1 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-2 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-3 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-4 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-5 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-6 1/1 Working 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-7 1/1 Working 0 2m7s
...

You may discover the logs of the workload execution immediately from the SageMaker Studio IDE.

Determine 12: View the logs of the submitted coaching run immediately in your CodeEditor terminal

You may observe coaching and system metrics from the managed MLflow server in SageMaker Studio.

Determine 13: SageMaker Studio immediately integrates with a managed MLFlow server. You should utilize it to trace coaching and system metrics immediately out of your Studio Area

Within the SageMaker HyperPod cluster sections, you may discover cluster metrics due to the mixing of SageMaker Studio with SageMaker HyperPod observability.

Determine 14: You may view extra cluster degree/infrastructure metrics within the “Compute” -> “SageMaker HyperPod clusters” part, together with GPU utilization.

On the conclusion of the fine-tuning workload, you need to use the identical cluster to run batch analysis workloads on the mannequin by deploying the manifest pod-evaluation.yaml file to run an analysis on the fine-tuned mannequin through the use of ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated textual content and human-written reference textual content.

The analysis script makes use of the identical SageMaker HyperPod cluster and compares outcomes with the beforehand downloaded base mannequin.

Clear up

To scrub up your sources to keep away from incurring extra fees, comply with these steps:

  1. Delete unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. If you happen to created a SageMaker HyperPod cluster, delete the cluster to cease incurring prices.
  4. If you happen to created the networking stack from the SageMaker HyperPod workshop, delete the stack as properly to wash up the digital personal cloud (VPC) sources and the FSx for Lustre quantity.

Conclusion

On this put up, we mentioned how SageMaker HyperPod and SageMaker Studio can enhance and pace up the event expertise of information scientists through the use of IDEs and tooling of SageMaker Studio and the scalability and resiliency of SageMaker HyperPod with Amazon EKS. The answer simplifies the setup for the system administrator of the centralized system through the use of the governance and safety capabilities provided by the AWS companies.

We suggest beginning your journey by exploring the workshops Amazon EKS Support in Amazon SageMaker HyperPod and Amazon SageMaker HyperPod, and prototyping your personalized giant language mannequin through the use of the sources accessible within the awsome-distributed-training GitHub repository.

A particular due to our colleagues Nisha Nadkarni (Sr. WW Specialist SA GenAI), Anoop Saha (Sr. Specialist WW Basis Fashions), and Mair Hasco (Sr. WW GenAI/ML Specialist) within the AWS ML Frameworks workforce, for his or her assist within the publication of this put up.


In regards to the authors

Bruno Pistone is a Senior Generative AI and ML Specialist Options Architect for AWS primarily based in Milan. He works with giant clients serving to them to deeply perceive their technical wants and design AI and Machine Studying options that make one of the best use of the AWS Cloud and the Amazon Machine Studying stack. His experience embrace: Machine Studying finish to finish, Machine Studying Industrialization, and Generative AI. He enjoys spending time together with his mates and exploring new locations, in addition to travelling to new locations

Aman Shanbhag is a Specialist Options Architect on the ML Frameworks workforce at Amazon Net Providers (AWS), the place he helps clients and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in laptop science, arithmetic, and entrepreneurship.

Leave a Reply

Your email address will not be published. Required fields are marked *