Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Half 2

This can be a visitor put up co-written with Meta’s PyTorch group and is a continuation of Part 1 of this sequence, the place we reveal the efficiency and ease of working PyTorch 2.0 on AWS.

Machine studying (ML) analysis has confirmed that giant language fashions (LLMs) skilled with considerably massive datasets lead to higher mannequin high quality. In the previous few years, the scale of present era fashions has elevated considerably, and so they require trendy instruments and infrastructure to be skilled effectively and at scale. PyTorch Distributed Knowledge Parallelism (DDP) helps course of information at scale in a easy and strong method, but it surely requires the mannequin to suit on one GPU. The PyTorch Totally Sharded Knowledge Parallel (FSDP) library breaks this barrier by enabling mannequin sharding to coach massive fashions throughout information parallel employees.

Distributed mannequin coaching requires a cluster of employee nodes that may scale. Amazon Elastic Kubernetes Service (Amazon EKS) is a well-liked Kubernetes-conformant service that significantly simplifies the method of working AI/ML workloads, making it extra manageable and fewer time-consuming.

On this weblog put up, AWS collaborates with Meta’s PyTorch group to debate the best way to use the PyTorch FSDP library to realize linear scaling of deep studying fashions on AWS seamlessly utilizing Amazon EKS and AWS Deep Learning Containers (DLCs). We reveal this by a step-by-step implementation of coaching 7B, 13B, and 70B Llama2 fashions utilizing Amazon EKS with 16 Amazon Elastic Compute Cloud (Amazon EC2) p4de.24xlarge situations (every with 8 NVIDIA A100 Tensor Core GPUs and every GPU with 80 GB HBM2e reminiscence) or 16 EC2 p5.48xlarge situations (every with 8 NVIDIA H100 Tensor Core GPUs and every GPU with 80 GB HBM3 reminiscence), attaining close to linear scaling in throughput and finally enabling sooner coaching time.

The next scaling chart exhibits that the p5.48xlarge situations supply 87% scaling effectivity with FSDP Llama2 fine-tuning in a 16-node cluster configuration.

Challenges of coaching LLMs

Companies are more and more adopting LLMs for a variety of duties, together with digital assistants, translation, content material creation, and laptop imaginative and prescient, to boost the effectivity and accuracy in quite a lot of functions.

Nonetheless, coaching or fine-tuning these massive fashions for a customized use case requires a considerable amount of information and compute energy, which provides to the general engineering complexity of the ML stack. That is additionally attributable to restricted reminiscence accessible on a single GPU, which restricts the scale of the mannequin that may be skilled, and likewise limits the per-GPU batch measurement used throughout coaching.

To handle this problem, numerous mannequin parallelism strategies akin to DeepSpeed ZeRO and PyTorch FSDP had been created to assist you to overcome this barrier of restricted GPU reminiscence. That is accomplished by adopting a sharded information parallel approach, the place every accelerator holds only a slice (a shard) of a mannequin reproduction as a substitute of the whole mannequin reproduction, which dramatically reduces the reminiscence footprint of the coaching job.

This put up demonstrates how you need to use PyTorch FSDP to fine-tune the Llama2 mannequin utilizing Amazon EKS. We obtain this by scaling out compute and GPU capability to handle the mannequin necessities.

FSDP overview

In PyTorch DDP coaching, every GPU (known as a employee within the context of PyTorch) holds an entire copy of the mannequin, together with the mannequin weights, gradients, and optimizer states. Every employee processes a batch of knowledge and, on the finish of the backward cross, makes use of an all-reduce operation to synchronize gradients throughout totally different employees.

Having a reproduction of the mannequin on every GPU restricts the scale of the mannequin that may be accommodated in a DDP workflow. FSDP helps overcome this limitation by sharding mannequin parameters, optimizer states, and gradients throughout information parallel employees whereas nonetheless preserving the simplicity of knowledge parallelism.

That is demonstrated within the following diagram, the place within the case of DDP, every GPU holds an entire copy of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). In FSDP, every GPU holds solely a slice of the mannequin state, together with the optimizer state (OS), gradients (G), and parameters (P): M<partition quantity>(OS + G + P). Utilizing FSDP ends in a considerably smaller GPU reminiscence footprint in comparison with DDP throughout all employees, enabling the coaching of very massive fashions or utilizing bigger batch sizes for coaching jobs.

This, nonetheless, comes at the price of elevated communication overhead, which is mitigated by FSDP optimizations akin to overlapping communication and computation processes with options like pre-fetching. For extra detailed info, check with Getting Started with Fully Sharded Data Parallel (FSDP).

FSDP affords numerous parameters that assist you to tune the efficiency and effectivity of your coaching jobs. Among the key options and capabilities of FSDP embrace:

  • Transformer wrapping coverage
  • Versatile combined precision
  • Activation checkpointing
  • Varied sharding methods to swimsuit totally different community speeds and cluster topologies:
    • FULL_SHARD – Shard mannequin parameters, gradients, and optimizer states
    • HYBRID_SHARD – Full shard inside a node DDP throughout nodes; helps a versatile sharding group for a full reproduction of the mannequin (HSDP)
    • SHARD_GRAD_OP – Shard solely gradients and optimizer states
    • NO_SHARD – Much like DDP

For extra details about FSDP, check with Efficient Large-Scale Training with Pytorch FSDP and AWS.

The next determine exhibits how FSDP works for 2 information parallel processes.

Resolution overview

On this put up, we arrange a compute cluster utilizing Amazon EKS, which is a managed service to run Kubernetes within the AWS Cloud and on-premises information facilities. Many purchasers are embracing Amazon EKS to run Kubernetes-based AI/ML workloads, making the most of its efficiency, scalability, reliability, and availability, in addition to its integrations with AWS networking, safety and different companies.

For our FSDP use case, we use the Kubeflow Training Operator on Amazon EKS, which is a Kubernetes-native undertaking that facilitates fine-tuning and scalable distributed coaching for ML fashions. It helps numerous ML frameworks, together with PyTorch, which you need to use to deploy and handle PyTorch coaching jobs at scale.

Using the PyTorchJob customized useful resource of Kubeflow Coaching Operator, we run coaching jobs on Kubernetes with a configurable variety of employee replicas which permits us to optimize useful resource utilization.

The next are a couple of elements of the coaching operator that play a job in our Llama2 fine-tuning use case:

  • A centralized Kubernetes controller that orchestrates distributed coaching jobs for PyTorch.
  • PyTorchJob, a Kubernetes customized useful resource for PyTorch, offered by the Kubeflow Coaching Operator, to outline and deploy Llama2 coaching jobs on Kubernetes.
  • etcd, which is said to the implementation of the rendezvous mechanism for coordinating the distributed coaching of PyTorch fashions. Thisetcdserver, as a part of the rendezvous course of, facilitates the coordination and synchronization of the taking part employees throughout distributed coaching.

The next diagram illustrates the answer structure.

A lot of the particulars will probably be abstracted by the automation scripts that we use to run the Llama2 instance.

We use the next code references on this use case:

What’s Llama2?

Llama2 is a LLM pre-trained on 2 trillion tokens of textual content and code. It is likely one of the largest and strongest LLMs accessible at this time You need to use Llama2 for quite a lot of duties, together with pure language processing (NLP), textual content era, and translation. For extra info, check with Getting started with Llama.

Llama2 is out there in three totally different mannequin sizes:

  • Llama2-70b – That is the biggest Llama2 mannequin, with 70 billion parameters. It’s the strongest Llama2 mannequin and can be utilized for essentially the most demanding duties.
  • Llama2-13b – This can be a medium-sized Llama2 mannequin, with 13 billion parameters. It’s a good steadiness between efficiency and effectivity, and can be utilized for quite a lot of duties.
  • Llama2-7b – That is the smallest Llama2 mannequin, with 7 billion parameters. It’s the most effective Llama2 mannequin, and can be utilized for duties that don’t require the best stage of efficiency.

This put up allows you to fine-tune all of those fashions on Amazon EKS. To supply a easy and reproducible expertise of making an EKS cluster and working FSDP jobs on it, we use the aws-do-eks undertaking. The instance can even work with a pre-existing EKS cluster.

A scripted walkthrough is out there on GitHub for an out-of-the-box expertise. Within the following sections, we clarify the end-to-end course of in additional element.

Provision the answer infrastructure

For the experiments described on this put up, we use clusters with p4de (A100 GPU) and p5 (H100 GPU) nodes.

Cluster with p4de.24xlarge nodes

For our cluster with p4de nodes, we use the next eks-gpu-p4de-odcr.yaml script:

export ODCR_ID=<your-capacityreservation-id>

cat > ./eks-gpu-p4de-odcr.yaml <<EOF
variety: ClusterConfig
  title: do-eks-yaml-p4de-odcr
  model: "1.28"
  area: us-east-1
  tags: do-eks-yaml-p4de-odcr
  - us-east-1a
  - us-east-1b
  - us-east-1c
  - us-east-1d
  - title: sys
    instanceType: c5.2xlarge
    desiredCapacity: 1
        autoScaler: true
        cloudWatch: true
  - title: p4de-odcr
    instanceType: p4de.24xlarge
    instancePrefix: p4de-odcr
    privateNetworking: true
      - us-east-1c
    efaEnabled: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 64
    volumeSize: 500
        capacityReservationID: $ODCR_ID
        cloudWatch: true
        ebs: true
        fsx: true
  withOIDC: true

Utilizing eksctl and the previous cluster manifest, we create a cluster with p4de nodes:

eksctl create cluster -f ./eks-gpu-p4de-odcr.yaml

Cluster with p5.48xlarge nodes

A terraform template for an EKS cluster with P5 nodes is positioned within the following GitHub repo.

You’ll be able to customise the cluster through the file after which create it through the Terraform CLI:

terraform init && terraform plan -out tfplan && terraform apply tfplan

You’ll be able to confirm the cluster availability by working a easy kubectl command:

The cluster is wholesome if the output of this command exhibits the anticipated variety of nodes in Prepared standing.

Deploy conditions

To run FSDP on Amazon EKS, we use the PyTorchJob customized useful resource. It requires etcd and Kubeflow Training Operator as conditions.

Deploy etcd with the next code:

kubectl apply -f

Deploy Kubeflow Coaching Operator with the next code:

kubectl apply -k ""

Construct and push an FSDP container picture to Amazon ECR

Use the next code to construct an FSDP container picture and push it to Amazon Elastic Container Registry (Amazon ECR):

# Obtain Dockerfile
curl -L -o ./Dockerfile.llama2-efa

# Construct Picture
AWS_REGION=$(aws configure get area)
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)

docker construct --progress=plain -t ${REGISTRY}${IMAGE}${TAG} -f ./Dockerfile.llama2-efa .

# Log in to ECR, create registry, push picture
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
aws ecr create-repository --repository-name ${IMAGE}
docker picture push ${REGISTRY}${IMAGE}${TAG}

Create the FSDP PyTorchJob manifest

Insert your Hugging Face token within the following snippet previous to working it:


Configure your PyTorchJob with .env file or immediately in your surroundings variables as under:


CMD="huggingface-cli login --token ${HF_TOKEN} && torchrun --nproc_per_node=${GPU_PER_WORKER} --nnodes=${NUM_WORKERS} examples/ --num_epochs=5 --batch_size_training=3 --enable_fsdp --model_name $MODEL_NAME --output_dir ."

Generate the PyTorchJob manifest utilizing the fsdp template and script or create it immediately utilizing the script under:

cat > ./fsdp.yaml <<EOF
variety: PyTorchJob
  title: $JOB_NAME
    rdzvBackend: etcd
    rdzvHost: $RDZV_HOST
    rdzvPort: $RDZV_PORT
    minReplicas: 1
    maxReplicas: 64
    maxRestarts: 100
      - kind: Useful resource
        useful resource:
          title: cpu
            kind: Utilization
            averageUtilization: 90
      replicas: $NUM_WORKERS
      restartPolicy: OnFailure
            app: $JOB_NAME
            - title: shmem
                path: /dev/shm
            - title: pytorch
              picture: '${REGISTRY}${IMAGE}${TAG}'
              imagePullPolicy: All the time
                - title: LOGLEVEL
                  worth: DEBUG
                - title: NCCL_DEBUG
                  worth: INFO
                - title: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  worth: '1'
                - bash
                - '-c'
                - '${CMD}'
                - title: shmem
                  mountPath: /dev/shm

Run the PyTorchJob

Run the PyTorchJob with the next code:

kubectl apply -f ./fsdp.yaml

You will notice the desired variety of FDSP employee pods created and, after pulling the picture, they are going to enter right into a Operating state.

To see the standing of the PyTorchJob, use the next code:

kubectl describe -f ./fsdp.yaml

To cease the PyTorchJob, use the next code:

kubectl delete -f ./fsdp.yaml

After a job is full, it must be deleted earlier than initiating a brand new run. We’ve additionally noticed that deleting theetcdpod and letting it restart previous to launching a brand new job helps keep away from a RendezvousClosedError.

Scale the cluster

You’ll be able to repeat the previous steps of making and working jobs whereas various the quantity and occasion kind of employee nodes within the cluster. This allows you to produce scaling charts just like the one proven earlier. Usually, you need to see a discount in GPU reminiscence footprint, discount in epoch time, and enhance in throughput when extra nodes are added to the cluster. The earlier chart was produced by conducting a number of experiments utilizing a p5 node group various from 1–16 nodes in measurement.

Observe the FSDP coaching workload

Observability of generative synthetic intelligence workloads is essential to permit visibility into your working jobs in addition to assist in maximizing the utilization of your compute sources. On this put up, we use a couple of Kubernetes-native and open supply observability instruments for this objective. These instruments allow you to trace errors, statistics, and mannequin habits, making AI observability a vital a part of any enterprise use case. On this part, we present numerous approaches for observing FSDP coaching jobs.

Employee pod logs

On the most elementary stage, you want to have the ability to see the logs of your coaching pods. This may simply be accomplished through the use of Kubernetes-native instructions.
First, retrieve an inventory of pods and find the title of the one that you simply wish to see logs for:

Then view the logs for the chosen pod:

kubectl logs -f <pod_name>

Just one employee (elected chief) pod log will record the general job statistics. The title of the elected chief pod is out there in the beginning of every employee pod log, recognized by the important thing master_addr=.

CPU utilization

Distributed coaching workloads require each CPU and GPU sources. To optimize these workloads, it’s essential to grasp how these sources are utilized. Fortuitously, some nice open supply utilities can be found that assist visualize CPU and GPU utilization. For viewing CPU utilization, you need to usehtop. In case your employee pods include this utility, you need to use the under command to open a shell right into a pod after which runhtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you’ll be able to deploy an htopdaemonsetjust like the one offered within the following GitHub repo.

Thedaemonsetwill run a light-weight htop pod on every node. You’ll be able to exec into any of those pods and run thehtopcommand:

kubectl exec -it <htop_pod_name> -- htop

The next screenshot exhibits the CPU utilization on one of many nodes within the cluster. On this case, we’re a P5.48xlarge occasion, which has 192 vCPUs. The processor cores are idle whereas the mannequin weights are downloaded, and we see rising utilization whereas the mannequin weights are being loaded to GPU reminiscence.

GPU utilization

If thenvtoputility is out there in your pod, you might exec into it utilizing under after which runnvtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you’ll be able to deploy a nvtopdaemonsetjust like the one offered within the following GitHub repo.

This may run anvtoppod on every node. You’ll be able to exec into any of these pods and runnvtop:

kubectl exec -it <nvtop_pod_name> -- nvtop

The next screenshot exhibits the GPU utilization on one of many nodes within the coaching cluster. On this case, we’re a P5.48xlarge occasion, which has 8 NVIDIA H100 GPUs. The GPUs are idle whereas the mannequin weights are downloaded, then GPU reminiscence utilization will increase because the mannequin weights are loaded onto the GPU, and GPU utilization spikes to 100% whereas the coaching iterations are underway.

Grafana dashboard

Now that you simply perceive how your system works on the pod and node stage, it’s additionally essential to take a look at metrics on the cluster stage. Aggregated utilization metrics will be collected by NVIDIA DCGM Exporter and Prometheus and visualized in Grafana.

An instance Prometheus-Grafana deployment is out there within the following GitHub repo.

An instance DCGM exporter deployment is out there within the following GitHub repo.

A easy Grafana dashboard is proven within the following screenshot. It was constructed by deciding on the next DCGM metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_MEM_COPY_UTIL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_TEMP, and DCGM_FI_DEV_POWER_USAGE. The dashboard will be imported into Prometheus from GitHub.

The next dashboard exhibits one run of a Llama2 7b single epoch coaching job. The graphs present that because the streaming multiprocessor (SM) clock will increase, the facility draw and temperature of the GPUs enhance as effectively, along with GPU and reminiscence utilization. You may as well see that there have been no XID errors and the GPUs had been wholesome throughout this run.

Since March 2024 GPU observability for EKS is supported natively in CloudWatch Container Insights. To allow this performance simply deploy the CloudWatch Observability Add-on in your EKS cluster. Then it is possible for you to to browse pod, node, and cluster stage metrics by pre-configured and customizable dashboards in Container Insights.

Clear up

If you happen to created your cluster utilizing the examples offered on this weblog, you’ll be able to execute the next code to delete the cluster and any sources related to it, together with the VPC:
For eksctl:

eksctl delete cluster -f ./eks-gpu-p4de-odcr.yaml

For terraform:

Upcoming options

FSDP is predicted to incorporate a per-parameter sharding characteristic, aiming to additional enhance its reminiscence footprint per GPU. Moreover, the continuing improvement of FP8 assist goals to enhance FSDP efficiency on H100 GPUs. Lastly, when FSDP is built-in withtorch.compile, we hope to see further efficiency enhancements and enablement of options like selective activation checkpointing.


On this put up, we mentioned how FSDP reduces the reminiscence footprint on every GPU, enabling the coaching of bigger fashions extra effectively and attaining close to linear scaling in throughput. We demonstrated this by a step-by-step implementation of coaching a Llama2 mannequin utilizing Amazon EKS on P4de and P5 situations and used observability instruments like kubectl, htop, nvtop, and dcgm to observe logs, in addition to CPU and GPU utilization.

We encourage you to reap the benefits of PyTorch FSDP on your personal LLM coaching jobs. Get began at aws-do-fsdp.

Concerning the Authors

Kanwaljit Khurmi is a Principal AI/ML Options Architect at Amazon Net Providers. He works with AWS clients to offer steerage and technical help, serving to them enhance the worth of their machine studying options on AWS. Kanwaljit focuses on serving to clients with containerized, distributed computing and deep studying functions.

Alex Iankoulski is a Principal Options Architect, Self-managed Machine Studying at AWS. He’s a full-stack software program and infrastructure engineer who likes to do deep, hands-on work. In his function, he focuses on serving to clients with containerization and orchestration of ML and AI workloads on container-powered AWS companies. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges.

Ana Simoes is a Principal Machine Studying Specialist, ML Frameworks at AWS. She helps clients deploying AI, ML, and generative AI at a big scale on HPC infrastructure within the cloud. Ana focuses on supporting clients to realize price-performance for brand new workloads and use circumstances for generative AI and machine studying.

Hamid Shojanazeri is a Accomplice Engineer at PyTorch engaged on open supply, high-performance mannequin optimization, distributed coaching (FSDP), and inference. He’s the co-creator of llama-recipe and contributor to TorchServe. His important curiosity is to enhance cost-efficiency, making AI extra accessible to the broader group.

Much less Wright is an AI/Accomplice Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).

Leave a Reply

Your email address will not be published. Required fields are marked *