Coaching Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod


This put up relies on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow mannequin improvement.

The Institute of Science Tokyo has efficiently educated Llama 3.3 Swallow, a 70-billion-parameter massive language mannequin (LLM) with enhanced Japanese capabilities, utilizing Amazon SageMaker HyperPod. The mannequin demonstrates superior efficiency in Japanese language duties, outperforming GPT-4o-mini and different main fashions. This technical report particulars the coaching infrastructure, optimizations, and finest practices developed through the mission.

This put up is organized as follows:

  • Overview of Llama 3.3 Swallow
  • Structure for Llama 3.3 Swallow coaching
  • Software program stack and optimizations employed in Llama 3.3 Swallow coaching
  • Experiment administration

We focus on subjects related to machine studying (ML) researchers and engineers with expertise in distributed LLM coaching and familiarity with cloud infrastructure and AWS companies. We welcome readers who perceive mannequin parallelism and optimization strategies, particularly these excited by steady pre-training and supervised fine-tuning approaches.

Overview of the Llama 3.3 Swallow

Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 structure with specialised enhancements for Japanese language processing. The mannequin was developed by a collaboration between the Okazaki Laboratory and Yokota Laboratory on the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

The mannequin is out there in two variants on Hugging Face:

Each variants are accessible by the tokyotech-llm organization on Hugging Face, offering researchers and builders with versatile choices for various software wants.

Coaching methodology

The bottom mannequin was developed by continuous pre-training from Meta Llama 3.3 70B Instruct, sustaining the unique vocabulary with out enlargement. The coaching information primarily consisted of the Swallow Corpus Version 2, a fastidiously curated Japanese internet corpus derived from Frequent Crawl. To safe high-quality coaching information, the workforce employed the Swallow Education Classifier to extract educationally beneficial content material from the corpus. The next desk summarizes the coaching information used for the bottom mannequin coaching with roughly 314 billion tokens. For compute, the workforce used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) cases (H100, 80 GB, 256 GPUs) for continuous pre-training with 16 days and 6 hours.

For the instruction-tuned variant, the workforce centered completely on Japanese dialogue and code era duties. This model was created by supervised fine-tuning of the bottom mannequin, utilizing the identical Japanese dialogue information that proved profitable within the earlier Llama 3.1 Swallow v0.3 release. Notably, the workforce made a deliberate option to exclude English dialogue information from the fine-tuning course of to keep up deal with Japanese language capabilities. The next desk summarizes the instruction-tuning information used for the instruction-tuned mannequin.

Efficiency and benchmarks

The bottom mannequin has demonstrated outstanding efficiency in Japanese language duties, persistently outperforming a number of industry-leading fashions. In complete evaluations, it has proven superior capabilities in comparison with OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks replicate the mannequin’s enhanced capability to grasp and generate Japanese textual content. The next graph illustrates the bottom mannequin efficiency comparability throughout these completely different benchmarks (original image).

The instruction-tuned mannequin has proven significantly sturdy efficiency on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in sensible functions. The next graph presents the efficiency metrics (original image).

Licensing and utilization

The mannequin weights are publicly accessible on Hugging Face and can be utilized for each analysis and business functions. Customers should adjust to each the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability goals to foster innovation and development in Japanese language AI functions whereas implementing accountable utilization by acceptable licensing necessities.

Coaching infrastructure structure

The coaching infrastructure for Llama 3.3 Swallow was constructed on SageMaker HyperPod, with a deal with excessive efficiency, scalability, and observability. The structure combines compute, community, storage, and monitoring parts to allow environment friendly large-scale mannequin coaching. The bottom infrastructure stack is out there as an AWS CloudFormation template for seamless deployment and replication. This template provisions a complete basis by making a devoted digital non-public cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to retailer lifecycle scripts, that are used to configure the SageMaker HyperPod cluster.

Earlier than deploying this infrastructure, it’s important to verify the AWS account has the suitable service quotas. The deployment of SageMaker HyperPod requires particular quota values that always exceed default limits. It’s best to examine your present quota towards the necessities detailed in SageMaker HyperPod quotas and submit a quota enhance request as wanted.

The next diagram illustrates the high-level structure of the coaching infrastructure.

Compute and community configuration

The compute infrastructure relies on SageMaker HyperPod utilizing a cluster of 32 EC2 P5 instances, every geared up with 8 NVIDIA H100 GPUs. The deployment makes use of a single backbone configuration to supply minimal latency between cases. All communication between GPUs is dealt with by NCCL over an Elastic Fabric Adapter (EFA), offering high-throughput, low-latency networking important for distributed coaching. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of those sources successfully.

Storage structure

The mission implements a hierarchical storage method that balances efficiency and cost-effectiveness. On the basis is Amazon S3, offering long-term storage for coaching information and checkpoints. To stop storage bottlenecks throughout coaching, the workforce deployed FSx for Lustre as a high-performance parallel file system. This configuration allows environment friendly information entry patterns throughout all coaching nodes, essential for dealing with the huge datasets required for the 70-billion-parameter mannequin.

The next diagram illustrates the storage hierarchy implementation.

The combination between Amazon S3 and FSx for Lustre is managed by a data repository association, configured utilizing the next AWS Command Line Interface (AWS CLI) command:


aws fsx create-data-repository-association 
    --file-system-id ${FSX_ID} 
    --file-system-path "/hsmtest" 
    --data-repository-path s3://${BUCKET_NAME_DATA} 
    --s3 AutoImportPolicy='{Occasions=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Occasions=[NEW,CHANGED,DELETED]} 
    --batch-import-meta-data-on-create 
    --region ${AWS_REGION}

Observability stack

The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to supply complete observability. The workforce built-in DCGM Exporter for GPU metrics and EFA Exporter for community metrics, enabling real-time monitoring of system well being and efficiency. This setup permits for steady monitoring of GPU well being, community efficiency, and coaching progress, with automated alerting for any anomalies by Grafana Dashboards. The next screenshot exhibits an instance of a GPU well being dashboard.

Software program stack and coaching optimizations

The coaching setting is constructed on SageMaker HyperPod DLAMI, which supplies a preconfigured Ubuntu base Amazon Machine Picture (AMI) with important parts for distributed coaching. The software program stack consists of CUDA drivers and libraries (resembling cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA help. On prime of this basis, the workforce deployed Megatron-LM as the first framework for mannequin coaching. The next diagram illustrates the software program stack structure.

Distributed coaching implementation

The coaching implementation makes use of Megatron-LM’s superior options for scaling LLM coaching. The framework supplies refined mannequin parallelism capabilities, together with each tensor and pipeline parallelism, together with environment friendly information parallelism that helps communication overlap. These options are important for managing the computational calls for of coaching a 70-billion-parameter mannequin.

Superior parallelism and communication

The workforce used a complete 4D parallelism technique of Megatron-LM that maximizes GPU utilization by cautious optimization of communication patterns throughout a number of dimensions: information, tensor, and pipeline, and sequence parallelism. Knowledge parallelism splits the coaching batch throughout GPUs, tensor parallelism divides particular person mannequin layers, pipeline parallelism splits the mannequin into levels throughout GPUs, and sequence parallelism partitions the sequence size dimension—collectively enabling environment friendly coaching of large fashions.

The implementation overlaps communication throughout information parallelism, tensor parallelism, and pipeline parallelism domains, considerably lowering blocking time throughout computation. This optimized configuration allows environment friendly scaling throughout the total cluster of GPUs whereas sustaining persistently excessive utilization charges. The next diagram illustrates this communication and computation overlap in distributed coaching (original image).

Megatron-LM allows fine-grained communication overlapping by a number of configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work collectively to enhance coaching scalability.

Checkpointing technique

The coaching infrastructure implements an optimized checkpointing technique utilizing Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations throughout all accessible GPUs, slightly than being constrained by tensor and pipeline parallel dimensions as in conventional Megatron-LM implementations. This parallelization, mixed with asynchronous I/O, allows the system to:

  • Save checkpoints as much as 10 occasions sooner in comparison with synchronous approaches
  • Decrease coaching interruption by offloading I/O operations
  • Scale checkpoint efficiency with the full variety of GPUs
  • Preserve consistency by coordinated distributed saves

The checkpointing system robotically saves mannequin states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

For implementation particulars on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

Experiment administration

In November 2024, the workforce launched a scientific method to useful resource optimization by the event of a classy reminiscence prediction device. This device precisely predicts per-GPU reminiscence utilization throughout coaching and semi-automatically determines optimum coaching settings by analyzing all potential 4D parallelism configurations. Based mostly on confirmed algorithmic analysis, this device has turn into instrumental in maximizing useful resource utilization throughout the coaching infrastructure. The workforce plans to open supply this device with complete documentation to learn the broader AI analysis neighborhood.

The next screenshot exhibits an instance of the reminiscence consumption prediction device interface (original image).

Coaching pipeline administration

The success of the coaching course of closely relied on sustaining high-quality information pipelines. The workforce applied rigorous information curation processes and strong cleansing pipelines, sustaining a cautious stability in dataset composition throughout completely different languages and domains.For experiment planning, model management was vital. The workforce first mounted the variations of pre-training libraries and instruction tuning libraries for use within the subsequent experiment cycle. For libraries with out formal model releases, the workforce managed variations utilizing Git branches or tags to supply reproducibility. After the variations had been locked, the workforce performed short-duration coaching runs to:

  • Measure throughput with completely different numbers of GPU nodes
  • Seek for optimum configurations amongst distributed coaching settings recognized by the reminiscence prediction library
  • Set up correct coaching time estimates for scheduling

The next screenshot exhibits an instance experiment schedule exhibiting GPU node allocation, anticipated coaching period, and key milestones throughout completely different coaching phases (original image).

To optimize storage efficiency earlier than starting experiments, coaching information was preloaded from Amazon S3 to the FSx for Lustre file system to forestall I/O bottlenecks throughout coaching. This preloading course of used parallel transfers to maximise throughput:

# Preload information to Lustre filesystem
discover <information/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
hsm_restore

Monitoring and efficiency administration

The workforce applied a complete monitoring system centered on real-time efficiency monitoring and proactive concern detection. By integrating with Weights & Biases, the system repeatedly displays coaching progress and delivers automated notifications for key occasions resembling job completion or failure and efficiency anomalies. Weights & Biases supplies a set of instruments that allow personalized alerting by Slack channels. The next screenshot exhibits an instance of a coaching monitoring dashboard in Slack (original image).

The monitoring infrastructure excels at figuring out each job failures and efficiency bottlenecks like stragglers. The next determine presents an instance of straggler detection exhibiting coaching throughput degradation.

Conclusion

The profitable coaching of Llama 3.3 Swallow represents a major milestone within the improvement of LLMs utilizing cloud infrastructure. Via this mission, the workforce has demonstrated the effectiveness of mixing superior distributed coaching strategies with fastidiously orchestrated cloud sources. The implementation of environment friendly 4D parallelism and asynchronous checkpointing has established new benchmarks for coaching effectivity, and the excellent monitoring and optimization instruments have offered constant efficiency all through the coaching course of.

The mission’s success is constructed on a number of foundational parts: a scientific method to useful resource planning and optimization, strong information pipeline administration, and a complete monitoring and alerting system. The environment friendly storage hierarchy implementation has confirmed significantly essential in managing the huge datasets required for coaching a 70-billion-parameter mannequin.Wanting forward, the mission opens a number of promising instructions for future improvement. The workforce plans to open supply the reminiscence prediction instruments, so different researchers can profit from the optimizations developed throughout this mission. Additional enhancements to the coaching pipelines are below improvement, together with continued enhancement of Japanese language capabilities. The mission’s success additionally paves the best way for expanded mannequin functions throughout numerous domains.

Assets and references

This part supplies key sources and references for understanding and replicating the work described on this paper. The sources are organized into documentation for the infrastructure and instruments used, in addition to model-specific sources for accessing and dealing with Llama 3.3 Swallow.

Documentation

The next sources present detailed details about the applied sciences and frameworks used on this mission:

Mannequin sources

For extra details about Llama 3.3 Swallow and entry to the mannequin, seek advice from the next sources:


In regards to the Authors

Kazuki Fujii graduated with a bachelor’s diploma in Laptop Science from Tokyo Institute of Know-how in 2024 and is at present a grasp’s pupil there (2024–2026). Kazuki is chargeable for the pre-training and fine-tuning of the Swallow mannequin sequence, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed coaching and constructing scalable coaching methods to reinforce the mannequin’s efficiency and infrastructure effectivity.

Daisuke Miyamato is a Senior Specialist Options Architect for HPC at Amazon Internet Companies. He’s primarily supporting HPC prospects in drug discovery, numerical climate prediction, digital design automation, and ML coaching.

Kei Sasaki is a Senior Options Architect on the Japan Public Sector workforce at Amazon Internet Companies, the place he helps Japanese universities and analysis establishments navigate their cloud migration journeys. With a background as a methods engineer specializing in high-performance computing, Kei helps these tutorial establishments of their massive language mannequin improvement initiatives and superior computing initiatives.

Keita Watanabe is a Senior GenAI World Huge Specialist Options Architect at Amazon Internet Companies, the place he helps develop machine studying options utilizing OSS initiatives resembling Slurm and Kubernetes. His background is in machine studying analysis and improvement. Previous to becoming a member of AWS, Keita labored within the ecommerce {industry} as a analysis scientist growing picture retrieval methods for product search. Keita holds a PhD in Science from the College of Tokyo.

Leave a Reply

Your email address will not be published. Required fields are marked *