Past accelerators: Classes from constructing basis fashions on AWS with Japan’s GENIAC program


In 2024, the Ministry of Economic system, Commerce and Trade (METI) launched the Generative AI Accelerator Challenge (GENIAC)—a Japanese nationwide program to spice up generative AI by offering corporations with funding, mentorship, and big compute assets for basis mannequin (FM) improvement. AWS was chosen because the cloud supplier for GENIAC’s second cycle (cycle 2). It supplied infrastructure and technical steering for 12 collaborating organizations. On paper, the problem appeared easy: give every crew entry to a whole lot of GPUs/Trainium chips and let innovation ensue. In observe, profitable FM coaching required excess of uncooked {hardware}.

AWS found that allocating over 1,000 accelerators was merely the start line—the true problem lay in architecting a dependable system and overcoming distributed coaching obstacles. Throughout GENIAC cycle 2, 12 prospects efficiently deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day. Over the next 6 months, a number of large-scale fashions have been educated, together with notable tasks like Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B, and others.

This submit shares the important thing insights from this engagement and precious classes for enterprises or nationwide initiatives aiming to construct FMs at scale.

Cross-functional engagement groups

A vital early lesson from technical engagement for the GENIAC was that operating a multi-organization, national-scale machine studying (ML) initiative requires coordinated assist throughout numerous inside groups. AWS established a digital crew that introduced collectively account groups, specialist Options Architects, and repair groups. The GENIAC engagement mannequin thrives on shut collaboration between prospects and a multi-layered AWS crew construction, as illustrated within the following determine.

cross-functional-team-engagement

Prospects (Cx) sometimes include enterprise and technical leads, together with ML and platform engineers, and are answerable for executing coaching workloads. AWS account groups (Options Architects and Account Managers) handle the connection, keep documentation, and keep communication flows with prospects and inside specialists. The World Large Specialist Group (WWSO) Frameworks crew makes a speciality of large-scale ML workloads, with a deal with core HPC and container providers akin to AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker HyperPod. The WWSO Frameworks crew is answerable for establishing this engagement construction and supervising technical engagements on this program. They lead the engagement in partnership with different stakeholders and function an escalation level for different stakeholders. They work straight with the service groups—Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon FSx, and SageMaker HyperPod—to assist navigate engagements, escalations (enterprise and technical), and ensure the engagement framework is in working order. They supply steering on coaching and inference to prospects and educate different groups on the know-how. The WWSO Frameworks crew labored carefully with Lead Options Architects (Lead SAs), a task particularly designated to assist GENIAC engagements. These Lead SAs function a cornerstone of this engagement. They’re an extension of the Frameworks specialist crew and work straight with prospects and the account groups. They interface with prospects and interact their Framework specialist counterparts when clarification or additional experience is required for in-depth technical discussions or troubleshooting. With this layered construction, AWS can scale technical steering successfully throughout complicated FM coaching workloads.

One other important success issue for GENIAC was establishing strong communication channels between prospects and AWS members. The muse of our communication technique was a devoted inside Slack channel for GENIAC program coordination, connecting AWS account groups with lead SAs. This channel enabled real-time troubleshooting, data sharing, and speedy escalation of buyer points to the suitable technical specialists and repair crew members. Complementing this was an exterior Slack channel that bridged AWS groups with prospects, making a collaborative setting the place individuals might ask questions, share insights, and obtain instant assist. This direct line of communication considerably lowered decision instances and fostered a group of observe amongst individuals.

AWS maintained complete workload monitoring paperwork, which clarifies every buyer’s coaching implementation particulars (mannequin structure, distributed coaching frameworks, and associated software program parts) alongside infrastructure specs (occasion sorts and portions, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, and storage options together with Amazon FSx for Lustre and Amazon S3). This monitoring system additionally maintained a chronological historical past of buyer interactions and assist instances. As well as, the engagement crew held weekly overview conferences to trace excellent buyer inquiries and technical points. This common cadence made it attainable for crew members to share classes realized and apply them to their very own buyer engagements, fostering steady enchancment and data switch throughout this system.

With a structured strategy to communication and documentation, we might determine frequent challenges, akin to misconfigured NCCL library impacting multi-node performance, share options throughout groups, and repeatedly refine our engagement mannequin. The detailed monitoring system supplied precious insights for future GENIAC cycles, serving to us anticipate buyer wants and proactively handle potential bottlenecks within the FM improvement course of.

Reference architectures

One other early takeaway was the significance of strong reference architectures. Slightly than let every crew configure their very own cluster from scratch, AWS created pre-validated templates and automation for 2 primary approaches: AWS ParallelCluster (for a user-managed HPC cluster) and SageMaker HyperPod (for a managed, resilient cluster service). These reference architectures lined the complete stack—from compute, community, and storage to container environments and monitoring—and have been delivered as a GitHub repository so groups might deploy them with minimal friction.

AWS ParallelCluster proved invaluable as an open supply cluster administration instrument for multi-node GPU coaching. AWS ParallelCluster automates the setup of a Slurm-based HPC cluster on AWS. AWS ParallelCluster simplifies cluster provisioning primarily based on the open supply Slurm scheduler, utilizing a easy YAML config to face up the setting. For the GEINIAC program, AWS additionally supplied SageMaker HyperPod as another choice for some groups. SageMaker HyperPod is a managed service that provisions GPU and Trainium clusters for large-scale ML. HyperPod integrates with orchestrators like Slurm or Kubernetes (Amazon EKS) for scheduling, offering extra managed performance round cluster resiliency. By together with reference architectures for each AWS ParallelCluster and SageMaker HyperPod, the GENIAC program gave individuals flexibility—some opted for the fine-grained management of managing their very own HPC cluster, whereas others most well-liked the comfort and resilience of a managed SageMaker HyperPod cluster.

The reference structure (proven within the following diagram) seamlessly combines compute, networking, storage, and monitoring into an built-in system particularly designed for large-scale FM coaching.

Cluster Reference Architecture

The bottom infrastructure stack is out there as an AWS CloudFormation template that provisions the entire infrastructure stack with minimal effort. This template routinely configures a devoted digital non-public cloud (VPC) with optimized networking settings and implements a high-performance FSx for Lustre file system for coaching knowledge (complemented by optionally available Amazon FSx for OpenZFS assist for shared house directories). The structure is accomplished with an S3 bucket that gives sturdy, long-term storage for datasets and mannequin checkpoints, sustaining knowledge availability nicely past particular person coaching cycles. This reference structure employs a hierarchical storage strategy that balances efficiency and cost-effectiveness. It makes use of Amazon S3 for sturdy, long-term storage of coaching knowledge and checkpoints, and hyperlinks this bucket to the Lustre file system by an information repository affiliation (DRA). The DRA allows computerized and clear knowledge switch between Amazon S3 and FSx for Lustre, permitting high-performance entry with out handbook copying. You should utilize the next CloudFormation template to create the S3 bucket used on this structure.

The optionally available monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana service running on Amazon EC2) to offer complete observability. It built-in DCGM Exporter for GPU metrics and EFA Exporter for community metrics, enabling real-time monitoring of system well being and efficiency. This setup permits for steady monitoring of GPU well being, community efficiency, and coaching progress, with automated alerting for anomalies by Grafana Dashboards. For instance, the GPU Health Dashboard (see the next screenshot) supplies metrics of frequent GPU errors, together with Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Lacking GPUs (from Nvidia-SMI), serving to customers determine {hardware} failures as rapidly as attainable.

xid-error-dashboard

Reproducible deployment guides and structured enablement periods

Even the very best reference architectures are solely helpful if groups know easy methods to use them. A important factor of GENIAC’s success was reproducible deployment guides and structured enablement by workshops.On October 3, 2024, AWS Japan and the WWSO Frameworks crew performed a mass enablement session for GENIAC Cycle 2 individuals, inviting Frameworks crew members from the US to share greatest practices for FM coaching on AWS.

The enablement session welcomed over 80 individuals and supplied a complete mixture of lectures, hands-on labs, and group discussions—incomes a CSAT rating of 4.75, reflecting its robust affect and relevance to attendees. The lecture periods lined infrastructure fundamentals, exploring orchestration choices akin to AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, together with the software program parts essential to construct and practice large-scale FMs utilizing AWS. The periods highlighted sensible challenges in FM improvement—together with huge compute necessities, scalable networking, and high-throughput storage—and mapped them to acceptable AWS providers and greatest practices. (For extra data, see the slide deck from the lecture session.) One other session centered on greatest practices, the place attendees realized to arrange efficiency dashboards with Prometheus and Grafana, monitor EFA site visitors, and troubleshoot GPU failures utilizing NVIDIA’s DCGM toolkit and customized Grafana dashboards primarily based on the Frameworks crew’s expertise managing a cluster with 2,000 P5 cases.

Moreover, the WWSO crew ready workshops for each AWS ParallelCluster (Machine Learning on AWS ParallelCluster) and SageMaker HyperPod (Amazon SageMaker HyperPod Workshop), offering detailed deployment guides for the aforementioned reference structure. Utilizing these supplies, individuals performed hands-on workouts deploying their coaching clusters utilizing Slurm with file methods together with FSx for Lustre and FSx for OpenZFS, operating multi-node PyTorch distributed coaching. One other phase of the workshop centered on observability and efficiency tuning, instructing individuals easy methods to monitor useful resource utilization, community throughput (EFA site visitors), and system well being. By the top of those enablement periods, prospects and supporting AWS engineers had established a shared baseline of data and a toolkit of greatest practices. Utilizing the property and data gained in the course of the workshops, prospects participated in onboarding periods—structured, hands-on conferences with their Lead SAs. These periods differed from the sooner workshops by specializing in customer-specific cluster deployments tailor-made to every crew’s distinctive use case. Throughout every session, Lead SAs labored straight with groups to deploy coaching environments, validate setup utilizing NCCL checks, and resolve technical points in actual time.

Buyer suggestions

“To essentially remedy knowledge entry challenges, we considerably improved processing accuracy and cost-efficiency by making use of two-stage reasoning and autonomous studying with SLM and LLM for normal gadgets, and visible studying with VLM utilizing 100,000 artificial knowledge samples for detailed gadgets. We additionally utilized Amazon EC2 P5 cases to reinforce analysis and improvement effectivity. These bold initiatives have been made attainable due to the assist of many individuals, together with AWS. We’re deeply grateful for his or her in depth assist.”

– Takuma Inoue, Government Officer, CTO at AI Inside

“Future selected AWS to develop large-scale language fashions specialised for Japanese and software program improvement at GENIAC. When coaching large-scale fashions utilizing a number of nodes, Future had issues about setting settings akin to inter-node communication, however AWS had a variety of instruments, akin to AWS ParallelCluster, and we obtained robust assist from AWS Options Architects, which enabled us to start out large-scale coaching rapidly.”

– Makoto Morishita, Chief Analysis Engineer at Future

Outcomes and searching forward

GENIAC has demonstrated that coaching FMs at scale is essentially an organizational problem, not merely a {hardware} one. By structured assist, reproducible templates, and a cross-functional engagement crew (WWSO Frameworks Workforce, Lead SAs, and Account Groups), even small groups can efficiently execute huge workloads within the cloud. Due to this construction, 12 prospects launched over 127 P5 cases and 24 Trn1 cases throughout a number of AWS Areas, together with Asia Pacific (Tokyo), in a single day. A number of giant language fashions (LLMs) and customized fashions have been educated efficiently, together with a 32B multimodal mannequin on Trainium and a 405B tourism-focused multilingual mannequin.The technical engagement framework established by GENIAC Cycle 2 has supplied essential insights into large-scale FM improvement. Constructing on this expertise, AWS is advancing enhancements throughout a number of dimensions: engagement fashions, technical property, and implementation steering. We’re strengthening cross-functional collaboration and systematizing data sharing to determine a extra environment friendly assist construction. Reference architectures and automatic coaching templates proceed to be enhanced, and sensible technical workshops and greatest practices are being codified primarily based on classes realized.AWS has already begun preparations for the subsequent cycle of GENIAC. As a part of the onboarding course of, AWS hosted a complete technical occasion in Tokyo on April 3, 2025, to equip FM builders with hands-on expertise and architectural steering. The occasion, attended by over 50 individuals, showcased the dedication AWS has to supporting scalable, resilient generative AI infrastructure.

geniac-event

The occasion highlighted the technical engagement mannequin of AWS for GENIAC, alongside different assist mechanisms, together with the LLM Development Support Program and Generative AI Accelerator. The day featured an intensive workshop on SageMaker HyperPod and Slurm, the place individuals gained hands-on expertise with multi-node GPU clusters, distributed PyTorch coaching, and observability instruments. Periods lined important subjects, together with containerized ML, distributed coaching methods, and AWS purpose-built silicon options. Classmethod Inc. shared sensible SageMaker HyperPod insights, and AWS engineers demonstrated architectural patterns for large-scale GPU workloads. The occasion showcased AWS’s end-to-end generative AI assist panorama, from infrastructure to deployment instruments, setting the stage for GENIAC Cycle 3. As AWS continues to broaden its assist for FM improvement, the success of GENIAC serves as a blueprint for enabling organizations to construct and scale their AI capabilities successfully.

By these initiatives, AWS will proceed to offer strong technical assist, facilitating the sleek execution of large-scale FM coaching. We stay dedicated to contributing to the development of generative AI improvement everywhere in the world by our technical experience.

This submit was contributed by AWS GENIAC Cycle 2 core members Masato Kobayashi, Kenta Ichiyanagi, and Satoshi Shirasawa, Accelerated Computing Specialist Mai Kiuchi, in addition to Lead SAs Daisuke Miyamoto, Yoshitaka Haribara, Kei Sasaki, Soh Ohara, and Hiroshi Tokoyo with Government Sponsorship from Toshi Yasuda. Hiroshi Hata and Tatsuya Urabe additionally supplied assist as core member and Lead SA throughout their time at AWS.

The authors lengthen their gratitude to WWSO Frameworks members Maxime Hugues, Matthew Nightingale, Aman Shanbhag, Alex Iankoulski, Anoop Saha, Yashesh Shroff, Natarajan Chennimalai Kumar, Shubha Kumbadakone, and Sundar Ranganathan for his or her technical contributions. Pierre-Yves Aquilanti supplied in-depth assist throughout his time at AWS.


Concerning the authors

Keita Watanabe is a Senior Specialist Options Architect on the AWS WWSO Frameworks crew. His background is in machine studying analysis and improvement. Previous to becoming a member of AWS, Keita labored within the ecommerce trade as a analysis scientist creating picture retrieval methods for product search. He leads GENIAC technical engagements.

Masaru Isaka is a Principal Enterprise Improvement on the AWS WWSO Frameworks crew, specializing in machine studying and generative AI options. Having engaged with GENIAC since its inception, he leads go-to-market methods for AWS’s generative AI choices.

Leave a Reply

Your email address will not be published. Required fields are marked *