Metagenomi generates hundreds of thousands of novel enzymes cost-effectively utilizing AWS Inferentia
This submit was written with Audra Devoto, Owen Janson, and Christopher Brown of Metagenomi, and Adam Perry of Tennex.
A promising technique to enhance the intensive pure variety of excessive worth enzymes is to make use of generative AI, particularly protein language fashions (pLMs), educated on recognized enzymes to create orders of magnitude extra predicted examples of given enzyme courses. Increasing pure enzyme variety via generative AI has many benefits, together with offering entry to quite a few enzyme variants which may supply enhanced stability, specificity, or efficacy in human cells—however excessive throughput technology could be expensive relying on the scale of the mannequin used and the variety of enzyme variants wanted.
At Metagenomi, we’re growing doubtlessly healing therapeutics with proprietary CRISPR gene modifying enzymes. We use the intensive pure variety of enzymes in our database (MGXdb) to determine pure enzyme candidates, and to coach protein language fashions used for generative AI. Growth of pure enzyme courses with generative AI permits us to entry further variants of a given enzyme class, that are filtered with multi-model workflows to foretell key enzyme traits and used to allow protein engineering campaigns to enhance enzyme efficiency in a given context.
On this weblog submit, we element strategies for value discount of excessive throughput protein generative AI workflows by implementing the Progen2 mannequin on AWS Inferentia, which enabled excessive throughput technology of enzyme variants with as much as 56% decrease value on AWS Batch and Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances. This work was completed in partnership with the AWS Neuron group and engineers at Tennex.
Progen2 on AWS Inferentia
PyTorch fashions can use AWS neuron cores as accelerators, which led us to make use of AWS Inferentia powered EC2 Inf2 occasion sorts for our high-throughput protein design workflow to make use of their cost-effectiveness and better availability as Spot Situations. We selected the autoregressive transformer mannequin Progen2 to implement on EC2 Inf2 occasion sorts, as a result of it met our wants for prime throughput artificial protein technology from wonderful tuned fashions based mostly on earlier work operating Progen2 on EC2 situations with NVIDIA L40S GPUs (g6e.xlarge), and since there may be established help in Neuron for transformer decoder kind fashions. To implement Progen2 on EC2 Inf2 situations, we traced customized Progen2 checkpoints educated on proprietary enzymes utilizing the bucketing approach. Tracing out fashions to a number of sizes optimizes for mannequin efficiency by producing sequences on consecutively bigger traced fashions, passing the output of the earlier mannequin as a immediate to the following one, which minimizes the inference time required to generate every token.
Nonetheless, the tracing and bucketing strategy used to allow Progen2 to run on EC2 Inf2 occasion introduces some modifications that would affect mannequin accuracy. For instance, Progen2-base was not educated with padding tokens, which should be added to enter tensors when operating on EC2 Inf2 situations. To check the consequences of our tracing and bucketing strategy, we in contrast the perplexity and sequence lengths of sequences generated utilizing the progen2-base mannequin on EC2 Inf2 situations to these utilizing the native implementation of the progen2-base mannequin on NVIDIA GPUs.
To check the fashions, we generated 1,000 protein sequences for every of 10 prompts below each the tracing and bucketing (AWS AI Chip Inferentia) implementation and the native (NVIDIA GPU) implementation. The set of prompts was created by drawing 10 well-characterized proteins from UniprotKB, and truncating every to the primary 50 amino acids. To examine for abnormalities within the generated sequences, all generated sequences from each fashions have been then run via a ahead go of the native implementation of progen2-base, and the imply perplexity of every sequence was calculated. The next determine illustrates this methodology.

As proven within the following determine, we discovered that for every immediate, the lengths and perplexities of sequences generated utilizing the tracing and bucketing implementation and native implementation regarded comparable.

Scaled inference on AWS Batch
With the essential inference logic utilizing Progen2 on EC2 Inf2 situations labored out, the following step was to massively scale inference throughout a big fleet of compute nodes. AWS Batch was the perfect service to scale this workflow, as a result of it could actually effectively run a whole bunch of 1000’s of batch computing jobs and dynamically provision to the optimum amount and sort of compute assets (similar to Amazon EC2 On-Demand or Spot Situations) based mostly on the quantity and useful resource necessities of the submitted jobs.
Progen2 was carried out as a batch workload following best practices. Jobs are submitted by a person, and run on a devoted compute surroundings that orchestrates Amazon EC2 inf2.xlarge Spot Situations. Customized docker containers are saved on the Amazon Amazon Elastic Container Registry (Amazon ECR). Fashions are pulled down from Amazon Simple Storage Service (Amazon S3) by every job, and generated sequences within the type of a FASTA file are positioned on Amazon S3 on the finish of every job. Optionally, Nextflow can be utilized to orchestrate jobs, deal with automated spot retries, and automate downstream or upstream duties. The next determine illustrates the answer structure.

As proven within the following determine, inside every particular person batch job, sequences are generated by first loading the smallest bucket dimension from accessible traced fashions. Sequences are generated out to the max tokens for that bucket, and sequences that have been generated with a cease codon or a begin codon are dropped. The remaining unfinished sequences are handed as much as subsequently bigger bucket sizes for additional technology, till the parameter max_length is reached. Tensors are stacked and decoded on the finish and written to a FASTA file that’s positioned on Amazon S3.

The next is an instance configuration sketch for establishing an AWS Batch surroundings utilizing EC2 Inf2 situations that may run Progen2-neuron workloads.
DISCLAIMER: That is pattern configuration for non-production utilization. You must work along with your safety and authorized groups to stick to your organizational safety, regulatory, and compliance necessities earlier than deployment.
Stipulations
Earlier than implementing this configuration, ensure you have the next stipulations in place:
The next is an instance Dockerfile for containerizing Progen2-neuron. This container picture builds on Amazon Linux 2023 and contains the mandatory elements for operating Progen2 on AWS Neuron—together with the Neuron SDK, Python dependencies, PyTorch, and Transformers libraries. It additionally configures offline mode for Hugging Face operations and contains the required sequence technology scripts.
The next is an instance generate_sequences.sh that orchestrates the sequence technology workflow for Progen2 on EC2 Inf2 situations.
Value comparisons
The first objective of this undertaking was to decrease the price of producing protein sequences with Progen2, in order that we may use this mannequin to considerably increase the variety of a number of enzyme courses. To match the price of producing sequences utilizing each providers, we generated 10,000 sequences based mostly on prompts derived from 10 widespread sequences in UniProtKB, utilizing a temperature of 1.0. Batch jobs have been run in parallel, with every job producing 100 sequences for a single immediate. We noticed that the implementation of Progen2 on EC2 Inf2 Spot Situations was considerably cheaper than implementation on Amazon EC2 G6e Spot Situations for longer sequences, representing financial savings of as much as 56%. These value estimates embody anticipated Amazon EC2 Spot interruption frequencies of 20% for Amazon EC2 g6e.xlarge situations powered by NVIDIA L40S Tensor Core GPUs and 5% for EC2 inf2.xlarge situations. The next determine illustrates whole value the place grey bars signify the common size of generated sequences.

Further value financial savings could be achieved by operating jobs at half precision, which appeared to provide equal outcomes, as proven within the following determine.

It is very important observe that as a result of the time it takes so as to add new tokens to the sequence throughout technology scales quadratically, the time and price to generate sequences is very depending on the sorts of sequences you’re making an attempt to generate. General, we noticed that the fee to generate sequences is determined by the precise mannequin checkpoint used and the distribution of sequence lengths the mannequin generated. For value estimates, we advocate producing a small subset of enzymes along with your chosen mannequin and extrapolating prices for a bigger technology set, as an alternative of making an attempt to calculate based mostly on prices for earlier experiments.
Scaling technology to hundreds of thousands of proteins
To check the scalability of our answer, Metagenomi wonderful tuned a mannequin on pure examples of a precious however uncommon class of enzymes sourced from Metagenomi’s massive, proprietary database of metagenomics knowledge. The mannequin was wonderful tuned utilizing conventional GPU situations, then traced onto AWS AI chips Inferentia. Utilizing our batch and Nextflow pipeline, we launched batch jobs to generate properly over 1 million enzymes, various technology parameters between jobs to check the impact of various sampling strategies, temperatures, and precisions. The overall compute value of technology utilizing our optimized AWS AI pipeline, together with prices incurred from EC2 Inf2 Spot retries, was $2,613 (see previous observe on estimating prices to your workloads). Generated sequences have been validated with a pipeline that used a mixture of AI and conventional sequence validation methods. Sequences have been dereplicated utilizing mmseqs, filtered for acceptable size, checked for correct area constructions with hmmsearch, folded utilizing ESMFold, and embedded utilizing AMPLIFY_350M. Buildings have been used for comparability to recognized enzymes within the class, and embeddings have been used to validate intrinsic enzyme health. Outcomes of the technology are proven within the following determine, with a number of hundred thousand generative AI enzymes plotted within the embedding house. Pure, characterised enzymes used as prompts proven in pink, generative AI enzymes passing all filters are proven in inexperienced, and generative AI enzymes not passing filters proven in orange.

Conclusion
On this submit, we outlined strategies to cut back the price of large-scale protein design tasks by as much as 56% utilizing Amazon EC2 Inf situations, which has allowed Metagenomi to generate hundreds of thousands of novel enzymes throughout a number of excessive worth protein courses utilizing fashions educated on our proprietary protein dataset. This implementation showcases how AWS Inferentia could make large-scale protein technology extra accessible and economical for biotechnology functions. To study extra about EC2 Inf situations and to start out implementing your personal workflows on AWS Neuron, see the AWS Neuron documentation. To learn extra about a few of the novel enzymes Metagenomi has found, see Metagenomi’s publications and posters.
In regards to the authors
Audra Devoto is a Knowledge Scientist with a background in metagenomics and a few years of expertise working with massive genomics datasets on AWS. At Metagenomi, she builds out infrastructure to help massive scale evaluation tasks and allows discovery of novel enzymes from MGXdb.
Owen Janson is a bioinformatics Engineer at Metagenomi who focuses on constructing instruments and cloud infrastructure to help evaluation of large genomic datasets.
Adam Perry is a seasoned cloud architect with deep experience in AWS, the place he’s designed and automatic complicated cloud options for a whole bunch of companies. As a co-founder of Tennex, he has led know-how technique, constructed customized instruments, and collaborated carefully together with his group to assist early stage biotech firms scale securely and effectively on cloud.
Christopher Brown, PhD, is head of the Discovery group at Metagenomi. He’s an achieved scientist and professional in metagenomics, and has led the invention and characterization of quite a few novel enzyme methods for gene modifying functions.
Jamal Arif is a Senior Options Architect and Generative AI Specialist at AWS with over a decade of expertise serving to clients design and operationalize next-generation AI and cloud-native architectures. His work focuses on agentic AI, Kubernetes, and modernization frameworks, guiding enterprises via scalable adoption methods and production-ready design patterns. Jamal builds thought-leadership content material and speaks at AWS Summits, re:Invent, and trade conferences, sharing finest practices for constructing safe, resilient, and high-impact AI options.
Pavel Novichkov, PhD, is a Senior Options Architect at AWS specializing in genomics and life sciences. He brings over 15 years of bioinformatics and cloud growth expertise to assist healthcare and life sciences startups design and implement cloud-based options on AWS. He accomplished his postdoc on the Nationwide Heart for Biotechnology Info (NIH) and served as a Computational Analysis Scientist at Berkeley Lab for over 12 years, the place he co-developed progressive NGS-based know-how that was acknowledged amongst Berkeley Lab’s high 90 breakthroughs in its historical past.