Pre-training genomic language fashions utilizing AWS HealthOmics and Amazon SageMaker


Genomic language fashions are a brand new and thrilling subject within the software of enormous language fashions to challenges in genomics. On this weblog put up and open source project, we present you how one can pre-train a genomics language mannequin, HyenaDNA, utilizing your genomic information within the AWS Cloud. Right here, we use AWS HealthOmics storage as a handy and cost-effective omic information retailer and Amazon Sagemaker as a totally managed machine studying (ML) service to coach and deploy the mannequin.

Genomic language fashions

Genomic language fashions symbolize a brand new strategy within the subject of genomics, providing a technique to perceive the language of DNA. These fashions use the transformer architecture, a sort of pure language processing (NLP), to interpret the huge quantity of genomic info out there, permitting researchers and scientists to extract significant insights extra precisely than with current in silico approaches and extra cost-effectively than with current in situ strategies.

By bridging the hole between uncooked genetic information and actionable information, genomic language fashions maintain immense promise for numerous industries and analysis areas, together with whole-genome analysis, delivered care, pharmaceuticals, and agriculture. They facilitate the invention of novel gene features, the identification of disease-causing mutations, and the event of customized therapy methods, finally driving innovation and development in genomics-driven fields. The flexibility to successfully analyze and interpret genomic information at scale is the important thing to precision medication, agricultural optimization, and biotechnological breakthroughs, making genomic language fashions a possible new foundational technology in these industries.

A number of the pioneering genomic language fashions embody

  • DNABERT which was one of many first makes an attempt to make use of the transformer architecture to study the language of DNA. DNABERT used a Bidirectional Encoder Representations from Transformers (BERT, encoder-only) structure pre-trained on a human reference genome and confirmed promising outcomes on downstream supervised duties.
  • Nucleotide transformer has an identical structure to DNABERT and confirmed that pre-training on extra information and rising the context window measurement improves the mannequin’s accuracy on downstream duties.
  • HyenaDNA makes use of the transformer structure, like different genomic fashions, besides that it replaces every self-attention layer with a Hyena operator. This widens the context window to permit processing of as much as 1 million tokens, considerably greater than prior fashions, permitting it to study longer-range interactions in DNA.

In our exploration of cutting-edge fashions that push the boundaries of genetic sequence evaluation, we centered on HyenaDNA. Pretrained HyenaDNA fashions are readily accessible on Hugging Face. This availability facilitates simple integration into current tasks or the place to begin for brand spanking new explorations in genetic sequence evaluation.

AWS HealthOmics and sequence shops

AWS HealthOmics is a purpose-built service that helps healthcare and life science organizations and their software program companions retailer, question, and analyze genomic, transcriptomic, and different omics information after which generate insights from that information to enhance well being and drive deeper organic understanding. It helps large-scale evaluation and collaborative analysis by means of HealthOmics storage, analytics, and workflow capabilities.

With HealthOmics storage, a managed omics centered findable accessible, interoperable, and reusable (FAIR) information retailer, customers can affordably retailer, manage, share, and entry petabytes of bioinformatics information effectively at a low value per gigabase. HealthOmics sequence shops ship value financial savings by means of computerized tiering and compression of recordsdata based mostly on utilization, allow sharing and findability by means of the biologically centered metadata and provenance monitoring, and supply instantaneous entry to continuously used information by means of low latency Amazon Simple Storage Service (Amazon S3) suitable APIs or HealthOmics native APIs. All of that is delivered by HealthOmics, eradicating the burden of managing compression, tiering, metadata, and file group from clients.

Amazon SageMaker

Amazon SageMaker is a totally managed ML service provided by AWS, designed to cut back the time and value related to coaching and tuning ML fashions at scale.

With SageMaker Coaching, a managed batch ML compute service, customers can effectively practice fashions with out having to handle the underlying infrastructure. SageMaker notably helps common deep studying frameworks, together with PyTorch, which is integral to the options offered right here.

SageMaker additionally offers a broad collection of ML infrastructure and mannequin deployment choices to assist meet all of your ML inference wants.

Answer overview

On this weblog put up we deal with pre-training a genomic language mannequin on an assembled genome. This genomic information might be both public (for instance, GenBank) or might be your personal proprietary information. The next diagram illustrates the workflow:

The image illustrates an architecture diagram for training HyenaDNA model using the data stored in AWS HealthOmics sequence store. 1. Read the Data: Data is read from an external genomic data source, such as GenBank. 2. Load the Data to Store: The data is then loaded into an AWS HealthOmics sequence store using Data Loading SageMaker Notebook. 3. Start Training Job: Utilizes SageMaker train & Deploy Notebook to initiate a training job on Amazon SageMaker. 4. Read the Data from Sequence Store: Training job accesses data from the Sequence Store using S3 access point of sequence store. 5. Download Model Checkpoint: A model checkpoint from Hugging Face (HyneDNA model) is downloaded. 6. Save Trained Model: The trained model is saved following the training process. 7. Deploy Trained Model: The trained model is then deployed using Amazon SageMaker, establishing a real-time endpoint. 8. Inference: Finally, the model performs inference tasks, likely using the deployed SageMaker real-time endpoint.

  1. We begin with genomic information. For the needs of this weblog put up, we’re utilizing a public non-reference Mouse genome from GenBank. The dataset is a part of The Mouse Genomes Undertaking and represents a consensus genome sequence of inbred mouse strains. Such a genomic information might readily be interchanged with proprietary datasets that you just may be working with in your analysis.
  2. We use a SageMaker notebook to course of the genomic recordsdata and to import these right into a HealthOmics sequence retailer.
  3. A second SageMaker notebook is used to start out the coaching job on SageMaker.
  4. Contained in the managed coaching job within the SageMaker surroundings, the coaching job first downloads the mouse genome utilizing the S3 URI provided by HealthOmics.
  5. Then the coaching job retrieves the checkpoint weights of the HyenaDNA mannequin from Huggingface. These weights are pretrained on the human reference genome. This pretraining permits the mannequin to grasp and predict genomic sequences, offering a complete baseline for additional specialised coaching on quite a lot of genomic duties.
  6. Utilizing these assets, the HyenaDNA mannequin is educated, the place it makes use of the mouse genome to refine its parameters. After pre-training is full and validation outcomes are passable, the educated mannequin is saved to Amazon S3.
  7. Then we deploy that mannequin as a SageMaker real-time inference endpoint.
  8. Lastly the mannequin is examined towards a set of identified genome sequences utilizing some inference API calls.

Information preparation and loading into sequence retailer

The preliminary step in our machine studying workflow focuses on getting ready the information. We begin by importing the genomic sequences right into a HealthOmics sequence retailer. Though FASTA recordsdata are the usual format for storing reference sequences, we convert these to FASTQ format. This conversion is carried out to higher replicate the format anticipated to retailer the assembled information of a sequenced pattern.

Within the sample Jupyter notebook we present how you can obtain FASTA recordsdata from GenBank, convert them into FASTQ recordsdata, after which load them right into a HealthOmics sequence retailer. You possibly can skip this step If you have already got your personal genomic information in a sequence retailer.

Coaching on SageMaker

We use PyTorch and Amazon SageMaker script mode to coach this mannequin. Script mode’s compatibility with PyTorch was essential, permitting us to make use of our current scripts with minimal modifications. For the coaching, we extract the coaching information from the sequence retailer by means of the sequence retailer’s offered S3 URIs. You possibly can, for instance, use the boto3 library to acquire this S3 URI.

seq_store_id = "4308389581“

seq_store_info = omics.get_sequence_store(id=seq_store_id)
s3_uri = seq_store_info["s3Access"]["s3Uri"]
s3_arn = seq_store_info["s3Access"]["s3AccessPointArn"]
key_arn = seq_store_info["sseConfig"]["keyArn"]
s3_uri, s3_arn, key_arn

S3_DATA_URI = f"{s3_uri}readSet/"
S3_DATA_URI

While you present this to the SageMaker estimator, the coaching job takes care of downloading the information from the sequence retailer by means of its S3 URI. Following Nguyen et al, we practice on chromosomes 2, 4, 6, 8, X, and 14–19; cross-validate on chromosomes 1, 3, 12, and 13; and take a look at on chromosomes 5, 7, and 9–11.

To maximise the coaching effectivity of our HyenaDNA mannequin, we use distributed information parallel (DDP). DDP is a method that facilitates the parallel processing of our coaching duties throughout a number of GPUs. To effectively implement DDP, we used the Hugging Face Accelerate library. Speed up simplifies operating distributed coaching by abstracting away the complexity usually related to establishing DDP.

After you’ve gotten outlined your coaching script, you may configure and submit a SageMaker coaching job.

First, let’s outline the hyperparameters, beginning with model_checkpoint. This parameter refers to a HuggingFace mannequin ID for a particular pre-trained mannequin. Notably, the HyenaDNA mannequin lineup contains checkpoints that may deal with as much as 1 million tokens. Nonetheless, for demonstration functions, we’re utilizing the hyenadna-small-32k-seqlen-hf mannequin, which has a context window of 32,000 tokens, indicated by the max_length setting. It’s important to grasp that totally different mannequin IDs and corresponding max_length settings might be chosen to make use of fashions with smaller or bigger context home windows, relying in your computational wants and aims.

The species parameter is ready to mouse, specifying the kind of organism the genomic coaching information represents.

hyperparameters = {
    "species" : "mouse",
    "epochs": 150,
    "model_checkpoint": MODEL_ID,
    "max_length": 32_000,
    "batch_size": 4,
    "learning_rate": 6e-4,
    "weight_decay" : 0.1,
    "log_level" : "INFO",
    "log_interval" : 100
}

Subsequent, outline what metrics, particularly the coaching and validation perplexity, to seize from the coaching logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "Epoch: ([0-9.]*)"},
    {"Title": "step", "Regex": "Step: ([0-9.]*)"},
    {"Title": "train_loss", "Regex": "Practice Loss: ([0-9.e-]*)"},
    {"Title": "train_perplexity", "Regex": "Practice Perplexity: ([0-9.e-]*)"},
    {"Title": "eval_loss", "Regex": "Eval Common Loss: ([0-9.e-]*)"},
    {"Title": "eval_perplexity", "Regex": "Eval Perplexity: ([0-9.e-]*)"}
]

Lastly, outline a Pytorch estimator and submit a coaching job that refers back to the information location obtained from the HealthOmics sequence retailer.

hyenaDNA_estimator = PyTorch(
    base_job_name=TRAINING_JOB_NAME,
    entry_point="train_hf_accelerate.py",
    source_dir="scripts/",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    image_uri=pytorch_image_uri,
    function=SAGEMAKER_EXECUTION_ROLE,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    sagemaker_session=sagemaker_session,
    distribution={"torch_distributed": {"enabled": True}},
    tags=[{"Key": "project", "Value": "genomics-model-pretraining"}],
    keep_alive_period_in_seconds=1800,
    tensorboard_output_config=tensorboard_output_config,
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hyenaDNA_estimator.match(
        {
            "information": TrainingInput(
                s3_data=S3_DATA_URI, input_mode="File"
            ),
        },
        wait=True,
    )

Outcomes

In our coaching cycle for the mannequin, we processed a dataset consisting of 1 mouse genome with 10,000 entries. The computational assets included a cluster configured with one ml.g5.12xlarge occasion, which homes 4 Nvidia A10G GPUs. The 32k sequence length model, was educated utilizing a batch measurement of 4 per GPU (24 gigabit (Gb) of VRAM). With this setup we accomplished 150 epochs to report the outcomes under.

Analysis metrics: The analysis perplexity and loss graphs present a downward development on the outset, which then plateaus. The preliminary steep lower signifies that the mannequin quickly realized from the coaching information, bettering its predictive efficiency. As coaching progressed, the speed of enchancment slowed, as evidenced by the plateau, which is typical within the later levels of coaching because the mannequin converges.

The image plots the evaluation loss of a HyenaDNA model training over a series of epochs. The overall trend suggests that the model's loss decreased significantly early in the training and reached a plateau, indicating potential convergence of the model training process.

The image plots the evaluation perplexity values of HyenaDNA model during its training over a sequence of epochs. This decreasing trend followed by stabilization indicates that the model's ability to predict or understand the data improved quickly initially and then reached a level of consistency as training progressed.

Coaching Metrics: Equally, the coaching perplexity and loss graphs point out an preliminary sharp enchancment adopted by a gradual plateau. This reveals that the mannequin successfully realized from the information. The coaching loss’s slight fluctuations counsel that the mannequin continued to fine-tune its parameters in response to the inherent complexities within the coaching dataset.

The image plots the perplexity values of a machine learning model over training steps. training perplexity, which demonstrates a significant decrease early on, followed by a gradual decline and stabilization around 3.2. This behavior suggests that as training progresses, the model becomes increasingly efficient at predicting or understanding the training data, indicated by the decreasing perplexity values. The stabilization at a lower perplexity level indicates that the model has likely achieved a good level of generalization.

Deployment

Upon the completion of coaching, we then deployed the mannequin on a SageMaker real-time endpoint. SageMaker real-time endpoints present an on-demand, scalable technique to generate embeddings for genomic sequences.

In our SageMaker real-time endpoint setup, we have to modify the default configurations to deal with massive payload sizes, particularly 32k context home windows for each requests and responses. As a result of the default payload measurement of 6.5 MB isn’t adequate, we’re rising it to a bit over 50 MB:

hyenaDNAModel = PyTorchModel(
    model_data=model_data,
    function=SAGEMAKER_EXECUTION_ROLE,
    image_uri=pytorch_deployment_uri,
    entry_point="inference.py",
    source_dir="scripts/",
    sagemaker_session=sagemaker_session,
    title=endpoint_name,
    env = {
        'TS_MAX_RESPONSE_SIZE':'60000000',
        'TS_MAX_REQUEST_SIZE':'60000000',
    }
)

# deploy the endpoint endpoint
realtime_predictor = hyenaDNAModel.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    endpoint_name=endpoint_name,
    env=env,
)

By submitting a sequence to the endpoint, customers can shortly obtain the corresponding embeddings generated by HyenaDNA. These embeddings encapsulate the advanced patterns and relationships realized throughout coaching, representing the genetic sequences in a type that’s conducive to additional evaluation and predictive modeling. Right here is an instance of how you can invoke the mannequin.

import json
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

sample_genome_data = []
with open("./sample_mouse_data.json") as file:
    for line in file:
        sample_genome_data.append(json.hundreds(line))
len(sample_genome_data)

information = [sample_genome_data[0]]
realtime_predictor.serializer = JSONSerializer()
realtime_predictor.deserializer = JSONDeserializer()
realtime_predictor.predict(information=information)

While you submit a pattern genomic sequence to the mannequin, it returns the embeddings of that sequence:

{'embeddings': [[-0.50390625, 0.447265625,-1.03125, 0.546875, 0.50390625, -0.53125, 0.59375, 0.71875, 0.349609375, -0.404296875, -4.8125, 0.84375, 0.359375, 1.2265625,………]]}

Conclusion

We’ve proven how you can pre-train a HyenaDNA mannequin with a 32k context window and to supply embeddings that can be utilized for downstream predictive duties. Utilizing the strategies proven right here you can even pre-train a HyenaDNA mannequin with context home windows of different sizes (for instance, 1 million tokens) and on different genomic information (for instance, proprietary genomic sequence information).

Pre-training genomic fashions on massive, numerous datasets is a foundational step in getting ready them for downstream duties, corresponding to figuring out genetic variants linked to ailments or predicting gene expression ranges. On this weblog put up, you’ve realized how AWS facilitates this pre-training course of by offering a scalable and cost-efficient infrastructure by means of HealthOmics and SageMaker. Trying ahead, researchers can use these pre-trained fashions to fast-track their tasks, fine-tuning them with particular datasets to achieve deeper insights into genetic analysis.

To discover additional particulars and take a look at your hand at utilizing these assets, we invite you to go to our GitHub repository. Moreover, We encourage you to study extra by visiting the Amazon SageMaker documentation and the AWS HealthOmics documentation.


In regards to the authors

Shamika Ariyawansa, serving as a Senior AI/ML Options Architect within the World Healthcare and Life Sciences division at Amazon Net Providers (AWS), focuses on Generative AI. He assists clients in integrating Generative AI into their tasks, emphasizing the adoption of Massive Language Fashions (LLMs) for healthcare and life sciences domains with a give attention to distributed coaching. Past his skilled commitments, Shamika passionately pursues snowboarding and off-roading adventures.

Simon Handley, PhD, is a Senior AI/ML Options Architect within the World Healthcare and Life Sciences workforce at Amazon Net Providers. He has greater than 25 years expertise in biotechnology and machine studying and is obsessed with serving to clients resolve their machine studying and genomic challenges. In his spare time, he enjoys horseback using and enjoying ice hockey.

Leave a Reply

Your email address will not be published. Required fields are marked *