Import a fine-tuned Meta Llama 3 mannequin for SQL question technology on Amazon Bedrock


Amazon Bedrock is a totally managed service that gives a selection of high-performing basis fashions (FMs) from main synthetic intelligence (AI) corporations like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by way of a single API. Amazon Bedrock additionally gives a broad set of capabilities wanted to construct generative AI purposes with safety, privateness, and accountable AI practices.

Some FMs are publicly accessible, which permits for personalisation tailor-made to particular use circumstances and domains. Nonetheless, deploying custom-made FMs to assist generative AI purposes in a safe and scalable method isn’t a trivial activity. Internet hosting massive fashions entails complexity across the number of occasion sort and deployment parameters. To handle this problem, AWS lately introduced the preview of Amazon Bedrock Custom Model Import, a characteristic that you should use to import custom-made fashions created in different environments—corresponding to Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) cases, and on premises—into Amazon Bedrock. This characteristic abstracts the complexity of the deployment course of by way of easy APIs for mannequin deployment and invocation. Presently, Customized Mannequin Import helps importing customized weights for selected model architectures (Meta Llama 2 and Llama 3, Flan, and Mistral) and precisions (FP32, FP16, and BF16), and serving the fashions on demand and with provisioned throughput.

Customizing FMs can unlock important worth by tailoring their capabilities to particular domains or duties. That is the primary in a collection of posts about mannequin customization eventualities that may be imported into Amazon Bedrock to simplify the method of constructing scalable and safe generative AI purposes. By demonstrating the method of deploying fine-tuned fashions, we goal to empower knowledge scientists, ML engineers, and utility builders to harness the complete potential of FMs whereas addressing distinctive utility necessities.

On this put up, we show the method of fine-tuning Meta Llama 3 8B on SageMaker to specialize it within the technology of SQL queries (text-to-SQL). Meta Llama 3 8B is a comparatively small mannequin that gives a steadiness between efficiency and useful resource effectivity. AWS clients have explored fine-tuning Meta Llama 3 8B for the technology of SQL queries—particularly when utilizing non-standard SQL dialects—and have requested strategies to import their custom-made fashions into Amazon Bedrock to learn from the managed infrastructure and safety that Amazon Bedrock gives when serving these fashions.

Answer overview

We stroll by way of the steps of fine-tuning an FM with utilizing SageMaker, and importing and evaluating the fine-tuned FM for SQL question technology utilizing Amazon Bedrock. The whole movement is proven within the following determine and it covers the next steps:

  1. The person invokes a SageMaker coaching job to fine-tune the mannequin utilizing QLoRA and retailer the weights in an Amazon Simple Storage Service (Amazon S3) bucket within the person’s account.
  2. When the fine-tuning job is full, the person runs the mannequin import job utilizing the Amazon Bedrock console. This step will run Steps 3–5 mechanically.
  3. Amazon Bedrock service begins an import job in an AWS operated deployment account.
  4. Mannequin artifacts are copied from the person’s account into an AWS managed S3 bucket.
  5. When the import job is full, the fine-tuned mannequin might be made accessible to be invoked.

Bedrock custom model import architecture

All knowledge stays throughout the chosen AWS Area, the mannequin artifacts are imported into the AWS operated deployment account utilizing a VPC endpoint, and you’ll encrypt your mannequin knowledge with your individual Amazon Key Management Service (AWS KMS) keys. The scripts for fine-tuning and analysis can be found on the GitHub repository.

A duplicate of your mannequin artifacts is saved in an AWS operated deployment account. This copy will stay till the customized mannequin is deleted. Deleting artifacts within the person’s account gained’t delete the mannequin or the artifacts within the AWS operated account. If completely different variations of a mannequin are imported into Amazon Bedrock, every model might be managed as an impartial mission with its personal set of artifacts. You may apply tags to fashions and import jobs to maintain monitor of various initiatives and variations.

Meta Llama3 8B is a gated mannequin on Hugging Face, which signifies that customers should be granted entry earlier than they’re allowed to obtain and customise the mannequin. Register to your Hugging Face account, learn the Meta Llama 3 Acceptable Use Coverage, and submit your contact data to be granted entry. This course of may take a few hours.

We use the sql-create-context dataset accessible on Hugging Face for fine-tuning. The dataset incorporates 78,577 tuples of context (desk schema), query (question expressed in pure language), and reply (SQL question). Seek advice from the licensing information relating to this dataset earlier than continuing additional.

We use Amazon SageMaker Studio to create a distant fine-tuning job, which is able to run as a SageMaker training job. SageMaker Studio is a single web-based interface for end-to-end machine studying (ML) growth. In case you need assistance configuring your SageMaker Studio area and your JupyterLab setting, see Launch Amazon SageMaker Studio. The coaching job will use QLoRA and the PyTorch FullyShardedDataParallel API (FSDP) to fine-tune the Meta Llama 3 mannequin. QLoRA quantizes a pretrained language mannequin to 4 bits and attaches smaller low-rank adapters (LoRA), that are fine-tuned with our coaching knowledge. PyTorch FSDP is a parallelism method that shards the mannequin throughout GPUs for environment friendly coaching. See the next notebook for the whole code pattern.

Knowledge preparation

Within the knowledge preparation stage, we use the next immediate template to insert particular directions for decoding the context and fulfilling the request, and retailer the modified coaching dataset as JSON recordsdata which are uploaded to Amazon S3:

system_message = """You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database."""

def create_conversation(report):
    pattern = {"messages": [
        {"role": "system", "content": system_message + f"""You can use the following table schema for context: {record["context"]}"""},
        {"position": "person", "content material": f"""Return the SQL question that solutions the next query: {report["question"]}"""},
        {"position" : "assistant", "content material": f"""{report["answer"]}"""}
    ]}
    return pattern

High-quality-tune Meta Llama 3 8B mannequin

Seek advice from the run_fsdp_qlora.py file outlined within the notebook for a full description of the fine-tuning script. The next snippets describe the configuration of the QLoRA job:

if script_args.use_qlora:
    print(f"Utilizing QLoRA - {torch_dtype}")
    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )
else:
    quantization_config = None

peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

The coach class relies on Supervised High-quality-tuning Coach (SFT Trainer) from Hugging Face, which is an API to create your SFT fashions and prepare them with a number of strains of code:

coach = SFTTrainer(
    mannequin=mannequin,
    args=training_args,
    train_dataset=train_dataset,
    dataset_text_field="textual content",
    eval_dataset=test_dataset,
    peft_config=peft_config,
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with particular tokens
        "append_concat_token": False,  # No want so as to add extra separator token
    },
)

As soon as the adapter is skilled, it’s merged with the unique mannequin earlier than persisting the weights. Customized Mannequin Import doesn’t assist LoRA adapters in the intervening time.

mannequin = mannequin.merge_and_unload()
mannequin.save_pretrained(
    sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
)

For this use case, we use an ml.g5.12xlarge occasion, which has 4 NVIDIA A10 accelerators. The important thing configurations are as follows:

huggingface_estimator = HuggingFace(
    entry_point="run_fsdp_qlora.py",    # prepare script
    source_dir="scripts/trl/",      # listing which incorporates all of the recordsdata wanted for coaching
    instance_type="ml.g5.12xlarge",   # cases sort used for the coaching job
    instance_count       = 1,                 # the variety of cases used for coaching
    max_run              = 2*24*60*60,        # most runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the title of the coaching job
    position                 = position,              # Iam position utilized in coaching job to entry AWS ressources, e.g. S3
    volume_size          = 300,               # the dimensions of the EBS quantity in GB
    transformers_version = '4.36.0',            # the transformers model used within the coaching job
    pytorch_version      = '2.1.0',             # the pytorch_version model used within the coaching job
    py_version           = 'py310',           # the python model used within the coaching job
    hyperparameters      =  hyperparameters,  # the hyperparameters handed to the coaching job
    disable_output_compression = True,        # not compress output to save lots of coaching time and price
    distribution={"torch_distributed": {"enabled": True}},
    setting          = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache fashions in /tmp
        "HF_TOKEN": HfFolder.get_token(),       # Retrieve HuggingFace Token for use for downloading base fashions from
        "ACCELERATE_USE_FSDP":"1", 
        "FSDP_CPU_RAM_EFFICIENT_LOADING":"1"
    },
)

In our testing, the coaching job accomplished two epochs in roughly 2.5 hours on a single ml.g5.12xlarge occasion, which incurred roughly $18 for coaching value. After coaching is full, mannequin weights within the Hugging Face safetensors format, the tokenizer, and the configuration file might be uploaded to the S3 bucket outlined within the coaching script. This path needs to be saved for use as the bottom listing for the import job within the subsequent part.

s3_files_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

The configuration file config.json will inform Amazon Bedrock how you can load the weights from the safetensors recordsdata. Some parameters to bear in mind are the model_type, which should be one of many types currently supported by Amazon Bedrock, max_position_embeddings, which units the utmost size of enter sequence that the mannequin can deal with, the mannequin dimensions (hidden_size, intermediate_size, num_hidden_layers, and num_attention_heads), and rotary place embedding (RoPE) parameters, which describe the encoding of place data. See the next configuration:

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 128256
}

Import the fine-tuned mannequin into Amazon Bedrock

To import the fine-tuned Meta Llama 3 mannequin into Amazon Bedrock, compete the next steps:

  1. On the Amazon Bedrock console, select Imported fashions on the navigation pane.
  2. Select Import mannequin.

  1. For Mannequin title, enter llama-3-8b-text-to-sql.
  2. For Mannequin import settings, enter the Amazon S3 location from the earlier steps.
  3. Select Import mannequin.
    The mannequin import job ought to take 15–18 minutes to finish.
  4. When it’s carried out, select Fashions to see your mannequin.
  5. Copy the mannequin Amazon Useful resource Title (ARN) so you possibly can invoke the mannequin with the AWS SDK within the subsequent part.

Consider SQL queries generated by the fine-tuned mannequin

On this part, we offer two examples to judge the SQL queries generated by the fine-tuned mannequin: one utilizing the Amazon Bedrock Textual content Playground and one utilizing a big language mannequin (LLM) as a choose.

Utilizing the Amazon Bedrock Textual content Playground

You may take a look at the mannequin utilizing the Amazon Bedrock Textual content Playground. For optimum outcomes, use the identical immediate template used to preprocess your coaching knowledge:

<s>[INST] <<SYS>>You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database. You should utilize the next desk schema for context: CREATE TABLE table_name_11 (match VARCHAR)<</SYS>>

[INST]Human: Return the SQL question that solutions the next query: Which Event has A in 1987?[/INST]

Assistant:

The next animation exhibits the outcomes.

Utilizing LLM as a choose

On the identical instance pocket book, we used the Amazon Bedrock InvokeModel API to name our imported mannequin on demand to generate SQL queries for data in our take a look at dataset. We use the identical immediate template used with the coaching knowledge within the fine-tuning step. The imported mannequin will solely assist parameters that have been supported by the bottom mannequin (max_tokens, top_p, and temperature). Imported fashions don’t assist penalty phrases (repetition_penalty or length_penalty) or the usage of token sampling as a substitute of grasping decoding (do_sample). See the next code:

def get_sql_query(system_prompt, user_question):
    """
    Generate a SQL question utilizing Llama 3 8B
    Keep in mind to make use of the identical template utilized in high quality tuning
    """
    formatted_prompt = f"<s>[INST] <<SYS>>{system_prompt}<</SYS>>nn[INST]Human: {user_question}[/INST]nnAssistant:"
    native_request = {
        "immediate": formatted_prompt,
        "max_tokens": 100,
        "top_p": 0.9,
        "temperature": 0.1
    }
    response = shopper.invoke_model(modelId=model_id,
                                   physique=json.dumps(native_request))
    response_text = json.hundreds(response.get('physique').learn())["outputs"][0]["text"]

    return response_text

After we generate mannequin predictions, we use a unique (extra highly effective) mannequin to behave as a choose and consider our fine-tuned mannequin responses. For this instance, we use the Anthropic Claude 3 Sonnet LLM on Amazon Bedrock to measure the similarity between the specified reply and the expected reply utilizing the next immediate:

formatted_prompt = f"""You're a knowledge science instructor that's introducing college students to SQL. Contemplate the next query and schema:
<query>{query}</query>
<schema>{db_schema}</schema>
    
Right here is the proper reply:
<correct_answer>{correct_answer}</correct_answer>
    
Right here is the coed's reply:
<student_answer>{test_answer}<student_answer>

Please present a numeric rating from 0 to 100 on how properly the coed's reply matches the proper reply for this query.
The rating needs to be excessive if the solutions say primarily the identical factor.
The rating needs to be decrease if some components are lacking, or if additional pointless components have been included.
The rating needs to be 0 for a wholly unsuitable reply. Put the rating in <SCORE> XML tags.
Don't take into account your individual reply to the query, however as a substitute rating primarily based solely on the proper reply above.
"""

The expected rating primarily based on our holdout break up of the dataset was 96.65%, which is great for a small mannequin tuned to a selected activity.

Clear up

The mannequin will spin all the way down to zero after a interval of no exercise and your value will cease accruing. Nonetheless, we suggest deleting the imported mannequin utilizing the Amazon Bedrock console. Keep in mind to additionally delete mannequin artifacts out of your S3 bucket when the fine-tuned mannequin is not wanted to forestall incurring prices.

Conclusion

This put up offered an outline of the method of fine-tuning a small mannequin utilizing SageMaker to assist generate extra correct SQL queries primarily based on questions requested in pure language after which importing the fine-tuned mannequin into Amazon Bedrock utilizing the Customized Mannequin Import characteristic. After we imported the mannequin, it was made accessible on demand by way of the Amazon Bedrock Playground and the InvokeModel API, which was used to judge the efficiency of the fine-tuned mannequin in opposition to a holdout dataset utilizing an LLM as a choose.

The next are really useful greatest practices that could be useful when utilizing fine-tuned FMs for code technology duties:

  • Choose a dataset that’s related and various sufficient to your code technology activity
  • Monitor the coaching job and PEFT parameters to forestall overfitting and catastrophic forgetting
  • Preprocess coaching knowledge with a constant instruction template
  • Retailer mannequin weights utilizing safetensors for quick loading
  • Invoke the mannequin utilizing the identical instruction template utilized in fine-tuning, utilizing solely inference parameters which are supported by the bottom mannequin and the Customized Mannequin Import characteristic in Amazon Bedrock

Discover the Amazon Bedrock Customized Mannequin Import characteristic as a solution to deploy FMs fine-tuned for code technology duties in a safe and scalable method. Go to our GitHub repository to discover samples ready for fine-tuning and importing fashions from numerous households.


In regards to the Authors

Evandro Franco is a Sr. AI/ML Specialist Options Architect engaged on Amazon Internet Companies. He helps AWS clients overcome enterprise challenges associated to AI/ML on prime of AWS. He has greater than 18 years working with know-how, from software program growth, infrastructure, serverless, to machine studying.

Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he centered on modeling and optimization merchandise for industrial purposes.

Jay Pillai is a Principal Answer Architect at Amazon Internet Companies. On this position, he capabilities because the World Generative AI Lead Architect and in addition the Lead Architect for Provide Chain Options with AABG. As an Data Know-how Chief, Jay focuses on synthetic intelligence, knowledge integration, enterprise intelligence, and person interface domains. He has 23 years of in depth expertise working with a number of shoppers throughout provide chain, authorized applied sciences, actual property, monetary providers, insurance coverage, funds, and market analysis enterprise domains.

Rupinder Grewal is a Senior AI/ML Specialist Options Architect with AWS. He at the moment focuses on the serving of fashions and MLOps on Amazon SageMaker. Previous to this position, he labored as a Machine Studying Engineer constructing and internet hosting fashions. Outdoors of labor, he enjoys taking part in tennis and biking on mountain trails.

Sandeep Singh is a Senior Generative AI Knowledge Scientist at Amazon Internet Companies, serving to companies innovate with generative AI. He focuses on Generative AI, Synthetic Intelligence, Machine Studying, and System Design. He’s keen about growing state-of-the-art AI/ML-powered options to resolve complicated enterprise issues for various industries, optimizing effectivity and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, the place he has had the privilege to take heed to buyer wants first-hand and understands what it takes to construct and launch scalable and safe Gen AI merchandise. Previous to Bedrock, he labored on quite a few merchandise in Amazon, starting from units to Adverts to Robotics.

Leave a Reply

Your email address will not be published. Required fields are marked *