Metadata filtering for tabular information with Information Bases for Amazon Bedrock

Amazon Bedrock is a completely managed service that gives a selection of high-performing basis fashions (FMs) from main synthetic intelligence (AI) firms like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by a single API. To equip FMs with up-to-date and proprietary info, organizations use Retrieval Augmented Era (RAG), a method that fetches information from firm information sources and enriches the immediate to supply extra related and correct responses. Knowledge Bases for Amazon Bedrock is a completely managed functionality that helps you implement your entire RAG workflow, from ingestion to retrieval and immediate augmentation. Nevertheless, details about one dataset might be in one other dataset, known as metadata. With out utilizing metadata, your retrieval course of could cause the retrieval of unrelated outcomes, thereby reducing FM accuracy and growing value within the FM immediate token.

On March 27, 2024, Amazon Bedrock introduced a key new characteristic known as metadata filtering and likewise modified the default engine. This alteration means that you can use metadata fields throughout the retrieval course of. Nevertheless, the metadata fields should be configured throughout the information base ingestion course of. Typically, you might need tabular information the place particulars about one subject can be found in one other subject. Additionally, you could possibly have a requirement to quote the precise textual content doc or textual content subject to stop hallucination. On this submit, we present you how one can use the brand new metadata filtering characteristic with Information Bases for Amazon Bedrock for such tabular information.

Answer overview

The answer consists of the next high-level steps:

Put together information for metadata filtering.
Create and ingest information and metadata into the information base.
Retrieve information from the information base utilizing metadata filtering.

Put together information for metadata filtering

As of this writing, Information Bases for Amazon Bedrock helps Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise, and MongoDB Atlas as underlying vector retailer suppliers. On this submit, we create and entry an OpenSearch Serverless vector retailer utilizing the Amazon Bedrock Boto3 SDK. For extra particulars, see Set up a vector index for your knowledge base in a supported vector store.

For this submit, we create a information base utilizing the general public dataset Food.com – Recipes and Reviews. The next screenshot reveals an instance of the dataset.

The TotalTime is in ISO 8601 format. You possibly can convert that to minutes utilizing the next logic:

# Operate to transform ISO 8601 length to minutes
def convert_to_minutes(length):
    hours = 0
    minutes = 0
    
    # Discover hours and minutes utilizing regex
    match = re.match(r'PT(?:(d+)H)?(?:(d+)M)?', length)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert complete time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df['TotalTimeInMinutes'] = df['TotalTime'].apply(convert_to_minutes)

After changing a few of the options like CholesterolContent, SugarContent, and RecipeInstructions, the information body appears to be like like the next screenshot.

To allow the FM to level to a particular menu with a hyperlink (cite the doc), we cut up every row of the tabular information in a single textual content file, with every file containing RecipeInstructions as the information subject and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. The metadata must be stored in a separate JSON file with the identical title as the information file and .metadata.json added to its title. For instance, if the information file title is 100.txt, the metadata file title must be 100.txt.metadata.json. For extra particulars, see Add metadata to your files to allow for filtering. Additionally, the content material within the metadata file must be within the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For the sake of simplicity, we solely course of the highest 2,000 rows to create the information base.

After you import the required libraries, create an area listing utilizing the next Python code:

import pandas as pd
import os, json, tqdm, boto3

metafolder="multi_file_recipe_data"os.mkdir(metafolder)

Iterate excessive 2,000 rows to create information and metadata information to retailer within the native folder:

for i in tqdm.trange(2000):
    desc = str(df['RecipeInstructions'][i])
    meta = {
    "metadataAttributes": {
        "Identify": str(df['Name'][i]),
        "TotalTimeInMinutes": str(df['TotalTimeInMinutes'][i]),
        "CholesterolContent": str(df['CholesterolContent'][i]),
        "SugarContent": str(df['SugarContent'][i]),
    }
    }
    filename = metafolder+'/' + str(i+1)+ '.txt'
    f = open(filename, 'w')
    f.write(desc)
    f.shut()
    metafilename = filename+'.metadata.json'
    with open( metafilename, 'w') as f:
        json.dump(meta, f)

Create an Amazon Simple Storage Service (Amazon S3) bucket named food-kb and add the information:

# Add information to s3
s3_client = boto3.consumer("s3")
bucket_name = "recipe-kb"
data_root = metafolder+'/'
def uploadDirectory(path,bucket_name):
    for root,dirs,information in os.stroll(path):
        for file in tqdm.tqdm(information):
            s3_client.upload_file(os.path.be a part of(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

Create and ingest information and metadata into the information base

When the S3 folder is prepared, you possibly can create the information base on the Amazon Bedrock console utilizing the SDK in response to this instance pocket book.

Retrieve information from the information base utilizing metadata filtering

Now let’s retrieve some information from the information base. For this submit, we use Anthropic Claude Sonnet on Amazon Bedrock for our FM, however you possibly can select from a wide range of Amazon Bedrock models. First, it’s good to set the next variables, the place kb_id is the ID of your information base. The information base ID might be discovered programmatically, as proven within the example notebook, or from the Amazon Bedrock console by navigating to the person information base, as proven within the following screenshot.

Set the required Amazon Bedrock parameters utilizing the next code:

import boto3
import pprint
from botocore.consumer import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
area = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.consumer('bedrock-runtime', region_name = area)
bedrock_agent_client = boto3.consumer("bedrock-agent-runtime",
                              config=bedrock_config, region_name = area)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching solely the related context.

question = " Inform me a recipe that I could make underneath half-hour and has ldl cholesterol lower than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'textual content': question
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents["retrievalResults"])

The next code is the output of the retrieval from the information base with out metadata filtering for the question “Inform me a recipe that I could make underneath half-hour and has ldl cholesterol lower than 10.” As we are able to see, out of the 2 recipes, the preparation durations are 30 and 480 minutes, respectively, and the ldl cholesterol contents are 86 and 112.4, respectively. Subsequently, the retrieval isn’t following the question precisely.

The next code demonstrates how one can use the Retrieve API with the metadata filters set to a ldl cholesterol content material lower than 10 and minutes of preparation lower than 30 for a similar question:

def retrieve(question, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'textual content': question
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        }
            }
        }
    ) 
question = "Inform me a recipe that I could make underneath half-hour and has ldl cholesterol lower than 10" 
response = retrieve(question, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

As we are able to see within the following outcomes, out of the 2 recipes, the preparation occasions are 27 and 20, respectively, and the ldl cholesterol contents are 0 and 0, respectively. With the usage of metadata filtering, we get extra correct outcomes.

The next code reveals how one can get correct output utilizing the identical metadata filtering with the retrieve_and_generate API. First, we set the immediate, then we arrange the API with metadata filtering:

immediate = f"""
Human: You've gotten nice information about meals, so present solutions to questions through the use of truth. 
If you do not know the reply, simply say that you do not know, do not attempt to make up a solution.

Assistant:"""

def retrieve_and_generate(question, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        enter= {
            'textual content': question,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{immediate} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        },
                }
        }
                    },
        'kind': 'KNOWLEDGE_BASE'
    }
    )
    
question = "Inform me a recipe that I could make underneath half-hour and has ldl cholesterol lower than 10"
response = retrieve_and_generate(question, kb_id,modelId, numberOfResults=10)
pp.pprint(response['output']['text'])

As we are able to see within the following output, the mannequin returns an in depth recipe that follows the instructed metadata filtering of lower than half-hour of preparation time and a ldl cholesterol content material lower than 10.

Clear up

Be sure that to remark the next part when you’re planning to make use of the information base that you just created for constructing your RAG utility. If you happen to solely needed to check out creating the information base utilizing the SDK, be certain that to delete all of the assets that had been created as a result of you’ll incur prices for storing paperwork within the OpenSearch Serverless index. See the next code:

bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(kind="information", title=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(kind="community", title=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(kind="encryption", title=encryption_policy['securityPolicyDetail']['name'])
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

On this submit, we defined how one can cut up a big tabular dataset into rows to arrange a information base with metadata for every of these information, and how one can then retrieve outputs with metadata filtering. We additionally confirmed how retrieving outcomes with metadata is extra correct than retrieving outcomes with out metadata filtering. Lastly, we confirmed how one can use the outcome with an FM to get correct outcomes.

To additional discover the capabilities of Information Bases for Amazon Bedrock, discuss with the next assets:

Concerning the Creator

Tanay Chowdhury is a Knowledge Scientist at Generative AI Innovation Heart at Amazon Net Companies. He helps clients to unravel their enterprise drawback utilizing Generative AI and Machine Studying.