Information Bases for Amazon Bedrock now helps metadata filtering to enhance retrieval accuracy


At AWS re:Invent 2023, we introduced the overall availability of Knowledge Bases for Amazon Bedrock. With Information Bases for Amazon Bedrock, you possibly can securely join basis fashions (FMs) in Amazon Bedrock to your organization knowledge utilizing a totally managed Retrieval Augmented Technology (RAG) mannequin.

For RAG-based purposes, the accuracy of the generated responses from FMs rely on the context offered to the mannequin. Contexts are retrieved from vector shops primarily based on consumer queries. Within the just lately launched function for Information Bases for Amazon Bedrock, hybrid search, you possibly can mix semantic search with key phrase search. Nevertheless, in lots of conditions, it’s possible you’ll must retrieve paperwork created in an outlined interval or tagged with sure classes. To refine the search outcomes, you possibly can filter primarily based on doc metadata to enhance retrieval accuracy, which in flip results in extra related FM generations aligned along with your pursuits.

On this submit, we talk about the brand new customized metadata filtering function in Information Bases for Amazon Bedrock, which you should utilize to enhance search outcomes by pre-filtering your retrievals from vector shops.

Metadata filtering overview

Previous to the discharge of metadata filtering, all semantically related chunks as much as the pre-set most could be returned as context for the FM to make use of to generate a response. Now, with metadata filters, you possibly can retrieve not solely semantically related chunks however a well-defined subset of these related chucks primarily based on utilized metadata filters and related values.

With this function, now you can provide a customized metadata file (every as much as 10 KB) for every doc within the data base. You may apply filters to your retrievals, instructing the vector retailer to pre-filter primarily based on doc metadata after which seek for related paperwork. This manner, you might have management over the retrieved paperwork, particularly in case your queries are ambiguous. For instance, you should utilize authorized paperwork with comparable phrases for various contexts, or films which have the same plot launched in numerous years. As well as, by lowering the variety of chunks which are being searched over, you obtain efficiency benefits like a discount in CPU cycles and value of querying the vector retailer, along with enchancment in accuracy.

To make use of the metadata filtering function, it is advisable to present metadata information alongside the supply knowledge information with the identical identify because the supply knowledge file and .metadata.json suffix. Metadata may be string, quantity, or Boolean. The next is an instance of the metadata file content material:

{
    "metadataAttributes" : { 
        "tag" : "challenge EVE",
        "12 months" :  2016,
        "group": "ninjas"
    }
}

The metadata filtering function of Information Bases for Amazon Bedrock is on the market in AWS Areas US East (N. Virginia) and US West (Oregon).

The next are frequent use circumstances for metadata filtering:

  • Doc chatbot for a software program firm – This permits customers to search out product data and troubleshooting guides. Filters on the working system or software model, for instance, may also help keep away from retrieving out of date or irrelevant paperwork.
  • Conversational search of a corporation’s software – This permits customers to go looking via paperwork, kanbans, assembly recording transcripts, and different property. Utilizing metadata filters on work teams, enterprise models, or challenge IDs, you possibly can personalize the chat expertise and enhance collaboration. An instance could be, “What’s the standing of challenge Sphinx and dangers raised,” the place customers can filter paperwork for a particular challenge or supply sort (similar to e-mail or assembly paperwork).
  • Clever seek for software program builders – This permits builders to search for data of a particular launch. Filters on the discharge model, doc sort (similar to code, API reference, or challenge) may also help pinpoint related paperwork.

Answer overview

Within the following sections, we exhibit the best way to put together a dataset to make use of as a data base, after which question with metadata filtering. You may question utilizing both the AWS Management Console or SDK.

Put together a dataset for Information Bases for Amazon Bedrock

For this submit, we use a sample dataset about fictional video video games for example the best way to ingest and retrieve metadata utilizing Information Bases for Amazon Bedrock. If you wish to comply with alongside in your individual AWS account, obtain the file.

If you wish to add metadata to your paperwork in an current data base, create the metadata information with the anticipated filename and schema, then skip to the step to sync your knowledge with the data base to start out the incremental ingestion.

In our pattern dataset, every recreation’s doc is a separate CSV file (for instance, s3://$bucket_name/video_game/$game_id.csv) with the next columns:

title, description, genres, 12 months, writer, rating

Every recreation’s metadata has the suffix .metadata.json (for instance, s3://$bucket_name/video_game/$game_id.csv.metadata.json) with the next schema:

{
  "metadataAttributes": {
    "id": quantity, 
    "genres": string,
    "12 months": quantity,
    "writer": string,
    "rating": quantity
  }
}

Create a data base for Amazon Bedrock

For directions to create a brand new data base, see Create a knowledge base. For this instance, we use the next settings:

  • On the Arrange knowledge supply web page, underneath Chunking technique, choose No chunking, since you’ve already preprocessed the paperwork within the earlier step.
  • Within the Embeddings mannequin part, select Titan G1 Embeddings – Textual content.
  • Within the Vector database part, select Fast create a brand new vector retailer. The metadata filtering function is on the market for all supported vector shops.

Synchronize the dataset with the data base

After you create the data base, and your knowledge information and metadata information are in an Amazon Simple Storage Service (Amazon S3) bucket, you can begin the incremental ingestion. For directions, see Sync to ingest your data sources into the knowledge base.

Question with metadata filtering on the Amazon Bedrock console

To make use of the metadata filtering choices on the Amazon Bedrock console, full the next steps:

  1. On the Amazon Bedrock console, select Information bases within the navigation pane.
  2. Select the data base you created.
  3. Select Take a look at data base.
  4. Select the Configurations icon, then increase Filters.
  5. Enter a situation utilizing the format: key = worth (for instance, genres = Technique) and press Enter.
  6. To vary the important thing, worth, or operator, select the situation.
  7. Proceed with the remaining situations (for instance, (genres = Technique AND 12 months >= 2023) OR (score >= 9))
  8. When completed, enter your question within the message field, then select Run.

For this submit, we enter the question “A technique recreation with cool graphic launched after 2023.”

Question with metadata filtering utilizing the SDK

To make use of the SDK, first create the consumer for the Agents for Amazon Bedrock runtime:

import boto3

bedrock_agent_runtime = boto3.consumer(
    service_name = "bedrock-agent-runtime"
)

Then assemble the filter (the next are some examples):

# genres = Technique
single_filter= {
    "equals": {
        "key": "genres",
        "worth": "Technique"
    }
}

# genres = Technique AND 12 months >= 2023
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "genres",
                "value": "Strategy"
            }
        },
        {
            "GreaterThanOrEquals": {
                "key": "year",
                "value": 2023
            }
        }
    ]
}

# (genres = Technique AND 12 months >=2023) OR rating >= 9
two_group_filter = {
    "orAll": [
        {
            "andAll": [
                {
                    "equals": {
                        "key": "genres",
                        "value": "Strategy"
                    }
                },
                {
                    "GreaterThanOrEquals": {
                        "key": "year",
                        "value": 2023
                    }
                }
            ]
        },
        {
            "GreaterThanOrEquals": {
                "key": "rating",
                "worth": "9"
            }
        }
    ]
}

Move the filter to retrievalConfiguration of the Retrieval API or RetrieveAndGenerate API:

retrievalConfiguration={
        "vectorSearchConfiguration": {
            "filter": metadata_filter
        }
    }

The next desk lists a number of responses with totally different metadata filtering situations.

Question Metadata Filtering Retrieved Paperwork Observations
“A technique recreation with cool graphic launched after 2023” Off

* Viking Saga: The Sea Raider, 12 months:2023, genres: Technique

* Medieval Fort: Siege and Conquest, 12 months:2022, genres: Technique
* Fantasy Kingdoms: Chronicles of Eldoria, 12 months:2023, genres: Technique

* Cybernetic Revolution: Rise of the Machines, 12 months:2022, genres: Technique
* Steampunk Chronicles: Clockwork Empires, 12 months:2021, genres: Metropolis-Constructing

2/5 video games meet the situation (genres = Technique and 12 months >= 2023)
On * Viking Saga: The Sea Raider, 12 months:2023, genres: Technique
* Fantasy Kingdoms: Chronicles of Eldoria, 12 months:2023, genres: Technique
2/2 video games meet the situation (genres = Technique and 12 months >= 2023)

Along with customized metadata, you too can filter utilizing S3 prefixes (which is a built-in metadata, so that you don’t want to supply any metadata information). For instance, if you happen to set up the sport paperwork into prefixes by writer (for instance, s3://$bucket_name/video_game/$writer/$game_id.csv), you possibly can filter with the particular writer (for instance, neo_tokyo_games) utilizing the next syntax:

publisher_filter = {
    "startsWith": {
                    "key": "x-amz-bedrock-kb-source-uri",
                    "worth": "s3://$bucket_name/video_game/neo_tokyo_games/"
                }
}

Clear up

To scrub up your sources, full the next steps:

  1. Delete the data base:
    1. On the Amazon Bedrock console, select Information bases underneath Orchestration within the navigation pane.
    2. Select the data base you created.
    3. Be aware of the AWS Identity and Access Management (IAM) service position identify within the Information base overview part.
    4. Within the Vector database part, be aware of the gathering ARN.
    5. Select Delete, then enter delete to verify.
  2. Delete the vector database:
    1. On the Amazon OpenSearch Service console, select Collections underneath Serverless within the navigation pane.
    2. Enter the gathering ARN you saved within the search bar.
    3. Choose the gathering and selected Delete.
    4. Enter affirm within the affirmation immediate, then select Delete.
  3. Delete the IAM service position:
    1. On the IAM console, select Roles within the navigation pane.
    2. Seek for the position identify you famous earlier.
    3. Choose the position and select Delete.
    4. Enter the position identify within the affirmation immediate and delete the position.
  4. Delete the pattern dataset:
    1. On the Amazon S3 console, navigate to the S3 bucket you used.
    2. Choose the prefix and information, then select Delete.
    3. Enter completely delete within the affirmation immediate to delete.

Conclusion

On this submit, we coated the metadata filtering function in Information Bases for Amazon Bedrock. You discovered the best way to add customized metadata to paperwork and use them as filters whereas retrieving and querying the paperwork utilizing the Amazon Bedrock console and the SDK. This helps enhance context accuracy, making question responses much more related whereas reaching a discount in price of querying the vector database.

For extra sources, discuss with the next:


In regards to the Authors

Corvus Lee is a Senior GenAI Labs Options Architect primarily based in London. He’s obsessed with designing and growing prototypes that use generative AI to resolve buyer issues. He additionally retains up with the most recent developments in generative AI and retrieval methods by making use of them to real-world situations.

Ahmed Ewis is a Senior Options Architect at AWS GenAI Labs, serving to clients construct generative AI prototypes to resolve enterprise issues. When not collaborating with clients, he enjoys taking part in along with his youngsters and cooking.

Chris Pecora is a Generative AI Knowledge Scientist at Amazon Net Providers. He’s obsessed with constructing revolutionary merchandise and options whereas additionally specializing in customer-obsessed science. When not working experiments and maintaining with the most recent developments in GenAI, he loves spending time along with his youngsters.

Leave a Reply

Your email address will not be published. Required fields are marked *