Speak to your slide deck utilizing multimodal basis fashions hosted on Amazon Bedrock and Amazon SageMaker – Half 1

With the appearance of generative AI, at the moment’s basis fashions (FMs), akin to the massive language fashions (LLMs) Claude 2 and Llama 2, can carry out a variety of generative duties akin to query answering, summarization, and content material creation on textual content knowledge. Nevertheless, real-world knowledge exists in a number of modalities, akin to textual content, photos, video, and audio. Take a PowerPoint slide deck, for instance. It might include info within the type of textual content, or embedded in graphs, tables, and footage.

On this publish, we current an answer that makes use of multimodal FMs such because the Amazon Titan Multimodal Embeddings mannequin and LLaVA 1.5 and AWS companies together with Amazon Bedrock and Amazon SageMaker to carry out comparable generative duties on multimodal knowledge.

Answer overview

The answer offers an implementation for answering questions utilizing info contained within the textual content and visible components of a slide deck. The design depends on the idea of Retrieval Augmented Technology (RAG). Historically, RAG has been related to textual knowledge that may be processed by LLMs. On this publish, we prolong RAG to incorporate photos as properly. This offers a robust search functionality to extract contextually related content material from visible components like tables and graphs together with textual content.

There are other ways to design a RAG resolution that features photos. We’ve offered one strategy right here and can observe up with an alternate strategy within the second publish of this three-part sequence.

This resolution contains the next elements:

Amazon Titan Multimodal Embeddings model – This FM is used to generate embeddings for the content material within the slide deck used on this publish. As a multimodal mannequin, this Titan mannequin can course of textual content, photos, or a mix as enter and generate embeddings. The Titan Multimodal Embeddings mannequin generates vectors (embeddings) of 1,024 dimensions and is accessed by way of Amazon Bedrock.
Large Language and Vision Assistant (LLaVA) – LLaVA is an open supply multimodal mannequin for visible and language understanding and is used to interpret the info within the slides, together with visible components akin to graphs and tables. We use the 7-billion parameter model LLaVA 1.5-7b on this resolution.
Amazon SageMaker – The LLaVA mannequin is deployed on a SageMaker endpoint utilizing SageMaker internet hosting companies, and we use the ensuing endpoint to run inferences towards the LLaVA mannequin. We additionally use SageMaker notebooks to orchestrate and display this resolution finish to finish.
Amazon OpenSearch Serverless – OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings mannequin. An index created within the OpenSearch Serverless assortment serves because the vector retailer for our RAG resolution.
Amazon OpenSearch Ingestion (OSI) – OSI is a totally managed, serverless knowledge collector that delivers knowledge to OpenSearch Service domains and OpenSearch Serverless collections. On this publish, we use an OSI pipeline to ship knowledge to the OpenSearch Serverless vector retailer.

Answer structure

The answer design consists of two components: ingestion and consumer interplay. Throughout ingestion, we course of the enter slide deck by changing every slide into a picture, generate embeddings for these photos, after which populate the vector knowledge retailer. These steps are accomplished previous to the consumer interplay steps.

Within the consumer interplay part, a query from the consumer is transformed into embeddings and a similarity search is run on the vector database to discover a slide that would doubtlessly include solutions to consumer query. We then present this slide (within the type of a picture file) to the LLaVA mannequin and the consumer query as a immediate to generate a solution to the question. All of the code for this publish is offered within the GitHub repo.

The next diagram illustrates the ingestion structure.

Ingestion architecture diagram

The workflow steps are as follows:

Slides are transformed to picture information (one per slide) in JPG format and handed to the Titan Multimodal Embeddings mannequin to generate embeddings. On this publish, we use the slide deck titled Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023, to display the answer. The pattern deck has 31 slides, so we generate 31 units of vector embeddings, every with 1,024 dimensions. We add further metadata fields to those generated vector embeddings and create a JSON file. These further metadata fields can be utilized to carry out wealthy search queries utilizing OpenSearch’s highly effective search capabilities.
The generated embeddings are put collectively in a single JSON file that’s uploaded to Amazon Simple Storage Service (Amazon S3).
Through Amazon S3 Event Notifications, an occasion is put in an Amazon Simple Queue Service (Amazon SQS) queue.
This occasion within the SQS queue acts as a set off to run the OSI pipeline, which in flip ingests the info (JSON file) as paperwork into the OpenSearch Serverless index. Word that the OpenSearch Serverless index is configured because the sink for this pipeline and is created as a part of the OpenSearch Serverless assortment.

The next diagram illustrates the consumer interplay structure.

User interaction architecture

The workflow steps are as follows:

A consumer submits a query associated to the slide deck that has been ingested.
The consumer enter is transformed into embeddings utilizing the Titan Multimodal Embeddings mannequin accessed by way of Amazon Bedrock. An OpenSearch vector search is carried out utilizing these embeddings. We carry out a k-nearest neighbor (ok=1) search to retrieve essentially the most related embedding matching the consumer question. Setting ok=1 retrieves essentially the most related slide to the consumer query.
The metadata of the response from OpenSearch Serverless incorporates a path to the picture equivalent to essentially the most related slide.
A immediate is created by combining the consumer query and the picture path and offered to LLaVA hosted on SageMaker. The LLaVA mannequin is ready to perceive the consumer query and reply it by analyzing the info within the picture.
The results of this inference is returned to the consumer.

These steps are mentioned intimately within the following sections. See the Outcomes part for screenshots and particulars on the output.

Conditions

To implement the answer offered on this publish, it is best to have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This resolution makes use of the Titan Multimodal Embeddings mannequin. Be sure that this mannequin is enabled to be used in Amazon Bedrock. On the Amazon Bedrock console, select Mannequin entry within the navigation pane. If Titan Multimodal Embeddings is enabled, the entry standing will state Entry granted.

Manage model access in Amazon Bedrock

If the mannequin isn’t accessible, allow entry to the mannequin by selecting Handle Mannequin Entry, choosing Titan Multimodal Embeddings G1, and selecting Request mannequin entry. The mannequin is enabled to be used instantly.

Request model access in Amazon Bedrock

Use an AWS CloudFormation template to create the answer stack

Use one of many following AWS CloudFormation templates (relying in your Area) to launch the answer sources.

AWS Area	Hyperlink
`us-east-1`
`us-west-2`

After the stack is created efficiently, navigate to the stack’s Outputs tab on the AWS CloudFormation console and notice the worth for MultimodalCollectionEndpoint, which we use in subsequent steps.

Resources created by the CloudFormation tempalate

The CloudFormation template creates the next sources:

IAM roles – The next AWS Identity and Access Management (IAM) roles are created. Replace these roles to use least-privilege permissions.
- SMExecutionRole with Amazon S3, SageMaker, OpenSearch Service, and Bedrock full entry.
- OSPipelineExecutionRole with entry to particular Amazon SQS and OSI actions.
SageMaker pocket book – All of the code for this publish is run by way of this pocket book.
OpenSearch Serverless assortment – That is the vector database for storing and retrieving embeddings.
OSI pipeline – That is the pipeline for ingesting knowledge into OpenSearch Serverless.
S3 bucket – All knowledge for this publish is saved on this bucket.
SQS queue – The occasions for triggering the OSI pipeline run are put on this queue.

The CloudFormation template configures the OSI pipeline with Amazon S3 and Amazon SQS processing as supply and an OpenSearch Serverless index as sink. Any objects created within the specified S3 bucket and prefix (multimodal/osi-embeddings-json) will set off SQS notifications, that are utilized by the OSI pipeline to ingest knowledge into OpenSearch Serverless.

The CloudFormation template additionally creates network, encryption, and data access insurance policies required for the OpenSearch Serverless assortment. Replace these insurance policies to use least-privilege permissions.

Word that the CloudFormation template title is referenced in SageMaker notebooks. If the default template title is modified, ensure you replace the identical in globals.py

Take a look at the answer

After the prerequisite steps are full and the CloudFormation stack has been created efficiently, you’re now prepared to check the answer:

On the SageMaker console, select Notebooks within the navigation pane.
Choose the MultimodalNotebookInstance pocket book occasion and select Open JupyterLab.
In File Browser, traverse to the notebooks folder to see the notebooks and supporting information.

The notebooks are numbered within the sequence through which they’re run. Directions and feedback in every pocket book describe the actions carried out by that pocket book. We run these notebooks one after the other.

Select 0_deploy_llava.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book deploys the LLaVA-v1.5-7B mannequin to a SageMaker endpoint. On this pocket book, we obtain the LLaVA-v1.5-7B mannequin from HuggingFace Hub, exchange the inference.py script with llava_inference.py, and create a mannequin.tar.gz file for this mannequin. The mannequin.tar.gz file is uploaded to Amazon S3 and used for deploying the mannequin on SageMaker endpoint. The llava_inference.py script has further code to permit studying a picture file from Amazon S3 and operating inference on it.

Select 1_data_prep.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book downloads the slide deck, converts every slide into JPG file format, and uploads these to the S3 bucket used for this publish.

Select 2_data_ingestion.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

We do the next on this pocket book:

We create an index within the OpenSearch Serverless assortment. This index shops the embeddings knowledge for the slide deck. See the next code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
      "index.knn": true
  },
  "mappings": {
      "properties": {
          "vector_embedding": {
              "sort": "knn_vector",
              "dimension": 1024,
              "methodology": {
                  "title": "hnsw",
                  "engine": "nmslib",
                  "parameters": {}
              }
          },
          "image_path": {
              "sort": "textual content"
          },
          "metadata": {
              "properties": {
                  "slide_filename": {
                      "sort": "textual content"
                  },
                  "model_id": {
                      "sort": "textual content"
                  },
                  "slide_description": {
                      "sort": "textual content"
                  }
              }
          }
      }
  }
}
"""
index_body = json.hundreds(index_body)
strive:
  response = os_client.indices.create(index_name, physique=index_body)
  logger.information(f"response obtained for the create index -> {response}")
besides Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

We use Titan Multimodal Embeddings mannequin to transform the JPG photos created within the earlier pocket book into vector embeddings. These embeddings and extra metadata (such because the S3 path of the picture file) are saved in a JSON file and uploaded to Amazon S3. Word {that a} single JSON file is created, which incorporates paperwork for all of the slides (photos) transformed into embeddings. The next code snippet reveals how a picture (within the type of a Base64 encoded string) is transformed into embeddings:

def get_multimodal_embeddings(bedrock: botocore.consumer, picture: str) -> np.ndarray:
    physique = json.dumps(dict(inputImage=picture))
    strive:
        response = bedrock.invoke_model(
            physique=physique, modelId=g.FMC_MODEL_ID, settle for=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.hundreds(response.get("physique").learn())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    besides Exception as e:
        logger.error(f"exception whereas picture(truncated)={picture[:10]}, exception={e}")
        embeddings = None

    return embeddings

This motion triggers the OpenSearch Ingestion pipeline, which processes the file and ingests it into the OpenSearch Serverless index. The next is a pattern of the JSON file created. (A vector with 4 dimensions is proven within the instance code. The Titan Multimodal Embeddings mannequin generates 1,024 dimensions.)

[
  {
    "image_path": "s3://<your-bucket-name>/path/to/file1.json",
    "metadata": {
      "slide_filename": "mypowerpoint1.pptx",
      "model_id": "amazon.titan-embed-image-v1",
      "slide_description": "This is a test slide deck"
    },
    "vector_embedding": [
      657.6052386529958,
      0.8865137233123771,
      763.870264592026
    ]
  }
]

Select 3_rag_inference.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book implements the RAG resolution: we convert the consumer query into embeddings, discover a comparable picture (slide) from the vector database, and supply the retrieved picture to LLaVA to generate a solution to the consumer query. We use the next immediate template:

prompt_template: str = """Faux that you're a useful assistant that solutions questions on content material in a slide deck. 
  Utilizing solely the data within the offered slide picture reply the next query.
  If you don't discover the reply within the picture then say I didn't discover the reply to this query within the slide deck.

  {query}
"""

The next code snippet offers the RAG workflow:

# create immediate and convert to embeddings
query: str = "As per the AI/ML flywheel, what do the AWS AI/ML companies present?"
immediate = prompt_template.format(query=query)
text_embeddings = get_text_embeddings(bedrock, query)

# vector db search
vector_db_response: Dict = find_similar_data(text_embeddings)

# obtain picture for native pocket book show
s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path')
logger.information(f"going to reply the query="{query}" utilizing the picture "{s3_img_path}"")

!aws s3 cp {s3_img_path} .
local_img_path = os.path.basename(s3_img_path)
show(Picture(filename=local_img_path))

# Ask LLaVA
knowledge = {
    "picture" : s3_img_path,
    "query" : immediate,
    "temperature" : 0.1,
}
output = predictor.predict(knowledge)
logger.information(f"Picture={s3_img_path}nQuestion: {query}nAnswer: {output}nn")

Outcomes

The next is a compilation of some consumer questions and responses generated by our implementation. The Query column captures the consumer query, and the Reply column is the textual response generated by LLaVA. Picture is the k-nearest (ok=1) slide match returned by the OpenSearch Serverless vector search. This slide is interpreted by LLaVA to supply the reply.

Multimodal RAG outcomes

Query	Reply	Picture
How does Inf2 examine in efficiency to comparable EC2 situations? I want numbers.	In accordance with the slide deck, Inf2 situations by AWS Inferentia2 provide as much as 4x larger throughput and 10x decrease latency in comparison with comparable EC2 situations.
As per the AI/ML flywheel, what do the AWS AI/ML companies present?	The AWS AI/ML companies present higher $/perfer capabilities, new capabilities, and funding in innovation.
In comparison with GPT-2, what number of extra parameters does GPT-3 have? What’s the numerical distinction between the parameter measurement of GPT-2 and GPT-3?	In accordance with the slide, GPT-3 has 175 billion parameters, whereas GPT-2 has 1.5 billion parameters. The numerical distinction between the parameter measurement of GPT-2 and GPT-3 is 173.5 billion.
What are quarks in particle physics?	I didn’t discover the reply to this query within the slide deck.

Be happy to increase this resolution to your slide decks. Merely replace the SLIDE_DECK variable in globals.py with a URL to your slide deck and run the ingestion steps detailed within the earlier part.

Tip

You should utilize OpenSearch Dashboards to work together with the OpenSearch API to run fast assessments in your index and ingested knowledge. The next screenshot reveals an OpenSearch dashboard GET instance.

View of OpenSearch Dashboards

Clear up

To keep away from incurring future fees, delete the sources you created. You are able to do this by deleting the stack by way of the CloudFormation console.

Deletion of the CloudFormation stack

Moreover, delete the SageMaker inference endpoint created for LLaVA inferencing. You are able to do this by uncommenting the cleanup step in 3_rag_inference.ipynb and operating the cell, or by deleting the endpoint by way of the SageMaker console: select Inference and Endpoints within the navigation pane, then choose the endpoint and delete it.

Conclusion

Enterprises generate new content material on a regular basis, and slide decks are a typical mechanism used to share and disseminate info internally with the group and externally with prospects or at conferences. Over time, wealthy info can stay buried and hidden in non-text modalities like graphs and tables in these slide decks. You should utilize this resolution and the ability of multimodal FMs such because the Titan Multimodal Embeddings mannequin and LLaVA to find new info or uncover new views on content material in slide decks.

We encourage you to be taught extra by exploring Amazon SageMaker JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service, and constructing an answer utilizing the pattern implementation offered on this publish.

Look out for 2 further posts as a part of this sequence. Half 2 covers one other strategy you can take to speak to your slide deck. This strategy generates and shops LLaVA inferences and makes use of these saved inferences to reply to consumer queries. Half 3 compares the 2 approaches.

Concerning the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Net Companies, serving to enterprise prospects use cloud-based machine studying companies to quickly scale their improvements. He’s additionally an adjunct lecturer within the MS knowledge science and analytics program at Georgetown College in Washington D.C.

Manju Prasad is a Senior Options Architect inside Strategic Accounts at Amazon Net Companies. She focuses on offering technical steerage in a wide range of domains, together with AI/ML to a marquee M&E buyer. Previous to becoming a member of AWS, she designed and constructed options for firms within the monetary companies sector and in addition for a startup.

Archana Inapudi is a Senior Options Architect at AWS supporting strategic prospects. She has over a decade of expertise serving to prospects design and construct knowledge analytics and database options. She is enthusiastic about utilizing know-how to supply worth to prospects and obtain enterprise outcomes.

Antara Raisa is an AI and ML Options Architect at Amazon Net Companies supporting strategic prospects based mostly out of Dallas, Texas. She additionally has earlier expertise working with giant enterprise companions at AWS, the place she labored as a Companion Success Options Architect for digital native prospects.