Enhance productiveness when processing scanned PDFs utilizing Amazon Q Enterprise


Amazon Q Business is a generative AI-powered assistant that may reply questions, present summaries, generate content material, and extract insights straight from the content material in digital in addition to scanned PDF paperwork in your enterprise knowledge sources while not having to extract the textual content first.

Clients throughout industries comparable to finance, insurance coverage, healthcare life sciences, and extra must derive insights from numerous doc sorts, comparable to receipts, healthcare plans, or tax statements, that are incessantly in scanned PDF format. These doc sorts usually have a semi-structured or unstructured format, which requires processing to extract textual content earlier than indexing with Amazon Q Enterprise.

The launch of scanned PDF doc help with Amazon Q Enterprise will help you seamlessly course of quite a lot of multi-modal doc sorts by way of the AWS Management Console and APIs, throughout all supported Amazon Q Enterprise AWS Areas. You possibly can ingest paperwork, together with scanned PDFs, out of your knowledge sources utilizing supported connectors, index them, after which use the paperwork to reply questions, present summaries, and generate content material securely and precisely out of your enterprise programs. This characteristic eliminates the event effort required to extract textual content from scanned PDF paperwork outdoors of Amazon Q Enterprise, and improves the doc processing pipeline for constructing your generative synthetic intelligence (AI) assistant with Amazon Q Enterprise.

On this submit, we present tips on how to asynchronously index and run real-time queries with scanned PDF paperwork utilizing Amazon Q Enterprise.

Answer overview

You should use Amazon Q Enterprise for scanned PDF paperwork from the console, AWS SDKs, or AWS Command Line Interface (AWS CLI).

Amazon Q Enterprise gives a flexible suite of information connectors that may combine with a variety of enterprise knowledge sources, empowering you to develop generative AI options with minimal setup and configuration. To be taught extra, go to Amazon Q Business, now generally available, helps boost workforce productivity with generative AI.

After your Amazon Q Enterprise software is able to use, you’ll be able to straight add the scanned PDFs into an Amazon Q Enterprise index utilizing both the console or the APIs. Amazon Q Enterprise gives a number of knowledge supply connectors that may combine and synchronize knowledge from a number of knowledge repositories into single index. For this submit, we show two eventualities to make use of paperwork: one with the direct doc add possibility, and one other utilizing the Amazon Simple Storage Service (Amazon S3) connector. If you should ingest paperwork from different knowledge sources, check with Supported connectors for particulars on connecting extra knowledge sources.

Index the paperwork

On this submit, we use three scanned PDF paperwork as examples: an bill, a well being plan abstract, and an employment verification type, together with some textual content paperwork.

Step one is to index these paperwork. Full the next steps to index paperwork utilizing the direct add characteristic of Amazon Q Enterprise. For this instance, we add the scanned PDFs.

  1. On the Amazon Q Enterprise console, select Purposes within the navigation pane and open your software.
  2. Select Add knowledge supply.
  3. Select Add Information.
  4. Add the scanned PDF information.

You possibly can monitor the uploaded information on the Knowledge sources tab. The Add standing modifications from Acquired to Processing to Listed or Up to date, as which level the file has been efficiently listed into the Amazon Q Enterprise knowledge retailer. The next screenshot exhibits the efficiently listed PDFs.

Indexed documents in uploaded files section.

The next steps show tips on how to combine and synchronize paperwork utilizing an Amazon S3 connector with Amazon Q Enterprise. For this instance, we index the textual content paperwork.

  1. On the Amazon Q Enterprise console, select Purposes within the navigation pane and open your software.
  2. Select Add knowledge supply.
  3. Select Amazon S3 for the connector.
  4. Enter the data for Identify, VPC and safety group settings, IAM function, and Sync mode.
  5. To complete connecting your knowledge supply to Amazon Q Enterprise, select Add knowledge supply.
  6. Within the Knowledge supply particulars part of your connector particulars web page, select Sync now to permit Amazon Q Enterprise to start syncing (crawling and ingesting) knowledge out of your knowledge supply.

When the sync job is full, your knowledge supply is able to use. The next screenshot exhibits all 5 paperwork (scanned and digital PDFs, and textual content information) are efficiently listed.

Amazon S3 connector

The next screenshot exhibits a complete view of the 2 knowledge sources: the straight uploaded paperwork and the paperwork ingested by way of the Amazon S3 connector.

Amazon Q Business data sources.

Now let’s run some queries with Amazon Q Enterprise on our knowledge sources.

Queries on dense, unstructured, scanned PDF paperwork

Your paperwork is likely to be dense, unstructured, scanned PDF doc sorts. Amazon Q Enterprise can establish and extract essentially the most salient information-dense textual content from it. On this instance, we use the multi-page well being plan abstract PDF we listed earlier. The next screenshot exhibits an instance web page.

Health plan summary document.

That is an instance of a well being plan abstract doc.

Within the Amazon Q Enterprise net UI, we ask “What’s the annual complete out-of-pocket most, talked about within the well being plan abstract?”

Amazon Q Enterprise searches the listed doc, retrieves the related info, and generates a solution whereas citing the supply for its info. The next screenshot exhibits the pattern output.

Amazon Q Business output

Queries on structured, tabular, scanned PDF paperwork

Paperwork may additionally comprise structured knowledge parts in tabular format. Amazon Q Enterprise can mechanically establish, extract, and linearize structured knowledge from scanned PDFs to precisely resolve any person queries. Within the following instance, we use the bill PDF we listed earlier. The next screenshot exhibits an instance.

Invoice

That is an instance of an bill.

Within the Amazon Q Enterprise net UI, we ask “How a lot have been the headphones charged within the bill?”

Amazon Q Enterprise searches the listed doc and retrieves the reply as regards to the supply doc. The next screenshot exhibits that Amazon Q Enterprise is ready to extract invoice info from the bill.

Amazon Q Business output

Queries on semi-structured varieties

Your paperwork may additionally comprise semi-structured knowledge parts in a type, comparable to key-value pairs. Amazon Q Enterprise can precisely fulfill queries associated to those knowledge parts by extracting particular fields or attributes which might be significant for the queries. On this instance, we use the employment verification PDF. The next screenshot exhibits an instance.

Employment verification sample

That is an instance of an employment verification type.

Within the Amazon Q Enterprise net UI, we ask “What’s the applicant’s date of employment within the employment verification type?” Amazon Q Enterprise searches the listed employment verification doc and retrieves the reply as regards to the supply doc.

Amazon Q Business output

Index paperwork utilizing the AWS CLI

On this part, we present you tips on how to use the AWS CLI to ingest structured and unstructured paperwork saved in an S3 bucket into an Amazon Q Enterprise index. You possibly can shortly retrieve detailed details about your paperwork, together with their statuses and any errors occurred throughout indexing. When you’re an current Amazon Q Enterprise person and have listed paperwork in numerous codecs, comparable to scanned PDFs and different supported sorts, and also you now need to reindex the scanned paperwork, full the next steps:

  1.  Examine the standing of every doc to filter failed paperwork in keeping with the standing "DOCUMENT_FAILED_TO_INDEX". You possibly can filter the paperwork primarily based on this error message:

"errorMessage": "Doc can't be listed because it accommodates no textual content to index and search on. Doc should comprise some textual content."

When you’re a brand new person and haven’t listed any paperwork, you’ll be able to skip this step.

The next is an instance of utilizing the ListDocuments API to filter paperwork with a particular standing and their error messages:

aws qbusiness list-documents --region <area> 
--application-id <application-id> 
--index-id <index-id> 
--query "documentDetailList[?status=='DOCUMENT_FAILED_TO_INDEX'].{DocumentId:documentId, ErrorMessage:error.errorMessage}"
--output json

The next screenshot exhibits the AWS CLI output with a listing of failed paperwork with error messages.

List of failed documents

Now you batch-process the paperwork. Amazon Q Enterprise helps including a number of paperwork to an Amazon Q Enterprise index.

  1. Use the BatchPutDocument API to ingest a number of scanned paperwork saved in an S3 bucket into the index:
    aws qbusiness batch-put-document —area <area> 
    --documents '[{ "id":"s3://<your-bucket-path>/<scanned-pdf-document1>","content":{"s3":{"bucket":"<your-bucket> ","key":"<scanned-pdf-document1>"}}}, { "id":"s3://<your-bucket-path>/<scanned-pdf-document2>","content":{"s3":{"bucket":" <your-bucket>","key":"<scanned-pdf-document2>"}}}]' 
    --application-id <application-id> 
    --index-id <index-id> 
    --endpoint-url <application-endpoint-url> 
    --role-arn <role-arn> 
    --no-verify-ssl

The next screenshot exhibits the AWS CLI output. You need to see failed paperwork as an empty listing.

List of failed documents

  1. Lastly, use the ListDocuments API once more to evaluation if all paperwork have been listed correctly:
    aws qbusiness list-documents --region <area> 
    --application-id <application-id> 
    --index-id <index-id> 
    --endpoint-url <application-endpoint-url> 
    --no-verify-ssl

The next screenshot exhibits that the paperwork are listed within the knowledge supply.

List of indexed documents

Clear up

When you created a brand new Amazon Q Enterprise software and don’t plan to make use of it additional, unsubscribe and take away assigned customers from the appliance and delete it in order that your AWS account doesn’t accumulate prices. Furthermore, in the event you don’t want to make use of the listed knowledge sources additional, check with Managing Amazon Q Business data sources for directions to delete your listed knowledge sources.

Conclusion

This submit demonstrated the help for scanned PDF doc sorts with Amazon Q Enterprise. We highlighted the steps to sync, index, and question supported doc sorts—now together with scanned PDF paperwork—utilizing generative AI with Amazon Q Enterprise. We additionally confirmed examples of queries on structured, unstructured, or semi-structured multi-modal scanned paperwork utilizing the Amazon Q Enterprise net UI and AWS CLI.

To be taught extra about this characteristic, check with Supported document formats in Amazon Q Business. Give it a attempt on the Amazon Q Business console immediately! For extra info, go to Amazon Q Business and the Amazon Q Business User Guide. You possibly can ship suggestions to AWS re:Post for Amazon Q or by way of your ordinary AWS help contacts.


Concerning the Authors

Sonali Sahu is main the Generative AI Specialist Options Structure workforce in AWS. She is an writer, thought chief, and passionate technologist. Her core space of focus is AI and ML, and he or she incessantly speaks at AI and ML conferences and meetups world wide. She has each breadth and depth of expertise in know-how and the know-how business, with business experience in healthcare, the monetary sector, and insurance coverage.

Chinmayee Rane is a Generative AI Specialist Options Architect at AWS. She is obsessed with utilized arithmetic and machine studying. She focuses on designing clever doc processing and generative AI options for AWS prospects. Exterior of labor, she enjoys salsa and bachata dancing.

Himesh Kumar is a seasoned Senior Software program Engineer, at the moment working at Amazon Q Enterprise in AWS. He’s obsessed with constructing distributed programs within the generative AI/ML area. His experience extends to develop scalable and environment friendly programs, making certain excessive availability, efficiency, and reliability. Past the technical expertise, he’s devoted to steady studying and staying on the forefront of technological developments in AI and machine studying.

Qing Wei is a Senior Software program Developer for Amazon Q Enterprise workforce in AWS, and obsessed with constructing trendy functions utilizing AWS applied sciences. He loves community-driven studying and sharing of know-how particularly for machine studying internet hosting and inference associated matters. His predominant focus proper now’s on constructing serverless and event-driven architectures for RAG knowledge ingestion.

Leave a Reply

Your email address will not be published. Required fields are marked *