Construct a receipt and bill processing pipeline with Amazon Textract


In in the present day’s enterprise panorama, organizations are continuously looking for methods to optimize their monetary processes, improve effectivity, and drive price financial savings. One space that holds vital potential for enchancment is accounts payable. On a excessive degree, the accounts payable course of contains receiving and scanning invoices, extraction of the related information from scanned invoices, validation, approval, and archival. The second step (extraction) may be complicated. Every bill and receipt look totally different. The labels are imperfect and inconsistent. A very powerful items of knowledge akin to value, vendor title, vendor handle, and fee phrases are sometimes not explicitly labeled and need to be interpreted based mostly on context. The standard strategy of utilizing human reviewers to extract the information is time-consuming, error-prone, and never scalable.

On this submit, we present easy methods to automate the accounts payable course of utilizing Amazon Textract for information extraction. We additionally present a reference structure to construct an bill automation pipeline that allows extraction, verification, archival, and clever search.

Resolution overview

The next structure diagram reveals the levels of a receipt and bill processing workflow. It begins with a doc seize stage to securely gather and retailer scanned invoices and receipts. The following stage is the extraction part, the place you cross the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially associated relationships between textual content akin to vendor title, bill receipt date, order date, quantity due, quantity paid, and so forth. Within the subsequent stage, you utilize predefined expense guidelines to find out if you happen to ought to routinely approve or reject the receipt. Authorised and rejected paperwork go to their respective folders inside the Amazon Simple Storage Service (Amazon S3) bucket. For accepted paperwork, you may search all of the extracted fields and values utilizing Amazon OpenSearch Service. You’ll be able to visualize the listed metadata utilizing OpenSearch Dashboards. Authorised paperwork are additionally set as much as be moved to Amazon S3 Intelligent-Tiering for long-term retention and archival utilizing S3 lifecycle insurance policies.

Solution Architecture

The next sections take you thru the method of making the answer.

Conditions

To deploy this resolution, you need to have the next:

  • An AWS account.
  • An AWS Cloud9 setting. AWS Cloud9 is a cloud-based built-in improvement setting (IDE) that allows you to write, run, and debug your code with only a browser. It features a code editor, debugger, and terminal.

To create the AWS Cloud9 setting, present a reputation and outline. Preserve every thing else as default. Select the IDE hyperlink on the AWS Cloud9 console to navigate to IDE. You’re now prepared to make use of the AWS Cloud9 setting.

Deploy the answer

To arrange the answer, you utilize the AWS Cloud Development Kit (AWS CDK) to deploy an AWS CloudFormation stack.

  1. In your AWS Cloud9 IDE terminal, clone the GitHub repository and set up the dependencies. Run the next instructions to deploy the InvoiceProcessor stack:
git clone https://github.com/aws-samples/amazon-textract-invoice-processor.git
pip set up -r necessities.txt
cdk bootstrap
cdk deploy

The deployment takes round 25 minutes with the default configuration settings from the GitHub repo. Extra output data can be out there on the AWS CloudFormation console.

  1. After the AWS CDK deployment is full, create expense validation guidelines in an Amazon DynamoDB desk. You should utilize the identical AWS Cloud9 terminal to run the next instructions:
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Worth' --output textual content)" VALUE {'ruleId': 1, 'sort': 'regex', 'discipline': 'INVOICE_RECEIPT_ID', 'verify': '(?i)[0-9]{3}[a-z]{3}[0-9]{3}$', 'errorTxt': 'Receipt quantity shouldn't be legitimate. It's of the format: 123ABC456'}"
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Worth' --output textual content)" VALUE {'ruleId': 2, 'sort': 'regex', 'discipline': 'PO_NUMBER', 'verify': '(?i)[a-z0-9]+$', 'errorTxt': 'PO quantity shouldn't be current'}"

  1. Within the S3 bucket that begins with invoiceprocessorworkflow-invoiceprocessorbucketf1-*, create an uploads folder.

In Amazon Cognito, it’s best to have already got an present person pool known as OpenSearchResourcesCognitoUserPool*. We use this person pool to create a brand new person.

  1. On the Amazon Cognito console, navigate to the person pool OpenSearchResourcesCognitoUserPool*.
  2. Create a brand new Amazon Cognito person.
  3. Present a person title and password of your alternative and word them for later use.
  4. Add the paperwork random_invoice1 and random_invoice2 to the S3 uploads folder to begin the workflows.

Now let’s dive into every of the doc processing steps.

Doc Seize

Clients deal with invoices and receipts in a mess of codecs from totally different distributors. These paperwork are obtained by way of channels like arduous copies, scanned copies uploaded to file storage, or shared storage units. Within the doc seize stage, you retailer all scanned copies of receipts and invoices in a extremely scalable storage akin to in an S3 bucket.

Upload sample invoices

Extraction

The following stage is the extraction part, the place you cross the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially associated relationships between textual content akin to Vendor Title, Bill Receipt Date, Order Date, Quantity Due/Paid, and many others.

AnalyzeExpense is an API devoted to processing bill and receipts paperwork. It’s out there each as a synchronous or asynchronous API. The synchronous API lets you ship pictures in bytes format, and the asynchronous API lets you ship information in JPG, PNG, TIFF, and PDF codecs. The AnalyzeExpense API response consists of three distinct sections:

  • Abstract fields – This part contains each normalized keys and the explicitly talked about keys together with their values. AnalyzeExpense normalizes the keys for contact-related data akin to vendor title and vendor handle, tax ID-related keys akin to tax payer ID, payment-related keys akin to quantity due and low cost, and basic keys akin to bill ID, supply date, and account quantity. Keys that aren’t normalized nonetheless seem within the abstract fields as key-value pairs. For an entire record of supported expense fields, consult with Analyzing Invoices and Receipts.
  • Line objects – This part contains normalized line merchandise keys akin to merchandise description, unit value, amount, and product code.
  • OCR block – The block accommodates the uncooked textual content extract from the bill web page. The uncooked textual content extract can be utilized for postprocessing and figuring out data that isn’t lined as a part of the abstract and line merchandise fields.

This submit makes use of the Amazon Textract IDP CDK constructs (AWS CDK elements to outline infrastructure for clever doc processing (IDP) workflows), which lets you construct use case-specific, customizable IDP workflows. The constructs and samples are a group of elements to allow definition of IDP processes on AWS and revealed to GitHub. The primary ideas used are the AWS CDK constructs, the precise AWS CDK stacks, and AWS Step Functions.

The next determine reveals the Step Features workflow.

Step function workflow

The extraction workflow contains the next steps:

  • InvoiceProcessor-Decider – An AWS Lambda perform that verifies if the enter doc format is supported by Amazon Textract. For extra particulars about supported codecs, consult with Input Documents.
  • DocumentSplitter – A Lambda perform that generates 2,500-page (max) chunks from paperwork and might course of giant multi-page paperwork.
  • Map State – A Lambda perform that processes every chunk in parallel.
  • TextractAsync – This activity calls Amazon Textract utilizing the asynchronous API following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and makes use of OutputConfig to retailer the Amazon Textract JSON output to the S3 bucket you created earlier. It consists of two Lambda capabilities: one to submit the doc for processing and one that’s triggered on the SNS notification.
  • TextractAsyncToJSON2 – As a result of the TextractAsync activity can produce a number of paginated output information, the TextractAsyncToJSON2 course of combines them into one JSON file.

We talk about the small print of the subsequent three steps within the following sections.

Verification and approval

For the verification stage, the SetMetaData Lambda perform verifies whether or not the uploaded file is a legitimate expense as per the foundations configured beforehand in DynamoDB desk. For this submit, you utilize the next pattern guidelines:

  • Verification is profitable if INVOICE_RECEIPT_ID is current and matches the regex (?i)[0-9]{3}[a-z]{3}[0-9]{3}$ and if PO_NUMBER is current and matches the regex (?i)[a-z0-9]+$
  • Verification is un-successful if both PO_NUMBER or INVOICE_RECEIPT_ID is wrong or lacking within the doc.

After the information are processed, the expense verification perform strikes the enter information to both accepted or declined folders in the identical S3 bucket.

S3 output

For the needs of this resolution, we use DynamoDB to retailer the expense validation guidelines. Nonetheless, you may modify this resolution to combine with your individual or business expense validation or administration options.

Clever index and search

With the OpenSearchPushInvoke Lambda perform, the extracted expense metadata is pushed to an OpenSearch Service index and is obtainable for search.

The ultimate TaskOpenSearchMapping step clears the context, which in any other case might exceed the Step Functions quota of most enter or output measurement for a activity, state, or workflow run.

After the OpenSearch Service index is created, you may seek for key phrases from the extracted textual content by way of OpenSearch Dashboards.

OpenSearch document search

Archival, audit, and analytics

To handle the lifecycle and archival of invoices and receipts, you may configure S3 lifecycle guidelines to transition S3 objects from Customary to Clever-Tiering storage courses. S3 Clever-Tiering displays entry patterns and routinely strikes objects to the Rare Entry tier once they haven’t been accessed for 30 consecutive days. After 90 days of no entry, the objects are moved to the Archive On the spot Entry tier with out efficiency affect or operational overhead.

For auditing and analytics, this resolution makes use of OpenSearch Service for working analytics on bill requests. OpenSearch Service lets you effortlessly ingest, safe, search, mixture, view, and analyze information for plenty of use instances, akin to log analytics, utility search, enterprise search, and extra.

Log in to OpenSearch Dashboards and navigate to Stack Administration, Saved objects, then select Import. Select the invoices.ndjson file from the cloned repository and select Import. This prepopulates indexes and builds the visualization.

OpenSearch import

Refresh the web page and navigate to House, Dashboard, and open Invoices. Now you can choose and apply filters and increase the time window to discover previous invoices.

OpenSearch dashboard

Clear up

Whenever you’re completed evaluating Amazon Textract for processing receipts and invoices, we advocate cleansing up any sources that you simply may need created. Full the next steps:

  1. Delete all content material from the S3 bucket invoiceprocessorworkflow-invoiceprocessorbucketf1-*.
  2. In AWS Cloud9, run the next instructions to delete Amazon Cognito sources and CloudFormation stacks:
cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Worth' --output textual content)
echo $cognito_user_pool
cdk destroy
aws cognito-idp delete-user-pool --user-pool-id $cognito_user_pool

  1. Delete the AWS Cloud9 setting that you simply created from the AWS Cloud9 console.

Conclusion

On this submit, we offered an summary of how we will construct an bill automation pipeline utilizing Amazon Textract for information extraction and create a workflow for validation, archival, and search. We offered code samples on easy methods to use the AnalyzeExpense API for extraction of vital fields from an bill.

To get began, register to the Amazon Textract console to do this characteristic. To study extra about Amazon Textract capabilities, consult with the Amazon Textract Developer Guide or Textract Resources. To study extra about IDP, consult with the IDP with AWS AI companies Part 1 and Part 2 posts.


In regards to the Authors

Sushant Pradhan is a Sr. Options Architect at Amazon Internet Companies, serving to enterprise prospects. His pursuits and expertise embrace containers, serverless know-how, and DevOps. In his spare time, Sushant enjoys spending time outside together with his household.

Shibin Michaelraj is a Sr. Product Supervisor with the AWS Textract crew. He’s centered on constructing AI/ML-based merchandise for AWS prospects.

Suprakash Dutta is a Sr. Options Architect at Amazon Internet Companies. He focuses on digital transformation technique, utility modernization and migration, information analytics, and machine studying. He’s a part of the AI/ML neighborhood at AWS and designs clever doc processing options.

Maran Chandrasekaran is a Senior Options Architect at Amazon Internet Companies, working with our enterprise prospects. Exterior of labor, he likes to journey and trip his motorbike in Texas Hill Nation.

Leave a Reply

Your email address will not be published. Required fields are marked *