Automate bill processing with Streamlit and Amazon Bedrock


Bill processing is a vital but usually cumbersome activity for companies of all sizes, particularly for big enterprises coping with invoices from a number of distributors with various codecs. The sheer quantity of knowledge, coupled with the necessity for accuracy and effectivity, could make bill processing a big problem. Invoices can fluctuate broadly in format, construction, and content material, making environment friendly processing at scale troublesome. Conventional strategies counting on handbook information entry or customized scripts for every vendor’s format cannot solely result in inefficiencies, however can even enhance the potential for errors, leading to monetary discrepancies, operational bottlenecks, and backlogs.

To extract key particulars reminiscent of bill numbers, dates, and quantities, we use Amazon Bedrock, a totally managed service that provides a selection of high-performing basis fashions (FMs) from main AI corporations reminiscent of AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by a single API, together with a broad set of capabilities it is advisable construct generative AI purposes with safety, privateness, and accountable AI.

On this publish, we offer a step-by-step information with the constructing blocks wanted for making a Streamlit utility to course of and evaluate invoices from a number of distributors. Streamlit is an open supply framework for information scientists to effectively create interactive web-based information purposes in pure Python. We use Anthropic’s Claude 3 Sonnet mannequin in Amazon Bedrock and Streamlit for constructing the applying front-end.

Answer overview

This resolution makes use of the Amazon Bedrock Knowledge Bases chat with document feature to research and extract key particulars out of your invoices, without having a information base. The outcomes are proven in a Streamlit app, with the invoices and extracted info displayed side-by-side for fast evaluate. Importantly, your doc and information are usually not saved after processing.

The storage layer makes use of Amazon Simple Storage Service (Amazon S3) to carry the invoices that enterprise customers add. After importing, you possibly can arrange a daily batch job to course of these invoices, extract key info, and save the ends in a JSON file. On this publish, we save the info in JSON format, however you may also select to retailer it in your most well-liked SQL or NoSQL database.

The applying layer makes use of Streamlit to show the PDF invoices alongside the extracted information from Amazon Bedrock. For simplicity, we deploy the app domestically, however you may also run it on Amazon SageMaker Studio, Amazon Elastic Compute Cloud (Amazon EC2), or Amazon Elastic Container Service (Amazon ECS) if wanted.

Conditions

To carry out this resolution, full the next:

Set up dependencies and clone the instance

To get began, set up the mandatory packages in your native machine or on an EC2 occasion. In the event you’re new to Amazon EC2, consult with the Amazon EC2 User Guide. This tutorial we are going to use the native machine for mission setup.

To put in dependencies and clone the instance, comply with these steps:

  1. Clone the repository into an area folder:
    git clone https://github.com/aws-samples/genai-invoice-processor.git

  2. Set up Python dependencies
    • Navigate to the mission listing:
      cd </path/to/your/folder>/genai-invoice-processor

    • Improve pip
      python3 -m pip set up --upgrade pip

    • (Non-obligatory) Create a digital atmosphere isolate dependencies:
    • Activate the digital atmosphere:
      1. Mac/Linux:
      2. Home windows:
  3. Within the cloned listing, invoke the next to put in the mandatory Python packages:
    pip set up -r necessities.txt

    It will set up the mandatory packages, together with Boto3 (AWS SDK for Python), Streamlit, and different dependencies.

  4. Replace the area within the config.yaml file to the identical Area set in your AWS CLI the place Amazon Bedrock and Anthropic’s Claude 3 Sonnet mannequin can be found.

After finishing these steps, the bill processor code will probably be arrange in your native atmosphere and will probably be prepared for the following levels to course of invoices utilizing Amazon Bedrock.

Course of invoices utilizing Amazon Bedrock

Now that the atmosphere setup is completed, you’re prepared to start out processing invoices and deploying the Streamlit app. To course of invoices utilizing Amazon Bedrock, comply with these steps:

Retailer invoices in Amazon S3

Retailer invoices from completely different distributors in an S3 bucket. You possibly can add them immediately utilizing the console, API, or as a part of your common enterprise course of. Comply with these steps to add utilizing the CLI:

  1. Create an S3 bucket:
    aws s3 mb s3://<your-bucket-name> --region <your-region>

    Substitute your-bucket-name with the identify of the bucket you created and your-region with the Area set in your AWS CLI and in config.yaml (for instance, us-east-1)

  2. Add invoices to S3 bucket. Use one of many following instructions to add the bill to S3.
    • To add invoices to the basis of the bucket:
      aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/ --recursive

    • To add invoices to a selected folder (for instance, invoices):
      aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/<prefix>/ --recursive

    • Validate the add:
      aws s3 ls s3://<your-bucket-name>/

Course of invoices with Amazon Bedrock

On this part, you’ll course of the invoices in Amazon S3 and retailer the ends in a JSON file (processed_invoice_output.json). You’ll extract the important thing particulars from the invoices (reminiscent of bill numbers, dates, and quantities) and generate summaries.

You possibly can set off the processing of those invoices utilizing the AWS CLI or automate the method with an Amazon EventBridge rule or AWS Lambda set off. For this walkthrough, we are going to use the AWS CLI to set off the processing.

We packaged the processing logic within the Python script invoices_processor.py, which may be run as follows:

python invoices_processor.py --bucket_name=<your-bucket-name> --prefix=<your-folder>

The --prefix argument is non-compulsory. If omitted, all the PDFs within the bucket will probably be processed. For instance:

python invoices_processor.py --bucket_name=’gen_ai_demo_bucket’

or

python invoices_processor.py --bucket_name="gen_ai_demo_bucket" --prefix='bill'

Use the answer

This part examines the invoices_processor.py code. You possibly can chat together with your doc both on the Amazon Bedrock console or by utilizing the Amazon Bedrock RetrieveAndGenerate API (SDK). On this tutorial, we use the API strategy.

    1. Initialize the atmosphere: The script imports the mandatory libraries and initializes the Amazon Bedrock and Amazon S3 consumer.
      import boto3
      import os
      import json
      import shutil
      import argparse
      import time
      import datetime
      import yaml
      from typing import Dict, Any, Tuple
      from concurrent.futures import ThreadPoolExecutor, as_completed
      from threading import Lock
      from mypy_boto3_bedrock_runtime.consumer import BedrockRuntimeClient
      from mypy_boto3_s3.consumer import S3Client
      
      # Load configuration from YAML file
      def load_config():
          """
          Load and return the configuration from the 'config.yaml' file.
          """
          with open('config.yaml', 'r') as file:
              return yaml.safe_load(file)
      
      CONFIG = load_config()
      
      write_lock = Lock() # Lock for managing concurrent writes to the output file
      
      def initialize_aws_clients() -> Tuple[S3Client, BedrockRuntimeClient]:
          """
          Initialize and return AWS S3 and Bedrock shoppers.
      
          Returns:
              Tuple[S3Client, BedrockRuntimeClient]
          """
          return (
              boto3.consumer('s3', region_name=CONFIG['aws']['region_name']),
              boto3.consumer(service_name="bedrock-agent-runtime", 
                           region_name=CONFIG['aws']['region_name'])
          )

    2. Configure : The config.yaml file specifies the mannequin ID, Area, prompts for entity extraction, and the output file location for processing.
      aws: 
          region_name: us-west-2 
          model_id: anthropic.claude-3-sonnet-20240229-v1:0
          prompts: 
              full: Extract information from connected bill in key-value format. 
              structured: | 
                  Course of the pdf bill and checklist all metadata and values in json format for the variables with descriptions in <variables></variables> tags. The end result must be returned as JSON as given within the <output></output> tags. 
      
                  <variables> 
                      Vendor: Title of the corporate or entity the bill is from. 
                      InvoiceDate: Date the bill was created.
                      DueDate: Date the bill is due and must be paid by. 
                      CurrencyCode: Foreign money code for the bill quantity primarily based on the image and vendor particulars.
                      TotalAmountDue: Complete quantity due for the bill
                      Description: a concise abstract of the bill description inside 20 phrases 
                  </variables> 
      
                  Format your evaluation as a JSON object in following construction: 
                      <output> {
                      "Vendor": "<vendor identify>", 
                      "InvoiceDate":"<DD-MM-YYYY>", 
                      "DueDate":"<DD-MM-YYYY>",
                      "CurrencyCode":"<Foreign money code primarily based on the image and vendor particulars>", 
                      "TotalAmountDue":"<100.90>" # must be a decimal quantity in string 
                      "Description":"<Concise abstract of the bill description inside 20 phrases>" 
                      } </output> 
                  Please proceed with the evaluation primarily based on the above directions. Please do not state "Based mostly on the .."
              abstract: Course of the pdf bill and summarize the bill below 3 strains 
      
      processing: 
          output_file: processed_invoice_output.json
          local_download_folder: invoices

    3. Arrange API calls: The RetrieveAndGenerate API fetches the bill from Amazon S3 and processes it utilizing the FM. It takes a number of parameters, reminiscent of immediate, supply sort (S3), mannequin ID, AWS Area, and S3 URI of the bill.
      def retrieve_and_generate(bedrock_client: BedrockRuntimeClient, input_prompt: str, document_s3_uri: str) -> Dict[str, Any]: 
          """ 
          Use AWS Bedrock to retrieve and generate bill information primarily based on the offered immediate and S3 doc URI.
      
          Args: 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock consumer 
              input_prompt (str): Immediate for the AI mannequin
              document_s3_uri (str): S3 URI of the bill doc 
      
          Returns: 
              Dict[str, Any]: Generated information from Bedrock 
          """ 
          model_arn = f'arn:aws:bedrock:{CONFIG["aws"]["region_name"]}::foundation-model/{CONFIG["aws"]["model_id"]}' 
          return bedrock_client.retrieve_and_generate( 
              enter={'textual content': input_prompt}, retrieveAndGenerateConfiguration={ 
                  'sort': 'EXTERNAL_SOURCES',
                  'externalSourcesConfiguration': { 
                      'modelArn': model_arn, 
                      'sources': [ 
                          { 
                              "sourceType": "S3", 
                              "s3Location": {"uri": document_s3_uri} 
                          }
                      ] 
                  } 
              } 
          )

    4. Batch processing: The batch_process_s3_bucket_invoices perform batch course of the invoices in parallel within the specified S3 bucket and writes the outcomes to the output file (processed_invoice_output.json as specified by output_file in config.yaml). It depends on the process_invoice perform, which calls the Amazon Bedrock RetrieveAndGenerate API for every bill and immediate.
      def process_invoice(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, pdf_file_key: str) -> Dict[str, str]: 
          """ 
          Course of a single bill by downloading it from S3 and utilizing Bedrock to research it. 
      
          Args: 
              s3_client (S3Client): AWS S3 consumer 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock consumer 
              bucket_name (str): Title of the S3 bucket
              pdf_file_key (str): S3 key of the PDF bill 
      
          Returns: 
              Dict[str, Any]: Processed bill information 
          """ 
          document_uri = f"s3://{bucket_name}/{pdf_file_key}"
          local_file_path = os.path.be a part of(CONFIG['processing']['local_download_folder'], pdf_file_key) 
      
          # Make sure the native listing exists and obtain the bill from S3
          os.makedirs(os.path.dirname(local_file_path), exist_ok=True) 
          s3_client.download_file(bucket_name, pdf_file_key, local_file_path) 
      
          # Course of bill with completely different prompts 
          outcomes = {} 
          for prompt_name in ["full", "structured", "summary"]:
              response = retrieve_and_generate(bedrock_client, CONFIG['aws']['prompts'][prompt_name], document_uri)
              outcomes[prompt_name] = response['output']['text']
      
          return outcomes

      def batch_process_s3_bucket_invoices(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, prefix: str = "") -> int: 
          """ 
          Batch course of all invoices in an S3 bucket or a selected prefix throughout the bucket. 
      
          Args: 
              s3_client (S3Client): AWS S3 consumer 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock consumer 
              bucket_name (str): Title of the S3 bucket 
              prefix (str, non-compulsory): S3 prefix to filter invoices. Defaults to "". 
      
          Returns: 
              int: Variety of processed invoices 
          """ 
          # Clear and recreate native obtain folder
          shutil.rmtree(CONFIG['processing']['local_download_folder'], ignore_errors=True)
          os.makedirs(CONFIG['processing']['local_download_folder'], exist_ok=True) 
      
          # Put together to iterate by all objects within the S3 bucket
          continuation_token = None # Pagination dealing with
          pdf_file_keys = [] 
      
          whereas True: 
              list_kwargs = {'Bucket': bucket_name, 'Prefix': prefix}
              if continuation_token:
                  list_kwargs['ContinuationToken'] = continuation_token 
      
              response = s3_client.list_objects_v2(**list_kwargs)
      
              for obj in response.get('Contents', []): 
                  pdf_file_key = obj['Key'] 
                  if pdf_file_key.decrease().endswith('.pdf'): # Skip folders or non-PDF recordsdata
                      pdf_file_keys.append(pdf_file_key) 
      
              if not response.get('IsTruncated'): 
                  break 
                  continuation_token = response.get('NextContinuationToken') 
      
          # Course of invoices in parallel 
          processed_count = 0 
          with ThreadPoolExecutor() as executor: 
              future_to_key = { 
                  executor.submit(process_invoice, s3_client, bedrock_client, bucket_name, pdf_file_key): pdf_file_key
                  for pdf_file_key in pdf_file_keys 
              } 
      
              for future in as_completed(future_to_key):
                  pdf_file_key = future_to_key[future] 
                  attempt: 
                      end result = future.end result() 
                      # Write end result to the JSON output file as quickly because it's accessible 
                      write_to_json_file(CONFIG['processing']['output_file'], {pdf_file_key: end result}) 
                      processed_count += 1 
                      print(f"Processed file: s3://{bucket_name}/{pdf_file_key}") 
                  besides Exception as e: 
                      print(f"Did not course of s3://{bucket_name}/{pdf_file_key}: {str(e)}") 
      
          return processed_count

    5. Submit-processing: The extracted information in processed_invoice_output.json may be additional structured or custom-made to fit your wants.

This strategy permits bill dealing with from a number of distributors, every with its personal distinctive format and construction. By utilizing giant language fashions (LLMs), it extracts essential particulars reminiscent of bill numbers, dates, quantities, and vendor info with out requiring customized scripts for every vendor format.

Run the Streamlit demo

Now that you’ve got the elements in place and the invoices processed utilizing Amazon Bedrock, it’s time to deploy the Streamlit utility. You possibly can launch the app by invoking the next command:

streamlit run review-invoice-data.py

or

python -m streamlit run review-invoice-data.py

When the app is up, it is going to open in your default net browser. From there, you possibly can evaluate the invoices and the extracted information side-by-side. Use the Earlier and Subsequent arrows to seamlessly navigate by the processed invoices so you possibly can work together with and analyze the outcomes effectively. The next screenshot exhibits the UI.

There are quotas for Amazon Bedrock (of which some are adjustable) that it is advisable take into account when constructing at scale with Amazon Bedrock.

Cleanup

To wash up after operating the demo, comply with these steps:

  • Delete the S3 bucket containing your invoices utilizing the command
    aws s3 rb s3://<your-bucket-name> --force

  • In the event you arrange a digital atmosphere, deactivate it by invoking deactivate
  • Take away any native recordsdata created in the course of the course of, together with the cloned repository and output recordsdata
  • In the event you used any AWS sources reminiscent of an EC2 occasion, terminate them to keep away from pointless fees

Conclusion

On this publish, we walked by a step-by-step information to automating bill processing utilizing Streamlit and Amazon Bedrock, addressing the problem of dealing with invoices from a number of distributors with completely different codecs. We confirmed the best way to arrange the atmosphere, course of invoices saved in Amazon S3, and deploy a user-friendly Streamlit utility to evaluate and work together with the processed information.

If you’re trying to additional improve this resolution, take into account integrating extra options or deploying the app on scalable AWS companies reminiscent of Amazon SageMaker, Amazon EC2, or Amazon ECS. Because of this flexibility, your bill processing resolution can evolve with what you are promoting, offering long-term worth and effectivity.

We encourage you to study extra by exploring Amazon Bedrock, Access Amazon Bedrock foundation models, RetrieveAndGenerate API, and Quotas for Amazon Bedrock and constructing an answer utilizing the pattern implementation offered on this publish and a dataset related to what you are promoting. In case you have questions or recommendations, depart a remark.


Concerning the Authors

Deepika Kumar is a Answer Architect at AWS. She has over 13 years of expertise within the know-how trade and has helped enterprises and SaaS organizations construct and securely deploy their workloads on the cloud securely. She is obsessed with utilizing Generative AI in a accountable method whether or not that’s driving product innovation, increase productiveness or enhancing buyer experiences.

Jobandeep Singh is an Affiliate Answer Architect at AWS specializing in Machine Studying. He helps prospects throughout a variety of industries to leverage AWS, driving innovation and effectivity of their operations. In his free time, he enjoys enjoying sports activities, with a selected love for hockey.

Ratan Kumar is a options architect primarily based out of Auckland, New Zealand. He works with giant enterprise prospects serving to them design and construct safe, cost-effective, and dependable web scale purposes utilizing the AWS cloud. He’s obsessed with know-how and likes sharing information by weblog posts and twitch classes.

Leave a Reply

Your email address will not be published. Required fields are marked *