Automate caption creation and seek for photographs at enterprise scale utilizing generative AI and Amazon Kendra

Amazon Kendra is an clever search service powered by machine studying (ML). Amazon Kendra reimagines seek for your web sites and purposes so your workers and prospects can simply discover the content material they’re searching for, even when it’s scattered throughout a number of areas and content material repositories inside your group.

Amazon Kendra helps quite a lot of document formats, equivalent to Microsoft Phrase, PDF, and textual content from various data sources. On this put up, we give attention to extending the doc assist in Amazon Kendra to make photographs searchable by their displayed content material. Pictures can usually be searched utilizing supplemented metadata equivalent to key phrases. Nonetheless, it takes plenty of handbook effort so as to add detailed metadata to probably 1000’s of photographs. Generative AI may be useful in producing the metadata routinely. By producing textual captions, the Generative AI caption predictions provide descriptive metadata for photographs. The Amazon Kendra index can then be enriched with the generated metadata throughout doc ingestion to allow looking the photographs with none handbook effort.

For example, a Generative AI mannequin can be utilized to generate a textual description for the next picture as “a canine laying on the bottom below an umbrella” throughout doc ingestion of the picture.

Image of a dog laying under an umbrella as an example of what can be searched in this solution

An object recognition mannequin can nonetheless detect key phrases equivalent to “canine” and “umbrella,” however a Generative AI mannequin gives deeper understanding of what’s represented within the picture by figuring out that the canine lies below the umbrella. This helps us construct extra refined searches within the picture search course of. The textual description is added as metadata to an Amazon Kendra search index by way of an automatic customized doc enrichment (CDE). Customers trying to find phrases like “canine” or “umbrella” will then have the ability to discover the picture, as proven within the following screenshot.

Image of Kendra search tool

On this put up, we present tips on how to use CDE in Amazon Kendra utilizing a Generative AI mannequin deployed on Amazon SageMaker. We show CDE utilizing easy examples and supply a step-by-step information so that you can expertise CDE in an Amazon Kendra index in your personal AWS account. It permits customers to rapidly and simply discover the photographs they want with out having to manually tag or categorize them. This answer will also be custom-made and scaled to satisfy the wants of various purposes and industries.

Picture captioning with Generative AI

Picture description with Generative AI includes utilizing ML algorithms to generate textual descriptions of photographs. The method is also called picture captioning, and operates on the intersection of laptop imaginative and prescient and pure language processing (NLP). It has purposes in areas the place knowledge is multi-modal equivalent to ecommerce, the place knowledge comprises textual content within the type of metadata in addition to photographs, or in healthcare, the place knowledge might include MRIs or CT scans together with physician’s notes and diagnoses, to call just a few use instances.

Generative AI fashions study to acknowledge objects and options throughout the photographs, after which generate descriptions of these objects and options in pure language. The state-of-the-art fashions use an encoder-decoder structure, the place the picture data is encoded within the intermediate layers of the neural community and decoded into textual descriptions. These may be thought-about as two distinct levels: characteristic extraction from photographs and textual caption technology. Within the characteristic extraction stage (encoder), the Generative AI mannequin processes the picture to extract related visible options, equivalent to object shapes, colours, and textures. Within the caption technology stage (decoder), the mannequin generates a pure language description of the picture based mostly on the extracted visible options.

Generative AI fashions are usually educated on huge quantities of knowledge, which make them appropriate for numerous duties with out extra coaching. Adapting to customized datasets and new domains can also be simply achievable via few-shot studying. Pre-training strategies permit multi-modal purposes to be simply educated utilizing state-of-the-art language and picture fashions. These pre-training strategies additionally let you combine and match the imaginative and prescient mannequin and language mannequin that most closely fits your knowledge.

The standard of the generated picture descriptions will depend on the standard and dimension of the coaching knowledge, the structure of the Generative AI mannequin, and the standard of the characteristic extraction and caption technology algorithms. Though picture description with Generative AI is an energetic space of analysis, it reveals superb ends in a variety of purposes, equivalent to picture search, visible storytelling, and accessibility for individuals with visible impairments.

Use instances

Generative AI picture captioning is beneficial within the following use instances:

  • Ecommerce – A standard trade use case the place photographs and textual content happen collectively is retail. Ecommerce specifically shops huge quantities of knowledge as product photographs together with textual descriptions. The textual description or metadata is necessary to make sure that the perfect merchandise are exhibited to the consumer based mostly on the search queries. Furthermore, with the pattern of ecommerce websites acquiring knowledge from 3P distributors, the product descriptions are sometimes incomplete, amounting to quite a few handbook hours and big overhead ensuing from tagging the appropriate data within the metadata columns. Generative-AI-based picture captioning is especially helpful for automating this laborious course of. Tremendous-tuning the mannequin on customized vogue knowledge equivalent to vogue photographs together with textual content describing the attributes of vogue merchandise can be utilized to generate metadata that then improves a consumer’s search expertise.
  • Advertising and marketing – One other use case of picture search is digital asset administration. Advertising and marketing companies retailer huge quantities of digital knowledge that must be centralized, simply searchable, and scalable enabled by knowledge catalogs. A centralized knowledge lake with informative knowledge catalogs would cut back duplication efforts and allow wider sharing of inventive content material and consistency between groups. For graphic design platforms popularly used for enabling social media content material technology, or shows in company settings, a sooner search might lead to an improved consumer expertise by rendering the right search outcomes for the photographs that customers need to search for and enabling customers to go looking utilizing pure language queries.
  • Manufacturing – The manufacturing trade shops plenty of picture knowledge like structure blueprints of elements, buildings, {hardware}, and tools. The power to go looking via such knowledge permits product groups to simply recreate designs from a place to begin that already exists and eliminates plenty of design overhead, thereby rushing up the method of design technology.
  • Healthcare – Docs and medical researchers can catalog and search via MRIs and CT scans, specimen samples, photographs of the ailment equivalent to rashes and deformities, together with physician’s notes, diagnoses, and scientific trials particulars.
  • Metaverse or augmented actuality – Promoting a product is about making a story that customers can think about and relate to. With AI-powered instruments and analytics, it has change into simpler than ever to construct not only one story however custom-made tales to seem to end-users’ distinctive tastes and sensibilities. That is the place image-to-text fashions generally is a sport changer. Visible storytelling can help in creating characters, adapting them to totally different types, and captioning them. It will also be used to energy stimulating experiences within the metaverse or augmented actuality and immersive content material together with video video games. Picture search permits builders, designers, and groups to go looking their content material utilizing pure language queries, which may keep consistency of content material between numerous groups.
  • Accessibility of digital content material for blind and low imaginative and prescient – That is primarily enabled by assistive applied sciences equivalent to screenreaders, Braille programs that permit contact studying and writing, and particular keyboards for navigating web sites and purposes throughout the web. Pictures, nevertheless, should be delivered as textual content material that may then be communicated as speech. Picture captioning utilizing Generative AI algorithms is a vital piece for redesigning the web and making it extra inclusive by offering everybody an opportunity to entry, perceive, and work together with on-line content material.

Mannequin particulars and mannequin fine-tuning for customized datasets

On this answer, we benefit from the vit-gpt2-image-captioning mannequin obtainable from Hugging Face, which is licensed below Apache 2.0 with out performing any additional fine-tuning. Vit is a foundational mannequin for picture knowledge, and GPT-2 is a foundational mannequin for language. The multi-modal mixture of the 2 gives the potential of picture captioning. Hugging Face hosts state-of-the-art picture captioning fashions, which may be deployed in AWS in just a few clicks and provide simple-to-deploy inference endpoints. Though we will use this pre-trained mannequin immediately, we will additionally customise the mannequin to suit domain-specific datasets, extra knowledge sorts equivalent to video or spatial knowledge, and distinctive use instances. There are a number of Generative AI fashions the place some fashions carry out finest with sure datasets, or your workforce would possibly already be utilizing imaginative and prescient and language fashions. This answer gives the flexibleness of selecting the best-performing imaginative and prescient and language mannequin because the picture captioning mannequin via simple alternative of the mannequin we now have used.

For personalization of the fashions to distinctive trade purposes, open-source fashions obtainable on AWS via Hugging Face provide a number of prospects. A pre-trained mannequin may be examined for the distinctive dataset or educated on samples of the labeled knowledge to fine-tune it. Novel analysis strategies additionally permit any mixture of imaginative and prescient and language fashions to be mixed effectively and educated in your dataset. This newly educated mannequin can then be deployed in SageMaker for the picture captioning described on this answer.

An instance of a custom-made picture search is Enterprise Useful resource Planning (ERP). In ERP, picture knowledge collected from totally different levels of logistics or provide chain administration might embody tax receipts, vendor orders, payslips, and extra, which should be routinely categorized for the purview of various groups throughout the group. One other instance is to make use of medical scans and physician diagnoses to foretell new medical photographs for computerized classification. The imaginative and prescient mannequin extracts options from the MRI, CT, or X-ray photographs and the textual content mannequin captions it with the medical diagnoses.

Answer overview

The next diagram reveals the structure for picture search with Generative AI and Amazon Kendra.

Architecture of proposed solution

We ingest photographs from Amazon Simple Storage Service (Amazon S3) into Amazon Kendra. Throughout ingestion to Amazon Kendra, the Generative AI mannequin hosted on SageMaker is invoked to generate a picture description. Moreover, textual content seen in a picture is extracted by Amazon Textract. The picture description and the extracted textual content are saved as metadata and made obtainable to the Amazon Kendra search index. After ingestion, photographs may be searched by way of the Amazon Kendra search console, API, or SDK.

We use the superior operations of CDE in Amazon Kendra to name the Generative AI mannequin and Amazon Textract throughout the picture ingestion step. Nonetheless, we will use CDE for a wider vary of use instances. With CDE, you possibly can create, modify, or delete doc attributes and content material if you ingest your paperwork into Amazon Kendra. This implies you possibly can manipulate and ingest your knowledge as wanted. This may be achieved by invoking pre- and post-extraction AWS Lambda features throughout ingestion, which permits for knowledge enrichment or modification. For instance, we will use Amazon Medical Comprehend when ingesting medical textual knowledge so as to add ML-generated insights to the search metadata.

You need to use our answer to go looking photographs via Amazon Kendra by following these steps:

  1. Add photographs to a picture repository like an S3 bucket.
  2. The picture repository is then listed by Amazon Kendra, which is a search engine that can be utilized to seek for structured and unstructured knowledge. Throughout indexing, the Generative AI mannequin in addition to Amazon Textract are invoked to generate the picture metadata. You’ll be able to set off the indexing manually or on a predefined schedule.
  3. You’ll be able to then seek for photographs utilizing pure language queries, equivalent to “Discover photographs of crimson roses” or “Present me footage of canines enjoying within the park,” via the Amazon Kendra console, SDK, or API. These queries are processed by Amazon Kendra, which makes use of ML algorithms to grasp the that means behind the queries and retrieve related photographs from the listed repository.
  4. The search outcomes are offered to you, together with their corresponding textual descriptions, permitting you to rapidly and simply discover the photographs you might be searching for.


You need to have the next conditions:

  • An AWS account
  • Permissions to provision and invoke the next companies by way of AWS CloudFormation: Amazon S3, Amazon Kendra, Lambda, and Amazon Textract.

Value estimate

The price of deploying this answer as a proof of idea is projected within the following desk. That is the explanation we use Amazon Kendra with the Developer Version, which isn’t really helpful for manufacturing workloads, however supplies a low-cost possibility for builders. We assume that the search performance of Amazon Kendra is used for 20 working days for 3 hours every day, and subsequently calculate related prices for 60 month-to-month energetic hours.

Service Time Consumed Value Estimate per Month
Amazon S3 Storage of 10 GB with knowledge switch 2.30 USD
Amazon Kendra Developer Version with 60 hours/month 67.90 USD
Amazon Textract 100% detect doc textual content on 10,000 photographs 15.00 USD
Amazon SageMaker Actual-time inference with ml.g4dn.xlarge for one mannequin deployed on one endpoint for 3 hours day-after-day for 20 days 44.00 USD
. . 129.2 USD

Deploy sources with AWS CloudFormation

The CloudFormation stack deploys the next sources:

  • A Lambda perform that downloads the picture captioning mannequin from Hugging Face hub and subsequently builds the mannequin belongings
  • A Lambda perform that populates the inference code and zipped mannequin artifacts to a vacation spot S3 bucket
  • An S3 bucket for storing the zipped mannequin artifacts and inference code
  • An S3 bucket for storing the uploaded photographs and Amazon Kendra paperwork
  • An Amazon Kendra index for looking via the generated picture captions
  • A SageMaker real-time inference endpoint for deploying the Hugging Face picture
  • captioning mannequin
  • A Lambda perform that’s triggered whereas enriching the Amazon Kendra index on demand. It invokes Amazon Textract and a SageMaker real-time inference endpoint.

Moreover, AWS CloudFormation deploys all the mandatory AWS Identity and Access

Management (IAM) roles and insurance policies, a VPC together with subnets, a safety group, and an web gateway wherein the customized useful resource Lambda perform is run.

Full the next steps to provision your sources:

  1. Select Launch stack to launch the CloudFormation template within the us-east-1 Area:
    Launch stack
  2. Select Subsequent.
  3. On the Specify stack particulars web page, go away the template URL and S3 URI of the parameters file at their defaults, then select Subsequent.
  4. Proceed to decide on Subsequent on the following pages.
  5. Select Create stack to deploy the stack.

Monitor the standing of the stack. When the standing reveals as CREATE_COMPLETE, the deployment is full.

Ingest and search instance photographs

Full the next steps to ingest and search your photographs:

  1. On the Amazon S3 console, create a folder referred to as photographs within the kendra-image-search-stack-imagecaptions S3 bucket within the us-east-1 Area.
  2. Add the next photographs to the photographs folder.

Image of a beach to test with the kendra image search using automated text captioningImage of a dog celebrating a birthday to test with the kendra image search using automated text captioningImage of a dog under an umbrella to test with the kendra image search using automated text captioningImage of a tablet, notebook and coffee on a desk to test with the kendra image search using automated text captioning

  1. Navigate to the Amazon Kendra console in us-east-1 Area.
  2. Within the navigation pane, select Indexes, then select your index (kendra-index).
  3. Select Knowledge sources, then select generated_image_captions.
  4. Select Sync now.

Look ahead to the synchronization to be full earlier than persevering with to the subsequent steps.

  1. Within the navigation pane, select Indexes, then select kendra-index.
  2. Navigate to the search console.
  3. Strive the next queries individually or mixed: “canine,” “umbrella,” and “publication,” and discover out which photographs are ranked excessive by Amazon Kendra.

Be at liberty to check your personal queries that match the uploaded photographs.

Clear up

To deprovisioning all of the sources, full the next step

  1. On the AWS CloudFormation console, select Stacks within the navigation pane.
  2. Choose the stack kendra-genai-image-search and select Delete.

Wait till the stack standing modifications to DELETE_COMPLETE.


On this put up, we noticed how Amazon Kendra and Generative AI may be mixed to automate the creation of significant metadata for photographs. State-of-the-art Generative AI fashions are extraordinarily helpful for producing textual content captions describing the content material of a picture. This has a number of trade use instances, starting from healthcare and life sciences, retail and ecommerce, digital asset platforms, and media. Picture captioning can also be essential for constructing a extra inclusive digital world and redesigning the web, metaverse, and immersive applied sciences to cater to the wants of visually challenged sections of society.

Picture search enabled via captions permits digital content material to be simply searchable with out handbook effort for these purposes, and removes duplication efforts. The CloudFormation template we offered makes it simple to deploy this answer to allow picture search utilizing Amazon Kendra. A easy structure of photographs saved in Amazon S3 and Generative AI to create textual descriptions of the photographs can be utilized with CDE in Amazon Kendra to energy this answer.

This is just one software of Generative AI with Amazon Kendra. To dive deeper into tips on how to construct Generative AI purposes with Amazon Kendra, check with Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models. For constructing and scaling Generative AI purposes, we suggest testing Amazon Bedrock.

Concerning the Authors

Charalampos Grouzakis is a Knowledge Scientist inside AWS Skilled Providers. He has over 11 years of expertise in creating and main knowledge science, machine studying, and massive knowledge initiatives. At the moment he’s serving to enterprise prospects modernizing their AI/ML workloads throughout the cloud utilizing trade finest practices. Previous to becoming a member of AWS, he was consulting prospects in numerous industries equivalent to Automotive, Manufacturing, Telecommunications, Media & Leisure, Retail and Monetary Providers. He’s keen about enabling prospects to speed up their AI/ML journey within the cloud and to drive tangible enterprise outcomes.

Bharathi Srinivasan
is a Knowledge Scientist at AWS Skilled Providers the place she likes to construct cool issues on Sagemaker. She is keen about driving enterprise worth from machine studying purposes, with a give attention to moral AI. Outdoors of constructing new AI experiences for purchasers, Bharathi loves to jot down science fiction and problem herself with endurance sports activities.

Jean-Michel Lourier is a Senior Knowledge Scientist inside AWS Skilled Providers. He leads groups implementing knowledge pushed purposes facet by facet with AWS prospects to generate enterprise worth out of their knowledge. He’s keen about diving into tech and studying about AI, machine studying, and their enterprise purposes. He’s additionally an enthusiastic bicycle owner, taking lengthy bike-packing journeys.

Tanvi Singhal is a Knowledge Scientist inside AWS Skilled Providers. Her expertise and areas of experience embody knowledge science, machine studying, and massive knowledge. She helps prospects in creating Machine studying fashions and MLops options throughout the cloud. Previous to becoming a member of AWS, she was additionally a marketing consultant in numerous industries equivalent to Transportation Networking, Retail and Monetary Providers. She is keen about enabling prospects on their knowledge/AI journey to the cloud.

Abhishek Maligehalli Shivalingaiah is a Senior AI Providers Answer Architect at AWS with give attention to Amazon Kendra. He’s keen about constructing purposes utilizing Amazon Kendra ,Generative AI and NLP. He has round 10 years of expertise in constructing Knowledge & AI options to create worth for purchasers and enterprises. He has constructed a (private) chatbot for enjoyable to solutions questions on his profession {and professional} journey. Outdoors of labor he enjoys making portraits of household & associates, and loves creating artworks.

Leave a Reply

Your email address will not be published. Required fields are marked *