Easy methods to Implement Picture Captioning with Imaginative and prescient Transformer (ViT) and Hugging Face Transformers
Picture by Writer
Picture captioning is a well-known multimodal job that mixes laptop imaginative and prescient and pure language processing. The analysis subject has undergone appreciable examine over time, and the fashions accessible right this moment are considerably robust sufficient to deal with a big number of instances.
On this article, we are going to discover using Hugging Face’s transformer library to make the most of the newest sequence-to-sequence fashions utilizing a Imaginative and prescient Transformer encoder and a GPT-based decoder. We are going to see how HuggingFace makes it easy to make use of brazenly accessible fashions to carry out picture captioning.
Mannequin Choice and Structure
We use the ViT-GPT2-image-captioning pre-trained mannequin by nlpconnect accessible on HuggingFace. Picture captioning takes a picture as an enter and outputs a textual description of the picture. For this job, we use a multi-modal mannequin divided into two components; an encoder and a decoder. The encoder takes the uncooked picture pixels as enter and makes use of a neural community to rework them right into a 1-dimensional compressed latent illustration. Within the case of the chosen mannequin, the encoder relies on the latest Imaginative and prescient Transformer (ViT) mannequin, which applies the state-of-the-art transformer structure to picture patches. The encoder enter is then handed as an enter to a language mannequin referred to as the decoder. The decoder, in our case GPT-2, executes in an auto-regressive method producing one output token at a time. When the mannequin is educated end-to-end on an image-description dataset, we get a picture captioning mannequin that generates tokens to explain the picture.
Setup and Inference
We first arrange a clear Python atmosphere and set up all required packages to run the mannequin. In our case, we simply want the HuggingFace transformer library that runs on a PyTorch backend. Run the beneath instructions for a contemporary set up:
python -m venv venv
supply venv/bin/activate
pip set up transformers torch Pillow
From the transformers package deal, we have to import the VisionEncoderDecoderModel, ViTImageProcessor, and the AutoTokenizer.
The VisionEncoderDecoderModel gives an implementation to load and execute a sequence-to-sequence mannequin in HuggingFace. It permits to simply load and generate tokens utilizing built-in features. The ViTImageProcessor resizes, rescales, and normalizes the uncooked picture pixels to preprocess it for the ViT Encoder. The AutoTokenizer will likely be used on the finish to transform the generated token IDs into human-readable strings.
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Picture
We will now load the open-source mannequin in Python. We load all three fashions from the pre-trained nlpconnect mannequin. It’s educated end-to-end for the picture captioning job and performs higher as a consequence of end-to-end coaching. Nonetheless, HuggingFace gives performance to load separate encoder and decoder fashions. Word, that the tokenizer needs to be supported by the decoder used, because the generated token IDs should match for proper decoding.
MODEL_ID = "nlpconnect/vit-gpt2-image-captioning"
mannequin = VisionEncoderDecoderModel.from_pretrained(MODEL_ID)
feature_extractor = ViTImageProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
Utilizing the above fashions, we will generate captions for any picture utilizing a easy perform outlined as follows:
def generate_caption(img_path: str):
i_img = Picture.open(img_path)
pixel_values = feature_extractor(photos=i_img, return_tensors="pt").pixel_values
output_ids = mannequin.generate(pixel_values)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return response.strip()
The perform takes a neighborhood picture path and makes use of the Pillow library to load a picture. First, we have to course of the picture and get the uncooked pixels that may be handed to the ViT Encoder. The characteristic extractor resizes the picture and normalizes the pixel values returning picture pixels of measurement 224 by 224. That is the usual measurement for ViT-based architectures however you may change this primarily based in your mannequin.
The picture pixels are then handed to the picture captioning mannequin that mechanically applies the encoder-decoder mannequin to output a listing of generated token IDs. We use the tokenizer to decode the integer IDs to their corresponding phrases to get the generated picture caption.
Name the above perform on any picture to check it out!
IMG_PATH="PATH_TO_IMG_FILE"
response = generate_caption(IMG_PATH)
A pattern output is proven beneath:
Generated Caption: a big elephant standing on high of a lush inexperienced discipline
Conclusion
On this article, we explored the essential use of HuggingFace for picture captioning duties. The transformers library gives flexibility and abstractions within the above course of and there’s a massive database of publically accessible fashions. You may tweak the method in a number of methods and apply the identical pipeline to numerous fashions to see what fits you finest.
Be at liberty to attempt any mannequin and structure as new fashions are pushed every single day and you could discover higher fashions every day!
Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions range and tutorial excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.