Multimodal RAG Implementation with Hugging Face

Picture by Writer | Ideogram

Giant language fashions (LLM) have modified some ways individuals work. The mannequin simply generates advanced textual content with easy enter, and this expertise has grow to be commonplace for a lot of purposes, equivalent to chatbots, planner mills, and so on.

Nonetheless, LLM might hallucinate, that means the mannequin output is incorrect and factual info will not be produced. That’s why a method known as retrieval-augmented technology (RAG) has been developed to boost the LLM output.

RAG is a method that mixes retrieval-based strategies with LLM to boost the response. By fetching the suitable textual content or doc from the exterior information base, the LLM can use the retrieved information to generate the suitable end result.

Classically, RAG works solely by retrieving and producing textual content information. Nonetheless, few fashions have now been developed to permit for multimodal operate.

This text will discover develop multimodal RAG implementation with Hugging Face, particularly for visible and textual content information.

Let’s get into it.

Multimodal RAG Implementation

On this tutorial, we’ll use Google Colab with entry to the GPU. Extra particularly, we’ll use the A100 GPU because the RAM necessity for this text is kind of excessive.

Let’s begin by putting in the mandatory Python packages. Run the next code for the set up.

!pip set up byaldi pdf2image qwen-vl-utils transformers

With the code put in high quality, we’ll construct our Information Base. We’ll use a number of PDF information collections for constructing design for this instance.

import requests
import os

pdfs = {
    "Window": "https://www.westoxon.gov.uk/media/ksqgvl4b/10-design-guide-windows-and-doors.pdf",
    "Roofs": "https://www.westoxon.gov.uk/media/d3ohnpd1/9-design-guide-roofs-and-roofing-materials.pdf",
    "Extensions": "https://www.westoxon.gov.uk/media/pekfogvr/14-design-guide-extensions-and-alterations.pdf",
    "Greener": "https://www.westoxon.gov.uk/media/thplpsay/16-design-guide-greener-traditional-buildings.pdf",
    "Sustainable": "https://www.westoxon.gov.uk/media/nk5bvv0v/12-design-guide-sustainable-building-design.pdf"
}

output_dir = "dataset"
os.makedirs(output_dir, exist_ok=True)

for title, url in pdfs.gadgets():
    response = requests.get(url)
    pdf_path = os.path.be a part of(output_dir, f"{title}.pdf")


    with open(pdf_path, "wb") as f:
        f.write(response.content material)

After we obtain all of the recordsdata, we remodel all of the PDF pages into photographs. Our multimodal document-retrieval mannequin must work whether it is to signify the doc as a picture.

import os
from pdf2image import convert_from_path

def convert_pdfs_to_images(folder):
    pdf_files = [f for f in os.listdir(folder) if f.endswith('.pdf')]
    all_images = {}

    for doc_id, pdf_file in enumerate(pdf_files):
        pdf_path = os.path.be a part of(pdf_folder, pdf_file)
        photographs = convert_from_path(pdf_path, dpi=100)
        all_images[doc_id] = photographs

    return all_images

all_images = convert_pdfs_to_images("/content material/dataset/")

All of the paperwork shall be remodeled into a picture file, so we will see their content material in picture format.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = all_images[0][i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.present()

Multimodal RAG Implementation with HuggingFace

Subsequent, we’ll initialize the RAG system with Byaldi and the Doc-Retrieval mannequin ColPali. The ColPali mannequin is a retrieval mannequin that fetches the doc by utilizing the picture immediately as an alternative of breaking it down right into a text-chunking course of.

We’ll use the Byaldi package deal, the ColPali easy wrapper, to facilitate the RAG implementation. Let’s use the code under for that.

from byaldi import RAGMultiModalModel

colpali_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

When the mannequin has been downloaded, we’ll use the next code to index our picture information and construct the Information Base.

colpali_model.index(
    input_path="dataset/",
    index_name="image_index",
    store_collection_with_index=False,
    overwrite=True
)

With the retrieval mannequin prepared, let’s check out how the mannequin retrieves the paperwork from the textual content question.

question = "How ought to we design greener and sustainable home?"

outcomes = colpali_model.search(question, okay=2)
outcomes

Output:

[{'doc_id': 1, 'page_num': 3, 'score': 12.0625, 'metadata': {}, 'base64': None},
 {'doc_id': 1, 'page_num': 9, 'score': 11.875, 'metadata': {}, 'base64': None}]

Let’s have a look at the paperwork retrieved from the above output.

import matplotlib.pyplot as plt

def get_result_images(outcomes, all_images):
    grouped_images = []

    for end in outcomes:
        doc_id = end result['doc_id']
        page_num = end result['page_num']
        grouped_images.append(all_images[doc_id][page_num - 1])
    return grouped_images
result_images = get_result_images(outcomes, all_images)

fig, axes = plt.subplots(1, 2, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = grouped_images[i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.present()

Multimodal RAG Implementation with HuggingFace

The retrieval mannequin efficiently retrieves probably the most applicable paperwork for our question.

Subsequent, we’ll use the Qwen-VL for our generative mannequin. Qwen-VL is a Imaginative and prescient Language Mannequin that may perceive our picture and supply textual content output. To try this, we’ll use the next code.

from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from qwen_vl_utils import process_vision_info
import torch

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
)
vl_model.cuda().eval()

Subsequent, we arrange the Qwen-VL picture processor and set the pixel measurement for GPU optimization.

min_pixels = 256*256
max_pixels = 1024*1024
vl_model_processor = Qwen2VLProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=min_pixels,
    max_pixels=max_pixels
)

Then, we’ll create our chat construction for our generative mannequin.

chat_template = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": result_images[0],
            },
             {
                 "sort": "picture",
                "picture": result_images[1],
            },
            {
                "sort": "textual content",
                "textual content": question
            },
        ],
    }
]

textual content = vl_model_processor.apply_chat_template(
    chat_template, tokenize=False, add_generation_prompt=True
)

Lastly, we’ll arrange the enter processing system from the picture and textual content to the output.

image_inputs, _ = process_vision_info(chat_template)
inputs = vl_model_processor(
    textual content=[text],
    photographs=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

When all the pieces is prepared, we’ll check out the Multimodal RAG system.

generated_ids = vl_model.generate(**inputs, max_new_tokens=100) 

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = vl_model_processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

Output:

To design greener and sustainable homes, we should always contemplate the next rules:

1. **Minimizing the usage of scarce sources**: Use constructing supplies, fossil fuels, and water effectively.
2. **Financial operation**: Make sure the constructing is cost-effective all through its life cycle and aligns with the wants of the area people.
3. **Power and carbon effectivity**: Design the constructing to attenuate vitality consumption with efficient insulation, heating, and cooling techniques.
4. **Preserving and enhancing website character

The result’s good and follows the PDF we supplied beforehand. To reduce the outcomes, we use a most of 100 tokens, however you may at all times enhance the tokens. Additionally, I solely use the highest 2 picture doc outcomes, which you’ll at all times enhance to enhance the output accuracy.

That’s all you should find out about initializing multimodal RAG. You may at all times check out different parameters and fashions to enhance your outcomes.

Conclusion

Retrieval-augmented technology, or RAG, is a method that mixes retrieval-based strategies with LLM to boost the response. Normally, it really works solely when utilizing textual content information, however this text explores the potential of utilizing picture information enter.

By combining the ColPali and Qwen-VL sequence, we established a RAG system that accepts each picture and textual content information and may reply our question.

I hope this has helped!

Cornellius Yudha Wijaya is an information science assistant supervisor and information author. Whereas working full-time at Allianz Indonesia, he likes to share Python and information ideas through social media and writing media. Cornellius writes on quite a lot of AI and machine studying subjects.