A Coding Implementation to Construct a Doc Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain

In at the moment’s information-rich world, discovering related paperwork shortly is essential. Conventional keyword-based search techniques usually fall brief when coping with semantic that means. This tutorial demonstrates learn how to construct a strong doc search engine utilizing:

Hugging Face’s embedding fashions to transform textual content into wealthy vector representations
Chroma DB as our vector database for environment friendly similarity search
Sentence transformers for high-quality textual content embeddings

This implementation allows semantic search capabilities – discovering paperwork primarily based on that means somewhat than simply key phrase matching. By the top of this tutorial, you’ll have a working doc search engine that may:

Course of and embed textual content paperwork
Retailer these embeddings effectively
Retrieve essentially the most semantically comparable paperwork to any question
Deal with quite a lot of doc sorts and search wants

Please observe the detailed steps talked about under in sequence to implement DocSearchAgent.

First, we have to set up the required libraries.

!pip set up chromadb sentence-transformers langchain datasets

Let’s begin by importing the libraries we’ll use:

import os
import numpy as np
import pandas as pd
from datasets import load_dataset
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time

For this tutorial, we’ll use a subset of Wikipedia articles from the Hugging Face datasets library. This offers us a various set of paperwork to work with.

dataset = load_dataset("wikipedia", "20220301.en", cut up="prepare[:1000]")
print(f"Loaded {len(dataset)} Wikipedia articles")


paperwork = []
for i, article in enumerate(dataset):
   doc = {
       "id": f"doc_{i}",
       "title": article["title"],
       "textual content": article["text"],
       "url": article["url"]
   }
   paperwork.append(doc)


df = pd.DataFrame(paperwork)
df.head(3)

Now, let’s cut up our paperwork into smaller chunks for extra granular looking out:

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000,
   chunk_overlap=200,
   length_function=len,
)


chunks = []
chunk_ids = []
chunk_sources = []


for i, doc in enumerate(paperwork):
   doc_chunks = text_splitter.split_text(doc["text"])
   chunks.lengthen(doc_chunks)
   chunk_ids.lengthen([f"chunk_{i}_{j}" for j in range(len(doc_chunks))])
   chunk_sources.lengthen([doc["title"]] * len(doc_chunks))


print(f"Created {len(chunks)} chunks from {len(paperwork)} paperwork")

We’ll use a pre-trained sentence transformer mannequin from Hugging Face to create our embeddings:

model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)


sample_text = "It is a pattern textual content to check our embedding mannequin."
sample_embedding = embedding_model.encode(sample_text)
print(f"Embedding dimension: {len(sample_embedding)}")

Now, let’s arrange Chroma DB, a light-weight vector database good for our search engine:

chroma_client = chromadb.Consumer()


embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)


assortment = chroma_client.create_collection(
   title="document_search",
   embedding_function=embedding_function
)


batch_size = 100
for i in vary(0, len(chunks), batch_size):
   end_idx = min(i + batch_size, len(chunks))
  
   batch_ids = chunk_ids[i:end_idx]
   batch_chunks = chunks[i:end_idx]
   batch_sources = chunk_sources[i:end_idx]
  
   assortment.add(
       ids=batch_ids,
       paperwork=batch_chunks,
       metadatas=[{"source": source} for source in batch_sources]
   )
  
   print(f"Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the gathering")


print(f"Complete paperwork in assortment: {assortment.depend()}")

Now comes the thrilling half – looking out via our paperwork:

def search_documents(question, n_results=5):
   """
   Seek for paperwork just like the question.
  
   Args:
       question (str): The search question
       n_results (int): Variety of outcomes to return
  
   Returns:
       dict: Search outcomes
   """
   start_time = time.time()
  
   outcomes = assortment.question(
       query_texts=[query],
       n_results=n_results
   )
  
   end_time = time.time()
   search_time = end_time - start_time
  
   print(f"Search accomplished in {search_time:.4f} seconds")
   return outcomes


queries = [
   "What are the effects of climate change?",
   "History of artificial intelligence",
   "Space exploration missions"
]


for question in queries:
   print(f"nQuery: {question}")
   outcomes = search_documents(question)
  
   for i, (doc, metadata) in enumerate(zip(outcomes['documents'][0], outcomes['metadatas'][0])):
       print(f"nResult {i+1} from {metadata['source']}:")
       print(f"{doc[:200]}...")

Let’s create a easy operate to offer a greater consumer expertise:

def interactive_search():
   """
   Interactive search interface for the doc search engine.
   """
   whereas True:
       question = enter("nEnter your search question (or 'give up' to exit): ")
      
       if question.decrease() == 'give up':
           print("Exiting search interface...")
           break
          
       n_results = int(enter("What number of outcomes would you want? "))
      
       outcomes = search_documents(question, n_results)
      
       print(f"nFound {len(outcomes['documents'][0])} outcomes for '{question}':")
      
       for i, (doc, metadata, distance) in enumerate(zip(
           outcomes['documents'][0],
           outcomes['metadatas'][0],
           outcomes['distances'][0]
       )):
           relevance = 1 - distance  
           print(f"n--- Outcome {i+1} ---")
           print(f"Supply: {metadata['source']}")
           print(f"Relevance: {relevance:.2f}")
           print(f"Excerpt: {doc[:300]}...")  
           print("-" * 50)


interactive_search()

Let’s add the power to filter our search outcomes by metadata:

def filtered_search(question, filter_source=None, n_results=5):
   """
   Search with non-compulsory filtering by supply.
  
   Args:
       question (str): The search question
       filter_source (str): Elective supply to filter by
       n_results (int): Variety of outcomes to return
  
   Returns:
       dict: Search outcomes
   """
   where_clause = {"supply": filter_source} if filter_source else None
  
   outcomes = assortment.question(
       query_texts=[query],
       n_results=n_results,
       the place=where_clause
   )
  
   return outcomes


unique_sources = listing(set(chunk_sources))
print(f"Out there sources for filtering: {len(unique_sources)}")
print(unique_sources[:5])  


if len(unique_sources) > 0:
   filter_source = unique_sources[0]
   question = "essential ideas and rules"
  
   print(f"nFiltered seek for '{question}' in supply '{filter_source}':")
   outcomes = filtered_search(question, filter_source=filter_source)
  
   for i, doc in enumerate(outcomes['documents'][0]):
       print(f"nResult {i+1}:")
       print(f"{doc[:200]}...")

In conclusion, we show learn how to construct a semantic doc search engine utilizing Hugging Face embedding fashions and ChromaDB. The system retrieves paperwork primarily based on that means somewhat than simply key phrases by remodeling textual content into vector representations. The implementation processes Wikipedia articles chunks them for granularity, embeds them utilizing sentence transformers, and shops them in a vector database for environment friendly retrieval. The ultimate product options interactive looking out, metadata filtering, and relevance rating.

Right here is the Colab Notebook. Additionally, don’t neglect to observe us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 80k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

A Coding Implementation to Construct a Doc Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain

Construct Your Personal AI Coding Assistant in JupyterLab with Ollama and Hugging Face

Asure’s strategy to enhancing their name heart expertise utilizing generative AI and Amazon Q in QuickSight

Bias Detection in LLM Outputs: Statistical Approaches

Leave a Reply Cancel reply

Unleashing the multimodal energy of Amazon Bedrock Information Automation to rework unstructured knowledge into actionable insights

Construct Your Personal AI Coding Assistant in JupyterLab with Ollama and Hugging Face

The place Do We Get Our Information? A Tour of Information Sources (with Examples)

Asure’s strategy to enhancing their name heart expertise utilizing generative AI and Amazon Q in QuickSight

EON Actuality Unveils Landmark White Paper: “The Future is Right here, However Not Evenly Distributed” – EON Actuality

More Stories

Leave a Reply Cancel reply

You may have missed