Making a Helpful Voice-Activated Totally Native RAG System


Creating a Useful Voice-Activated Fully Local RAG System
Picture by Writer | Ideogram.ai

 

RAG, or retrieval augmented technology, is a way for utilizing exterior data for added context that passes into the big language mannequin (LLM) to enhance the mannequin’s accuracy and relevance. It’s a way more dependable means to enhance the generative AI outcome than consistently fine-tuning the mannequin.

Historically, RAG programs depend on person textual content queries to look the vector database. The related paperwork retrieved are then used as context enter for the generative AI, which produces the lead to textual content format. Nevertheless, we are able to lengthen them much more in order that they will settle for and produce output in voice type.

This text will discover initiating the RAG system and making it absolutely voice-activated.

 

Constructing a Totally Voice-Activated RAG System

 
On this article, I’ll assume the reader has some data in regards to the LLM and RAG programs, so I can’t clarify them additional.

To construct a RAG system with full voice options, we’ll construction it round three key parts:

  1. Voice Receiver and Transcription
  2. Data Base
  3. Audio File Response Era

Total, the mission workflow will observe the picture under.

Creating a Useful Voice-Activated Fully Local RAG System

In case you are prepared, let’s get began to arrange all we’d like for this mission to succeed.

First, we won’t use the Pocket book IDE for this mission, as we would like the RAG system to work like issues in manufacturing.
Subsequently, a typical programming language IDE similar to Visible Studio Code (VS Code) must be ready.

Subsequent, we additionally wish to create a digital atmosphere for our mission. You should use any methodology, similar to Python or Conda.

python -m venv rag-env-audio

 

Along with your digital atmosphere prepared, we should set up all the mandatory libraries for this tutorial.

pip set up openai-whisper chromadb sentence-transformers sounddevice numpy scipy PyPDF2 transformers torch langchain-core langchain-community

 

If in case you have GPU entry, you may as well obtain the GPU model of the PyTorch library.

pip set up torch torchaudio --index-url https://obtain.pytorch.org/whl/cu118

 

With every part prepared, we’ll begin to construct our voice-activated RAG system. As a observe, the mission repository containing all of the code and dataset is on this repository.

We’ll begin by importing all the mandatory libraries and the environmental variables with the next code.

import os
import whisper
import chromadb
from sentence_transformers import SentenceTransformer
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter  
import torch

AUDIO_FILE = "user_input.wav"
RESPONSE_AUDIO_FILE = "response.wav"  
PDF_FILE = "Insurance_Handbook_20103.pdf"  
SAMPLE_RATE = 16000
WAKE_WORD = "Hello"  
SIMILARITY_THRESHOLD = 0.4  
MAX_ATTEMPTS = 5 

 

All of the variables will likely be defined when it’s used of their respective code. For now, let’s simply preserve it as it’s.

After we import all the mandatory libraries, we’ll arrange all of the features obligatory for the RAG programs. I’ll break it down individually so you may perceive what occurred in our mission.

Step one is to create a characteristic to report our enter voice and transcribe the voice into textual content knowledge. We’ll use the sound device library for audio recording, and for audio transcribing, we’ll use OpenAI Whisper.

# For recording audio enter.
def record_audio(filename, period=5, samplerate=SAMPLE_RATE):
    print("Listening... Converse now!")
    audio = sd.rec(int(period * samplerate), samplerate=samplerate, channels=1, dtype="float32")
    sd.wait()  
    print("Recording completed.")
    write(filename, samplerate, (audio * 32767).astype(np.int16))

# Transcribe the Enter audio into textual content 
def transcribe_audio(filename):
    print("Transcribing audio...")
    mannequin = whisper.load_model("base.en")
    outcome = mannequin.transcribe(filename)
    return outcome["text"].strip().decrease()

 

The above features will change into the muse for accepting and returning our voice as textual content knowledge. We’ll use them a number of instances through the mission, so preserve it in thoughts.

We’ll create an entrance characteristic for our RAG system with the features to simply accept audio prepared. Within the subsequent code, we create a voice-activation operate earlier than accessing the system utilizing WAKE_WORD. The phrase could be something; you may set it up as required.

The concept behind the voice activation above is that the RAG system is activated if our recorded transcribed voice matches the Wake Phrase. Nevertheless, it won’t be possible if the transcription must match Wake Phrase precisely, as the potential of the transcription system producing the textual content lead to a unique format is excessive. We will standardize the transcribed output for that. Nevertheless, for now, I’ve an thought to make use of embedding similarities so the system remains to be activated even with barely totally different phrase compositions.

# Detecting Wake Phrase to activate the RAG System
def detect_wake_word(max_attempts=MAX_ATTEMPTS):

    print("Ready for wake phrase...")
    text_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    wake_word_embedding = text_embedding_model.encode(WAKE_WORD).reshape(1, -1)

    makes an attempt = 0
    whereas makes an attempt = SIMILARITY_THRESHOLD:
            print(f"Wake phrase detected: {WAKE_WORD}")
            return True
        makes an attempt += 1
        print(f"Try {makes an attempt}/{max_attempts}. Please attempt once more.")
    print("Wake phrase not detected. Exiting.")
    return False

 

By combining the WAKE_WORD and SIMILARITY_THRESHOLD variables, we’ll find yourself with the voice activation characteristic.
Subsequent, let’s construct our data base utilizing our PDF file. To try this, we’ll put together a operate that extracts textual content from the file and splits it into chunks.

def load_and_chunk_pdf(pdf_file):
    from PyPDF2 import PdfReader
    print("Loading and chunking PDF...")
    reader = PdfReader(pdf_file)
    all_text = ""
    for web page in reader.pages:
        textual content = web page.extract_text()
        if textual content:
            all_text += textual content + "n"

    # Cut up the textual content into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=250,  # Measurement of every chunk
        chunk_overlap=50,  # Overlap between chunks to take care of context
        separators=["nn", "n", " ", ""]      
     )
    chunks = text_splitter.split_text(all_text)
    return chunks

 

You may change the chunk dimension along with your supposed. There are not any actual numbers to make use of, so experiment with them to see which is one of the best parameter.

The chunks from the operate above will then be handed into the vector database. We’ll use the ChromaDB vector database and SenteceTransformer to entry the embedding mannequin.

def setup_chromadb(chunks):
    print("Organising ChromaDB...")
    consumer = chromadb.PersistentClient(path="chroma_db")
    text_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

    # Delete current assortment (if wanted)
    attempt:
        consumer.delete_collection(identify="knowledge_base")
        print("Deleted current assortment: knowledge_base")
    besides Exception as e:
        print(f"Assortment doesn't exist or couldn't be deleted: {e}")

    assortment = consumer.create_collection(identify="knowledge_base")

    for i, chunk in enumerate(chunks):
        embedding = text_embedding_model.encode(chunk).tolist()
        assortment.add(
            ids=[f"chunk_{i}"],
            embeddings=[embedding],
            metadatas=[{"source": "pdf", "chunk_id": i}],
            paperwork=[chunk]
        )
    print("Textual content chunks and embeddings saved in ChromaDB.")
    return assortment
Moreover, we'll put together the operate for retrieval with the textual content question to ChromaDB as welll
def query_chromadb(assortment, question, top_k=3):
    """Question ChromaDB for related chunks."""
    text_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = text_embedding_model.encode(question).tolist()
    outcomes = assortment.question(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    relevant_chunks = [chunk for sublist in results["documents"] for chunk in sublist]
    return relevant_chunks

 

Then, we have to put together the technology characteristic to finish the RAG system. On this case, I’ll use the Qwen-1.5.-0.5B-Chat mannequin hosted within the HuggingFace. You may tweak the immediate and the technology mannequin as you please.

def generate_response(question, context_chunks):

    machine = "cuda" if torch.cuda.is_available() else "cpu"
    model_name = "Qwen/Qwen1.5-0.5B-Chat"
    mannequin = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Format the immediate with the question and context
    context = "n".be a part of(context_chunks)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Use the following context to answer the question:nnContext:n{context}nnQuestion: {query}nnAnswer:"}
    ]

    textual content = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer(
        [text],
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(machine)

    # Generate the response
    generated_ids = mannequin.generate(
        model_inputs.input_ids,
        attention_mask=model_inputs.attention_mask,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

 

Lastly, the thrilling half is remodeling the generated response into an audio file with the text-to-speech mannequin. For our instance, we’ll use the Suno Bark mannequin hosted within the HuggingFace. After producing the audio, we’ll play the audio response to finish the pipeline.

def text_to_speech(textual content, output_file):
    from transformers import AutoProcessor, BarkModel
    print("Producing speech...")
   
    processor = AutoProcessor.from_pretrained("suno/bark-small")
    mannequin = BarkModel.from_pretrained("suno/bark-small")

    inputs = processor(textual content, return_tensors="pt")

    audio_array = mannequin.generate(**inputs)
    audio = audio_array.cpu().numpy().squeeze()

    # Save the audio to a file
    write(output_file, 22050, (audio * 32767).astype(np.int16))
    print(f"Audio response saved to {output_file}")
    return audio

def play_audio(audio, samplerate=22050):
    print("Taking part in response...")
    sd.play(audio, samplerate=samplerate)
    sd.wait()

 

That’s all of the options we have to full the absolutely voice-activated RAG pipeline. Let’s mix all collectively to make a cohesive construction.

def essential():
    # Step 1: Load and chunk the PDF
    chunks = load_and_chunk_pdf(PDF_FILE)

    # Step 2: Arrange ChromaDB
    assortment = setup_chromadb(chunks)

    # Step 3: Detect wake phrase with embedding similarity
    if not detect_wake_word():
        return  # Exit if wake phrase will not be detected

    # Step 4: Report and transcribe person enter
    record_audio(AUDIO_FILE, period=5) 
    user_input = transcribe_audio(AUDIO_FILE)
    print(f"Consumer Enter: {user_input}")

    # Step 5: Question ChromaDB for related chunks
    relevant_chunks = query_chromadb(assortment, user_input)
    print(f"Related Chunks: {relevant_chunks}")

    # Step 6: Generate response utilizing a Hugging Face mannequin
    response = generate_response(user_input, relevant_chunks)
    print(f"Generated Response: {response}")

    # Step 7: Convert response to speech, reserve it, and play it
    audio = text_to_speech(response, RESPONSE_AUDIO_FILE)
    play_audio(audio)

    # Clear up
    os.take away(AUDIO_FILE)  # Delete the momentary audio file

if __name__ == "__main__":
    essential()

 

I’ve saved the entire code in a script known as app.py, and we are able to activate the system utilizing the next code.

 

Strive it your self, and you’ll purchase the response audio file that you need to use to overview.

That’s all it is advisable to construct the native RAG system with voice activation. You may evolve the mission even additional by constructing an software for the system and deploying it into manufacturing.

 

Conclusion

 
Constructing an RAG system with voice activation entails a sequence of superior strategies and a number of fashions that work collectively as one. By using retrieval and generative features to construct the RAG system, this mission provides one other layer by embedding audio functionality in a number of steps. With the muse now we have constructed, we are able to evolve the mission even additional relying in your wants.
 
 

Cornellius Yudha Wijaya is an information science assistant supervisor and knowledge author. Whereas working full-time at Allianz Indonesia, he likes to share Python and knowledge suggestions by way of social media and writing media. Cornellius writes on quite a lot of AI and machine studying matters.

Leave a Reply

Your email address will not be published. Required fields are marked *