Making a Qwen-Powered Light-weight Private Assistant


Making a Light-weight Private Assistant Powered by a Qwen Language Mannequin
Picture by Editor | Midjourney
Introduction
The Qwen family of language models gives highly effective and open-source giant language fashions for varied pure language processing duties.
This text reveals you learn how to arrange and run a private assistant software in Python powered by a Qwen mannequin — particularly the Qwen1.5-7B-Chat
mannequin, which is an environment friendly and comparatively light-weight 7-billion-parameter chat mannequin optimized for conversational use instances. The code proven is prepared for use in a Python pocket book corresponding to Google Colab, however can simply be tailored to run regionally if most popular.
Coding Resolution
Since constructing a Qwen-powered assistant requires a number of dependencies and libraries being put in, we begin by putting in them and verifying set up variations to make sure compatibility amongst variations you might need pre-installed as a lot as attainable.
pip set up –q transformers speed up bitsandbytes einops ipywidgets |
We additionally set GPU use, if out there, to make sure a quicker mannequin inference, the primary time it is going to be known as throughout execution.
These preliminary setup steps are proven within the code under:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import torch from transformers import AutoModelForCausalLM, AutoTokenizer import time from IPython.show import show, HTML, clear_output import ipywidgets as widgets import sys import os
# Verifying put in packages and dependencies strive: import bitsandbytes as bnb print(“Efficiently imported bitsandbytes”) besides ImportError: print(“Error importing bitsandbytes. Trying to put in once more…”) !pip set up –q bitsandbytes —improve import bitsandbytes as bnb
# Putting in required packages (chances are you’ll wish to remark the cell under for those who already acquired these put in) !pip set up –q transformers speed up bitsandbytes einops
# Set gadget, prioritizing GPU if out there gadget = “cuda” if torch.cuda.is_available() else “cpu” print(f“Utilizing gadget: {gadget}”) |
Now it’s time to load and configure the mannequin:
- We use
Qwen/Qwen1.5-7B-Chat
, which permits for quicker first-time inference in comparison with heavier fashions like Qwen2.5-Omni, which is an actual powerhouse however not as light-weight as different variations of this household of fashions. - As common, when loading a pre-trained language mannequin, we’d like a tokenizer that converts textual content inputs to a readable format by the mannequin. Fortunately, the
AutoTokenizer from HuggingFace's Transformers library smoothens this course of.
- To reinforce effectivity, we attempt to configure 4-bit quantization which optimizes reminiscence utilization.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# Load Qwen1.5-7B-Chat mannequin – publicly out there and environment friendly to run in Google Colab with T4 GPU model_name = “Qwen/Qwen1.5-7B-Chat”
print(f“Loading {model_name}…”) start_time = time.time()
# Loading the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Attempting to load the mannequin with 4-bit quantization for effectivity strive: print(“Trying to load mannequin with 4-bit quantization…”) mannequin = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, # Use bfloat16 for higher efficiency device_map=“auto”, trust_remote_code=True, quantization_config={“load_in_4bit”: True} # 4-bit quantization for reminiscence effectivity ) besides Exception as e: print(f“4-bit quantization failed with error: {str(e)}”) print(“Falling again to 8-bit quantization…”) strive: mannequin = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map=“auto”, trust_remote_code=True, load_in_8bit=True # Strive 8-bit quantization as a substitute ) besides Exception as e2: print(f“8-bit quantization failed with error: {str(e2)}”) print(“Falling again to straightforward loading (will use extra reminiscence)…”) mannequin = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map=“auto”, trust_remote_code=True )
load_time = time.time() – start_time print(f“Mannequin loaded in {load_time:.2f} seconds”) |
When constructing our personal conversational assistant, it’s usually a very good follow to craft a default immediate that accompanies every particular request to adapt the mannequin’s habits and generated response to our wants. This is a selected default immediate:
system_prompt = """You're a useful, respectful and sincere assistant. All the time reply as helpfully as attainable, whereas being protected. Your solutions needs to be participating and enjoyable.
If a query doesn't make any sense, or isn't factually coherent, clarify why as a substitute of answering one thing not right. If you do not know the reply to a query, please do not share false data."""
The next operate we’ll outline encapsulates the heaviest a part of the execution circulate, as that is the place the mannequin will get consumer enter and is known as to carry out inference and generate a response. Importantly, we’ll run a dialog through which we are able to sequentially make a number of requests, subsequently, it is very important handle the chat historical past accordingly and incorporate it as a part of every new request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
def generate_response(user_input, chat_history=None): if chat_history is None: chat_history = []
# Formatting the dialog for the mannequin messages = [{“role”: “system”, “content”: system_prompt}]
# Including chat historical past for a full context of the dialog for message in chat_history: messages.append(message)
# Including the present consumer enter messages.append({“position”: “consumer”, “content material”: user_input})
# Tokenization: changing messages to mannequin enter format immediate = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(immediate, return_tensors=“pt”).to(gadget)
# Producing response: this half could take a while to execute at first with torch.no_grad(): outputs = mannequin.generate( **inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id )
# Decoding the generated response full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extracting solely the assistant’s response, as a substitute of the complete uncooked output assistant_response = full_response.break up(user_input)[–1].strip()
# Additional cleansing up the response if it incorporates position markers or different artifacts if “assistant” in assistant_response.decrease()[:20]: assistant_response = assistant_response.break up(“:”, 1)[–1].strip()
return assistant_response |
As soon as the important thing operate to generate responses has been outlined, we are able to construct a easy consumer interface to run and work together with the assistant.
The interface will include an output show space that reveals the dialog, an enter textual content field the place the consumer can ask questions, and two buttons for sending a request and clearing the chat. Discover the usage of the widgets library for these components.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# Create a easy UI for the private assistant def create_assistant_ui(): output = widgets.Output() input_box = widgets.Textual content( worth=”, placeholder=‘Ask me something…’, description=‘Query:’, format=widgets.Format(width=‘80%’) ) send_button = widgets.Button(description=“Ship”) clear_button = widgets.Button(description=“Clear Chat”)
chat_history = []
def on_send_button_clicked(b): user_input = input_box.worth if not user_input.strip(): return
with output: print(f“You: {user_input}”)
# Present considering indicator print(“Assistant: Considering…”, finish=“r”)
# Generate response start_time = time.time() strive: response = generate_response(user_input, chat_history) end_time = time.time()
# Clear the “considering” message clear_output(wait=True)
# Show the trade print(f“You: {user_input}”) print(f“Assistant: {response}”) print(f“n(Response generated in {end_time – start_time:.2f} seconds)”)
# Replace chat historical past chat_history.append({“position”: “consumer”, “content material”: user_input}) chat_history.append({“position”: “assistant”, “content material”: response}) besides Exception as e: clear_output(wait=True) print(f“You: {user_input}”) print(f“Error producing response: {str(e)}”) import traceback traceback.print_exc()
# Clear enter field input_box.worth = ”
def on_clear_button_clicked(b): with output: clear_output() print(“Chat cleared!”) chat_history.clear()
# Join button clicks to capabilities send_button.on_click(on_send_button_clicked) clear_button.on_click(on_clear_button_clicked)
# Deal with Enter key in enter field def on_enter(sender): on_send_button_clicked(None) input_box.on_submit(on_enter)
# Organize UI elements input_row = widgets.HBox([input_box, send_button, clear_button]) ui = widgets.VBox([output, input_row])
return ui |
Alternatively, we are able to additionally arrange the choice of utilizing a command line interface (CLI) for the chat workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Instance of a less complicated method to make the most of the mannequin (command line interface) def cli_chat(): print(“n=== Beginning CLI Chat (kind ‘exit’ to give up) ===”) chat_history = []
whereas True: user_input = enter(“nYou: “) if user_input.decrease() in [‘exit’, ‘quit’, ‘q’]: print(“Goodbye!”) break
print(“Assistant: “, finish=“”) strive: start_time = time.time() response = generate_response(user_input, chat_history) end_time = time.time()
print(f“{response}”) print(f“(Generated in {end_time – start_time:.2f} seconds)”)
# Replace chat historical past chat_history.append({“position”: “consumer”, “content material”: user_input}) chat_history.append({“position”: “assistant”, “content material”: response}) besides Exception as e: print(f“Error: {str(e)}”) import traceback traceback.print_exc() |
Nearly finished. We are going to outline two final capabilities:
- One for performing a fast take a look at to make sure that each the mannequin and dependencies are arrange properly.
- An overarching operate to run your entire conversational assistant software. Right here, the consumer can select the type of interface to make use of (UI vs. CLI).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# Attempting a easy take a look at question to make sure the whole lot is working def quick_test(): test_question = “What are you able to assist me with?” print(f“nTest Query: {test_question}”)
start_time = time.time() strive: response = generate_response(test_question) end_time = time.time()
print(f“Response: {response}”) print(f“Era time: {end_time – start_time:.2f} seconds”) return True besides Exception as e: print(f“Check failed with error: {str(e)}”) import traceback traceback.print_exc() # Print the complete stack hint for debugging return False
# Overarching operate for our software: we are able to select right here which interface to make use of def run_assistant(): print(“nRunning fast take a look at…”) test_success = quick_test()
if test_success: # Ask consumer which interface they like interface_choice = enter(“nChoose interface (1 for UI, 2 for CLI): “)
if interface_choice == “2”: cli_chat() else: print(“nStarting the private assistant UI…”) assistant_ui = create_assistant_ui() show(assistant_ui)
# Utilization directions print(“n— Directions —“) print(“1. Kind your query within the textual content field”) print(“2. Press Enter or click on ‘Ship'”) print(“3. Anticipate the assistant’s response”) print(“4. Click on ‘Clear Chat’ to begin a brand new dialog”) print(“———————-“) else: print(“nSkipping UI launch on account of take a look at failure.”) print(“You might wish to strive the CLI interface by calling cli_chat() instantly”)
# Working the conversational assistant run_assistant() |
Attempting It Out
If the whole lot has gone properly, now it is time to have enjoyable and work together with our newly constructed assistant. Right here is an instance excerpt of the conversational workflow.
Working fast take a look at...
Check Query: What are you able to assist me with?
Response: 1. Normal information: I can present data on a variety of subjects, from historical past and science to popular culture, present occasions, and extra.
2. Downside-solving: Need assistance with a math drawback, determining learn how to do one thing, or troubleshooting a difficulty? I am right here to information you.
3. Analysis: If in case you have a selected matter or query in thoughts, I may also help you discover dependable sources and summarize the knowledge for you.
4. Language help: Need assistance with writing, grammar, spelling, or translation? I can help with that.
5. Enjoyable information and trivia: Need to impress your folks with fascinating information or simply searching for a very good snigger? I've acquired you lined!
6. Time administration and group: Methods that can assist you keep on prime of your duties and tasks.
7. Private growth: Suggestions for studying new expertise, setting targets, or managing your feelings.
Simply let me know what you want, and I am going to do my finest to help you! Keep in mind, I am unable to at all times give away all of the solutions, however I am going to actually attempt to make the method as pleasant and informative as attainable.
Era time: 18.04 seconds
Select interface (1 for UI, 2 for CLI):
Beneath is an instance of dwell interplay by means of the UI.

Qwen-based conversational assistant’s UI
Picture by Creator
Conclusion
On this article, we demonstrated learn how to construct a easy conversational assistant software powered by a light-weight but highly effective Qwen language mannequin. This software is designed to be run and tried out effectively in a GPU setting like these supplied by Google Colab pocket book environments.