Which Quantization Methodology is Proper for You?(GPTQ vs. GGUF vs. AWQ) | by Maarten Grootendorst | Nov, 2023


Exploring Pre-Quantized Giant Language Fashions

All through the final 12 months, now we have seen the Wild West of Giant Language Fashions (LLMs). The tempo at which new expertise and fashions had been launched was astounding! Because of this, now we have many alternative requirements and methods of working with LLMs.

On this article, we are going to discover one such matter, specifically loading your native LLM by way of a number of (quantization) requirements. With sharding, quantization, and completely different saving and compression methods, it isn’t simple to know which methodology is appropriate for you.

All through the examples, we are going to use Zephyr 7B, a fine-tuned variant of Mistral 7B that was skilled with Direct Preference Optimization (DPO).

🔥 TIP: After every instance of loading an LLM, it’s suggested to restart your pocket book to stop OutOfMemory errors. Loading a number of LLMs requires vital RAM/VRAM. You’ll be able to reset reminiscence by deleting the fashions and resetting your cache like so:

# Delete any fashions beforehand created
del mannequin, tokenizer, pipe

# Empty VRAM cache
import torch
torch.cuda.empty_cache()

You too can comply with together with the Google Colab Notebook to ensure all the pieces works as supposed.

Essentially the most simple, and vanilla, means of loading your LLM is thru 🤗 Transformers. HuggingFace has created a big suite of packages that permit us to do wonderful issues with LLMs!

We are going to begin by putting in HuggingFace, amongst others, from its principal department to help newer fashions:

# Newest HF transformers model for Mistral-like fashions
pip set up git+https://github.com/huggingface/transformers.git
pip set up speed up bitsandbytes xformers

After set up, we are able to use the next pipeline to simply load our LLM:

from torch import bfloat16
from transformers import pipeline

# Load in your LLM with none compression methods
pipe = pipeline(
"text-generation",
mannequin="HuggingFaceH4/zephyr-7b-beta",
torch_dtype=bfloat16,
device_map="auto"
)

Leave a Reply

Your email address will not be published. Required fields are marked *