Are GPTs Good Embedding Fashions. A shocking experiment to point out that… | by Yu-Cheng Tsai | Might, 2024
With the rising variety of embedding fashions accessible, selecting the best one on your machine studying functions could be difficult. Happily, the MTEB leaderboard offers a complete vary of rating metrics for numerous pure language processing duties.
While you go to the location, you’ll discover that the highest 5 embedding fashions are Generative Pre-trained Transformers (GPTs). This would possibly lead you to assume that GPT fashions are the most effective for embeddings. However is that this actually true? Let’s conduct an experiment to search out out.
Embeddings are tensor illustration of texts, that converts textual content token IDs and initiatives them right into a tensor house.
By inputting textual content right into a neural community mannequin and performing a ahead cross, you may get hold of embedding vectors. Nevertheless, the precise course of is a little more advanced. Let’s break it down step-by-step:
- Convert the textual content into token IDs
- Move the token IDs right into a neural community
- Return the outputs of the neural community
In step one, I’m going to make use of a tokenizer to realize it. model_inputs
is the tensor illustration of the textual content content material, "some questions."
.
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
{
"role": "user",
"content": "some questions.",
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
The second step is simple, forward-passing the model_inputs
right into a neural community. The logits of generated tokens could be accessed through .logits
. torch.no_grad()
means I don’t need the mannequin weights to be up to date as a result of the mannequin is in inference mode.
import torchwith torch.no_grad():
return mannequin(model_inputs).logits
The third step is a bit tough. GPT fashions are decoder-only, and their token technology is autoregressive. In easy phrases, the final token of a accomplished sentence has seen all of the previous tokens within the sentence. Due to this fact, the output of the final token incorporates all of the affinity scores (attentions) from the previous tokens.
Bingo! You’re most within the final token due to the eye mechanism within the transformers.
The output dimension of the GPTs applied in Hugging Face is (batch measurement, enter token measurement, variety of vocabulary). To get the final token output of all of the batches, I can carry out a tensor slice.
import torch
with torch.no_grad():
return mannequin(model_inputs).logits[:, -1, :]
To measure the standard of those GPT embeddings, you should utilize cosine similarity. The upper the cosine similarity, the nearer the semantic that means of the sentences.
import torch
def compute_cosine_similarity(vec1, vec2):
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
return cos(vec1, vec2)
Let’s create some util capabilities that permits us to loop by way of listing of query and reply pairs and see the consequence. Mistral 7b v0.1 instruct , one of many nice open-sourced fashions, is used for this experiment.
import torch
from termcolor import coloured
from transformers import AutoModelForCausalLM, AutoTokenizermannequin = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
def generate_last_token_embeddings(query):
messages = [
{
"role": "user",
"content": question,
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
with torch.no_grad():
return mannequin(model_inputs).logits[:, -1, :]
def get_similarities(questions, solutions):
for query in questions:
for reply in solutions:
q_embedding, a_embedding = (
generate_last_token_embeddings(query),
generate_last_token_embeddings(reply),
)
similarity = compute_cosine_similarity(q_embedding, a_embedding)
print(coloured(f"query: {query} and ans: {reply}", "inexperienced"))
print(coloured(f"consequence: {similarity}", "blue"))
questions = ["Where is the headquarter of OpenAI?", "What is GPU?"]
solutions = [
"OpenAI is based at San Francisco.",
"A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly",
]
get_similarities(questions, solutions)
For the primary query and reply pair:
- Query: “What’s the headquarter of OpenAI?”
- Reply: “OpenAI relies at San Francisco.”
- Cosine Similarity: 0.96
For the second query and reply pair:
- Query: “What’s GPU?”
- Reply: “A graphics processing unit (GPU) is an digital circuit that may carry out mathematical calculations shortly.”
- Cosine Similarity: 0.94
For an irrelevant pair:
- Query: “The place is the headquarter of OpenAI?”
- Reply: “A graphics processing unit (GPU) is an digital circuit that may carry out mathematical calculations shortly.”
- Cosine Similarity: 0.90
For the worst pair:
- Query: “What’s GPU?”
- Reply: “OpenAI relies at San Francisco.”
- Cosine Similarity: 0.93
These outcomes counsel that utilizing GPT fashions, on this case, the mistral 7b instruct v0.1, as embedding fashions could not yield nice outcomes when it comes to distinguishing between related and irrelevant pairs. However why are GPT fashions nonetheless among the many prime 5 embedding fashions?
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct")
mannequin = AutoModelForCausalLM.from_pretrained(
"intfloat/e5-mistral-7b-instruct"
)
Repeating the identical analysis process with a unique mannequin, e5-mistral-7b-instruct
, which is among the prime open-sourced fashions from the MTEB leaderboard and fine-tuned from mistral 7b instruct, I uncover that the cosine similarity for the related query and pairs are 0.88 and 0.84 for OpenAI and GPU questions, respectively. For the irrelevant query and reply pairs, the similarity drops to 0.56 and 0.67. This findings suggests e5-mistral-7b-instruct
is a much-improved mannequin for embeddings. What makes such an enchancment?
Delving into the paper behind e5-mistral-7b-instruct
, the secret’s the usage of contrastive loss to additional positive tune the mistral mannequin.
In contrast to GPTs which are skilled or additional fine-tuned utilizing cross-entropy loss of predicted tokens and labeled tokens, contrastive loss goals to maximise the gap between destructive pairs and decrease the gap between the optimistic pairs.
This blog post covers this idea in better particulars. The sim
operate calculates the cosine distance between two vectors. For contrastive loss, the denominators signify the cosine distance between optimistic examples and destructive examples. The rationale behind contrastive loss is that we would like comparable vectors to be as near 1 as attainable, since log(1) = 0 represents the optimum loss.
On this put up, I’ve highlighted a standard pitfall of utilizing GPTs as embedding fashions with out fine-tuning. My analysis means that fine-tuning GPTs with contrastive loss, the embeddings could be extra significant and discriminative. By understanding the strengths and limitations of GPT fashions, and leveraging custom-made loss like contrastive loss, you may make extra knowledgeable selections when choosing and using embedding fashions on your machine studying initiatives. I hope this put up helps you select GPTs fashions properly on your functions and look ahead to listening to your suggestions! 🙂