Decoding Methods in Giant Language Fashions

The tokenizer, Byte-Pair Encoding on this occasion, interprets every token within the enter textual content right into a corresponding token ID. Then, GPT-2 makes use of these token IDs as enter and tries to foretell the following most definitely token. Lastly, the mannequin generates logits, that are transformed into chances utilizing a softmax perform.

For instance, the mannequin assigns a chance of 17% to the token for “of” being the following token after “I’ve a dream”. This output primarily represents a ranked listing of potential subsequent tokens within the sequence. Extra formally, we denote this chance as P(of | I’ve a dream) = 17%.

Autoregressive fashions like GPT predict the following token in a sequence based mostly on the previous tokens. Take into account a sequence of tokens w = (w₁, w₂, …, wₜ). The joint chance of this sequence P(w) may be damaged down as:

For every token wᵢ within the sequence, P(wᵢ | w₁, w₂, …, wᵢ₋₁) represents the conditional chance of wᵢ given all of the previous tokens (w₁, w₂, …, wᵢ₋₁). GPT-2 calculates this conditional chance for every of the 50,257 tokens in its vocabulary.

This results in the query: how will we use these chances to generate textual content? That is the place decoding methods, resembling grasping search and beam search, come into play.

Grasping search is a decoding technique that takes essentially the most possible token at every step as the following token within the sequence. To place it merely, it solely retains the most definitely token at every stage, discarding all different potential choices. Utilizing our instance:

Step 1: Enter: “I’ve a dream” → Probably token: “ of”
Step 2: Enter: “I’ve a dream of” → Probably token: “ being”
Step 3: Enter: “I’ve a dream of being” → Probably token: “ a”
Step 4: Enter: “I’ve a dream of being a” → Probably token: “ physician”
Step 5: Enter: “I’ve a dream of being a health care provider” → Probably token: “.”

Whereas this method may sound intuitive, it’s essential to notice that the grasping search is short-sighted: it solely considers essentially the most possible token at every step with out contemplating the general impact on the sequence. This property makes it quick and environment friendly because it doesn’t must maintain monitor of a number of sequences, but it surely additionally implies that it may possibly miss out on higher sequences that may have appeared with barely much less possible subsequent tokens.

Subsequent, let’s illustrate the grasping search implementation utilizing graphviz and networkx. We choose the ID with the very best rating, compute its log chance (we take the log to simplify calculations), and add it to the tree. We’ll repeat this course of for 5 tokens.

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import timedef get_log_prob(logits, token_id):
# Compute the softmax of the logits
chances = torch.nn.practical.softmax(logits, dim=-1)
log_probabilities = torch.log(chances)
# Get the log chance of the token
token_log_probability = log_probabilities[token_id].merchandise()
return token_log_probability
def greedy_search(input_ids, node, size=5):
if size == 0:
return input_ids
outputs = mannequin(input_ids)
predictions = outputs.logits
# Get the expected subsequent sub-word (right here we use top-k search)
logits = predictions[0, -1, :]
token_id = torch.argmax(logits).unsqueeze(0)
# Compute the rating of the expected token
token_score = get_log_prob(logits, token_id)
# Add the expected token to the listing of enter ids
new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0)], dim=-1)
# Add node and edge to graph
next_token = tokenizer.decode(token_id, skip_special_tokens=True)
current_node = listing(graph.successors(node))[0]
graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
graph.nodes[current_node]['token'] = next_token + f"_{size}"
# Recursive name
input_ids = greedy_search(new_input_ids, current_node, length-1)
return input_ids
# Parameters
size = 5
beams = 1
# Create a balanced tree with top 'size'
graph = nx.balanced_tree(1, size, create_using=nx.DiGraph())
# Add 'tokenscore', 'cumscore', and 'token' attributes to every node
for node in graph.nodes:
graph.nodes[node]['tokenscore'] = 100
graph.nodes[node]['token'] = textual content
# Begin producing textual content
output_ids = greedy_search(input_ids, 0, size=size)
output = tokenizer.decode(output_ids.squeeze().tolist(), skip_special_tokens=True)
print(f"Generated textual content: {output}")

Generated textual content: I've a dream of being a health care provider.

Our grasping search generates the identical textual content because the one from the transformers library: “I’ve a dream of being a health care provider.” Let’s visualize the tree we created.

import matplotlib.pyplot as plt
import networkx as nx
import matplotlib.colours as mcolors
from matplotlib.colours import LinearSegmentedColormapdef plot_graph(graph, size, beams, rating):
fig, ax = plt.subplots(figsize=(3+1.2*beams**size, max(5, 2+size)), dpi=300, facecolor='white')
# Create positions for every node
pos = nx.nx_agraph.graphviz_layout(graph, prog="dot")
# Normalize the colours alongside the vary of token scores
if rating == 'token':
scores = [data['tokenscore'] for _, information in graph.nodes(information=True) if information['token'] is just not None]
elif rating == 'sequence':
scores = [data['sequencescore'] for _, information in graph.nodes(information=True) if information['token'] is just not None]
vmin = min(scores)
vmax = max(scores)
norm = mcolors.Normalize(vmin=vmin, vmax=vmax)
cmap = LinearSegmentedColormap.from_list('rg', ["r", "y", "g"], N=256) 
# Draw the nodes
nx.draw_networkx_nodes(graph, pos, node_size=2000, node_shape='o', alpha=1, linewidths=4, 
node_color=scores, cmap=cmap)
# Draw the perimeters
nx.draw_networkx_edges(graph, pos)
# Draw the labels
if rating == 'token':
labels = {node: information['token'].break up('_')[0] + f"n{information['tokenscore']:.2f}%" for node, information in graph.nodes(information=True) if information['token'] is just not None}
elif rating == 'sequence':
labels = {node: information['token'].break up('_')[0] + f"n{information['sequencescore']:.2f}" for node, information in graph.nodes(information=True) if information['token'] is just not None}
nx.draw_networkx_labels(graph, pos, labels=labels, font_size=10)
plt.field(False)
# Add a colorbar
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
if rating == 'token':
fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Token chance (%)')
elif rating == 'sequence':
fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Sequence rating')
plt.present()
# Plot graph
plot_graph(graph, size, 1.5, 'token')