Positional Encodings in Transformer Fashions

Pure language processing (NLP) has developed considerably with transformer-based fashions. A key innovation in these fashions is positional encodings, which assist seize the sequential nature of language. On this put up, you’ll study:
- Why positional encodings are crucial in transformer fashions
- Various kinds of positional encodings and their traits
- The way to implement varied positional encoding schemes
- How positional encodings are utilized in trendy language fashions
Let’s get began!

Positional Encodings in Language Fashions
Picture by Svetlana Gumerova. Some rights reserved.
Overview
This put up is split into 5 elements; they’re:
- Understanding Positional Encodings
- Sinusoidal Positional Encodings
- Discovered Positional Encodings
- Rotary Positional Encodings (RoPE)
- Relative Positional Encodings
Understanding Positional Encodings
Take into account these two sentences: “The fox jumps over the canine” and “The canine jumps over the fox”. They include the identical phrases however in several orders. In recurrent neural networks, the mannequin processes phrases sequentially, naturally capturing this distinction. Nevertheless, transformer fashions course of all phrases in parallel, making them unable to differentiate between these sentences with out further data.
Positional encodings clear up this drawback by offering details about every token’s place within the sequence. Every token is transformed right into a vector by means of the mannequin’s embedding layer, with the vector dimension known as the “hidden dimension”. Positional encoding provides place data by making a vector of the identical hidden dimension.
The positional encodings are added to the enter within the consideration module. Throughout the dot-product operation, these encodings emphasize relationships between close by tokens, serving to the mannequin perceive context. This permits the mannequin to differentiate between sentences with the identical phrases in several orders.
The most typical forms of positional encodings are:
- Sinusoidal Positional Encodings (used within the unique Transformer): Makes use of fixed vectors constructed with sine and cosine capabilities
- Discovered Positional Encodings (utilized in BERT and GPT): Vectors are discovered throughout coaching
- Rotary Positional Encodings (RoPE, utilized in Llama fashions): Makes use of fixed vectors constructed with rotational matrices
- Relative Positional Encodings (utilized in T5 and MPT): Based mostly on distances between tokens moderately than absolute positions
- Consideration with Linear Bias (ALiBi, utilized in Falcon fashions): A bias time period added to consideration scores based mostly on token distances
Every sort has distinctive benefits and limitations, which we’ll discover intimately.
Sinusoidal Positional Encodings
The unique Transformer paper launched sinusoidal positional encodings. Deterministic capabilities are used to generate distinctive patterns for every place, as proven within the following formulation:
$$
start{aligned}
PE(p, 2i) &= sinleft(frac{p}{10000^{2i/d}}proper)
PE(p, 2i+1) &= cosleft(frac{p}{10000^{2i/d}}proper)
finish{aligned}
$$
The place $d$ is the hidden dimension (should be even), and $i$ ranges from 0 to $d/2$. The positional encoding $PE(p, okay)$ represents the $okay$-th component within the vector for place $p$. The fixed 10000 was advised by the unique Transformer paper. It needs to be bigger than the utmost sequence size.
Right here’s the PyTorch implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
mport torch import numpy as np
def create_sinusoidal_encodings(seq_len, dim): N = 10000 i = torch.arange(0, dim//2) div_term = torch.exp(–np.log(N) * (2*i / dim)) place = torch.arange(seq_len).unsqueeze(1)
pe = torch.zeros(seq_len, dim) pe[:, 0::2] = torch.sin(place * div_term) pe[:, 1::2] = torch.cos(place * div_term) return pe
# Instance utilization seq_len = 512 dim = 768 positional_encodings = create_sinusoidal_encodings(seq_len, dim) sequence = sequence + positional_encodings |
On this implementation, div_term
computes $1/N^{2i/d}$ for $i=0$ to $d/2-1$. The place
matrix has form (seq_len,1)
. Their multiplication within the sine and cosine capabilities produces a matrix of form (seq_len, dim//2)
. The outcomes are interleaved within the output matrix pe
of form (seq_len, dim)
.
Sinusoidal encodings have two key benefits: they’re deterministic and might extrapolate to longer sequences than seen throughout coaching. The relative place between tokens might be simply computed from the dot product of their positional encoding vectors, because of the properties of sinusoidal capabilities.
Nevertheless, these encodings don’t adapt to information traits and could also be much less efficient for very lengthy sequences.
Discovered Positional Encodings
Fashions like GPT-2 use discovered positional encodings. Right here’s the PyTorch implementation:
import torch.nn as nn
class LearnedPositionalEncoding(nn.Module): def __init__(self, max_seq_len, dim): tremendous().__init__() self.position_embeddings = nn.Embedding(max_seq_len, dim)
def ahead(self, x): positions = torch.arange(x.dimension(1), machine=x.machine).develop(x.dimension(0), –1) position_embeddings = self.position_embeddings(positions) return x + place_embeddings
# Instance utilization mannequin = LearnedPositionalEncoding(max_seq_len=512, dim=768) |
The nn.Embedding
layer acts as a lookup desk mapping integer indices to vectors of dimension dim
. Within the ahead()
perform, the positions
tensor has form (batch_size, seq_len, dim)
, matching the enter x
. The positional encoding is added to x
earlier than the eye operation.
Discovered positional encodings adapt to information traits by means of coaching, doubtlessly providing higher efficiency when skilled correctly. Nevertheless, they’ll’t extrapolate to longer sequences and will overfit. In addition they enhance mannequin dimension since they’re a part of the mannequin parameters.
Rotary Positional Encodings (RoPE)
Most trendy giant language fashions use rotary positional encodings (RoPE). These encode relative positions by means of rotation matrices, with every place representing a geometrical development of angles. The formulation are:
$$
start{aligned}
hat{x}_m^{(i)} &= x_m^{(i)} cos(mtheta_i) + x_m^{(d/2+i)} sin(mtheta_i)
hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} cos(mtheta_i) – x_m^{(i)} sin(mtheta_i)
finish{aligned}
$$
the place $theta_i = 10000^{-2i/d}$, with $d$ because the embedding dimension, $m$ the place index, and $i$ starting from 0 to $d/2-1$. In matrix kind, it’s:
$$
mathbf{hat{x}}_m = mathbf{R}_mmathbf{x}_m = start{bmatrix}
cos(mtheta_i) & -sin(mtheta_i)
sin(mtheta_i) & cos(mtheta_i)
finish{bmatrix} mathbf{x}_m
$$
For $mathbf{x}_m$ representing a pair $(i, d/2+i)$ of components within the vector at place $m$.
Right here’s the PyTorch implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import torch import numpy as np
def rotate_half(x): x1, x2 = x.chunk(2, dim=–1) return torch.cat((–x2, x1), dim=–1)
def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin)
class RotaryPositionalEncoding(nn.Module): def __init__(self, dim, max_seq_len=512): tremendous().__init__() N = 10000 inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim)) place = torch.arange(max_seq_len).float() sinusoid_inp = torch.outer(place, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin())
def ahead(self, x, seq_len=None): if seq_len is None: seq_len = x.dimension(1) cos = self.cos[:seq_len].view(1, seq_len, 1, –1) sin = self.sin[:seq_len].view(1, seq_len, 1, –1) return apply_rotary_pos_emb(x, cos, sin) |
The register_buffer()
calls cache the sine and cosine calculations for effectivity. The inv_freq
variable computes $theta_i$ for all $i$, place
represents $m$ (indices from 0 to max_seq_len-1
), and sinusoid_inp
holds $mtheta_i$ in a matrix of form (max_seq_len, dim//2)
. The rotate_half()
perform converts a vector $(x_1, x_2, cdots, x_{d-1}, x_{d})$ to $(-x_{d/2+1}, -x_{d/2+2}, dots, x_{d/2-1}, x_{d/2})$. Then, apply_rotary_pos_emb()
applies the rotation matrix to the enter.
RoPE presents a number of benefits:
- The rotation matrix $mathbf{R}_m$ geometrically rotates the 2D enter vector by an angle $mtheta_i$
- The transpose $mathbf{R}_m^prime = mathbf{R}_m^{-1}$ represents reverse rotation. Therefore the relative positions might be simply computed as $mathbf{R}_{m-n} = mathbf{R}_mmathbf{R}_n^prime$
- It could extrapolate to longer sequences because of the geometric development of angles
- Since $cos^2t+sin^2t=1$, RoPE preserves vector norms of $mathbf{x}_m$, aiding coaching stability
Relative Positional Encodings
Whereas earlier implementations use absolute token positions, often what issues is the relative positions between tokens. Right here’s a simplified implementation on how you can use relative positional encodings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import torch import torch.nn as nn
class RelativePositionalEncoding(nn.Module): def __init__(self, max_relative_position, d_model): tremendous().__init__() self.max_relative_position = max_relative_position self.relative_attention_bias = nn.Parameter( torch.randn(2 * max_relative_position + 1, d_model) )
def ahead(self, size): context_position = torch.arange(size, dtype=torch.lengthy)[:, None] memory_position = torch.arange(size, dtype=torch.lengthy)[None, :] relative_position = memory_position – context_position relative_position_bucket = relative_position + self.max_relative_position return self.relative_attention_bias[relative_position_bucket] |
The relative_position
matrix has form (size, size)
, with every component representing the relative place between tokens $i$ and $j$. That is computed by subtracting an $Ntimes 1$ matrix context_position
from a $1times N$ matrix memory_position
.
The relative_position_bucket
shifts values to be non-negative, and place encoding vectors are regarded up from the relative_attention_bias
tensor.
Relative positional encodings naturally deal with variable-length sequences and work effectively for duties like translation, making them the selection for fashions like T5.
Consideration with Linear Bias (ALiBi) is a associated strategy that provides a bias matrix to consideration scores as a substitute of manipulating the enter sequence. Within the code above, you see that relative_positon_bucket
is used to search for a sequence of vectors because the positional encoding, which is then added to the enter sequence within the consideration module. In ALiBi, the enter sequence are used straight in calculating the eye rating. However afterwards, the matrix of relative_positon_bucket
is scaled and added to the eye rating matrix earlier than continuing to the softmax operation. The scaling think about ALiBi is computed as $m_h=1/2^{8h/H}$, the place $h$ is the top index and $H$ is the overall variety of consideration heads.
Additional Readings
Beneath are some additional readings on the subject:
Abstract
On this article, you discovered about positional encodings and their significance in transformer fashions. Particularly, you discovered:
- Positional encodings are crucial as a result of transformers course of tokens in parallel
- Various kinds of positional encodings have completely different benefits and limitations
- Sinusoidal encodings are deterministic and might extrapolate to longer sequences
- Discovered encodings are easy however can’t extrapolate
- RoPE gives higher efficiency on lengthy sequences
- Relative positional encodings deal with inter-token distances