LLaMA in R with Keras and TensorFlow

OpenAI’s chatGPT has woke up a collective consciousness of what Giant
Language Fashions (LLMs) are able to. With that awakening comes a each day
march of LLM information: new merchandise, new options, new fashions, new
capabilities, (and new worries). It appears we’re within the early levels of a
Cambrian explosion of LLMs and LLM powered instruments; it’s not but clear how
LLMs will affect and affect our skilled and private lives, however
it appears clear that they’ll, indirectly.
Since LLMs are right here to remain, it’s worthwhile to take a while to
perceive how these fashions work from a first-principles perspective.
Beginning with the mechanics can assist foster sturdy intuitions that may
inform our utilization of those fashions now and sooner or later. (Particularly if
the longer term is one the place LLMs are a staple of the information scientist’s
toolbox, as widespread as an lm() operate name).
And what higher manner is there to study than by doing. So with that
preamble, on this publish we’ll stroll by means of an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorFlow and Keras, with the objective being to develop
understanding first, functionality second.
Why LLaMA? With the sheer quantity of LLM associated content material and information out
there, it may appear formidable to know the place to get began. Nearly weekly
it appears there’s a new mannequin introduced. Shopping some hubs of LLM
exercise (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. How you can decide a selected mannequin?
Of the various LLM-related information gadgets previously months, one which stands
head-and-shoulders above the gang is the release of
LLaMA,
a contemporary, foundational LLM made obtainable to the general public by Meta AI in
Februay 2023. On widespread benchmarks, LLaMA outperforms OpenAI’s GPT-3,
whereas being considerably smaller (although nonetheless giant).
LLaMA is a good beginning place as a result of it’s a easy and trendy
structure, has glorious efficiency on benchmarks, and is open. The
mannequin structure has had just some new concepts integrated into it since
the unique Transformer structure first described in,
“Attention Is All You Need”
revealed from Google (Vaswani et al. 2017). 4 completely different sizes of
LLaMA have been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching information these fashions have seen–the most important 65B mannequin has been
skilled on roughly the “Chinchilla
compute-optimum” (Hoffmann et al. 2022)
variety of tokens, whereas the smaller LLaMAs are considerably
past that optimum. On this weblog publish we’ll give attention to the smallest, 7B
parameter LLaMA mannequin, which you’ll be able to comfortably load domestically and run on
CPU with solely 64Gb of RAM.
Whereas not strictly essential, to comply with alongside domestically, you’ll in all probability
wish to purchase the pre-trained LLaMA weights one
way or
another. Word, the
weights do include their very own license, which you’ll be able to preview
here.
So, with out additional ado, let’s get began.
Setup
First, we’ll wish to set up the required R and Python packages, and
configure a digital setting:
remotes::install_github(c("rstudio/reticulate",
"rstudio/tensorflow",
"rstudio/keras"))
reticulate::virtualenv_create("./.venv", version = "3.10")
tensorflow::install_tensorflow(envname = "./.venv", version = "release")
With that out of the way, let’s load some packages and prepare our R
session:
library(purrr)
library(envir)
library(tensorflow)
library(tfautograph)
library(keras)
use_virtualenv("./.venv")
options(tensorflow.extract.warn_tensors_passed_asis = FALSE)
attach_eval({
import_from(glue, glue)
import_from(jsonlite, read_json)
import_from(withr, with_dir, with_options)
import_from(keras$layers, Dense)
np <- reticulate::import("numpy", convert = FALSE)
seq_len0 <- function(x) seq.int(from = 0L, length.out = x)
})
If you’ve acquired the pre-trained weights, it’ll be convenient to
convert them from the torch checkpoint format to something that’s more
framework agnostic (you only need to do this once, of course):
# reticulate::py_install("torch", pip = TRUE)
torch <- reticulate::import("torch", convert = FALSE)
with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
pretrained_weights <- torch$load("consolidated.00.pth",
map_location = "cpu")
for (name in names(pretrained_weights)) {
filename <- sprintf("%s.npy", name)
array <- pretrained_weights[[nm]]$numpy()
np$save(filename, array)
message(glue(
"wrote: '{basename(filename)}' with shape: {array$shape}"))
}
})
We’ll also define a helper function so we can avoid having to retype the
full path to our weights:
weights_path <- function(filename) normalizePath(file.path(
"~/github/facebookresearch/llama/weights/LLaMA/",
glue(filename, .envir = parent.frame())), mustWork = TRUE)
And load the model configuration parameters specific to the 7B LLaMA,
which we’ll use to build the model.
params <- read_json(weights_path("7B/params.json"))
str(params)
List of 6
$ dim : int 4096
$ multiple_of: int 256
$ n_heads : int 32
$ n_layers : int 32
$ norm_eps : num 1e-06
$ vocab_size : int -1
Tokenizer
The first component to LLaMA is the tokenizer, which converts text to a
sequence of integers. The LLaMA model uses the
SentencePiece tokenizer from
Google. SentencePiece is on the market as a TensorFlow graph operation
by means of
tf_text.SentencepieceTokenizer,
and likewise as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer.
By alternative of a coin flip, we’ll use the lower-level tf_text interface.
tf_text <- reticulate::import("tensorflow_text")
tokenizer_path <- weights_path("tokenizer.model")
tokenizer <- tf_text$SentencepieceTokenizer(
tf$io$gfile$GFile(tokenizer_path, "rb")$read(),
add_bos = TRUE, add_eos = FALSE,
)
Let’s test it out with a prompt:
prompt <- "The best way to attract bees"
tokenizer$tokenize(prompt)
tf.Tensor([ 1 450 1900 982 304 13978 367 267], shape=(8), dtype=int32)
prompt |> tokenizer$tokenize() |> tokenizer$detokenize()
tf.Tensor(b'The best way to attract bees', shape=(), dtype=string)
Let’s define a show_tokens() helper function and play with the
tokenizer a little.
show_tokens <- function(what) > tokenizer$tokenize()
show_tokens(prompt)
1 450 1900 982 304 13978 367 267
"" "The" "best" "way" "to" "attract" "be" "es"
Note that “bees” is two tokens. Not every token corresponds to a word.
For example, one non-word token we can reliably expect to show up in a
tokenizer trained on a corpus of English text is “ing.” However, when the
“ing” token shows up will not always follow your intuitions, because
common words get their own token id, even if they can be decomposed into
multiple tokens.
1 2348
"" "ing"
1 1985
"" "working"
1 8525 292
"" "flex" "ing"
1 2113 9292
"" "won" "king"
Another thing to note about the tokenizer is that each token sequence
starts with token id 1. This is a special beginning-of-sequence
token that we requested be added when we loaded the tokenizer with
add_bos = TRUE. There are two other such special tokens that we will
encounter later: an end-of-sequence special tokens with id 2, and an
unknown-token with id 0.
as.character(tokenizer$id_to_string(0L))
[1] "<unk>"
as.character(tokenizer$id_to_string(1L))
[1] "<s>"
as.character(tokenizer$id_to_string(2L))
[1] "</s>"
1 0 2
"" " ⁇ " ""
Overall, there are 32,000 tokens.
as.integer(tokenizer$vocab_size())
[1] 32000
One last observation is that the more frequently encountered tokens are
assigned lower ids.
show_tokens(seq(50, len = 10))
50 51 52 53 54 55 56 57 58 59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"
show_tokens(seq(100, len = 10))
100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
show_tokens(seq(1000, len = 10))
1000 1001 1002 1003 1004 1005 1006 1007 1008 1009
"ied" "ER" "stat" "fig" "me" "von" "inter" "roid" "ater" "their"
show_tokens(seq(10000, len = 10))
10000 10001 10002 10003 10004 10005 10006 10007
"ång" "citep" "Ill" "rank" "sender" "beim" "рак" "compat"
10008 10009
"occurs" "diese"
show_tokens(seq(20000, len = 10))
20000 20001 20002 20003 20004 20005 20006 20007
"admit" "Comment" "стя" "Vien" "ці" "permut" "cgi" "crít"
20008 20009
"Console" "ctic"
show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))
31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
"ὀ" "げ" "べ" "边" "还" "黃" "왕" "收" "弘" "给"
Moving on, the next step after tokenization is embedding. An embedding
layer is effectively a dictionary lookup that converts an integer (token
id) to a 1-d float array. For this we can use the standard keras
Embedding layer.
tok_embeddings <- keras$layers$Embedding(
input_dim = tokenizer$vocab_size(),
output_dim = params$dim,
embeddings_initializer =
(...) np$load(weights_path("7B/tok_embeddings.weight.npy"))
)
tok_embeddings(3L) |> str()
<tf.Tensor: shape=(4096), dtype=float32, numpy=…>
prompt |> # "The best way to attract bees"
tokenizer$tokenize() |>
tok_embeddings() |>
str()
<tf.Tensor: shape=(8, 4096), dtype=float32, numpy=…>
TransformerBlock
Once it’s tokenized and embedded, the input then passes through the bulk
of the model, a sequence of repeating TransformerBlock layers. The 7B
model has 32 of these TransformerBlock layers, while the 65B model has
80 of them.
weights_path("7B/params.json") |> read_json() |> _$n_layers
[1] 32
weights_path("65B/params.json") |> read_json() |> _$n_layers
[1] 80
Here is what the transformer block looks like:
TransformerBlock(keras$layers$Layer) %py_class% {
initialize <- function(attn_head_size, attn_n_heads,
norm_eps = k_epsilon(), ...,
block_id = NULL) {
super$initialize(...)
self$attention <- Attention(attn_head_size, attn_n_heads,
block_id = block_id)
self$feed_forward <- FeedForward(
hidden_dim = 4 * attn_head_size * attn_n_heads,
block_id = block_id)
self$attention_norm <- RMSNorm(eps = norm_eps,
block_id = block_id,
feeds_into = "attention")
self$feed_forward_norm <- RMSNorm(eps = norm_eps,
block_id = block_id,
feeds_into = "ffn")
}
call <- function(x) >
self$attention()
x <- x + x2 # add residual
# norm and swiglu
x2 <- x %>%
self$feed_forward_norm() %>%
self$feed_forward()
x <- x + x2 # residual again
x
}
While there is not a lot of code, there are a lot of ideas packed in
there. This block forms the main trunk of the model, so it’s worth
taking the time to go through it slowly.
We implement the TransformerBlock as a subclassed
keras.layers.Layer. This is gives us some niceties like the ability to
compose with other Keras layers, but these are mostly irrelevant to the
purpose of this blog post; we could just as easily implement this as,
for example, a vanilla R6 class. Our TransformerBlock class has two
methods: initialize, called when we first create the block, and
call, called when we run the forward pass of the block.
In initialize, we create 4 layers: an Attention layer, a
FeedForward layer, and 2 RMSNorm layers. We’ll take a close look at
each of these soon, but even before we do so, we can see how they fit
together by looking at the TransformerBlock$call() method.
The call method has a few simple ideas. In no particular order, the
first one to observe is the composition pattern of adding residuals.
x2 <- x |> ...
x <- x + x2 # add residual x to x2
This is a common pattern that helps with model training, and especially
to help with the vanishing gradient
problem. It’s
a skip-connection within the other-wise linear sequence of matrix
transformations. It reinjects data (throughout the ahead cross), and
gradients (throughout again propagation), again into the trunk. You possibly can assume
of those residual connections as releasing the learnable layers in-between
(the ... within the pseudo code) from the burden of getting to
“pass-through” or “protect” data in x, permitting the weights to
as an alternative give attention to studying transformations which can be, (in corporatese
vernacular), value-adding.
The following composition sample to notice is the repeating utilization of a
normalization layer:
x2 <- x |> norm() |> ...
x <- x + x2
There are many kinds of normalization layers, but to slightly
over-generalize, they can all be thought of as a stabilizer that helps
with training. Like their deep-learning cousins the regularizers, their
main function is to keep values passing through in a sensible range–in
the ball park of (-1, 1), typically. We’ll take a closer look at
RMSNorm soon.
Stripped of two tricks that are mostly there to help the model train,
residuals and normalization, the core of the TransformerBlock is just
this:
x |> attention() |> feed_forward()
In a moment we’ll see that that feed_foward is a slightly fancier
variation of a conventional sequence of Dense layer. Before we get
there we can we safely skip ahead to distill the following intuition: a
TransformerBlock is basically an Attention layer followed by a few
(fancy) dense layers, with some simple composition patterns (tricks)
that help with training. Attention is the heart of the model: it’s the
most interesting, and also the most involved.
With the framing in place, let’s go through and take a closer look at
RMSNorm, FeedForward, and then with the foundation in place, we’ll
turn our attention to Attention.
RMSNorm
RMSNorm(keras$layers$Layer) %py_class% {
initialize <-
function(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
super$initialize(...)
self$eps <- eps
self$block_id <- block_id
self$feeds_into <- feeds_into
}
build <- function(input_shape) {
# input_shape == (batch_size, seqlen, params$dim)
# self$w will broadcast over batch_size and seqlen dims.
# w_shape == (1, 1, params$dim)
w_shape <- rep(1L, length(input_shape))
w_shape[length(input_shape)] <- as.integer(input_shape) |> tail(1L)
# define a local function that will load
# the pretrained-weights if we supplied `block_id` and `feeds_into`
import_from({self}, block_id, feeds_into)
initializer <-if (is.null(block_id))
"ones"
else if (block_id >=0) {
(...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
np$load() |> np$expand_dims(0:1)
} else if(block_id == -1)
# load weights for the final output normalization layer, which is not
# part of a TransformerBlock
(...) weights_path("7B/norm.weight.npy") |>
np$load() |> np$expand_dims(0:1)
self$w <- self$add_weight(shape = w_shape,
initializer = initializer,
trainable = TRUE)
}
rrms <- function(x) {
# reciprocal root mean square along the last axis
x %>% # (batch_size, seqlen, n_features)
tf$math$square() %>%
tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
tf$math$add(self$eps) %>% # for numerical stability
tf$math$rsqrt()
}
call <- function(x) {
x * self$rrms(x) * self$w
}
}
RMSnorm() has a single trainable tensor w. In the forward pass, each
value in the input is multiplied by the reciprocal-root-mean-square of
all the values in the feature axis and by w. Certainly a mouthful, but
just a simple sequence of arithmetic transformations in the end,
designed for the express purpose of adjusting the range of values
passing through.
Let’s kick the tires on it:
norm <- RMSNorm()
m <- matrix(c(0, 1,
2, 3), nrow = 2)
norm(m)
tf.Tensor(
[[0. 1.4142132 ]
[0.44721353 1.3416406 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137 ]
[0.44721362 1.3416408 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137]
[0.4472136 1.3416408]], shape=(2, 2), dtype=float32)
FeedForward
Next up is FeedForward()
FeedForward(keras$layers$Layer) %py_class% {
initialize <- function(hidden_dim, multiple_of = 256L,
..., block_id = NULL) {
super$initialize()
if(!is.null(multiple_of)) {
hidden_dim <- hidden_dim %>%
{ as.integer( . * (2/3)) } %>%
{ (. + multiple_of - 1) %/% multiple_of } %>%
{ . * multiple_of }
}
self$hidden_dim <- hidden_dim
self$block_id <- block_id
}
build <- function(input_shape) {
output_dim <- input_shape |> as.integer() |> tail(1)
if(is.null(self$block_id))
load_weight <- (...) NULL
else
load_weight <- (name) (...) np$load(weights_path(
"7B/layers.{self$block_id}.feed_forward.{name}.weight.npy"))$`T`
self$w1 <- Dense(self$hidden_dim, use_bias = FALSE,
kernel_initializer = load_weight("w1"))
self$w2 <- Dense(output_dim, use_bias = FALSE,
kernel_initializer = load_weight("w2"))
self$w3 <- Dense(self$hidden_dim, use_bias = FALSE,
kernel_initializer = load_weight("w3"))
super$build(input_shape)
}
call <- function(x) {
import_from({self}, w1, w2, w3)
import_from(tf$nn, silu)
x %>%
{ silu(w1(.)) * w3(.) } %>% # SwiGLU
w2()
}
}
FeedForward consists of three Dense layers. initialize does some
simple arithmetic, munging on the input value hidden_dim to ensure the
size is a performant multiple of 256, and build is mostly boiler plate
for creating the layers and loading the weights.
The novelty of FeedForward() is in the call() method, where rather
than composing the Dense layers in a conventional sequential model
with, say, ReLU activations in between and maybe some dropout, the
layers are composed to form a “SwiGLU” unit. The publication by Shazeer (2020)
of SwiGLU and different variations on GLU is an exemplar of the kinds
of explorations and enhancements across the Transformer structure
since its preliminary publication in
2017; a gentle accretion of
enhancements that has introduced us to right this moment. The Feedforward$name() is
only a single SwiGLU adopted by a linear projection. In its essence,
it’s a intelligent composition of three (realized) linear projections, an
element-wise multiplication, and a silu()
activation
operate.
Maybe probably the most stunning remark to make right here is the relative
dearth of activation capabilities, and even non-linearities, not simply in
FeedForward, however general. The silu() on this feedforward, the
reciprocal-root-mean-square in RMSnorm(), and a softmax() in
Consideration() are the one non-linear transformations in the entire
sequence of TransformerBlocks. Every part else is a linear
transformation!
Consideration
Lastly, let’s flip our consideration to Consideration().
Attention(keras$layers$Layer) %py_class% {
initialize <- function(head_size, n_heads,
..., block_id = NULL) {
super$initialize(...)
self$head_size <- head_size
self$n_heads <- n_heads
if (is.null(block_id))
load_weight <- function(name) NULL
else
load_weight <- (name) (...) np$load(weights_path(
"7B/layers.{block_id}.attention.{name}.weight.npy"))$`T`
Dense <- function(name) keras$layers$Dense(
units = n_heads * head_size,
use_bias = FALSE,
kernel_initializer = load_weight(name)
)
self$wq <- Dense("wq")
self$wk <- Dense("wk")
self$wv <- Dense("wv")
self$wo <- Dense("wo")
}
call <- function(x) {
c(batch_size, seqlen, n_features) %<-% tf$unstack(tf$shape(x))
# 1. project (linear transform) x into
# query, key, and value tensors
# 2. reshape q k v, splitting out the last dim (n_features)
# into n_heads independent subspaces,
# each with size head_size.
# (n_features == head_size * n_heads)
split_heads_shape <- c(batch_size, seqlen,
self$n_heads, self$head_size)
q <- x |> self$wq() |> tf$reshape(split_heads_shape)
k <- x |> self$wk() |> tf$reshape(split_heads_shape)
v <- x |> self$wv() |> tf$reshape(split_heads_shape)
# embed positional information in query and key
# (bsz, seqlen, n_heads, head_size)
q %<>% apply_rotary_embedding()
k %<>% apply_rotary_embedding()
# reshape:
# move heads out of the last 2 axes,
# so later matmuls are performed across the subspaces (heads)
# between (seqlen, head_size) axes
v <- tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
q <- tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
k <- tf$transpose(k, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)
# calculate and normalize attention scores
scores <- q %*% k # (bsz, n_heads, seqlen, seqlen)
scores <- scores / sqrt(self$head_size) # scale
# apply causal mask, so the model can't "look ahead" during training
mask <- make_mask(seqlen, dtype = scores$dtype)
scores %<>% { . + mask }
scores <- tf$nn$softmax(scores, axis = -1L)
# adjust values tensor with attention scores
# scores (bsz, n_heads, seqlen, seqlen)
# v (bsz, n_heads, seqlen, head_size)
output <- scores %*% v # (bsz, n_heads, seqlen, head_size)
# combine heads back into a single features dim,
# so Attention output_shape==input_shape
output <- output |>
tf$transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
tf$reshape(tf$shape(x)) # (bsz, seqlen, n_heads * head_size)
# one more trainable linear projection for good luck
output <- self$wo(output) # (bsz, seqlen, n_heads * head_size)
output
}
}
Attention in LLaMA is similar but not identical to the Attention
described in the original Transformers
paper (and obtainable as a keras
builtin below keras$layers$MultiHeadAttention()). The core novelty is
the addition of the apply_rotary_embedding() operate, which we’ll
describe shortly. The extra novelty is balanced by the simplicity
from the truth that the layer is performing self-attention—we don’t want
to cross in numerous question, key, and worth tensors (or purpose about what
which means), for the reason that similar enter serves all three roles. Word that the
typical MultiHeadAttention() layer is roofed fairly totally in
the 2nd Version of Deep Learning with R,
together with a full implementation of consideration in base R.
To develop an understanding of the mechanics in a layer like this, it’s
useful to briefly unsee among the minutia that may act as a fog
obscuring the essence of the operation. On this occasion, if we
briefly strip out the transpose()s and reshape()s (as intelligent and
very important as they’re), that is what’s left:
call <- function(x) > self$wv()
# rotate q,k to inject position information.
# cross q,k to calculate an attention score for each token pair.
scores <- rotate(q) %*% rotate(k)
Returning to the transpose()s and reshapes(), you can observe that
their purpose is to make it so that the attention calculations are
performed across n_heads independent subspaces, rather than in a
single larger space. The same reasoning drives this decision as that
driving usage of depthwise-separable convolutions in image models.
Empirically, for the fixed compute budget, factoring features into
independent subspaces performs better than doing the same core
operations in single larger feature space. As with all things, there is
a balance to strike between n_heads (the number of subspaces) and
head_dim (the size of each subspace). The LLaMA authors have struck
the balance like this at the various model sizes:
lapply(c("7B", "13B", "30B", "65B"), (size) {
p <- read_json(weights_path("{size}/params.json"))
with(p, list(llama_size = size,
n_heads = n_heads,
head_dim = dim %/% n_heads))
}) |> dplyr::bind_rows()
# A tibble: 4 × 3
llama_size n_heads head_dim
<chr> <int> <int>
1 7B 32 128
2 13B 40 128
3 30B 52 128
4 65B 64 128
Next lets turn our attention to the causal attention mask.
make_mask <- function(seqlen, dtype = k_floatx()) {
x <- tf$range(seqlen)
mask <- tf$where(x[, tf$newaxis] < x[tf$newaxis, ],
tf$constant(-Inf, dtype = dtype),
tf$constant(0, dtype = dtype))
# broadcast over batch and heads dim
mask[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}
The mask is a strictly upper triangular matrix filled with -Inf
values. Adding the mask to the attention scores prevents the model from
being able to “look ahead” and see the attention score for a token
pairing it hasn’t seen yet at a particular position in the sequence.
This need for a mask is best thought of as a vestige from training,
an apparatus that the model needed to learn with and now it can’t function without.
During training, gradients are calculated for predictions from all
token positions in a sequence, including predictions tokens where the correct
answer is right there, as the very next token in same sequence. The mask
prevents the model from being able to cheat and look ahead into the future,
something it won’t be able to do once it’s we’re running it for inference.
tf.Tensor(
[[[[ 0. -inf -inf -inf -inf]
[ 0. 0. -inf -inf -inf]
[ 0. 0. 0. -inf -inf]
[ 0. 0. 0. 0. -inf]
[ 0. 0. 0. 0. 0.]]]], shape=(1, 1, 5, 5), dtype=float32)
Rotary Position Embedding
Next lets turn our attention to apply_rotary_embedding(). This core
innovation was published by Su et al. (2022) within the paper titled
“RoFormer: Enhanced Transformer with Rotary Position Embedding”.
Some context:
-
The naked
Consideration()mechanism doesn’t go away any chance for a
token’s place in a sequence to have an effect on the eye scores, since
solely token-pairs are scored. Consideration treats its enter like a
bag-of-tokens. -
The place of a token in a sequence is clearly essential, and the
consideration layer ought to have entry to that data. -
Absolutely the place of a token in a sequence is much less essential
than the relative place between tokens. (Particularly so for lengthy
sequences).
Which leads us into the complicated aircraft. If we think about the options as
complicated numbers, we will rotate them, and we will calculate angles between
them. From the Roformers paper:
Particularly, incorporating the relative place embedding is
easy: merely rotate the affine-transformed phrase embedding
vector by quantity of angle multiples of its place index and thus
interprets the instinct behind Rotary Place Embedding
Increasing barely: the rotation matrix is designed in order that
subsequently, after rotating our q and ok token sequence embedding
the identical manner, the angle between token options is a operate of the
relative distance between these tokens within the token sequence. The
relative angle between two tokens is invariant to absolutely the
place of these tokens within the full sequence.
In brief, the rotation injects positional data. The which means or
interpretability of that positional data, or how it’s meant to
be used, and even extracted from the results of q %*% ok, is left to the
mannequin to study.
Right here is the code:
apply_rotary_embedding <- function(x) {
c(., seqlen, ., head_size) %<-%
tf$unstack(tf$shape(x))
rotation_matrix <- compute_rotation_matrix(seqlen, head_size)
x %>%
view_as_complex() %>%
{ . * rotation_matrix } %>%
view_as_real()
}
compute_rotation_matrix <-
function(seqlen, feature_dim, theta = 10000) {
# `feature_dim` here is going to be attention$head_size
# `seqlen` is going to match the token sequence length.
t <- tf$range(seqlen, dtype = tf$float32)
freqs <- tf$range(start = 0, limit = 1, delta = 1 / (feature_dim %/% 2),
dtype = tf$float32)
tf_assert(tf$size(freqs) == feature_dim %/% 2)
freqs <- 1.0 / (theta ^ freqs)
# outer product; (seqlen, head_size/2)
freqs <- tf$einsum('a,b->ab', t, freqs)
rot_mat <- tf$complex(tf$cos(freqs), tf$sin(freqs))
# the positional embedding will be broadcast across batch and heads dim
rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
}
view_as_complex <- function(x) {
tf$complex(x[all_dims(), `::2`],
x[all_dims(), `2::2`])
}
view_as_real <- function(x) {
# xs = (..., f); xs2 = (..., f*2)
xs <- tf$shape(x)
xs2 <- tf$concat(list(xs[1:(length(xs)-1)],
xs[length(xs), drop = FALSE] * 2L),
axis = 0L)
x2 <- tf$stack(list(Re(x), Im(x)), axis = -1L)
# (..., f, 2) -> (..., f*2)
tf$reshape(x2, xs2)
}
As you can see, to imagine the embedding features as existing in the
complex plane, we merely treat adjacent pairs of floats in the
underlying array as the real and imaginary part of a complex number. We
rotate the embeddings in the complex plane, then go back to imagining
the features as existing in the real plane. Again, the job of
interpreting the meaning of the features after rotation is left to the
model to learn.
We can quickly confirm that the rotary embeddings only rotate features
and don’t scale them:
near <- function (x, y, tol = 1e-6) abs(x - y) < tol
all(near(1, Mod(compute_rotation_matrix(2048L, 128L))))
tf.Tensor(True, shape=(), dtype=bool)
There is one more trick to observe before moving on: because of some of
the mathematical properties of the rotation matrix, it’s possible to
avoid doing a full complex multiply operation and still arrive at the
same result. Also, since the rotation matrix never changes, it makes
sense to only compute it once and cache it, like so:
precomputed_rotation_matrix <- compute_rotation_matrix(
seqlen = 2048L, # LLaMA max seqlen
feature_dim = with(params, dim %/% n_heads) # head_size
)
apply_rotary_embedding_faster <- function(x) {
rotate_every_two <- function(x) {
x1 <- x[all_dims(), `::2`]
x2 <- x[all_dims(), `2::2`]
x_ <- tf$stack(list(-x2, x1), axis = -1L)
tf$reshape(x_, tf$shape(x))
}
repeat_each_twice <- function(x) {
tf$`repeat`(x, 2L, axis = -1L)
}
seqlen <- tf$shape(x)[2]
rot <- precomputed_rotation_matrix[, NA:seqlen, , ]
cos <- Re(rot) |> repeat_each_twice()
sin <- Im(rot) |> repeat_each_twice()
(x * cos) + (rotate_every_two(x) * sin)
}
rand <- tf$random$uniform(shape(3, 8, params$n_heads, 128))
all(apply_rotary_embedding(rand) ==
apply_rotary_embedding_faster(rand))
tf.Tensor(True, shape=(), dtype=bool)
apply_rotary_embedding <- apply_rotary_embedding_faster
Finally, note that the rotary positional embeddings are applied within
each Attention layer. This is different from the original Transformer
implementation, where a positional embedding was only added once at the
head of the model. Similar to residual connections, you can think of the
presence of these repeated injections of positional information as
relieving the remaining trainable layers from the burden of allocating
some of their weights to the task of “passing through” or “preserving”
the positional information for later layers.
Positional embeddings are a rich subject that also comes up in other
deep learning architectures, like denoising diffusion (Falbel and Keydana 2023),
so time spent understanding them higher is time nicely
spent. For the needs of this weblog publish we’ve coated the factors
wanted and we’ll transfer on to tying all items collectively. To go deeper and
develop a extra mathematically knowledgeable perceive of RoPE, two glorious
beginning factors are:
Tying all of it collectively
With Tokenizer, Embedding, TransformerBlock (RMSNorm,
Consideration FeedForward and apply_rotary_embedding) all coated,
it’s time to tie all of the items collectively right into a Transformer mannequin. We
might do that utilizing %py_class% like with the opposite layers above, however
it’s simply as simple to maneuver over to utilizing the Keras useful API at this
level.
layer_transformer_block <- create_layer_wrapper(TransformerBlock)
layer_rms_norm <- create_layer_wrapper(RMSNorm)
# input to the model will be output from the tokenizer
input <- layer_input(shape(NA)) #, dtype = "int32")
x <- input |>
tok_embeddings() # instantiated earlier in the blog-post
for(block_id in seq_len0(params$n_layers)) >
layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
attn_n_heads = params$n_heads,
norm_eps = params$norm_eps,
block_id = block_id)
# final output projection into logits of output tokens
x <- x |>
layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
layer_dense(
tokenizer$vocab_size(), use_bias = FALSE,
kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
)
# slice out the logits for the last token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
output <- x[, -1, ]
})
llama <- keras_model(input, output) %>%
compile(jit_compile = TRUE)
The input to the model is tokenized text and the output is the
(unnormalized) probabilities for each token in tokenizer$vocab_size()
being the next token in the sequence.
next_token_probs <- prompt %>%
tokenizer$tokenize() %>%
llama()
next_token_probs
tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00 1.3200411e+01 ... 4.8804146e-01
-1.3277926e+00 9.9985600e-03]], shape=(1, 32000), dtype=float32)
Sampling strategies for selecting a token from the token logits is a
rich topic, (also covered thoroughly in the Deep Learning with
R e-book), however this weblog publish is lengthy sufficient
already. So for now, let’s simply take the argmax().
sampler <- (logits) tf$argmax(logits, axis = -1L, output_type = "int32")
(next_token <- sampler(next_token_probs))
tf.Tensor([304], shape=(1), dtype=int32)
tokenizer$detokenize(next_token) |> as.character()
[1] "to"
Let’s run it for a few tokens and let LLaMa finish the sentence:
prompt_tokens <- tokenizer$tokenize("The best way to attract bees")
for (i in 1:20) {
next_token_probs <- prompt_tokens |> llama()
next_token <- sampler(next_token_probs)
prompt_tokens %<>% { tf$concat(c(., next_token), axis = -1L) }
# end of sentence
if (as.logical(next_token == tokenizer$string_to_id(".")))
break
}
prompt_tokens |>
tokenizer$detokenize() |>
as.character() |>
strwrap(60) |> writeLines()
The best way to attract bees to your garden is to plant a
variety of flowers that bloom at different times.
Wrapping up
In this blog post we’ve walked through the LLaMA architecture
implemented in R TensorFlow, including how to load pretrained weights,
and then run the model to generate a sentence. Note, much of the code in
this blog post is tailored for didactic purposes. While the
implementation of the LLaMA architecture covered in this blog post is
appropriate for training, there are a few modifications you’ll want to
make before doing a lot of text generation. Those include things like:
-
In the
Attentionlayer, caching thekandvtensors. Then,
after the first forward pass with the initial prompt, only feeding
the model the one new token from thesampler(), rather than
feeding the model all the tokens of the full prompt on each forward
pass. -
Only generating the causal mask
make_mask()androtary_matrix
slices once per forward pass, instead of within eachAttention
call. -
Updating the
TransformerBlockto be cache-aware and to pass
through the appropriate arguments toAttention() -
Wrapping all the additional book-keeping logic in a custom
TransformerDecoder()class.
The changes required to implement these optimizations for inference
balloon the code size and are mostly about book-keeping, so we won’t go
through them in this blog post. However, you can find a fuller
implementation of LLaMA in R Tensorflow, including a cache-aware
generate() method that only feeds the model one token at a time during
the main inference loop, (and compiles to XLA!),
here.
That’s all for now. Thanks for studying and joyful travels to all
exploring this thrilling LLM terrain!
Picture by Sébastien Goldberg on Unsplash