LLaMA in R with Keras and TensorFlow
OpenAI’s chatGPT has woke up a collective consciousness of what Giant
Language Fashions (LLMs) are able to. With that awakening comes a each day
march of LLM information: new merchandise, new options, new fashions, new
capabilities, (and new worries). It appears we’re within the early levels of a
Cambrian explosion of LLMs and LLM powered instruments; it’s not but clear how
LLMs will affect and affect our skilled and private lives, however
it appears clear that they’ll, indirectly.
Since LLMs are right here to remain, it’s worthwhile to take a while to
perceive how these fashions work from a first-principles perspective.
Beginning with the mechanics can assist foster sturdy intuitions that may
inform our utilization of those fashions now and sooner or later. (Particularly if
the longer term is one the place LLMs are a staple of the information scientist’s
toolbox, as widespread as an lm()
operate name).
And what higher manner is there to study than by doing. So with that
preamble, on this publish we’ll stroll by means of an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorFlow and Keras, with the objective being to develop
understanding first, functionality second.
Why LLaMA? With the sheer quantity of LLM associated content material and information out
there, it may appear formidable to know the place to get began. Nearly weekly
it appears there’s a new mannequin introduced. Shopping some hubs of LLM
exercise (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. How you can decide a selected mannequin?
Of the various LLM-related information gadgets previously months, one which stands
head-and-shoulders above the gang is the release of
LLaMA,
a contemporary, foundational LLM made obtainable to the general public by Meta AI in
Februay 2023. On widespread benchmarks, LLaMA outperforms OpenAI’s GPT-3,
whereas being considerably smaller (although nonetheless giant).
LLaMA is a good beginning place as a result of it’s a easy and trendy
structure, has glorious efficiency on benchmarks, and is open. The
mannequin structure has had just some new concepts integrated into it since
the unique Transformer structure first described in,
“Attention Is All You Need”
revealed from Google (Vaswani et al. 2017). 4 completely different sizes of
LLaMA have been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching information these fashions have seen–the most important 65B mannequin has been
skilled on roughly the “Chinchilla
compute-optimum” (Hoffmann et al. 2022)
variety of tokens, whereas the smaller LLaMAs are considerably
past that optimum. On this weblog publish we’ll give attention to the smallest, 7B
parameter LLaMA mannequin, which you’ll be able to comfortably load domestically and run on
CPU with solely 64Gb of RAM.
Whereas not strictly essential, to comply with alongside domestically, you’ll in all probability
wish to purchase the pre-trained LLaMA weights one
way or
another. Word, the
weights do include their very own license, which you’ll be able to preview
here.
So, with out additional ado, let’s get began.
Setup
First, we’ll wish to set up the required R and Python packages, and
configure a digital setting:
::install_github(c("rstudio/reticulate",
remotes"rstudio/tensorflow",
"rstudio/keras"))
::virtualenv_create("./.venv", version = "3.10")
reticulate::install_tensorflow(envname = "./.venv", version = "release") tensorflow
With that out of the way, let’s load some packages and prepare our R
session:
library(purrr)
library(envir)
library(tensorflow)
library(tfautograph)
library(keras)
use_virtualenv("./.venv")
options(tensorflow.extract.warn_tensors_passed_asis = FALSE)
attach_eval({
import_from(glue, glue)
import_from(jsonlite, read_json)
import_from(withr, with_dir, with_options)
import_from(keras$layers, Dense)
<- reticulate::import("numpy", convert = FALSE)
np
<- function(x) seq.int(from = 0L, length.out = x)
seq_len0 })
If you’ve acquired the pre-trained weights, it’ll be convenient to
convert them from the torch checkpoint format to something that’s more
framework agnostic (you only need to do this once, of course):
# reticulate::py_install("torch", pip = TRUE)
<- reticulate::import("torch", convert = FALSE)
torch with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
<- torch$load("consolidated.00.pth",
pretrained_weights map_location = "cpu")
for (name in names(pretrained_weights)) {
<- sprintf("%s.npy", name)
filename <- pretrained_weights[[nm]]$numpy()
array $save(filename, array)
npmessage(glue(
"wrote: '{basename(filename)}' with shape: {array$shape}"))
} })
We’ll also define a helper function so we can avoid having to retype the
full path to our weights:
<- function(filename) normalizePath(file.path(
weights_path "~/github/facebookresearch/llama/weights/LLaMA/",
glue(filename, .envir = parent.frame())), mustWork = TRUE)
And load the model configuration parameters specific to the 7B LLaMA,
which we’ll use to build the model.
<- read_json(weights_path("7B/params.json"))
params str(params)
List of 6
$ dim : int 4096
$ multiple_of: int 256
$ n_heads : int 32
$ n_layers : int 32
$ norm_eps : num 1e-06
$ vocab_size : int -1
Tokenizer
The first component to LLaMA is the tokenizer, which converts text to a
sequence of integers. The LLaMA model uses the
SentencePiece tokenizer from
Google. SentencePiece is on the market as a TensorFlow graph operation
by means of
tf_text.SentencepieceTokenizer
,
and likewise as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer
.
By alternative of a coin flip, we’ll use the lower-level tf_text
interface.
<- reticulate::import("tensorflow_text")
tf_text <- weights_path("tokenizer.model")
tokenizer_path <- tf_text$SentencepieceTokenizer(
tokenizer $io$gfile$GFile(tokenizer_path, "rb")$read(),
tfadd_bos = TRUE, add_eos = FALSE,
)
Let’s test it out with a prompt:
<- "The best way to attract bees"
prompt $tokenize(prompt) tokenizer
tf.Tensor([ 1 450 1900 982 304 13978 367 267], shape=(8), dtype=int32)
|> tokenizer$tokenize() |> tokenizer$detokenize() prompt
tf.Tensor(b'The best way to attract bees', shape=(), dtype=string)
Let’s define a show_tokens()
helper function and play with the
tokenizer a little.
<- function(what) > tokenizer show_tokens $tokenize()
show_tokens(prompt)
1 450 1900 982 304 13978 367 267
"" "The" "best" "way" "to" "attract" "be" "es"
Note that “bees” is two tokens. Not every token corresponds to a word.
For example, one non-word token we can reliably expect to show up in a
tokenizer trained on a corpus of English text is “ing.” However, when the
“ing” token shows up will not always follow your intuitions, because
common words get their own token id, even if they can be decomposed into
multiple tokens.
1 2348
"" "ing"
1 1985
"" "working"
1 8525 292
"" "flex" "ing"
1 2113 9292
"" "won" "king"
Another thing to note about the tokenizer is that each token sequence
starts with token id 1
. This is a special beginning-of-sequence
token that we requested be added when we loaded the tokenizer with
add_bos = TRUE
. There are two other such special tokens that we will
encounter later: an end-of-sequence special tokens with id 2
, and an
unknown-token with id 0
.
as.character(tokenizer$id_to_string(0L))
[1] "<unk>"
as.character(tokenizer$id_to_string(1L))
[1] "<s>"
as.character(tokenizer$id_to_string(2L))
[1] "</s>"
1 0 2
"" " ⁇ " ""
Overall, there are 32,000 tokens.
as.integer(tokenizer$vocab_size())
[1] 32000
One last observation is that the more frequently encountered tokens are
assigned lower ids.
show_tokens(seq(50, len = 10))
50 51 52 53 54 55 56 57 58 59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"
show_tokens(seq(100, len = 10))
100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
show_tokens(seq(1000, len = 10))
1000 1001 1002 1003 1004 1005 1006 1007 1008 1009
"ied" "ER" "stat" "fig" "me" "von" "inter" "roid" "ater" "their"
show_tokens(seq(10000, len = 10))
10000 10001 10002 10003 10004 10005 10006 10007
"ång" "citep" "Ill" "rank" "sender" "beim" "рак" "compat"
10008 10009
"occurs" "diese"
show_tokens(seq(20000, len = 10))
20000 20001 20002 20003 20004 20005 20006 20007
"admit" "Comment" "стя" "Vien" "ці" "permut" "cgi" "crít"
20008 20009
"Console" "ctic"
show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))
31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
"ὀ" "げ" "べ" "边" "还" "黃" "왕" "收" "弘" "给"
Moving on, the next step after tokenization is embedding. An embedding
layer is effectively a dictionary lookup that converts an integer (token
id) to a 1-d float array. For this we can use the standard keras
Embedding
layer.
<- keras$layers$Embedding(
tok_embeddings input_dim = tokenizer$vocab_size(),
output_dim = params$dim,
embeddings_initializer =
$load(weights_path("7B/tok_embeddings.weight.npy"))
(...) np
)
tok_embeddings(3L) |> str()
<tf.Tensor: shape=(4096), dtype=float32, numpy=…>
|> # "The best way to attract bees"
prompt $tokenize() |>
tokenizertok_embeddings() |>
str()
<tf.Tensor: shape=(8, 4096), dtype=float32, numpy=…>
TransformerBlock
Once it’s tokenized and embedded, the input then passes through the bulk
of the model, a sequence of repeating TransformerBlock
layers. The 7B
model has 32 of these TransformerBlock
layers, while the 65B model has
80 of them.
weights_path("7B/params.json") |> read_json() |> _$n_layers
[1] 32
weights_path("65B/params.json") |> read_json() |> _$n_layers
[1] 80
Here is what the transformer block looks like:
TransformerBlock(keras$layers$Layer) %py_class% {
<- function(attn_head_size, attn_n_heads,
initialize norm_eps = k_epsilon(), ...,
block_id = NULL) {
$initialize(...)
super
$attention <- Attention(attn_head_size, attn_n_heads,
selfblock_id = block_id)
$feed_forward <- FeedForward(
selfhidden_dim = 4 * attn_head_size * attn_n_heads,
block_id = block_id)
$attention_norm <- RMSNorm(eps = norm_eps,
selfblock_id = block_id,
feeds_into = "attention")
$feed_forward_norm <- RMSNorm(eps = norm_eps,
selfblock_id = block_id,
feeds_into = "ffn")
}
<- function(x) >
call $attention()
self
<- x + x2 # add residual
x
# norm and swiglu
<- x %>%
x2 $feed_forward_norm() %>%
self$feed_forward()
self
<- x + x2 # residual again
x
x
}
While there is not a lot of code, there are a lot of ideas packed in
there. This block forms the main trunk of the model, so it’s worth
taking the time to go through it slowly.
We implement the TransformerBlock
as a subclassed
keras.layers.Layer
. This is gives us some niceties like the ability to
compose with other Keras layers, but these are mostly irrelevant to the
purpose of this blog post; we could just as easily implement this as,
for example, a vanilla R6 class. Our TransformerBlock
class has two
methods: initialize
, called when we first create the block, and
call
, called when we run the forward pass of the block.
In initialize
, we create 4 layers: an Attention
layer, a
FeedForward
layer, and 2 RMSNorm
layers. We’ll take a close look at
each of these soon, but even before we do so, we can see how they fit
together by looking at the TransformerBlock$call()
method.
The call
method has a few simple ideas. In no particular order, the
first one to observe is the composition pattern of adding residuals.
<- x |> ...
x2 <- x + x2 # add residual x to x2 x
This is a common pattern that helps with model training, and especially
to help with the vanishing gradient
problem. It’s
a skip-connection within the other-wise linear sequence of matrix
transformations. It reinjects data (throughout the ahead cross), and
gradients (throughout again propagation), again into the trunk. You possibly can assume
of those residual connections as releasing the learnable layers in-between
(the ...
within the pseudo code) from the burden of getting to
“pass-through” or “protect” data in x
, permitting the weights to
as an alternative give attention to studying transformations which can be, (in corporatese
vernacular), value-adding.
The following composition sample to notice is the repeating utilization of a
normalization layer:
<- x |> norm() |> ...
x2 <- x + x2 x
There are many kinds of normalization layers, but to slightly
over-generalize, they can all be thought of as a stabilizer that helps
with training. Like their deep-learning cousins the regularizers, their
main function is to keep values passing through in a sensible range–in
the ball park of (-1, 1), typically. We’ll take a closer look at
RMSNorm
soon.
Stripped of two tricks that are mostly there to help the model train,
residuals and normalization, the core of the TransformerBlock
is just
this:
|> attention() |> feed_forward() x
In a moment we’ll see that that feed_foward
is a slightly fancier
variation of a conventional sequence of Dense
layer. Before we get
there we can we safely skip ahead to distill the following intuition: a
TransformerBlock
is basically an Attention
layer followed by a few
(fancy) dense layers, with some simple composition patterns (tricks)
that help with training. Attention
is the heart of the model: it’s the
most interesting, and also the most involved.
With the framing in place, let’s go through and take a closer look at
RMSNorm
, FeedForward
, and then with the foundation in place, we’ll
turn our attention to Attention
.
RMSNorm
RMSNorm(keras$layers$Layer) %py_class% {
<-
initialize function(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
$initialize(...)
super$eps <- eps
self$block_id <- block_id
self$feeds_into <- feeds_into
self
}
<- function(input_shape) {
build # input_shape == (batch_size, seqlen, params$dim)
# self$w will broadcast over batch_size and seqlen dims.
# w_shape == (1, 1, params$dim)
<- rep(1L, length(input_shape))
w_shape length(input_shape)] <- as.integer(input_shape) |> tail(1L)
w_shape[
# define a local function that will load
# the pretrained-weights if we supplied `block_id` and `feeds_into`
import_from({self}, block_id, feeds_into)
<-if (is.null(block_id))
initializer "ones"
else if (block_id >=0) {
weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
(...) $load() |> np$expand_dims(0:1)
npelse if(block_id == -1)
} # load weights for the final output normalization layer, which is not
# part of a TransformerBlock
weights_path("7B/norm.weight.npy") |>
(...) $load() |> np$expand_dims(0:1)
np
$w <- self$add_weight(shape = w_shape,
selfinitializer = initializer,
trainable = TRUE)
}
<- function(x) {
rrms # reciprocal root mean square along the last axis
%>% # (batch_size, seqlen, n_features)
x $math$square() %>%
tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
tf$math$add(self$eps) %>% # for numerical stability
tf$math$rsqrt()
tf
}
<- function(x) {
call * self$rrms(x) * self$w
x
} }
RMSnorm()
has a single trainable tensor w
. In the forward pass, each
value in the input is multiplied by the reciprocal-root-mean-square of
all the values in the feature axis and by w
. Certainly a mouthful, but
just a simple sequence of arithmetic transformations in the end,
designed for the express purpose of adjusting the range of values
passing through.
Let’s kick the tires on it:
<- RMSNorm()
norm <- matrix(c(0, 1,
m 2, 3), nrow = 2)
norm(m)
tf.Tensor(
[[0. 1.4142132 ]
[0.44721353 1.3416406 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137 ]
[0.44721362 1.3416408 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0. 1.4142137]
[0.4472136 1.3416408]], shape=(2, 2), dtype=float32)
FeedForward
Next up is FeedForward()
FeedForward(keras$layers$Layer) %py_class% {
<- function(hidden_dim, multiple_of = 256L,
initialize block_id = NULL) {
..., $initialize()
super
if(!is.null(multiple_of)) {
<- hidden_dim %>%
hidden_dim as.integer( . * (2/3)) } %>%
{ + multiple_of - 1) %/% multiple_of } %>%
{ (. * multiple_of }
{ .
}
$hidden_dim <- hidden_dim
self$block_id <- block_id
self
}
<- function(input_shape) {
build <- input_shape |> as.integer() |> tail(1)
output_dim
if(is.null(self$block_id))
<- (...) NULL
load_weight else
<- (name) (...) np$load(weights_path(
load_weight "7B/layers.{self$block_id}.feed_forward.{name}.weight.npy"))$`T`
$w1 <- Dense(self$hidden_dim, use_bias = FALSE,
selfkernel_initializer = load_weight("w1"))
$w2 <- Dense(output_dim, use_bias = FALSE,
selfkernel_initializer = load_weight("w2"))
$w3 <- Dense(self$hidden_dim, use_bias = FALSE,
selfkernel_initializer = load_weight("w3"))
$build(input_shape)
super
}
<- function(x) {
call import_from({self}, w1, w2, w3)
import_from(tf$nn, silu)
%>%
x silu(w1(.)) * w3(.) } %>% # SwiGLU
{ w2()
}
}
FeedForward
consists of three Dense
layers. initialize
does some
simple arithmetic, munging on the input value hidden_dim
to ensure the
size is a performant multiple of 256, and build
is mostly boiler plate
for creating the layers and loading the weights.
The novelty of FeedForward()
is in the call()
method, where rather
than composing the Dense
layers in a conventional sequential model
with, say, ReLU activations in between and maybe some dropout, the
layers are composed to form a “SwiGLU” unit. The publication by Shazeer (2020)
of SwiGLU and different variations on GLU is an exemplar of the kinds
of explorations and enhancements across the Transformer structure
since its preliminary publication in
2017; a gentle accretion of
enhancements that has introduced us to right this moment. The Feedforward$name()
is
only a single SwiGLU adopted by a linear projection. In its essence,
it’s a intelligent composition of three (realized) linear projections, an
element-wise multiplication, and a silu()
activation
operate.
Maybe probably the most stunning remark to make right here is the relative
dearth of activation capabilities, and even non-linearities, not simply in
FeedForward
, however general. The silu()
on this feedforward, the
reciprocal-root-mean-square in RMSnorm()
, and a softmax()
in
Consideration()
are the one non-linear transformations in the entire
sequence of TransformerBlock
s. Every part else is a linear
transformation!
Consideration
Lastly, let’s flip our consideration to Consideration()
.
Attention(keras$layers$Layer) %py_class% {
<- function(head_size, n_heads,
initialize block_id = NULL) {
..., $initialize(...)
super
$head_size <- head_size
self$n_heads <- n_heads
self
if (is.null(block_id))
<- function(name) NULL
load_weight else
<- (name) (...) np$load(weights_path(
load_weight "7B/layers.{block_id}.attention.{name}.weight.npy"))$`T`
<- function(name) keras$layers$Dense(
Dense units = n_heads * head_size,
use_bias = FALSE,
kernel_initializer = load_weight(name)
)
$wq <- Dense("wq")
self$wk <- Dense("wk")
self$wv <- Dense("wv")
self$wo <- Dense("wo")
self
}
<- function(x) {
call c(batch_size, seqlen, n_features) %<-% tf$unstack(tf$shape(x))
# 1. project (linear transform) x into
# query, key, and value tensors
# 2. reshape q k v, splitting out the last dim (n_features)
# into n_heads independent subspaces,
# each with size head_size.
# (n_features == head_size * n_heads)
<- c(batch_size, seqlen,
split_heads_shape $n_heads, self$head_size)
self<- x |> self$wq() |> tf$reshape(split_heads_shape)
q <- x |> self$wk() |> tf$reshape(split_heads_shape)
k <- x |> self$wv() |> tf$reshape(split_heads_shape)
v
# embed positional information in query and key
# (bsz, seqlen, n_heads, head_size)
%<>% apply_rotary_embedding()
q %<>% apply_rotary_embedding()
k
# reshape:
# move heads out of the last 2 axes,
# so later matmuls are performed across the subspaces (heads)
# between (seqlen, head_size) axes
<- tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
v <- tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
q <- tf$transpose(k, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)
k
# calculate and normalize attention scores
<- q %*% k # (bsz, n_heads, seqlen, seqlen)
scores <- scores / sqrt(self$head_size) # scale
scores
# apply causal mask, so the model can't "look ahead" during training
<- make_mask(seqlen, dtype = scores$dtype)
mask %<>% { . + mask }
scores
<- tf$nn$softmax(scores, axis = -1L)
scores
# adjust values tensor with attention scores
# scores (bsz, n_heads, seqlen, seqlen)
# v (bsz, n_heads, seqlen, head_size)
<- scores %*% v # (bsz, n_heads, seqlen, head_size)
output
# combine heads back into a single features dim,
# so Attention output_shape==input_shape
<- output |>
output $transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
tf$reshape(tf$shape(x)) # (bsz, seqlen, n_heads * head_size)
tf
# one more trainable linear projection for good luck
<- self$wo(output) # (bsz, seqlen, n_heads * head_size)
output
output
} }
Attention
in LLaMA is similar but not identical to the Attention
described in the original Transformers
paper (and obtainable as a keras
builtin below keras$layers$MultiHeadAttention()
). The core novelty is
the addition of the apply_rotary_embedding()
operate, which we’ll
describe shortly. The extra novelty is balanced by the simplicity
from the truth that the layer is performing self-attention—we don’t want
to cross in numerous question, key, and worth tensors (or purpose about what
which means), for the reason that similar enter serves all three roles. Word that the
typical MultiHeadAttention()
layer is roofed fairly totally in
the 2nd Version of Deep Learning with R,
together with a full implementation of consideration in base R.
To develop an understanding of the mechanics in a layer like this, it’s
useful to briefly unsee among the minutia that may act as a fog
obscuring the essence of the operation. On this occasion, if we
briefly strip out the transpose()
s and reshape()
s (as intelligent and
very important as they’re), that is what’s left:
<- function(x) > self call $wv()
# rotate q,k to inject position information.
# cross q,k to calculate an attention score for each token pair.
<- rotate(q) %*% rotate(k) scores
Returning to the transpose()
s and reshapes()
, you can observe that
their purpose is to make it so that the attention calculations are
performed across n_heads
independent subspaces, rather than in a
single larger space. The same reasoning drives this decision as that
driving usage of depthwise-separable convolutions in image models.
Empirically, for the fixed compute budget, factoring features into
independent subspaces performs better than doing the same core
operations in single larger feature space. As with all things, there is
a balance to strike between n_heads
(the number of subspaces) and
head_dim
(the size of each subspace). The LLaMA authors have struck
the balance like this at the various model sizes:
lapply(c("7B", "13B", "30B", "65B"), (size) {
<- read_json(weights_path("{size}/params.json"))
p with(p, list(llama_size = size,
n_heads = n_heads,
head_dim = dim %/% n_heads))
|> dplyr::bind_rows() })
# A tibble: 4 × 3
llama_size n_heads head_dim
<chr> <int> <int>
1 7B 32 128
2 13B 40 128
3 30B 52 128
4 65B 64 128
Next lets turn our attention to the causal attention mask.
<- function(seqlen, dtype = k_floatx()) {
make_mask <- tf$range(seqlen)
x <- tf$where(x[, tf$newaxis] < x[tf$newaxis, ],
mask $constant(-Inf, dtype = dtype),
tf$constant(0, dtype = dtype))
tf
# broadcast over batch and heads dim
$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
mask[tf }
The mask is a strictly upper triangular matrix filled with -Inf
values. Adding the mask to the attention scores prevents the model from
being able to “look ahead” and see the attention score for a token
pairing it hasn’t seen yet at a particular position in the sequence.
This need for a mask is best thought of as a vestige from training,
an apparatus that the model needed to learn with and now it can’t function without.
During training, gradients are calculated for predictions from all
token positions in a sequence, including predictions tokens where the correct
answer is right there, as the very next token in same sequence. The mask
prevents the model from being able to cheat and look ahead into the future,
something it won’t be able to do once it’s we’re running it for inference.
tf.Tensor(
[[[[ 0. -inf -inf -inf -inf]
[ 0. 0. -inf -inf -inf]
[ 0. 0. 0. -inf -inf]
[ 0. 0. 0. 0. -inf]
[ 0. 0. 0. 0. 0.]]]], shape=(1, 1, 5, 5), dtype=float32)
Rotary Position Embedding
Next lets turn our attention to apply_rotary_embedding()
. This core
innovation was published by Su et al. (2022) within the paper titled
“RoFormer: Enhanced Transformer with Rotary Position Embedding”.
Some context:
-
The naked
Consideration()
mechanism doesn’t go away any chance for a
token’s place in a sequence to have an effect on the eye scores, since
solely token-pairs are scored. Consideration treats its enter like a
bag-of-tokens. -
The place of a token in a sequence is clearly essential, and the
consideration layer ought to have entry to that data. -
Absolutely the place of a token in a sequence is much less essential
than the relative place between tokens. (Particularly so for lengthy
sequences).
Which leads us into the complicated aircraft. If we think about the options as
complicated numbers, we will rotate them, and we will calculate angles between
them. From the Roformers paper:
Particularly, incorporating the relative place embedding is
easy: merely rotate the affine-transformed phrase embedding
vector by quantity of angle multiples of its place index and thus
interprets the instinct behind Rotary Place Embedding
Increasing barely: the rotation matrix is designed in order that
subsequently, after rotating our q
and ok
token sequence embedding
the identical manner, the angle between token options is a operate of the
relative distance between these tokens within the token sequence. The
relative angle between two tokens is invariant to absolutely the
place of these tokens within the full sequence.
In brief, the rotation injects positional data. The which means or
interpretability of that positional data, or how it’s meant to
be used, and even extracted from the results of q %*% ok
, is left to the
mannequin to study.
Right here is the code:
<- function(x) {
apply_rotary_embedding c(., seqlen, ., head_size) %<-%
$unstack(tf$shape(x))
tf
<- compute_rotation_matrix(seqlen, head_size)
rotation_matrix
%>%
x view_as_complex() %>%
* rotation_matrix } %>%
{ . view_as_real()
}
<-
compute_rotation_matrix function(seqlen, feature_dim, theta = 10000) {
# `feature_dim` here is going to be attention$head_size
# `seqlen` is going to match the token sequence length.
<- tf$range(seqlen, dtype = tf$float32)
t <- tf$range(start = 0, limit = 1, delta = 1 / (feature_dim %/% 2),
freqs dtype = tf$float32)
tf_assert(tf$size(freqs) == feature_dim %/% 2)
<- 1.0 / (theta ^ freqs)
freqs
# outer product; (seqlen, head_size/2)
<- tf$einsum('a,b->ab', t, freqs)
freqs
<- tf$complex(tf$cos(freqs), tf$sin(freqs))
rot_mat
# the positional embedding will be broadcast across batch and heads dim
$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
rot_mat[tf
}
<- function(x) {
view_as_complex $complex(x[all_dims(), `::2`],
tfall_dims(), `2::2`])
x[
}
<- function(x) {
view_as_real # xs = (..., f); xs2 = (..., f*2)
<- tf$shape(x)
xs <- tf$concat(list(xs[1:(length(xs)-1)],
xs2 length(xs), drop = FALSE] * 2L),
xs[axis = 0L)
<- tf$stack(list(Re(x), Im(x)), axis = -1L)
x2
# (..., f, 2) -> (..., f*2)
$reshape(x2, xs2)
tf }
As you can see, to imagine the embedding features as existing in the
complex plane, we merely treat adjacent pairs of floats in the
underlying array as the real and imaginary part of a complex number. We
rotate the embeddings in the complex plane, then go back to imagining
the features as existing in the real plane. Again, the job of
interpreting the meaning of the features after rotation is left to the
model to learn.
We can quickly confirm that the rotary embeddings only rotate features
and don’t scale them:
<- function (x, y, tol = 1e-6) abs(x - y) < tol
near all(near(1, Mod(compute_rotation_matrix(2048L, 128L))))
tf.Tensor(True, shape=(), dtype=bool)
There is one more trick to observe before moving on: because of some of
the mathematical properties of the rotation matrix, it’s possible to
avoid doing a full complex multiply operation and still arrive at the
same result. Also, since the rotation matrix never changes, it makes
sense to only compute it once and cache it, like so:
<- compute_rotation_matrix(
precomputed_rotation_matrix seqlen = 2048L, # LLaMA max seqlen
feature_dim = with(params, dim %/% n_heads) # head_size
)
<- function(x) {
apply_rotary_embedding_faster
<- function(x) {
rotate_every_two <- x[all_dims(), `::2`]
x1 <- x[all_dims(), `2::2`]
x2 <- tf$stack(list(-x2, x1), axis = -1L)
x_ $reshape(x_, tf$shape(x))
tf
}
<- function(x) {
repeat_each_twice $`repeat`(x, 2L, axis = -1L)
tf
}
<- tf$shape(x)[2]
seqlen <- precomputed_rotation_matrix[, NA:seqlen, , ]
rot
<- Re(rot) |> repeat_each_twice()
cos <- Im(rot) |> repeat_each_twice()
sin
* cos) + (rotate_every_two(x) * sin)
(x }
<- tf$random$uniform(shape(3, 8, params$n_heads, 128))
rand all(apply_rotary_embedding(rand) ==
apply_rotary_embedding_faster(rand))
tf.Tensor(True, shape=(), dtype=bool)
<- apply_rotary_embedding_faster apply_rotary_embedding
Finally, note that the rotary positional embeddings are applied within
each Attention
layer. This is different from the original Transformer
implementation, where a positional embedding was only added once at the
head of the model. Similar to residual connections, you can think of the
presence of these repeated injections of positional information as
relieving the remaining trainable layers from the burden of allocating
some of their weights to the task of “passing through” or “preserving”
the positional information for later layers.
Positional embeddings are a rich subject that also comes up in other
deep learning architectures, like denoising diffusion (Falbel and Keydana 2023),
so time spent understanding them higher is time nicely
spent. For the needs of this weblog publish we’ve coated the factors
wanted and we’ll transfer on to tying all items collectively. To go deeper and
develop a extra mathematically knowledgeable perceive of RoPE, two glorious
beginning factors are:
Tying all of it collectively
With Tokenizer
, Embedding
, TransformerBlock
(RMSNorm
,
Consideration
FeedForward
and apply_rotary_embedding
) all coated,
it’s time to tie all of the items collectively right into a Transformer
mannequin. We
might do that utilizing %py_class%
like with the opposite layers above, however
it’s simply as simple to maneuver over to utilizing the Keras useful API at this
level.
<- create_layer_wrapper(TransformerBlock)
layer_transformer_block <- create_layer_wrapper(RMSNorm)
layer_rms_norm
# input to the model will be output from the tokenizer
<- layer_input(shape(NA)) #, dtype = "int32")
input
<- input |>
x tok_embeddings() # instantiated earlier in the blog-post
for(block_id in seq_len0(params$n_layers)) >
layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
attn_n_heads = params$n_heads,
norm_eps = params$norm_eps,
block_id = block_id)
# final output projection into logits of output tokens
<- x |>
x layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
layer_dense(
$vocab_size(), use_bias = FALSE,
tokenizerkernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
)
# slice out the logits for the last token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
<- x[, -1, ]
output
})
<- keras_model(input, output) %>%
llama compile(jit_compile = TRUE)
The input to the model is tokenized text and the output is the
(unnormalized) probabilities for each token in tokenizer$vocab_size()
being the next token in the sequence.
<- prompt %>%
next_token_probs $tokenize() %>%
tokenizerllama()
next_token_probs
tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00 1.3200411e+01 ... 4.8804146e-01
-1.3277926e+00 9.9985600e-03]], shape=(1, 32000), dtype=float32)
Sampling strategies for selecting a token from the token logits is a
rich topic, (also covered thoroughly in the Deep Learning with
R e-book), however this weblog publish is lengthy sufficient
already. So for now, let’s simply take the argmax()
.
<- (logits) tf$argmax(logits, axis = -1L, output_type = "int32")
sampler
<- sampler(next_token_probs)) (next_token
tf.Tensor([304], shape=(1), dtype=int32)
$detokenize(next_token) |> as.character() tokenizer
[1] "to"
Let’s run it for a few tokens and let LLaMa finish the sentence:
<- tokenizer$tokenize("The best way to attract bees")
prompt_tokens
for (i in 1:20) {
<- prompt_tokens |> llama()
next_token_probs <- sampler(next_token_probs)
next_token
%<>% { tf$concat(c(., next_token), axis = -1L) }
prompt_tokens
# end of sentence
if (as.logical(next_token == tokenizer$string_to_id(".")))
break
}
|>
prompt_tokens $detokenize() |>
tokenizeras.character() |>
strwrap(60) |> writeLines()
The best way to attract bees to your garden is to plant a
variety of flowers that bloom at different times.
Wrapping up
In this blog post we’ve walked through the LLaMA architecture
implemented in R TensorFlow, including how to load pretrained weights,
and then run the model to generate a sentence. Note, much of the code in
this blog post is tailored for didactic purposes. While the
implementation of the LLaMA architecture covered in this blog post is
appropriate for training, there are a few modifications you’ll want to
make before doing a lot of text generation. Those include things like:
-
In the
Attention
layer, caching thek
andv
tensors. Then,
after the first forward pass with the initial prompt, only feeding
the model the one new token from thesampler()
, rather than
feeding the model all the tokens of the full prompt on each forward
pass. -
Only generating the causal mask
make_mask()
androtary_matrix
slices once per forward pass, instead of within eachAttention
call. -
Updating the
TransformerBlock
to be cache-aware and to pass
through the appropriate arguments toAttention()
-
Wrapping all the additional book-keeping logic in a custom
TransformerDecoder()
class.
The changes required to implement these optimizations for inference
balloon the code size and are mostly about book-keeping, so we won’t go
through them in this blog post. However, you can find a fuller
implementation of LLaMA in R Tensorflow, including a cache-aware
generate()
method that only feeds the model one token at a time during
the main inference loop, (and compiles to XLA!),
here.
That’s all for now. Thanks for studying and joyful travels to all
exploring this thrilling LLM terrain!
Picture by Sébastien Goldberg on Unsplash