NumPy-style broadcasting for R TensorFlow customers
We develop, prepare, and deploy TensorFlow fashions from R. However that doesn’t imply we don’t make use of documentation, weblog posts, and examples written in Python. We glance up particular performance within the official TensorFlow API docs; we get inspiration from different folks’s code.
Relying on how comfy you’re with Python, there’s an issue. For instance: You’re speculated to understand how broadcasting works. And maybe, you’d say you’re vaguely accustomed to it: So when arrays have completely different shapes, some parts get duplicated till their shapes match and … and isn’t R vectorized anyway?
Whereas such a worldwide notion may match on the whole, like when skimming a weblog put up, it’s not sufficient to know, say, examples within the TensorFlow API docs. On this put up, we’ll attempt to arrive at a extra actual understanding, and examine it on concrete examples.
Talking of examples, listed below are two motivating ones.
Broadcasting in motion
The primary makes use of TensorFlow’s matmul
to multiply two tensors. Would you wish to guess the outcome – not the numbers, however the way it comes about on the whole? Does this even run with out error – shouldn’t matrices be two-dimensional (rank-2 tensors, in TensorFlow converse)?
a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a
# tf.Tensor(
# [[[ 1. 2. 3.]
# [ 4. 5. 6.]]
#
# [[ 7. 8. 9.]
# [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)
b <- tf$fixed(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b
# tf.Tensor(
# [[[101. 102.]
# [103. 104.]
# [105. 106.]]], form=(1, 3, 2), dtype=float64)
c <- tf$matmul(a, b)
Second, here’s a “actual instance” from a TensorFlow Likelihood (TFP) github issue. (Translated to R, however holding the semantics).
In TFP, we are able to have batches of distributions. That, per se, is no surprise. However have a look at this:
library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Regular("Regular", batch_shape=[2, 2], event_shape=[], dtype=float64)
We create a batch of 4 regular distributions: every with a special scale (1.5, 2.5, 3.5, 4.5). However wait: there are solely two location parameters given. So what are their scales, respectively?
Fortunately, TFP builders Brian Patton and Chris Suter defined the way it works: TFP really does broadcasting – with distributions – identical to with tensors!
We get again to each examples on the finish of this put up. Our predominant focus will probably be to elucidate broadcasting as completed in NumPy, as NumPy-style broadcasting is what quite a few different frameworks have adopted (e.g., TensorFlow).
Earlier than although, let’s shortly assessment a couple of fundamentals about NumPy arrays: The best way to index or slice them (indexing usually referring to single-element extraction, whereas slicing would yield – nicely – slices containing a number of parts); tips on how to parse their shapes; some terminology and associated background.
Although not difficult per se, these are the sorts of issues that may be complicated to rare Python customers; but they’re usually a prerequisite to efficiently making use of Python documentation.
Acknowledged upfront, we’ll actually limit ourselves to the fundamentals right here; for instance, we gained’t contact advanced indexing which – identical to tons extra –, could be regarded up intimately within the NumPy documentation.
Few info about NumPy
Primary slicing
For simplicity, we’ll use the phrases indexing and slicing roughly synonymously any longer. The fundamental machine here’s a slice, particularly, a begin:cease
construction indicating, for a single dimension, which vary of parts to incorporate within the choice.
In distinction to R, Python indexing is zero-based, and the top index is unique:
import numpy as np
= np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x
1:7]
x[# array([1, 2, 3, 4, 5, 6])
Minus, to R users, is a false friend; it means we start counting from the end (the last element being -1):
Leaving out start
(stop
, resp.) selects all elements from the start (till the end).
This may feel so convenient that Python users might miss it in R:
5:]
x[# array([5, 6, 7, 8, 9])
7]
x[:# array([0, 1, 2, 3, 4, 5, 6])
Just to make a point about the syntax, we could leave out both the start
and the stop
indices, in this one-dimensional case effectively resulting in a no-op:
x[:] 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) array([
Going on to two dimensions – without commenting on array creation just yet –, we can immediately apply the “semicolon trick” here too. This will select the second row with all its columns:
= np.array([[1, 2], [3, 4], [5, 6]])
x
x# array([[1, 2],
# [3, 4],
# [5, 6]])
1, :]
x[# array([3, 4])
While this, arguably, makes for the easiest way to achieve that result and thus, would be the way you’d write it yourself, it’s good to know that these are two alternative ways that do the same:
1]
x[# array([3, 4])
1, ]
x[# array([3, 4])
While the second one sure looks a bit like R, the mechanism is different. Technically, these start:stop
things are parts of a Python tuple – that list-like, but immutable data structure that can be written with or without parentheses, e.g., 1,2
or (1,2
) –, and whenever we have more dimensions in the array than elements in the tuple NumPy will assume we meant :
for that dimension: Just select everything.
We can see that moving on to three dimensions. Here is a 2 x 3 x 1-dimensional array:
= np.array([[[1],[2],[3]], [[4],[5],[6]]])
x
x# array([[[1],
# [2],
# [3]],
#
# [[4],
# [5],
# [6]]])
x.shape# (2, 3, 1)
In R, this would throw an error, while in Python it works:
0,]
x[#array([[1],
# [2],
# [3]])
In such a case, for enhanced readability we could instead use the so-called Ellipsis
, explicitly asking Python to “use up all dimensions required to make this work”:
0, ...]
x[#array([[1],
# [2],
# [3]])
We stop here with our selection of essential (yet confusing, possibly, to infrequent Python users) Numpy indexing features; re. “possibly confusing” though, here are a few remarks about array creation.
Syntax for array creation
Creating a more-dimensional NumPy array is not that hard – depending on how you do it. The trick is to use reshape
to tell NumPy exactly what shape you want. For example, to create an array of all zeros, of dimensions 3 x 4 x 2:
24).reshape(4, 3, 2) np.zeros(
But we also want to understand what others might write. And then, you might see things like these:
= np.array([[[0, 0, 0]]])
c1 = np.array([[[0], [0], [0]]])
c2 = np.array([[[0]], [[0]], [[0]]]) c3
These are all 3-dimensional, and all have three elements, so their shapes must be 1 x 1 x 3, 1 x 3 x 1, and 3 x 1 x 1, in some order. Of course, shape
is there to tell us:
# (1, 1, 3)
c1.shape # (1, 3, 1)
c2.shape # (3, 1, 1) c3.shape
but we’d like to be able to “parse” internally without executing the code. One way to think about it would be processing the brackets like a state machine, every opening bracket moving one axis to the right and every closing bracket moving back left by one axis. Let us know if you can think of other – possibly more helpful – mnemonics!
In the very last sentence, we on purpose used “left” and “right” referring to the array axes; “out there” though, you’ll also hear “outmost” and “innermost”. Which, then, is which?
A bit of terminology
In common Python (TensorFlow, for example) usage, when talking of an array shape like (2, 6, 7)
, outmost is left and innermost is right. Why?
Let’s take a simpler, two-dimensional example of shape (2, 3)
.
= np.array([[1, 2, 3], [4, 5, 6]])
a
a# array([[1, 2, 3],
# [4, 5, 6]])
Computer memory is conceptually one-dimensional, a sequence of locations; so when we create arrays in a high-level programming language, their contents are effectively “flattened” into a vector. That flattening could occur “by row” (row-major, C-style, the default in NumPy), resulting in the above array ending up like this
1 2 3 4 5 6
or “by column” (column-major, Fortran-style, the ordering used in R), yielding
1 4 2 5 3 6
for the above example.
Now if we see “outmost” as the axis whose index varies the least often, and “innermost” as the one that changes most quickly, in row-major ordering the left axis is “outer”, and the right one is “inner”.
Just as a (cool!) aside, NumPy arrays have an attribute called strides
that stores how many bytes have to be traversed, for each axis, to arrive at its next element. For our above example:
= np.array([[[0, 0, 0]]])
c1 # (1, 1, 3)
c1.shape # (24, 24, 8)
c1.strides
= np.array([[[0], [0], [0]]])
c2 # (1, 3, 1)
c2.shape # (24, 8, 8)
c2.strides
= np.array([[[0]], [[0]], [[0]]])
c3 # (3, 1, 1)
c3.shape # (8, 8, 8) c3.strides
For array c3
, every element is on its own on the outmost level; so for axis 0, to jump from one element to the next, it’s just 8 bytes. For c2
and c1
though, everything is “squished” in the first element of axis 0 (there is just a single element there). So if we wanted to jump to another, nonexisting-as-yet, outmost item, it’d take us 3 * 8 = 24 bytes.
At this point, we’re ready to talk about broadcasting. We first stay with NumPy and then, examine some TensorFlow examples.
NumPy Broadcasting
What happens if we add a scalar to an array? This won’t be surprising for R users:
= np.array([1,2,3])
a = 1
b + b a
array([2, 3, 4])
Technically, this is already broadcasting in action; b
is virtually (not physically!) expanded to shape (3,)
in order to match the shape of a
.
How about two arrays, one of shape (2, 3)
– two rows, three columns –, the other one-dimensional, of shape (3,)
?
= np.array([1,2,3])
a = np.array([[1,2,3], [4,5,6]])
b + b a
array([[2, 4, 6],
[5, 7, 9]])
The one-dimensional array gets added to both rows. If a
were length-two instead, would it get added to every column?
= np.array([1,2,3])
a = np.array([[1,2,3], [4,5,6]])
b + b a
ValueError: operands could not be broadcast together with shapes (2,) (2,3)
So now it is time for the broadcasting rule. For broadcasting (virtual expansion) to happen, the following is required.
- We align array shapes, starting from the right.
# array 1, shape: 8 1 6 1
# array 2, shape: 7 1 5
-
Starting to look from the right, the sizes along aligned axes either have to match exactly, or one of them has to be
1
: In which case the latter is broadcast to the one not equal to1
. -
If on the left, one of the arrays has an additional axis (or more than one), the other is virtually expanded to have a
1
in that place, in which case broadcasting will happen as stated in (2).
Stated like this, it probably sounds incredibly simple. Maybe it is, and it only seems complicated because it presupposes correct parsing of array shapes (which as shown above, can be confusing)?
Here again is a quick example to test our understanding:
= np.zeros([2, 3]) # shape (2, 3)
a = np.zeros([2]) # shape (2,)
b = np.zeros([3]) # shape (3,)
c
+ b # error
a
+ c
a # array([[0., 0., 0.],
# [0., 0., 0.]])
All in accord with the rules. Maybe there’s something else that makes it confusing?
From linear algebra, we are used to thinking in terms of column vectors (often seen as the default) and row vectors (accordingly, seen as their transposes). What now is
, of shape – as we’ve seen a few times by now – (2,)
? Really it’s neither, it’s just some one-dimensional array structure. We can create row vectors and column vectors though, in the sense of 1 x n and n x 1 matrices, by explicitly adding a second axis. Any of these would create a column vector:
# start with the above "non-vector"
= np.array([0, 0])
c
c.shape# (2,)
# way 1: reshape
2, 1).shape
c.reshape(# (2, 1)
# np.newaxis inserts new axis
c[ :, np.newaxis].shape# (2, 1)
# None does the same
None].shape
c[ :, # (2, 1)
# or construct directly as (2, 1), paying attention to the parentheses...
= np.array([[0], [0]])
c
c.shape# (2, 1)
And analogously for row vectors. Now these “more explicit”, to a human reader, shapes should make it easier to assess where broadcasting will work, and where it won’t.
= np.array([[0], [0]])
c
c.shape# (2, 1)
= np.zeros([2, 3])
a
a.shape# (2, 3)
+ c
a # array([[0., 0., 0.],
# [0., 0., 0.]])
= np.zeros([3, 2])
a
a.shape# (3, 2)
+ c
a # ValueError: operands could not be broadcast together with shapes (3,2) (2,1)
Before we jump to TensorFlow, let’s see a simple practical application: computing an outer product.
= np.array([0.0, 10.0, 20.0, 30.0])
a
a.shape# (4,)
= np.array([1.0, 2.0, 3.0])
b
b.shape# (3,)
* b
a[:, np.newaxis] # array([[ 0., 0., 0.],
# [10., 20., 30.],
# [20., 40., 60.],
# [30., 60., 90.]])
TensorFlow
If by now, you’re feeling less than enthusiastic about hearing a detailed exposition of how TensorFlow broadcasting differs from NumPy’s, there is good news: Basically, the rules are the same. However, when matrix operations work on batches – as in the case of matmul
and friends – , things may still get complicated; the best advice here probably is to carefully read the documentation (and as always, try things out).
Before revisiting our introductory matmul
example, we quickly check that really, things work just like in NumPy. Thanks to the tensorflow
R package, there is no reason to do this in Python; so at this point, we switch to R – attention, it’s 1-based indexing from here.
First check – (4, 1)
added to (4,)
should yield (4, 4)
:
a <- tf$ones(shape = c(4L, 1L))
a
# tf.Tensor(
# [[1.]
# [1.]
# [1.]
# [1.]], form=(4, 1), dtype=float32)
b <- tf$fixed(c(1, 2, 3, 4))
b
# tf.Tensor([1. 2. 3. 4.], form=(4,), dtype=float32)
a + b
# tf.Tensor(
# [[2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]
# [2. 3. 4. 5.]], form=(4, 4), dtype=float32)
And second, after we add tensors with shapes (3, 3)
and (3,)
, the 1-d tensor ought to get added to each row (not each column):
a <- tf$fixed(matrix(1:9, ncol = 3, byrow = TRUE), dtype = tf$float32)
a
# tf.Tensor(
# [[1. 2. 3.]
# [4. 5. 6.]
# [7. 8. 9.]], form=(3, 3), dtype=float32)
b <- tf$fixed(c(100, 200, 300))
b
# tf.Tensor([100. 200. 300.], form=(3,), dtype=float32)
a + b
# tf.Tensor(
# [[101. 202. 303.]
# [104. 205. 306.]
# [107. 208. 309.]], form=(3, 3), dtype=float32)
Now again to the preliminary matmul
instance.
Again to the puzzles
The documentation for matmul says,
The inputs should, following any transpositions, be tensors of rank >= 2 the place the interior 2 dimensions specify legitimate matrix multiplication dimensions, and any additional outer dimensions specify matching batch dimension.
So right here (see code slightly below), the interior two dimensions look good – (2, 3)
and (3, 2)
– whereas the one (one and solely, on this case) batch dimension exhibits mismatching values 2
and 1
, respectively.
A case for broadcasting thus: Each “batches” of a
get matrix-multiplied with b
.
a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a
# tf.Tensor(
# [[[ 1. 2. 3.]
# [ 4. 5. 6.]]
#
# [[ 7. 8. 9.]
# [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)
b <- tf$fixed(keras::array_reshape(101:106, dim = c(1, 3, 2)))
b
# tf.Tensor(
# [[[101. 102.]
# [103. 104.]
# [105. 106.]]], form=(1, 3, 2), dtype=float64)
c <- tf$matmul(a, b)
c
# tf.Tensor(
# [[[ 622. 628.]
# [1549. 1564.]]
#
# [[2476. 2500.]
# [3403. 3436.]]], form=(2, 2, 2), dtype=float64)
Let’s shortly examine this actually is what occurs, by multiplying each batches individually:
tf$matmul(a[1, , ], b)
# tf.Tensor(
# [[[ 622. 628.]
# [1549. 1564.]]], form=(1, 2, 2), dtype=float64)
tf$matmul(a[2, , ], b)
# tf.Tensor(
# [[[2476. 2500.]
# [3403. 3436.]]], form=(1, 2, 2), dtype=float64)
Is it too bizarre to be questioning if broadcasting would additionally occur for matrix dimensions? E.g., might we strive matmul
ing tensors of shapes (2, 4, 1)
and (2, 3, 1)
, the place the 4 x 1
matrix can be broadcast to 4 x 3
? – A fast take a look at exhibits that no.
To see how actually, when coping with TensorFlow operations, it pays off overcoming one’s preliminary reluctance and truly seek the advice of the documentation, let’s strive one other one.
Within the documentation for matvec, we’re advised:
Multiplies matrix a by vector b, producing a * b.
The matrix a should, following any transpositions, be a tensor of rank >= 2, with form(a)[-1] == form(b)[-1], and form(a)[:-2] in a position to broadcast with form(b)[:-1].
In our understanding, given enter tensors of shapes (2, 2, 3)
and (2, 3)
, matvec
ought to carry out two matrix-vector multiplications: as soon as for every batch, as listed by every enter’s leftmost dimension. Let’s examine this – to this point, there isn’t a broadcasting concerned:
# two matrices
a <- tf$fixed(keras::array_reshape(1:12, dim = c(2, 2, 3)))
a
# tf.Tensor(
# [[[ 1. 2. 3.]
# [ 4. 5. 6.]]
#
# [[ 7. 8. 9.]
# [10. 11. 12.]]], form=(2, 2, 3), dtype=float64)
b = tf$fixed(keras::array_reshape(101:106, dim = c(2, 3)))
b
# tf.Tensor(
# [[101. 102. 103.]
# [104. 105. 106.]], form=(2, 3), dtype=float64)
c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
# [2522. 3467.]], form=(2, 2), dtype=float64)
Doublechecking, we manually multiply the corresponding matrices and vectors, and get:
tf$linalg$matvec(a[1, , ], b[1, ])
# tf.Tensor([ 614. 1532.], form=(2,), dtype=float64)
tf$linalg$matvec(a[2, , ], b[2, ])
# tf.Tensor([2522. 3467.], form=(2,), dtype=float64)
The identical. Now, will we see broadcasting if b
has only a single batch?
b = tf$fixed(keras::array_reshape(101:103, dim = c(1, 3)))
b
# tf.Tensor([[101. 102. 103.]], form=(1, 3), dtype=float64)
c <- tf$linalg$matvec(a, b)
c
# tf.Tensor(
# [[ 614. 1532.]
# [2450. 3368.]], form=(2, 2), dtype=float64)
Multiplying each batch of a
with b
, for comparability:
tf$linalg$matvec(a[1, , ], b)
# tf.Tensor([ 614. 1532.], form=(2,), dtype=float64)
tf$linalg$matvec(a[2, , ], b)
# tf.Tensor([[2450. 3368.]], form=(1, 2), dtype=float64)
It labored!
Now, on to the opposite motivating instance, utilizing tfprobability.
Broadcasting all over the place
Right here once more is the setup:
library(tfprobability)
d <- tfd_normal(loc = c(0, 1), scale = matrix(1.5:4.5, ncol = 2, byrow = TRUE))
d
# tfp.distributions.Regular("Regular", batch_shape=[2, 2], event_shape=[], dtype=float64)
What’s going on? Let’s examine location and scale individually:
d$loc
# tf.Tensor([0. 1.], form=(2,), dtype=float64)
d$scale
# tf.Tensor(
# [[1.5 2.5]
# [3.5 4.5]], form=(2, 2), dtype=float64)
Simply specializing in these tensors and their shapes, and having been advised that there’s broadcasting happening, we are able to purpose like this: Aligning each shapes on the fitting and increasing loc
’s form by 1
(on the left), we now have (1, 2)
which can be broadcast with (2,2)
– in matrix-speak, loc
is handled as a row and duplicated.
That means: We have now two distributions with imply (0) (one among scale (1.5), the opposite of scale (3.5)), and likewise two with imply (1) (corresponding scales being (2.5) and (4.5)).
Right here’s a extra direct technique to see this:
d$imply()
# tf.Tensor(
# [[0. 1.]
# [0. 1.]], form=(2, 2), dtype=float64)
d$stddev()
# tf.Tensor(
# [[1.5 2.5]
# [3.5 4.5]], form=(2, 2), dtype=float64)
Puzzle solved!
Summing up, broadcasting is easy “in principle” (its guidelines are), however might have some working towards to get it proper. Particularly along side the truth that features / operators do have their very own views on which components of its inputs ought to broadcast, and which shouldn’t. Actually, there isn’t a method round wanting up the precise behaviors within the documentation.
Hopefully although, you’ve discovered this put up to be begin into the subject. Perhaps, just like the writer, you are feeling such as you would possibly see broadcasting happening anyplace on the planet now. Thanks for studying!