Flash Consideration (Quick and Reminiscence-Environment friendly Precise Consideration with IO-Consciousness): A Deep Dive | by Anish Dubey | Might, 2024
Flash consideration is energy optimization transformer consideration mechanism that gives 15% effectivity
Flash consideration is an influence optimization transformer consideration mechanism which gives 15% effectivity by way of wall-clock velocity with no approximation.
Given transformer fashions are sluggish and reminiscence hungry on lengthy sequences (time and reminiscence complexity is quadratic in nature), flash consideration(paper) gives a 15% end-to-end wall-clock speedup on BERT-large, 3x velocity on GPT-2.
Contemplating, huge quantity of vitality consumed in coaching these massive fashions, Flash consideration with software program and {hardware} optimization is ready to present 15% effectivity which is a large win by way of enchancment.
Beneath, dialogue helps to clarify a few of the primary ideas behind flash consideration and the way it’s carried out.
Primary ideas round compute & reminiscence
Earlier than we dive deeper into compute and reminiscence, let’s revisit them:
What’s Compute?
- Time spent in your GPU computing precise floating level operations (FLOPS)
What’s Reminiscence?
- Time spent transferring tensors inside a GPU
Ideally, we wish our gCPU to be performing matrix multiplication on a regular basis and never restricted by reminiscence. However in actuality, compute have made extra progress as in comparison with reminiscence and we’re in a world the place gCPU sits idle ready for information to be loaded. That is normally known as reminiscence certain operation. Refer beneath on illustrative diagram depicting this. Matrix multiplication is taken into account compute and reminiscence is storing the information (contemplating it as a warehouse). Compute want information to course of and reminiscence bandwidth has to help that operation.
What’s Reminiscence hierarchy ?
The A100 GPU has 40–80GB of excessive bandwidth reminiscence with a bandwidth of 1.5–2.0 TB/s and 192KB of on-chip SRAM with every 108 streaming multiprocessors with bandwidth estimated round 19TB/s.
With the above context in thoughts, self consideration structure is memory-bound.
Taking a look at consideration math, it’s a softmax operation which causes the memory-bound.
- Quantitative proof: As you’ll be able to see beneath, operations like softmax, dropout, masking are taking majority of the time as in comparison with Matrix multiplication (Matmul)
Why does softmax turn out to be a reminiscence certain operation ?
The dimensions at which it operates is our greatest bottleneck. Within the beneath diagram
- N -> variety of tokens
- d -> variety of embedding dimensions
- When Question and Key’ are multiplied, the eye matrix explodes to N * N which takes quite a lot of reminiscence. For reference (d ~128; N ~128k tokens; google gemini: ~1 million tokens)
Beneath is the algorithm of implementing self consideration mechanism
As famous within the above part, transferring data to HBM (write S to HBM) after which loading again from HBM to gCPU to compute softmax after which writing again to HBM is quite a lot of data touring making it memory-bound operation.
Together with the diagram, beneath steps assist clarify how self consideration is computed by means of matrix multiplication
Step 1:
- I’ve simplified this. In observe, every token is added with positional encoding to generate embeddings to feed right into a linear layer to generate <key, question and worth>. For illustration I used a dimension of three (typically it ranges from 64–128). That is normal transformer structure enter.
Step 2
- Key -> Key’ (transpose) is computed, and multiplied with Question to present QK’ which is N*N. This incorporates the eye of every token with the remainder of the tokens. Beneath diagram reveals the connection as properly. Since these are tokens and we have to compute the significance of every token with respect to one another, softmax operation is utilized row-wise to normalize it from 0 -1.
- This step requires motion to HBM and is the most costly operation as we mentioned. Whole flash consideration paper is easy methods to optimize this course of.
Step 3
- Softmax(QK’) * V is computed as the ultimate output matrix. Dimension right here is similar as enter embeddings of Key, question and worth.
- Closing row within the output matrix
- 1*5 means, the embedding of “this” needs to be modified to include relations with different tokens.
- 2*5 means, the embedding of “is” needs to be modified to include relations with different tokens.
- Similar as above for remainder of the opposite rows
Primary concept is defined by means of the beneath diagram the place blocks of key, question and worth are propagated from HBM to SRAM and thru some mathematical tips (defined beneath), the computation accomplished right here is just not an approximate however precise appropriate reply.
With this implementation, paper is ready to cut back the wall-speed time by accessing data in blocks with out sacrificing correctness.
Algorithm behind the paper: How is Flash consideration carried out ?
That is probably the most advanced a part of the paper. Let’s break this downside into sub-aspects and dive deeper.
Beneath diagram breaks the matrix into blocks and the way every block is used to compute partial softmax after which appropriate softmax.
- Preliminary enter: Token: That is flash consideration paper
- Key: 4 (tokens) X 3(dimensions), Question: 4 (tokens) X 3(dimensions) and Worth: 4 (tokens) X 3(dimensions)
Step 0
- Assume reminiscence is 24 bytes
- SRAM will probably be divided into 4 blocks (Question, Key, Worth and output matrix)
- Question, Key, Worth, Output will get = 6 bytes every to retailer their data (12 bytes/4)
- Every dimension is 3 since every embedding cannot be damaged, so
- Question: 6 bytes/ 3 (dimension) = 2. Similar for worth, key and output
- Therefore, [M/4d] offers the dimensions of every block. On this case, the block dimension is 2. It means 2 rows could be fetched into SRAM.
- Typically sense, Block dimension is [M/4d] and # of blocks is [N*4D/M]
Step 1 & 2: Including a desk beneath which illustrates steps 1 and a pair of on how flash consideration works and evaluate reminiscence and computation side of it.
Beneath diagram helps visualize matrix multiplication (block by block) utilized in flash consideration.
What’s the mathematical side of softmax ?
One of the essential points of the paper on how breaking down matrices nonetheless ends in computing softmax accuracy. Leaving the mathematical instance beneath on easy methods to present two totally different matrices could be clubbed to compute softmax once more.
Instinct
- That is the gorgeous property of exponents which is leveraged right here.
- Every softmax is computed individually however together with this most worth of the row is saved together with the summed exponent worth.
- When merging with one other matrix , we have to examine how a lot max differs with the worldwide max of two matrices. And due to the exponent, each numerator and denominator are adjusted with e^(current_max — global_max) to include this.
Logic is sort of advanced and therefore leaving an instance beneath to undergo. As soon as familiarized with an instance, the above instinct will make quite a lot of sense.
Let’s take a look at complexity evaluation to get a way of how issues modified
Self consideration
- Whereas computing S = QK’ it turns into a N*N matrix which must be propagated again to HRAM after which pulled again from HRAM.
- Therefore O(N*N + N*N) = O(N*N) is HBM entry
Flash consideration
- Outer loop: Key and Question will probably be accessed O(Nd) occasions
- Inside loop: Solely O(Nd/M) will probably be wanted to load from HBM since working on blocks
- Total: O(N*N*d*d/M)
- Virtually, d is far smaller than M. d ranges from (64–128) whereas M ranges from 100 KB and therefore HBM entry is optimized
- We began with the target of optimizing HBM entry and with this complexity evaluation, we see the paper has optimized the HBM entry by (d*d/M) issue with no approximation.
Such a posh paper with enormous enchancment in effectivity. I hope the above rationalization offers some instinct on how flash consideration optimizes and improves the efficiency. I haven’t lined block sparse flash consideration, how does this evaluate with different optimization methods, forwards go optimization and so forth. Hopefully to cowl it in a future submit.