Learn how to Optimize LLM Inference
LLM inference is a memory-bound workload. Having a excessive batch measurement retains the GPU utilization excessive.
Tensor and Pipeline parallelism, quantization, and superior consideration mechanisms can considerably cut back the reminiscence bottlenecks.
Steady batching operates on the system stage and ensures GPUs aren’t idling.
Speculative decoding may give an extra speedup by parallelizing the in any other case sequential autoregressive iterations.
Massive Language Mannequin (LLM) inference at scale is difficult because it entails transferring large quantities of mannequin parameters and knowledge and performing computations on massive tensors. Coupled with the low-latency wants of many functions, we’re compelled to push the {hardware} to its limits, in reminiscence bandwidth (measured in Bytes/s) in addition to compute functionality (measured in FLOPs, brief for “floating level operations per second”).
Have you ever ever questioned how LLM suppliers like OpenAI, Hugging Face, and Anthropic get a solution again to you this shortly, on condition that they’re processing tens of millions of requests concurrently? On this article, we’ll discover the traits of LLM inference as a computational workload and talk about approaches corresponding to key worth caching, quantization, and numerous forms of parallelization.
Understanding the LLM workload at inference
Typically, all LLMs observe the identical schema: embedding enter tokens, then processing the embeddings in N equal transformer blocks, earlier than remodeling the output again into the enter house and sampling from the ensuing likelihood distribution.
Within the following, we’ll use the Llama model family structure as a selected instance to grasp the LLM workload at inference.

The next desk reveals the variety of floating-point operations (FLOPs) required for computing the output of a Llama transformer block. s is the sequence size, b the batch measurement, and dmannequin the mannequin’s hidden dimension. The feed-forward layer has an internal dimension dFFN.
| Operation | FLOPs |
| Q, Okay, V projections | 3 *b* s* dmannequin * dmannequin |
| Feed ahead | 3*b* s *dmannequin* dFFN |
| Consideration | 2 *b *s2 * dmannequin |
We see that the FLOPs of the Q, Okay, and V projections, in addition to the feed-forward layers, enhance linearly with the sequence size s and dominate the FLOPs for brief sequences (s < dmannequin, s < dFFN). Matrix multiplications dominate the eye block’s FLOPs. (The softmax FLOPs are negligible and never proven.) Calculating the eye dominates the computation for lengthy sequences, scaling quadratically with the sequence size s.
Throughout autoregressive generation, to acquire the subsequent token, we have to course of your complete sequence. Thus, the Q, Okay, and V projections and the feed-forward layers scale as O(s2), whereas the eye scales as O(s3). The eye computation dominates the general scaling and turns into intractable even for modest sequence lengths. Thus, it’s the focus of optimizations.
The reminiscence required to retailer the mannequin weights is determined by the precision at which they’re saved. Widespread floating level precisions are FP8 (8 bits), FP16 (16 bits), and FP32 (32 bits). Due to this fact, we’d like roughly 16 GB of reminiscence to retailer the eight billion parameters of a Llama 3.1 8B mannequin in FP16 precision. The 400-billion-parameter Llama 4 Maverick mannequin requires 800 GB on the similar precision, exceeding the capability of the most important out there GPUs by a large margin. Therefore, managing and doubtlessly lowering reminiscence calls for is one other essential space of LLM inference optimization.
These back-of-the-envelope numbers will suffice for our exploration of LLM inference optimization. For a much more detailed evaluation of the LLM workload at inference, see the chapter All About Transformer Inference within the ebook How to Scale Your Model, printed by Google DeepMind.
A fast primer on {hardware} for LLM inference
A typical LLM inference cluster consists of a number of nodes, every with a multi-core CPU and a number of accelerator gadgets, generally GPUs. The GPUs are performing the precise tensor computations, whereas the CPU is dealing with knowledge switch and inter-node communication.
Every GPU executes directions independently however can synchronize and talk with others by means of collective operations corresponding to AllReduce, Gather, or Scatter. The GPUs are related with high-speed interconnects, enabling them to speak immediately, with no need to go over the CPU. The bandwidth varies between totally different {hardware}. For instance, Nvidia GPUs speaking over NVLink attain as much as 1.8 TB/s in its fifth era.
The first constructing blocks of a GPU are streaming multiprocessors (SMs) that deal with parallel computation. Every SM is designed to execute many threads concurrently. On Nvidia’s H100, which we’ll use as our reference, there are as much as 144 SMs (the exact quantity is determined by the board’s type issue).
Every SM contains:
- CUDA cores: Execute commonplace floating-point and integer arithmetic operations. A H100’s SM comprises 128 FP32 CUDA cores.
- Tensor Cores: Specialised cores for matrix-multiply and accumulate operations. These deal with the overwhelming majority of operations. On the H100, there are 4 Tensor Cores per SM.
- Warp schedulers: Handle teams of threads referred to as “warps” (32 on the H100) and subject directions to CUDA cores and Tensor Cores. The Warp schedulers function in a SIMT (Single Instruction, A number of Threads) method, which signifies that in a given cycle, every “warp” performs the identical operation.
- L1 Cache: Low-latency reminiscence native to every SM. On the H100, the L1 cache per SM is roughly 256 KB.
All SMs share:
- L2 Cache: Bigger and slower than the L1 cache, however considerably sooner than the HBM and shared between all SMs. The H100 has an L2 cache between 50 MB and 60 MB with about 5.5TB/s full-duplex bandwidth (i.e., this bandwidth may be reached concurrently in each instructions).
- Excessive-Bandwidth Reminiscence (HBM): Off-chip reminiscence shared throughout all SMs. H100s have 80 GB of HBM and a bandwidth between Tensor Cores and HBM of three.35TB/s.
The HBM is related to the CPU’s important reminiscence, which may be considerably bigger, however the communication bandwidth is about an order of magnitude smaller.
Once more, for a extra detailed evaluation, see the chapter How to Think About GPUs in Google DeepMind’s How to Scale Your Model ebook.


The principle problem when working with accelerators is maintaining their utilization. This usually arises attributable to knowledge switch overheads between CPU and GPU, restricted GPU reminiscence capability proscribing mannequin measurement, and mismatched workloads the place computational duties don’t absolutely leverage the GPU’s parallel processing capabilities. Addressing these points requires workload balancing, optimized reminiscence administration, and environment friendly communication pipelines.
Graphics processing models (GPUs) are the default selection for basis mannequin coaching. They’re the core constructing blocks of at this time’s high-performance computing (HPC) clusters, as they supply unmatched efficiency on parallelizable computations. Sustaining and effectively using this {hardware} platform is a significant problem.
The size of infrastructure and quantity of power required to coach a basis mannequin rely upon its measurement and structure. In flip, the particular {hardware} constrains measurement and structure, with the GPU reminiscence as a key restriction. Basis mannequin groups usually clear up this chicken-and-egg downside by defining a compute finances beforehand. As a common rule of thumb, a couple of fifth of this finances may be spent on the primary coaching run, with the rest wanted for experimentation and take a look at runs.
Optimizing the eye mechanism
Because the consideration mechanism scales quadratically with the sequence size s, it dominates the computation. Throughout autoregressive era, we have to compute the eye for all the earlier tokens in each iteration, resulting in O(n3) scaling.

Key-value caching
Let’s have a look at the eye computation in additional element: For each subsequent token, the Q, Okay, and V matrices will add a brand new row and column, and the QKT matrix will achieve an extra row and column as nicely. The essential half: all different rows and columns keep the identical as a result of their queries and keys haven’t modified.
To generate new tokens, we solely must compute the eye of the newest question to all earlier tokens, whose data is encoded within the Okay and V matrices. Solely the final rows (tensors) within the Okay and V matrices are new, whereas all others have already been computed in earlier iterations. Thus, we will cache these tensors at runtime, an optimization often called key-value caching (KV caching).

Moreover, all knowledge from beforehand generated tokens—aside from the Okay and V matrices—is redundant. In each iteration, we solely want to contemplate the newest token and compute its consideration over all earlier tokens.

If we load Okay and V from a cache, we will move simply the newest token into the mannequin. Solely the newest question tensor is used to supply a single consideration rating. This improves the scaling of autoregressive era to O(sequence_length2).
Nonetheless, this doesn’t come free of charge: KV caching will increase reminiscence utilization linearly with the sequence size s, as we now must retailer as a substitute of compute the Okay and V matrix entries for the earlier tokens.
When utilizing KV caching, we will distinguish two phases of LLMs’ operation:
- Prefill section: The mannequin processes the preliminary enter tokens (e.g., a consumer’s immediate). It computes the Okay and V matrices for all tokens within the enter sequence concurrently. Throughout this section, all enter tokens are processed, and the KV cache is populated.
Within the prefill section, we’re often compute-bound as a result of we will compute the eye for all enter tokens collectively in a single ahead move, resulting in massive matrix multiplications for which trendy accelerators are optimized.
- Decode section: After the prefill section, the mannequin generates tokens one after the other autoregressively. At every decoding step, a single token is available in, and a single token is predicted. For all of the earlier tokens, we reuse the cached keys and values.
Now, the question is an embedding of solely a single token at a time, resulting in a a lot decrease computational depth. As an alternative, we spend extra time shifting knowledge round, e.g., loading Okay and V from the cache and shifting the weights and activations from high-bandwidth reminiscence (HBM) to GPU SRAM (the reminiscence closest to the compute models). Thus, we’re memory-bound.
For the general utility runtime, it’s typically higher to be compute-bound than memory-bound. Not absolutely using the compute capability means losing energy, as even when cores are idle, they nonetheless draw energy. Additionally, if we’re compute-bound, we will scale the variety of gadgets to hurry up.
Environment friendly consideration mechanisms
We’ve shifted from compute-bound to memory-bound. KV caching cuts FLOPs per step, however the computation of consideration now spends most of its time shifting and storing Okay/V states. The following wins come from lowering what we preserve in reminiscence and the way we contact it in comparison with vanilla Multi-Head Consideration (MHA):
- Multi-query attention (MQA) and Grouped-query attention (GQA) result in fewer parameters and a smaller KV cache. MQA shares a single Okay/V throughout all heads, minimizing parameters and cache measurement (lowest reminiscence consumption with a doable high quality hit). GQA shares Okay/V inside teams of heads, touchdown between MHA and MQA (higher high quality/reminiscence stability).
- Flash Attention is an optimization for sooner and leaner reminiscence entry. It reorganizes the eye computation into tiled blocks that stay in on-chip reminiscence, slashing reads/writes to HBM. It does the identical math however causes far much less reminiscence site visitors. FlashAttention is orthogonal to MQA/GQA—pair it with any of the above to scale back reminiscence entry overhead.


Leveraging FlashAttention, the large QKT consideration matrix must not ever be absolutely materialized, resulting in an enormous reminiscence discount.

Parallelizing the inference workload
The LLM inference workload may be parallelized in lots of orthogonal methods throughout gadgets. They can be utilized collectively or individually, relying on the situation and the infrastructure.
The only form of parallelism is knowledge parallelism. We create a number of mannequin replicas on totally different gadgets and feed totally different inputs to them. This strategy is right for processing massive datasets with smaller fashions that match onto a single machine. For instance, in a chatbot utility, totally different customers’ chats may be despatched to totally different mannequin replicas.
The opposite two frequent parallelism strategies utilized in LLM coaching and at inference are tensor and pipeline parallelism, as a result of they permit us to scale up massive fashions that wouldn’t match on a single GPU throughout many gadgets.
Utilizing some X parallelism strategies directly is often dubbed “XD parallelism”.
Tensor parallelism
Tensor parallelism (TP, also referred to as mannequin parallelism or horizontal parallelism) was launched within the 2020 MegatronLM paper to alleviate the reminiscence bottlenecks of the big linear layers within the feed-forward block.
The linear layers’ weights are cut up („sharded“) throughout gadgets such that every machine does a subset of computations. Tensor parallelism regulates the wanted reminiscence bandwidth as a result of each machine solely must load a slice of the weights.

Typically, a linear layer (i.e., a matrix multiplication) may be parallelized column-wise or row-wise:
- In column parallelism, the weights are cut up column-wise, and the enter is copied to all gadgets. Performing the computation on the tiles produces output columns that have to be concatenated collectively.
- In row parallelism, the weights are cut up row-wise, and the enter have to be cut up column-wise. After the tiled matrix multiplications are completed, the output matrices have to be summed up („diminished“).
In LLMs, each row- and column-wise parallelisms are used collectively. For instance, the feed-forward blocks in Llama 3 include three linear layers, w1, w2, w3, and an activation operate (SiLU):
Matrices w1 and w3 mission the enter x into a better intermediate dimension, and w2 tasks the intermediate tensor again to the unique dimension.
For instance, a Llama3-8B mannequin has a mannequin dimension of 4096 and an intermediate dimension of 14336. To parallelize this computation, we will parallelize w1 and w3 column-wise, every machine producing a subset of the channels. Every machine performs the SiLU activation and the elementwise multiplication on its shard of the information. The w2 matrix is then sharded row-wise such that the subset of the channels is down-projected once more. Then, every machine performs the entire ahead move on solely part of the information. Ultimately, all shards are summed up.


The diploma of parallelism, which is the variety of gadgets to parallelize over, must be tuned to attain most machine utilization. TP=1 means no parallelism, and TP=4 (additionally referred to as “4-way parallelism”) signifies that the matrices are cut up into 4 shards.
The decisive think about optimizing the diploma of tensor parallelism is the communication overhead between gadgets. The shards should first be distributed („scattered“) throughout gadgets, and „gathered“ or „diminished“ in the long run.
The tenet is conserving gadgets busy with computations: Scale TP till compute time dominates switch time for the given batch measurement, reminiscence capability, and hyperlink bandwidth.
Pipeline parallelism
In pipeline parallelism (PP, also referred to as vertical parallelism), totally different layers are assigned to totally different gadgets. The intermediate activations circulate from one machine to a different.
Like tensor parallelism, PP can be utilized to alleviate reminiscence capability points. For instance, a Llama3 405B (910 GB of parameters) may be cut up throughout 64 Nvidia T4 GPUs, every with simply 16 GB of VRAM, totaling 1 TB.
The principle problem of PP is scheduling the workload such that idle intervals (referred to as “bubbles”), the place a tool waits for the output of one other machine, are minimized. Such areas may be found by profiling the workload.

To scale back the idle time, the communication between gadgets must be optimally overlapped with the unbiased computations that may run in parallel.
Different parallelisms
Past tensor and pipeline parallelism, two different forms of parallelism are generally utilized for LLM inference:
- In “sequence parallelism,” lengthy enter sequences that require extra reminiscence than a single machine supplies are cut up throughout gadgets, so that every computes the eye scores for less than a subset of the overall enter tokens. Whereas this permits inference on longer sequences than a single machine might deal with and retains most computations native, it requires substantial synchronization effort.
- “Skilled parallelism”, particular to the mixture of experts architecture (MoE), distributes the “consultants” throughout gadgets. Throughout runtime, the mannequin dynamically routes the inputs to the suitable consultants. For instance, the DeepSeek-V3 model with 64 consultants per layer makes use of 64-way skilled parallelism throughout 8 gadgets, which means every machine will get 8 consultants.
Quantization
One other means of lowering the reminiscence and compute bottlenecks is through the use of fewer bits for the weights and activations. That is referred to as quantization. The decrease the bitwidth, the extra reminiscence we save. Nonetheless, this comes on the danger of degrading the mannequin’s output accuracy.
The numeric knowledge sorts utilized in neural networks are integer (INT) and floating level (FP), and logarithmic knowledge sorts.
IEEE FP16 and BF16 are two distinguished floating-point knowledge codecs utilizing 16-bit. BF16 (“brain float”) was developed by Google Brain (now a part of Google DeepMind) and retains the identical dynamic vary as FP32, however sacrifices precision and can’t characterize very small values as precisely.
The bit-width of the information sort used is the parameter that immediately impacts its reminiscence utilization. An IEEE 754 FP32 takes up 4 Bytes per worth. Changing this with an FP16 knowledge sort, we will instantly save half of the reminiscence wanted. Moreover, if we’re memory-bottlenecked (e.g., within the decode section), quantization frees up the reminiscence bandwidth, immediately resulting in runtime enhancements.
Past the reminiscence financial savings, quantized knowledge codecs may also pace up the computation if the {hardware} helps it.
For instance, matrix multiplication is a standard bottleneck in LLM fashions. At its core, matrix multiplication is a sequence of multiplications and accumulations, which, on {hardware}, is computed utilizing multipliers and accumulators with a sure bit-width, e.g., 32 bits. Reminiscence transfers and the compute capabilities of the {hardware} are optimized for this bit-width.
Nonetheless, since 2017, when Nvidia introduced the Volta architecture, {hardware} distributors have made optimizations for native help of lower-bandwidth matrix multiplication workloads current in ML fashions. AMD calls these „Matrix cores” and Nvidia „Tensor cores“. The desk beneath reveals a comparability of theoretical FLOPS for AMD’s MI300X and Nvidia’s H200 NVL (PCIe model) utilizing these specialised cores. You possibly can see that halving the bit-width doubles the out there FLOPS.
Quantization strategies
Mannequin quantization can considerably enhance effectivity, but it surely usually comes with a tradeoff in output high quality, as lowering bit-width means lowering the quantity of data that may be represented. When making use of quantization, it’s important to check its results on life like knowledge to evaluate whether or not the rise in computational effectivity deserves the drop in job efficiency.
Quantization strategies are distinguished by:
- When quantization occurs: throughout coaching (Quantization-Conscious Coaching, QAT) or after coaching (Submit-Coaching Quantization, PTQ).
- How scaling and outliers are dealt with to keep away from vary clipping and cut back quantization errors.
- How quantization parameters are decided: statically (offline, mounted) or dynamically (on-line, at runtime).
Quantization-Conscious Coaching (QAT) is utilized throughout coaching whereas parameters are being up to date. A typical instance is coaching an LLM in BF16. In Submit-Coaching Quantization (PTQ), the mannequin is already educated, and the method depends on a calibration dataset to quantize it, e.g., set parameters corresponding to scaling components, per-layer bit-widths, and group sizes.
Scaling performs a vital position in avoiding range-clipping errors. For example, the utmost representable worth in FP16 is roughly 65,000, whereas a generally used FP8 format tops out round 448. Changing immediately from FP16 to FP8 would clamp something above that restrict, introducing massive errors. Scaling the values earlier than quantizing, performing the computation in FP8, after which rescaling afterwards preserves extra of the mannequin’s dynamic vary.
The next instance (tailored from this Gist by Nikita Shulga) reveals how two FP16 tensors may be scaled and quantized earlier than an FP8 matrix multiplication:
The timing of when quantization parameters are decided issues as nicely. In static quantization, parameters are computed offline utilizing a calibration dataset. This has no runtime overhead, however the high quality can degrade if the precise runtime knowledge differs from what was seen throughout calibration. For instance, bigger runtime values could cause clipping if the scaling is inadequate. In dynamic quantization, parameters are computed at runtime, permitting the system to adapt to altering knowledge distributions at the price of additional computation. Utilizing the sooner instance, dynamic quantization would imply recalculating the scaling components each time the tensors are quantized.
Making (activation) quantization work
Till now, we haven’t differentiated between weights and activations when discussing quantization.
It seems that quantizing weights is far easier than quantizing the activations. Weights are static, so we will quantize them offline. Moreover, attributable to using regularization that penalizes massive weights throughout coaching, weights typically have distributions with small amplitudes.
In distinction, LLM activation tensors have outliers. Outliers are channels with excessive absolute values, that are troublesome to quantize as a result of they’ve a huge impact on the scaling issue. We divide the numbers in a tensor by the maximal worth of that tensor. If this worth is far bigger than the opposite values, the division can push the opposite values out of the representable vary.


Outliers in activations may be dealt with by leveraging the commentary that outliers aren’t random but occur in the same channel for all input tokens. We will cut up the channels into “outlier” and regular channels and use totally different scaling components to quantize them. We will even cut up the layer and calculate the outliers in full precision, and solely quantize the remaining.
Conclusion
On this article, we now have explored methods of optimizing LLM inference. KV caching is used to keep away from recomputing Okay and V matrices, whereas superior consideration mechanisms, like Flash Consideration, speed up the eye course of. To alleviate reminiscence bottlenecks, we will quantize the mannequin’s parameters or parallelize it throughout gadgets in several methods. If our {hardware} helps calculation in decrease bit widths, e.g., FP8 matrix multiplication, we get an extra speed-up. On high of all that, steady batching and speculative decoding allow environment friendly deployment.
By combining these approaches, you possibly can unlock sooner and extra resource-efficient LLM inference in your utility, serving extra customers higher.