Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Mixture-of-Experts (MoEs) architectures supply a promising resolution by sparsely activating particular components of the mannequin, lowering the inference overhead. Nonetheless, even with MoEs, the sheer variety of parameters and specialists makes deployment and serving expensive.
Pruning is a longtime technique to cut back the variety of parameters of a educated mannequin whereas sustaining its process efficiency. Usually, we distinguish two sorts of approaches. Unstructured pruning removes particular person weights, whereas structured pruning removes complete mannequin parts.
As a result of their clear construction, structured pruning appears to be a really perfect match for MoEs. By eradicating redundant specialists, we are able to shrink the whole mannequin dimension. Nonetheless, present approaches for knowledgeable pruning require many ahead passes, whose quantity grows exponentially with the variety of specialists. Additional, structured pruning doesn’t cut back the variety of energetic weights throughout inference.
In our paper STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, which was accepted for a presentation at ACL 2025, we mix the 2 lessons of pruning strategies and introduce an strategy that works exceptionally effectively for MoEs with over 100 specialists. In a nutshell, STUN first removes redundant specialists after which performs unstructured pruning inside particular person specialists.
Scaling limitations for Combination of Skilled fashions
MoEs are an efficient method to extend the whole variety of mannequin parameters whereas conserving computational calls for in examine. By dividing the mannequin into specialised constructions, referred to as specialists, and selectively activating them based mostly on the enter, MoEs obtain effectivity features in coaching and inference.
Extra specialists permit the mannequin to seize a broader vary of representations and specializations, bettering efficiency on various duties or advanced information. Unsurprisingly, we see a transparent pattern in direction of an elevated variety of specialists in MoEs. For example this evolution, Mistral’s Mixtral 8x7B (December 2023) builds on eight specialists, Databricks’ DBRX (March 2024) on 16, and Snowflake’s Arctic (April 2024) makes use of 128 specialists.
Nonetheless, as fashions scale additional, the effectivity features offered by the MoE structure alone are inadequate. Right here, pruning turns into important, refining the structure by eradicating redundant parameters with out compromising general efficiency. Combining MoEs with pruning methods can optimize inference velocity and reminiscence consumption, making it a promising path for additional scaling fashions.
Fixing the exponential scaling problem in structured MoE pruning
Structured pruning removes particular patterns, corresponding to rows or complete weight tensors. Within the context of MoEs, as knowledgeable constructions from coaching MoEs correspond to such patterns, pruning specialists is a pure match for structured pruning.
Whereas a rise from 8 to 128 specialists could appear modest, it renders present pruning strategies unviable. Roughly talking, they take a “combinatorial” strategy to figuring out which constructions to take away, requiring the enumeration of all doable subsets of specialists to find out the optimum configuration. For example, when the variety of specialists will increase from 8 to 128, the ahead passes of combinatorial pruning algorithms develop exponentially, from 70 to 2.4 × 10³⁷.
In distinction, STUN leverages the behavioral similarity between specialists to make knowledgeable pruning choices. Particularly, it first identifies clusters of comparable specialists based mostly on their behavioral similarity. We will decide the similarity at a minimal price by inspecting the mannequin’s weights. If two rows have related values, this means a excessive pairwise similarity between the 2 corresponding specialists. Such an knowledgeable pair tends to activate on related inputs and exhibit related outputs, thus forming a cluster.
By pruning all however one consultant knowledgeable from every cluster, STUN successfully reduces the mannequin dimension whereas preserving its general performance. This strategy drastically reduces the exponential complexity of exhaustively enumerating mixtures to fixed O(1), making it extremely scalable for large MoEs.
Exploring the potential of a two-phase strategy to MoE pruning
A key query in our analysis was: How a lot can we achieve from an extra unstructured pruning section? After we take away all redundant specialists, there could be much less “margin” for additional pruning in comparison with a situation the place we solely apply unstructured pruning.
We will quantify this margin because the kurtosis of the mannequin weights’ distribution, colloquially often known as its “tailedness.” As unstructured pruning removes near-zero weights, it reduces the load distribution’s kurtosis.
Not like unstructured pruning, which selectively targets weights that minimally affect the mannequin’s output, structured pruning removes teams of parameters (in our case, specialists) based mostly on redundancy or low significance. Thus, structured pruning doesn’t considerably lower kurtosis, leaving loads of margin for unstructured pruning.
As an example, if two specialists in an MoE carry out identically, one may be eliminated with out altering the mannequin’s output. Nonetheless, this doesn’t considerably affect the general weight distribution—it solely reduces the mannequin’s dimension.
Since structured pruning primarily reduces architectural redundancy somewhat than reshaping the underlying weight distribution, our two-phase strategy—leveraging unstructured pruning after structured pruning—outperforms unstructured-only pruning.
Placing STUN to the take a look at
Our evaluations present that STUN achieves excessive sparsity with no loss in efficiency on numerous MoE architectures, together with Snowflake’s Arctic, a 480B-sized MoE with 128 specialists.
We achieved practically no loss in efficiency with 40% sparsity, even on difficult generative duties like GSM8K (Grade Faculty Math 8K), a broadly adopted query answering process testing on mathematical issues that require multi-step reasoning.

In some circumstances, STUN carried out orders of magnitude higher than unstructured pruning strategies. Our O(1) knowledgeable pruning technique additionally outperformed current, extra computationally costly strategies, corresponding to Lu et al. (2024), highlighting the effectiveness of our strategy.
What’s subsequent in MoE pruning?
Since STUN doesn’t make any assumption about base MoE fashions, it’s generalizable to different MoE households, corresponding to Mixtral. Our code is available on GitHub. We encourage you to learn our paper and adapt it to your MoE fashions.
Past making use of and evaluating STUN, an important subsequent space of optimization is {hardware} acceleration for unstructuredly pruned fashions. Unstructured pruning removes particular person weights with out contemplating their location or association within the mannequin. Due to this, the ensuing mannequin’s sparsity is random and unaligned—some rows, columns, and even small sections could develop into very sparse, whereas others stay dense.
This irregularity is difficult as a result of {hardware} like GPUs or TPUs assumes common, contiguous reminiscence layouts. Whereas structured pruning yields a predictable sparsity sample that enables for reminiscence optimization, the irregularly sparse fashions ensuing from unstructured pruning forestall environment friendly reminiscence entry and parallel processing.
Specialised {hardware} assist can reorganize reminiscence entry patterns to cut back overheads from irregularity. Such co-evolution of {hardware} and software program assist will doubtless additional set up pruning as a cornerstone of scaling and making use of MoE fashions.