arXiv Papers

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Published: 2025-09-08 05:38:10

Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

Categories: cs.LG, cs.AI

Abstract:
Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Summary (gpt-4o-mini — added 2025-09-11 16:00 UTC)

            {
  "tldr": "The paper introduces Ban&Pick, a post-training strategy that enhances performance and speeds up inference in Mixture-of-Experts (MoE) large language models without retraining. It identifies key experts for improved accuracy and prunes redundant ones for efficiency.",
  "whats_new": [
    "• Introduces Ban&Pick, a novel post-training routing strategy for MoE models.",
    "• Identifies and reinforces key experts that significantly impact performance.",
    "• Implements dynamic pruning to reduce redundancy in expert selection.",
    "• Achieves performance gains and inference speedup without architectural changes."
  ],
  "method": "Ban&Pick consists of two modules: Pick, which identifies and enhances key experts based on their impact on performance, and Ban, which dynamically prunes less effective experts during inference. This approach is applied post-training, avoiding the need for retraining or architectural redesign.",
  "results": [
    "• On Qwen3-30B-A3B, accuracy improved from 80.67 to 84.66 on AIME2024.",
    "• Achieved a 1.25× speedup in inference time under vLLM.",
    "• Average performance improvement of 1.41% across five datasets.",
    "• Pick alone resulted in a 2.83% performance boost on average."
  ],
  "limitations": [
    "• The effectiveness of the method may vary across different MoE architectures.",
    "• Potential for over-reliance on identified key experts could limit generalization.",
    "• Dynamic pruning may lead to performance drops in certain contexts."
  ],
  "why_useful": "This paper is significant as it provides a practical method for enhancing the efficiency and effectiveness of MoE models, which are increasingly used in large-scale language tasks. The findings can inform future research and applications in optimizing model performance without extensive retraining.",
  "provenance": {
    "page_citations": [],
    "code_data_availability": "Not provided."
  }
}

arXiv Papers

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Proof of a conjecture of Voss on bridges of longest cycles

Objective Bayesian inference for the Dhillon distribution

Cloud Detection using Night Sky Background Light at the Pierre Auger Observatory

Towards bridging the gap: Systematic sim-to-real transfer for diverse legged robots

Evaluating Multi-Turn Bargain Skills in LLM-Based Seller Agent

A flat-mode perspective on the boson peak in amorphous solids

Understanding the well-rounded deformation retraction of Teichmüller space

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

Harnessing Object Grounding for Time-Sensitive Video Understanding

Optimal Average Disk-Inspection via Fermat's Principle

Multi-Modal Camera-Based Detection of Vulnerable Road Users

A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs

Quantitative Currency Evaluation in Low-Resource Settings through Pattern Analysis to Assist Visually Impaired Users

Exploring approaches to computational representation and classification of user-generated meal logs

Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap

A Celestial Soft Symmetry Algebra in the ${\cal N}=8$ Supergravity

Coexistence of Two Types of Liquid Structures at Platinum-Water Interfaces

AttestLLM: Efficient Attestation Framework for Billion-scale On-device LLMs

Ferroelectricity in antiferromagnetic wurtzite nitrides

A Generic and Efficient Python Runtime Verification System and its Large-scale Evaluation

Hydrogen-induced fast fracture in a 1.5 GPa dual-phase steel

Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

On the Casimir number and formal codegree of Haagerup-Izumi fusion rings

Thermal Fluctuation Driven Structural Relaxation in Undeformed Glasses: Unraveling the Evolution of Mechanical Stability

Schrodinger's Toolbox: Exploring the Quantum Rowhammer Attack

$\mathcal{H}_\infty$ Optimal Navigation in the Cislunar Space with LFT Models

Single-Shot Decoding of Biased-Tailored Quantum LDPC Codes

A long period transient search method for the Murchison Widefield Array

Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

Nonlinear planar Hall effect from superconducting vortex motion

Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition

WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting

A Deep SETI Search for Technosignatures in the TRAPPIST-1 System with FAST

Moment Dilations and Functional Calculus for Random Operators

Minimax optimal transfer learning for high-dimensional additive regression

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning

Efficient Convex Optimization for Bosonic State Tomography

Hyperon Physics at BESIII

MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks

On the recognition problem for limits of entropy functions

Learning From Software Failures: A Case Study at a National Space Research Center

Dynamical Constraints on a Population of Massive Interstellar Objects

Continuous Recovery of Phase from Single Interferogram

MCTuner: Spatial Decomposition-Enhanced Database Tuning via LLM-Guided Exploration

LoaQ: Layer-wise Output Approximation Quantization