What is MoE 2.0? Update Your Knowledge about Mixture-of-experts
đł The fresh angle on current Mixture-of-Expert. We discuss what new MoE techniques like S'MoRE, Symbolic-MoE, and others mean to the next generation AI
Even the most powerful techniques require rethinking to align with new trends. MoE is a fascinating framework that reshaped how we build and understand scalable AI systems. It has rapidly gained attention because it enables massive model growth â like trillion-parameter models â without overwhelming hardware. What makes MoE especially powerful is its ability to dynamically select experts based on the input, allowing the model to specialize in different subdomains or tasks. Itâs already a backbone of many systems: DeepSeek-V3 incorporates an impressive 671 billion parameters using MoE; Googleâs Gemini 1.5 Pro employs a sparse MoE Transformer to handle a million-token context efficiently; Mistralâs Mixtral 8Ă22B routes tokens across 8 experts per layer and outperforms dense models on cost and speed; Alibabaâs Qwen2.5-Max, a 325B MoE trained on 20T tokens, ranks near the top of Chatbot Arena with standout reasoning and coding skills; and Metaâs Llama 4 introduces a MoE architecture across its models, including the 400B-parameter Maverick and the 2T-parameter Behemoth, both designed for multimodal and multilingual tasks.
We started this AI 101 series, explaining what Mixture-of-Experts (MoE) is. Today, we will discuss the fresh angle on current MoE developments most readers havenât seen dissected yet. Why is MoE suddenly back on fire?
A lot of lab chatter and industry roadâmaps right now revolve around nextâgeneration MoE designs. A pair of brandânew papers dropped this month: 1) Structural Mixture of Residual Experts (SâMoRE) â Aprilâs release from Meta that shows how you can fuse LoRAâstyle lowârank adapters with a hierarchical MoE tree, introducing exponential âstructural flexibilityâ gain that dense models canât match; 2) SymbolicâMoE from UNC Chapel Hill which moves MoE out of gradient space and into pure language space, performing with accuracy better than GPTâ4oâmini and running 16 experts on a single GPU thanks to batched inference. There is also a bunch of fresh MoE developments optimizing inference of MoE models, such as eMoE, MoEShard, Speculative-MoE, and MoE-Gen.
What can these innovative methods teach us about rethinking the efficiency of next-gen MoE models? Letâs break down what makes these developments special and why they might be the clearest path to open-source models that scale.
Welcome to the MoE 2.0!
đ¨ Click follow! If you want to receive our articles straight to your inbox, please subscribe here
Also follow us on đĽ YouTube Twitter
In todayâs episode, we will cover:
- Structural Mixture of Residual Experts (SâMoRE)
- How does SâMoRE work?
- Performance of SâMoRE
- Not without limitations
- Symbolic-MoE
- How does Symbolic-MoE work?
- Results and advantages of Symbolic-MoE
- Limitations
- What these two methods buy you
- Other notable shifts to MoE 2.0
- Conclusion: Why does this new MoE shift matter right now?
- Sources and further reading
Structural Mixture of Residual Experts (SâMoRE)
Meta AIâs April 8 release showed a new approach for effective LLMsâ learning and fine-tuning. They took two popular techniques, that can be called fundamental in AI, LoRA (Low-Rank Adaptation) and MoE, and mixed them together. This turned out to be an interesting nontrivial development â Structural Mixture of Residual Experts (SâMoRE). It fuses LoRAâstyle lowârank adapters with a hierarchical MoE tree. This allows to benefit from both approaches â efficiency from LoRA, because everything remains low-rank, and flexibility and power from MoE with some additional advantageous upgrades. Letâs see how this works together.
But firstly, a quick reminder about LoRA. Itâs a lightweight and efficient way to fine-tune LLMs with minimal added parameters and computation. Instead of changing all the millions, or even billions, of parameters in a model, LoRA freezes the original weights and adds small, trainable layers (in the form of low-rank matrices) that adjust the modelâs behavior.
How does SâMoRE work?
Here is the entire workflow of SâMoRE system:
- It breaks down the modelâs "experts" into layers of small adjustments, called residuals. Each residual uses a low-rank update to modify the input.
- These residuals are connected in a tree-like structure, like branches in a tree, so the model can decide how to route information through them.
- The router sends each token down a dynamicallyâchosen subâtree of âresidual experts.â SâMoRE computes the output by flowing through that tree.
By mixing and matching the paths through the tree, SâMoRE can act like it has many more experts than it actually does. SâMoRE reuses small modules across layers, letting different expert paths share the same building blocks (see the image on the left below).
Image Credit: SâMoRE original paper
And a little bit more about routing in SâMoRE. Itâs a top-down, step-by-step process, where each choice helps guide the next layer below. The router uses a small neural network (an MLP) that âlooks atâ:
- The token itself (its embedding), and
- A key from the residual in the higher layer, also called a parent helper.
The router uses this info to calculate which lower-level helpers (children) make the most sense to activate, based on which parent was chosen before.
So SâMoRE allows you to get the capacity of dozens of experts while instantiating only a handful of tiny matrices. And whatâs the actual result?
Performance of SâMoRE
With the same amount of parameters as older methods, SâMoRE is way more flexible and effective, helping to fine-tune LLMs better. SâMoRE was tested on multiple tasks and models (like LLaMA-3 variants), and it consistently beat the best existing models in two ways:
- Higher accuracy by up to 2.1%.
- It uses â16% fewer trainable parameters, because it only trains small low-rank matrices and lightweight projection layers.
Other benefits of S'MoRE include:
- Structural flexibility: Its tree-like multi-layer structure allows the model to combine and reuse small expert pieces in many different combinations. SâMoRE chooses both the experts and how they are connected. This exponentially increases the number of meaningful expert combinations.
Image Credit: SâMoRE original paper
- Low computation overhead: SâMoRE's total compute cost is nearly the same as LoRA, and routing overhead is minimal, often less than 5â10%. Itâs efficient even when using 2â3 layers of experts. So â
- Scalable design: Adding more layers often boosts performance with low extra cost, just 1â7%. For example, going from 2 layers to 3 layers, SâMoRE achieved better accuracy while reducing parameter count by 27% in some tasks.
- Flexible routing mechanism: SâMoRE supports multiple types of gates, such as dense, noisy top-k, and switch transformer.
However, S'MoRE has some limitations that we should also take into account.
Not without limitations
- Increased architectural complexity: A more complex routing system and multi-layer structure of SâMoRE can be harder to implement and tune compared to simple LoRA or flat MoE models.
- Router adds some compute and memory cost, especially when using low-rank settings where routing becomes relatively more significant.
- Performance in much larger models or real-world applications is not yet fully tested.
- Tuning for deeper SâMoRE (with 3 layers or more) may potentially require careful hyperparameter search, which adds to development cost.
Anyway, despite these limitations, SâMoRE shows how to shift MoE to the next level with âstructural flexibilityâ, meaning how smartly we arrange and use the model's pieces. This might be the key to making LLMs even better at fine-tuning for specific tasks, without needing bigger and more expensive models.
And what does Symbolic-MoE present to us?
Symbolic-MoE
Traditional MoE approaches allow to use a group of specialized models to combine their strengths but require retraining them from scratch, which is costly and impractical. UNC Chapel Hill, on March 7, proposed a way to avoid this constant retraining and to efficiently mix the outputs of multiple models â Symbolic-MoE.
It selects the best experts for each query based on their specific skills, focusing on the individual question instead of the overall task. For example, if a question is about algebra, it picks algebra experts, and if it's about probability, it picks experts with that skill. It groups queries and runs all of them for each selected model in a single batch. Due to this Symbolic-MoE doesnât need to load models repeatedly, which makes it faster and less demanding on resources.
As a result, Symbolic-MoE can handle up to 16 models on a single GPU, or even scale across multiple GPUs if needed.
Image Credit: Symbolic-MoE original paper
Why is it called Symbolic? While traditional MoE frameworks operate in the parameter space of the models, Symbolic-MoE operates in the output space by leveraging text-based reasoning to integrate diverse model responses. In other words, it uses symbolic representations in the form of natural language to represent the expertise of the models. Symbolic-MoE reminded us the modular architectures that Googleâs Jeff Dean considers to be the future of AI.
For now, letâs explore how it works from the technical side.
How does Symbolic-MoE work?
Symbolic-MoE works in two stages: preprocessing and inference with the final answer generation. Letâs break each stage down:
- Preprocessing
Before it starts solving problems, Symbolic-MoE sets everything up. It uses a small validation set of problems and a pool of available models. It runs each model on this validation set to create model profiles that show what each model is good at (like geometry or biology). For example, a modelâs profile might show that itâs strong in algebra but weak in chemistry.
A "Keyword LLM" identifies the key skills for each question, like algebra or calculus for a math problem.
Symbolic-MoE also selects an aggregator based on its ability to combine answers from different experts into a final, high-quality response.
Image Credit: Symbolic-MoE original paper
- Inference
When a new problem comes in, Symbolic-MoE âlooksâ at the model profiles to figure out which experts are the best for the job based on the skills needed for the problem.
A suitability score is calculated for each model, which helps decide which models to recruit for the job. This process is dynamic â each new question gets a different set of experts based on its specific needs. Global competency ensures that the models selected for a given problem are not only good at the required skills but are also strong performers overall.
Selected experts generate their reasoning by producing Chain-of-Thought (CoT) responses. The aggregator then takes these reasoning outputs and combines them into the final answer. This approach avoids the need for multiple rounds of back-and-forth discussion between models, making the process faster and more efficient.
To make things faster, Symbolic-MoE uses a special trick â batch inference strategy. Instead of repeatedly loading and unloading models for each question, it groups questions that require the same set of experts and processes them all at once. So each model only gets loaded once per batch. This reduces the time spent on loading models and helps optimize the use of GPU memory.
Results and advantages of Symbolic-MoE
By automatically selecting the best experts for each query and the best aggregator, and using a batch processing approach, Symbolic-MoE outperforms existing systems that require more complex multi-agent discussions, offering a simpler and more effective solution. And here is how:
- Performance improvement: Symbolic-MoE outperforms the best multi-agent baseline by an average of 8.15% across all tested benchmarks, such as MMLU-Pro, AIME, GPQA, and MedMCQA. It shows even higher accuracy than GPTâ4oâmini.
Image Credit: Symbolic-MoE original paper
- Compatible with larger models: Primarily using 7-8B parameter models, Symbolic-MoE matches or exceeds the performance of larger 70B models. This efficiency makes it accessible to users with limited hardware resources.
- Efficiency: It reduces run-time by 44% compared to multi-agent baselines like MoA (Mixture-of-Agents) when run on a single GPU. Thanks to batched inference, it can handle up to 16 models on a single GPU or even scale across multiple GPUs if needed. On 4 GPUs, it results in almost 2Ă speedup over MoA.
- Scalability: Symbolic-MoE scales efficiently, even when using a large number of experts. It also happens because the batch inference strategy reduces the need for frequent model loading and offloading.
- Flexibility: This approach is modular, allowing it to adapt to different tasks without needing to modify or retrain the models. Also, it can be easily updated and adapted as new models are introduced without retraining from scratch.
However, as usual, not everything is perfect â
Limitations
- Symbolic-MoE still requires running multiple models in parallel, which increases the inference cost.
- Dependency on skill inference: The system uses a small validation set to create skill-based model profiles, relying on the quality of the skill inference mechanism, the "Keyword LLM". So inaccurate or insufficiently trained inference mechanism can harm expert selection and performance.
- It is also limited by quality of models/experts in the pool. If they are not specialized enough or lack the required domain expertise, the framework may not achieve optimal performance.
- Dealing with a large pool of models to identify the most suitable experts for each query might introduce some overhead.
Overall, Symbolic-MoE demonstrates how skill expertise of models gathered through language-based inference can lead to better efficiency compared to traditional MoE systems, which typically rely on parameter-based selection.
What these two methods buy you
Together, the two approaches, SâMoRE and Symbolic-MoE, introduce three innovative ideas that are gaining traction:
- Hierarchical residual routing from SâMoRE expands the expert-choice space without increasing the parameter count.
- Skill-based recruiting at query time from Symbolic-MoE selects only experts needed for each specific question.
- GPU-friendly batching/sharding tricks keep latency low, even when activating 10 or more experts.
Are there any other signals that show MoE is entering a new stage of growth?
Other notable shifts to MoE 2.0
Firstly, letâs talk about the recent top model release. Yes, itâs Meta's Llama 4 Scout and Maverick models that are notable for being its first models released with a MoE architecture. MoE approach, allows Llama 4 Scout model, with 17 billion active parameters and 16 experts, to operate on a single NVIDIA H100 GPU. In addition to performance gains, MoE architectures offer potential cost reductions. As Bloomberg reported, Meta views MoE as a primary strategy for reducing expenses in large-scale inference tasks, particularly in high-performance applications.
Secondly, there is a consistent focus on optimizing MoE inference. Just take a look at some interesting developments:
eMoE
eMoE is a memory efficient inference system for MoE-based LLMs introduced by researchers from the University of Virginia and Georgia Institute of Technology. It uses a predictive model to forecast which experts will be needed for future inputs, based on recurring token-to-expert routing patterns, and preloads only the most likely experts. To reduce overhead, eMoE invokes the expert predictor periodically, every few prompts.
It also leverages a clever trick â eMoE schedules tasks based on their specific requirements, like token generation length and sensitivity to expert routing, to ensure optimal use of resources.
As a result, eMoE reduces memory consumption by up to 80% while maintaining accuracy and improving inference latency by up to 17%.
MoEShard
MoEShard is an inference system designed by EPFL and McGill Universityâs researchers to address the challenges of load imbalance across multiple GPUs in MoE. It employs tensor sharding. This means the expert matrices are split between GPUs, with each GPU holding a part of every expert, ensuring that computation is evenly distributed across GPUs. This also allows MoEShard to retain all tokens, unlike other methods that drop tokens to reduce memory usage.
MoEShard can achieve up to 6.4x faster time-to-first-token (TTFT) compared to systems like DeepSpeed.
DeepSpeed-MoE
Microsoftâs DeepSpeed-MoE is already a classic example, as it was developed back in 2022. It combines several techniques to efficiently handle MoE models at scale:
- Pyramid-Residual MoE (PR-MoE): Integrates residual connections with MoE layers. By maintaining key parameters in these residual connections and sharing weights across layers, PR-MoE reduces the overall model size by up to 3x without compromising the quality.
- Optimized inference system includes features like expert parallelism, tensor slicing (dividing model's parameters into smaller, manageable pieces), and memory bandwidth management (optimizing the flow of data between the GPU and memory).
- The Mixture-of-Students (MoS) technique ensures that the system can run smaller, compressed versions of MoE models.
Thanks to these features, DeepSpeed-MoE achieves up to 7.3x reduction in inference latency and cost, and 4.5x faster and 9x cheaper inference compared to dense models of similar quality. But later MoE methods significantly outperform DeepSpeed-MoE â
Speculative-MoE (s-MoE)
s-MoE by Huawei Technology aims to improve communication efficiency in parallel inference. It employs two mechanisms:
- Speculative Token Shuffling (s-TS) that predicts the routing paths for tokens early, allowing tokens to be shuffled and sent to their most likely experts in advance. This reduces the need for expensive communication between GPUs during routing.
- Speculative Expert Pre-grouping (s-EG): Experts likely to be activated together are grouped and placed on the same GPU, minimizing cross-device communication and enhancing local activation rates.
The system also uses dynamic co-clustering, grouping tokens and experts based on predicted activation patterns.
All this together allows s-MoE to minimize the need for inter-GPU communication, achieving up to 75% reduction in communication costs and reducing latency. It also significantly boosts inference throughput up to 2.37x compared to DeepSpeed-MoE.
MoE-Gen
MoE-Gen from the University of Edinburgh also focuses on achieving high throughput on a single GPU to optimize inference of MoE models. It uses module-based batching. Instead of processing entire model batches at once, MoE-Gen divides the model into its attention and expert modules. It accumulates tokens in host memory and dynamically batches them for GPU processing, adjusting the batch size for each module based on GPU capabilities.
MoE-Gen offloads key-value (KV) caches and model parameters to host memory, reducing GPU memory pressure and allowing for the use of larger batch sizes, which in turn increases throughput up to 8â31Ă compared to other methods, like DeepSpeed-MoE.
This wide range of MoE inference optimization methods proves that we can significantly boost MoE model efficiency, making them a powerful tool for scaling AI systems while cutting resource consumption and computational costs, and making them faster.
Conclusion: Why does this new MoE shift matter right now?
Today large proprietary labs are moving toward trillion-parameter "jagged" models, where only the right 1â2% of parameters are activated for each token. Meanwhile, the open-source community seeks similar efficiency improvements but lacks the resources for such large-scale training. As these companies continue to bet on scaling AI systems, MoE offers a practical solution for achieving high performance without the typical rise in computational costs. Techniques like SâMoRE and Symbolic-MoE address this challenge directly: they allow you to start from a smaller, dense model with, for example, 8B parameters, incorporate specialized low-rank experts or plug-in models, and create a powerful system that performs well beyond expectations â without needing a massive GPU farm.
Additionally, as many developers switch their focus to the inference stage and its efficiency, methods like eMoE, MoEShard, DeepSpeed-MoE, Speculative-MoE, and MoE-Gen are pushing the boundaries of MoE model inference. These advancements show how we can adapt the fundamental MoE technique to current trends. And the potential is still far from its limits.
Author: Alyona Vert Editor: Ksenia Se
Sources and further reading
- S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning
- Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning by Elias Stengel-Eskin et al.
- eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
- Accelerating MoE Model Inference with Expert Sharding
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling
- MOE-GEN: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
- A Survey on Mixture of Experts in Large Language Models by Juyong Jiang
- A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications
Resources from the Turing Post
đ¨ If you want to receive our articles straight to your inbox, please subscribe here