AI Breakthrough ๐: Faster Models, Huge Gains! ๐คฏ
AI
April 20, 2026
๐ง Audio Summaries
๐ง




๐ Shop on Amazon
ABR-INSIGHTS Tech Hub Picks
BROWSE COLLECTION โ*As an Amazon Associate, I earn from qualifying purchases.
Verified Recommendations๐ง Quick Intel
๐Summary
A research team from Moonshot AI and Tsinghua University is exploring a new approach to large language model serving. Their โPrefill-as-a-Serviceโ architecture strategically divides the process, offloading lengthy prefill tasks to dedicated clusters equipped with H200 GPUs. This allows for the transfer of KVCache data via commodity Ethernet to local PD clusters for decoding, utilizing H20 GPUs. Tests with a 1-trillion-parameter hybrid model demonstrated a 54% increase in serving throughput compared to a standard setup, achieving a 15% gain at equal hardware costs. The innovation centers on separating compute-intensive prefill from memory-bandwidth-demanding decode, optimizing hardware utilization and resulting in significant memory savings โ up to 36 times โ particularly with models employing linear-complexity attention stacks.
๐กInsights
โผ
PREFILL-AS-A-SERVICE (PrfaaS): A NEW ARCHITECTURE FOR LARGE LANGUAGE MODEL SERVING
The emergence of large language models (LLMs) has presented significant challenges in terms of inference efficiency, particularly regarding the computationally intensive prefill phase. Traditionally, prefill and decode operations have been confined to the same datacenter, utilizing high-bandwidth RDMA networks. However, this approach has proven limiting, creating a bottleneck. The research presented here introduces Prefill-as-a-Service (PrfaaS), a novel architecture designed to overcome these constraints by strategically offloading prefill to dedicated clusters and leveraging commodity Ethernet for KVCache transport, ultimately leading to substantial improvements in serving throughput.
DISAGGREGATING PREFILL AND DECODE FOR OPTIMIZED PERFORMANCE
The fundamental challenge in LLM serving lies in the distinct computational and memory-bandwidth requirements of the prefill and decode phases. Prefill, responsible for processing input tokens and generating the KVCache, is inherently compute-intensive. Decode, focused on generating output tokens, is memory-bandwidth-demanding. Traditional Prefill-Decode (PD) disaggregation, while improving utilization, introduces a significant transport problem: the KVCache generated by prefill must be transferred to the decode side before output generation can begin. This dependency on dense-attention models utilizing Grouped Query Attention (GQA) and the resulting enormous KVCache sizes necessitate RDMA-class interconnects, tightly binding PD disaggregation to a single datacenter network fabric. The research highlights the shift in model architecture as a key enabler for PrfaaS, specifically the adoption of hybrid attention stacks.
HYBRID ATTENTION STACKS AND KV THROUGHPUT REDUCTIONS
A growing number of modern LLMs โ including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T โ employ hybrid attention stacks that integrate full-attention layers with linear-complexity or bounded-state layers like Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA). These architectures dramatically reduce KVCache size because only the full-attention layers produce the data that scales with sequence length. The resulting reduction in KV throughput โ defined as KVCache size divided by prefill latency โ is substantial. For example, MiMo-V2-Flash achieves a 13x reduction in KVCache production compared to MiniMax-M2.5 at 32K tokens, while Qwen3.5-397B sees a 4x reduction versus Qwen3-235B. Specifically, the Ring-2.5-1T model demonstrates a 36x reduction in KV memory savings through MLA, coupled with an additional 8x reduction from the 7:1 hybrid ratio, resulting in a total memory saving of approximately 36x. This shift fundamentally alters the landscape of LLM serving, making cross-datacenter PD disaggregation feasible.
OPTIMIZING DATA FLOW THROUGH SCHEDULING
The core of this research centers around the intelligent rebalancing of computational resources within a distributed system, specifically focusing on the interaction between PrfaaS and local PD clusters. At extended timescales, the scheduler dynamically adjusts the number of prefill and decode nodes within the local PD cluster, responding to evolving traffic patterns. This proactive approach maintains the systemโs performance close to its optimal throughput capacity. A key component of the study involved a PrfaaS cluster comprising 32 H200 GPUs paired with a local PD cluster of 64 H20 GPUs, connected via a VPC network offering approximately 100 Gbps of cross-cluster bandwidth. The overall egress load on the PrfaaS cluster, operating under the optimized configuration, reaches approximately 13 Gbps โ representing only 13% of the total available Ethernet capacity. Importantly, the research highlights substantial bandwidth headroom remaining, indicating the systemโs capacity for future growth and increased demands.
SCALABILITY AND THROUGHPUT PROJECTIONS
The findings extend beyond the initial case study, projecting significant scalability for the PrfaaS architecture. Even at a deployment scale of 10,000 GPUs, the aggregate egress bandwidth required for KVCache transfers is estimated to be a mere 1.8 Tbps โ comfortably within the capacity of modern inter-datacenter links. This projection underscores the inherent efficiency of the system's design. Furthermore, the research demonstrates a marked improvement in performance metrics. Specifically, Mean Time to First Token (TTFT) decreases by 50% and P90 TTFT drops by 64% when compared to a homogeneous baseline configuration. This substantial reduction in latency highlights the effectiveness of the heterogeneous architecture and the scheduling layer's role in optimizing data flow.
THE VALUE OF A LAYERED APPROACH
The comparative analysis reveals the crucial contribution of the scheduling layer. A naive, homogeneous configuration โ with all prefill tasks assigned to H200 GPUs and all decode tasks to H20 GPUs โ achieves only a 1.16x throughput improvement over the baseline. However, the full PrfaaS-PD system, incorporating the intelligent scheduling logic, delivers a 1.54x throughput boost. This gap between 1.16x and 1.54x definitively demonstrates the scheduling layer's significant impact, accounting for the majority of the performance gains. The research team strategically positions PrfaaS as a viable design solution available today, advocating for its continued relevance as context windows expand, KVCache compression techniques mature, and specialized hardware like NVIDIAโs Rubin CPX for prefill and LPU-style chips for decode become more prevalent. This layered approach suggests a strong future for cross-datacenter PD disaggregation.
Our editorial team uses AI tools to aggregate and synthesize global reporting. Data is cross-referenced with public records as of April 2026.
Related Articles
Ai
๐คฏ GPT-Rosalind: Biology's AI Game Changer ๐งฌ
On Thursday, OpenAI unveiled GPT-Rosalind, a new large language model focused on common biology workflows. Designed by Y...
Ai
Quantum Leaps ๐: Fixing Errors & AI ๐ง
NVIDIA has introduced NVIDIA Ising, a new family of open quantum AI models, aiming to bridge the gap between laboratory ...
Ai
๐คฏ Codex Evolved: AI Coding's Wild Future! ๐
OpenAI is implementing updates to its Codex system, designed to enhance its capabilities. The system will now be able to...