AI Breakthrough ๐Ÿš€: Faster Models, Huge Gains! ๐Ÿคฏ

AI

April 20, 2026

๐ŸŽง Audio Summaries
๐ŸŽง
English flag
French flag
German flag
Korean flag
Spanish flag
๐Ÿ›’ Shop on Amazon

๐Ÿง Quick Intel


  • Moonshot AI and Tsinghua University developed Prefill-as-a-Service (PrfaaS) to selectively offload prefill to compute-dense clusters and transfer KVCache via commodity Ethernet.
  • Using an internal 1T-parameter hybrid model, the research team achieved 54% higher serving throughput compared to a homogeneous PD baseline and 32% higher than a naive heterogeneous setup.
  • The throughput gain of approximately 15% was observed when comparing at equal hardware cost, utilizing H200 GPUs for prefill and H20 GPUs for decode.
  • The KVCache utilizes Grouped Query Attention (GQA) and achieved 60 Gbps throughput for a 32K-token request on an 8ร—H200 instance, benchmarking MiniMax-M2.5.
  • MiMo-V2-Flash produced KVCache at 4.66 Gbps, a 13ร— reduction compared to MiniMax-M2.5, and Qwen3.5-397B reached 8.25 Gbps versus 33.35 Gbps for Qwen3-235B.
  • Ring-2.5-1T achieved a 36ร— KV memory saving through a 7:1 hybrid ratio and MLA contributing a 4.5ร— compression over GQA.
  • RDMA-class interconnects are necessary for data transfer to avoid compute stalling within the PD disaggregation architecture.
  • ๐Ÿ“Summary


    A research team from Moonshot AI and Tsinghua University is exploring a new approach to large language model serving. Their โ€œPrefill-as-a-Serviceโ€ architecture strategically divides the process, offloading lengthy prefill tasks to dedicated clusters equipped with H200 GPUs. This allows for the transfer of KVCache data via commodity Ethernet to local PD clusters for decoding, utilizing H20 GPUs. Tests with a 1-trillion-parameter hybrid model demonstrated a 54% increase in serving throughput compared to a standard setup, achieving a 15% gain at equal hardware costs. The innovation centers on separating compute-intensive prefill from memory-bandwidth-demanding decode, optimizing hardware utilization and resulting in significant memory savings โ€“ up to 36 times โ€“ particularly with models employing linear-complexity attention stacks.

    ๐Ÿ’กInsights

    โ–ผ


    PREFILL-AS-A-SERVICE (PrfaaS): A NEW ARCHITECTURE FOR LARGE LANGUAGE MODEL SERVING
    The emergence of large language models (LLMs) has presented significant challenges in terms of inference efficiency, particularly regarding the computationally intensive prefill phase. Traditionally, prefill and decode operations have been confined to the same datacenter, utilizing high-bandwidth RDMA networks. However, this approach has proven limiting, creating a bottleneck. The research presented here introduces Prefill-as-a-Service (PrfaaS), a novel architecture designed to overcome these constraints by strategically offloading prefill to dedicated clusters and leveraging commodity Ethernet for KVCache transport, ultimately leading to substantial improvements in serving throughput.

    DISAGGREGATING PREFILL AND DECODE FOR OPTIMIZED PERFORMANCE
    The fundamental challenge in LLM serving lies in the distinct computational and memory-bandwidth requirements of the prefill and decode phases. Prefill, responsible for processing input tokens and generating the KVCache, is inherently compute-intensive. Decode, focused on generating output tokens, is memory-bandwidth-demanding. Traditional Prefill-Decode (PD) disaggregation, while improving utilization, introduces a significant transport problem: the KVCache generated by prefill must be transferred to the decode side before output generation can begin. This dependency on dense-attention models utilizing Grouped Query Attention (GQA) and the resulting enormous KVCache sizes necessitate RDMA-class interconnects, tightly binding PD disaggregation to a single datacenter network fabric. The research highlights the shift in model architecture as a key enabler for PrfaaS, specifically the adoption of hybrid attention stacks.

    HYBRID ATTENTION STACKS AND KV THROUGHPUT REDUCTIONS
    A growing number of modern LLMs โ€“ including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T โ€“ employ hybrid attention stacks that integrate full-attention layers with linear-complexity or bounded-state layers like Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA). These architectures dramatically reduce KVCache size because only the full-attention layers produce the data that scales with sequence length. The resulting reduction in KV throughput โ€“ defined as KVCache size divided by prefill latency โ€“ is substantial. For example, MiMo-V2-Flash achieves a 13x reduction in KVCache production compared to MiniMax-M2.5 at 32K tokens, while Qwen3.5-397B sees a 4x reduction versus Qwen3-235B. Specifically, the Ring-2.5-1T model demonstrates a 36x reduction in KV memory savings through MLA, coupled with an additional 8x reduction from the 7:1 hybrid ratio, resulting in a total memory saving of approximately 36x. This shift fundamentally alters the landscape of LLM serving, making cross-datacenter PD disaggregation feasible.

    OPTIMIZING DATA FLOW THROUGH SCHEDULING
    The core of this research centers around the intelligent rebalancing of computational resources within a distributed system, specifically focusing on the interaction between PrfaaS and local PD clusters. At extended timescales, the scheduler dynamically adjusts the number of prefill and decode nodes within the local PD cluster, responding to evolving traffic patterns. This proactive approach maintains the systemโ€™s performance close to its optimal throughput capacity. A key component of the study involved a PrfaaS cluster comprising 32 H200 GPUs paired with a local PD cluster of 64 H20 GPUs, connected via a VPC network offering approximately 100 Gbps of cross-cluster bandwidth. The overall egress load on the PrfaaS cluster, operating under the optimized configuration, reaches approximately 13 Gbps โ€“ representing only 13% of the total available Ethernet capacity. Importantly, the research highlights substantial bandwidth headroom remaining, indicating the systemโ€™s capacity for future growth and increased demands.

    SCALABILITY AND THROUGHPUT PROJECTIONS
    The findings extend beyond the initial case study, projecting significant scalability for the PrfaaS architecture. Even at a deployment scale of 10,000 GPUs, the aggregate egress bandwidth required for KVCache transfers is estimated to be a mere 1.8 Tbps โ€“ comfortably within the capacity of modern inter-datacenter links. This projection underscores the inherent efficiency of the system's design. Furthermore, the research demonstrates a marked improvement in performance metrics. Specifically, Mean Time to First Token (TTFT) decreases by 50% and P90 TTFT drops by 64% when compared to a homogeneous baseline configuration. This substantial reduction in latency highlights the effectiveness of the heterogeneous architecture and the scheduling layer's role in optimizing data flow.

    THE VALUE OF A LAYERED APPROACH
    The comparative analysis reveals the crucial contribution of the scheduling layer. A naive, homogeneous configuration โ€“ with all prefill tasks assigned to H200 GPUs and all decode tasks to H20 GPUs โ€“ achieves only a 1.16x throughput improvement over the baseline. However, the full PrfaaS-PD system, incorporating the intelligent scheduling logic, delivers a 1.54x throughput boost. This gap between 1.16x and 1.54x definitively demonstrates the scheduling layer's significant impact, accounting for the majority of the performance gains. The research team strategically positions PrfaaS as a viable design solution available today, advocating for its continued relevance as context windows expand, KVCache compression techniques mature, and specialized hardware like NVIDIAโ€™s Rubin CPX for prefill and LPU-style chips for decode become more prevalent. This layered approach suggests a strong future for cross-datacenter PD disaggregation.

    Our editorial team uses AI tools to aggregate and synthesize global reporting. Data is cross-referenced with public records as of April 2026.