🤯AI's Network Revolution: Faster, Smarter GPUs! 🚀

May 07, 2026 |

AI

🎧 Audio Summaries
English flag
French flag
German flag
Japanese flag
Korean flag
Mandarin flag
Spanish flag
đź›’ Shop on Amazon

đź§ Quick Intel


  • OpenAI released MRC, a new networking protocol developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA over two years, published through the Open Compute Project.
  • More than 900 million people use ChatGPT weekly.
  • MRC extends RDMA over Converged Ethernet (RoCE) and utilizes SRv6-based source routing, fundamentally changing cluster architecture.
  • The protocol allows one network interface to connect to eight different switches, potentially supporting approximately 131,000 GPUs.
  • MRC is implemented across 400 and 800Gb/s RDMA NICs, including NVIDIA ConnectX-8, AMD Pollara, AMD Vulcano, and Broadcom Thor Ultra.
  • AMD contributed the NSCC congestion control algorithm, currently running in production on OpenAI’s NVIDIA GB200 supercomputers at Oracle OCI in Abilene, Texas and Microsoft’s Fairwater supercomputers in Atlanta and Wisconsin.
  • 📝Summary


    OpenAI recently unveiled MRC, a new networking protocol developed over two years with key partners including AMD, Broadcom, Intel, Microsoft, and NVIDIA. The specification, released through the Open Compute Project, addresses the growing challenges of training advanced AI models. MRC extends RDMA over Converged Ethernet, leveraging techniques from the Ultra Ethernet Consortium and SRv6 source routing. It fundamentally alters cluster architecture, enabling a single network interface to connect to eight switches and approximately 131,000 GPUs. Currently deployed across OpenAI’s NVIDIA GB200 supercomputers, including those at Oracle Cloud Infrastructure in Texas and Microsoft’s Fairwater supercomputers, MRC represents a significant advancement in high-performance networking for AI development.

    đź’ˇInsights

    â–Ľ


    THE RISE OF MRC: OPENAI’S NETWORK REVOLUTION
    OpenAI has unveiled MRC (Multipath Reliable Connection), a novel networking protocol designed to address the escalating challenges of training large AI models. Developed over two years with contributions from AMD, Broadcom, Intel, Microsoft, and NVIDIA, MRC represents a significant shift in how AI infrastructure is approached, moving beyond simple compute scaling to a robust and predictable network architecture. The protocol’s publication through the Open Compute Project (OCP) facilitates broader industry adoption and innovation.

    UNDERSTANDING THE CHALLENGES OF LARGE-SCALE AI TRAINING
    Training massive AI models, such as those powering ChatGPT, demands the transfer of vast quantities of data – often millions of transfers per step. Traditional network architectures struggle to handle this volume reliably, leading to bottlenecks, delays, and ultimately, idle GPU time. Network congestion, link failures, and device issues are common culprits, creating a compounding infrastructure problem. The sheer scale of ChatGPT’s user base (over 900 million weekly users) amplifies the importance of minimizing these inefficiencies.

    MRC: A MULTI-PATH, FAIL-SAFE NETWORK SOLUTION
    MRC’s core innovation lies in its ability to distribute data transfers across hundreds of simultaneous network paths. This dramatically reduces congestion in the network’s core, offering a far more resilient and efficient training environment. Unlike traditional RoCEv2, which relies on single paths, MRC employs Intelligent Packet-Spray Load Balancing to dynamically route traffic around failed paths, ensuring continuous training. The protocol leverages SRv6 (Segment Routing over IPv6) to encode the exact route directly within the packet header, eliminating the need for complex switch routing calculations and minimizing processing load.

    TECHNICAL ARCHITECTURE AND KEY COMPONENTS
    MRC’s architecture is built upon extending RDMA over Converged Ethernet (RoCE), a standard that enables hardware-accelerated remote direct memory access. It incorporates techniques from the Ultra Ethernet Consortium (UEC) and expands them with SRv6-based source routing. A crucial design decision is the decentralized routing intelligence, residing within the Network Interface Cards (NICs) rather than the switches, preventing interference between adaptive mechanisms. This unconventional approach allows MRC to detect and recover from failures within microseconds, a stark contrast to the seconds or tens of seconds often experienced with conventional network fabrics.

    DYNAMIC LOAD BALANCING AND FAILOVER MECHANISMS
    MRC’s dynamic load balancing and failover mechanisms are central to its performance. If a network path, link, or switch fails, MRC detects the issue and reroutes traffic across available paths within a microsecond timescale. This redundancy minimizes disruption and maintains training progress. The system adapts by reducing the rate of data transfer through the affected path, effectively mitigating the impact of the failure. Recovery times are typically within a minute, allowing the network to seamlessly return to normal operation.

    RE-THINKING CLUSTER ARCHITECTURE: A TWO-TIER APPROACH
    MRC fundamentally alters cluster architecture by dividing network interfaces into multiple smaller links. For example, a single interface can connect to eight different switches, enabling the creation of a network fully connecting approximately 131,000 GPUs with just two tiers of switches. This dramatically reduces the number of switch tiers required compared to traditional 800Gb/s networks, leading to lower latency and a smaller blast radius when components fail. The two-tier design reduces the number of optics and switches by 2/3 and 3/5, respectively, compared to a three-tier network.

    CURRENT IMPLEMENTATION AND PRODUCTION DEPLOYMENT
    MRC is already in production across OpenAI’s largest NVIDIA GB200 supercomputers, including those deployed on Oracle Cloud Infrastructure (OCI) in Abilene, Texas, and Microsoft’s Fairwater supercomputers in Atlanta and Wisconsin. The protocol is implemented across 400 and 800Gb/s RDMA NICs from NVIDIA, AMD, and Broadcom, utilizing SRv6 switch support on NVIDIA Spectrum-4 and Spectrum-5 (running Cumulus and SONiC) and Broadcom Tomahawk 5 via Arista EOS. AMD’s NSCC congestion control algorithm is integrated, alongside IB/RDMA transport semantic layer extensions. MRC is actively used to train multiple OpenAI models, including those powering ChatGPT and Codex, leveraging hardware from NVIDIA and Broadcom.

    FUTURE DIRECTIONS AND COMMUNITY ENGAGEMENT
    The successful deployment of MRC highlights OpenAI's commitment to pushing the boundaries of AI infrastructure. Continued research and development, coupled with community engagement through initiatives like Twitter, Reddit, and a newsletter, will be crucial for the ongoing evolution and adoption of this transformative networking protocol.