๐Ÿคฏ AI Breakthrough: Nested Models Unlocked! ๐Ÿš€

May 10, 2026 |

AI

๐ŸŽง Audio Summaries
English flag
French flag
German flag
Japanese flag
Korean flag
Mandarin flag
Spanish flag
๐Ÿ›’ Shop on Amazon

๐Ÿง Quick Intel


  • NVIDIA researchers propose Star Elastic, a post-training method for embedding multiple nested submodels within a single reasoning model.
  • Applied to Nemotron Nano v3 (30B parameter hybrid Mambaโ€“Transformerโ€“MoE model), Star Elastic produced 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens.
  • Star Elastic utilizes importance estimation and Router-Weighted Expert Activation Pruning (REAP) for MoE layers, employing a differentiable router for nested submodel architecture determination.
  • The training process employs a two-stage curriculum: a short-context phase (8,192 tokens) and an extended-context phase (49,152 tokens) with uniform and non-uniform sampling, respectively.
  • Gumbel-Softmax facilitates gradient flow through discrete architectural decisions during joint training with knowledge distillation and a router loss.
  • Ablations on Nano v2 demonstrated gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant in Stage 2.
  • The method allows all three variants (23B, 12B, and 30B) to reside in one checkpoint and be extracted without fine-tuning.
  • ๐Ÿ“Summary


    NVIDIA researchers have developed Star Elastic, a post-training method designed to embed multiple submodels within a single reasoning model. The research focused on optimizing compute costs associated with large language models like Nemotron Nano v3, a 30B parameter hybrid model. Utilizing a single training run, the team created 23B and 12B nested variants, trained with approximately 160B tokens, all residing within a single checkpoint. Employing importance estimation and a differentiable router, the process utilized a two-stage curriculum with uniform and non-uniform sampling. Ablation studies on Nano v2 demonstrated improvements, particularly in the 6B and 12B variants, suggesting a promising approach to efficient LLM scaling.

    ๐Ÿ’กInsights

    โ–ผ


    NESTING LARGE LANGUAGE MODELS: A NEW APPROACH
    Star Elastic represents a paradigm shift in training and deploying large language models (LLMs), moving away from the traditional approach of training separate models for each parameter size. This innovative method utilizes a single training run to embed multiple nested submodels within a larger parent model, offering significant efficiency gains.

    THE CORE PRINCIPLE: ELASTIC WEIGHT-SHARING
    The fundamental concept behind Star Elastic is elastic or nested architectures, where smaller submodels are contained within a larger parent model. This approach leverages importance estimation to identify the most crucial components of the larger model, allowing smaller-budget submodels to reuse these weights. This nested weight-sharing property dramatically reduces the computational burden of training and deploying multiple LLM variants.

    IMPLEMENTATION WITH NEMOTRON NANO V3
    The research team applied Star Elastic to Nemotron Nano v3, a hybrid Mambaโ€“Transformerโ€“MoE model with 30B total parameters and 3.6B active parameters. Through careful training with approximately 160B tokens, they produced 23B and 12B nested variants, each trained with a significantly reduced parameter budget. These variants can be extracted from a single checkpoint without any additional fine-tuning, streamlining the deployment process.

    ROUTER-WEIGHTED EXPERT ACTIVATION PRUNING (REAP)
    A key component of Star Elastic is Router-Weighted Expert Activation Pruning (REAP), particularly for MoE layers. This technique ranks experts based on both routing gate values and expert output magnitudes, providing a more principled signal than naive frequency-based pruning. The router dynamically determines which experts are active at each budget level, further optimizing model performance.

    END-TO-END TRAINABLE ROUTER AND GUMBEL-SOFTMAX
    The router itself is end-to-end trainable, taking a target budget as a one-hot input and outputting differentiable masks that select active components. These masks are trained jointly with the model using Gumbel-Softmax, enabling gradient flow through discrete architectural decisions. This approach allows the router to learn optimal architecture choices that improve accuracy under knowledge distillation.

    A TWO-STAGE CURRICULUM FOR OPTIMAL TRAINING
    The training process employs a two-stage curriculum to maximize performance. Initially, a short-context phase (8,192 tokens) with uniform budget sampling is used, followed by an extended-context phase (49,152 tokens) with non-uniform sampling that prioritizes the full 30B model. This staged approach ensures the model develops robust reasoning capabilities.

    ABLATION STUDIES AND PERFORMANCE GAINS
    Ablation studies on Nano v2, reproduced as the empirical basis for Nano v3, demonstrated significant gains, with up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone. These results highlight the effectiveness of the Star Elastic approach.

    OPTIMAL CONFIGURATION: MS โ†’ ML
    The researchers identified an optimal configuration, called MS โ†’ ML (small model for thinking, large model for answering), which allocates a cheaper model to generate extended reasoning traces and reserves the full-capacity model for synthesizing the final answer. This configuration achieved up to 16% higher accuracy and 1.9ร— lower latency compared to default Nemotron Nano v3 budget control.

    QUANTIZATION-AWARE DISTILLATION (QAD) FOR EFFICIENT DEPLOYMENT
    To facilitate efficient deployment, the team utilized Quantization-Aware Distillation (QAD) directly on the elastic checkpoint, preserving the nested mask hierarchy throughout. This allowed for the use of FP8 (E4M3 format) with 98.69% of BF16 accuracy on the 30B variant, and NVFP4 (NVIDIAโ€™s 4-bit floating-point format) with recovery to 97.79% after a short nested QAD phase.

    MEMORY EFFICIENCY THROUGH ELASTIC CHECKPOINTS
    The use of a single elastic checkpoint significantly reduces storage requirements compared to storing separate 12B, 23B, and 30B BF16 checkpoints, saving approximately 58.9 GB of memory. This enables the 12B NVFP4 variant to run on an RTX 5080, where every BF16 configuration runs out of memory.

    WIDTH COMPRESSION VS. DEPTH COMPRESSION
    The research explored two compression strategies: depth compression (removing layers) versus width compression (reducing internal dimensions). Width compression, with a 15% parameter reduction and 25B tokens of knowledge distillation, recovered 98.1% of baseline performance while depth compression recovered only 95.2%, highlighting the advantages of width-based elasticity.

    THE SIGNIFICANCE OF CONTEXT LENGTH
    The extended-context phase (49,152 tokens) is critical for reasoning performance, enabling the model to develop more sophisticated understanding of complex tasks. This longer context window significantly improves the model's ability to generate accurate and nuanced responses.