🤯 AI Video Magic: Beauty in Motion? 🎬

May 16, 2026 | Author ABR-INSIGHTS Tech Hub

🎧 Audio Summaries

🛒 Shop on Amazon

🧠Quick Intel

NVIDIA’s SANA-WM generates minute-long, high-resolution video without requiring large clusters.

The SANA-WM model is a 2.6B-parameter Diffusion Transformer built on the SANA-Video codebase.

Frame-wise Gated DeltaNet (GDN) blocks replace attention blocks, achieving a Camera Motion Consistency (CamMC) of 0.2453 with UCPE-only.

Training utilizes an algebraic key-scaling approach to ensure training stability.

A second-stage refiner, initialized from the 17B LTX-2 model, reduces long-horizon visual drift.

Training required metric-scale 6-DoF pose annotations across 212,975 clips.

The pipeline processes data on 64 H100 GPUs, generating 961 latent frames at 720p.

📝Summary

NVIDIA’s SANA-WM demonstrates a new approach to video generation. The system, built upon the SANA-Video codebase and a 2.6B-parameter Diffusion Transformer, leverages frame-wise Gated DeltaNet blocks to produce minute-long, high-resolution video. Training utilized 64 H100 GPUs and processed 212,975 clips, achieving a Camera Motion Consistency of 0.2453. A second-stage refiner, initialized from a 17B LTX-2 model, minimized visual drift. The pipeline generated 961 latent frames at 720p, showcasing the potential of this technology.

💡Insights

▼

SANA-WM: SCALING WORLD MODEL GENERATION
SANA-WM represents a significant advancement in world models, addressing key bottlenecks in generating realistic, minute-long video sequences. The core innovation lies in its architecture, designed to enable high-resolution, 6-DoF camera-controlled synthesis without the need for massive computational resources. This system, built upon the SANA-Video codebase and available through the NVlabs/Sana GitHub repository, leverages a 2.6B-parameter Diffusion Transformer (DiT) trained natively for one-minute generation at 720p with metric-scale 6-DoF camera control, offering three single-GPU inference variants for flexibility.

FRAME-WISE GDN AND KEY-SCALING FOR STABILITY
A critical element of SANA-WM’s design is the replacement of standard attention mechanisms with frame-wise Gated DeltaNets (GDNs). Unlike token-wise GDN implementations common in language models, SANA-WM processes an entire latent frame per recurrent step, maintaining a constant D×D size regardless of video length. This is achieved through a decay gate (γ) that down-weights stale past frames and a delta-rule correction that updates only the residual between the target value and the current state prediction. Furthermore, the research team introduced an algebraic key-scaling approach – scaling keys by 1/√(D·S) – to maintain bounded spectral norms and prevent NaN divergence events, a common issue with standard L2 key normalization. This key-scaling strategy ensures stable training, even at longer sequence lengths.

REFINER AND DATA PIPELINE FOR HIGH-FIDELITY OUTPUT
To mitigate structural artifacts that can emerge in long-sequence synthetic videos, SANA-WM incorporates a second-stage refiner. This refiner, initialized from the 17B LTX-2 model with rank-384 LoRA adapters, is fine-tuned on paired synthetic and real video data. The refiner utilizes truncated-σ flow matching, perturbing stage-1 latents with a large starting noise (σ_start = 0.9) and learning to map this noisy input toward the high-fidelity target. Only three Euler denoising steps are needed at inference, significantly reducing the computational cost. The research team also developed a robust annotation pipeline, modifying VIPE to integrate Pi3X and MoGe-2 for accurate metric-scale pose annotations, extending bundle adjustment to treat focal lengths and principal points as per-frame variables. This pipeline utilizes a training corpus of 212,975 clips sourced from SpatialVID-HQ, DL3DV, OmniWorld, Sekai Game, Sekai Walking-HQ, and MiraData, generating a comprehensive dataset for training.

SANA-WM: A Novel Video Generation Architecture
SANA-WM solves the memory limitations inherent in standard video generation models by employing a sophisticated, multi-faceted architecture. This design integrates four key components: hybrid linear attention, dual-branch camera control, a two-stage refinement pipeline, and a robust data annotation pipeline, allowing for efficient and high-quality video synthesis at scale.

Frame-Wise Gated DeltaNet (GDN) and Scaled Attention
The core innovation of SANA-WM lies in the Frame-Wise Gated DeltaNet (GDN). Unlike token-wise GDN, commonly used in Large Language Models, the GDN processes an entire latent frame in each recurrent step. This approach mitigates the issue of cumulative drift observed in SANA-Video, which utilized constant, cumulative ReLU-based linear attention. To stabilize training and prevent gradient instability, keys are scaled by 1/√(D·S), where D is the head dimension and S is the spatial tokens per frame. This scaling prevents NaN events, particularly during the initial training steps.

Camera-Controlled World Modeling for Precise Trajectory Tracking
SANA-WM’s camera-controlled world modeling system demands faithful adherence to a continuous 6-DoF trajectory – going beyond simple text-based motion descriptions. This is achieved through two complementary branches operating at different temporal rates. One branch computes a ray-local camera basis from the camera-to-world pose and intrinsics, while the other captures the global 6-DoF trajectory structure. Unified Camera Positional Encoding (UCPE) is applied to the geometric channels of each attention head, further enhancing the system's ability to maintain accurate spatial relationships. A crucial element is addressing compression mismatches; each latent token summarizes 8 raw frames, each with a distinct camera pose.

Two-Stage Refinement Pipeline and Metric-Scale Annotations
The initial output from Stage-1 of the SANA-WM system can contain structural artifacts, particularly over longer sequences. To address this, a dedicated second-stage refinement stage corrects these artifacts. The quality of this refinement is measured using the ΔIQ (imaging-quality score) metric, comparing the score within the first and last 10 seconds of a video. Lower ΔIQ values indicate reduced degradation over the minute, signifying improved image quality. Training camera-controlled generation necessitates metric-scale 6-DoF pose annotations— a level of detail not typically found in standard video datasets. To overcome this, the team modified the VIPE pose annotation engine, incorporating Pi3X (long-sequence-consistent 3D structure) fused with MoGe-2 (accurate per-frame metric scale). This involved computing a per-frame scale factor minimizing weighted depth error, smoothed via exponential moving average (momentum 0.99). Furthermore, the system extended bundle adjustment to treat focal lengths and principal points as frame-specific variables, enabling robust annotation of internet video with varying focal lengths.

Training Infrastructure and Model Variants
The training process leverages a substantial computing infrastructure, utilizing 64 H100 GPUs over a period of approximately 15 days. The initial VAE pre-adaptation phase lasts around 3.5 days with 50,000 steps. The main DiT (Diffusion Transformer) training proceeds in four progressive stages, evaluated on 80 diverse scenes covering various camera trajectories (Simple/Hard splits). The pipeline memory consumption is capped at 74.7 GB, staying within the 80 GB H100 budget.

Open-Source Availability and Key Variants
SANA-WM is an open-source project, readily accessible through the NVlabs/Sana GitHub repository under an Apache 2.0 license for the code. Alongside the core SANA-WM model, the repository also hosts SANA, SANA-1.5, SANA-Sprint, and SANA-Video. These variants offer diverse capabilities, including bidirectional high-quality offline synthesis (49.2 GB), chunk-causal AR for sequential streaming (51.1 GB), and distilled AR with NVFP4 for faster processing (34s per 60s clip on RTX 5090).

Limitations and Future Directions
Despite its advancements, SANA-WM has certain limitations. Notably, it lacks explicit 3D scene memory, potentially leading to drift in dynamic scenes or rare viewpoints. Further research will focus on incorporating mechanisms for dynamic scene understanding and memory management to address these challenges.

🤯 AI Video Magic: Beauty in Motion? 🎬

ABR-INSIGHTS Tech Hub Picks

🧠Quick Intel

📝Summary

💡Insights

Related Articles

Prediction Markets Hacked? 🚨 Crypto Chaos 💰

AI Finance Future 🚀: Data's Critical Challenge 💰

Google AI Predicts You 👀 - Amazing! 🚀