Deleting Objects from Video? 🤯 AI Magic Just Got REAL! ✨

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

Researchers from Netflix, INSAIT, and Sofia University have released VOID, a model that automatically removes objects from videos while accurately accounting for induced physical interactions, including secondary effects like shadows. VOID surpasses standard inpainting by reasoning about causality and collisions. Built on a CogVideoX 3D Transformer, the system processes input using a quadmask and a text prompt. Its core innovation is a two-pass architecture: an initial base inpainting pass followed by a second pass that corrects object morphing using flow-warped noise. This advanced approach, trained on synthetic counterfactual videos, marks a significant leap in video understanding and manipulation.

INSIGHTS


THE CHALLENGE OF CAUSAL VIDEO EDITING
Traditional video inpainting models function merely as sophisticated background painters, trained only to fill the pixel region where an object was removed, ignoring the physical context of the scene. This limitation means that while existing methods can correct superficial artifacts like shadows and reflections, they fail catastrophically when the removed object has significant physical interactions, such as collisions or support structures. For example, standard models cannot deduce that if a person holding a guitar is removed, the instrument must fall due to gravity, resulting in implausible and physically incorrect output.

THE VOID BREAKTHROUGH IN INTERACTION-AWARE DELETION
Researchers from Netflix and INSAIT, Sofia University, introduced the VOID (Video Object and Interaction Deletion) model to solve this causality problem. VOID moves beyond simple pixel filling by reasoning about physical plausibility, automatically removing an object and all interactions it induces on the scene. Its key innovation lies in understanding that removal is not just about filling space, but about maintaining consistent scene dynamics. The model is designed to handle complex physical scenarios, such as simulating the natural fall of a prop when its support is removed, thereby producing highly realistic and physically grounded video edits.

ADVANCED MASKING AND ARCHITECTURAL FOUNDATIONS
The technical backbone of VOID is built upon CogVideoX, a 3D Transformer-based video generation model analogous to a temporal version of Stable Diffusion. This system utilizes a highly structured input called the quadmask, which is far more advanced than a simple binary mask. Instead of just marking what to remove, the quadmask is a 4-value semantic map that encodes four distinct regions: the primary object being removed (0), overlap areas (63), interaction-affected regions that will move (127), and the background to keep (255). The core architecture is a CogVideoX 3D Transformer with 5 billion parameters, fine-tuned for video inpainting with interaction-aware quadmask conditioning.

TRAINING ON SYNTHETIC PHYSICS SIMULATIONS
Because real-world paired data—videos of the exact same scene, one with and one without the object, where physics plays out correctly—is practically nonexistent, the VOID team generated its training data synthetically. They utilized two advanced sources: HUMOTO, which simulates human-object interactions using Blender and motion-capture data; and Kubric, developed by Google Research, which handles object-only collisions. By running a Blender re-simulation, the process ensures that when the human or object is removed, the resulting counterfactual video accurately reflects how physics would naturally govern the scene.

THE TWO-PASS INFERENCE AND STABILITY MECHANISMS
To ensure maximum temporal consistency and correct known failure modes, VOID uses a two-pass inference system. Pass 1 serves as the base inpainting model and is sufficient for most tasks. Pass 2 is a specialized refinement pass designed specifically to correct object morphing—the gradual warping or deformation of objects across frames, a common artifact in video diffusion. Pass 2 achieves this by using optical flow to warp the latent space from Pass 1, stabilizing the shape of synthesized objects frame-to-frame and anchoring them to their correct trajectories.

PERFORMANCE AND IMPLEMENTATION SPECIFICATIONS
For optimal performance, VOID is designed to operate at a default resolution of 384×672, processing up to 197 frames using the DDIM scheduler and optimized with BF16 and FP8 quantization for memory efficiency. The full system requires the base model, CogVideoX-Fun-V1.5-5b-InP, from Alibaba PAI, which must be downloaded separately. VOID was rigorously evaluated against leading competitors, including ProPainter, DiffuEraser, Runway, and MiniMax-Remover, demonstrating superior ability to preserve consistent scene dynamics after complex object removal.

This article is AI-synthesized from public sources and may not reflect original reporting.