🀯 Self-Command Machines: AI's Next Leap πŸš€

AI

🎧 Audio Summaries
🎧
English flag
French flag
German flag
Spanish flag
πŸ›’ Shop on Amazon

🧠Quick Intel

  • Researchers from Meta AI and KAUST introduced Neural Computers (NCs), a machine form where a neural network acts as the running computer.
  • NCCLIGen and NCGUIWorld prototypes were built on top of Wan2.1, a state-of-the-art video generation model.
  • CLIGen (General) contains approximately 823,989 video streams (roughly 1,100 hours) sourced from public asciinema.cast recordings.
  • Training NCCLIGen on CLIGen (General) required approximately 15,000 H100 GPU hours.
  • Reconstruction quality on CLIGen (General) reached an average PSNR of 40.77 dB and SSIM of 0.989 at a 13px font size.
  • NCGUIWorld operates on Ubuntu 22.04 with XFCE4 at 15 FPS, using a dataset totaling roughly 1,510 hours.
  • Training NCGUIWorld used 64 GPUs for approximately 15 days per run, totaling roughly 23,000 GPU hours per full pass.
Click anywhere to collapse

Summary

Researchers at Meta AI and KAUST have developed Neural Computers, a novel machine architecture where a neural network directly functions as the computer. The project produced two prototypes, NCCLIGen and NCGUIWorld, utilizing a video generation model named Wan2.1. NCCLIGen, trained on approximately 1,100 hours of asciinema recordings, demonstrated text-and-image-to-video interaction via a command-line interface. NCGUIWorld, operating at 1024x768 resolution, modeled full desktop interactions with a dataset totaling 1,510 hours. Internal conditioning during training yielded the most consistent results, achieving a structural similarity of 0.863. These initial demonstrations represent a significant step towards machines that can directly process and respond to visual and textual input.

INSIGHTS


THE CONCEPT OF NEURAL COMPUTERS: A REVOLUTIONARY APPROACH
The research presented by Meta AI and KAUST introduces Neural Computers (NCs), a novel machine architecture fundamentally different from traditional computing paradigms. Instead of functioning as a layer within a larger system, the NC itself acts as the running computer, offering a potentially transformative shift in how machines process information. This approach moves away from conventional methods like explicit programs, AI agents utilizing existing software stacks, or world models predicting environmental evolution.

DEFINING NEURAL COMPUTERS: FRAMEWORK AND PROTOTYPES
Neural Computers are formally defined by an update function (FΞΈ) and a decoder (GΞΈ) operating on a latent runtime state (ht). This framework allows the NC to continuously update its internal state based on observations (xt) and user actions (ut), effectively mimicking the operations of an operating system stack – including executable context, working memory, and interface state – within the model itself. The development of two video-based prototypes, NCCLIGen and NCGUIWorld, serves as initial proof-of-concept demonstrations of these runtime primitives across both CLI and GUI settings.

TECHNICAL ARCHITECTURE AND TRAINING
The development of NCCLIGen and NCGUIWorld relied heavily on Wan2.1, a state-of-the-art video generation model, with specific conditioning and action modules added to tailor it for NC functionality. The models were trained separately, without shared parameters, and evaluated in open-loop mode using recorded prompts and action streams. NCCLIGen, focused on CLI generation, was trained on the CLIGen (General) dataset – approximately 823,989 video streams – while CLIGen (Clean) was used for a more controlled training run, consuming roughly 7,000 H100 GPU hours. The training process for the general dataset required approximately 15,000 H100 GPU hours.

NCGUIWORLD: FULL DESKTOP INTERACTION
NCGUIWorld, in contrast, aimed for full desktop interaction, modeling each interaction as a synchronized sequence of RGB frames and input events. Trained on a dataset encompassing random slow and fast exploration (1,400 hours) alongside 110 hours of goal-directed trajectories collected using Claude CUA, it utilized 64 GPUs for approximately 23,000 GPU hours per full pass. The research team experimented with four action injection schemes – external, contextual, residual, and internal – to determine the most effective method for integrating action embeddings within the diffusion backbone.

EVALUATION AND KEY FINDINGS
Evaluation of both models highlighted several key findings. NCCLIGen demonstrated a significant improvement in character-level accuracy – rising from 0.03 to 0.54 – when using detailed captions, illustrating the impact of caption specificity on text-to-pixel alignment. Training on the CLIGen (Clean) dataset plateaued around 25,000 steps, yielding no further gains. Conversely, on symbolic computation, NCCLIGen achieved 4% accuracy on a held-out math problem set, compared to 71% for Sora-2 and 2% for Veo3.1. Remarkably, re-prompting with the correct answer boosted accuracy to 83% without modifying the model. NCGUIWorld showcased superior structural consistency with internal conditioning and demonstrated the crucial role of explicit cursor supervision, achieving 98.7% accuracy in cursor control.

DATA QUALITY AND SAMPLE EFFICIENCY
The research underscored the critical importance of data quality, demonstrating that curated, goal-directed data – specifically the 110-hour Claude CUA dataset – significantly outperformed passive random exploration across all metrics. This finding highlights the need for more efficient data collection strategies in the development of Neural Computers.

Our editorial team uses AI tools to aggregate and synthesize global reporting. Data is cross-referenced with public records as of April 2026.