🤯 AI Breakthrough: Phi-4 - Reason & Vision! 🚀

Tech

🎧English flagFrench flagGerman flagSpanish flag

Summary

Microsoft has released Phi-4-reasoning-vision-15B, a model designed for image and text tasks demanding both perception and selective reasoning. Built upon the Phi-4-Reasoning language backbone and the SigLIP-2 vision encoder, utilizing an amid-fusion architecture, the model was trained on 200 billion multimodal tokens. Researchers leveraged a dynamic resolution vision encoder capable of processing up to 3,600 visual tokens. The training incorporated a hybrid approach, with approximately 20% of the data representing reasoning samples. Performance benchmarks, generated using Eureka ML Insights and VLMEvalKit, yielded scores of 84.8 on AI2DTEST, 83.3 on ChartQATEST, and 76.0 on OCRBench, among others. The model’s design prioritizes accurate perception, enabling efficient responses in situations where extended reasoning doesn’t improve outcomes.

INSIGHTS


PHI-4-REASONING-VISION-15B: A NEW MULTIMODAL MODEL
Phi-4-reasoning-vision-15B represents a significant advancement in open-weight multimodal reasoning models. Developed by Microsoft, this 15 billion parameter model is specifically engineered for tasks demanding both visual perception and selective reasoning, with notable strengths in scientific and mathematical reasoning, as well as understanding user interfaces. The model’s design prioritizes a balance between reasoning quality, computational efficiency, and the resources required for training. This compact architecture addresses a growing trend in vision-language models, which have historically increased in size and complexity, leading to higher latency and deployment costs. The core innovation lies in its hybrid approach, combining the Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder utilizing an amid-fusion architecture. This setup efficiently processes visual information, converting images into tokens that are then projected into the language model’s embedding space for processing. The model was trained on a substantial dataset of 200 billion multimodal tokens, building upon the Phi-4-Reasoning model (16 billion tokens) and the Phi-4base model (400 billion unique tokens), demonstrating a scalable approach to multimodal learning.

KEY DESIGN ELEMENTS AND TRAINING STRATEGIES
Several key design decisions contribute to the performance of Phi-4-reasoning-vision-15B. First, the model incorporates a dynamic resolution vision encoder capable of handling up to 3,600 visual tokens, allowing for high-resolution understanding crucial for tasks like GUI grounding and fine-grained document analysis. The team emphasizes that accurate perception is a foundational requirement for robust reasoning. Second, the model employs a mixed reasoning and non-reasoning training strategy. Rather than forcing chain-of-thought reasoning across all tasks, the training data is strategically divided. Reasoning samples, marked with " traces," are interspersed with non-reasoning samples beginning with "" designed for perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Approximately 20% of the training data consists of reasoning samples, enabling the model to efficiently respond directly on tasks where extended reasoning doesn't improve accuracy, while still engaging structured reasoning for tasks like math and science. Furthermore, the model allows for user override of the default behavior through explicit prompting with "" or "" tokens, providing flexibility and control.

PERFORMANCE AND APPLICATION AREAS
Phi-4-reasoning-vision-15B has demonstrated strong performance across a range of benchmarks. The Microsoft team reports scores of 84.8 on AI2DTEST, 83.3 on ChartQATEST, 44.9 on MathVerseMINI, 36.2 on MathVisionMINI, 75.2 on MathVistaMINI, 54.3 on MMMUVAL, 64.5 on MMStar, 76.0 on OCRBench, and 88.2 on ScreenSpotv2. Importantly, these results were generated using Eureka ML Insights and VLMEvalKit with fixed evaluation settings, and are presented as comparison results rather than definitive leaderboard claims. The model’s primary application areas include scientific and mathematical reasoning over visual inputs – encompassing handwritten equations, diagrams, charts, tables, and quantitative documents – and computer-use agent tasks, where the model interprets screen content, localizes GUI elements, and supports interaction with desktop, web, or mobile interfaces. Users can access detailed information, including the technical report, repository, and model weights, through the provided links. Interested individuals can also follow Microsoft’s updates on Twitter, join the 120k+ member ML SubReddit, and subscribe to their newsletter. Finally, they invite users to join their Telegram community as well.

This article is AI-synthesized from public sources and may not reflect original reporting.