🀯 AI Crushes OCR: New Model Wins! πŸ†

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

Researchers from Zhipu AI and Tsinghua University have introduced GLM-OCR, a compact multimodal model designed for document understanding. The system combines a visual encoder, a lightweight connector, and a language decoder, utilizing Multi-Token Prediction and a two-stage pipeline. Training involved four stages, incorporating reinforcement learning with specific reward designs. Initial evaluations on public benchmarks, including OmniDocBench v1.5, OCRBench (Text), UniMERNet, PubTabNet, and TEDS_TEST, demonstrated strong performance. GLM-OCR achieved the highest reported scores among non-reference models. Throughput testing indicated 0.67 images and 1.86 PDF pages per second. The MaaS API operates at a rate of 0.2 RMB per million tokens. These results highlight a significant advancement in document understanding technology, prioritizing both accuracy and efficiency.

INSIGHTS


GLM-OCR: A Compact Solution for Document Understanding
The research team at Zhipu AI and Tsinghua University has introduced GLM-OCR, a 0.9B-parameter multimodal model designed to address the persistent challenges of Optical Character Recognition (OCR). This innovative approach prioritizes balance between recognition quality and practical deployment constraints, moving beyond the limitations of larger, more computationally intensive systems.

The Core Innovation: Multi-Token Prediction (MTP)
Traditional autoregressive decoding, where OCR systems predict one token at a time, is ill-suited for the deterministic, locally structured outputs common in OCR tasks. GLM-OCR overcomes this by employing Multi-Token Prediction (MTP), enabling the model to predict multiple tokens per step. This strategy results in a 50% throughput improvement, significantly enhancing processing speed while maintaining accuracy.

A Two-Stage Pipeline for Enhanced Efficiency
GLM-OCR’s architecture utilizes a two-stage pipeline to optimize document understanding. Initially, PP-DocLayout-V3 performs layout analysis, identifying and delineating structured regions within the page. Subsequently, parallel region-level recognition is applied to these defined areas, avoiding the inefficiencies of processing an entire page as a single, monolithic entity. This approach improves both speed and robustness, particularly in documents with complex layouts.

Structured Generation: Parsing and Key Information Extraction
The system’s flexibility is further enhanced through a distinct approach to two key document tasks. For document parsing, GLM-OCR leverages layout detection and region processing to generate structured outputs, such as Markdown and JSON. Conversely, for Key Information Extraction (KIE), the model directly generates JSON containing the extracted fields from the full document image, streamlining the extraction process.

A Multi-Stage Training Recipe for Optimal Performance
The development of GLM-OCR involved a carefully designed, four-stage training recipe. Stage 1 trains the vision encoder on image-text pairs and grounding or retrieval data. Stage 2.1 performs multimodal pretraining on image-text, document parsing, grounding, and VQA data. Stage 2.2 adds the MTP objective. Stage 3 is supervised fine-tuning on OCR-specific tasks, including text recognition, formula transcription, table structure recovery, and KIE. Stage 4 applies reinforcement learning using GRPO.

Reward Design and Optimization
The reinforcement learning component utilizes task-specific reward design. Normalized Edit Distance is employed for text recognition, CDM score for formula recognition, TEDS score for table recognition, and field-level F1 for KIE, alongside structural penalties such as repetition penalties, malformed structure penalties, and JSON validation constraints.

Benchmark Performance and Competitive Positioning
On public benchmarks, GLM-OCR demonstrates strong results across multiple document tasks. It achieves 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it reports 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE.

Operational Considerations and Cost Modeling
The research team highlights GLM-OCR's support for vLLM, SGLang, and Ollama, and its ability to be fine-tuned through LLaMA-Factory. Performance metrics include a throughput of 0.67 images/second and 1.86 PDF pages/second under their evaluation setup. Furthermore, a MaaS API is offered at a price of 0.2 RMB per million tokens, providing cost estimates for scanned images and simple-layout PDFs.

This article is AI-synthesized from public sources and may not reflect original reporting.