🤯 AI Crushes OCR: New Model Wins! 🏆

March 15, 2026| AuthorABR-INSIGHTS Tech Hub

🎧 Audio Summaries

🛒 Shop on Amazon

🧠Quick Intel

GLM-OCR is a 0.9B-parameter multimodal model developed by Zhipu AI and Tsinghua University.
Multi-Token Prediction (MTP) results in a 50% throughput improvement compared to traditional autoregressive decoding.
GLM-OCR utilizes a two-stage pipeline with PP-DocLayout-V3 for layout analysis followed by parallel region-level recognition.
The system generates structured outputs like Markdown and JSON for document parsing and directly generates JSON for Key Information Extraction (KIE).
The four-stage training recipe includes training the vision encoder, multimodal pretraining, supervised fine-tuning, and reinforcement learning using GRPO.
The reinforcement learning component employs Normalized Edit Distance for text recognition, CDM score for formula recognition, TEDS score for table recognition, and field-level F1 for KIE.
GLM-OCR achieves 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST.
Under the evaluation setup, GLM-OCR’s throughput is 0.67 images/second and 1.86 PDF pages/second, with a MaaS API priced at 0.2 RMB per million tokens.

📝Summary

Researchers from Zhipu AI and Tsinghua University have introduced GLM-OCR, a compact multimodal model designed for document understanding. The system combines a visual encoder, a lightweight connector, and a language decoder, utilizing Multi-Token Prediction and a two-stage pipeline. Training involved four stages, incorporating reinforcement learning with specific reward designs. Initial evaluations on public benchmarks, including OmniDocBench v1.5, OCRBench (Text), UniMERNet, PubTabNet, and TEDS_TEST, demonstrated strong performance. GLM-OCR achieved the highest reported scores among non-reference models. Throughput testing indicated 0.67 images and 1.86 PDF pages per second. The MaaS API operates at a rate of 0.2 RMB per million tokens. These results highlight a significant advancement in document understanding technology, prioritizing both accuracy and efficiency.

💡Insights

▼

GLM-OCR: A Compact Solution for Document Understanding
The research team at Zhipu AI and Tsinghua University has introduced GLM-OCR, a 0.9B-parameter multimodal model designed to address the persistent challenges of Optical Character Recognition (OCR). This innovative approach prioritizes balance between recognition quality and practical deployment constraints, moving beyond the limitations of larger, more computationally intensive systems.

The Core Innovation: Multi-Token Prediction (MTP)
Traditional autoregressive decoding, where OCR systems predict one token at a time, is ill-suited for the deterministic, locally structured outputs common in OCR tasks. GLM-OCR overcomes this by employing Multi-Token Prediction (MTP), enabling the model to predict multiple tokens per step. This strategy results in a 50% throughput improvement, significantly enhancing processing speed while maintaining accuracy.

A Two-Stage Pipeline for Enhanced Efficiency
GLM-OCR’s architecture utilizes a two-stage pipeline to optimize document understanding. Initially, PP-DocLayout-V3 performs layout analysis, identifying and delineating structured regions within the page. Subsequently, parallel region-level recognition is applied to these defined areas, avoiding the inefficiencies of processing an entire page as a single, monolithic entity. This approach improves both speed and robustness, particularly in documents with complex layouts.

Structured Generation: Parsing and Key Information Extraction
The system’s flexibility is further enhanced through a distinct approach to two key document tasks. For document parsing, GLM-OCR leverages layout detection and region processing to generate structured outputs, such as Markdown and JSON. Conversely, for Key Information Extraction (KIE), the model directly generates JSON containing the extracted fields from the full document image, streamlining the extraction process.

A Multi-Stage Training Recipe for Optimal Performance
The development of GLM-OCR involved a carefully designed, four-stage training recipe. Stage 1 trains the vision encoder on image-text pairs and grounding or retrieval data. Stage 2.1 performs multimodal pretraining on image-text, document parsing, grounding, and VQA data. Stage 2.2 adds the MTP objective. Stage 3 is supervised fine-tuning on OCR-specific tasks, including text recognition, formula transcription, table structure recovery, and KIE. Stage 4 applies reinforcement learning using GRPO.

Reward Design and Optimization
The reinforcement learning component utilizes task-specific reward design. Normalized Edit Distance is employed for text recognition, CDM score for formula recognition, TEDS score for table recognition, and field-level F1 for KIE, alongside structural penalties such as repetition penalties, malformed structure penalties, and JSON validation constraints.

Benchmark Performance and Competitive Positioning
On public benchmarks, GLM-OCR demonstrates strong results across multiple document tasks. It achieves 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it reports 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE.

Operational Considerations and Cost Modeling
The research team highlights GLM-OCR's support for vLLM, SGLang, and Ollama, and its ability to be fine-tuned through LLaMA-Factory. Performance metrics include a throughput of 0.67 images/second and 1.86 PDF pages/second under their evaluation setup. Furthermore, a MaaS API is offered at a price of 0.2 RMB per million tokens, providing cost estimates for scanned images and simple-layout PDFs.

Our editorial team uses AI tools to aggregate and synthesize global reporting. Data is cross-referenced with public records as of April 2026.

🤯 AI Crushes OCR: New Model Wins! 🏆

ABR-INSIGHTS Tech Hub Picks

🧠Quick Intel

📝Summary

💡Insights

Related Articles

🤯 IBM Granite 4.0: Speech AI Breakthrough! 🚀

🤯 AI Just Leveled Up Math Research! 🚀

AI Breakthrough: Fixing the AI Gap 🚀💡