🤯 IBM Granite 4.0: Speech AI Breakthrough! 🚀
AI
🎧



IBM has released Granite 4.0 1B Speech, a compact speech-language model designed for multilingual automatic speech recognition and bidirectional automatic speech translation. The model’s development targeted enterprise and edge deployments, prioritizing memory footprint, latency, and compute efficiency. It incorporates Japanese ASR, keyword list biasing, and improved English transcription accuracy, achieved through adaptation and multimodal training. The model supports English, French, German, Spanish, Portuguese, and Japanese, ranking #1 on the OpenASR leaderboard with an Average WER of 5.52. Its modular design involves speech recognition followed by language-level post-processing. This release offers Python inference and API-style serving, utilizing transformers>=4.52.1 and vLLM, and supports lower-resource environments with limitations on model length and audio prompts.
GRANITE 4.0 1B SPEECH: A NEW STANDARD IN MULTILINGUAL ASR AND AST
Granite 4.0 1B Speech represents a significant advancement in speech-language technology, specifically designed for enterprise and edge deployments where efficiency is paramount. IBM’s core objective with this release was to dramatically reduce model size while maintaining the robust capabilities expected of modern multilingual systems. The model achieves this by utilizing half the number of parameters compared to granite-speech-3.3-2b, incorporating Japanese ASR, keyword list biasing, and enhanced English transcription accuracy. This optimization translates directly into faster inference speeds through improved encoder training and speculative decoding, shifting the focus from simply scaling model size to meticulously balancing efficiency and quality for practical deployment scenarios. The model’s architecture is built upon a two-pass design, offering developers a modular and flexible approach to speech processing workflows.
KEY FEATURES AND ARCHITECTURAL DESIGN
Granite 4.0 1B Speech is a compact and efficient speech-language model trained for multilingual Automatic Speech Recognition (ASR) and Bidirectional Automatic Speech Translation (AST). The training data incorporates a diverse mix of public ASR and AST corpora alongside synthetic data, specifically tailored to support Japanese ASR, keyword-biased ASR, and speech translation. This strategic data selection demonstrates IBM’s approach: they didn’t build a completely new speech stack, but rather adapted a Granite 4.0 base language model through alignment and multimodal training. The supported language set includes English, French, German, Spanish, Portuguese, and Japanese, enabling speech-to-text and speech translation to and from English, alongside specific scenarios like English-to-Italian and English-to-Mandarin translation. Crucially, the model is released under the Apache 2.0 license, providing teams with greater flexibility in evaluating and deploying open deployment options, avoiding restrictions often found in commercial speech systems. The two-pass architecture—an initial transcription followed by a separate language model call—allows for a modular and adaptable pipeline design.
DEPLOYMENT AND TECHNICAL SPECIFICATIONS
Granite 4.0 1B Speech has recently achieved the top ranking on the OpenASR leaderboard, boasting an Average Word Error Rate (WER) of 5.52 and a Relative Transcript Factor (RTF) of 280.02. Performance on specific datasets includes 1.42 on LibriSpeech Clean, 2.85 on LibriSpeech Other, 3.89 on SPGISpeech, 3.1 on Tedlium, and 5.84 on VoxPopuli. Deployment is natively supported with Transformers>=4.52.1 and can be served through vLLM, offering both standard Python inference and API-style serving options. The model expects mono 16 kHz audio and utilizes a format where the user prompt begins with `<|audio|>`. Keyword biasing can be directly implemented within the prompt using the format `Keywords:
This article is AI-synthesized from public sources and may not reflect original reporting.