🤯 AI Breakthrough: Covo-Audio Speaks! 🗣️
AI
🎧



Tencent AI Lab recently released Covo-Audio, a 7B-parameter Large Audio Language Model. The model’s architecture directly processes continuous audio, generating audio outputs. A key innovation is the Hierarchical Tri-modal Speech-Text Interleaving strategy, aligning acoustic features, speech tokens, and text. Training involved a two-stage pipeline utilizing 2T tokens and an Intelligence Speaker Decoupling strategy to reduce dialogue data costs. Covo-Audio achieved competitive results, notably scoring 75.30% on the MMAU benchmark and 66.64% on the MMSU benchmark, demonstrating strong performance in music understanding. Further development, Covo-Audio-Chat-FD, utilized a 1:4 chunk-interleaving ratio for dual-stream communication. The model incorporated Chain-of-Thought reasoning and Group Relative Policy Optimization, achieving state-of-the-art empathetic responses in Mandarin for emotions like anger, sadness, and anxiety. Initial testing revealed a sensitivity to prolonged silences, potentially leading to premature responses.
COVO-AUDIO: A NOVEL LARGE AUDIO LANGUAGE MODEL
Covo-Audio, developed by Tencent AI Lab, represents a significant advancement in Large Audio Language Models (LALMs). This 7B-parameter model distinguishes itself through its end-to-end architecture, designed to seamlessly integrate speech processing and language intelligence by directly handling continuous audio inputs and generating audio outputs. The core innovation lies in the Hierarchical Tri-modal Speech-Text Interleaving strategy. Unlike traditional methods limited to word or character-level analysis, Covo-Audio simultaneously aligns continuous acoustic features (a_c), discrete speech tokens (a_d), and natural language text (t) – a critical step for truly understanding and generating audio. The model utilizes a two-stage hierarchical approach: phrase-level interleaving for fine-grained alignment and sentence-level interleaving to maintain global semantic integrity, particularly crucial for longer utterances. The training process involved processing a total of 2T tokens, utilizing an Intelligence Speaker Decoupling strategy to mitigate the cost of creating large-scale, speaker-specific dialogue datasets. This decoupling separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data. The research team reformatted high-quality TTS recordings into pseudo-conversations, incorporating masked text loss to preserve reasoning abilities while inheriting the naturalness of the TTS speaker. This allows for personalized interaction without extensive speaker-specific datasets.
ADVANCED ARCHITECTURAL FEATURES AND OPTIMIZATION
Covo-Audio has evolved into Covo-Audio-Chat-FD, a variant optimized for simultaneous dual-stream communication. The audio encoder utilizes a chunk-streaming format, with user and model streams interleaved in a 1:4 ratio, where each chunk represents 0.16 seconds of audio. The model manages conversational states through a recursive context-filling strategy, incorporating continuous audio features from user input and generated tokens from previous turns as historical context. To enhance complex reasoning capabilities, the model incorporates Chain-of-Thought (CoT) reasoning and Group Relative Policy Optimization (GRPO). Furthermore, the model is optimized using a verifiable composite reward function targeting four key objectives: correctness (Raccuracy), structured output adherence (Rformat), logical coherence (Rconsistency), and reasoning depth (Rthinking). This sophisticated optimization strategy contributes to the model’s strong performance across various benchmarks. The research team emphasizes that Covo-Audio (7B) demonstrates competitive or superior results compared to models of comparable scale, particularly when applied to selected speech/audio tasks.
PERFORMANCE AND FUTURE DIRECTIONS
Covo-Audio achieved significant results on several benchmarks. On the MMAU benchmark, it attained an average score of 75.30%, the highest among evaluated 7B-scale models, with exceptional performance in music understanding (76.05%). On the MMSU benchmark, it achieved a leading 66.64% average accuracy. Concerning its conversational variants, Covo-Audio-Chat demonstrated strong performance on the URO-Bench, notably outperforming models like Qwen3-Omnion the Chinese track. For empathetic interaction on the VStyle benchmark, it achieved state-of-the-art results in Mandarin for anger (4.89), sadness (4.93), and anxiety (5.00). However, the research team identified an ‘early-response’ issue during the GaokaoEval full-duplex setting, where unusually long silent pauses between vocal fragments could trigger premature responses. This behavior is linked to the model’s pause-handling success metric and represents a critical area for future optimization. The team encourages exploration through the accompanying paper, the model available on HF, and the associated repository. Interested individuals are invited to follow them on Twitter and join their 120k+ ML SubReddit, as well as subscribe to their newsletter. Finally, they invite users to connect via Telegram. Michal Sutter, a data science professional with a Master of Science in Data Science from the University of Padova, brings a robust foundation in statistical analysis, machine learning, and data engineering to this innovative work.
This article is AI-synthesized from public sources and may not reflect original reporting.