๐Ÿคฏ AI Audio Future: GPT-Realtime 2 Unlocked! ๐Ÿš€

May 08, 2026 |

AI

๐ŸŽง Audio Summaries
English flag
French flag
German flag
Japanese flag
Korean flag
Mandarin flag
Spanish flag
๐Ÿ›’ Shop on Amazon

๐Ÿง Quick Intel


  • OpenAI launched three new audio models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, marking the exit of the Realtime API from beta.
  • GPT-Realtime-2 incorporates a 128K token context window and offers adjustable reasoning effort across five levels, achieving 96.6% on Big Bench Audio and 48.5% on Audio MultiChallenge instruction following.
  • GPT-Realtime-Translate supports translation from 70+ input languages into 13 output languages at a rate of $0.034 per minute.
  • GPT-Realtime-Whisper provides low-latency streaming speech-to-text transcription at a rate of $0.017 per minute.
  • Two new voices, Cedar and Marin, were added to the available API voices.
  • The OpenAI Realtime API is now generally available starting today.
  • ๐Ÿ“Summary


    OpenAI has announced the general availability of three new audio models through its Realtime API. The models โ€“ GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper โ€“ are now accessible following a beta period. GPT-Realtime-2, boasting a 128K token context window, was designed for voice agents with advanced reasoning, achieving high scores on audio benchmarks. Simultaneously, GPT-Realtime-Translate offers live speech translation across 70+ languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. The addition of Cedar and Marin voices expands the APIโ€™s capabilities. These advancements represent a significant step in real-time audio processing technology.

    ๐Ÿ’กInsights

    โ–ผ


    GPT-REALTIME-2: THE FLAGSHIP VOICE MODEL
    OpenAI has unveiled three new audio models designed for real-time voice applications: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. These models represent a significant advancement in voice agent capabilities, moving beyond simple question-and-answer interactions toward systems capable of reasoning, translating, transcribing, and acting within a single conversation. The core of this innovation lies in GPT-Realtime-2, described by OpenAI as its first voice model with GPT-5-class reasoning abilities. This model is engineered to handle complex requests, manage interruptions, and maintain conversational flow naturally, addressing a persistent weakness of previous voice models.

    EXPANDED CONTEXT WINDOW AND INTELLIGENT DESIGN
    A key upgrade within GPT-Realtime-2 is its dramatically expanded context window, now boasting 128K tokens compared to the previous 32K. This allows the model to sustain longer, more intricate conversations and maintain context throughout multi-step tasks without losing critical information. Previously, voice models often struggled with complex, multi-stage requests or would abruptly lose context during extended sessions. The new architecture ensures a smoother, more natural conversational experience.

    DYNAMIC CONTROL AND TAILORED PERFORMANCE
    Developers now have granular control over the modelโ€™s reasoning effort, offering five distinct levels โ€“ minimal, low, medium, high, and xhigh โ€“ to optimize performance for various use cases. The default setting is โ€œlowโ€ for simple requests, minimizing latency, while more demanding tasks can leverage the โ€œxhighโ€ setting for increased computational power. This dynamic adjustment allows teams to fine-tune the performance-latency tradeoff, ensuring optimal efficiency for applications ranging from quick customer lookups to complex travel booking workflows.

    TONE AND INDUSTRY-SPECIFIC KNOWLEDGE
    Beyond reasoning capabilities, GPT-Realtime-2 incorporates advanced tone control, adapting its speaking style based on the conversationโ€™s context. The model can switch between calm and empathetic tones during problem-solving, shifting to an upbeat demeanor after a successful outcome. Furthermore, the model demonstrates enhanced understanding of industry-specific terminology, including sophisticated healthcare vocabulary and proper nouns, improving accuracy and relevance across diverse applications.

    BENCHMARK IMPROVEMENTS AND SCALABLE PERFORMANCE
    Rigorous benchmarking validates the improvements in GPT-Realtime-2. With high reasoning enabled, the model achieved a 96.6% score on Big Bench Audio, a significant jump from the 81.4% score of GPT-Realtime-1.5. Similarly, with โ€œxhighโ€ reasoning, it scored 48.5% on Audio MultiChallenge instruction following, surpassing the 34.7% score of the previous model. These metrics demonstrate a substantial leap in the modelโ€™s ability to understand and respond to complex audio input.

    PRICING AND AVAILABLE MODELS
    The pricing structure for GPT-Realtime-2 is tiered, at $32 per 1 million audio input tokens ($0.40 for cached input tokens) and $64 per 1 million audio output tokens. Alongside this flagship model, OpenAI has introduced two specialized options: GPT-Realtime-Translate and GPT-Realtime-Whisper.

    GPT-REALTIME-TRANSLATE: LIVE SPEECH TRANSLATION
    GPT-Realtime-Translate is a dedicated translation model designed for real-time speech translation, converting audio from 70+ input languages into 13 output languages. Unlike GPT-Realtime-2, this model focuses solely on translation, streamlining the development process for applications requiring bilingual communication. Itโ€™s a purpose-built solution for scenarios such as customer support flows or live interpreter services. Pricing for this model is $0.034 per minute.

    GPT-REALTIME-WHISPER: REAL-TIME SPEECH-TO-TEXT
    GPT-Realtime-Whisper is a streaming speech-to-text model, offering low-latency transcription directly from audio. This contrasts with the original Whisper model, which was designed for post-session transcription. The streaming capability makes it ideal for applications demanding immediate text output, such as live broadcast captions, meeting notes generated in real-time, and voice agents requiring continuous understanding of the userโ€™s input. Latency control allows developers to adjust the delay setting for optimal balance between transcription quality and responsiveness, priced at $0.017 per minute.

    NEW VOICES AND API INTEGRATION
    The release also introduces two new voices, Cedar and Marin, exclusively available through the OpenAI Realtime API. All three models โ€“ GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper โ€“ are immediately accessible through the OpenAI Realtime API, now generally available, offering developers a comprehensive suite of tools for building sophisticated voice applications.

    SESSION TYPES AND CUSTOMIZATION
    Developers can select from three session types depending on the use case: a voice-agent session for an assistant that responds to the user, a translation session for an interpreter, and a transcription session for text from audio. This flexibility allows for tailored solutions across a broad range of applications.