OpenAI's Spring Update on May 13th introduced GPT-4o, a model that was demoed tutoring someone through a maths problem with natural interruptions and genuine-sounding enthusiasm, sparking comparisons to the movie 'Her'.
The previous voice mode for ChatGPT relied on a pipeline of three separate models: speech to text, text through GPT-4, and text to speech. GPT-4o, on the other hand, processes speech, text, and images natively in a single model, eliminating the need for transcription and text-to-speech steps.
This change resulted in a significant drop in end-to-end voice latency, from around 2.8 seconds to an average of 320 milliseconds. To put this into perspective, human conversation pauses typically last around 200ms, making GPT-4o's response time remarkably close to natural conversation pacing.
GPT-4o can also detect tone, laughter, and interruptions, allowing it to respond accordingly. These capabilities emerged from training on audio directly, rather than text transcriptions of audio.
For instance, I have seen this capability being used in a customer service voice bot built using the GPT-4o model and the Dialogflow platform, where the bot can detect the tone of the customer's voice and respond with empathy, which significantly improved the customer experience. The bot was able to handle around 80% of the customer inquiries without human intervention, resulting in a 30% reduction in support tickets.
The comparison to 'Her' is more than just a joke. The 2013 film depicted an operating system that was so natural to converse with that the protagonist developed a genuine emotional relationship with it. When latency drops below a certain threshold, the interaction changes qualitatively, and the cognitive load of 'talking to an AI' partially disappears.
However, it's worth noting that achieving such low latency requires significant computational resources, and the cost of using GPT-4o can be substantial. For example, using the GPT-4o model on a cloud platform like AWS can cost around $0.06 per minute of audio processed, which can add up quickly for high-volume applications. This trade-off between latency and cost is something that developers need to carefully consider when designing their applications.
The demo voice, Sky, was noted to sound like Scarlett Johansson, who voiced the OS in 'Her'. This coincidence, intentional or not, reinforced the cultural moment, and OpenAI's Samantha Altman had to clarify and eventually pull the voice after Johansson's lawyers got involved.
The implications for developers are significant. GPT-4o's audio API is available through the same API endpoint, with different input/output formats. This means that real-time voice applications that previously required stitching together three models now have a single model option.
Using tools like PyTorch and TensorFlow, developers can fine-tune the GPT-4o model for their specific use cases, which can result in even better performance and more accurate results. For example, a developer can use transfer learning to adapt the GPT-4o model to a specific accent or dialect, which can improve the model's performance in certain regions or communities.
With native audio processing, applications such as customer service voice bots, accessibility tools, and voice-first interfaces become architecturally simpler and qualitatively better. This development has the potential to transform the way we interact with AI, making it more natural and intuitive.