OpenAI's Spring Update on May 13th introduced GPT-4o. The live demo where the model tutored someone through a maths problem, interrupted naturally mid-sentence, paused when asked to, and expressed what sounded like genuine enthusiasm: it got compared immediately to the movie "Her." That comparison is worth taking seriously rather than dismissing.

What actually changed technically

Previous voice mode for ChatGPT used a pipeline: speech to text, text through GPT-4, text to speech. Three separate models. GPT-4o processes speech, text, and images natively in a single model. That removes the transcription step and the text-to-speech step from the loop. End-to-end voice latency dropped from around 2.8 seconds to an average of 320 milliseconds. Human conversation pauses are around 200ms. At 320ms, the model is in range of natural conversation pacing.

The model can also hear tone, detect laughter, and respond accordingly. It can be interrupted mid-sentence and understand it was interrupted. These are emergent capabilities from training on audio directly rather than text transcriptions of audio.

Why the "Her" comparison landed

The 2013 film "Her" depicted an operating system so natural to converse with that the protagonist developed a genuine emotional relationship with it. The comparison was meant partly as a joke but it points at something real. When the latency drops below the threshold where you are consciously aware of a computer responding, the interaction changes qualitatively. The cognitive load of "talking to an AI" partially disappears.

Samantha Altman then had to clarify that one of the demo voices (Sky) sounded like Scarlett Johansson, who voiced the OS in Her, and had to pull that voice after Johansson's lawyers got involved. The coincidence, intentional or not, reinforced the cultural moment.

The developer implications

GPT-4o's audio API is available to developers through the same API endpoint with different input/output formats. Real-time voice applications that previously needed to stitch together three models now have a single model option. Customer service voice bots, accessibility tools, voice-first interfaces: all of these become architecturally simpler and qualitatively better with native audio processing.