OpenAI came out demonstrations of Sora on February 15th: a text-to-video model that generates 60-second HD videos with consistent physics, character continuity, and camera motion. The videos are striking in a way that DALL-E 3 was for images.

What Sora can do

The demonstrations show scenes with complex camera movements, multiple characters interacting, consistent lighting across cuts, and physically plausible object behaviour. Previous text-to-video models produced clips where objects morphed, characters changed appearance mid-frame, and backgrounds shifted incoherently. Sora's outputs have a degree of temporal consistency that is qualitatively different. A character who walks out of frame walks back in looking the same. Water flows in a physically plausible way. Snow falls consistently.

The transformer on video architecture

OpenAI's technical report describes Sora as a diffusion transformer that operates on spacetime patches of video. Rather than generating video frame by frame, it treats video as a sequence of visual patches across both space and time, applying the transformer architecture that has been so effective in language and image models. This lets the model learn relationships between content at different points in time in the same way language transformers learn relationships between tokens at different positions in a sequence.

What is not in the demos

The demos were curated. They show the best outputs. Red-teaming results, failure rates, generation times, and the compute cost per second of video are not published. For a product that is being compared to a Hollywood production pipeline, the relevant questions are: how often does it produce the result you want, how long does it take, and what does it cost? OpenAI did not release Sora publicly in February. Access is limited to red teamers and creative professionals. The production economics are unknown.

Creative industry implications

The advertising, stock footage, concept visualisation, and indie film production spaces are the immediate addressable market. A text-to-video pipeline that can produce concept shots changes the pre-production economics for commercial content. A brand that needs 20 variations of a product shot in different settings can generate them in hours rather than scheduling multiple production days. The displacement risk for stock footage libraries is more immediate than for human cinematography.