OpenAI released GPT-4V, the vision-capable version of GPT-4, to all ChatGPT Plus subscribers in late September and to the API in October. The practical applications that are emerging are broader than expected.
What vision capability actually means
GPT-4V accepts images as inputs alongside text. You can ask it to describe an image, answer questions about what is in it, interpret a chart or diagram, read text from a screenshot, or reason about the spatial relationships in a photo. The vision capability is not a separate model bolted on: it is a multimodal understanding of the combined text and image input.
Real use cases emerging
The most immediately productive applications are in document processing: invoice reading, form extraction, diagram interpretation. Previous computer vision approaches required training custom models for each document type. GPT-4V can interpret an invoice it has never seen before by understanding the semantic meaning of the layout. For enterprise document workflows, this changes the build vs buy calculation significantly.
Code from screenshots
Developers have been using GPT-4V to generate HTML from design mockups and to recreate UI components from screenshots. The quality is not production-ready without editing, but the scaffolding value is high. Describing a layout in words and getting mediocre code is less useful than showing the layout as an image and getting code that captures the visual structure.
The limitations
GPT-4V is not reliable for tasks requiring precise spatial measurements, mathematical reading of charts (it can interpret trends but misreads exact values), or small text in complex images. Hallucination rates are higher for visual inputs than for text. It should not be the only step in a pipeline that relies on accurate visual information. Validation against the source data matters.