2023 was the year everyone figured out that AI was real and actually useful. 2024 is the year the hype settles and the hard engineering begins. The trends are clear, the infrastructure is emerging, and the questions have shifted from "can we do this?" to "how do we do this efficiently?" If you're building AI-powered features in 2024, here's what actually matters.
The prompt engineering era is ending
In 2023, the key skill for working with LLMs was prompt engineering: crafting inputs that reliably produced useful outputs from a model that had no knowledge of your specific context. In 2024, the architecture pattern that matters is retrieval-augmented generation, embeddings, vector databases, and tool use. Models are increasingly capable of using structured tools rather than just generating text. The engineering challenge shifts from getting the model to say the right thing to building the scaffolding that gives the model the right information and actions.
The embedding infrastructure build-out
Every organisation building AI-powered features in 2024 will need an embedding pipeline: a way to convert their documents, products, conversations, and other content into vector representations that can be searched semantically. The ecosystem of vector databases, Pinecone, Weaviate, Qdrant, pgvector, and others, matured significantly in 2023. In 2024, the build vs buy question for this infrastructure layer is being answered: most teams will use a managed vector store rather than rolling their own. The engineering work moves to the pipeline that generates and maintains the embeddings.
Fine-tuning vs RAG vs prompting
The three approaches to making a general-purpose LLM behave for your specific use case are prompt engineering, retrieval-augmented generation, and fine-tuning on your data. The pattern that 2023 produced is that RAG handles most use cases better than fine-tuning for less cost and without the risk of the model forgetting general capabilities. Fine-tuning remains valuable for style, format, and specialised domain knowledge that does not change frequently. The default architecture in 2024 is RAG with a well-designed prompt, with fine-tuning reserved for specific gaps that RAG does not address.
The cost optimisation opportunity
GPT-4 at the API prices of mid-2023 was prohibitively expensive for many production use cases at scale. OpenAI reduced prices significantly, but the more important development is the maturity of smaller, cheaper models for appropriate tasks. A pipeline that routes simple tasks to GPT-3.5 or a fine-tuned Mistral 7B, and complex tasks to GPT-4, can reduce AI API costs by 80-90% with minimal quality impact. Cost optimisation of AI pipelines is a meaningful engineering opportunity in 2024.