AI Infrastructure Cycle Hits Engineering Teams

2023 was the year AI proved itself useful. 2024 is about settling into the hard engineering work. The questions now are about efficient implementation, not just possibility. If you're building AI-powered features, here's what matters.

The prompt engineering era is ending. In 2023, working with LLMs meant crafting good inputs. Now, it's about retrieval-augmented generation, embeddings, vector databases, and tool use. Models can use structured tools, not just generate text. The challenge shifts to building scaffolding that gives models the right information and actions.

Every organisation building AI-powered features needs an embedding pipeline. This means converting documents, products, conversations, and content into vector representations for semantic search. The vector database ecosystem, including Pinecone, Weaviate, Qdrant, and pgvector, matured in 2023. Most teams will use a managed vector store rather than building their own.

The engineering work now focuses on the pipeline that generates and maintains embeddings. The build vs buy question for this infrastructure layer is being answered, with most teams opting for a managed solution.

To give you an idea of the scale, consider an e-commerce company with 10 million products. Generating embeddings for these products using a model like sentence-transformers/all-MiniLM-L6-v2 requires significant computational resources. A single Nvidia A100 GPU can handle around 100,000 products per hour. With a managed solution like Pinecone, you can offload this work and get started quickly, but costs can add up. For example, if you need to update embeddings every day, you might end up paying around $10,000 per month for the compute resources alone.

There are three approaches to making a general-purpose LLM work for your use case: prompt engineering, retrieval-augmented generation, and fine-tuning on your data. Retrieval-augmented generation handles most use cases better than fine-tuning, at lower cost and without risks. Fine-tuning remains valuable for style, format, and specialised domain knowledge that doesn't change often.

The default architecture in 2024 is retrieval-augmented generation with a well-designed prompt. Fine-tuning is reserved for specific gaps that retrieval-augmented generation doesn't address. When implementing retrieval-augmented generation, one major trade-off is between the complexity of the retrieval system and the quality of the results. For instance, you might choose to use a simpler, more efficient retrieval algorithm like FAISS, or opt for a more complex, but potentially more accurate one like HNSW.

GPT-4 was too expensive for many production use cases at scale, given mid-2023 API prices. OpenAI cut prices, but smaller, cheaper models have matured for appropriate tasks. A pipeline routing simple tasks to GPT-3.5 or a fine-tuned Mistral 7B, and complex tasks to GPT-4, can cut AI API costs by 80-90% with minimal quality impact.

Cost optimisation of AI pipelines is a significant engineering opportunity in 2024. This involves using the right models for the right tasks and optimising the pipeline for efficiency and cost-effectiveness.