Prompt Engineering Matters

I've seen a clear difference in application quality between developers who understand how to effectively instruct LLMs and those who don't. This difference is especially noticeable in 2023 as prompt engineering emerges as a genuine skill.

When working with LLMs, it's crucial to be specific about the format and length of the output you want. Otherwise, you'll get variable-format outputs that are hard to work with. So specify the format, length, and structure you need.

I've found that few-shot examples outperform description alone when it comes to getting the model to produce the output you want. So, if you want the model to classify support tickets into five categories, provide three examples of each category alongside the classification.

The model learns the boundary conditions from examples better than from abstract descriptions. That's why showing the model examples of the input-output behaviour you want is more reliable than describing it.

Role prompting and persona are also important considerations, as telling the model it's a specific type of expert activates the model's knowledge about how that type of expert reasons and communicates, and produces more domain-appropriate outputs than generic prompts.

For instance, "Review this code for security issues" produces different outputs than "You are a security engineer specialising in web applications, review this code". This demonstrates the value of using role prompting and persona.

In production, I've seen systems fail because the model's output wasn't structured correctly. For example, when using GPT-3.5 to generate JSON for a ticketing system, the initial prompts resulted in inconsistent keys like 'issue_type' vs 'category'. Switching to JSON mode with a strict schema cut parsing errors by 70% and reduced downstream validation code. The cost was higher token usage, but the reliability gain justified it.

When it comes to complex tasks that require multi-step reasoning, asking the model to "think step by step" before giving a final answer produces higher accuracy, as the intermediate reasoning steps keep the model on track and make errors more detectable.

Another trade-off is the number of examples. In a 2023 project, we tested 2, 5, and 10 examples for a classification task. Accuracy rose from 72% to 85% with 5 examples, but 10 examples only added 2% and doubled token cost. Beyond 5, the model started overfitting to examples, leading to poor generalization. This taught us to cap few-shot examples at 5 unless the task is highly ambiguous.

This is especially important for mathematical and logical tasks where the final answer without reasoning is hard to verify. So it's essential to use chain of thought prompting to get the most out of your LLM.