"Should we fine-tune?" is the most common question we get on a new agent project, and it is usually the wrong first question. Most teams reach for fine-tuning when what they actually need is retrieval. Here is the decision tree we run before writing a line of code.
What each one actually changes
Retrieval-augmented generation (RAG) changes what the model knows at answer time. You keep a knowledge base, fetch the relevant pieces for a given question, and put them in the prompt with citations. Fine-tuning changes how the model behaves — its format, tone, or a narrow skill — by adjusting weights on examples.
The confusion comes from treating them as substitutes. They are not. They solve different problems, and on most agent projects you want RAG first and fine-tuning rarely.
The decision tree
- Does the answer depend on facts that change? Pricing, inventory, a policy document, a customer's record — anything that updates. → RAG. Retrain-on-every-change is not a strategy.
- Do you need citations or traceability? Regulated workflows, support, anything auditable. → RAG. Fine-tuning bakes knowledge in with no source to point at.
- Is the problem about format or behavior, not facts? "Always return this JSON shape," "match this house style." → fine-tuning is reasonable, once prompting plateaus.
- Is latency or cost from long prompts the bottleneck, and the knowledge is stable? → fine-tuning can help by moving stable knowledge into weights. Measure first.
Why retrieval wins more often
If the knowledge changes, retrieval is the only honest answer. A model fine-tuned on last quarter's policy will confidently cite a rule that no longer exists.
A retrieval pipeline is also inspectable. When an agent answers wrong, you can look at exactly which chunks it retrieved and fix the index, the chunking, or the ranking — a normal engineering loop. A fine-tune that answers wrong is a black box and a retraining cycle.
Here is the shape of a retrieval step inside a plan, with the sources it grounded on:
$ bytevon plan step --name answer-policy-question
retrieve: k=6 store=standards-index
grounded: 3 chunks · cited · doc-rev tracked
◉ answer returned with citations · 0 ungrounded claims
Where they combine
The strong pattern on serious projects: RAG for knowledge, light fine-tuning (or just good prompting) for behavior. Retrieve the facts; shape the output with a small amount of format tuning if prompting alone cannot get there. Start with retrieval, measure, and only reach for weights when you can name the specific behavior you cannot get any other way.