Inference
Running a trained model to generate output. The day-to-day cost in any production LLM system, distinct from training cost.
Last updated: April 26, 2026
Definition
Inference is the act of running an already-trained model on a new input to generate output. For LLMs this is the API call you make every time the model produces a response. Production inference cost dominates total LLM spend by a wide margin: training costs are large but one-time, inference costs scale linearly with usage and recur forever. Three things drive inference cost: model size (larger models cost more per token), input length (more tokens to process), and output length (more tokens to generate). Optimizing inference cost is what cost-engineering an LLM system means.
When To Use
Track inference cost from day one. The single most useful metric: cost per user interaction. If that number is increasing without commensurate value increase, you have a problem.
Related Terms
Building with Inference?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.