Models & Training

Inference

Running a trained model to generate output. The day-to-day cost in any production LLM system, distinct from training cost.

Last updated: April 26, 2026

Definition

Inference is the act of running an already-trained model on a new input to generate output. For LLMs this is the API call you make every time the model produces a response. Production inference cost dominates total LLM spend by a wide margin: training costs are large but one-time, inference costs scale linearly with usage and recur forever. Three things drive inference cost: model size (larger models cost more per token), input length (more tokens to process), and output length (more tokens to generate). Optimizing inference cost is what cost-engineering an LLM system means.

When To Use

Track inference cost from day one. The single most useful metric: cost per user interaction. If that number is increasing without commensurate value increase, you have a problem.

Sources

AWS: LLM inference cost optimization

Related Terms

LLM (Large Language Model)

A neural network trained on text that predicts the next token, used as the engin…

Small Language Model (SLM)

A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low lat…

Quantization

Compressing a neural network by reducing the numerical precision of its weights …

Latency

How long an agent takes to respond. Time-to-first-token, total response time, an…

Throughput

How many requests or tokens an agent system can handle per unit of time. The met…

Building with Inference?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms