Models & Training

Quantization

Compressing a neural network by reducing the numerical precision of its weights (e.g., 16-bit float to 4-bit integer) for faster, cheaper inference.

Last updated: April 26, 2026

Definition

Quantization shrinks model weights from their training-time precision (typically FP16 or BF16) to lower-precision formats: INT8, INT4, or even lower. The compressed model uses a fraction of the memory, runs faster on the same hardware, and costs less to serve. Quality loss is typically small (1-3 percent on benchmarks for INT8, 3-7 percent for INT4) for foundation models, often invisible for narrow workloads. Quantization is the technology that makes serious LLMs runnable on laptops, phones, and edge devices. Llama.cpp, Ollama, and most modern inference servers support quantized model formats out of the box.

When To Use

Use quantized models for any self-hosted deployment where latency or cost matters. INT8 is the safe default; INT4 is fine for most chat workloads but verify quality on your evals first.

Sources

Related Terms

Small Language Model (SLM)

A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low lat…

Inference

Running a trained model to generate output. The day-to-day cost in any productio…

LLM (Large Language Model)

A neural network trained on text that predicts the next token, used as the engin…

Latency

How long an agent takes to respond. Time-to-first-token, total response time, an…

Building with Quantization?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms