Jahanzaib
Models & Training

Quantization

Compressing a neural network by reducing the numerical precision of its weights (e.g., 16-bit float to 4-bit integer) for faster, cheaper inference.

Last updated: April 26, 2026

Definition

Quantization shrinks model weights from their training-time precision (typically FP16 or BF16) to lower-precision formats: INT8, INT4, or even lower. The compressed model uses a fraction of the memory, runs faster on the same hardware, and costs less to serve. Quality loss is typically small (1-3 percent on benchmarks for INT8, 3-7 percent for INT4) for foundation models, often invisible for narrow workloads. Quantization is the technology that makes serious LLMs runnable on laptops, phones, and edge devices. Llama.cpp, Ollama, and most modern inference servers support quantized model formats out of the box.

When To Use

Use quantized models for any self-hosted deployment where latency or cost matters. INT8 is the safe default; INT4 is fine for most chat workloads but verify quality on your evals first.

Sources

Related Terms

Building with Quantization?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.