Small Language Model (SLM)
A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low latency, and on-device or edge deployment.
Last updated: April 26, 2026
Definition
Small language models trade some capability for big wins in cost, latency, and deployment flexibility. Common SLMs in 2026: Phi-4 (Microsoft, ~14B), Llama 3.3 8B, Gemma 3 (Google), Qwen 2.5 (Alibaba), Mistral 7B. They run on a single consumer GPU (or even CPU with quantization), respond in under 100ms, and cost essentially nothing per call when self-hosted. Capability is meaningfully below frontier models on hard reasoning, but for narrow tasks (classification, structured extraction, intent detection, fast routing) they often match or beat frontier models.
When To Use
Use SLMs for high-volume, low-complexity tasks: classification, routing, extraction, simple summarization. Fall back to a frontier model only for the cases the SLM cannot handle.
Related Terms
Building with Small Language Model (SLM)?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.