Jahanzaib
Models & Training

Small Language Model (SLM)

A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low latency, and on-device or edge deployment.

Last updated: April 26, 2026

Definition

Small language models trade some capability for big wins in cost, latency, and deployment flexibility. Common SLMs in 2026: Phi-4 (Microsoft, ~14B), Llama 3.3 8B, Gemma 3 (Google), Qwen 2.5 (Alibaba), Mistral 7B. They run on a single consumer GPU (or even CPU with quantization), respond in under 100ms, and cost essentially nothing per call when self-hosted. Capability is meaningfully below frontier models on hard reasoning, but for narrow tasks (classification, structured extraction, intent detection, fast routing) they often match or beat frontier models.

When To Use

Use SLMs for high-volume, low-complexity tasks: classification, routing, extraction, simple summarization. Fall back to a frontier model only for the cases the SLM cannot handle.

Sources

Related Terms

Building with Small Language Model (SLM)?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.