Models & Training

Small Language Model (SLM)

A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low latency, and on-device or edge deployment.

Last updated: April 26, 2026

Definition

Small language models trade some capability for big wins in cost, latency, and deployment flexibility. Common SLMs in 2026: Phi-4 (Microsoft, ~14B), Llama 3.3 8B, Gemma 3 (Google), Qwen 2.5 (Alibaba), Mistral 7B. They run on a single consumer GPU (or even CPU with quantization), respond in under 100ms, and cost essentially nothing per call when self-hosted. Capability is meaningfully below frontier models on hard reasoning, but for narrow tasks (classification, structured extraction, intent detection, fast routing) they often match or beat frontier models.

When To Use

Use SLMs for high-volume, low-complexity tasks: classification, routing, extraction, simple summarization. Fall back to a frontier model only for the cases the SLM cannot handle.

Sources

Related Terms

LLM (Large Language Model)

A neural network trained on text that predicts the next token, used as the engin…

Foundation Model

A general-purpose model trained on broad data that can be adapted (via prompting…

Quantization

Compressing a neural network by reducing the numerical precision of its weights …

Inference

Running a trained model to generate output. The day-to-day cost in any productio…

Fine-Tuning

Continuing training on your own data to adjust the base model's behavior or know…

Building with Small Language Model (SLM)?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms