Jahanzaib
Evaluation

Self-Correction

An agent's ability to detect errors in its own outputs and revise them without external intervention.

Last updated: April 26, 2026

Definition

Self-correction is the practice of having an agent check its own work and fix mistakes before delivering. The simplest pattern is structural: after generating an answer, ask the same model "is this correct? If not, what is wrong, and what is the corrected version?" Frontier models often catch their own arithmetic errors, format violations, and schema mismatches via this loop. More advanced patterns use a separate critic model (Constitutional AI, LLM-as-judge) to evaluate the original output independently, which avoids the model's self-confirmation bias.

Self-correction has a known limit: a model cannot reliably catch errors in topics where it does not know the right answer. Asking GPT-5 to self-correct a math problem it got wrong helps; asking it to self-correct a fact about a 2026 event it has no training data on does not. For high-stakes self-correction, pair it with grounding (RAG retrieval, tool verification) so the critic has external truth to check against. Self-correction is best for catching format errors, off-by-one bugs, and constraint violations, not for catching factual hallucination.

When To Use

Add a self-correction pass before any irreversible action or any user-facing output where format/structure correctness matters. Pair with grounding for fact-correctness use cases.

Sources

Related Terms

Building with Self-Correction?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.