Production

Exponential Backoff

Retry strategy that doubles the wait between attempts to give a struggling system room to recover.

Last updated: April 26, 2026

Definition

When an LLM API returns 429 (rate limited) or 5xx (transient error), retrying immediately makes things worse. Exponential backoff waits longer between each retry: 1s, 2s, 4s, 8s. Add jitter (random ±25 percent) to prevent thundering-herd retries. Cap total retry time so requests fail fast on permanent errors. The Anthropic and OpenAI SDKs have backoff built in; your job is mostly to not disable it. Custom code paths (background jobs, webhook handlers) need explicit backoff.

Code Example

python

import asyncio, random
async def call_with_backoff(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(wait)
    raise

1s, 2s, 4s, 8s, 16s with jitter. Cap retries so permanent failures fail fast.

When To Use

Required on every LLM call in production. The default in official SDKs is good. Just don't disable it.

Related Terms

Rate Limiting

Capping how many requests a user, IP, or app can make in a time window.…

Observability

Logging, metrics, and tracing for LLM calls so you can debug, audit, and optimiz…

Building with Exponential Backoff?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms