Service resilience — fallback chains, alerts, friendly errors

The hardest part of running an AI-heavy product is provider variance. One day Anthropic is up, the next day a content classifier false-positives. One week OpenAI deprecates dall-e-3 overnight. One Tuesday afternoon a KIE outage kills image generation across every venue. None of these are catastrophic on their own — but they’re catastrophic if the operator opens the Composer, hits Generate, and sees a red Server Error panel with no path forward.

Service resilience is three connected systems that work together to keep the surface calm even when the underlying providers misbehave.

The three layers

1. Fallback chains decide what happens when a provider fails. Instead of “primary + one backup hardcoded in source”, the platform walks a super-admin-curated list of providers in order. KIE first, then a second KIE model, then a third, then OpenAI. The first one that returns a result wins.

2. Service alerts notify the super admin when an entire chain has been exhausted — every candidate failed for one request. A row lands in the Alerts inbox and (if Slack is configured) a webhook pings. Silent failures are the failure mode this prevents.

3. Friendly errors decide what the operator sees if a chain truly does exhaust. Not a raw stack trace. Not a red [CONVEX A(...)] Server Error. An amber banner with a calm sentence — “We couldn’t generate your image right now. Try again in a minute, or pick a different style” — and a collapsible “Show technical details” expander for support tickets.

The rule

The operator sees calm copy. The super admin sees everything.

Two surfaces, two audiences. The Composer’s amber banner is what the venue’s marketing person sees when KIE has a hiccup. The super-admin Alerts inbox is what we see, with the full cascade audit (“KIE nano-banana → ‘sensitive’”, “OpenAI gpt-image-1 → 429”), the timestamps, the venue ID, the failure reason. The first surface protects trust; the second surface protects us.

How the fallback chains work

Open Super Admin → AI Fallbacks. Four tabs across the top: Image, Chat / Text, OCR, Embedding. Each is an ordered list of provider candidates the system tries in order.

Each candidate is a card showing:

A priority number (1, 2, 3, …) on the top-left
The display label (e.g. “KIE Nano Banana 2”)
The provider chip strip — clicking a different provider chip changes which lab this slot uses
The model dropdown — populated from the provider registry, filtered by capability
A Ready badge when the env key is configured, or a Skipped badge when the candidate is disabled
Reorder controls (▲ / ▼) to change the order
An enable toggle as a chip — flip off without removing
A delete button — remove the candidate entirely
An Advanced collapsible — env key name, base URL override, internal notes

At the bottom of the list, two buttons: Add provider (inserts a blank row at the end) and Reset to defaults (replaces the chain with the platform’s seeded baseline). The “Save changes” button on the right commits the changes to the database.

What the runtime does with the chain

When the Composer’s Generate action runs:

It resolves the venue’s chosen primary (whatever the venue picked in their AI Settings).
It loads the super-admin chain for that service kind (Image, Chat, etc.).
It tries the primary first; if that fails, walks the chain top-to-bottom.
For each candidate, if the first call fails with a sensitivity / safety / moderation flag AND a reference image was attached, it retries WITHOUT the reference image (most KIE false-positives clear on text-only retry).
The first candidate that returns a result wins — the operator gets their image and doesn’t know which leg of the chain answered.
If every candidate fails, the action throws a friendly error AND writes an alert row.

The credit gate auto-refunds when the cascade exhausts, so the operator isn’t charged for a failed request.

Worked example: the dall-e-3 outage

Tuesday morning, OpenAI deprecates dall-e-3 without notice. Our existing fallback chain has dall-e-3 as the backup behind KIE. Tuesday afternoon, KIE’s content classifier flags a meat-pasta image as “sensitive”. The chain falls back to dall-e-3, gets a 404 (“The model ‘dall-e-3’ does not exist”), and exhausts.

Without service resilience: the operator sees a red Server Error. Doesn’t know what to do. Stops using the Composer. Tells their friend “the AI thing in BiteTheMenu is broken.”

With service resilience:

The operator sees an amber banner: “We couldn’t generate your image right now. Try again in a minute, or pick a different style.”
The super admin sees a row in /admin/super/alerts with the cascade audit: “KIE nano-banana → ‘sensitive’”, “OpenAI dall-e-3 → 404 ‘model does not exist’”.
The super admin opens /admin/super/ai-fallbacks, swaps dall-e-3 → gpt-image-1 on the Image tab, hits Save. Done. No code deploy.
The next operator clicking Generate succeeds. The first operator hits Generate again 30 seconds later: succeeds.

The whole loop, start to finish, takes about two minutes — and it doesn’t need an engineer.

How to configure the chain

The platform ships with sensible defaults. You probably don’t need to touch them unless a model gets deprecated or you want to try a new provider.

Default Image chain

KIE Nano Banana 2 — Gemini 3.1 Flash, fast, up to 4K. Primary.
KIE Nano Banana Pro — Gemini 3 Pro, higher quality. Use when Banana 2 underwhelms.
KIE Nano Banana Edit — Reference-image-aware. Useful when we want the venue’s own dish photo in the loop.
OpenAI gpt-image-1 — Last-ditch outside KIE. More expensive per call; only fires if every KIE leg fails.

Default Chat chain

Claude Sonnet 4.6 (via OpenRouter) — Anthropic flagship for writing voice.
GPT-4o (via OpenRouter) — OpenAI flagship as second.
Qwen3 235B (via OpenRouter) — Alibaba flagship for Chinese / EN bilingual diversity.
DeepSeek V3 (via OpenRouter) — Cheap.
DeepSeek V4 direct — Cheapest floor, no OpenRouter markup.

Adding a custom provider

Click Add provider at the bottom of the list. Pick a provider from the chip strip (OpenAI, OpenRouter, Anthropic, etc.). Pick a model from the dropdown. Open the Advanced collapsible to confirm the env key name (e.g. OPENROUTER_API_KEY) and base URL.

If the provider needs an env key that isn’t set yet, the card shows a “Skipped” badge — the runtime silently skips it until the key is set in Convex env.

How alerts work

Open Super Admin → Alerts. Every time a fallback chain exhausts, a row lands here. The sidebar tab shows an amber badge with the unack count.

Each row shows:

The kind (Image / Chat / OCR / Embedding)
The service name (e.g. social.generateImage)
The last error (truncated to fit on the card)
The timestamp
An amber dot if unacknowledged

Click a row to expand the per-attempt audit — exactly which candidate hit which error. Then click Acknowledge to clear it from the badge, or Acknowledge all for bulk.

Slack notifications (optional)

If you set the environment variable SLACK_ALERTS_WEBHOOK_URL to a Slack Incoming Webhook URL, every alert also fires a phone-notification ping. The webhook is fire-and-forget — Slack downtime can’t break the user-facing flow.

To enable, run from the repo root:

CONVEX_DEPLOY_KEY=$(grep CONVEX_DEPLOY_KEY .env.deploy.local | cut -d= -f2-) npx convex env set SLACK_ALERTS_WEBHOOK_URL https://hooks.slack.com/services/...

How friendly errors work

Backend actions that catch a cascade exhaustion throw a structured error with three fields: code, userMessage, technicalDetail. The Composer’s error banner shows the userMessage prominently and tucks the technical detail behind a “Show technical details” expander.

The tone is amber — not red — because most AI failures are transient and red over-alarms. Red is reserved for actual destructive states (failed publish that left a calendar post in a bad status, expired Meta token blocking auto-publish).

What you don’t need to do

Manually retry failed requests — the chain handles retries silently.
Watch the Convex logs — the Alerts inbox is the single source of truth for “things are going wrong.”
Worry about credits when a cascade fails — the credit gate auto-refunds on throw. Operators aren’t charged for failed requests.
Touch the chain to add a new venue — chains are platform-wide. New venue inherits whatever you’ve configured.

Composer — where the chain is invoked from on most user clicks.
Image editor — also routes through the image chain on Save.
Library — surfaces every successful AI output regardless of which chain leg answered.