Mehman Ismayilov
← All articles

Fallbacks Are Not Resilience

A fallback is a code path you do not run, cannot easily test, and will absolutely need under the worst possible conditions. Most fallbacks make outages worse, not better.

A service has retries, a timeout budget, and a fallback to cached data when the database is slow. Looks resilient. Then someone asks:

How often does the fallback path actually run?

Nobody knows. You check. It hasn’t run in production. Not once in six months. The cache it falls back to was deleted three months ago during a cleanup sprint. If the database actually failed, the fallback would crash the service in a more confusing way than just propagating the original error.

The moment you write a fallback, you’ve added a code path that you don’t run, can’t easily test, and absolutely will need under the worst possible conditions.

Fallbacks are the default answer when a dependency fails. Sometimes that works. Sometimes it makes things worse.

The intuition is wrong

The case for fallbacks sounds airtight. Service A depends on Service B. Service B sometimes fails. So when B fails, do something else. Returning something feels safer than returning an error. That assumption causes a lot of outages.

Sometimes a fallback genuinely helps. But “sometimes” is doing a lot of work there. A fallback only helps if all of the following are true:

  • The alternate path produces a correct-enough answer.
  • The alternate path is itself reliable.
  • Your downstream consumers can deal with the degraded result.
  • You will know, operationally, when the system is running on the fallback.

If any one of those is false, your fallback is making things worse. At least one is almost always false. The trade-off you think you’re making, slightly stale data instead of an error, is rarely the trade-off you’re actually making.

The four ways fallbacks bite

1. They serve wrong answers quietly

A few years ago a service I worked near read user entitlements from Postgres. Someone added a fallback: if the primary was unreachable, read from a replica. Reasonable, except the replica lag wasn’t bounded. During a long primary failover, the replica drifted by about twenty minutes. For twenty minutes, users who had upgraded their plan still saw the free-tier limits. Support tickets started rolling in. From the service’s perspective, everything was fine. The fallback was working perfectly.

A loud error would have been better. The user could have refreshed, or we could have shown a “try again in a moment” page. Instead the system silently lied to thousands of users.

2. They cascade

This is the failure mode that gets the most airtime, and for good reason. The classic shape: A calls B. B is slow. A retries. A’s retries pile up. B gets slower because it’s now serving three times the traffic. A’s threads exhaust. A is now down. A’s callers retry against A. The blast radius keeps growing.

Fallbacks have the same shape. If B is down and A falls back to C, every request that used to go to B now goes to C. C was sized for its normal load. C falls over. You’ve turned a partial outage into a total one.

A concrete version I’ve seen: a service reads user data from a cache that a Kafka consumer keeps in sync. Someone adds a fallback. If the cache entry is too stale, skip it and read straight from Postgres. Fine in isolation. The day Kafka has a hiccup, cache entries across every service instance go stale at roughly the same moment, and every read request lands on Postgres at once. Postgres was sized for normal write load plus a trickle of reads, not the full firehose. By the time Kafka recovers, Postgres is the new outage.

3. They hide the real problem

A fallback exists to make a failure invisible to the caller. That’s the whole point. The catch is that it tends to make the failure invisible to you too.

Take a service that calls an internal personalization service to tailor a user’s homepage. The call has a fallback: if personalization is unavailable, return a generic list of popular items instead.

The obvious objection is “just alert on the inner call.” In practice, teams rarely do. Once a fallback exists, the inner failure becomes “expected” — the call was a bit flaky to begin with, which is part of why someone wrote the fallback. The metric is non-zero by design, the threshold gets set loose or never set at all, and any alert that does exist sits in a channel nobody pages on because “the fallback handles it.” The outer SLO stays green while the product quietly gets worse.

Meanwhile, every user is seeing the same generic homepage. Engagement is decaying. Click-through is down. The first real signal anyone gets is a product review at the end of the week: “why is engagement off 12%?”

The fallback didn’t make the failure unmeasurable. It made the failure feel acceptable. Those are different things, and only the first one is solved by a Grafana dashboard.

4. They have their own bugs

Fallback paths rot.

They run less than the primary path, so configuration drift, schema changes, and edge cases accumulate silently. The unit test still passes. Production doesn’t. The worst time to discover your fallback is broken is during the outage that activates it.

What to do instead

There’s no single answer here. The right thing depends on what you’re building. But these are the moves I reach for before I reach for a fallback.

Fail fast and loud. The most underrated pattern in distributed systems. If a dependency is unhealthy, return an error. Quickly. Don’t hold the request thread waiting. Don’t try to be clever. A clear 503 is honest. It gives the caller a chance to do its own thing (back off, switch traffic, alert), and it gives operators a real signal to debug.

Make timeouts first-class. Every network call should have a timeout shorter than your request budget. Every single one. In Java this means setting both connection and socket timeouts on your HTTP clients, configuring statement_timeout on Postgres connections, and setting request.timeout.ms, delivery.timeout.ms, and friends on your Kafka producers. The default values in most libraries are some flavor of “wait forever,” which is the worst possible default.

Use backpressure. When a dependency is struggling, the instinct is to keep serving the request anyway — serve stale data, return a default, paper over the gap somehow. Backpressure is the opposite instinct: stop accepting work and push the slowness back up the chain. Kafka consumers can pause partitions. Java thread pools can reject work with a clear signal. HTTP servers can shed load with 429s. This is more honest than absorbing the failure quietly. You’re telling the system “I’m degraded” instead of pretending you aren’t.

If you must have a fallback, exercise it. Constantly. Not once a quarter in a game day. Daily. Hourly. Route some real traffic through the fallback path so you find the bugs before your customers do. An untested fallback is a guaranteed future outage. The fallback path should be just as well-exercised in production as the primary, or you should delete it.

Decide whether the dependency is actually required. Most fallbacks exist in a fuzzy middle ground: the call isn’t critical enough to fail the request when it breaks, but it’s not optional enough to drop entirely. That ambiguity is what the fallback papers over. Force the decision instead. If the call is truly optional, design the system so the request doesn’t depend on it — make it async, fire-and-forget, or move it out of the request path entirely. If the call is genuinely required, fail the request when it fails. The honest answers are “we don’t need this” or “we need this and we’ll surface the failure.” Fallbacks are mostly the cost of refusing to pick one.

The honest version of “graceful degradation”

There’s a legitimate version of all this. None of the above means “never run degraded.”

Graceful degradation works when:

  • The degraded mode is a real product decision, not a code afterthought.
  • It’s visible to operators, with its own dashboards and alerts.
  • It’s visible to users when appropriate, with a banner or a different layout.
  • It’s exercised regularly.
  • The system that serves it is independent enough not to share failure modes with the primary.

“Show a generic homepage when personalization is down, with a banner telling the user the section is temporarily unavailable” is graceful degradation. “Silently read from a stale replica” is not. The difference is whether the degraded state is a thing the system knows about and the team has signed off on, versus a hidden branch nobody thinks about until it lights up at 3 AM.

Some domains are naturally tolerant of degraded answers. Recommendations, search autocomplete, feed personalization, cached content delivery — these are good places for graceful degradation because the consequence of a slightly worse answer is, at most, a slightly worse experience. The danger starts when fallbacks creep into correctness-critical paths: billing, entitlements, permissions, transactional state. There, “silently degraded” is just “silently wrong.”

The question to ask

Whenever I see a fallback in code review now, I ask: what happens if this fallback runs on every request? Not the primary path. The fallback. If the system would survive that, the fallback is probably fine. If the system would collapse, the fallback is a trap door we’ve installed in our own floor.

Most fallbacks fail that test. Delete them. Return the error. Let the caller decide. The on-call rotation will thank you.