Designing for Degradation: How to Build AI Features That Stay Useful When the Model Fails
Back to Blog

Designing for Degradation: How to Build AI Features That Stay Useful When the Model Fails

David Liu

David Liu

·12 min read

There’s a particular kind of failure that doesn’t show up in your uptime dashboard. Your API is healthy. Your error rate is zero. But somewhere in your product, a user just got a blank response from your AI feature, shrugged, and closed the tab. Not an error. Not a timeout. Just… nothing useful.

This is what happens when AI features aren’t designed to degrade gracefully. And unlike traditional software failures, they’re quiet. They don’t page you. They just slowly erode trust, one hollow interaction at a time.

This post is about building AI features that stay useful when things go wrong — when the model returns low-confidence output, when latency spikes, when the context is garbage, when the user has no idea what to ask. Graceful degradation for AI isn’t a nice-to-have. It’s the difference between a feature users trust and one they quietly stop using.


Why AI Features Fail Differently

Traditional software has a clean failure mode: it either works or it throws an error. You can write tests for it. You can instrument it. When something breaks, you know.

AI features fail on a spectrum. A model can return something syntactically valid, semantically coherent, and completely wrong for the user’s situation. It can be 90% right in a way that’s worse than 0% right. It can be confidently unhelpful.

There are roughly four failure modes you’re actually designing against:

  • Hard failures: Timeout, API error, rate limit. Easy to detect. Harder to handle well.
  • Quality failures: The model returns output, but it’s wrong, generic, or off-topic. Invisible to your monitoring.
  • Context failures: The model didn’t have what it needed to do the job — missing user history, ambiguous prompt, bad retrieval. It answered the question it was given, not the question the user meant.
  • Confidence failures: The model doesn’t know that it doesn’t know. It generates fluent nonsense and presents it with the same tone as accurate output.

Most engineering teams solve for hard failures and ship. The other three are what separate products users trust from ones they eventually stop opening.


The Core Principle: Every AI Feature Needs a Non-AI Fallback

Before writing a single line of AI code, ask: if this feature returns nothing useful, what does the user do instead?

If the answer is “they’re stuck” — you have a design problem, not an engineering problem. No amount of retry logic or error handling fixes a feature with no fallback path.

This isn’t pessimism about AI. It’s the same discipline you’d apply to any external dependency. You don’t assume the database will always respond in under 50ms. You don’t assume the payment gateway will always be available. The same thinking applies here.

Fallbacks don’t have to be complex. They’re often just honest UI. A few patterns that work:

Show the input, not the failure. If your AI feature is supposed to summarize a document and it returns garbage, surface the original document instead. The user can still read it. You haven’t made them worse off.

Offer a manual path. If AI-assisted search fails, show a standard search box. If AI-generated suggestions are low-quality, show a list of recent or popular items instead. Users understand that suggestions are sometimes off. They do not understand having no path forward.

Be explicit about uncertainty. “Here’s what I found, but I’m not confident this is right — you might want to double-check” is a vastly better user experience than a confidently wrong answer. It also sets expectations that make users more forgiving of imperfections.


Confidence Signals: What to Actually Instrument

One of the practical gaps in most AI feature implementations is that teams instrument the call — latency, success rate, token usage — but not the output quality. You know your model responded. You don’t know if it responded usefully.

You can’t achieve perfect output quality measurement without human evaluation at scale, but you can build signals that correlate with it:

Output length as a proxy

For generative features where the expected output has a known shape, very short or very long responses are often a signal of failure. A summary that’s three words long or eight paragraphs long probably didn’t hit the mark. This is a cheap heuristic you can add to your logging immediately.

User behavior as feedback

Did the user immediately dismiss the output? Did they edit it heavily before using it? Did they retry? These behavioral signals are noisier than explicit thumbs-up/thumbs-down feedback, but they’re available without asking users to do extra work. Wire them up from day one.

Retrieval quality for RAG features

If your feature uses retrieval-augmented generation, the quality of what you retrieved is often the dominant factor in output quality. Log your top retrieval results. Measure the semantic distance between the query and the retrieved chunks. If retrieval is consistently returning low-relevance results for a query pattern, the model can’t fix that.

Prompt injection detection

For any feature that passes user input into your prompt, build a detection layer for injection attempts. This isn’t just a security concern — it’s a reliability one. A user who pastes in an adversarial prompt (sometimes accidentally) can produce output that breaks your downstream processing. Know when this is happening.


Handling Latency: The Forgotten UX Problem

AI features are slow relative to what users expect from software. A typical chat response takes one to five seconds. A complex RAG pipeline can take ten. Users have been conditioned by decades of software that responds in milliseconds.

Latency isn’t just a performance problem. It’s a trust problem. When users wait and get back something unhelpful, the wait amplifies the disappointment. When they wait and get something excellent, they still tolerate the wait but only up to a point.

A few patterns that help:

Stream by default. If your output is text and your model supports streaming, stream it. Perceived latency drops dramatically when users see the first tokens in under a second, even if the full response takes five seconds. This is the single highest-leverage change you can make to AI feature UX for most applications.

Optimize the critical path. Look at where your latency is actually going. For most RAG applications, the retrieval step is the bottleneck. For most fine-tuned model calls, it’s the inference time. Profile before you optimize. “Make the model call faster” often isn’t where the time is.

Set a timeout that degrades gracefully. Decide what “too long” is for your feature. If a response takes more than eight seconds, consider surfacing the fallback path rather than making the user wait indefinitely. An eight-second cap with a fallback is better UX than a fifteen-second response that may or may not arrive.

Precompute where you can. Not every AI feature needs to be generated on demand. User-facing recommendations, content summaries, similar-item suggestions — many of these can be generated asynchronously and served from cache. The user experience of an instant pre-computed result is almost always better than a real-time generated one, latency aside.


Context Management: Garbage In, Garbage Out

Most AI quality failures aren’t model failures. They’re context failures. The model was given an ambiguous prompt, insufficient user context, or low-quality retrieved content, and it did exactly what it should do with what it had. It generated a plausible response. It just wasn’t the right one.

Context management is the unsexy work that makes the difference between a demo that impresses and a product that earns retention.

Be intentional about what goes in the context window

More context isn’t always better. A 128k context window doesn’t mean you should fill it. Irrelevant context confuses the model. Contradictory context produces unpredictable outputs. Stale context is often worse than no context.

For each call, ask: what is the minimum information this model needs to do this specific task well? Build a function that constructs that context deliberately, not a pipeline that dumps everything you have.

Handle the cold start problem

New users have no history. No preferences. No behavioral signals. Your AI features will perform worst for the people you most need to impress: users in their first session.

Design for this explicitly. What does a useful response look like with zero user context? Often the answer is: show popular or curated content, ask clarifying questions, or offer explicit prompts that help the user teach the system. Don’t just let the model hallucinate a generic response and call it personalization.

Validate retrieval before generation

For RAG features, add a validation step between retrieval and generation. If your top retrieved chunk has a cosine similarity below some threshold, treat it as a retrieval failure and either surface a “I couldn’t find relevant information” response or fall back to a non-RAG baseline. Generating from low-relevance retrieved content often produces convincingly wrong answers — the most dangerous kind.


Communication Patterns That Build Trust

The way you frame AI output shapes how users engage with it. Some communication patterns build trust over time. Others erode it, even when the model is performing well.

Be explicit about what the AI is and isn’t doing. “Based on your last 30 interactions” is more trustworthy than “Recommended for you.” It tells the user why they’re seeing something and lets them evaluate whether the reasoning makes sense. Opacity feels suspicious over time, even when the outputs are good.

Acknowledge the limits of the feature. “This summary covers the main points but may miss nuance in technical sections” is better than presenting a summary as if it’s complete. Users who understand the limitations are more forgiving of errors and more likely to stay engaged rather than leaving when they catch one.

Give users control over the AI’s behavior. This doesn’t have to be complex. A simple “regenerate” button, a tone selector, or the ability to edit an AI-generated draft before using it are all forms of user control. They signal that the AI is a tool, not an oracle, and they create a feedback loop that surfaces failure cases you’d otherwise miss.

Separate generation from presentation. Don’t stream AI output directly into final UI state. Buffer it, validate the shape, and then render it. A markdown table that failed to close properly shouldn’t break your layout. A response that returned a hallucinated URL shouldn’t render as a clickable link to nowhere. Your rendering layer is the last line of defense.


The Operational Side: Monitoring AI Features in Production

Most teams monitor AI features the same way they monitor everything else: error rates, latency percentiles, throughput. These metrics will tell you when your API is down. They won’t tell you when your feature is quietly failing its users.

Build a separate monitoring layer for AI feature quality. The specifics depend on your application, but some dimensions to track:

  • Completion rate: Of users who trigger the AI feature, what percentage receive a response they use? (For generative features, “use” might mean copying it, not dismissing it, or accepting a suggestion.)
  • Retry rate: How often do users regenerate? High retry rates on a specific feature or input type signal a quality problem.
  • Edit distance for drafts: If your feature generates draft content that users then edit, measure how much they’re editing. Heavy editing on average means the drafts aren’t useful.
  • Exit rate after failure: What percentage of users who hit an AI failure (empty response, obvious error, loading spinner that exceeds threshold) leave within the next 60 seconds?

None of these are hard to instrument. Most teams just don’t do it. Set up these metrics before you launch, not after you notice the retention problem.


A Practical Checklist Before You Ship

Before any AI feature goes to production, walk through these:

  1. What happens if the model call times out? Is there a UI state for this? Does it degrade to something useful?
  2. What happens if the model returns an empty string? Does your rendering handle it? Does the user see something sensible?
  3. What happens if the retrieved context is low quality? Do you validate similarity scores? Do you have a no-results state?
  4. What does the feature look like for a new user with no history? Is it still useful?
  5. Are you streaming, and should you be? If the response takes more than two seconds and it’s text, stream it.
  6. Can users tell the AI is uncertain? Are you surfacing confidence signals or just presenting all outputs as equally reliable?
  7. What are you logging beyond latency and error rate? Do you have behavioral signals that will tell you if the feature is working?
  8. Is there a manual path? If the AI returns nothing useful, does the user still have a way to accomplish their goal?

You don’t need perfect answers to all of these before you ship. But you need answers. “We’ll figure it out if it’s a problem” is how you end up with a feature that quietly stops getting used while your dashboards show green.


The Bigger Point

AI features are still software features. They have the same basic obligations: be useful to the user, behave predictably, communicate failures honestly, and not make the experience worse when things go wrong.

What changes is the failure surface. Traditional software fails in ways you can reproduce and test. AI features fail in ways that are probabilistic, contextual, and often invisible to your observability stack. Designing for this isn’t harder than traditional reliability engineering — it’s just different. It requires more intentionality about fallbacks, more instrumentation of outputs rather than just infrastructure, and more humility in how you present what the model produces.

The teams building AI products that hold up in production aren’t the ones with the most sophisticated models. They’re the ones who’ve thought carefully about what happens when the model isn’t at its best — and built a product that stays useful anyway.