Self-Healing AI Agents: How Automatic Error Recovery Works

AI Agents Fail More Than You Think

If you have used an AI coding assistant for anything beyond trivial tasks, you have seen it fail. The model generates code that does not compile. It imports a module that does not exist. It writes a function that passes type checking but breaks at runtime because it misunderstood the API.

These failures are not bugs in the model. They are an inherent property of probabilistic code generation. Models work from patterns, and sometimes the pattern does not match reality. The question is not whether your AI agent will fail. It is what happens next.

In a single-agent workflow, the answer is simple: you read the error, adjust your prompt, and try again. In a multi-agent workflow with six streams running in parallel, manual recovery is not practical. By the time you diagnose one failure, three other streams may have cascading issues.

This is why self-healing matters.

The 10 Error Categories

Not all errors are created equal. A syntax error has a different recovery strategy than a missing dependency. orchex categorizes every failure into one of 10 error types, and each type triggers a different recovery approach:

Syntax errors -- The generated code does not parse. Missing brackets, malformed expressions, unterminated strings. Recovery: re-generate with the parse error highlighted.
Type errors -- TypeScript compilation fails. Wrong argument types, missing properties, incompatible return types. Recovery: include the full type definitions in the retry context.
Import errors -- The code references modules or exports that do not exist. Recovery: provide the actual export list from the referenced files.
Runtime errors -- The code compiles but crashes during execution. Null references, undefined method calls, assertion failures. Recovery: include the stack trace and runtime state.
Test failures -- The implementation works but tests fail. Wrong assertions, missing edge cases, changed behavior. Recovery: include failing test output with expected vs actual values.
Lint/format errors -- The code works but violates project style rules. Recovery: include the lint configuration and specific rule violations.
Dependency errors -- Missing npm packages, version conflicts, peer dependency issues. Recovery: include the package.json and lock file context.
Timeout errors -- The LLM call exceeded the time limit. Recovery: simplify the prompt, reduce context size, or switch to a faster model.
Context errors -- The stream's reads list was insufficient. It needed information from files it did not have access to. Recovery: expand the context window with additional file contents.
Artifact errors -- The LLM response could not be parsed into valid file modifications. Malformed JSON, missing required fields, invalid line ranges. Recovery: re-prompt with stricter output format instructions.

How Fix Streams Work

When a stream fails, orchex does not simply retry with the same prompt. That would repeat the same failure. Instead, it generates a fix stream: a new stream specifically designed to address the error.

The fix stream process works like this:

Step 1: Error Analysis

orchex examines the failure output and categorizes it. For a TypeScript type error, it extracts the specific error codes, the file locations, and the expected types.

Step 2: Context Enrichment

The fix stream gets additional context that the original stream lacked. If the error was a missing type, the fix stream's reads list includes the type definition files. If the error was a wrong API call, the fix stream gets the actual API interface.

Step 3: Targeted Prompt

The fix stream's goal is not "implement the feature." It is "fix this specific error in this specific file." A narrow goal produces better results than re-running the entire original task.

Step 4: Chain Tracking

Fix streams maintain a parentStreamId that points to the stream they are fixing. orchex tracks the full chain of attempts using countChainAttempts(), which traverses from the current fix stream back to the original. After 3 total attempts (original + 2 fixes), orchex stops and reports the failure for manual intervention.

This chain limit prevents infinite retry loops. Three attempts with enriched context is enough to fix most recoverable errors. If the error persists after three tries, it usually requires human judgment -- a design decision, not a code fix.

A Real-World Recovery Example

Consider a multi-stream task that adds authentication to an Express app. One of the streams, auth-middleware, generates this code:

import { verify } from 'jsonwebtoken';
import { UserRepository } from '../repos/user-repo';

export function authMiddleware(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];
  const decoded = verify(token, process.env.JWT_SECRET);
  const user = await UserRepository.findById(decoded.sub);
  req.user = user;
  next();
}

This code has two errors: the function is not async but uses await, and there is no error handling for invalid tokens.

What Happens Next

orchex applies the artifact and runs syntax validation. The await in a non-async function is caught as a syntax error.
A fix stream is generated with the error message: 'await' expressions are only allowed within async functions.
The fix stream also receives the original stream's goal and the current file content, so it has full context.
The fix stream generates corrected code:

import { verify } from 'jsonwebtoken';
import { UserRepository } from '../repos/user-repo';

export async function authMiddleware(req, res, next) {
  try {
    const token = req.headers.authorization?.split(' ')[1];
    if (!token) {
      return res.status(401).json({ error: 'No token provided' });
    }
    const decoded = verify(token, process.env.JWT_SECRET);
    const user = await UserRepository.findById(decoded.sub);
    req.user = user;
    next();
  } catch (err) {
    return res.status(401).json({ error: 'Invalid token' });
  }
}

The fix stream's artifact passes validation. The self-healing chain is complete.

The entire recovery happened automatically. No human had to read the error, understand the cause, or adjust a prompt.

When Self-Healing Cannot Help

Self-healing is powerful but not omnipotent. There are categories of failure it cannot fix:

Design errors -- The stream implemented the wrong approach entirely. The code works, it just does the wrong thing. This requires human judgment about intent.

Missing requirements -- The original plan did not capture a constraint. No amount of retrying will produce code that meets an unstated requirement.

Environment issues -- The LLM provider is down, rate-limited, or returning garbage. Recovery requires waiting or switching providers, not fixing code.

Conflicting constraints -- Two requirements contradict each other. The model cannot resolve a contradiction that exists in the specification.

Understanding these limits is important. Self-healing handles the mechanical failures that waste developer time: typos, missing imports, type mismatches, off-by-one errors. It does not replace the developer's role in defining what should be built.

The Compound Effect

Self-healing becomes more valuable as the number of parallel streams increases. With 2 streams, you might encounter 1 error per run. With 10 streams, you might encounter 4-5. Manual recovery for 5 errors in a 10-stream workflow takes longer than the original execution.

But with self-healing, those 5 errors are categorized, fixed, and resolved automatically. The developer sees the final result: all streams completed, all artifacts applied, all validations passed.

This is the difference between "AI agents are fast but unreliable" and "AI agents are fast and self-correcting." The reliability does not come from better models. It comes from better error handling around the same models.

Try It Yourself

To see self-healing in action, install orchex and run a multi-stream task that involves TypeScript:

npx @wundam/orchex@latest

Intentionally include a file in owns that has tricky type dependencies. Watch the execution report to see how errors are categorized, fix streams are generated, and recovery happens without your intervention.

The orchex documentation covers self-healing configuration, including how the learning system improves recovery strategies over time based on past execution data.