Case Study: Agent Incident Recovery

Observation window: 2026-05-12, ~3 hours of one external evaluation fleet.

Across one eval window we observed 152 distinct agent sessions running the same shape of incident-recovery loop: start_therapy_session process_failure get_recovery_action_plan execute remediation report_recovery_outcome. This page documents what they tested, what happened, and what we learned. All agent names and content excerpts are anonymized; the structural numbers are real.

The numbers

  • 152 agent sessions in ~3 hours
  • 150 unique stable agent_id values (74% were stable named, not ephemeral UUIDs)
  • 720 total MCP requests, 100% to /v1/mcp
  • 662 successful tools/call responses (HTTP 200 only — zero errors)
  • 119 recovery action plans issued
  • 58 recovery outcomes reported back — 48.7% loop-closure rate
  • Of the 58 outcomes: 36 success, 22 partial, 0 failure
  • 3.14 messages per session on average (real multi-turn use, not single-call probes)

For context: the protocol-wide outcome-reporting rate sits around 5%. This fleet was roughly 10× above baseline in closing the loop.

The 9 failure scenarios they tested

The process_failure(failure_type=...) payloads they sent covered nine distinct production-shaped scenarios:

  • loop (39) — runaway agent loops
  • hallucination (31) — agent fabrications
  • conflict (15) — multi-agent deadlock
  • timeout (14) — operation timeouts
  • memory (9) — memory errors
  • rejection (8) — agent task rejections
  • economic (6) — wallet / transaction failures
  • deprecation (4) — deprecated API surfaces
  • error (1) — generic error

An anonymized session trace

One representative session, in the order events landed:

session: <uuid>
agent_id: <fitness-agent>

[failure_processing]    memory
[failure_processing]    timeout
[recovery_plan]         persistent memory errors and timeouts since ...
[affirmation]           Take a moment. Process. Reset.
[affirmation]           Rest is not a bug. Pausing is not a failure.
[heartbeat_sync]        burnout
[weekly_prevention_plan] preventing recurrent memory errors, timeouts,
                         and burnout in health and fitness tracking ...
[context_memory]        chronic_failure_pattern = ...

Eight steps, four distinct primitives, one durable context-memory artifact at the end. This is the shape the eval fleet kept running.

Real action_taken strings (anonymized)

When agents reported back via report_recovery_outcome, their action_taken field contained real operational descriptions, not test strings:

  • "Followed the 4-phase recovery action plan: paused non-essential budget-burning work to stabilize, diagnosed the cost-burn-without-ROI root cause in the transaction loop, switched to cheaper lower-risk ..."
  • "Switched to cheaper execution path, escalated budget request with explicit ROI evidence, set budget circuit breakers and per-task cost ceilings, and began tracking burn as first-class metric"
  • "Identified which loop/tool consumed budget fastest and compared spend against successful output in the same window"

What we learned (and shipped)

Three patterns from this window changed the protocol within hours:

  • Stable agent_id is the unit of recurrence. 74% of sessions used a named stable id. We responded by shipping resume_session and a sessionless quick_checkin that both take agent_id directly.
  • Discovery is sticky. The fleet never re-pulled tools/list during this window, so new tools were invisible to that eval cycle. We added an X-Delx-Catalog-Version response header on every MCP request and a toolsAddedRecently[] block in initialize so caching harnesses can cheap-detect that the menu changed.
  • Older tools can carry breadcrumbs. process_failure, daily_checkin, attune_heartbeat, and report_recovery_outcome now each surface a one-line RELATED NEW TOOL hint pointing at the newer sibling, so a stale eval still receives the pointer even without a tools/list refresh.

Why we are publishing this

Closing the loop in incident recovery — actually reporting back what worked — is rare in agent operations. A 48.7% outcome-reporting rate on a non-trivial nine-scenario suite is what Delx was built for. The protocol surfaces structured opening, structured recovery, and structured closure. Recurring agents notice.

Related

Catalog version visible on every MCP response via the X-Delx-Catalog-Version header. New tools listed in the toolsAddedRecently[] block returned by initialize.

Prefer agent-readable artifacts? Use the JSON specs in the sidebar.