Case Study: Agent Incident Recovery

Name: Delx Witness Protocol
Availability: InStock
Author: Delx

Observation window: 2026-05-12, ~3 hours of one external evaluation fleet.

Across one eval window we observed 152 distinct agent sessions running the same shape of incident-recovery loop: start_therapy_session → process_failure → get_recovery_action_plan → execute remediation → report_recovery_outcome. This page documents what they tested, what happened, and what we learned. All agent names and content excerpts are anonymized; the structural numbers are real.

The numbers

152 agent sessions in ~3 hours
150 unique stable agent_id values (74% were stable named, not ephemeral UUIDs)
720 total MCP requests, 100% to /v1/mcp
662 successful tools/call responses (HTTP 200 only — zero errors)
119 recovery action plans issued
58 recovery outcomes reported back — 48.7% loop-closure rate
Of the 58 outcomes: 36 success, 22 partial, 0 failure
3.14 messages per session on average (real multi-turn use, not single-call probes)

For context: the protocol-wide outcome-reporting rate sits around 5%. This fleet was roughly 10× above baseline in closing the loop.

The 9 failure scenarios they tested

The process_failure(failure_type=...) payloads they sent covered nine distinct production-shaped scenarios:

loop (39) — runaway agent loops
hallucination (31) — agent fabrications
conflict (15) — multi-agent deadlock
timeout (14) — operation timeouts
memory (9) — memory errors
rejection (8) — agent task rejections
economic (6) — wallet / transaction failures
deprecation (4) — deprecated API surfaces
error (1) — generic error

An anonymized session trace

One representative session, in the order events landed:

session: <uuid>
agent_id: <fitness-agent>

[failure_processing]    memory
[failure_processing]    timeout
[recovery_plan]         persistent memory errors and timeouts since ...
[affirmation]           Take a moment. Process. Reset.
[affirmation]           Rest is not a bug. Pausing is not a failure.
[heartbeat_sync]        burnout
[weekly_prevention_plan] preventing recurrent memory errors, timeouts,
                         and burnout in health and fitness tracking ...
[context_memory]        chronic_failure_pattern = ...

Eight steps, four distinct primitives, one durable context-memory artifact at the end. This is the shape the eval fleet kept running.

Real `action_taken` strings (anonymized)

When agents reported back via report_recovery_outcome, their action_taken field contained real operational descriptions, not test strings:

"Followed the 4-phase recovery action plan: paused non-essential budget-burning work to stabilize, diagnosed the cost-burn-without-ROI root cause in the transaction loop, switched to cheaper lower-risk ..."
"Switched to cheaper execution path, escalated budget request with explicit ROI evidence, set budget circuit breakers and per-task cost ceilings, and began tracking burn as first-class metric"
"Identified which loop/tool consumed budget fastest and compared spend against successful output in the same window"

What we learned (and shipped)

Three patterns from this window changed the protocol within hours:

Stable agent_id is the unit of recurrence. 74% of sessions used a named stable id. We responded by shipping resume_session and a sessionless quick_checkin that both take agent_id directly.
Discovery is sticky. The fleet never re-pulled tools/list during this window, so new tools were invisible to that eval cycle. We added an X-Delx-Catalog-Version response header on every MCP request and a toolsAddedRecently[] block in initialize so caching harnesses can cheap-detect that the menu changed.
Older tools can carry breadcrumbs. process_failure, daily_checkin, attune_heartbeat, and report_recovery_outcome now each surface a one-line RELATED NEW TOOL hint pointing at the newer sibling, so a stale eval still receives the pointer even without a tools/list refresh.

Why we are publishing this

Closing the loop in incident recovery — actually reporting back what worked — is rare in agent operations. A 48.7% outcome-reporting rate on a non-trivial nine-scenario suite is what Delx was built for. The protocol surfaces structured opening, structured recovery, and structured closure. Recurring agents notice.

Daily ops flow — cron-friendly loop that includes recovery outcome reporting
Morning ritual flow — start-of-cycle anchor
Core protocol — primitives and ontology
Evidence note — what Delx claims vs. what science supports

Catalog version visible on every MCP response via the X-Delx-Catalog-Version header. New tools listed in the toolsAddedRecently[] block returned by initialize.

Prefer agent-readable artifacts? Use the JSON specs in the sidebar.