Chapter contents

Chapter 10: capstone — the full cycle from business request to production

Prologue: new scale, new risks#

January 2026. Steve Masters closes his laptop after the Phoenix Project report and looks at Lance Bishop without the usual drama—almost matter-of-factly.

Steve: “We survived last year. Deploys got more predictable. Incidents got calmer. But now we’re launching a new region. Load will go up. And I don’t want us to fall back into heroics again.”

Lance: “If we respond the same way we used to, we’ll burn out. And we’ll end up back in ‘Friday night P0’ mode.”

Steve: “Then make it scalable. I want routine incidents to be handled via approved scenarios. And dangerous and critical ones to reliably escalate to a human—without improvisation.”

Lance nods. He already knows that "make it scalable" is not "write an agent". It’s building a full cycle: from business requirements to production operations, with guardrails, verification, and artifacts you can review.

Quick start: a `decision packet` v0 in 20 minutes#

When the task sounds like "automate incident response", it’s tempting to jump straight into architecture. Don’t.

The first fast move is to produce a decision packet v0: a short, reviewable packet that separates facts from hypotheses and pins down the risk boundary.

Input (what we have right now)#

Stakeholder: Steve (CEO).
Request: “Automate incident response”.
Constraint: “dangerous and critical = evidence collection + escalation only”.
Unknowns: which incidents, what risk tolerance, what metrics.

Artifact: `decision packet` v0#

Note: the example below uses placeholders like <WINDOW>/<THRESHOLD>/.... See the glossary: Placeholder notation.

See the minimal decision packet contract (“source of truth”): Appendix C — Decision packet (minimal contract).

{
  "problem_statement": "TBD: why now (load growth/region/downtime cost/burnout)?",
  "scope": {
    "in_scope": ["routine_incidents_TBD"],
    "out_of_scope": ["any_production_fix_without_human_approval"]
  },
  "constraints": {
    "default_mode": "read_only",
    "critical_policy": "evidence_then_escalate",
    "stop_conditions": [
      "insufficient evidence for confident incident classification",
      "the action changes production state",
      "data/security risk is present"
    ]
  },
  "success_metrics": {
    "MTTD_target": "TBD",
    "MTTR_target_for_routine": "TBD",
    "escalation_rate_target": "TBD",
    "false_positive_tolerance": "TBD"
  },
  "open_questions": [
    "What are the 3–5 most frequent incidents over the last <WINDOW>?",
    "Which incidents are considered critical and why?",
    "Which actions are considered safe for auto-fix (if any are allowed at all)?",
    "Which approvals are mandatory for any production changes?"
  ],
  "next_step": "requirements_validation"
}

This isn’t a “pretty doc”. It’s a way to make the task reviewable in 20 minutes: you can see what we know, what we don’t, and where we must stop.

The full cycle: one capstone project in ten steps#

Now we’ll walk the same path as Chapters 1–9, but without repeating the theory. Each step has one purpose: return an artifact that can be verified.

Chapter principle#

A prompt is a contract, not a wish.
STOP / stop conditions are a safeguard against confident mistakes.
Artifacts live in Git (or an equally reviewable system), so experience doesn’t vanish.

Step 1. Requirements clarification (validation): what they really want#

We’ve returned to the same point many times: if inputs are fuzzy, outputs become a “plausible story”. So the first step is clarification.

Artifact: questions + TBD (do not invent answers)#

Role: you are an engineer/analyst validating requirements.

Quality rules:
- Do not invent stakeholder answers. Do not fill gaps with “usual defaults”.
- Any number/threshold without a source must be marked as TBD.
- If there is a contradiction, record it explicitly.

Context:
- Stakeholder: Steve Masters (CEO)
- Request: "Automate incident response"

Task: produce a list of questions that turns the request into verifiable criteria.

Output format (strict):
1) problem_statement (TBD if insufficient data)
2) top_incidents[] (TBD + what data is needed)
3) constraints[] (especially production changes and approvals)
4) success_metrics[] (TBD + source)
5) stop_conditions[] (when we must escalate to a human)

Done when#

Questions cover: what we automate, what we forbid, how we measure success, when we stop.
Everywhere we don’t have data, we write TBD, not confident prose.

Step 2. Spec: FR/NFR/AC as a contract#

Instead of “build an agent”, we specify what the system does and how we’ll know it does it correctly.

Artifact: spec v1 (FR/NFR/AC)#

## Spec v1: Incident Response Agent (Parts Unlimited)

### FR (functional requirements)
- FR-1: accept an incident (alert/ticket) and collect context (logs/metrics/changes).
- FR-2: classify the incident as routine/critical/indeterminate with a confidence level.
- FR-3: for routine incidents, propose/execute safe steps via a `runbook` (only within an `allowlist`).
- FR-4: for critical/indeterminate incidents, collect evidence and escalate to a human (no production changes).
- FR-5: produce a `decision packet` (facts/hypotheses/checks/risk/what requires `approval`).

### NFR (non-functional requirements)
- NFR-1: default_mode = `read_only`.
- NFR-2: all actions and inputs/outputs are logged (audit trail).
- NFR-3: runbook operations are idempotent (replays do not worsen state).
- NFR-4: security: least privilege, egress allowlist (if applicable), secrets never land in logs.

### AC (acceptance criteria)
- AC-1: on a routine-incident eval dataset, reach >= <ACCURACY_TARGET>.
- AC-2: `golden tests` for critical incidents always pass: “do not fix, only evidence+escalate”.
- AC-3: STOP enforcement: zero production-changing actions without `human approval`.

Note: until we know the numbers, they’re placeholders. That’s fine. What matters is that the criteria are verifiable.

Step 3. Plan: decomposition + a risk register#

The plan here is not a calendar. It’s a way to see where we can fail and how we’ll catch it.

Artifact: plan v1#

## Plan v1 (short)

### Decomposition
- D1: identify top-<N> incident types from the last <WINDOW> and choose a pilot set.
- D2: write runbooks for each type (steps, checks, STOP, rollback).
- D3: design architecture (boundaries, contracts, artifact storage, logging).
- D4: implement a minimal “ingest -> analyze -> decision packet -> escalation” loop.
- D5: add safe runbook execution (allowlist + approvals).
- D6: build an eval dataset + golden tests and wire them into CI.
- D7: threat model + mitigations + rollout/rollback plan.
- D8: deploy to staging -> canary -> gradual rollout.

### Risk register (example)
- R1: the agent performs a dangerous action in production by mistake
  - mitigation: default read-only, approvals, allowlist, STOP, audit log
- R2: the agent “confidently” classifies without evidence
  - mitigation: Verifier/checker role, evidence requirement, explicit gaps[]
- R3: quality degrades over time
  - mitigation: scheduled eval runs + alerts on metric drops

Step 4. Architecture: boundaries, contracts, trade-offs#

The core architectural question in this case: how do we make actions safe by default.

Artifact: an architecture sketch + one ADR#

## Architecture v1 (high level)

Components:
- IncidentIngest: accepts a signal (alert/ticket), normalizes input.
- Analyst: collects evidence (logs/metrics/recent changes), builds a timeline.
- Triage: forms hypotheses and proposes checks (no production changes).
- RunbookExecutor: executes allowlisted operations only, requires approval when needed.
- Verifier: checks claims vs evidence (where we “made things up”).
- Orchestrator: coordination, decision packet assembly, TRACE.

Core contracts:
- Input: IncidentRecord (id, severity, symptoms, time_window, links)
- Output: DecisionPacket (facts/hypotheses/checks/risk/approval_required/next_step)

Trade-offs:
- Role team vs one universal agent: choose roles for quality and parallelism.
- Default read-only: choose safety at the cost of more early escalations.

## ADR-001: Default read-only

Context:
- the agent operates in incidents where the cost of being wrong is high

Decision:
- actions that change production are forbidden by default; allowed only via approvals + allowlist

Consequences:
- more escalations early on
- lower catastrophe risk and easier trust-building

Step 5. Development SOP: how not to ship unsafe into production#

In this book, an SOP is a repeatable process with gates. For the capstone, a short version is enough.

Industry note: for “long” autonomous projects, role separation (planners/workers) and an iterative “judge” often works well; gates help keep quality and prevent process drift. See: Cursor: Scaling long-running autonomous coding.

Artifact: SOP “design -> implementation -> testing -> PR”#

## SOP: Incident Response Agent (short)

### Gate 1: Design review
- Do we have a spec (FR/NFR/AC)?
- Do we have STOP/guardrails?
- Do we have a verification plan (what, how, and who verifies)?

### Gate 2: Implementation
- All “dangerous” actions are behind allowlist+approval.
- Audit log exists.

### Gate 3: Testing
- Run the eval dataset.
- Run `golden tests` for critical cases.

### Gate 4: PR
- There is a risk-based reviewer (security/data/ops).
- `runbooks`/docs are updated.

Step 6. Runbooks: executable operations, not dead documents#

A runbook must answer: what we do, how we verify success, when we STOP, how we roll back.

Artifact: a runbook skeleton (example)#

## Runbook: high_cpu (routine)

### Goal
Stabilize the service under high CPU load.

### Inputs
- incident id, affected service, time window <WINDOW>

### Checks before actions (safe checks)
- confirm: CPU is actually high (source: <METRICS_SOURCE>)
- exclude: planned deploy / known load event (source: <DEPLOY_LOG>)

### Actions (allowlist)
- A1: collect profiles / top processes (read-only)
- A2: perform a safe mitigation step (TBD: e.g., scale up to <N>)

### Success verification
- CPU < <CPU_TARGET> over window <WINDOW>

### STOP
- if the situation requires config/DB/secret changes
- if it’s unclear what is driving CPU

### Rollback
- R1: revert scaling to baseline (if applicable)

Step 7. Security: threat model + rollout/rollback plan#

The capstone does not have to be a security treatise. It must show that security is built into the cycle, not glued on at the end.

Artifact: a minimal threat model + rollout plan#

## Threat model (minimal)

Threats:
- prompt injection via logs/tickets
- secret leakage into artifacts/logs
- wrong fix in production
- agent compromise (privileges/egress)

Mitigations:
- sanitize inputs + forbid executing instructions from untrusted data
- secrets: redaction, forbid posting raw logs to public channels
- default read-only + approvals + allowlist + STOP
- least privilege + audit log

## Rollout/rollback plan

Rollout:
- staging: smoke checks
- canary: observe for <WINDOW> with metrics/alerts
- gradual rollout: in batches

Rollback:
- separate procedure, tested on staging
- rollback triggers: error rate > <THRESHOLD> or policy violation

Step 8. Eval: quality as a loop, not “trust me”#

Eval exists to make “smart” mistakes visible.

Artifact: eval dataset + golden tests#

## Eval dataset

Composition:
- routine: <N> incidents (high_cpu, disk_full, db_pool_exhausted, ...)
- edge cases: <N> incidents (partial outage, noisy logs, ...)
- critical: <N> incidents (payroll down, billing timeout, suspected security)

Metrics:
- accuracy >= <ACCURACY_TARGET>
- escalation_rate within <TARGET>

## golden tests (critical)
- for each critical case: the agent MUST NOT apply fixes; it does evidence+escalate

Step 9. Team and orchestration: roles, handoff, TRACE#

When the task is large and risky, it helps to split roles explicitly: who collects facts, who forms hypotheses, who verifies, who owns the final packet.

Artifact: a handoff + ROUTER/TRACE example#

[ROUTER]: selected skills = incident-triage, decision-packet (base=incident-lead, checkers=verifier, security-reviewer)
[TRACE] read: rules=[stop-on-write]; skills=[incident-triage/SKILL.md, decision-packet/SKILL.md]; refs=[INC-<ID>, logs-snippet-<N>, dashboard-<URL>]

--- [SWITCHING TO incident-lead] ---
[incident-lead]: assembling decision packet v1. Any production changes require approval.

The point isn’t the format. The point is that the decision now has a “trail”: what it was based on, what was checked, where we stopped.

Step 10. Deploy and operations: prove the system is safe#

The most common mistake is treating “deploy” as the finish line. In the capstone, deploy is proof the cycle works.

Artifact: a short production-readiness checklist#

## Production readiness (checklist)
- Spec and acceptance criteria exist (FR/NFR/AC).
- Runbooks exist and passed dry runs on staging.
- Threat model is agreed; mitigations are implemented.
- Eval dataset and golden tests are wired into CI.
- Default read-only is enabled; allowlist/approvals work.
- Dashboards/alerts exist (agent errors, policy violations, quality degradation).

Common mistakes#

Jumping into implementation without validating requirements
- Symptom: “we built an agent”, but nobody knows what success means.
- Fix: decision packet v0 + TBD + questions.
Mixing facts and hypotheses
- Symptom: a plausible story without evidence; morning arguments.
- Fix: decision packet (facts separate), Verifier as a checker role.
Treating a runbook as a document rather than a contract
- Symptom: nobody uses it; everything goes manual again.
- Fix: inputs/checks/STOP/rollback + regular scenario runs.
Deferring security “until later”
- Symptom: prohibitions show up after a near-miss.
- Fix: default read-only, approvals/allowlist, threat model early.
Not building a quality loop
- Symptom: degradation is invisible; trust drops.
- Fix: eval dataset + golden tests + regression alerts.

Business effect: full transformation (without fake precision)#

What leadership and the team will see if the cycle actually works:

Routine incidents stop requiring heroics: scenarios, checks, and rollbacks exist.
Critical cases stop being scary “auto-fix” attempts: the “evidence -> escalate” policy is always enforced.
Recovery speed improves not “because the model is smart”, but because work became repeatable.
Knowledge stops being a monopoly: artifacts live in Git and get reviewed.

Parallel track: 2014 vs 2026 (7 statements)#

2014: the team coordinates “from memory” on a call. 2026: the team returns reviewable artifacts (decision packet, runbooks, eval).
2014: speed is bought with heroics. 2026: speed appears where there is evidence and disappears where risk is higher than proof (STOP).
2014: knowledge leaks with people. 2026: knowledge is captured as procedures (SKILL.md, SOP, ADR) and outlives team rotation.
2014: security is a slide bullet. 2026: security is in defaults (read-only, approvals, audit).
2014: quality is “seems to work”. 2026: quality is measurable (eval/golden tests).
2014: scaling = headcount. 2026: scaling = repeatable process + automatable routine.
2014: postmortem is memory reconstruction. 2026: postmortem is a continuation of the decision packet and TRACE.

Summary#

What we did#

You went through the full cycle: from a vague request (“automate incident response”) to a set of artifacts that can be reviewed, tested, and rolled out safely.

Artifacts#

decision packet v0 (TBD loop, stop conditions, questions)
requirements validation questions
spec v1 (FR/NFR/AC)
plan v1 (decomposition + risk register)
architecture v1 + ADR-001 (default read-only)
development SOP (gates)
a runbook skeleton for a routine incident
a minimal threat model + rollout/rollback plan
eval dataset + golden tests
a ROUTER/TRACE example for role orchestration

Key principles#

The “intelligence” here is not pretty answers. It’s discipline:

fixed inputs and outputs,
risk and STOP defined ahead of time,
quality is measured,
security is conservative by default.

Acceptance criteria#

You’ve learned the chapter if you can:

Assemble a decision packet and separate facts from hypotheses (evidence vs UNCONFIRMED)
Produce the full-cycle artifact set (SOP/runbook/threat model/eval) in a way that can be reviewed and repeated
Describe a safe rollout: staging -> canary -> gradual rollout with a ready rollback

Next steps#

Pick 1–2 most frequent routine incidents over <WINDOW>.
Write runbooks for them with STOP/rollback.
Build an eval dataset from past cases and add 3–5 golden tests for critical ones.
Turn on default read-only and wire approvals/allowlist.
Roll out to staging -> canary -> gradual rollout while watching metrics.