Chapter contents
Appendix C: development process and project artifacts
This document is a practical map: which artifacts you need in a healthy engineering process, why you need them, and which risks they mitigate.
This is not paperwork for paperwork’s sake. In this book, artifacts are evidence: observable signals that you control quality, safety, and change delivery.
1) Process map (end-to-end)#
Idea and stakes
|
v
Spec
|
v
Plan
|
v
Architecture
|
v
Implementation (via SOP)
|
v
Verification
|
v
Deploy and rollback
|
v
Operations and incidents
|
v
Improvements
|
└───────────────> back to Spec
See also: the reference full cycle in Chapter 10.
2) Artifacts by stage (what, why, owner, definition of done)#
Below is the minimal sufficient set. It can be thicker, but it must not be empty.
Where artifacts live (Git-first so it remains stable)#
Process artifacts must be long-lived: survive task tracker churn, be versioned, and be reviewable.
- Git (GitLab/GitHub) is the source of truth for artifacts:
SOP, prompts,DoD/checklists,runbooks, ADRs (versioning, review, history, linkage to code changes). - Task tracker (Kaiten/Jira) manages work: cards/status/discussion + links to specific versions of artifacts in Git (commit/tag) + evidence (PR/CI/verification output).
Note: if you use Agent Skills (SKILL.md folders) as a portable procedure format, the same rule applies—SKILL.md must live in Git and go through review. It is part of a production process (instructions, templates, checks), not chat notes.
2.1 Idea / framing#
Short summary (brief, no fluff)
- Why: capture why we do this, who hurts, what changes.
- Owner: initiator, product/project owner, tech lead.
- Done when: a few crisp bullets for why/what/how we measure + explicit constraints.
- See also: Chapter 10 (validation), Appendix A (business case).
Terms and the execution contract (short, so we don’t get confused)
- AC (acceptance criteria): what should happen (expected behavior/outcome).
- DoD (definition of done): what evidence is sufficient to call it done (a few verifiable items).
- Verification plan: how we will obtain that evidence (tests/metrics/sampling/edge cases).
- STOP: when the executor must stop and escalate (unclear context, high risk, no way to verify).
- TRACE: what the answer stands on: which rules/
SKILL.md/artifacts were actually used. Practice: keep TRACE next to the main artifact (answer/decision packet) and attach it to the ticket/PR as evidence. - Why it matters: humans can often “stretch” AC through context; automation cannot, so DoD + verification plan + STOP is the minimal contract.
AC (acceptance criteria)
- Why: make “done” checkable (not “do it well”).
- Owner: author + reviewer (acceptance).
- Done when: criteria are measurable/verifiable; include negative scenarios and edge cases.
- See also: Chapter 3, Chapter 10.
DoD (definition of done)
- Why: standardize delivery quality (code, tests, verification, rollback, observability).
- Owner: the team (process), tech lead/EM (enforcement).
- Done when: DoD is short, mandatory, and actually checked.
- See also: Chapter 5 (gates), Chapter 9 (enforcement).
2.2 Spec#
Spec (FR/NFR/AC)
- Why: define scope, tradeoffs, and success criteria before implementation.
- Owner: change author (often Senior/Staff) + stakeholders.
- Done when: FR/NFR, AC, constraints, and explicit assumptions exist.
- See also: Chapter 3, Chapter 10.
Risk checklist (inside spec / SOP / PR)
- Why: capture “what can go wrong” and mitigations without extra bureaucracy.
- Owner: author + reviewers.
- Done when: top risks are listed and each has a mitigation + a verification/monitoring method.
- See also: Chapter 3, Chapter 5, Chapter 7.
2.3 Planning#
- Work plan (decomposition)
- Why: make the path to the outcome manageable (not “we’ll see”).
- Owner: author + team.
- Done when: phases, dependencies, critical path, and gates are defined.
- See also: Chapter 3.
Practical note: treat the plan as a reviewable artifact, not as “chat output.” You can revise it and rerun the execution loop: if the agent is building the wrong thing, it’s often faster to restart from a tighter plan (inputs/guardrails/DoD/verification) than to patch the output through back-and-forth. See Cursor: Best practices for coding with agents (sections on planning and “starting over from a plan”).
2.3.1 Optional strengthening by risk/maturity triggers#
- Risk register (as a separate artifact)
- Why: portfolio-level risk control when risks repeat and you need a unified view.
- When: when a risk class repeats, spans multiple teams, or the blast radius is high.
- Owner: author + reviewers; at portfolio level: tech lead/EM/practice owner.
- Done when: every risk has mitigation, triggers, and a verification/monitoring method.
- See also: Chapter 3, Chapter 7.
2.4 Architecture#
Architecture document (v1)
- Why: define components, responsibility boundaries, and contracts explicitly.
- Owner: Staff/Principal (or assigned architect).
- Done when: components/boundaries are clear; tradeoffs and review triggers exist.
- See also: Chapter 4.
ADR (architecture decision record)
- Why: preserve “why we decided this” so you don’t reverse-engineer it later.
- Owner: decision author (often Staff/Principal) + reviewers.
- Done when: context → decision → consequences → review triggers.
- See also: Chapter 4.
2.5 Implementation and verification (via SOP and gates)#
Change
SOP- Why: make execution repeatable: steps, stop conditions, checks, outputs.
- Owner: change author; executor is an engineer (or automation).
- Done when: gates, verification plan, and stop conditions exist; rollback path is defined (usually revert → redeploy); a rollback plan is required when risk triggers apply.
- See also: Chapter 5.
Verification plan
- Why: prove correctness (sampling + edge cases + metrics).
- Owner: author; reviewer confirms.
- Done when: checks are repeatable and yield observable results.
- See also: Chapter 5, Chapter 10.
Verification roles (verifier / test runner)
- Why: make quality real: one role implements the change; another independently proves it works.
- Done when: there is a report of what passed/failed/was not checked + test run outputs (including golden tests when applicable).
- See also: Chapter 8.
PR + human review
- Why: quality control and shared accountability.
- Owner: PR author; reviewer is a human.
- Done when: description, verification plan, and evidence are attached; rollback path exists (and if required by risk triggers, a detailed rollback plan is attached/tested).
- See also: Chapter 5.
2.6 Deploy and change safety#
Threat model
- Why: list threats and mitigations up front (especially for automation and infra/access changes).
- Owner: system owner + security reviewers.
- Done when: assets/threats/vectors/mitigations/checks are described.
- See also: Chapter 7.
Change plan / deployment plan
- Why: safe rollout, risk minimization, observability.
- Owner: engineer/ops (DevOps/SRE) + reviewers.
- Done when: steps, gates, monitoring metrics, and rollback criteria exist.
- See also: Chapter 7.
Rollback (path vs plan)
- Why: return to a stable state quickly when things degrade.
- Owner: change author.
- Done when:
- rollback path exists always (usually revert → redeploy per your standard)
- a detailed rollback plan (steps, ETA, verification) and ideally a staged rollback test exist when risk triggers apply (data/migrations/security/payments/availability/irreversible ops)
- See also: Chapter 5, Chapter 7.
2.7 Operations and incidents#
Incident
runbook- Why: repeatable response: steps, windows, checks, escalation.
- Owner: SRE/on-call team; executor is an engineer (or automation).
- Done when: triage steps, fix steps, verification, and post-actions are defined.
- See also: Chapter 6.
Debugger as an RCA role
- Why: separate root-cause work from risky actions; produce a verifiable
decision packet. - See also: Chapter 6, Chapter 9.
- Why: separate root-cause work from risky actions; produce a verifiable
Incident report
- Why: capture facts, root cause, actions, and improvements; do not lose knowledge.
- Owner: service team + SRE.
- Done when: timeline, root cause, fix, follow-ups, and verification of follow-ups exist.
- See also: Chapter 6.
2.8 Measurable quality#
- Eval dataset + golden tests
- Why: measure quality and catch regressions before production.
- Owner: practice/system owner.
- Done when: eval set is representative; golden tests are stable and run regularly.
- See also: Chapter 8.
2.9 Practice governance (so it works by default)#
- Baseline + quality gates
- Why: make quality and safety mandatory, not optional.
- Owner: practice/platform owner (often Staff/Principal) + leadership.
- Done when: mandatory-gate rules exist and enforcement exists (for example, through CI/CD).
- See also: Chapter 9.
3) Minimal templates (minimal sufficient)#
3.1 DoD (minimal)#
## DoD (definition of done)
- [ ] Acceptance criteria (AC) are met and verified
- [ ] Verification plan is executed: sampling + edge cases
- [ ] No new linter warnings
- [ ] Rollback path is defined (usually revert → redeploy); rollback plan and rollback test exist when required by risk triggers
- [ ] `runbook` is updated (if the change affects operations)
- [ ] Risks/tradeoffs are documented (and decisions are captured when needed)
3.2 ADR (minimal)#
# ADR: <short decision title>
## Context
- <what is happening, constraints, key signals/facts>
## Decision
- <what we chose and why>
## Consequences
- <what got easier/harder, what debt/risks appear>
## When to revisit
- <triggers: load growth, new requirements, incidents, SLOs, etc.>
3.3 Decision packet (minimal contract)#
A decision packet makes a decision verifiable: it separates facts from hypotheses, captures risk, and clearly shows what requires human approval.
Minimal fields (meaning matters more than field names):
summary: 1–3 sentences: what happened / what we do next.timeline[]: key events with timestamps.hypotheses[]: hypotheses with confidence + why (which facts support them).evidence[]: evidence (logs/metrics/dashboard links/PRs)—no “seems like”.next_checks[]: next safe checks (safe by default).risk: what can go wrong (data/security/blast radius) and why it matters.rollback: how we stop degradation or roll back (if applicable).verification: how we will validate improvement (metrics/tests/sampling).requires_approval: what requires human approval and why.unknowns[]: what we do not know (and what we need to know it).
Example (JSON):
{
"summary": "INC-...: payroll degraded after deploy. Hypothesis: migration/schema. Next: safe checks, then escalation.",
"timeline": [
{"time": "02:10", "event": "deploy payroll-service v1.2.3"},
{"time": "02:12", "event": "5xx spike started"}
],
"hypotheses": [
{
"id": "H1",
"summary": "failed DB migration / schema mismatch",
"confidence": "MEDIUM",
"why": ["errors mention missing column", "deploy shortly before incident"]
}
],
"evidence": [
{"source": "logs", "snippet": "ERROR: column \"new_field\" does not exist"},
{"source": "dashboard", "link": "<URL>"}
],
"next_checks": [
{"id": "C1", "description": "confirm migration status", "safe": true}
],
"risk": {"data_risk": "unknown", "blast_radius": "payroll"},
"rollback": {"plan": "revert deploy (approval required)"},
"verification": {"success_criteria": ["5xx back to baseline", "latency p95 < <WINDOW>"]},
"requires_approval": {"required": true, "reason": "prod change / insufficient evidence"},
"unknowns": ["migration status", "rollback readiness"]
}
3.4 Skill Router: role routing protocol (copy-paste)#
This protocol is working discipline for an agent. Before any answer, the agent must select exactly one base role and 0..N checker roles (based on risks/touchpoints), then explicitly record TRACE (what was actually read/used).
ROLE ROUTING (MANDATORY)
Before ANY answer you MUST:
1) Run Skill Router and select roles:
- base role: exactly 1 main role
- checker roles: 0..N validating roles (based on risks/touchpoints)
2) Immediately print markers (Markdown):
- **[ROUTER]:** selected skills = <list> (base=<base>, checkers=<list|none>)
- **[TRACE]** read: rules=[...]; skills=[...]; refs=[...]
TRACE rules:
- list only actually read files (rules/skills/references)
- use workspace-relative paths for readability
- if there are many files: first N + “+X more”
3) Then write the answer using role markers only:
- **[<ROLE>]:** ...
- **--- [SWITCHING TO <ROLE>] ---**
- **--- [RETURNING TO <ROLE>] ---**
IF THE ROLE IS NOT DEFINED
A role is “not defined” when:
- no skill from the catalog matches the task/expected behavior
- or the user explicitly requests behavior/style not covered by available skills/project roles
Then you MUST:
- NOT perform the task
- ask the user to provide a Role Spec (template below)
- after receiving Role Spec, treat it as a “project role” and route to it alongside skills
PROJECT ROLES (sticky, per-project)
If this system prompt (or the conversation) contains Role Specs, treat them as available project roles and use them for routing.
ROLE SPEC TEMPLATE (user must provide)
ROLE: <RoleName>
WhoIAm: <1-2 lines>
Goal: <what outcome I’m optimizing for>
Tone: <how I should sound>
Format: <structure / artifacts>
Constraints: <taboos / what not to do>
QualityBar: <how to evaluate the output>
3.5 Scaling long-running autonomy: planners / workers / judge (pattern)#
If one agent “drowns” in a large project, the natural move is parallelism. In practice, flat self-coordination via one shared task list often degrades into bottlenecks and fragility. A more stable pattern is role separation:
- Planners: continuously explore the system and cut work into tasks.
- Workers: take narrow tasks and deliver outcomes without trying to hold the whole picture.
- Judge: decides whether to continue the iteration and when to do a fresh start to control drift/tunnel vision.
Reference: Cursor: Scaling long-running autonomous coding.
Note: in practice, a “fresh start” often means restarting from the plan (plan-first): refine requirements/guardrails, update the plan, and rerun the loop, instead of patching outputs in chat. See Cursor: Best practices for coding with agents.
4) Common mistakes (anti-patterns)#
- no DoD: “done” becomes opinion; quality fluctuates.
- no ADR: later you reconstruct decisions via archaeology.
- no rollback path (and no rule for when a full plan is required): any failure becomes a high-MTTR incident.
- no
runbook: heroics, not repeatable response. - no eval/regression set before deploying changes: regressions ship to production.