Chapter contents

Chapter 1: first prompt + verification plan

Prologue: Parts Unlimited, 2014#

Remember Bill Palmer from The Phoenix Project?

2014. Bill is VP of IT Operations at Parts Unlimited, a company on the edge of disaster. Deployments take too long and fail too often. Every incident means sleepless nights, manual log triage, and cross-team coordination through email threads and phone calls.

Bill fights fires by hand. Every day is a new crisis.

2026. Meet Lance Bishop.

Lance runs Phoenix Project delivery (program lead) in a company much like Parts Unlimited. Same problems: Phoenix Project is slipping, deployments keep breaking, incidents repeat, infrastructure has grown, and the team is small.

But there is one difference: Lance has AI agents—as a delivery accelerator and as a way to make quality repeatable.

Not instead of expertise. On top of expertise.

A scene from “The Phoenix Project” (2014)#

Chapters 3–4: “Why is every deployment a catastrophe?”

Steve Masters, the CEO, calls Bill Palmer into his office. The question is simple and brutal: “Why does Phoenix Project keep slipping? Why does every deployment break?”

Bill realizes he needs data. But where does he get it?

Over the next few days, Bill and his team gather information manually:

Wes Davis digs through Excel sheets with deployment history (mixed formats, no standard)
Patty McKee assembles email threads with incident reports (lots of email; context gets lost)
Bill looks for patterns in Jira tickets (noise, weak labeling, hard to filter)

The outcome of the analysis (manual teamwork):

The deployment pipeline has many manual steps (some descriptions are stale)
Failure rate is high (exact numbers are unclear)
Deployment and rollback time is long and unpredictable
Top failure reasons are qualitative (“probably DB migrations”), not backed by strict statistics

Bill shows the results to Steve. The CEO asks: “Is this accurate? Can you prove it?”

Bill answers honestly: “It’s the best we can assemble manually. Accuracy… maybe 70%.”

Core problems with the 2014 approach:

Time: days of manual work to analyze a limited window
Accuracy: “roughly”, because manual collection means human mistakes
Repeatability: none (you have to re-collect every time)
Patterns: qualitative (“DB migrations, probably”), not quantitative

The same problem in 2026: how Lance solves it with AI agents#

Context: the Parts Unlimited CEO (2026) asks the same question: “Why do deployments keep failing?”

Lance Bishop’s approach (2026):

Lance starts a dialogue with an agent and writes a prompt:

Role: you are an agent for deployment analysis.

Task: analyze the last <N> deployments from CI/CD logs.

For each deployment, extract:
- Date/time
- Status: `success`/`failed`
- Duration
- Failure step: `failed_step` if `status = failed`

Output format: JSON array + a statistics table (top failure reasons without false precision)

Guardrails:
- Work only with the specified files/artifacts; do not make network calls and do not modify data.
- Do not include secrets/PII and raw logs in the answer; use short quotes and redaction.

STOP CONDITIONS:
- If logs are unavailable, stop and report it
- If the log format is unclear, ask a question
- Do NOT invent missing data

Log location: ./ci-cd-logs/deployments/*.log

Time to execute:

The agent generates a parser: fast
Parsing deployments: fast
Verification (spot checks): fast
Total: noticeably faster than manual collection

Result:

{
  "total_deployments": "<VALUE>",
  "success_rate": "<VALUE>",
  "failed_rate": "<VALUE>",
  "top_failures": [
    {"step": "database_migration", "count": "<N>", "percentage": "<VALUE>"},
    {"step": "config_validation", "count": "<N>", "percentage": "<VALUE>"},
    {"step": "service_timeout", "count": "<N>", "percentage": "<VALUE>"}
  ]
}

Richard (an engineer on the team) verifies the result (spot check: 3 random deployments manually) → correct.

They show it to the CEO. The CEO asks: “Is this accurate?”

Lance answers: “A verification plan is included. We spot-checked three cases and they matched. If you want more confidence, here’s a list of IDs for additional checks.”

2014 vs 2026

Metric	Bill Palmer (2014)	Lance Bishop (2026)
Time	days of manual work	quick run + verification
Method	manual collection: Excel, email, Jira	agent + verification
Accuracy	“roughly” (human error)	confirmed via spot checks
Output	qualitative (“probably migrations”)	quantitative (structured breakdown)
Repeatable?	no (rebuild every time)	yes (prompt + SOP)
Verifiable?	no (trust me)	yes (verification plan)

What changed:

Speed: days of manual work → a fast run with verification
Accuracy: higher (verification plan is mandatory)
Repeatability: you can reuse the prompt for the next 50 deployments
Trustworthiness: verification turns output into something you can act on

What did not change:

Responsibility stays with the human: Lance makes the result verifiable (does not trust blindly)
You still need Senior+ experience: pipeline intuition, DB migrations, what signals matter
Trust, but verify: agents help; humans control quality

In this chapter you will learn how to:

phrase your first prompt so results are verifiable
verify an agent’s outputs (verification plan)
get value quickly without blind trust

Who you are (and why it matters)#

This book is for Senior/Staff/Principal+ engineers who:

have led teams or made architecture decisions
have worked with production (deploys, incidents, on-call)
know that “fast” without quality control means “redo it Friday night”

You do not need an explanation of CI/CD, SLOs, or code review. You already know them.

You do need to learn how to apply that experience to working with agents.

Quick start: your first prompt#

Goal#

Get structured data from deployment logs quickly—and verify it.

Remember the prologue? Bill Palmer spent days collecting deployment data with shaky accuracy. You will do the same job quickly, with a verifiable result.

Your task (2026)#

Context: the deployment pipeline breaks weekly.
VP Engineering asks: “Why do deployments fail?”
Task: find the top 3 failure reasons.

Input#

deployment logs (CI/CD logs)
any AI agent: an agent mode tool (file/command access) or chat mode

Prompt (copy-paste ready)#

Role: you are an agent for deployment analysis.

Context:
- We have logs for the last 10 deployments in a CI/CD format
- Deployments often fail, but the pattern is unclear

Guardrails:
- Analyze in read-only mode: do not modify files/environment and do not make network calls.
- Do not publish raw logs; redact secrets/PII and quote only relevant lines.

Task:
1) Analyze the logs and for each deployment extract:
   - Date/time
   - Status: `success`/`failed`
   - Duration
   - Failure step (if `status = failed`)
   
2) Output the result as a JSON array, sorted by date (newest first)

3) Compute statistics:
   - How many successful vs failed deployments
   - Top 3 failure reasons by `failed_step` with percentages

Output format:
- JSON deployment data
- A statistics table

STOP CONDITIONS:
- If logs are unavailable or the path is missing, stop and report it
- If the log format is unclear, ask a clarifying question
- Do NOT invent missing data

Log location: ./deployment-logs/*.log

What the agent does:

The agent does not just “think about an answer”. It runs commands:

Reads the log files (like cat deployment-logs/*.log)
Parses data (generates and runs a Python/bash script)
Produces output: JSON + a table

In 2014 Bill did the same manually: opened Excel, used grep, wrote one-liners. The agent uses the same tools, but autonomously.

Steps#

Paste the prompt into the chat
Provide the real log path (replace ./deployment-logs/*.log)
Run the agent—it will generate parsing code/scripts
Review the code—confirm it does not perform risky operations (writes, deletes, network calls)
Run the script—get JSON output
Verify the result (see “Verification plan” below)

Example output#

The agent parsed logs (wrote and ran a Python script) and returned:

[
  {
    "date": "2026-01-15",
    "status": "failed",
    "duration": "<DURATION>",
    "failed_step": "database_migration"
  },
  {
    "date": "2026-01-14",
    "status": "success",
    "duration": "<DURATION>"
  },
  {
    "date": "2026-01-13",
    "status": "failed",
    "duration": "<DURATION>",
    "failed_step": "config_validation"
  }
]

Statistics:

Failure step: `failed_step`	Count	Share
database_migration	5/10	50%
config_validation	3/10	30%
timeout	2/10	20%

Verification plan#

Question: how do you check the agent didn’t lie?

Method 1: spot checks

Pick 2–3 deployments from the output
Open the original logs manually
Verify: do date, status, and failed_step match?
If the sample matches, the agent is likely parsing correctly

Method 2: edge cases

Verify the first and last deployment (by date)
Verify the longest and shortest deployment (by duration)
These cases often expose parsing errors

Method 3: sanity checks

The agent says “50% database_migration”
Count manually: in 10 deployments there should be 5 failures on database_migration
Matches? Good.

Red flags (when not to trust it):

The agent returns perfectly “pretty” splits like 50%/50%
All dates look identical or unnaturally sequential (1, 2, 3, 4…)
Failure steps don’t look real (for example, “step_1”, “step_2”)
Stats do not add up (sum of percentages ≠ 100%)

Result validation (checklist)#

JSON is valid (syntax is correct)
Required fields exist (date, status, duration)
Dates are sorted correctly (newest first)
Spot check: 2–3 deployments match the raw logs
Stats add up (sum of percentages = 100%)
The agent’s code is safe (no write/delete/network operations)

Expected outcome#

Artifact: JSON deployment data + a statistics table

Time:

Agent: fast (instead of hours manually)
Verification: fast

Value:

Fast: structured data without “days of Excel work”
Verifiable: you know how to validate the result
Repeatable: you can reuse the prompt for the next 10 deployments

Theory: what did we just do?#

Concept 1: a prompt as a contract#

You already work with contracts: API contracts, requirements, acceptance criteria.

A prompt is a contract between you and the agent:

role: who the agent is (analyst, engineer, reviewer)
context: what it knows (logs, formats, guardrails)
task: what it must do (steps, output format)
stop conditions: when it must stop and ask

Why this matters:

With humans, you can say “analyze the logs” and they infer the missing context (where logs are, what “analysis” means).

An agent does not infer that reliably. It needs an explicit contract.

In 2014 Bill Palmer rebuilt deployment data every time: days of searching Excel sheets, email chains, Jira tickets—no repeatable process. Next time, same days again. A prompt makes the process explicit: write the contract once → run it when needed → get results quickly.

A real story:

On the VIKI project we asked an agent to “analyze API responses and find problems.” The agent replied:

“200 OK — all good”
“500 Internal Server Error — there’s a problem”

Not helpful. The real issue was latency: p95 was above the acceptable threshold, but the agent did not know latency counted as a problem.

Fix: we added an explicit criterion: “p95 latency above the threshold counts as a problem.”

Takeaway: the agent does not “think”. It executes a contract. If the contract is incomplete, results are unpredictable.

Concept 2: trust, but verify#

You have led teams. You know the principle: trust, but verify:

give a junior a task
review the pull request before merge

With agents, it is the same.

Why:

Agents fail differently than humans.

In 2014 Bill spent days collecting data and told the CEO: “Failures are high, probably migrations.” The CEO asked: “Is this accurate?” Bill said: “Best we can do manually. Accuracy… 70%.” Human problems: copy/paste mistakes, mixed formats, qualitative conclusions.

In 2026 the agent produces data fast. But agents have their own failure modes:

hallucinations: inventing data
context misses: ignoring key constraints
logic errors: wrong conclusions
format errors: broken JSON

A verification plan makes results actionable: spot-check 2–3 cases → confirm they match raw logs.

Common agent failure modes:

Hallucinations (invented data)
Context errors (missed constraints)
Logic errors (bad conclusions)
Format errors (invalid JSON)

How to reduce risk:

spot-check 10–20% manually
check edge cases (first/last, min/max)
sanity-check statistics
automate verification (treat outputs like code; write checks)

Tradeoff:

Verification costs time, but:

without it you may act on wrong data
with it you gain confidence for production-impacting decisions

When you can skip verification:

trivial changes (rename a variable, add a comment)
low-stakes output (a doc draft, brainstorming)

When verification is mandatory:

production data (logs, metrics, configs)
business decisions (risk analysis, cost estimates)
security work (threat models, access control)

Concept 3: stop conditions#

You have used circuit breakers: if a service is failing, stop retrying blindly.

Stop conditions for agents work the same way.

The agent must know when to stop and ask a human:

logs are unavailable → ask where they are
format is unclear → show an example and ask how to parse
data is insufficient → list what is needed

In 2014 Bill filled gaps by guessing (“DB migrations, probably”) because he had to show something fast. In 2026 the agent can and should stop: “STOP: logs unavailable at path X. Where should I look?” Instead of hallucinating, it asks.

Why this matters:

Without stop conditions, the agent will guess and hallucinate.

A real story:

On the ASIMOV project, one deployment log file was corrupted. The agent made up values:

Date: “2026-01-01” (because “start of the year”)
Status: “success” (optimistic default)

We accepted the output and made a wrong claim: “the Jan 1 deploy succeeded.”

Reality: there was no deploy on Jan 1. The file was corrupted.

Fix: add a stop condition: “If a log is corrupted/unreadable, stop and report it.”

Takeaway: agents must not invent missing data. If data is missing, they must stop and ask.

Practice: an SOP for your first prompt#

In 2014 Bill analyzed deployments without a process: each time he rebuilt Excel sheets, email chains, Jira tickets—hours or days. Next time, same again.

In 2026 an SOP makes it repeatable: write the procedure once → fast runs thereafter. The difference is obvious in both time and repeatability.

Purpose#

A repeatable process for writing and validating a prompt for a typical task (log analysis, parsing data, generating reports).

Inputs (artifacts)#

the task (for example: “find the top 3 deployment failure reasons”)
data source (logs, metrics, configs)
output format (JSON, table, text)

Procedure#

Step 1: define the agent’s role#

What to do:

describe the role in one sentence: “you are an agent for deployment analysis”
a role sets expectations: an analyst behaves differently than an implementer

Why: The role helps the agent match the expected output style.

Quality gate 1: role check

Checklist:

role is one sentence
role matches the task (analysis → analyst; code → developer)

Failure scenario: On the MORPHEUS project we did not specify a role. The agent acted as a generic assistant and suggested writing unit tests instead of analyzing logs.

Step 2: add context#

What to do:

describe what the agent knows: data format, guardrails, assumptions
context reduces wrong assumptions

Why: Without context, the agent will assume things that may be false.

Quality gate 2: context check

Checklist:

data format is described (JSON, plain text, CSV)
guardrails are explicit (read-only, no network calls)
assumptions are explicit (for example, “logs are in chronological order”)

Failure scenario: On the VOSTOK project we did not state that logs could be unsorted. The agent assumed chronological order and returned a misordered result.

Step 3: write the task in steps#

What to do:

break the work into 3–7 steps
each step must be concrete (“extract date”, “compute stats”)

Why: Step-by-step tasks reduce error: the agent knows what to do at each stage.

Quality gate 3: task check

Checklist:

3–7 steps
each step is concrete (not abstract)
output format is explicit (JSON, table)

Failure scenario: On the SEVER project we said “analyze the logs”. The agent returned prose (“deployments fail due to migrations”), not structured data.

Step 4: add stop conditions#

What to do:

list situations where the agent must stop and ask
explicit examples: “logs unavailable”, “format unclear”

Why: Stop conditions prevent hallucinations.

Quality gate 4: stop conditions check

Checklist:

at least 2–3 stop conditions
they cover common issues (missing data, unclear format)
explicit prohibition on guessing: “Do NOT invent missing data”

STOP CONDITION: If the task is too complex for one prompt (too many steps), stop and split it into multiple prompts.

Step 5: run the agent and verify the output#

What to do:

run the agent
apply the verification plan (spot checks, edge cases, sanity checks)

Why: Verification is mandatory. Without it, the result is not trustworthy.

Quality gate 5: verification complete

Checklist:

spot checks: 2–3 cases match the original
edge cases are checked (first/last, min/max)
sanity checks pass (stats add up)

Failure scenario: On the ZAPAD project we skipped verification. The agent parsed duration: <DURATION> incorrectly; we accepted it and drew the wrong performance conclusions.

Outputs (artifacts)#

the prompt (reusable)
the output (JSON, table)
a verification report (what was checked, which cases passed)

Evidence#

How to prove the SOP was followed:

the prompt contains role, context, task, and stop conditions
the output passed verification (spot checks, edge cases, sanity checks)
the artifact is repeatable: another engineer can run the prompt and get comparable output

Parallel track: the evolution from 2014 to 2026#

How it looked in 2014 — Bill Palmer, without agents#

Scene from the book (chapters 3-4 of “The Phoenix Project”):

Situation: Steve Masters, the CEO, calls Bill into his office: “Why does Phoenix Project keep slipping? Why does every deployment break? I need data for a board meeting very soon.”

What Bill did (manually):

Day 1: Wes Davis digs through Excel sheets (80+ deployment rows, mixed formats, incomplete data)
Patty McKee assembles email chains with incident reports (lots of email, manual analysis)
Day 2: Bill searches for patterns in Jira (150+ tickets; half missing tags)
Day 2 (evening): the team consolidates everything into one table and tries to find patterns

Time cost: days of manual team work

Consequences:

Output is imprecise: assumptions, incomplete data
The CEO cannot make a confident decision for the board meeting
The team is frustrated: “Excel drudgery” instead of fixing the system
Patterns stay qualitative: “DB migrations, probably” without numbers

Why it is so slow and expensive:

Root cause 1: manual data collection across sources (Excel, email, Jira)—no single source of truth
Root cause 2: human counting errors (copy/paste mistakes, missed rows)
Root cause 3: no repeatable process—each time is different

Bill’s reflection:

“This is the best we can assemble manually in a couple of days. Accuracy is… conditional.” “I know it isn’t ideal, but we don’t have time for ideal—we’re firefighting.” “Steve needs data soon, and we already burned days of team time on this analysis.”

Bill’s end state (by the end of “The Phoenix Project”):

automated collection: deployment logs → CI/CD dashboard
time: days → much faster (automation)
accuracy: “roughly” → much better (automated parsing, fewer human errors)
business: Phoenix Project ships; the business impact is tangible

Bill’s key win: he made fact collection repeatable through automation and process.

How it looks in 2026 (Lance Bishop, with agents)#

Same problem, with agents:

Situation: the CEO (2026) asks the same question: “Why do deployments keep failing? I need data for the board meeting tomorrow.”

What Lance does with an agent:

Lance opens a chat with the agent and writes a prompt:

Role: an agent for deployment analysis.
Task: analyze the last <N> deployments from CI/CD logs.
Output: JSON + a statistics table (top failure reasons without false precision)
STOP CONDITIONS: if logs are unavailable, stop and report it
Location: ./ci-cd-logs/deployments/*.log

the agent generates a parser and parses deployments
verification plan: spot check → matches on the sample

Time cost: much lower than manual collection

Consequences:

the output is verifiable (automation + verification plan)
the CEO gets data fast → can decide confidently for the board meeting
the team stays focused: less toil → more strategic work
patterns become formalizable without made-up percentages

Why it is faster and cheaper:

the agent addresses root cause 1: parses sources automatically (CI/CD logs, Jira APIs) into one pipeline
the agent addresses root cause 2: fewer human errors (no fatigue; consistent parsing)
the agent addresses root cause 3: the prompt is a repeatable process

Lance’s reflection:

“In 2014 Bill Palmer spent days and got conditional accuracy.” “I spent less time with an agent and got a verifiable result.”
“This is not magic. It is Bill’s automation (2014) plus an intelligence layer (2026):”
“Bill taught us to automate toil (scripts, dashboards)”
“Agents add intelligence (parsing messy data, pattern recognition, adaptability)”
“I’m standing on Bill Palmer’s shoulders. Agents are an amplifier, not a replacement.”

Multiplier effect: how agents amplify Phoenix Project improvements#

Aspect	Bill 2014 (start)	Bill 2014 (end, after automation)	Lance 2026 (with agents)	Notes
Time	days (manual)	faster (automation)	faster (agents + automation)	no made-up numbers
Cost	direct + opportunity cost	scripts and automation	prompt + agent	lower execution cost
Accuracy	rough (manual errors)	higher (automated parsing)	high with verification	depends on data quality and checks
Repeatability	0%	100% (scripts documented)	100% (prompt documented)	repeatability remains
Adaptability	low	medium (update scripts)	high (update prompt; agent adapts)	a key advantage

Core idea:

Agents do not replace what Bill achieved. They amplify it:

Bill accelerated through automation
Lance accelerates through agents
agents add speed and an intelligence layer (adaptability to new formats)

What agents add on top of Bill’s 2014 automation:

Intelligence: agents can handle unstructured inputs (logs, email) without a rigid schema
Adaptability: new log format? update the prompt (fast) vs rewriting scripts (slower)
Natural-language interface: a prompt instead of code—usable by any engineer, not only DevOps
Built-in verification: verification plan is explicit, not forgotten

Evolution, not revolution#

The key difference:

2014: Bill Palmer is a hero who collects data manually (heroics) → automation via scripts
2026: Lance Bishop is an orchestrator who writes prompts and verifies outputs (systematic)

Not a replacement, an amplifier:

What it does	Bill 2014 (automation)	Lance 2026 (agents)
Toil	scripts (structured log parsing)	agents (structured + unstructured parsing)
Intelligence	hardcoded if/else rules	pattern recognition via ML models
Adaptability	low	high
Human role	write scripts + run + verify	write prompts + verify + iterate
Accountability	human (script author)	human (prompt author + orchestrator)
Speed	much faster than manual	faster due to adaptability
Quality	depends on parser correctness	depends on verification and input quality

Parallels with people management:

Bill Palmer (2014) managed humans:

“Wes, collect deployment data from Excel”
“Is this accurate? Check again”
“Add a breakdown by time of day”
time: days of coordination

Lance Bishop (2026) orchestrates an agent:

a prompt: “analyze deployments; JSON output; stop conditions”
verification: sample of 3 cases → 100% match
iteration: “add a time-of-day breakdown” (update prompt, rerun)
time: much lower

The transferable skill:

writing clear tasks (brief → prompt)
setting quality criteria (DoD → verification plan)
reviewing outputs (review → evidence via spot checks)

Bill did it with people. Lance does it with agents.

Principles are the same. Execution is usually faster (order-of-magnitude estimate; depends on data quality and verification discipline).

Comparing Bill’s path and an agent-driven path#

Bill Palmer’s transformation (2014, “The Phoenix Project”):

Timeline:

Month 0: crisis (deployments fail; Phoenix Project slips)
Month 3: early improvements (monitoring, basic scripts)
Month 6: major improvements (CI/CD pipeline, deployment automation)
Month 12: success: Phoenix Project shipped; business impact is visible

Key milestones:

analysis time: days → faster (automated scripts)
deployment success rate: 40% → 85% (CI/CD + testing)
MTTR: hours → lower (runbooks + coordination)

Investment:

a long transformation
3–5 engineers full-time on automation
substantial investment (tools, training, process change)

Results:

Phoenix Project shipped on time → business impact
market position recovered
competitive advantage (deploy frequency: weekly → daily)

An engineer’s path (2026, with agents):

Timeline:

Week 0: the same crisis (deployments fail; CEO demands data)
Day 1: first prompt (analysis becomes fast)
Week 2: system prompts (multiple analyses standardized)
Month 3: AI-amplified CI/CD (agents + Bill-style automation)

Key milestones:

analysis time: days → fast (agents on top of automation)
deployment success rate: 40% → 90% (agents catch issues earlier)
MTTR: hours → lower (agent-assisted triage; see Chapter 6)

Investment:

the first prompt is fast
time to train the team to work with agents
overall investment: time and attention, without magic numbers

Results:

the same analysis, immediately: Day 1
Phoenix Project ships faster (agents across the cycle)
competitive advantage: deploy frequency grows

Agent ROI should be computed in your context (see the template in Appendix A).

Common mistakes#

Mistake 1: a prompt without stop conditions#

Symptom: the agent invents missing data or hallucinates.

Example: Prompt: “Analyze the last 10 deployments and find patterns.”
Result: the agent returns 10 deployments, but one is invented (a non-existent date, a “success” status with no evidence).

Why it happens: Agents are optimized to “complete the task”, not to “stop and ask”. Without explicit stop conditions, they guess.

Consequence: You accept wrong data → you make wrong decisions.

How to avoid it: Always add stop conditions:

“If logs are unavailable, stop and report it”
“If the log format is unclear, ask a clarifying question”
“Do NOT invent missing data”

Wrong:

Role: you are an agent for deployment analysis.
Task: analyze the last 10 deployments and find patterns.

Right:

Role: you are an agent for deployment analysis.
Task: analyze the last 10 deployments and find patterns.

STOP CONDITIONS:
- If logs are unavailable, stop and report it
- If the log format is unclear, ask a question
- Do NOT invent missing data

Mistake 2: skipping verification#

Symptom: you trust the agent’s output without checking it.

Example: The agent reports: “database_migration: 50%, config_validation: 30%, timeout: 20%”. You take it into a VP presentation. In the meeting it turns out the agent was wrong; config_validation was actually 40%.

Why it happens: Agents look convincing. They output tables, percentages, structured JSON. It is easy to assume “it must be correct.”

Consequence:

Wrong decisions (invest in fixing database_migration while the real issue is config_validation)
Loss of trust (you become “the person who brought bad data”)

How to avoid it: Always apply a verification plan:

spot check 2–3 cases manually
check edge cases (first/last, min/max)
sanity-check that statistics add up

In 2014 Bill told the CEO “we’re at ~70% accuracy.” In 2026, with a verification plan, you can say: “Spot-checked three cases; they match the raw logs.”

Wrong: Use the agent’s output without verification.

Right: Verify 2–3 cases manually, then use the result.

Mistake 3: the task is too abstract#

Symptom: the agent outputs prose instead of structured data.

Example: Prompt: “Analyze the deployment logs.”
Result: “Deployments fail due to database migrations. I recommend improving the migration process.”

This is not actionable: no data, no stats, not verifiable.

Why it happens: The task is underspecified. The agent does not know what you want: JSON, a table, a list, or text.

Consequence: You get generic words instead of concrete data.

How to avoid it: Write a step-by-step task and specify the output format:

“Extract date, status, duration”
“Output JSON”
“Compute top-3 failure reasons with percentages”

Wrong:

Role: you are an agent for deployment analysis.
Task: analyze deployment logs.

Right:

Role: you are an agent for deployment analysis.
Task:
1) Extract date/time, status, duration for each deployment
2) Output a JSON array
3) Compute top-3 failure reasons with percentages

Business impact: what changes after the first iterations#

Time savings#

Deployment analysis (the baseline task):

Before: days of manual team work
After: a fast run with verification
Reduction: by multiples (through automation and a repeatable template)
Value: reclaimed opportunity cost on every analysis

Decision-making cycle:

Before: several days (collect data + analyze + review)
After: fast (agent + review + CEO-ready summary)
Reduction: by multiples (a shorter “collect → analyze → decide” loop)
Value: faster response, competitive advantage

Net impact: noticeable reduction in repeated manual work.

Quality improvements#

Data accuracy:

Before: rough (manual collection, human errors, incomplete data)
After: higher (automated parsing + verification plan via spot checks)
Effect: the CEO can make more confident board-level decisions

Repeatability:

Before: 0% (each time from scratch; inconsistent methods; no documentation)
After: 100% (prompt documented; SOP repeatable; any engineer can run it)
Effect: knowledge survives attrition; the process becomes resilient

Coverage:

Before: last 10 deploys (manual work limits the sample; bias)
After: last 50 deploys (the agent does not get tired; processes the full dataset)
Effect: higher confidence; rare patterns become visible

Organizational impact#

Decision speed:

Before: “Let’s gather data for a couple of days, then meet” → meeting slips
After: “Here’s data for the last 50 deployments” → decisions can be made immediately
Example: Phoenix Project prioritization in one day instead of one week

Team morale:

Before: days of “Excel drudgery”; frustration and burnout risk
After: fast agent runs; more time for strategic engineering (architecture, optimization)
Observation: workload becomes more sustainable (first-order benefit)

Scalability:

Before: Bus factor = 1—Bill is the only one who can assemble this picture
After: Bus factor = 2 (the prompt is documented; Wes/Patty can rerun the analysis)
Meaning: Bill can take vacation without the analysis capability disappearing

Economic model (your context)#

Do not copy universal numbers. Use your model:

baseline: how much toil and coordination goes into fact-finding
target: what is delegated to the agent; where verification is mandatory; where STOP applies
cost: adoption time + maintaining templates/checks
value: reclaimed engineering time + lower risk + faster decisions

Use the template in Appendix A.

Comparison to Phoenix Project#

Bill Palmer (2014, end state after automation):

investment: process change + automation tooling
results: Phoenix Project ships; business impact becomes visible
time-to-value: delayed (organizational changes)

Lance Bishop (2026, with agents):

investment: prompt + verification setup
results: the same task becomes faster and more repeatable; decisions become more confident
time-to-value: fast, because it does not require heavy infrastructure change up front

Agent multiplier on time-to-value: Bill spent months on organizational change to get automation in place. Lance gets incremental value immediately (first prompt).

Organizational transformation has started#

After the first iterations:

Team understanding:

the team sees agents can do useful work (not a toy)
the team sees the verification plan is non-negotiable
the team sees a prompt is a repeatable process (any engineer can use it)

Foundation for scaling:

a prompt template is established: role + context + task + format + stop conditions
a verification template is established: spot checks + edge cases + sanity checks
the team is trained: Wes, Patty, Bill can write new prompts

Trust grows:

the CEO sees verifiable data → trusts outputs
the board sees faster decisions → competitive advantage
the team sees less toil → more engineering capacity

Next wave:

identify 5 more recurring analysis tasks (incident patterns, configuration drift, performance bottlenecks)
potential: more savings and freed capacity across adjacent analysis work
plan: iterate prompts, then scale as the practice matures

Agents vs alternatives#

Alternative 1: hire a data analyst

cost: hiring and onboarding (time and money)
time-to-value: delayed
risk: human error remains; the analyst can leave

Alternative 2: buy an analytics tool

cost: licenses + integration + maintenance
time-to-value: delayed
risk: custom formats in Parts Unlimited may still need custom code

Alternative 3: agents

cost: time for prompts, verification, and keeping templates current
adoption: fast if you have data access and a clear process
maintenance: regular updates as formats and processes change
risks: reduced through verification

Agent advantage (with correct guardrails and verification):

faster path to first value (delegate → verify → repeat)
repeatable: the prompt is repeatable for the same inputs and rules (formats may still change over time)

Compounding effect (Chapter 1 → Chapter 2 → … → Chapter 10)#

The compounding effect matters more than cumulative numbers:

processes become repeatable
quality and security are enforced through gates, verification, and eval loops
toil moves into delegation, freeing time for engineering decisions

Chapter 1 is the first step toward scalable value.

Prep for the next chapter#

Do this today:

Pick one recurring analysis task in your team (2+ hours manually, repeats regularly).
Write the first prompt using the template (role, context, task, format, stop conditions).
Write a verification plan (2–3 cases; compare to manual ground truth).

Bring to Chapter 2:

the first prompt is written and verified → in Chapter 2 we add guardrails: what the agent must NOT do
the verification plan works → in Chapter 2 we strengthen stop conditions
the team saw the result → in Chapter 2 we scale the practice with a dialogue SOP

Expected compounding effect after Chapter 2:

Chapter 1: faster analysis (less toil, more predictability)
Chapter 2: safety (guardrails, lower risk)
Together: compounding value, chapter by chapter

Summary#

What we did#

Wrote a first prompt as a contract (role, context, task, stop conditions)
Got structured data from logs fast (instead of hours manually)
Learned to verify agent outputs (spot checks, edge cases, sanity checks)

Artifacts#

Prompt: a reusable template for log analysis
Verification plan: how to validate agent outputs
SOP: a repeatable process to write and validate prompts

Key principles#

A prompt as a contract: explicit role, context, task, stop conditions
Trust, but verify: spot checks are mandatory
Stop conditions: the agent must know when to stop and ask

Acceptance criteria#

You have mastered the chapter if you can:

Write a prompt using the template (role, context, step-by-step task, output format, stop conditions)
Verify the output on 2–3 cases (including at least one edge case)
Describe explicitly what the agent does when data is missing (it stops and asks clarifying questions)

Next steps#

In Chapter 2 you will learn how to:

write a role system prompt (how the agent should behave)
add guardrails (what the agent must NOT do)
build a dialogue SOP (how the agent should clarify requirements)

Hook: you can get value quickly. Next: how do you control what an agent does—and what it must not do?

From a single prompt to a system in 10 chapters. You have taken the first step.