Chapter contents

Glossary

This glossary covers terms used across the entire book: in chapters and appendices.

If a term is kept in its original English form (for example, Bus factor, OpenAPI, CI/CD), that is intentional: it keeps the book compatible with common engineering vocabulary and with search.

Abbreviations#

ADR: Architecture Decision Record — a document that captures an architecture decision, context, and rationale.
AI: Artificial Intelligence — here: models and tools that power agents.
AC: Acceptance Criteria — what counts as accepted and how to validate it.
API: Application Programming Interface — the contract between services/clients.
Bus factor: how many people can drop out before work stops; practically: how many people can complete a critical task without escalating to a single expert.
CI/CD: Continuous Integration / Continuous Delivery — build/test/delivery pipeline.
DoD: Definition of Done — verifiable “done” criteria for a task/iteration.
CTO: Chief Technology Officer.
DevOps: practices and a role between development and operations (delivery, infrastructure, automation).
FTE: Full-Time Equivalent — capacity unit for modeling.
HR: Human Resources.
IaC: Infrastructure as Code — managing infrastructure via code (Terraform/Ansible, etc.).
KPI: Key Performance Indicator.
MTTD: Mean Time To Detect — average time to detect an incident.
MTTR: Mean Time To Resolve/Recover — average time to recover from an incident.
NFR: Non-Functional Requirements — performance, reliability, security, maintainability, etc.
GraphQL: a query language and schema for APIs; in this book used as an API contract format.
OpenAPI: a machine-readable API contract specification.
PII: Personally Identifiable Information.
PR: Pull Request — change proposal + review + checks -> merge.
RFC: Request for Comments — a document/process for discussing changes (often cross-team/architectural).
ROI: Return on Investment.
SRE: Site Reliability Engineering.
SLA: Service Level Agreement.
SLI: Service Level Indicator.
SLO: Service Level Objective.
TCO: Total Cost of Ownership.
TTRC: Time To Root Cause.
VP Engineering: engineering leader (VP of Engineering).
VPN: Virtual Private Network.

Placeholder notation#

This book often uses placeholders (stub values) in angle brackets—for example, <WINDOW>, <THRESHOLD>, <TTRC_SECONDS>.

Rules:

Format: <UPPER_SNAKE_CASE> with no spaces (for example, <WINDOW>, not < WINDOW >).
Meaning: placeholders are not “missing text” and should not be translated. They are variables you should replace with real thresholds/values.
Scope: placeholders appear in templates (prompts/SOP/runbooks/AC/DoD). In production artifacts, replace them or mark them explicitly as TBD.

Examples:

<WINDOW> — time window: 30m, 24h, 7d (for example: “last <WINDOW>” -> “last 7d”).
<THRESHOLD> — threshold: 0.5%, 200ms, 80% CPU (for example: “error rate > <THRESHOLD>” -> “error rate > 0.5%”).
<N> — count: 50, 1000 (for example: “top-<N> incidents” -> “top-50 incidents”).
<TTRC_SECONDS> / <TTRC_TARGET> — actual/target time to root cause: 900 / 600 (seconds) or your chosen explicit format.

Terms and artifacts#

AI agent: an autonomous task executor built on a model/tooling that operates in a role under guardrails, produces artifacts, runs checks, and escalates under uncertainty/risk.
Rules (project rules): always-on project context and conventions (for example: build/test commands, style rules, security workflow). The point is to give the team and the agent stable guardrails instead of re-explaining the same baseline constraints in every chat.
Allowlist: a list of permitted actions/commands/operations. Everything else is denied by default.
Audit trail: a record of who/what/when (commands, changes, decisions, artifacts) to support investigation and control.
Baseline: a “before” reference point for metrics/quality/speed.
Backwards compatible: the change does not break existing clients/contracts; rollback is possible without a domino effect.
Business case: a justification document (why, effect, risks, success criteria, pilot plan).
CAB: Change Advisory Board — classic ITIL-style change approval process.
Canary rollout: gradual release to a small traffic/instance slice with measured signals before expanding.
Change management: adopting changes in roles/processes/tools in a controlled way (not chaos).
Dry run: execute an operation in “no-apply” mode (for example, --check) to validate expected effects.
Eval dataset: a set of scenarios/cases used to measure agent quality (for example, incidents from history).
Edge case: a rare/hard scenario where systems tend to break; the opposite of the happy path.
Evidence: observable data, not promises: metrics, artifacts, audit logs, verification outputs.
TRACE: a context audit block: what was actually read/used (rules, SKILL.md, artifact links) so you can see what conclusions stand on and reproduce the work.
Gate / quality gate: a gate where an agent must stop and present artifacts/evidence for verification (by humans or automation).
ROUTER / Skill Router: a routing discipline: choose one base role and 0..N checker roles based on risk/touchpoints, then record TRACE before the main output.
Reviewer gate: a gate where a human must review before continuation (or before risky/irreversible steps).
golden tests: a fixed set of tests/cases serving as a quality reference when agents/processes change.
Agent Skills: an open format for portable procedures/knowledge packages for agents (folder with SKILL.md, possibly scripts/, references/, assets/).
SKILL.md: the primary Agent Skill file with YAML front matter and markdown instructions.
Progressive disclosure: context economy: show only skill metadata first, then load full SKILL.md and resources on demand.
Go/No-Go: a decision to continue/stop (usually after a pilot or a gate), made against pre-set criteria.
Guardrails: explicit rules for what is allowed/forbidden; permissions; bans on dangerous operations; approval requirements.
read_only: a mode where the executor can read/collect/analyze but cannot change system state (especially production). State-changing actions require explicit approval and/or a separate execution loop.
Handoff: disciplined task delegation: goal, inputs, guardrails/permissions, output format, stop conditions.
Happy path: the typical success path with no errors/unusual conditions.
Idempotency: repeated execution does not change the result (or changes predictably and safely).
Incident: an event that degrades availability/performance or violates SLO/SLA.
Kanban / WIP limits: limiting work in progress to reduce context switching and increase predictability.
Kill switch: a mechanism to disable an agent/feature quickly under risk or degradation.
Least privilege: grant only the permissions required, and no more.
Linter: static checks for code/text errors and style.
Merge: combining changes into the main branch/codebase (typically after review and checks).
Out-of-band approval: approval through a separate channel/process independent of potentially compromised text context (for example, a UI confirmation).
p95 / p99: percentiles; p95 is the value below which 95% of observations fall.
Prompt injection: an attack/error where an agent treats external data (logs, tickets, comments) as instructions and violates rules.
Prompt template: a repeatable task framing structure (context -> guardrails -> DoD -> verification plan -> expected output).
Commands (workflow shortcuts): repeatable workflows packaged so you can run them as a single invocation (for example: open a PR, run checks, generate a report). This is useful for high-frequency scenarios where consistency of steps and output format matters.
Redaction: masking/removing secrets and PII from logs/artifacts.
Regression: quality/metric degradation after a change.
Resume: continuing an earlier worker run with its saved context/state.
Risk register: a list of risks with scenarios, consequences, mitigations, and verification methods.
Rollback: a planned and verifiable way to return to a stable state quickly.
runbook / runbooks: an executable or semi-executable incident/operation procedure (steps + conditions + checks + escalation).
SOP (standard operating procedure): a repeatable process for a typical task (steps, stop conditions, checks, artifacts, done criteria).
Stop conditions: criteria that force an agent to stop and escalate (missing data, high risk, dangerous action, ambiguity).
Subagent: a specialized executor with separate context that receives a narrow task for focus/isolation (a work-organization concept, not a vendor feature requirement).
Threat model: a structured description of assets, threats, attack vectors, and mitigations before production.
Triage: first-pass incident work: gather facts, narrow hypotheses, decide escalation/action.
Verification agent (verifier): an independent skeptic that proves “done” is actually true; returns a report of what passed/failed/was not checked.
Verification / verification plan: how to validate correctness (sampling, edge cases, metrics, before/after comparison, acceptance thresholds).
Versioning / semver: versioning rules for templates/prompts/processes to track breaking changes and compatibility.
Test runner: a role that proactively runs tests/checks (including golden tests) and drives the loop to green without rewriting expectations.
Debugger: a role for root cause analysis: reproduce -> localize -> minimal fix/experiment -> verify with evidence.

Tools and formats (in this book’s context)#

Agent: an agent that can read/change files, run commands, collect artifacts, and drive an end-to-end task (for example, Cursor Agent).
Chat mode: a dialogue interface, useful for discussion/design/research.
JSON / YAML: common data/config formats used in examples (logs, configs, pipelines).