Chapter contents
Glossary
This glossary covers terms used across the entire book: in chapters and appendices.
If a term is kept in its original English form (for example, Bus factor, OpenAPI, CI/CD), that is intentional: it keeps the book compatible with common engineering vocabulary and with search.
Abbreviations#
- ADR: Architecture Decision Record — a document that captures an architecture decision, context, and rationale.
- AI: Artificial Intelligence — here: models and tools that power agents.
- AC: Acceptance Criteria — what counts as accepted and how to validate it.
- API: Application Programming Interface — the contract between services/clients.
- Bus factor: how many people can drop out before work stops; practically: how many people can complete a critical task without escalating to a single expert.
- CI/CD: Continuous Integration / Continuous Delivery — build/test/delivery pipeline.
- DoD: Definition of Done — verifiable “done” criteria for a task/iteration.
- CTO: Chief Technology Officer.
- DevOps: practices and a role between development and operations (delivery, infrastructure, automation).
- FTE: Full-Time Equivalent — capacity unit for modeling.
- HR: Human Resources.
- IaC: Infrastructure as Code — managing infrastructure via code (Terraform/Ansible, etc.).
- KPI: Key Performance Indicator.
- MTTD: Mean Time To Detect — average time to detect an incident.
- MTTR: Mean Time To Resolve/Recover — average time to recover from an incident.
- NFR: Non-Functional Requirements — performance, reliability, security, maintainability, etc.
- GraphQL: a query language and schema for APIs; in this book used as an API contract format.
- OpenAPI: a machine-readable API contract specification.
- PII: Personally Identifiable Information.
- PR: Pull Request — change proposal + review + checks -> merge.
- RFC: Request for Comments — a document/process for discussing changes (often cross-team/architectural).
- ROI: Return on Investment.
- SRE: Site Reliability Engineering.
- SLA: Service Level Agreement.
- SLI: Service Level Indicator.
- SLO: Service Level Objective.
- TCO: Total Cost of Ownership.
- TTRC: Time To Root Cause.
- VP Engineering: engineering leader (VP of Engineering).
- VPN: Virtual Private Network.
Placeholder notation#
This book often uses placeholders (stub values) in angle brackets—for example, <WINDOW>, <THRESHOLD>, <TTRC_SECONDS>.
Rules:
- Format:
<UPPER_SNAKE_CASE>with no spaces (for example,<WINDOW>, not< WINDOW >). - Meaning: placeholders are not “missing text” and should not be translated. They are variables you should replace with real thresholds/values.
- Scope: placeholders appear in templates (prompts/SOP/runbooks/AC/DoD). In production artifacts, replace them or mark them explicitly as TBD.
Examples:
<WINDOW>— time window:30m,24h,7d(for example: “last<WINDOW>” -> “last 7d”).<THRESHOLD>— threshold:0.5%,200ms,80% CPU(for example: “error rate ><THRESHOLD>” -> “error rate > 0.5%”).<N>— count:50,1000(for example: “top-<N>incidents” -> “top-50 incidents”).<TTRC_SECONDS>/<TTRC_TARGET>— actual/target time to root cause:900/600(seconds) or your chosen explicit format.
Terms and artifacts#
- AI agent: an autonomous task executor built on a model/tooling that operates in a role under guardrails, produces artifacts, runs checks, and escalates under uncertainty/risk.
- Rules (project rules): always-on project context and conventions (for example: build/test commands, style rules, security workflow). The point is to give the team and the agent stable guardrails instead of re-explaining the same baseline constraints in every chat.
- Allowlist: a list of permitted actions/commands/operations. Everything else is denied by default.
- Audit trail: a record of who/what/when (commands, changes, decisions, artifacts) to support investigation and control.
- Baseline: a “before” reference point for metrics/quality/speed.
- Backwards compatible: the change does not break existing clients/contracts; rollback is possible without a domino effect.
- Business case: a justification document (why, effect, risks, success criteria, pilot plan).
- CAB: Change Advisory Board — classic ITIL-style change approval process.
- Canary rollout: gradual release to a small traffic/instance slice with measured signals before expanding.
- Change management: adopting changes in roles/processes/tools in a controlled way (not chaos).
- Dry run: execute an operation in “no-apply” mode (for example,
--check) to validate expected effects. - Eval dataset: a set of scenarios/cases used to measure agent quality (for example, incidents from history).
- Edge case: a rare/hard scenario where systems tend to break; the opposite of the happy path.
- Evidence: observable data, not promises: metrics, artifacts, audit logs, verification outputs.
- TRACE: a context audit block: what was actually read/used (rules,
SKILL.md, artifact links) so you can see what conclusions stand on and reproduce the work. - Gate / quality gate: a gate where an agent must stop and present artifacts/evidence for verification (by humans or automation).
- ROUTER / Skill Router: a routing discipline: choose one base role and 0..N checker roles based on risk/touchpoints, then record TRACE before the main output.
- Reviewer gate: a gate where a human must review before continuation (or before risky/irreversible steps).
- golden tests: a fixed set of tests/cases serving as a quality reference when agents/processes change.
- Agent Skills: an open format for portable procedures/knowledge packages for agents (folder with
SKILL.md, possiblyscripts/,references/,assets/). SKILL.md: the primary Agent Skill file with YAML front matter and markdown instructions.- Progressive disclosure: context economy: show only skill metadata first, then load full
SKILL.mdand resources on demand. - Go/No-Go: a decision to continue/stop (usually after a pilot or a gate), made against pre-set criteria.
- Guardrails: explicit rules for what is allowed/forbidden; permissions; bans on dangerous operations; approval requirements.
read_only: a mode where the executor can read/collect/analyze but cannot change system state (especially production). State-changing actions require explicit approval and/or a separate execution loop.- Handoff: disciplined task delegation: goal, inputs, guardrails/permissions, output format, stop conditions.
- Happy path: the typical success path with no errors/unusual conditions.
- Idempotency: repeated execution does not change the result (or changes predictably and safely).
- Incident: an event that degrades availability/performance or violates SLO/SLA.
- Kanban / WIP limits: limiting work in progress to reduce context switching and increase predictability.
- Kill switch: a mechanism to disable an agent/feature quickly under risk or degradation.
- Least privilege: grant only the permissions required, and no more.
- Linter: static checks for code/text errors and style.
- Merge: combining changes into the main branch/codebase (typically after review and checks).
- Out-of-band approval: approval through a separate channel/process independent of potentially compromised text context (for example, a UI confirmation).
- p95 / p99: percentiles; p95 is the value below which 95% of observations fall.
- Prompt injection: an attack/error where an agent treats external data (logs, tickets, comments) as instructions and violates rules.
- Prompt template: a repeatable task framing structure (context -> guardrails -> DoD -> verification plan -> expected output).
- Commands (workflow shortcuts): repeatable workflows packaged so you can run them as a single invocation (for example: open a PR, run checks, generate a report). This is useful for high-frequency scenarios where consistency of steps and output format matters.
- Redaction: masking/removing secrets and PII from logs/artifacts.
- Regression: quality/metric degradation after a change.
- Resume: continuing an earlier worker run with its saved context/state.
- Risk register: a list of risks with scenarios, consequences, mitigations, and verification methods.
- Rollback: a planned and verifiable way to return to a stable state quickly.
runbook/runbooks: an executable or semi-executable incident/operation procedure (steps + conditions + checks + escalation).- SOP (standard operating procedure): a repeatable process for a typical task (steps, stop conditions, checks, artifacts, done criteria).
- Stop conditions: criteria that force an agent to stop and escalate (missing data, high risk, dangerous action, ambiguity).
- Subagent: a specialized executor with separate context that receives a narrow task for focus/isolation (a work-organization concept, not a vendor feature requirement).
- Threat model: a structured description of assets, threats, attack vectors, and mitigations before production.
- Triage: first-pass incident work: gather facts, narrow hypotheses, decide escalation/action.
- Verification agent (verifier): an independent skeptic that proves “done” is actually true; returns a report of what passed/failed/was not checked.
- Verification / verification plan: how to validate correctness (sampling, edge cases, metrics, before/after comparison, acceptance thresholds).
- Versioning / semver: versioning rules for templates/prompts/processes to track breaking changes and compatibility.
- Test runner: a role that proactively runs tests/checks (including golden tests) and drives the loop to green without rewriting expectations.
- Debugger: a role for root cause analysis: reproduce -> localize -> minimal fix/experiment -> verify with evidence.
Tools and formats (in this book’s context)#
- Agent: an agent that can read/change files, run commands, collect artifacts, and drive an end-to-end task (for example, Cursor Agent).
- Chat mode: a dialogue interface, useful for discussion/design/research.
- JSON / YAML: common data/config formats used in examples (logs, configs, pipelines).