22. Prompt and Program Management¶
Why This Chapter?¶
You changed the prompt, and the agent got worse. But you can't figure out what exactly changed, or roll back. Without prompt management, you lose control over agent behavior.
In production, prompts are code. They define agent behavior just like functions and conditionals. They deserve the same treatment: versioning, testing, rollback, monitoring.
Real-World Case Study¶
Situation: You updated the system prompt to improve response quality. A day later, users complain that the agent handles incidents worse.
Problem: No prompt versioning. You don't know which version was running yesterday. You can't roll back.
Solution: A centralized prompt registry with versions. Evals validate each version before deployment. A/B testing shows which version performs better. Rollback is one command.
Theory in Simple Terms¶
Prompt as Artifact¶
A prompt is not "text in code." It's an artifact that:
- Changes more often than code
- Affects behavior unpredictably (small change → big effect)
- Must be tested on every change
- Must be linked to specific runs/traces for debugging
What Are Prompt Regressions?¶
A prompt regression is agent quality degradation after a prompt change. A single word can break behavior. Evals catch regressions before deployment.
How It Works (Step by Step)¶
Step 1: Centralized Prompt Registry¶
All prompts live in one place with metadata:
type PromptRegistry struct {
store map[string][]PromptVersion // promptID → versions
}
type PromptVersion struct {
ID string `json:"id"`
PromptID string `json:"prompt_id"`
Version string `json:"version"` // "1.0.0", "1.1.0"
Content string `json:"content"`
Variables []string `json:"variables"` // Variables in the prompt
Author string `json:"author"`
CreatedAt time.Time `json:"created_at"`
Description string `json:"description"` // What changed
Tags map[string]string `json:"tags"` // "model": "gpt-4o", "domain": "devops"
IsActive bool `json:"is_active"` // Currently used in production
}
func (r *PromptRegistry) Get(promptID, version string) (*PromptVersion, error) {
versions, ok := r.store[promptID]
if !ok {
return nil, fmt.Errorf("prompt %s not found", promptID)
}
for i := len(versions) - 1; i >= 0; i-- {
if version == "latest" || versions[i].Version == version {
return &versions[i], nil
}
}
return nil, fmt.Errorf("version %s not found", version)
}
func (r *PromptRegistry) Rollback(promptID, toVersion string) error {
version, err := r.Get(promptID, toVersion)
if err != nil {
return err
}
// Deactivate the current version, activate the rollback target
r.deactivateAll(promptID)
version.IsActive = true
return nil
}
Step 2: Versioning with Semantic Versioning¶
Apply semver to prompts:
- MAJOR (1.0 → 2.0): Structural change (new role, new response format)
- MINOR (1.0 → 1.1): Adding instructions (new edge case, clarification)
- PATCH (1.0.0 → 1.0.1): Typo fix, formatting
// Diff between versions
func (r *PromptRegistry) Diff(promptID, v1, v2 string) string {
pv1, _ := r.Get(promptID, v1)
pv2, _ := r.Get(promptID, v2)
// Line-by-line comparison
lines1 := strings.Split(pv1.Content, "\n")
lines2 := strings.Split(pv2.Content, "\n")
var diff strings.Builder
// ... standard diff algorithm ...
return diff.String()
}
Step 3: Templating¶
Prompts often contain variables. Separate the template from the data:
type PromptTemplate struct {
Template string // "You are a {{.Role}}. Your tools: {{.ToolList}}"
Defaults map[string]string // Default values
}
func (pt *PromptTemplate) Render(vars map[string]string) (string, error) {
tmpl, err := template.New("prompt").Parse(pt.Template)
if err != nil {
return "", err
}
// Merge defaults with provided variables
merged := make(map[string]string)
for k, v := range pt.Defaults {
merged[k] = v
}
for k, v := range vars {
merged[k] = v
}
var buf bytes.Buffer
if err := tmpl.Execute(&buf, merged); err != nil {
return "", err
}
return buf.String(), nil
}
// Usage
tmpl := PromptTemplate{
Template: `You are a {{.Role}} agent.
Available tools: {{.ToolList}}
SOP: {{.SOP}}
Constraints: {{.Constraints}}`,
Defaults: map[string]string{
"Constraints": "Always ask for confirmation before destructive actions.",
},
}
prompt, _ := tmpl.Render(map[string]string{
"Role": "DevOps",
"ToolList": "ping, check_status, restart_service",
"SOP": "1. Diagnose 2. Fix 3. Verify",
})
Step 4: Prompt Playground¶
A Prompt Playground is an environment for testing prompts before deployment. You can run a prompt against several test inputs and see the results.
type PlaygroundRequest struct {
PromptVersion string `json:"prompt_version"`
TestInputs []string `json:"test_inputs"` // Test queries
Model string `json:"model"`
}
type PlaygroundResult struct {
Input string `json:"input"`
Output string `json:"output"`
Tokens int `json:"tokens"`
Latency float64 `json:"latency_ms"`
HasError bool `json:"has_error"`
}
func runPlayground(req PlaygroundRequest, client *openai.Client) []PlaygroundResult {
prompt, _ := registry.Get("system", req.PromptVersion)
var results []PlaygroundResult
for _, input := range req.TestInputs {
start := time.Now()
resp, err := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
Model: req.Model,
Messages: []openai.ChatCompletionMessage{
{Role: openai.ChatMessageRoleSystem, Content: prompt.Content},
{Role: openai.ChatMessageRoleUser, Content: input},
},
})
result := PlaygroundResult{
Input: input,
Latency: float64(time.Since(start).Milliseconds()),
}
if err != nil {
result.HasError = true
} else {
result.Output = resp.Choices[0].Message.Content
result.Tokens = resp.Usage.TotalTokens
}
results = append(results, result)
}
return results
}
Step 5: A/B Testing Prompts¶
Test two prompt versions in parallel on live traffic:
type ABTest struct {
Name string `json:"name"`
VersionA string `json:"version_a"` // Control group
VersionB string `json:"version_b"` // Experimental group
TrafficPct float64 `json:"traffic_pct"` // % of traffic to version B (0.0 - 1.0)
StartedAt time.Time
}
func (ab *ABTest) SelectVersion(requestID string) string {
// Deterministic selection based on requestID (for reproducibility)
hash := fnv.New32a()
hash.Write([]byte(requestID))
bucket := float64(hash.Sum32()) / float64(math.MaxUint32)
if bucket < ab.TrafficPct {
return ab.VersionB
}
return ab.VersionA
}
// Usage in the agent loop
abTest := ABTest{
Name: "improved_sop_prompt",
VersionA: "1.0.0", // Current version
VersionB: "1.1.0", // New version
TrafficPct: 0.1, // 10% of traffic to the new version
}
selectedVersion := abTest.SelectVersion(runID)
prompt, _ := registry.Get("incident_sop", selectedVersion)
Metrics for comparison:
type ABMetrics struct {
Version string
PassRate float64 // % of successful tasks
AvgLatency float64 // Average latency
AvgTokens float64 // Average token consumption
UserRating float64 // User rating (if available)
}
Step 6: MCP for Prompts¶
An MCP server can serve prompts to agents. This is useful when multiple agents share common prompts:
// MCP server exposes prompts as resources
type PromptMCPServer struct {
registry *PromptRegistry
}
// Agent requests a prompt via MCP
func (s *PromptMCPServer) GetResource(uri string) (string, error) {
// URI: "prompt://incident_sop/latest"
parts := strings.Split(uri, "/")
promptID := parts[1]
version := parts[2]
pv, err := s.registry.Get(promptID, version)
if err != nil {
return "", err
}
return pv.Content, nil
}
For more on MCP, see Chapter 18: Tool Protocols and Servers.
Step 7: Link to Traces¶
Every agent run records which prompt version was used. This lets you tie behavior to a specific version:
type RunMetadata struct {
RunID string `json:"run_id"`
PromptID string `json:"prompt_id"`
PromptVersion string `json:"prompt_version"`
Model string `json:"model"`
Timestamp time.Time
}
func logRunMetadata(runID string, prompt *PromptVersion, model string) {
metadata := RunMetadata{
RunID: runID,
PromptID: prompt.PromptID,
PromptVersion: prompt.Version,
Model: model,
Timestamp: time.Now(),
}
// Write to the trace
tracing.LogMetadata(metadata)
}
// Now when investigating an incident:
// "Which prompt version was used in run_id=abc123?"
// → prompt_id=incident_sop, version=1.1.0
For more on tracing, see Chapter 19: Observability and Tracing.
Minimal Code Example¶
A minimal example: load a prompt by version + feature flag:
func getSystemPrompt(flags FeatureFlags) string {
version := "1.0.0"
if flags.UseNewPrompt {
version = "1.1.0"
}
prompt, err := registry.Get("system_devops", version)
if err != nil {
log.Printf("Failed to get prompt version %s: %v, using default", version, err)
return defaultPrompt
}
return prompt.Content
}
Common Errors¶
Error 1: Prompts Hardcoded in Source¶
Symptom: Changing a prompt requires a code change, code review, and a full deploy.
Cause: Prompts are stored as string constants in Go files.
Solution:
// BAD: Prompt in code
const systemPrompt = "You are a DevOps agent..."
// GOOD: Prompt from the registry
prompt, _ := registry.Get("system_devops", "latest")
Error 2: No Evals Before Deploying a Prompt¶
Symptom: A new prompt version breaks agent behavior. You find out after deployment.
Cause: Prompts are deployed without testing.
Solution:
// BAD: Deploy without validation
registry.SetActive("system_devops", "2.0.0")
// GOOD: Run evals before activation
passRate := runEvalsForPrompt("system_devops", "2.0.0")
if passRate < 0.95 {
log.Printf("Prompt 2.0.0 failed evals: %.2f < 0.95", passRate)
return // Don't activate
}
registry.SetActive("system_devops", "2.0.0")
Error 3: A/B Test Without Statistical Significance¶
Symptom: You switched 100% of traffic to the new version after "a test on 10 requests showed it's better."
Cause: Not enough data for a statistically significant comparison.
Solution:
// BAD: 10 requests → decision
if sampleSize < 100 {
log.Println("Not enough data for A/B decision")
return
}
// GOOD: Statistically significant sample
// Minimum 100-500 requests per version
// Compare across multiple metrics (pass rate, latency, tokens)
Error 4: No Link Between Prompt and Trace¶
Symptom: A user complains about a bad response. You don't know which prompt version was used.
Cause: Run metadata doesn't record the prompt version.
Solution:
// BAD: Run the agent without recording the prompt version
runAgent(prompt.Content, ...)
// GOOD: Log the version in the trace
logRunMetadata(runID, prompt, model)
runAgent(prompt.Content, ...)
Error 5: String Formatting Instead of Templates¶
Symptom: The prompt is built with fmt.Sprintf and 10+ arguments. You can't predict what the final prompt looks like.
Cause: No templating.
Solution:
// BAD
prompt := fmt.Sprintf("You are a %s. Tools: %s. SOP: %s. Constraints: %s.", role, tools, sop, constraints)
// GOOD
tmpl := PromptTemplate{
Template: `You are a {{.Role}}.
Tools: {{.ToolList}}
SOP: {{.SOP}}
Constraints: {{.Constraints}}`,
}
prompt, _ := tmpl.Render(vars)
Exercises¶
Exercise 1: Implement a PromptRegistry¶
Build a prompt store with Get, Add, and Rollback methods:
type PromptRegistry struct {
// Your code
}
func (r *PromptRegistry) Get(id, version string) (*PromptVersion, error) { ... }
func (r *PromptRegistry) Add(pv PromptVersion) error { ... }
func (r *PromptRegistry) Rollback(id, version string) error { ... }
Expected result:
- You can add multiple versions of a prompt
- You can retrieve a specific version or "latest"
- Rollback deactivates the current version and activates the target
Exercise 2: Implement A/B Testing¶
Implement prompt version selection by requestID:
func selectVersion(requestID string, trafficPct float64) string {
// Your code: deterministic selection based on hash(requestID)
}
Expected result:
- The same requestID always gets the same version
- trafficPct=0.1 routes ~10% of requests to version B
Exercise 3: Implement a Playground¶
Implement a function that tests a prompt against multiple inputs:
Expected result:
- Each input is tested with the given prompt
- The result contains output, tokens, latency
Completion Criteria / Checklist¶
Completed:
- Prompts are stored in a centralized registry with versions
- Each version passes evals before activation
- Templating is used for prompt variables
- The prompt version is recorded in the trace of every run
- A rollback mechanism exists
- Feature flags allow enabling/disabling versions without a deploy
Not completed:
- Prompts are hardcoded in source
- No evals for validating changes
- No link between prompt and trace
- A/B tests run without a statistically significant sample
Deep Dive¶
Prompt as Code vs Prompt as Config¶
Two approaches to prompt management:
- Prompt as Code: Prompts live in Git, changes go through PRs. Pro — full audit trail. Con — slow iteration cycle.
- Prompt as Config: Prompts live in a DB/API, changes happen through a UI. Pro — fast iteration. Con — harder to track.
The sweet spot: Prompt as Code for the system prompt (changes rarely), Prompt as Config for few-shot examples and SOPs (changes often).
Connection with Other Chapters¶
- Chapter 02: Prompt Engineering — how to write effective prompts
- Chapter 08: Evals and Reliability — how to test prompts
- Chapter 18: Tool Protocols and Servers — MCP for serving prompts
- Chapter 19: Observability and Tracing — linking prompts to traces
- Chapter 23: Evals in CI/CD — automated validation in the pipeline
- Agent Skills —
SKILL.mdformat as a standard way to package reusable prompts and instructions
What's Next?¶
After learning prompt management, move on to:
- 23. Evals in CI/CD — automated quality checks in the pipeline