11. State Management¶
Why This Chapter?¶
An agent runs a long task (say, a deployment), and then the server reboots. The task is gone. The user waits, but nothing happens. Without state management, you can't:
- Resume execution after failure
- Guarantee idempotency (repeated call doesn't create duplicates)
- Handle errors with retry
- Set deadlines for long tasks
State management is what makes long-lived agents reliable. Without it, tasks that take minutes or hours fall apart.
Real-World Case Study¶
Situation: Agent deploys an application. The process takes 10 minutes. On the 8th minute, the server reboots.
Problem: Task is lost. User doesn't know what happened. On restart, agent starts from the beginning, creating duplicates.
Solution: Persist state in a DB, make operations idempotent, add retry with backoff, and enforce deadlines. Now the agent can resume from where it stopped, and repeated calls won't create duplicates.
Theory in Simple Terms¶
What Is State Management?¶
State Management is about saving agent state between restarts. This allows:
- Resume execution after failure
- Track task progress
- Guarantee idempotency
What Is Idempotency?¶
Idempotency is a property of an operation: a repeated call gives the same result as the first. For example, "create file" isn't idempotent (creates duplicates), but "create file if it doesn't exist" is idempotent.
Connection with Planning¶
State Management is closely related to Planning, but focuses on execution reliability, not task decomposition. Planning creates a plan, State Management guarantees its reliable execution.
How It Works (Step by Step)¶
Step 0: Agent state as a contract (AgentState)¶
In the examples below we store a task state (Task). For production agents it's often helpful to also define a canonical agent run state. This makes long-running loops easier to operate.
What you get:
- resume after restarts,
- revise the plan when new facts arrive,
- enforce HITL based on policy,
- keep the context small by using artifacts.
A minimal shape (simplified):
{
"goal": "Deploy service X to staging",
"constraints": {
"human_in_the_loop": { "required_for_risk_levels": ["write_local", "external_action"] }
},
"budget": {
"max_steps": 20,
"max_wall_time_ms": 300000,
"max_llm_tokens": 200000,
"max_artifact_bytes_in_context": 8000
},
"plan": ["Check current status", "Collect config", "Apply changes", "Verify again"],
"known_facts": [{ "key": "service", "value": "X", "source": "user" }],
"open_questions": ["Which namespace should we deploy to?"],
"artifacts": [{ "artifact_id": "log_123", "type": "tool_result.logs", "summary": "nginx error log", "bytes": 48231 }],
"risk_flags": ["budget_pressure"]
}
To update this state between steps, use a StatePatch: "append facts", "replace plan", "add open questions". This also supports a clean split of responsibilities: one component normalizes observations, another selects the next action.
Step 1: Task Structure with State¶
Create a structure to store task state:
type TaskState string
const (
TaskPending TaskState = "pending"
TaskRunning TaskState = "running"
TaskCompleted TaskState = "completed"
TaskFailed TaskState = "failed"
)
type Task struct {
ID string `json:"id"`
UserInput string `json:"user_input"`
State TaskState `json:"state"`
Result string `json:"result,omitempty"`
Error string `json:"error,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
Step 2: Operation Idempotency¶
Check if the task was already executed:
func executeTask(id string) error {
// Load task from DB
task, exists := getTask(id)
if !exists {
return fmt.Errorf("task not found: %s", id)
}
// Check idempotency
if task.State == TaskCompleted {
return nil // Already executed, do nothing
}
// Set state to "running"
task.State = TaskRunning
task.UpdatedAt = time.Now()
saveTask(task)
// Execute task...
result, err := doWork(task.UserInput)
if err != nil {
task.State = TaskFailed
task.Error = err.Error()
} else {
task.State = TaskCompleted
task.Result = result
}
task.UpdatedAt = time.Now()
saveTask(task)
return err
}
Step 3: Retry with Exponential Backoff¶
Retry call on error with increasing delay:
func executeWithRetry(fn func() error, maxRetries int) error {
var lastErr error
for i := 0; i < maxRetries; i++ {
err := fn()
if err == nil {
return nil
}
lastErr = err
// Don't backoff after last attempt
if i < maxRetries-1 {
backoff := time.Duration(1<<i) * time.Second // 1s, 2s, 4s, 8s...
time.Sleep(backoff)
}
}
return fmt.Errorf("failed after %d retries: %v", maxRetries, lastErr)
}
Step 4: Deadlines¶
Set timeout for entire agent run and for each step:
func runAgentWithDeadline(ctx context.Context, client *openai.Client, userInput string) (string, error) {
// Deadline for entire agent run (5 minutes)
ctx, cancel := context.WithDeadline(ctx, time.Now().Add(5*time.Minute))
defer cancel()
// ... agent loop ...
for i := 0; i < maxIterations; i++ {
// Check deadline before each iteration
select {
case <-ctx.Done():
return "", fmt.Errorf("deadline exceeded")
default:
}
// ... execution ...
}
}
Step 5: State Persistence¶
Save task state to DB (or file for simplicity):
// Simple file-based implementation
var tasks = make(map[string]*Task)
var tasksMutex sync.RWMutex
func saveTask(task *Task) {
tasksMutex.Lock()
defer tasksMutex.Unlock()
task.UpdatedAt = time.Now()
tasks[task.ID] = task
// Save to file (for simplicity)
data, _ := json.Marshal(tasks)
os.WriteFile("tasks.json", data, 0644)
}
func getTask(id string) (*Task, bool) {
tasksMutex.RLock()
defer tasksMutex.RUnlock()
task, exists := tasks[id]
return task, exists
}
Step 6: Resume Execution¶
Continue task execution after failure:
func resumeTask(taskID string) error {
task, exists := getTask(taskID)
if !exists {
return fmt.Errorf("task not found: %s", taskID)
}
// If task already completed, do nothing
if task.State == TaskCompleted {
return nil
}
// If task failed, can retry
if task.State == TaskFailed {
task.State = TaskPending
saveTask(task)
}
// Continue execution
return executeTask(taskID)
}
Where to Integrate This in Our Code¶
Integration Point 1: Agent Loop¶
In labs/lab04-autonomy/main.go add state persistence:
// At start of agent run:
taskID := generateTaskID()
task := &Task{
ID: taskID,
UserInput: userInput,
State: TaskRunning,
CreatedAt: time.Now(),
}
saveTask(task)
// In loop save progress:
task.State = TaskRunning
saveTask(task)
// After completion:
task.State = TaskCompleted
task.Result = finalAnswer
saveTask(task)
Integration Point 2: Tool Execution¶
In labs/lab02-tools/main.go add retry for tools:
func executeToolWithRetry(toolCall openai.ToolCall) (string, error) {
return executeWithRetry(func() error {
result, err := executeTool(toolCall)
if err != nil {
return err
}
return nil
}, 3)
}
Mini Code Example¶
Complete example with workflow and state management based on labs/lab04-autonomy/main.go:
package main
import (
"context"
"encoding/json"
"fmt"
"os"
"sync"
"time"
"github.com/sashabaranov/go-openai"
)
type TaskState string
const (
TaskPending TaskState = "pending"
TaskRunning TaskState = "running"
TaskCompleted TaskState = "completed"
TaskFailed TaskState = "failed"
)
type Task struct {
ID string `json:"id"`
UserInput string `json:"user_input"`
State TaskState `json:"state"`
Result string `json:"result,omitempty"`
Error string `json:"error,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
var tasks = make(map[string]*Task)
var tasksMutex sync.RWMutex
func generateTaskID() string {
return fmt.Sprintf("task-%d", time.Now().UnixNano())
}
func saveTask(task *Task) {
tasksMutex.Lock()
defer tasksMutex.Unlock()
task.UpdatedAt = time.Now()
tasks[task.ID] = task
data, _ := json.Marshal(tasks)
os.WriteFile("tasks.json", data, 0644)
}
func getTask(id string) (*Task, bool) {
tasksMutex.RLock()
defer tasksMutex.RUnlock()
task, exists := tasks[id]
return task, exists
}
func executeWithRetry(fn func() error, maxRetries int) error {
var lastErr error
for i := 0; i < maxRetries; i++ {
err := fn()
if err == nil {
return nil
}
lastErr = err
if i < maxRetries-1 {
backoff := time.Duration(1<<i) * time.Second
fmt.Printf("Retry %d/%d after %v...\n", i+1, maxRetries, backoff)
time.Sleep(backoff)
}
}
return fmt.Errorf("failed after %d retries: %v", maxRetries, lastErr)
}
func checkDisk() string { return "Disk Usage: 95% (CRITICAL). Large folder: /var/log" }
func cleanLogs() string { return "Logs cleaned. Freed 20GB." }
func main() {
token := os.Getenv("OPENAI_API_KEY")
baseURL := os.Getenv("OPENAI_BASE_URL")
if token == "" {
token = "dummy"
}
config := openai.DefaultConfig(token)
if baseURL != "" {
config.BaseURL = baseURL
}
client := openai.NewClientWithConfig(config)
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(5*time.Minute))
defer cancel()
userInput := "I'm out of disk space. Fix it."
// Create task
taskID := generateTaskID()
task := &Task{
ID: taskID,
UserInput: userInput,
State: TaskRunning,
CreatedAt: time.Now(),
}
saveTask(task)
tools := []openai.Tool{
{
Type: openai.ToolTypeFunction,
Function: &openai.FunctionDefinition{
Name: "check_disk",
Description: "Check current disk usage",
},
},
{
Type: openai.ToolTypeFunction,
Function: &openai.FunctionDefinition{
Name: "clean_logs",
Description: "Delete old logs to free space",
},
},
}
messages := []openai.ChatCompletionMessage{
{Role: openai.ChatMessageRoleSystem, Content: "You are an autonomous DevOps agent."},
{Role: openai.ChatMessageRoleUser, Content: userInput},
}
fmt.Printf("Starting Agent Loop (task_id: %s)...\n", taskID)
for i := 0; i < 5; i++ {
// Check deadline
select {
case <-ctx.Done():
task.State = TaskFailed
task.Error = "deadline exceeded"
saveTask(task)
fmt.Println("Deadline exceeded")
return
default:
}
req := openai.ChatCompletionRequest{
Model: "gpt-4o-mini",
Messages: messages,
Tools: tools,
}
resp, err := client.CreateChatCompletion(ctx, req)
if err != nil {
task.State = TaskFailed
task.Error = err.Error()
saveTask(task)
panic(fmt.Sprintf("API Error: %v", err))
}
msg := resp.Choices[0].Message
messages = append(messages, msg)
if len(msg.ToolCalls) == 0 {
task.State = TaskCompleted
task.Result = msg.Content
saveTask(task)
fmt.Println("AI:", msg.Content)
break
}
for _, toolCall := range msg.ToolCalls {
fmt.Printf("Executing tool: %s\n", toolCall.Function.Name)
var result string
err := executeWithRetry(func() error {
if toolCall.Function.Name == "check_disk" {
result = checkDisk()
} else if toolCall.Function.Name == "clean_logs" {
result = cleanLogs()
}
return nil
}, 3)
if err != nil {
task.State = TaskFailed
task.Error = err.Error()
saveTask(task)
fmt.Printf("Tool execution failed: %v\n", err)
continue
}
fmt.Println("Tool Output:", result)
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleTool,
Content: result,
ToolCallID: toolCall.ID,
})
}
}
}
Store: Database-Backed Storage¶
The examples above use a tasks.json file to store state. This works for learning, but file storage is unreliable in production.
Why a File Isn't Production-Ready¶
File storage has three problems:
- No atomicity. If the process crashes during a write, the file gets corrupted.
- No concurrent access. Two agents can't safely write to the same file.
- No queries. To find all incomplete tasks, you have to read the entire file.
Databases solve all three problems. PostgreSQL is a solid choice for production. SQLite works well for local development.
StateStore: Storage Interface¶
Separate the interface from the implementation. This lets you swap the store in tests and change it without rewriting agent logic.
// StateStore defines the contract for agent state storage.
// Implementations can use PostgreSQL, SQLite, or in-memory storage.
type StateStore interface {
Save(ctx context.Context, task *Task) error
Get(ctx context.Context, id string) (*Task, error)
ListByState(ctx context.Context, state TaskState) ([]*Task, error)
}
PostgreSQL Implementation¶
type PgStateStore struct {
db *sql.DB
}
func NewPgStateStore(dsn string) (*PgStateStore, error) {
db, err := sql.Open("pgx", dsn)
if err != nil {
return nil, fmt.Errorf("connect to postgres: %w", err)
}
return &PgStateStore{db: db}, nil
}
func (s *PgStateStore) Save(ctx context.Context, task *Task) error {
// UPSERT: insert a new task or update an existing one
query := `
INSERT INTO agent_tasks (id, user_input, state, result, error, created_at, updated_at)
VALUES ($1, $2, $3, $4, $5, $6, now())
ON CONFLICT (id) DO UPDATE SET
state = EXCLUDED.state,
result = EXCLUDED.result,
error = EXCLUDED.error,
updated_at = now()`
_, err := s.db.ExecContext(ctx, query,
task.ID, task.UserInput, task.State,
task.Result, task.Error, task.CreatedAt,
)
return err
}
func (s *PgStateStore) Get(ctx context.Context, id string) (*Task, error) {
task := &Task{}
err := s.db.QueryRowContext(ctx,
`SELECT id, user_input, state, result, error, created_at, updated_at
FROM agent_tasks WHERE id = $1`, id,
).Scan(
&task.ID, &task.UserInput, &task.State,
&task.Result, &task.Error,
&task.CreatedAt, &task.UpdatedAt,
)
if errors.Is(err, sql.ErrNoRows) {
return nil, nil
}
return task, err
}
func (s *PgStateStore) ListByState(ctx context.Context, state TaskState) ([]*Task, error) {
rows, err := s.db.QueryContext(ctx,
`SELECT id, user_input, state, result, error, created_at, updated_at
FROM agent_tasks WHERE state = $1 ORDER BY created_at`, state,
)
if err != nil {
return nil, err
}
defer rows.Close()
var tasks []*Task
for rows.Next() {
t := &Task{}
if err := rows.Scan(
&t.ID, &t.UserInput, &t.State,
&t.Result, &t.Error,
&t.CreatedAt, &t.UpdatedAt,
); err != nil {
return nil, err
}
tasks = append(tasks, t)
}
return tasks, rows.Err()
}
Transactions for Atomic Updates¶
When the agent executes a step, state must update atomically. If the step fails, the state must remain unchanged.
func (s *PgStateStore) ExecuteStep(ctx context.Context, taskID string, stepFn func(*Task) error) error {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback()
// SELECT ... FOR UPDATE locks the row for the duration of the step.
// Another agent won't be able to modify this task concurrently.
task := &Task{}
err = tx.QueryRowContext(ctx,
`SELECT id, user_input, state, result, error, created_at, updated_at
FROM agent_tasks WHERE id = $1 FOR UPDATE`, taskID,
).Scan(
&task.ID, &task.UserInput, &task.State,
&task.Result, &task.Error,
&task.CreatedAt, &task.UpdatedAt,
)
if err != nil {
return fmt.Errorf("lock task: %w", err)
}
// Execute step business logic
if err := stepFn(task); err != nil {
return fmt.Errorf("step failed: %w", err)
}
// Save updated state inside the transaction
_, err = tx.ExecContext(ctx,
`UPDATE agent_tasks SET state=$1, result=$2, error=$3, updated_at=now() WHERE id=$4`,
task.State, task.Result, task.Error, task.ID,
)
if err != nil {
return fmt.Errorf("save state: %w", err)
}
return tx.Commit()
}
The step either completes fully or rolls back. There is no "half-done" state.
MCP for State¶
Model Context Protocol (MCP) lets you store and share agent state through standardized resources. For more on MCP, see Chapter 18: Tool Protocols and Tool Servers.
Why MCP for State?¶
An MCP server acts as a single source of truth for state. Any agent or tool accesses it by URI. This solves two problems:
- Shared access. Multiple agents read and update the same state.
- Standard protocol. No need to write a custom API for each store.
State Resource¶
Agent state is represented as an MCP resource with a URI:
One agent writes progress, another reads it and continues the work.
Example: Reading Shared State¶
// MCPStateResource represents agent state as an MCP resource.
type MCPStateResource struct {
URI string `json:"uri"`
AgentID string `json:"agent_id"`
TaskID string `json:"task_id"`
State TaskState `json:"state"`
Plan []string `json:"plan,omitempty"`
Artifacts []string `json:"artifacts,omitempty"`
}
// readSharedState reads another agent's state via MCP.
// Agent A wrote progress, agent B reads it and continues.
func readSharedState(
ctx context.Context,
mcpClient *mcp.Client,
agentID, taskID string,
) (*MCPStateResource, error) {
uri := fmt.Sprintf("state://agents/%s/tasks/%s", agentID, taskID)
resource, err := mcpClient.ReadResource(ctx, uri)
if err != nil {
return nil, fmt.Errorf("read MCP resource %s: %w", uri, err)
}
var state MCPStateResource
if err := json.Unmarshal(resource.Content, &state); err != nil {
return nil, fmt.Errorf("decode state: %w", err)
}
return &state, nil
}
This approach is useful in multi-agent systems, where multiple agents work on the same task.
Dynamic Context: Selecting Relevant State¶
Problem: Not Everything Fits in the Context¶
The agent accumulates artifacts: logs, command outputs, intermediate data. Over time their volume exceeds the LLM context window. Send everything and the model loses focus. Send nothing and the model can't make decisions.
The solution is to select only the relevant state for the current step.
Filtering by Relevance¶
The strategy is simple: first take data from the current step, then fill the remainder with the most recent facts.
// ContextSlice is a slice of state that fits inside the context window.
type ContextSlice struct {
Goal string `json:"goal"`
CurrentStep string `json:"current_step"`
Facts []Fact `json:"facts"`
Artifacts []Artifact `json:"artifacts"`
OpenQuestions []string `json:"open_questions"`
}
// filterRelevantState selects only what the current step needs
// from the full state.
// maxBytes caps the size to avoid overflowing the context window.
func filterRelevantState(state *AgentState, currentStep string, maxBytes int) *ContextSlice {
slice := &ContextSlice{
Goal: state.Goal,
CurrentStep: currentStep,
OpenQuestions: state.OpenQuestions,
}
usedBytes := 0
// Priority 1: artifacts for the current step
for _, a := range state.Artifacts {
if a.Step == currentStep && usedBytes+a.Bytes <= maxBytes {
slice.Artifacts = append(slice.Artifacts, a)
usedBytes += a.Bytes
}
}
// Priority 2: most recent facts (fresh data is more likely relevant)
for i := len(state.KnownFacts) - 1; i >= 0; i-- {
factSize := len(state.KnownFacts[i].Value)
if usedBytes+factSize > maxBytes {
break
}
slice.Facts = append(slice.Facts, state.KnownFacts[i])
usedBytes += factSize
}
return slice
}
When You Need This¶
Filtering becomes critical when the agent runs for more than 5-10 steps. For short tasks you can skip it. For more on context management, see Chapter 20: Cost & Latency Engineering.
Advanced Checkpoint Strategies¶
The basic Checkpoint implementation (structure, save/load, agent loop integration) is covered in Chapter 09: Agent Architecture. Here we look at advanced strategies for production.
When to Save: Checkpoint Granularity¶
A checkpoint is a snapshot of state you can return to. Save frequency is a trade-off between reliability and performance:
| Strategy | When to save | Pros | Cons |
|---|---|---|---|
every_step |
After every tool call | Minimal progress loss | Many DB writes |
every_iteration |
After every loop iteration | Balance of reliability and I/O | Lose intermediate steps |
on_state_change |
Only on state transition | Minimal I/O | Lose progress within a state |
CheckpointManager¶
type CheckpointStrategy string
const (
CheckpointEveryStep CheckpointStrategy = "every_step"
CheckpointEveryIteration CheckpointStrategy = "every_iteration"
CheckpointOnStateChange CheckpointStrategy = "on_state_change"
)
type CheckpointManager struct {
store StateStore
strategy CheckpointStrategy
maxAge time.Duration // Maximum checkpoint age
maxCount int // How many checkpoints to keep per task
}
// MaybeSave saves a checkpoint if the current trigger matches the strategy.
func (cm *CheckpointManager) MaybeSave(
ctx context.Context,
task *Task,
trigger CheckpointStrategy,
) error {
if trigger != cm.strategy {
return nil // Not our trigger — skip
}
return cm.store.Save(ctx, task)
}
Validation Before Resume¶
You can't blindly resume a task from a checkpoint. The checkpoint may be stale, and the state may be invalid.
// ValidateAndResume loads a checkpoint and verifies it's usable.
func (cm *CheckpointManager) ValidateAndResume(ctx context.Context, taskID string) (*Task, error) {
task, err := cm.store.Get(ctx, taskID)
if err != nil {
return nil, fmt.Errorf("load checkpoint: %w", err)
}
if task == nil {
return nil, fmt.Errorf("checkpoint not found: %s", taskID)
}
// Check 1: checkpoint is not expired
age := time.Since(task.UpdatedAt)
if age > cm.maxAge {
return nil, fmt.Errorf("checkpoint expired: age %v exceeds max %v", age, cm.maxAge)
}
// Check 2: state allows resumption
switch task.State {
case TaskCompleted:
return task, nil // Already done, no re-execution needed
case TaskRunning, TaskFailed:
return task, nil // Can resume
default:
return nil, fmt.Errorf("cannot resume from state: %s", task.State)
}
}
Checkpoint Rotation¶
Checkpoints accumulate. Without cleanup they waste storage and complicate recovery. Rotation keeps only the last N checkpoints and deletes expired ones.
// Cleanup removes expired checkpoints, keeping the last maxCount.
func (cm *CheckpointManager) Cleanup(ctx context.Context, taskID string) (int64, error) {
result, err := cm.store.(*PgStateStore).db.ExecContext(ctx,
`DELETE FROM agent_checkpoints
WHERE task_id = $1
AND created_at < $2
AND id NOT IN (
SELECT id FROM agent_checkpoints
WHERE task_id = $1
ORDER BY created_at DESC
LIMIT $3
)`,
taskID, time.Now().Add(-cm.maxAge), cm.maxCount,
)
if err != nil {
return 0, fmt.Errorf("cleanup checkpoints: %w", err)
}
return result.RowsAffected()
}
Good practice: run rotation after every successful save or on a schedule.
Common Errors¶
Error 1: No Idempotency¶
Symptom: Repeated call creates duplicates (e.g., creates two files instead of one).
Cause: Operations don't check if they were already executed.
Solution:
// BAD
func createFile(filename string) error {
os.WriteFile(filename, []byte("data"), 0644)
return nil
}
// GOOD
func createFileIfNotExists(filename string) error {
if _, err := os.Stat(filename); err == nil {
return nil // Already exists
}
return os.WriteFile(filename, []byte("data"), 0644)
}
Error 2: No Retry on Errors¶
Symptom: Agent fails on first temporary error (network error, timeout).
Cause: No retries on errors.
Solution:
// BAD
result, err := executeTool(toolCall)
if err != nil {
return "", err // Immediately return error
}
// GOOD
err := executeWithRetry(func() error {
result, err := executeTool(toolCall)
return err
}, 3)
Error 3: No Deadlines¶
Symptom: Agent hangs forever, user waits.
Cause: No timeout for operations.
Solution:
// BAD
resp, _ := client.CreateChatCompletion(ctx, req)
// May hang forever
// GOOD
ctx, cancel := context.WithDeadline(ctx, time.Now().Add(5*time.Minute))
defer cancel()
resp, err := client.CreateChatCompletion(ctx, req)
Error 4: State Not Persisted¶
Symptom: After restart, agent starts from beginning, losing progress.
Cause: State stored only in memory.
Solution:
// BAD
var taskState = "running" // Only in memory
// GOOD
task.State = TaskRunning
saveTask(task) // Save to DB/file
Mini-Exercises¶
Exercise 1: Implement Retry with Backoff¶
Implement a retry execution function:
func executeWithRetry(fn func() error, maxRetries int) error {
// Your code here
// Retry call with exponential backoff
}
Expected result:
- Function retries call on error
- Uses exponential backoff (1s, 2s, 4s...)
- Function returns error after retries exhausted
Exercise 2: Implement Idempotency¶
Create a function that checks if task was already executed:
func executeTaskIfNotDone(taskID string) error {
// Your code here
// Check task state before execution
}
Expected result:
- If task already completed, function returns nil without execution
- If task not completed, function executes it and saves state
Completion Criteria / Checklist¶
Completed (production ready):
- Operation idempotency implemented (repeated call doesn't create duplicates)
- Retries with exponential backoff implemented
- Deadlines set for agent run and individual operations
- Task state persisted between restarts
- Can resume task execution after failure
Not completed:
- No idempotency
- No retry on errors
- No deadlines
- State not persisted
Connection with Other Chapters¶
- Chapter 04: Autonomy and Loops — Basic agent loop
- Chapter 10: Planning and Workflow Patterns — State Management guarantees reliable plan execution
- Chapter 19: Observability and Tracing — Logging task state
- Chapter 20: Cost & Latency Engineering — Cost control for long tasks
What's Next?¶
After understanding state management, proceed to:
- 12. Agent Memory Systems — Learn how agents remember and retrieve information