20. Cost & Latency Engineering¶

Why This Chapter?¶

An agent works, but the LLM API bill grows uncontrollably. One request costs $5, another — $0.10. Why? Without cost control and latency optimization, you can't:

Predict monthly budget
Optimize expensive requests
Guarantee response time to user

Cost & latency engineering gives you budget and performance control. Without it, you can spend thousands on simple requests—or ship an agent so slow nobody uses it.

Real-World Case Study¶

Situation: A DevOps agent works in production for a week. LLM API bill — $5000 per month instead of expected $500.

Problem: Agent uses GPT-4 for all requests, even for simple status checks. No token limits, no caching, no fallback to cheaper models. One request "check server status" uses 50,000 tokens due to large context.

Solution: Token budgets, iteration limits, caching, model routing by task complexity. Now simple requests use a cheap model (GPT-4o-mini), complex ones use a powerful model (GPT-4o). Bill reduced to $600 per month.

Note: LLM API prices change frequently. Check provider websites for current pricing. Prices in this chapter are approximate and shown to illustrate relative cost differences.

Theory in Simple Terms¶

What Is Cost Engineering?¶

Cost Engineering is the control and optimization of LLM API usage costs. Main levers:

Model selection — powerful models (GPT-4o) cost several times more than lightweight ones (GPT-4o-mini)
Token count — larger context = more expensive
Request count — each LLM call costs money
Caching — identical requests can be skipped

What Is Latency Engineering?¶

Latency Engineering is response time control. Main factors:

Model — powerful models are slower than lightweight ones
Context size — more tokens = more processing time
Iteration count — more ReAct cycles = more time
Timeouts — protection against hangs

How It Works (Step by Step)¶

Step 1: Token Budgets¶

Set token limit per request:

const (
    MaxTokensPerRequest = 10000
    MaxIterations      = 10
)

type TokenBudget struct {
    Used  int
    Limit int
}

func checkTokenBudget(budget TokenBudget) error {
    if budget.Used > budget.Limit {
        return fmt.Errorf("token budget exceeded: %d > %d", budget.Used, budget.Limit)
    }
    return nil
}

Step 2: Iteration Limits¶

Limit the number of ReAct loop iterations (see labs/lab04-autonomy/main.go):

func runAgent(ctx context.Context, client *openai.Client, userInput string) (string, error) {
    maxIterations := 10
    tokenBudget := TokenBudget{Limit: MaxTokensPerRequest}

    for i := 0; i < maxIterations; i++ {
        resp, err := client.CreateChatCompletion(ctx, req)
        if err != nil {
            return "", err
        }

        tokenBudget.Used += resp.Usage.TotalTokens

        // Check budget
        if err := checkTokenBudget(tokenBudget); err != nil {
            return "", fmt.Errorf("stopping: %v", err)
        }

        // ... rest of code ...
    }

    return "", fmt.Errorf("max iterations (%d) exceeded", maxIterations)
}

Step 2.5: Artifacts for large tool results¶

One of the fastest ways to reduce cost and latency is to stop putting large tool outputs directly into messages[].

Pattern:

store the raw tool output as an artifact in external storage,
add only a short excerpt (for example, top-20 lines) plus artifact_id to the conversation,
fetch the artifact later in slices if needed (range/offset/limit).

This often works better than summarization. You keep the full data, but you keep it out of the context window.

Step 3: LLM Result Caching¶

Cache results for identical requests:

import (
    "crypto/sha256"
    "encoding/hex"
    "sync"
    "time"
)

type CacheEntry struct {
    Result   string
    ExpiresAt time.Time
}

var cache = sync.Map{} // map[string]*CacheEntry

func getCacheKey(messages []openai.ChatCompletionMessage) string {
    // Create key from message content
    data := ""
    for _, msg := range messages {
        data += msg.Role + ":" + msg.Content
    }
    hash := sha256.Sum256([]byte(data))
    return hex.EncodeToString(hash[:])
}

func getCachedResult(key string) (string, bool) {
    entry, ok := cache.Load(key)
    if !ok {
        return "", false
    }
    e := entry.(*CacheEntry)
    if time.Now().After(e.ExpiresAt) {
        cache.Delete(key)
        return "", false
    }
    return e.Result, true
}

func setCachedResult(key string, result string, ttl time.Duration) {
    cache.Store(key, &CacheEntry{
        Result:    result,
        ExpiresAt: time.Now().Add(ttl),
    })
}

Step 4: Model Routing by Complexity¶

Use cheaper models for simple tasks (see labs/lab09-context-optimization/main.go):

func selectModel(taskComplexity string) string {
    switch taskComplexity {
    case "simple":
        return "gpt-4o-mini" // Cheaper and faster
    case "complex":
        return "gpt-4o" // Better quality, but more expensive
    default:
        return "gpt-4o-mini"
    }
}

func assessTaskComplexity(userInput string) string {
    // Simple tasks: status check, read logs
    simpleKeywords := []string{"check", "status", "read", "get", "list"}
    for _, keyword := range simpleKeywords {
        if strings.Contains(strings.ToLower(userInput), keyword) {
            return "simple"
        }
    }
    // Complex tasks: analysis, planning, problem solving
    return "complex"
}

Step 5: Fallback Models¶

Implement fallback chain on errors or budget exceeded:

func createChatCompletionWithFallback(ctx context.Context, client *openai.Client, req openai.ChatCompletionRequest) (*openai.ChatCompletionResponse, error) {
    models := []string{"gpt-4o", "gpt-4o-mini"}

    for _, model := range models {
        req.Model = model
        resp, err := client.CreateChatCompletion(ctx, req)
        if err == nil {
            return resp, nil
        }
        // If error, try next model
    }

    return nil, fmt.Errorf("all models failed")
}

Step 6: Timeouts¶

Set timeout for entire agent run and for each tool call:

import "context"
import "time"

func runAgentWithTimeout(ctx context.Context, client *openai.Client, userInput string) (string, error) {
    // Timeout for entire agent run (5 minutes)
    ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
    defer cancel()

    // ... agent loop ...

    for i := 0; i < maxIterations; i++ {
        // Timeout for each LLM call (30 seconds)
        callCtx, callCancel := context.WithTimeout(ctx, 30*time.Second)
        resp, err := client.CreateChatCompletion(callCtx, req)
        callCancel()

        if err != nil {
            if err == context.DeadlineExceeded {
                return "", fmt.Errorf("LLM call timeout")
            }
            return "", err
        }

        // ... rest of code ...
    }
}

Where to Integrate This in Our Code¶

Integration Point 1: Agent Loop¶

In labs/lab04-autonomy/main.go add budget check and iteration limit:

const MaxIterations = 10
const MaxTokensPerRequest = 10000

// In loop:
for i := 0; i < MaxIterations; i++ {
    resp, err := client.CreateChatCompletion(ctx, req)
    // Check tokens
    if resp.Usage.TotalTokens > MaxTokensPerRequest {
        return "", fmt.Errorf("token limit exceeded")
    }
    // ... rest of code ...
}

Integration Point 2: Context Optimization¶

In labs/lab09-context-optimization/main.go token counting already exists. Add budget check:

func adaptiveContextManagement(ctx context.Context, client *openai.Client, messages []openai.ChatCompletionMessage, maxTokens int) []openai.ChatCompletionMessage {
    usedTokens := countTokensInMessages(messages)

    // If budget exceeded, apply aggressive optimization
    if usedTokens > MaxTokensPerRequest {
        return compressOldMessages(ctx, client, messages, maxTokens)
    }

    // ... rest of logic ...
}

Mini Code Example¶

Complete example with cost control based on labs/lab04-autonomy/main.go:

package main

import (
    "context"
    "fmt"
    "os"
    "time"

    "github.com/sashabaranov/go-openai"
)

const (
    MaxTokensPerRequest = 10000
    MaxIterations      = 10
    LLMTimeout         = 30 * time.Second
    AgentTimeout       = 5 * time.Minute
)

type TokenBudget struct {
    Used  int
    Limit int
}

func checkTokenBudget(budget TokenBudget) error {
    if budget.Used > budget.Limit {
        return fmt.Errorf("token budget exceeded: %d > %d", budget.Used, budget.Limit)
    }
    return nil
}

func selectModel(taskComplexity string) string {
    if taskComplexity == "simple" {
        return "gpt-4o-mini"
    }
    return "gpt-4o"
}

func checkDisk() string { return "Disk Usage: 95% (CRITICAL). Large folder: /var/log" }
func cleanLogs() string { return "Logs cleaned. Freed 20GB." }

func main() {
    token := os.Getenv("OPENAI_API_KEY")
    baseURL := os.Getenv("OPENAI_BASE_URL")
    if token == "" {
        token = "dummy"
    }

    config := openai.DefaultConfig(token)
    if baseURL != "" {
        config.BaseURL = baseURL
    }
    client := openai.NewClientWithConfig(config)

    ctx, cancel := context.WithTimeout(context.Background(), AgentTimeout)
    defer cancel()

    userInput := "I'm out of disk space. Fix it."

    // Determine task complexity
    model := selectModel("simple") // For simple tasks use GPT-3.5

    tokenBudget := TokenBudget{Limit: MaxTokensPerRequest}

    tools := []openai.Tool{
        {
            Type: openai.ToolTypeFunction,
            Function: &openai.FunctionDefinition{
                Name:        "check_disk",
                Description: "Check current disk usage",
            },
        },
        {
            Type: openai.ToolTypeFunction,
            Function: &openai.FunctionDefinition{
                Name:        "clean_logs",
                Description: "Delete old logs to free space",
            },
        },
    }

    messages := []openai.ChatCompletionMessage{
        {Role: openai.ChatMessageRoleSystem, Content: "You are an autonomous DevOps agent."},
        {Role: openai.ChatMessageRoleUser, Content: userInput},
    }

    fmt.Println("Starting Agent Loop...")

    for i := 0; i < MaxIterations; i++ {
        // Check budget before request
        if err := checkTokenBudget(tokenBudget); err != nil {
            fmt.Printf("Budget exceeded: %v\n", err)
            break
        }

        req := openai.ChatCompletionRequest{
            Model:    model,
            Messages: messages,
            Tools:    tools,
        }

        // Timeout for each LLM call
        callCtx, callCancel := context.WithTimeout(ctx, LLMTimeout)
        resp, err := client.CreateChatCompletion(callCtx, req)
        callCancel()

        if err != nil {
            if err == context.DeadlineExceeded {
                fmt.Printf("LLM call timeout\n")
                break
            }
            panic(fmt.Sprintf("API Error: %v", err))
        }

        msg := resp.Choices[0].Message
        messages = append(messages, msg)

        // Update budget
        tokenBudget.Used += resp.Usage.TotalTokens
        fmt.Printf("Tokens used: %d / %d\n", tokenBudget.Used, tokenBudget.Limit)

        if len(msg.ToolCalls) == 0 {
            fmt.Println("AI:", msg.Content)
            break
        }

        for _, toolCall := range msg.ToolCalls {
            fmt.Printf("Executing tool: %s\n", toolCall.Function.Name)

            var result string
            if toolCall.Function.Name == "check_disk" {
                result = checkDisk()
            } else if toolCall.Function.Name == "clean_logs" {
                result = cleanLogs()
            }

            fmt.Println("Tool Output:", result)

            messages = append(messages, openai.ChatCompletionMessage{
                Role:       openai.ChatMessageRoleTool,
                Content:    result,
                ToolCallID: toolCall.ID,
            })
        }
    }
}

Multi-Model in a Single Agent Run¶

In Step 4 we picked one model for the entire request: simple task — cheap model, complex — expensive. But within a single agent run, different stages need different levels of intelligence.

Think about it: to pick a tool from a list, the model doesn't need deep understanding of the task. It's like grabbing a screwdriver from a toolbox — a mechanical action. But to analyze the tool's result and formulate an answer — that takes thinking.

Three Iteration Stages¶

Each iteration of the agent loop (see Chapter 04: Autonomy and Loops) consists of three stages:

Stage	What Happens	Needs a Powerful Model?	Example Model
1. Tool selection	LLM decides which tool to call	No — tool list is fixed	GPT-4o-mini (~$0.15/1M input)
2. Result analysis	LLM interprets tool output	Yes — needs context understanding	GPT-4o (~$2.50/1M input)
3. Final answer	LLM formulates user response	Yes — answer quality is critical	GPT-4o (~$2.50/1M input)

Economics: Concrete Numbers¶

Say the agent completes a task in 8 iterations. Each LLM call uses ~2000 tokens.

Single model for all stages (GPT-4o):

8 iterations × 2 calls (tool selection + analysis) = 16 calls
16 × ~$0.01 = **~$0.16 per task**

Multi-model approach:

8 tool selection calls × ~$0.0002 (GPT-4o-mini) = $0.0016
7 result analysis calls × ~$0.01 (GPT-4o) = $0.07
1 final answer × ~$0.01 (GPT-4o) = $0.01
Total: ~$0.08 per task

~50% savings with no quality loss. At 10,000 requests per month, that's the difference between $1,600 and $800.

Implementation: ModelRouter¶

// Stage defines the agent iteration stage
type Stage int

const (
    StageToolSelection Stage = iota // Tool selection
    StageResultAnalysis             // Tool result analysis
    StageFinalAnswer                // Final answer generation
)

// ModelRouter selects a model based on the iteration stage.
// A cheap model handles tool selection from a fixed list,
// a powerful model is needed for analysis and answer generation.
type ModelRouter struct {
    CheapModel    string // For mechanical decisions (tool selection)
    PowerfulModel string // For analysis and answer generation
}

func NewModelRouter() *ModelRouter {
    return &ModelRouter{
        CheapModel:    "gpt-4o-mini",
        PowerfulModel: "gpt-4o",
    }
}

func (r *ModelRouter) SelectModel(stage Stage) string {
    switch stage {
    case StageToolSelection:
        return r.CheapModel
    case StageResultAnalysis, StageFinalAnswer:
        return r.PowerfulModel
    default:
        return r.PowerfulModel
    }
}

Integration into the agent loop:

func runAgentMultiModel(ctx context.Context, client *openai.Client, messages []openai.ChatCompletionMessage, tools []openai.Tool) (string, error) {
    router := NewModelRouter()

    for i := 0; i < MaxIterations; i++ {
        // Stage 1: tool selection — cheap model
        req := openai.ChatCompletionRequest{
            Model:    router.SelectModel(StageToolSelection),
            Messages: messages,
            Tools:    tools,
        }
        resp, err := client.CreateChatCompletion(ctx, req)
        if err != nil {
            return "", err
        }

        msg := resp.Choices[0].Message
        messages = append(messages, msg)

        // If no tool calls — this is the final answer,
        // but it was generated by the cheap model. Regenerate with the powerful one.
        if len(msg.ToolCalls) == 0 {
            return regenerateWithPowerfulModel(ctx, client, router, messages)
        }

        // Execute tools
        toolResults := executeTools(msg.ToolCalls)
        messages = append(messages, toolResults...)

        // Stage 2: result analysis — powerful model
        // Next loop iteration decides: another tool call or final answer.
        // If another tool call is needed, we switch back to the cheap model.
    }

    return "", fmt.Errorf("max iterations exceeded")
}

// regenerateWithPowerfulModel regenerates the final answer with the powerful model.
// The cheap model is good at picking tools, but the final answer
// must be high quality.
func regenerateWithPowerfulModel(ctx context.Context, client *openai.Client, router *ModelRouter, messages []openai.ChatCompletionMessage) (string, error) {
    // Remove the last response from the cheap model
    messages = messages[:len(messages)-1]

    req := openai.ChatCompletionRequest{
        Model:    router.SelectModel(StageFinalAnswer),
        Messages: messages,
        // No tools — request text-only answer
    }
    resp, err := client.CreateChatCompletion(ctx, req)
    if err != nil {
        return "", err
    }
    return resp.Choices[0].Message.Content, nil
}

Connection with other chapters: The Function Calling mechanism used at the tool selection stage is covered in Chapter 03: Tools and Function Calling. The agent loop itself — in Chapter 04: Autonomy and Loops.

Streaming to Reduce Perceived Latency¶

There are two kinds of latency:

Actual — time from request to the last token of the response
Perceived — time the user feels as waiting

Streaming doesn't speed up generation. The model spends the same total time on the response. But the user sees the first words in 200-500 ms instead of waiting 5-10 seconds for the full response. This fundamentally changes UX.

How Streaming Works¶

Without streaming: request → wait 5 seconds → entire response at once.

With streaming: request → 200 ms → first token → tokens appear one by one → response complete.

Under the hood it's Server-Sent Events (SSE) — the HTTP connection stays open, and the server sends data as it becomes available.

Streaming in the Agent Loop¶

func streamAgentResponse(ctx context.Context, client *openai.Client, messages []openai.ChatCompletionMessage) error {
    req := openai.ChatCompletionRequest{
        Model:    "gpt-4o",
        Messages: messages,
        Stream:   true, // Enable streaming
    }

    stream, err := client.CreateChatCompletionStream(ctx, req)
    if err != nil {
        return fmt.Errorf("stream error: %w", err)
    }
    defer stream.Close()

    for {
        chunk, err := stream.Recv()
        if errors.Is(err, io.EOF) {
            fmt.Println() // Newline after response
            return nil
        }
        if err != nil {
            return fmt.Errorf("stream recv error: %w", err)
        }

        // Each chunk contains one or more tokens
        if len(chunk.Choices) > 0 {
            token := chunk.Choices[0].Delta.Content
            fmt.Print(token) // Print token immediately, no buffering
        }
    }
}

SSE Streaming for Web Clients¶

If the agent runs as an HTTP server, streaming is delivered to the client via SSE:

func handleStream(w http.ResponseWriter, r *http.Request) {
    // Check streaming support
    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "streaming not supported", http.StatusInternalServerError)
        return
    }

    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    w.Header().Set("Connection", "keep-alive")

    ctx := r.Context()
    stream, err := client.CreateChatCompletionStream(ctx, req)
    if err != nil {
        return
    }
    defer stream.Close()

    for {
        chunk, err := stream.Recv()
        if errors.Is(err, io.EOF) {
            fmt.Fprintf(w, "data: [DONE]\n\n")
            flusher.Flush()
            return
        }
        if err != nil {
            return
        }

        if len(chunk.Choices) > 0 {
            token := chunk.Choices[0].Delta.Content
            // SSE format: "data: ...\n\n"
            fmt.Fprintf(w, "data: %s\n\n", token)
            flusher.Flush() // Send immediately, no buffering
        }
    }
}

When Streaming Doesn't Help¶

Streaming reduces perceived latency only for the final user-facing response. Intermediate LLM calls (tool selection, result analysis) don't need streaming — the user doesn't see them. For intermediate calls, actual latency matters. Reduce it by picking a faster model and optimizing context.

Parallel Tool Call Execution¶

The LLM can return multiple tool calls in a single iteration (see Chapter 03: Tools and Function Calling). If the tools are independent, you can execute them in parallel.

Parallel execution doesn't save tokens — the LLM generated all tool calls in one request anyway. But it cuts wall-clock time — the time from the start to the end of an iteration.

When It's Safe to Parallelize¶

Safe	Unsafe
`check_disk` + `check_memory` — read different resources	`create_file` → `write_to_file` — second depends on first
`get_logs("app-1")` + `get_logs("app-2")` — different servers	`stop_service` → `deploy` → `start_service` — strict order
`dns_lookup` + `ping` — independent checks	`read_config` → `apply_config` — read first, then apply

Rule: if one tool's result is needed as input for another — execute sequentially. Otherwise — in parallel.

Implementation: executeToolsParallel¶

// ToolResult stores the result of a single tool execution
type ToolResult struct {
    ToolCallID string
    Content    string
    Err        error
}

// executeToolsParallel runs all tool calls concurrently.
// Each tool runs in its own goroutine.
func executeToolsParallel(ctx context.Context, toolCalls []openai.ToolCall) []openai.ChatCompletionMessage {
    results := make([]ToolResult, len(toolCalls))

    var wg sync.WaitGroup
    for i, tc := range toolCalls {
        wg.Add(1)
        go func(idx int, call openai.ToolCall) {
            defer wg.Done()

            // Per-tool timeout
            toolCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
            defer cancel()

            content, err := executeTool(toolCtx, call.Function.Name, call.Function.Arguments)
            results[idx] = ToolResult{
                ToolCallID: call.ID,
                Content:    content,
                Err:        err,
            }
        }(i, tc)
    }
    wg.Wait()

    // Collect results as messages
    var messages []openai.ChatCompletionMessage
    for _, r := range results {
        content := r.Content
        if r.Err != nil {
            content = fmt.Sprintf("error: %v", r.Err)
        }
        messages = append(messages, openai.ChatCompletionMessage{
            Role:       openai.ChatMessageRoleTool,
            Content:    content,
            ToolCallID: r.ToolCallID,
        })
    }
    return messages
}

Integration into the agent loop — replace sequential execution with parallel:

for i := 0; i < MaxIterations; i++ {
    resp, err := client.CreateChatCompletion(ctx, req)
    if err != nil {
        return "", err
    }

    msg := resp.Choices[0].Message
    messages = append(messages, msg)

    if len(msg.ToolCalls) == 0 {
        return msg.Content, nil
    }

    // Before: sequential execution
    // for _, tc := range msg.ToolCalls { ... }

    // After: parallel execution
    toolMessages := executeToolsParallel(ctx, msg.ToolCalls)
    messages = append(messages, toolMessages...)
}

How Much Time You Save¶

Say the agent called 3 tools in one iteration. Each tool takes ~2 seconds.

Sequential: 2 + 2 + 2 = 6 seconds
Parallel: max(2, 2, 2) = 2 seconds

Over 8 iterations with 2-3 tool calls each — that's the difference between 48 seconds and 16 seconds. The user waits 3x less.

Connection with Chapter 04: In Chapter 04: Autonomy and Loops, tool calls run sequentially. Parallel execution is a natural optimization once the agent works reliably and you start optimizing latency.

Common Errors¶

Error 1: No Token Limits¶

Symptom: LLM API bill grows uncontrollably. One request can use 100,000 tokens.

Cause: Token usage is not checked before sending request.

Solution:

// BAD
resp, _ := client.CreateChatCompletion(ctx, req)
// No token check

// GOOD
tokenBudget.Used += resp.Usage.TotalTokens
if err := checkTokenBudget(tokenBudget); err != nil {
    return "", err
}

Error 2: No Iteration Limit¶

Symptom: Agent loops and makes 50+ iterations, spending thousands of tokens.

Cause: No limit on ReAct loop iterations.

Solution:

// BAD
for {
    // Infinite loop
}

// GOOD
for i := 0; i < MaxIterations; i++ {
    // Limited iterations
}

Error 3: Using GPT-4 for All Tasks¶

Symptom: Simple requests cost 15x more than needed.

Cause: Always using the most expensive model.

Solution:

// BAD
req.Model = "gpt-4o" // Always GPT-4

// GOOD
model := selectModel(assessTaskComplexity(userInput))
req.Model = model

Error 4: No Caching¶

Symptom: Identical requests execute repeatedly, wasting tokens.

Cause: No cache for LLM results.

Solution:

// BAD
resp, _ := client.CreateChatCompletion(ctx, req)
// New request every time

// GOOD
key := getCacheKey(messages)
if result, ok := getCachedResult(key); ok {
    return result, nil
}
resp, _ := client.CreateChatCompletion(ctx, req)
setCachedResult(key, resp.Choices[0].Message.Content, 1*time.Hour)

Error 5: No Timeouts¶

Symptom: Agent hangs for 10+ minutes, user waits.

Cause: No timeout for LLM calls or tools.

Solution:

// BAD
resp, _ := client.CreateChatCompletion(ctx, req)
// May hang forever

// GOOD
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
resp, err := client.CreateChatCompletion(ctx, req)
if err == context.DeadlineExceeded {
    return "", fmt.Errorf("timeout")
}

Mini-Exercises¶

Exercise 1: Implement Token Budget Check¶

Add budget check to labs/lab04-autonomy/main.go:

func checkTokenBudget(used int, limit int) error {
    // Your code here
    // Return error if limit exceeded
}

Expected result:

Function returns error if tokens used exceed limit
Function returns nil if limit not exceeded

Exercise 2: Implement Model Routing¶

Create a function to select model by task complexity:

func selectModelByComplexity(userInput string) string {
    // Your code here
    // Return GPT-3.5 for simple tasks, GPT-4 for complex
}

Expected result:

Simple tasks (check, status, read) → GPT-3.5
Complex tasks (analyze, fix, plan) → GPT-4

Exercise 3: Implement Caching¶

Add simple in-memory caching for LLM results:

var cache = make(map[string]CacheEntry)

func getCachedResult(key string) (string, bool) {
    // Your code here
}

func setCachedResult(key string, result string, ttl time.Duration) {
    // Your code here
}

Expected result:

Identical requests returned from cache
Cache has TTL (time to live)

Completion Criteria / Checklist¶

Completed (production ready):

Token budgets implemented with check before each request
ReAct loop iteration limit set
Model routing by task complexity implemented
LLM result caching implemented
Timeouts set for LLM calls and agent run
Token usage tracked and warned when exceeded

Not completed:

Connection with Other Chapters¶

Observability: Logging token usage — Chapter 19: Observability and Tracing
Context Optimization: Managing context size — Chapter 13: Context Engineering
Agent Loop: Basic agent loop — Chapter 04: Autonomy and Loops