13. Context Engineering¶

Why This Chapter?¶

Context windows are limited. As conversations grow, you need to decide what stays in context and what gets summarized or dropped. Poor context management wastes tokens, loses important details, and confuses the agent.

This chapter covers practical context management techniques: layers, summarization, fact selection, and adaptive context building.

Real-World Case Study¶

Situation: Long-running conversation with agent. After 50 turns, context is 50K tokens. New request needs recent information, but it's buried in history.

Problem:

Include full history: Exceeds context limit, expensive
Include only recent: Loses important context from early
No strategy: Agent gets confused or misses critical information

Solution: Context engineering uses layers (working memory, summaries, facts), selective retrieval, and summarization of old turns while preserving key facts.

Theory in Simple Terms¶

Context Layers¶

Working memory (recent turns):

The last N conversation turns
Always included
Most relevant for the current task

Summary layer:

Summarized old conversations
Preserves key facts
Reduces token usage

Facts layer:

Extracted important facts from long-term memory
User preferences, decisions, constraints
Persistent between conversations
Note: Storing and retrieving facts is described in Memory, here only their use in context is described

Important: Context as an Anchor (Anchoring Bias). If user preferences ("user thinks X", "we need answer Y") or unverified hypotheses enter the facts layer, they become a strong anchor for the model. The model may shift answers toward these preferences, even if actual data points elsewhere.

Problem: Preferences and hypotheses included in context as facts can distort objective analysis.

Solution: Separate entry types: Fact (verified data), Preference (user preferences), Hypothesis (hypotheses). Include preferences and hypotheses in context only when appropriate (personalization), and exclude them for analytical tasks requiring objectivity.

Task state:

Current task progress
What's done, what's pending
Allows resumption

Context Operations¶

Select — Choose what to include
Summarize — Compress old information
Extract — Extract key facts
Layer — Organize by importance/freshness

How It Works (Step by Step)¶

Step 1: Context Manager Interface¶

type ContextManager interface {
    AddMessage(msg openai.ChatCompletionMessage) error
    GetContext(maxTokens int) ([]openai.ChatCompletionMessage, error)
    Summarize() error
    ExtractFacts() ([]Fact, error)
}

type Fact struct {
    Key        string
    Value      string
    Source     string // Which conversation
    Importance int    // 1-10
    Type       string // "fact", "preference", "hypothesis", "constraint"
}

Step 2: Layered Context¶

type LayeredContext struct {
    workingMemory []openai.ChatCompletionMessage // Recent turns
    summary       string                          // Summarized history
    facts         []Fact                          // Extracted facts
    maxWorking    int                             // Max turns in working memory
}

func (c *LayeredContext) GetContext(maxTokens int) ([]openai.ChatCompletionMessage, error) {
    var messages []openai.ChatCompletionMessage

    // Add system prompt with facts
    if len(c.facts) > 0 {
        factsContext := "Important facts:\n"
        for _, fact := range c.facts {
            factsContext += fmt.Sprintf("- %s: %s\n", fact.Key, fact.Value)
        }
        messages = append(messages, openai.ChatCompletionMessage{
            Role:    "system",
            Content: factsContext,
        })
    }

    // Add summary if exists
    if c.summary != "" {
        messages = append(messages, openai.ChatCompletionMessage{
            Role:    "system",
            Content: "Previous conversation summary: " + c.summary,
        })
    }

    // Add working memory (recent turns)
    messages = append(messages, c.workingMemory...)

    // Truncate if exceeds maxTokens
    return truncateToTokenLimit(messages, maxTokens), nil
}

Step 3: Summarization¶

func (c *LayeredContext) Summarize(ctx context.Context, client *openai.Client) error {
    if len(c.workingMemory) <= c.maxWorking {
        return nil // Not needed yet
    }

    // Get old messages for summarization
    oldMessages := c.workingMemory[:len(c.workingMemory)-c.maxWorking]

    // Create summarization prompt
    prompt := "Summarize this conversation, preserving key facts and decisions:\n\n"
    for _, msg := range oldMessages {
        prompt += fmt.Sprintf("%s: %s\n", msg.Role, msg.Content)
    }

    resp, err := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
        Model: "gpt-4o-mini",
        Messages: []openai.ChatCompletionMessage{
            {Role: "system", Content: "You are a summarization agent. Extract key facts and decisions."},
            {Role: "user", Content: prompt},
        },
        Temperature: 0,
    })
    if err != nil {
        return err
    }

    c.summary = resp.Choices[0].Message.Content

    // Keep only recent messages in working memory
    c.workingMemory = c.workingMemory[len(c.workingMemory)-c.maxWorking:]

    return nil
}

Step 4: Using Facts from Memory¶

IMPORTANT: Fact extraction and storage happens in Memory. Here we only use already extracted facts when assembling context.

func (c *LayeredContext) GetContext(maxTokens int, memory Memory, includePreferences bool) ([]openai.ChatCompletionMessage, error) {
    var messages []openai.ChatCompletionMessage

    // Get facts from memory (don't extract here!)
    facts, _ := memory.Retrieve("user_preferences", 10)

    // Filter facts by type depending on task
    var filteredFacts []Fact
    for _, fact := range facts {
        if fact.Type == "fact" || fact.Type == "constraint" {
            // Always include verified facts and constraints
            filteredFacts = append(filteredFacts, fact)
        } else if includePreferences && (fact.Type == "preference" || fact.Type == "hypothesis") {
            // Include preferences and hypotheses only when appropriate (personalization)
            filteredFacts = append(filteredFacts, fact)
        }
        // Otherwise exclude preferences/hypotheses for objective analysis
    }

    // Add system prompt with facts
    if len(filteredFacts) > 0 {
        factsContext := "Important facts:\n"
        for _, fact := range filteredFacts {
            // Mark type for clarity
            prefix := ""
            if fact.Type == "preference" {
                prefix = "[User preference] "
            } else if fact.Type == "hypothesis" {
                prefix = "[Hypothesis] "
            }
            factsContext += fmt.Sprintf("- %s%s: %v\n", prefix, fact.Key, fact.Value)
        }
        messages = append(messages, openai.ChatCompletionMessage{
            Role:    "system",
            Content: factsContext,
        })
    }

    // Add summary if exists
    if c.summary != "" {
        messages = append(messages, openai.ChatCompletionMessage{
            Role:    "system",
            Content: "Previous conversation summary: " + c.summary,
        })
    }

    // Add working memory (recent turns)
    messages = append(messages, c.workingMemory...)

    // Truncate if exceeds maxTokens
    return truncateToTokenLimit(messages, maxTokens), nil
}

Token Counting and truncateToTokenLimit¶

In previous examples we called truncateToTokenLimit but never implemented it. Let's cover token counting and context truncation.

Why Count Tokens?¶

Every model has a hard context window limit. Exceed it — you get an error. Undershoot — you waste money on empty space. Precise token counting lets you use context as efficiently as possible.

Simple Counting: Words vs Tokens¶

Precise counting requires the model's tokenizer (e.g., tiktoken for OpenAI). For quick estimates, an approximation works: 1 token ≈ 0.75 words for English text, closer to 0.5 words for Russian (Cyrillic encodes less efficiently).

// TokenCounter — token counting interface.
// Swap implementations: approximate for tests, precise for production.
type TokenCounter interface {
    Count(text string) int
}

// WordBasedCounter — approximate word-based counting.
// Good for quick estimates without external dependencies.
type WordBasedCounter struct {
    TokensPerWord float64 // English ≈ 1.33, Russian ≈ 2.0
}

func (c *WordBasedCounter) Count(text string) int {
    words := len(strings.Fields(text))
    return int(float64(words) * c.TokensPerWord)
}

// TiktokenCounter — precise counting via tiktoken.
// Use in production for accurate budgeting.
type TiktokenCounter struct {
    encoding *tiktoken.Encoding
}

func NewTiktokenCounter(model string) (*TiktokenCounter, error) {
    enc, err := tiktoken.EncodingForModel(model)
    if err != nil {
        return nil, fmt.Errorf("encoding for model %s: %w", model, err)
    }
    return &TiktokenCounter{encoding: enc}, nil
}

func (c *TiktokenCounter) Count(text string) int {
    return len(c.encoding.Encode(text, nil, nil))
}

Model Limits¶

Context limits depend on the model. Keep them in configuration, not in code:

// ModelLimits stores limits for a specific model.
var ModelLimits = map[string]int{
    "gpt-4o":      128_000,
    "gpt-4o-mini": 128_000,
    "gpt-4-turbo": 128_000,
    "gpt-3.5-turbo": 16_385,
    "claude-3-5-sonnet": 200_000,
}

// SafeLimit returns the limit with room for the model's response.
// Leaves space for generation (maxOutputTokens).
func SafeLimit(model string, maxOutputTokens int) int {
    limit, ok := ModelLimits[model]
    if !ok {
        return 4096 // Safe default
    }
    return limit - maxOutputTokens
}

Implementing truncateToTokenLimit¶

Truncate context from the end, but always keep system messages and the user's last request:

func truncateToTokenLimit(
    messages []openai.ChatCompletionMessage,
    maxTokens int,
    counter TokenCounter,
) []openai.ChatCompletionMessage {
    total := countMessages(messages, counter)
    if total <= maxTokens {
        return messages
    }

    // Split: system messages, middle, last user message
    var system []openai.ChatCompletionMessage
    var middle []openai.ChatCompletionMessage
    var last openai.ChatCompletionMessage

    for i, msg := range messages {
        if msg.Role == "system" {
            system = append(system, msg)
        } else if i == len(messages)-1 {
            last = msg
        } else {
            middle = append(middle, msg)
        }
    }

    // Count fixed parts (system + last request)
    reserved := countMessages(system, counter) + counter.Count(last.Content) + 4 // +4 for metadata

    // Trim middle from the start (remove oldest messages)
    budget := maxTokens - reserved
    var kept []openai.ChatCompletionMessage
    runningTotal := 0

    for i := len(middle) - 1; i >= 0; i-- {
        msgTokens := counter.Count(middle[i].Content) + 4
        if runningTotal+msgTokens > budget {
            break
        }
        runningTotal += msgTokens
        kept = append([]openai.ChatCompletionMessage{middle[i]}, kept...)
    }

    result := append(system, kept...)
    result = append(result, last)
    return result
}

func countMessages(messages []openai.ChatCompletionMessage, counter TokenCounter) int {
    total := 0
    for _, msg := range messages {
        total += counter.Count(msg.Content) + 4 // +4 tokens for role and delimiters
    }
    return total
}

Why +4? Each message in the API is encoded with metadata: role, start and end delimiters. For OpenAI this is roughly 4 tokens per message.

Advanced Compression Strategies¶

Basic LLM summarization is just one way to compress context. Let's look at more precise approaches.

Semantic Compression¶

The idea: keep the meaning, drop the filler. Instead of retelling the entire conversation — extract only what affects future decisions.

Key-Value Extraction¶

The idea: turn a long narrative into structured key-value pairs. More compact than a summary, easier for the model to use.

Implementation¶

// CompressionStrategy defines the compression method.
type CompressionStrategy string

const (
    StrategySummarize CompressionStrategy = "summarize" // Standard summarization
    StrategySemantic  CompressionStrategy = "semantic"   // Semantic compression
    StrategyKeyValue  CompressionStrategy = "keyvalue"   // Key-Value extraction
)

// compressContext compresses messages using the chosen strategy.
func compressContext(
    ctx context.Context,
    client *openai.Client,
    messages []openai.ChatCompletionMessage,
    strategy CompressionStrategy,
) (string, error) {
    conversation := formatMessages(messages)

    prompts := map[CompressionStrategy]string{
        StrategySummarize: "Summarize this conversation. Preserve key facts and decisions:\n\n" + conversation,

        StrategySemantic: `Compress this conversation to the minimum.
Rules:
- Keep ONLY facts, decisions, and open questions
- Remove greetings, thanks, repetitions
- Remove reasoning if a final decision exists
- Format: one statement per line

Conversation:
` + conversation,

        StrategyKeyValue: `Extract key facts from the conversation in "key: value" format.
Key categories:
- decision: a decision made
- constraint: a constraint or requirement
- action: an action taken
- open: an unresolved question

Example:
decision:database: Using PostgreSQL
constraint:budget: No more than $100/month

Conversation:
` + conversation,
    }

    prompt, ok := prompts[strategy]
    if !ok {
        return "", fmt.Errorf("unknown strategy: %s", strategy)
    }

    resp, err := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
        Model: "gpt-4o-mini",
        Messages: []openai.ChatCompletionMessage{
            {Role: "system", Content: "You compress context. Be as brief as possible."},
            {Role: "user", Content: prompt},
        },
        Temperature: 0,
    })
    if err != nil {
        return "", err
    }
    return resp.Choices[0].Message.Content, nil
}

func formatMessages(messages []openai.ChatCompletionMessage) string {
    var b strings.Builder
    for _, msg := range messages {
        fmt.Fprintf(&b, "[%s]: %s\n", msg.Role, msg.Content)
    }
    return b.String()
}

Choosing the Right Strategy¶

Strategy	Compression Ratio	Information Loss	When to Use
`summarize`	Medium (~3x)	Low	Need context to continue dialog
`semantic`	High (~5-10x)	Medium	Long discussions, need the gist
`keyvalue`	Very high (~10-20x)	High (facts only)	Long-term storage, cross-session

Incremental Summarization¶

The Problem¶

Summarizing the entire history every time is expensive. If a conversation has 100 messages and you summarize every 10, by the 10th iteration you're reprocessing everything from scratch. That's O(n²) in tokens.

Solution: Update the Existing Summary¶

Instead of summarizing the full history, take the previous summary and augment it with new messages. That's O(n) in tokens.

// incrementalSummarize updates the existing summary with new messages.
// Instead of re-summarizing the full history — augments the current summary.
func incrementalSummarize(
    ctx context.Context,
    client *openai.Client,
    currentSummary string,
    newMessages []openai.ChatCompletionMessage,
) (string, error) {
    if len(newMessages) == 0 {
        return currentSummary, nil
    }

    newConversation := formatMessages(newMessages)

    var prompt string
    if currentSummary == "" {
        // First summarization
        prompt = "Summarize this conversation. Preserve key facts, decisions, and open questions:\n\n" + newConversation
    } else {
        // Update existing summary
        prompt = fmt.Sprintf(`Update the conversation summary with new messages.

Current summary:
%s

New messages:
%s

Rules:
- Include ALL important information from the current summary
- Add new facts and decisions from new messages
- If new messages contradict the summary — use the new information
- Remove outdated items if they were resolved in new messages
- Keep the format compact`, currentSummary, newConversation)
    }

    resp, err := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
        Model: "gpt-4o-mini",
        Messages: []openai.ChatCompletionMessage{
            {Role: "system", Content: "You update conversation summaries. Be precise and concise."},
            {Role: "user", Content: prompt},
        },
        Temperature: 0,
    })
    if err != nil {
        return currentSummary, err // On error, keep the old summary
    }
    return resp.Choices[0].Message.Content, nil
}

Using in LayeredContext¶

func (c *LayeredContext) SummarizeIncremental(ctx context.Context, client *openai.Client) error {
    if len(c.workingMemory) <= c.maxWorking {
        return nil
    }

    // Take only messages that overflow working memory
    overflow := c.workingMemory[:len(c.workingMemory)-c.maxWorking]

    // Update summary incrementally (don't re-summarize everything)
    updated, err := incrementalSummarize(ctx, client, c.summary, overflow)
    if err != nil {
        return err
    }

    c.summary = updated
    c.workingMemory = c.workingMemory[len(c.workingMemory)-c.maxWorking:]
    return nil
}

Cost comparison:

Approach	Tokens at 100th message	Growth
Full summarization	~50K (entire history)	O(n²)
Incremental	~2K (summary + 10 new)	O(n)

Context Prioritization¶

When the token budget is tight, you need to decide: which data matters more. Not everything is equally valuable — recent messages matter more than old ones, errors matter more than successful results.

Budget by Layer¶

Divide available tokens across context layers. A fixed ratio guarantees no single layer consumes the entire budget:

// TokenBudget distributes available tokens across context layers.
type TokenBudget struct {
    Total          int     // Total budget (model maxTokens - maxOutputTokens)
    SystemRatio    float64 // Share for system prompt (0.10-0.15)
    FactsRatio     float64 // Share for facts (0.10-0.15)
    SummaryRatio   float64 // Share for summary (0.15-0.20)
    WorkingRatio   float64 // Share for working memory (0.50-0.65)
}

func (b TokenBudget) SystemBudget() int  { return int(float64(b.Total) * b.SystemRatio) }
func (b TokenBudget) FactsBudget() int   { return int(float64(b.Total) * b.FactsRatio) }
func (b TokenBudget) SummaryBudget() int { return int(float64(b.Total) * b.SummaryRatio) }
func (b TokenBudget) WorkingBudget() int { return int(float64(b.Total) * b.WorkingRatio) }

Message Scoring¶

Not all messages are equally useful. Score importance and select within the budget:

// ScoredMessage — a message with an importance score.
type ScoredMessage struct {
    Message    openai.ChatCompletionMessage
    Score      float64
    TokenCount int
}

// scoreMessage scores message importance.
// High score = message should be kept.
func scoreMessage(msg openai.ChatCompletionMessage, position, total int) float64 {
    score := 0.0

    // 1. Recency: recent messages are more important (0.0–0.4)
    recency := float64(position) / float64(total)
    score += recency * 0.4

    // 2. Role: assistant messages with tool_calls are more important than plain text
    if msg.Role == "tool" {
        score += 0.2 // Tool call results are important
    }

    // 3. Content: errors and important decisions
    content := strings.ToLower(msg.Content)
    if strings.Contains(content, "error") {
        score += 0.3 // Errors are more important than regular messages
    }
    if strings.Contains(content, "decision") || strings.Contains(content, "chose") {
        score += 0.2 // Decisions are worth remembering
    }

    return score
}

// prioritizeContext assembles context respecting budget and priorities.
func prioritizeContext(
    messages []openai.ChatCompletionMessage,
    facts []Fact,
    summary string,
    budget TokenBudget,
    counter TokenCounter,
) []openai.ChatCompletionMessage {
    var result []openai.ChatCompletionMessage

    // 1. Facts — within budget
    if len(facts) > 0 {
        factsText := buildFactsText(facts, budget.FactsBudget(), counter)
        result = append(result, openai.ChatCompletionMessage{
            Role:    "system",
            Content: factsText,
        })
    }

    // 2. Summary — truncate if it doesn't fit
    if summary != "" {
        if counter.Count(summary) > budget.SummaryBudget() {
            // Summary too long — truncate by sentence
            summary = truncateText(summary, budget.SummaryBudget(), counter)
        }
        result = append(result, openai.ChatCompletionMessage{
            Role:    "system",
            Content: "Previous conversation summary:\n" + summary,
        })
    }

    // 3. Working memory — select by score
    scored := make([]ScoredMessage, len(messages))
    for i, msg := range messages {
        scored[i] = ScoredMessage{
            Message:    msg,
            Score:      scoreMessage(msg, i, len(messages)),
            TokenCount: counter.Count(msg.Content) + 4,
        }
    }

    // Always include the last user message
    workingBudget := budget.WorkingBudget()
    if len(scored) > 0 {
        last := scored[len(scored)-1]
        workingBudget -= last.TokenCount
    }

    // Remaining messages — by descending score, while they fit
    sort.Slice(scored[:len(scored)-1], func(i, j int) bool {
        return scored[i].Score > scored[j].Score
    })

    var selected []ScoredMessage
    used := 0
    for _, sm := range scored[:len(scored)-1] {
        if used+sm.TokenCount > workingBudget {
            continue
        }
        selected = append(selected, sm)
        used += sm.TokenCount
    }

    // Restore chronological order
    sort.Slice(selected, func(i, j int) bool {
        return indexOfMessage(messages, selected[i].Message) <
            indexOfMessage(messages, selected[j].Message)
    })

    for _, sm := range selected {
        result = append(result, sm.Message)
    }

    // Last message — always at the end
    if len(messages) > 0 {
        result = append(result, messages[len(messages)-1])
    }

    return result
}

Budget Example¶

For a model with 128K context and maxOutputTokens = 4096:

Layer	Share	Tokens
System prompt	10%	~12,400
Facts	10%	~12,400
Summary	20%	~24,800
Working memory	60%	~74,300
Total input	100%	~123,900
Model response	—	4,096

Common Errors¶

Error 1: No Summarization¶

Symptom: Context grows infinitely, reaching token limits.

Cause: Old conversations are never summarized.

Solution: Implement periodic summarization when working memory exceeds threshold.

Error 2: Too Aggressive Summarization¶

Symptom: Important details lost in summary, agent makes mistakes.

Cause: Summary too compressed, facts not extracted.

Solution: Extract facts before summarization, save them separately.

Error 3: No Fact Selection¶

Symptom: Including irrelevant facts wastes tokens.

Cause: Including all facts regardless of relevance.

Solution: Score facts by importance, include only highly scored facts.

Error 4: Preferences Included as Facts¶

Symptom: Model shifts answer toward user preferences, even if actual data points elsewhere.

Cause: User preferences or hypotheses included in context as facts without distinguishing types.

Solution:

// GOOD: Distinguish types
fact := Fact{
    Key:   "user_thinks_db_problem",
    Value: "User assumes problem is in DB",
    Type:  "hypothesis", // Not "fact"!
}

// When assembling context for analytical task:
if !includePreferences {
    // Exclude hypotheses and preferences
    if fact.Type == "fact" || fact.Type == "constraint" {
        includeInContext(fact)
    }
}

Practice: For analytical tasks (incidents, diagnostics), exclude preferences and hypotheses from context. Include them only for personalized responses (e.g., recommendations based on user preferences).

Mini-Exercises¶

Exercise 1: Implement Summarization¶

Create a function that summarizes conversation history:

func summarizeConversation(messages []openai.ChatCompletionMessage) (string, error) {
    // Use LLM to create summary
}

Expected result:

Summary preserves key facts
Significantly reduces token count
Can recover main points

Completion Criteria / Checklist¶

Completed:

Understand context layers
Can summarize conversations
Extract and store facts
Manage context within token limits

Not completed:

No summarization, context grows infinitely
Too aggressive summarization, facts lost
No fact selection, token waste

Connection with Other Chapters¶

Chapter 11: State Management — Task state is used when assembling context
Chapter 12: Agent Memory Systems — Facts from memory are used in context (storage/retrieval described there)
Chapter 20: Cost & Latency Engineering — Token budgets control context selection policies

IMPORTANT: Context Engineering focuses on assembling context from various sources (memory, state, retrieval). Data storage is described in respective chapters (Memory, State Management, RAG).

What's Next?¶

After mastering context engineering, proceed to:

14. Ecosystem and Frameworks — Learn about agent frameworks