Skip to content

23. Evals in CI/CD

Why This Chapter?

You changed the prompt or code, and the agent's performance degraded. But you only learn about it after deploying to production. Without evals in CI/CD, you can't automatically check quality before deployment.

Real-World Case Study

Situation: You updated the system prompt and deployed changes. After a day, users complain that the agent works worse.

Problem: No automatic quality check before deployment. Changes are deployed without testing.

Solution: Evals in CI/CD pipeline, quality gates, blocking deployment on metric degradation. Now bad changes don't reach production.

Theory in Simple Terms

What Are Quality Gates?

Quality Gates are quality checks that block deployment if metrics have degraded.

How It Works (Step by Step)

Step 1: Quality Gates in CI/CD

Integrate evals into CI/CD pipeline:

GitHub Actions

# .github/workflows/evals.yml
name: Run Evals
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run evals
        run: go run cmd/evals/main.go
      - name: Check quality gate
        run: |
          if [ "$PASS_RATE" -lt "0.95" ]; then
            echo "Quality gate failed: Pass rate $PASS_RATE < 0.95"
            exit 1
          fi

GitLab CI/CD

# .gitlab-ci.yml
stages:
  - evals

evals:
  stage: evals
  image: golang:1.21
  before_script:
    - go version
  script:
    # Run evals and save result
    - |
      PASS_RATE=$(go run cmd/evals/main.go 2>&1 | grep -oP 'Pass rate: \K[0-9.]+' || echo "0")
      echo "Pass rate: $PASS_RATE"
      # Check quality gate
      if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then
        echo "Quality gate failed: Pass rate $PASS_RATE < 0.95"
        exit 1
      fi
  only:
    - merge_requests
  tags:
    - docker

Alternative (if cmd/evals/main.go exports environment variable):

evals:
  stage: evals
  image: golang:1.21
  script:
    - go run cmd/evals/main.go
    - |
      if [ "$PASS_RATE" -lt "0.95" ]; then
        echo "Quality gate failed: Pass rate $PASS_RATE < 0.95"
        exit 1
      fi
  only:
    - merge_requests

Note: Ensure cmd/evals/main.go outputs result in parseable format (e.g., Pass rate: 0.95), or exports environment variable PASS_RATE.

Step 2: Dataset Versioning

Store "golden" scenarios in dataset:

type EvalDataset struct {
    Version string
    Cases   []EvalCase
}

type EvalCase struct {
    Input    string
    Expected string
}

Where to Integrate This in Our Code

Integration Point: CI/CD Pipeline

Create separate file cmd/evals/main.go to run evals:

package main

import (
    "fmt"
    "os"
)

func main() {
    passRate := runEvals()

    // Output result in format parseable in CI/CD
    fmt.Printf("Pass rate: %.2f\n", passRate)

    // Export environment variable for convenience (works in both systems)
    os.Setenv("PASS_RATE", fmt.Sprintf("%.2f", passRate))

    // Quality gate: if pass rate below threshold, exit with error
    if passRate < 0.95 {
        fmt.Printf("Quality gate failed: Pass rate %.2f < 0.95\n", passRate)
        os.Exit(1)
    }

    fmt.Println("Quality gate passed!")
}

Note: This example works with both GitHub Actions and GitLab CI/CD. Environment variable PASS_RATE is available in both cases, and output in format Pass rate: 0.95 allows parsing result via grep or other tools.

Common Errors

Error 1: Evals Not Integrated in CI/CD

Symptom: Evals run manually, bad changes reach production.

Solution: Integrate evals into CI/CD pipeline.

Completion Criteria / Checklist

Completed:

  • Evals integrated in CI/CD
  • Quality gates block deployment on degradation

Not completed:

  • Evals not integrated in CI/CD

Connection with Other Chapters


Navigation: ← Prompt and Program Management | Table of Contents | Data and Privacy →