GitHub Actions

Feb 2026

GitHub Actions for LLM Eval Pipelines

Eddie Wangengineering

You've shipped an LLM-powered feature. The prompt works, the outputs look reasonable, and stakeholders are happy. Then someone tweaks the system prompt to handle an edge case, and three days later a user reports the chatbot is recommending competitors. Nobody noticed because there was no test.

This is the gap most teams are sitting in right now. Traditional CI catches broken builds and failing unit tests, but LLM outputs aren't deterministic. You can't just assert that the response equals some expected string. The output changes every run, even with temperature set to zero (thanks to batching and floating-point nondeterminism on GPU hardware). So teams fall back on manual spot-checking, which scales about as well as you'd expect.

GitHub Actions turns out to be a natural place to close this gap. The infrastructure is already there. Your prompts live in the repo, your API keys are in Actions secrets, and you already have PR-triggered workflows for everything else. The missing piece is knowing what to actually test and which tools make it practical.

What "testing" means for LLM outputs

Before wiring anything into CI, it helps to understand the three main evaluation strategies teams use. Each has different cost, reliability, and setup tradeoffs.

Deterministic assertions

The simplest approach: check whether the output contains (or doesn't contain) specific strings, matches a regex, or falls within a length range. If your prompt is supposed to return valid JSON, you can parse it and assert on the schema. If it should never mention a competitor by name, a substring check catches that.

These checks are fast, free (no extra API calls), and completely reproducible. They won't tell you if the response is "good" in a subjective sense, but they're the first line of defense against obvious regressions. Most teams underuse them.

LLM-as-judge

For subjective quality, you use another LLM to score the output. You give the judge model the original prompt, the generated response, and optionally a reference answer, then ask it to rate accuracy, helpfulness, or whatever dimensions matter to you. Promptfoo, Braintrust, and LangSmith all support this pattern natively.

The tradeoff is cost and latency. Every eval case now requires an additional API call to the judge model. It also introduces its own nondeterminism: the judge might score the same response differently on consecutive runs. You can mitigate this with majority voting (run the judge three times, take the consensus) but that triples the cost.

Embedding similarity scoring

A middle ground between deterministic and LLM-as-judge: embed the generated response and a reference answer, then compute cosine similarity. If the similarity drops below a threshold, the test fails. This catches semantic drift without needing a full judge model call. Embedding API calls are cheap (fractions of a cent per call with OpenAI's text-embedding-3-small) and fast.

The limitation is nuance. Two responses can be semantically similar but differ on a critical detail (like recommending the right product vs. a similar but wrong one). Embedding similarity works best as a coarse regression detector, not a precision quality gate.

In practice, most teams combine all three. Deterministic checks as the baseline, embedding similarity for broad semantic coverage, and LLM-as-judge for the high-stakes cases where you need a genuine quality assessment.

Running evals in GitHub Actions

The basic pattern is straightforward: trigger on PR, run your eval suite against the changed prompts, and post results back to the PR as a comment or check. Here's what a minimal workflow looks like using promptfoo, the most widely adopted open-source eval framework (17k+ GitHub stars, now part of OpenAI):

name: Prompt Evaluation
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - name: Cache promptfoo
        uses: actions/cache@v4
        with:
          path: ~/.cache/promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}
          restore-keys: |
            ${{ runner.os }}-promptfoo-

      - name: Run eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
        run: |
          npx promptfoo@latest eval \
            -c promptfooconfig.yaml \
            --share \
            -o results.json

      - name: Quality gate
        run: |
          FAILURES=$(jq '.results.stats.failures' results.json)
          if [ "$FAILURES" -gt 0 ]; then
            echo "Eval failed with $FAILURES failures"
            exit 1
          fi

A few things to notice here. The paths filter means the eval only runs when prompt files actually change, so you're not burning API credits on every PR. The cache step stores previous LLM responses so identical inputs don't hit the API twice. And the quality gate is a simple failure-count check, but you can make it a percentage threshold instead.

Before vs. after comparisons on PRs

Promptfoo's dedicated GitHub Action (promptfoo/promptfoo-action@v1) goes further than running evals. It checks out the base branch, runs the eval against the old prompts, then checks out the PR branch, runs against the new prompts, and posts a side-by-side diff as a PR comment. You see exactly which test cases got better or worse. The comment includes a link to an interactive web viewer where you can drill into individual responses.

This is powerful for prompt iteration. Instead of merging a prompt change and hoping it doesn't break anything, reviewers can see the impact on every test case before approving the PR.

Scheduled benchmark runs

Not every eval should run on every PR. Full benchmark suites that test hundreds of cases across multiple models are expensive and slow. Run those on a schedule or trigger them manually with workflow_dispatch. A nightly cron job that runs your full eval suite across GPT-4o, Claude 3.5 Sonnet, and Gemini Pro gives you a daily snapshot of how each model performs on your specific use case.

GitHub Actions' matrix strategy makes parallel model testing straightforward:

strategy:
  matrix:
    model: [gpt-4o, claude-3-5-sonnet, gemini-1.5-pro]
steps:
  - name: Run benchmark
    run: |
      npx promptfoo@latest eval \
        --providers.0.config.model=${{ matrix.model }} \
        -o results-${{ matrix.model }}.json

Each model runs in its own job, in parallel, and you can aggregate the results afterward to compare.

The tool landscape

Several tools have emerged for running LLM evals in CI, each with a different philosophy.

Promptfoo

The most mature option for CI integration. It's open-source, config-driven (YAML), and has a dedicated GitHub Action that handles before/after comparisons automatically. Promptfoo supports deterministic assertions, LLM-as-judge, embedding similarity, custom Python/JavaScript validators, and even red teaming for security testing. It works with any LLM provider. The fact that OpenAI acquired Promptfoo in early 2026 signals how central eval tooling has become.

The config file defines your test cases, assertions, and providers in one place:

# promptfooconfig.yaml
prompts:
  - prompts/system.txt
providers:
  - openai:gpt-4o
tests:
  - vars:
      query: "What's your return policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "Response is helpful and accurate"
      - type: similar
        value: "You can return items within 30 days."
        threshold: 0.8

Braintrust

A commercial platform (with a free tier) that treats evals as a core part of the development loop. Braintrust provides dataset management, scoring functions, and a dashboard for tracking eval results over time. Its CI integration uses a Python or TypeScript SDK: you write eval scripts that call braintrust eval and the results upload to their platform automatically. The strength here is the dashboard: you can track score trends across commits and get alerted when scores drop.

LangSmith

LangChain's evaluation platform. If you're already using LangChain or LangGraph, LangSmith integrates tightly. You define datasets and evaluators through their SDK, then run them in CI with langsmith test. The platform stores traces, so you can debug exactly what happened on a failing eval run. The downside is ecosystem lock-in: it's less useful if your LLM code doesn't use LangChain.

Evidently AI

Evidently released a GitHub Action in mid-2025 that wraps their open-source Python library. It downloads a test dataset, runs your agent against the inputs, evaluates responses with LLM judges or custom functions, and fails the CI job if any test doesn't meet your thresholds. Results show up in GitHub Checks. If you're coming from an ML monitoring background (Evidently started in traditional ML observability), the concepts will feel familiar.

Rolling your own with pytest

If you don't want another tool in the stack, a pytest suite with direct API calls works fine for smaller eval sets. Load your golden test cases from a JSON file, call the OpenAI API (or whatever provider), and assert on the results. You lose the before/after comparison UI and the dashboard, but you gain full control and zero vendor dependency.

import openai
import pytest
import json

client = openai.OpenAI()

with open("golden_tests.json") as f:
    test_cases = json.load(f)

@pytest.mark.parametrize("case", test_cases)
def test_prompt_response(case):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": open("system_prompt.txt").read()},
            {"role": "user", "content": case["input"]}
        ]
    )
    output = response.choices[0].message.content

    # Deterministic checks
    for required in case.get("must_contain", []):
        assert required.lower() in output.lower()
    for forbidden in case.get("must_not_contain", []):
        assert forbidden.lower() not in output.lower()

This approach is honest about what it can do. It won't give you pretty PR comments or trend charts, but it'll catch regressions, and it's straightforward to maintain.

Controlling costs

LLM evals in CI can get expensive fast. Every test case that hits the API costs money, and if you've got 200 test cases running on every PR across three models, you're looking at real spend. Here's how teams keep it manageable.

Model tiering

Use a cheaper, faster model for PR checks and reserve the full model for release or nightly runs. If your production system uses GPT-4o, run PR evals against GPT-4o-mini. It's roughly 15x cheaper per token and fast enough to keep CI feedback under a few minutes. The nightly benchmark run uses the production model. This gives you fast feedback on PRs without the sticker shock.

Response caching

Promptfoo's built-in cache stores API responses keyed by the input. If the prompt and test input haven't changed, it serves the cached response instead of making another API call. Set PROMPTFOO_CACHE_PATH and use actions/cache to persist it across workflow runs. This also makes deterministic test cases genuinely deterministic, since you're replaying the same response.

Rate limiting and concurrency

Running 200 test cases in parallel will hit rate limits on most providers. Promptfoo's -j flag controls concurrency (-j 5 limits to five parallel requests). If you're rolling your own, add retry logic with exponential backoff. It's better to have a slower eval run than a failed one because you hit a 429.

Secrets management for AI API keys

Your eval workflows need API keys, and those keys need to stay secure. GitHub Actions provides a few layers for this.

Repository secrets are the baseline. Store your OPENAI_API_KEY (or ANTHROPIC_API_KEY, GOOGLE_API_KEY, etc.) in Settings > Secrets. They're encrypted at rest and masked in logs.

Environment scoping adds a layer. Create separate environments ("eval-pr" and "eval-nightly") with different API keys. The PR environment might use a key with lower rate limits or a spend cap. The nightly environment uses the full-access key. This limits blast radius if a key leaks from a fork PR.

OIDC federation is the gold standard for cloud providers that support it. Instead of storing a static API key, the workflow requests a short-lived token from your cloud provider using GitHub's OIDC identity. AWS Bedrock supports this natively: your workflow assumes an IAM role scoped to just the Bedrock actions it needs. No long-lived credentials to rotate. GCP Vertex AI works the same way via Workload Identity Federation.

For providers like OpenAI and Anthropic that don't support OIDC, you're stuck with static API keys. Rotate them regularly and use environment-level protection rules (require approval before the nightly eval can run, for example).

Interpreting results and failing builds

The hardest part of LLM evals in CI isn't running them. It's deciding when to fail the build.

If you fail on any single test case failure, you'll get flaky builds. LLM-as-judge scores fluctuate. An eval that passes 95% of the time and fails 5% isn't necessarily broken; it might just be scoring borderline cases differently. Teams that start with "fail on any failure" usually relax to a threshold within the first week.

A practical approach: set a pass-rate threshold (say, 90%) and only fail the build when you drop below it. You can do this with a simple shell script after the eval run:

PASS_RATE=$(jq '.results.stats.successes /
  (.results.stats.successes + .results.stats.failures) * 100' results.json)

if (( $(echo "$PASS_RATE < 90" | bc -l) )); then
  echo "Quality gate failed: ${PASS_RATE}% pass rate (threshold: 90%)"
  exit 1
fi
echo "Quality gate passed: ${PASS_RATE}% pass rate"

For PR comments with eval diffs, use gh pr comment to post a summary with the pass rate, any regressions, and a link to the detailed results. Promptfoo's --share flag generates a shareable URL to an interactive viewer, so reviewers can explore the results without pulling down JSON files.

Tracking evals over time

A single eval run tells you if something is broken now. Tracking eval scores across commits tells you whether quality is trending up or down. Braintrust and LangSmith both provide built-in dashboards for this. If you're using promptfoo or a custom solution, you can upload eval results to a datastore (even a simple CSV in a GitHub artifact) and build your own trend analysis.

The pattern that works well: upload eval results as GitHub Actions artifacts on every run, and have a separate workflow that aggregates them weekly into a summary. This gives you both the per-PR granularity and the big-picture trend.

Putting it together

A complete LLM evaluation pipeline in GitHub Actions typically looks like this:

PR trigger with path filters so evals only run when prompts or model config changes.
Cache restoration to avoid re-running identical API calls.
Eval run against a golden test set using a cheaper model for fast feedback.
Quality gate that checks pass rate against a threshold, not a hard zero-failure policy.
PR comment with results summary and a link to the detailed viewer.
Artifact upload for audit trail and trend tracking.

Separately, a nightly scheduled workflow runs the full benchmark suite across production models, uploads results to whatever dashboard you're using, and alerts the team on Slack if scores drop below baseline.

The tooling for this is still early. Expect breaking changes, workflow patterns that evolve fast, and occasional frustration with flaky eval scores. But the alternative is shipping prompt changes blind and finding out from users that something broke. Even a basic eval pipeline with ten golden test cases and a few deterministic assertions is a massive improvement over nothing.