Introducing Tenki's code reviewer: deep, context-aware reviews that actually find bugs.Try it for Free
GitHub Actions
May 2026

Flaky Test Quarantine in GitHub Actions

Eddie Wang
Eddie Wangengineering

Share Article:

Every team that's lived with CI long enough develops the same coping mechanism: a failing check shows up on a PR, someone mutters "that's flaky," and they hit Re-run. The test passes on the second attempt. The PR merges. Nobody files a ticket.

This works until it doesn't. And lately, it doesn't work at all. GitHub now caps workflow reruns at 50 per run, which means your safety valve has a hard limit. But the bigger problem isn't the cap. It's that habitual rerunning trains your team to ignore red checks, and that's how real regressions slip into production unnoticed.

The sustainable answer is a quarantine pipeline: a system that automatically detects flaky tests, moves them out of the critical path, assigns ownership, and escalates when they sit unfixed too long. Here's how to build one in GitHub Actions.

What Makes a Test Flaky (and How to Detect It Automatically)

A flaky test is one that produces different results on the same code. But "same code" is deceptive. The commit hash might be identical while the runner's disk, network latency, clock skew, or available memory differ across runs. The detection problem is really about separating signal from noise across multiple dimensions.

Three reliable signals for automated detection:

  1. Pass-on-retry. A test fails on attempt 1 and passes on attempt 2 within the same workflow run, with no code changes in between. This is the strongest flake signal because the environment is nearly identical. Most test frameworks expose retry metadata that you can capture.
  2. Variance across identical commits. When the same test fails on commit abc123 in one workflow run but passes in another (say, from a re-run or a different PR pointing at the same merge base), that's a flake. You can track this by storing test results keyed by commit SHA and comparing across runs.
  3. Environment-dependent failures. A test passes on ubuntu-latest but fails on ubuntu-24.04, or passes on a self-hosted runner with 8 cores but fails on a 2-core GitHub-hosted runner. These are real bugs in the test code, but they look like flakes to anyone staring at a red PR check.

The key insight: you don't need to determine why a test is flaky to quarantine it. You just need enough statistical confidence that its results aren't trustworthy. Three failures across five identical-commit runs? That's a 40% failure rate on unchanged code. Quarantine it.

Auto-Tagging Flaky Tests from CI Metadata

Manual tagging doesn't scale. If a human has to review each failure and decide "that's flaky" before marking it, you've built a process that depends on someone's goodwill and available time. The goal is zero-human-intervention detection.

The architecture is straightforward: after each test run, a post-processing step parses the results, compares against historical data, and updates a flake registry. Here's how that looks in practice:

name: Test with Flake Detection
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run tests with JSON output
        run: npm test -- --json --outputFile=test-results.json
        continue-on-error: true

      - name: Detect flakes
        if: always()
        run: |
          python3 scripts/detect-flakes.py \
            --results test-results.json \
            --commit ${{ github.sha }} \
            --history .flake-history/

      - name: Update quarantine list
        if: always()
        run: |
          python3 scripts/update-quarantine.py \
            --registry .quarantine/registry.json \
            --threshold 3

The detect-flakes.py script does the heavy lifting: it reads the JSON test output, checks whether any test that failed this run has passed on the same commit SHA in a previous run, and writes a flake candidate list. The update-quarantine.py script promotes candidates to quarantined status once they've been flagged a configurable number of times (in this example, three).

Store the registry in your repository or an external data store. A JSON file committed to the repo is simplest; a shared database or S3 bucket works better for monorepos or teams that need cross-repository flake tracking.

Routing Quarantined Tests to a Separate Workflow

Once you've identified flaky tests, the next move is separating them from your main CI pipeline. The quarantined tests still run, but they don't gate PRs. This keeps your green rate honest while preserving the signal those tests produce.

You need two workflows:

  1. The main CI workflow runs all tests except those in the quarantine registry. This is a required check for PR merges.
  2. The quarantine workflow runs only quarantined tests. This is an informational check that doesn't block merges.

The main workflow reads the quarantine registry and excludes those test paths:

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build quarantine exclusion pattern
        id: quarantine
        run: |
          if [ -f .quarantine/registry.json ]; then
            EXCLUDE=$(jq -r '[.tests[].path] | join("|")' .quarantine/registry.json)
            echo "exclude=--testPathIgnorePatterns='$EXCLUDE'" >> $GITHUB_OUTPUT
          else
            echo "exclude=" >> $GITHUB_OUTPUT
          fi

      - name: Run tests (excluding quarantined)
        run: npm test -- ${{ steps.quarantine.outputs.exclude }}

And the quarantine workflow runs the inverse set:

# .github/workflows/quarantine.yml
name: Quarantine Tests
on: [push, pull_request]

jobs:
  quarantine:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run quarantined tests only
        run: |
          INCLUDE=$(jq -r '[.tests[].path] | join(" ")' .quarantine/registry.json)
          npx jest --testPathPatterns="$INCLUDE" --retries=2
        continue-on-error: true

      - name: Report quarantine results
        if: always()
        run: |
          python3 scripts/report-quarantine.py \
            --results test-results.json \
            --notify slack

The continue-on-error: true on the quarantine run is deliberate. These tests are known-unreliable, so you don't want their failures to register as a failed check on the PR. But you still want the data, which the reporting step captures.

SLAs and Owner Assignment: The Part Everyone Skips

Here's the uncomfortable truth about quarantine systems: a quarantined test without an owner and a deadline is worse than a red test. At least a red test creates urgency. A quarantined test creates comfort. It's out of the way, nobody's looking at it, and it'll sit there for months.

When a test enters quarantine, the system should automatically:

  • Open a tracking issue with the test name, failure frequency, recent logs, and the quarantine date.
  • Assign an owner based on CODEOWNERS or git blame for the test file. If nobody owns the code the test covers, that's a separate problem worth surfacing.
  • Set a deadline. A 14-day SLA is reasonable for most teams. The test should be fixed, rewritten, or deliberately deleted within that window.

You can automate this with the GitHub CLI in your quarantine detection workflow:

# For each newly quarantined test, open an issue
for test in $(jq -r '.newly_quarantined[]' quarantine-diff.json); do
  OWNER=$(git log --format='%ae' -1 -- "$test" | head -1)
  gh issue create \
    --title "Flaky test quarantined: $test" \
    --body "This test was auto-quarantined after failing in 3+ runs on identical commits.\n\nOwner (from git blame): $OWNER\nSLA: 14 days from quarantine date\nQuarantined: $(date -u +%Y-%m-%d)" \
    --label "flaky-test,quarantine" \
    --assignee "$OWNER"
done

Preventing Quarantine Rot

Without active maintenance, quarantine becomes a dumping ground. I've seen teams with 200+ quarantined tests and no plan to address any of them. At that point, you've effectively deleted a chunk of your test suite and called it process improvement.

Three policies keep quarantine from rotting:

  1. Aging policy. Run a scheduled workflow weekly that checks the quarantine date on every entry. Tests quarantined for more than 14 days without activity on their tracking issue get escalated. Tests quarantined for more than 30 days get a second escalation to the team lead.
  2. Auto-deletion. If a quarantined test's tracking issue has been closed as "won't fix" or has had zero activity for 45 days, automatically remove the test file and its quarantine entry. A test that nobody will fix and nobody will delete is pure noise.
  3. Auto-promotion back to CI. If a quarantined test passes consistently for 10+ consecutive quarantine runs, automatically remove it from quarantine and restore it to the main CI pipeline. The flake might have been fixed incidentally by a related change.

Here's a scheduled workflow that handles aging:

# .github/workflows/quarantine-maintenance.yml
name: Quarantine Maintenance
on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 9 AM UTC

jobs:
  maintain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check quarantine age and promote stable tests
        run: |
          python3 scripts/quarantine-maintenance.py \
            --registry .quarantine/registry.json \
            --escalate-after 14 \
            --delete-after 45 \
            --promote-after 10
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Framework Integration: Where Each Exposes the Flake Signal

Your detection pipeline needs machine-readable test results with retry metadata. Here's how the major frameworks expose that.

Jest

Jest doesn't have native retries built into its core, but jest-circus (the default test runner since Jest 27) supports jest.retryTimes(n). Combine this with --json --outputFile=results.json to get structured output. The JSON report includes test status per attempt, so your detection script can identify any test that failed then passed. Jest also supports --testPathIgnorePatterns for exclusion and --testPathPatterns for inclusion, which maps directly onto the two-workflow split.

Playwright

Playwright has the best native flake detection of any major test framework. When you configure retries in playwright.config.ts, it automatically categorizes each test as "passed" (first try), "flaky" (failed then passed on retry), or "failed" (all attempts failed). The test report explicitly labels flaky tests, giving your detection script a first-class signal to work with. You can also access testInfo.retry at runtime to adjust test behavior on retries.

pytest

The pytest-rerunfailures plugin adds --reruns N support and introduces a "rerun" test outcome. Combined with JUnit XML output (--junitxml=results.xml), you get structured data that includes which tests were rerun. You can also use @pytest.mark.flaky(reruns=3) to annotate known flaky tests at the code level, but that's the manual approach you're trying to automate away.

Go test

Go doesn't have built-in retries, but go test -count=N runs each test N times in a single invocation, and go test -json gives you structured output. If a test passes on run 1 and fails on run 2 (or vice versa) within the same -count batch, that's a definitive flake signal. For exclusion, use -run with a regex that inverts the quarantine list, or use build tags to exclude quarantined test files.

Why This Matters More After the 50-Rerun Cap

GitHub's documentation states it plainly: "A workflow run can be re-run a maximum of 50 times." That includes both full reruns and partial job reruns.

50 sounds like a lot until you consider a busy monorepo. If your CI matrix runs tests across 4 OS variants and 3 Node versions, that's 12 jobs per push. A single "rerun failed jobs" click retries all failed jobs, and each attempt counts against the cap. If you've got two or three flaky tests that each fail intermittently, you can burn through 50 reruns on a single PR in a bad week.

The rerun cap makes quarantine a prerequisite, not a nice-to-have. If flaky tests stay in your main pipeline, you're spending a finite rerun budget on tests you already know are unreliable. That leaves fewer reruns available for legitimate infrastructure hiccups, like a runner failing to pull a Docker image or a transient network timeout during package installation.

Quarantining your known flaky tests also reduces CI compute costs. Every unnecessary rerun burns runner minutes you're paying for. Teams running on Tenki Runners, for example, save up to 50% on per-minute costs compared to GitHub-hosted runners, but even cheaper runners add up fast when you're re-executing the same unreliable tests three times per PR across dozens of PRs per day.

The Quarantine Registry Format

Whatever format you choose, the registry needs a few fields per test to power the lifecycle described above:

{
  "tests": [
    {
      "path": "src/__tests__/payment-flow.test.ts",
      "testName": "should process refund within timeout",
      "quarantinedAt": "2026-04-22T14:30:00Z",
      "failureRate": 0.35,
      "consecutivePasses": 0,
      "owner": "alice@example.com",
      "trackingIssue": 1842,
      "slaDeadline": "2026-05-06T14:30:00Z"
    },
    {
      "path": "src/__tests__/websocket-reconnect.test.ts",
      "testName": "reconnects after server restart",
      "quarantinedAt": "2026-05-01T09:00:00Z",
      "failureRate": 0.22,
      "consecutivePasses": 7,
      "owner": "bob@example.com",
      "trackingIssue": 1901,
      "slaDeadline": "2026-05-15T09:00:00Z"
    }
  ]
}

The consecutivePasses field is what drives auto-promotion. The second test in this example has passed 7 times in a row; three more clean runs and it returns to the main pipeline automatically. The slaDeadline field drives escalation. The first test has already passed its deadline, which means the maintenance workflow should be pinging someone.

Putting It Together: The Full Pipeline

The complete system has four moving parts that form a closed loop:

  1. Detection runs as a post-test step in your main CI workflow. It compares results against history, identifies flake candidates, and promotes them after enough evidence.
  2. Routing splits your test execution across two workflows: the main gate (no quarantined tests) and the quarantine lane (only quarantined tests, non-blocking).
  3. Accountability creates issues, assigns owners, and sets SLA deadlines automatically when tests enter quarantine.
  4. Maintenance runs on a schedule to escalate overdue tests, promote recovered tests back to the main pipeline, and delete abandoned ones.

The net effect: your PR checks reflect actual code quality, not environmental noise. Your CI green rate goes up because it's measuring what it should measure. Your rerun budget is reserved for genuine infrastructure blips. And flaky tests get fixed or removed instead of silently eroding trust in the entire suite.

The hardest part isn't the automation. It's the organizational commitment to treat quarantine as a temporary state with an expiration date, not a permanent parking lot for tests nobody wants to deal with.

Tags

#flaky-tests#test-quarantine#ci-cd

Recommended for you

What's next in your stack.

GET TENKI

Smarter reviews. Faster builds. Start for Free in less than 2 min.