
When Your Security Scanner Gets Compromised
Every team that's lived with CI long enough develops the same coping mechanism: a failing check shows up on a PR, someone mutters "that's flaky," and they hit Re-run. The test passes on the second attempt. The PR merges. Nobody files a ticket.
This works until it doesn't. And lately, it doesn't work at all. GitHub now caps workflow reruns at 50 per run, which means your safety valve has a hard limit. But the bigger problem isn't the cap. It's that habitual rerunning trains your team to ignore red checks, and that's how real regressions slip into production unnoticed.
The sustainable answer is a quarantine pipeline: a system that automatically detects flaky tests, moves them out of the critical path, assigns ownership, and escalates when they sit unfixed too long. Here's how to build one in GitHub Actions.
A flaky test is one that produces different results on the same code. But "same code" is deceptive. The commit hash might be identical while the runner's disk, network latency, clock skew, or available memory differ across runs. The detection problem is really about separating signal from noise across multiple dimensions.
Three reliable signals for automated detection:
The key insight: you don't need to determine why a test is flaky to quarantine it. You just need enough statistical confidence that its results aren't trustworthy. Three failures across five identical-commit runs? That's a 40% failure rate on unchanged code. Quarantine it.
Manual tagging doesn't scale. If a human has to review each failure and decide "that's flaky" before marking it, you've built a process that depends on someone's goodwill and available time. The goal is zero-human-intervention detection.
The architecture is straightforward: after each test run, a post-processing step parses the results, compares against historical data, and updates a flake registry. Here's how that looks in practice:
name: Test with Flake Detection
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests with JSON output
run: npm test -- --json --outputFile=test-results.json
continue-on-error: true
- name: Detect flakes
if: always()
run: |
python3 scripts/detect-flakes.py \
--results test-results.json \
--commit ${{ github.sha }} \
--history .flake-history/
- name: Update quarantine list
if: always()
run: |
python3 scripts/update-quarantine.py \
--registry .quarantine/registry.json \
--threshold 3The detect-flakes.py script does the heavy lifting: it reads the JSON test output, checks whether any test that failed this run has passed on the same commit SHA in a previous run, and writes a flake candidate list. The update-quarantine.py script promotes candidates to quarantined status once they've been flagged a configurable number of times (in this example, three).
Store the registry in your repository or an external data store. A JSON file committed to the repo is simplest; a shared database or S3 bucket works better for monorepos or teams that need cross-repository flake tracking.
Once you've identified flaky tests, the next move is separating them from your main CI pipeline. The quarantined tests still run, but they don't gate PRs. This keeps your green rate honest while preserving the signal those tests produce.
You need two workflows:
The main workflow reads the quarantine registry and excludes those test paths:
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build quarantine exclusion pattern
id: quarantine
run: |
if [ -f .quarantine/registry.json ]; then
EXCLUDE=$(jq -r '[.tests[].path] | join("|")' .quarantine/registry.json)
echo "exclude=--testPathIgnorePatterns='$EXCLUDE'" >> $GITHUB_OUTPUT
else
echo "exclude=" >> $GITHUB_OUTPUT
fi
- name: Run tests (excluding quarantined)
run: npm test -- ${{ steps.quarantine.outputs.exclude }}And the quarantine workflow runs the inverse set:
# .github/workflows/quarantine.yml
name: Quarantine Tests
on: [push, pull_request]
jobs:
quarantine:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run quarantined tests only
run: |
INCLUDE=$(jq -r '[.tests[].path] | join(" ")' .quarantine/registry.json)
npx jest --testPathPatterns="$INCLUDE" --retries=2
continue-on-error: true
- name: Report quarantine results
if: always()
run: |
python3 scripts/report-quarantine.py \
--results test-results.json \
--notify slackThe continue-on-error: true on the quarantine run is deliberate. These tests are known-unreliable, so you don't want their failures to register as a failed check on the PR. But you still want the data, which the reporting step captures.
Here's the uncomfortable truth about quarantine systems: a quarantined test without an owner and a deadline is worse than a red test. At least a red test creates urgency. A quarantined test creates comfort. It's out of the way, nobody's looking at it, and it'll sit there for months.
When a test enters quarantine, the system should automatically:
You can automate this with the GitHub CLI in your quarantine detection workflow:
# For each newly quarantined test, open an issue
for test in $(jq -r '.newly_quarantined[]' quarantine-diff.json); do
OWNER=$(git log --format='%ae' -1 -- "$test" | head -1)
gh issue create \
--title "Flaky test quarantined: $test" \
--body "This test was auto-quarantined after failing in 3+ runs on identical commits.\n\nOwner (from git blame): $OWNER\nSLA: 14 days from quarantine date\nQuarantined: $(date -u +%Y-%m-%d)" \
--label "flaky-test,quarantine" \
--assignee "$OWNER"
doneWithout active maintenance, quarantine becomes a dumping ground. I've seen teams with 200+ quarantined tests and no plan to address any of them. At that point, you've effectively deleted a chunk of your test suite and called it process improvement.
Three policies keep quarantine from rotting:
Here's a scheduled workflow that handles aging:
# .github/workflows/quarantine-maintenance.yml
name: Quarantine Maintenance
on:
schedule:
- cron: '0 9 * * 1' # Every Monday at 9 AM UTC
jobs:
maintain:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check quarantine age and promote stable tests
run: |
python3 scripts/quarantine-maintenance.py \
--registry .quarantine/registry.json \
--escalate-after 14 \
--delete-after 45 \
--promote-after 10
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}Your detection pipeline needs machine-readable test results with retry metadata. Here's how the major frameworks expose that.
Jest doesn't have native retries built into its core, but jest-circus (the default test runner since Jest 27) supports jest.retryTimes(n). Combine this with --json --outputFile=results.json to get structured output. The JSON report includes test status per attempt, so your detection script can identify any test that failed then passed. Jest also supports --testPathIgnorePatterns for exclusion and --testPathPatterns for inclusion, which maps directly onto the two-workflow split.
Playwright has the best native flake detection of any major test framework. When you configure retries in playwright.config.ts, it automatically categorizes each test as "passed" (first try), "flaky" (failed then passed on retry), or "failed" (all attempts failed). The test report explicitly labels flaky tests, giving your detection script a first-class signal to work with. You can also access testInfo.retry at runtime to adjust test behavior on retries.
The pytest-rerunfailures plugin adds --reruns N support and introduces a "rerun" test outcome. Combined with JUnit XML output (--junitxml=results.xml), you get structured data that includes which tests were rerun. You can also use @pytest.mark.flaky(reruns=3) to annotate known flaky tests at the code level, but that's the manual approach you're trying to automate away.
Go doesn't have built-in retries, but go test -count=N runs each test N times in a single invocation, and go test -json gives you structured output. If a test passes on run 1 and fails on run 2 (or vice versa) within the same -count batch, that's a definitive flake signal. For exclusion, use -run with a regex that inverts the quarantine list, or use build tags to exclude quarantined test files.
GitHub's documentation states it plainly: "A workflow run can be re-run a maximum of 50 times." That includes both full reruns and partial job reruns.
50 sounds like a lot until you consider a busy monorepo. If your CI matrix runs tests across 4 OS variants and 3 Node versions, that's 12 jobs per push. A single "rerun failed jobs" click retries all failed jobs, and each attempt counts against the cap. If you've got two or three flaky tests that each fail intermittently, you can burn through 50 reruns on a single PR in a bad week.
The rerun cap makes quarantine a prerequisite, not a nice-to-have. If flaky tests stay in your main pipeline, you're spending a finite rerun budget on tests you already know are unreliable. That leaves fewer reruns available for legitimate infrastructure hiccups, like a runner failing to pull a Docker image or a transient network timeout during package installation.
Quarantining your known flaky tests also reduces CI compute costs. Every unnecessary rerun burns runner minutes you're paying for. Teams running on Tenki Runners, for example, save up to 50% on per-minute costs compared to GitHub-hosted runners, but even cheaper runners add up fast when you're re-executing the same unreliable tests three times per PR across dozens of PRs per day.
Whatever format you choose, the registry needs a few fields per test to power the lifecycle described above:
{
"tests": [
{
"path": "src/__tests__/payment-flow.test.ts",
"testName": "should process refund within timeout",
"quarantinedAt": "2026-04-22T14:30:00Z",
"failureRate": 0.35,
"consecutivePasses": 0,
"owner": "alice@example.com",
"trackingIssue": 1842,
"slaDeadline": "2026-05-06T14:30:00Z"
},
{
"path": "src/__tests__/websocket-reconnect.test.ts",
"testName": "reconnects after server restart",
"quarantinedAt": "2026-05-01T09:00:00Z",
"failureRate": 0.22,
"consecutivePasses": 7,
"owner": "bob@example.com",
"trackingIssue": 1901,
"slaDeadline": "2026-05-15T09:00:00Z"
}
]
}The consecutivePasses field is what drives auto-promotion. The second test in this example has passed 7 times in a row; three more clean runs and it returns to the main pipeline automatically. The slaDeadline field drives escalation. The first test has already passed its deadline, which means the maintenance workflow should be pinging someone.
The complete system has four moving parts that form a closed loop:
The net effect: your PR checks reflect actual code quality, not environmental noise. Your CI green rate goes up because it's measuring what it should measure. Your rerun budget is reserved for genuine infrastructure blips. And flaky tests get fixed or removed instead of silently eroding trust in the entire suite.
The hardest part isn't the automation. It's the organizational commitment to treat quarantine as a temporary state with an expiration date, not a permanent parking lot for tests nobody wants to deal with.
Tags
Recommended for you
What's next in your stack.
GET TENKI