GOAT bench
Pipeline-check ships a regression gate that runs the scanner
against a corpus of pinned, vulnerable-by-design CI/CD and IaC
testbeds, then asserts the rule pack still fires on the
misconfigurations each testbed is intended to teach. It's the
proof that "this scanner catches the things it claims to catch" on
real-world surface area, not just on the synthetic
bench/cases/
fixtures.
Live status: GOAT bench workflow.
Current state
| Goat | Recall | Findings | Coverage |
|---|---|---|---|
cicd-goat |
9 / 9 (100%) | 30 | GHA release workflow + 7 Jenkinsfiles |
cicd-goat-comparison |
27 / 27 (100%) | - | GHA + npm slice of the 120-scenario cross-scanner matrix (pipeline-check leads) |
cfngoat |
5 / 5 (100%) | 6 | cfngoat.yaml (IAM, KMS, Lambda, CloudTrail) |
kubernetes-goat |
27 / 27 (100%) | 27 | scenarios/ manifest tree |
terragoat |
pending curation | - | Direct-HCL parsing shipped; expected.txt awaiting population |
41 check IDs locked across the three fully curated goats. Any rule
change that stops one from firing on its goat trips the bench in
CI. The cicd-goat-comparison goat gates the GHA + npm slice with 27
unique curated check IDs; upstream that testbed has since grown into a
120-scenario, 16-provider cross-scanner matrix that Pipeline-Check
leads (see below).
How it works
bench/goats.yml
declares the corpus. The runner
(bench/goat_runner.py)
shallow-clones each goat into a tmpdir, runs pipeline-check with
the provider mix declared per goat (Jenkinsfile discovery is
globbed automatically when jenkins is in the goat's pipeline
list), and diffs findings against three committed inputs per goat:
bench/goats/<slug>/expected.txt— hand-curated check IDs the goat is intended to teach. Each entry maps to a documented challenge or CIS benchmark control. Missing one fails the bench.bench/goats/<slug>/allowlist.txt— known false positives, with a one-line justification each. Allowlisted IDs don't gate.bench/goats/<slug>/baseline.json— the last committed scan output. Drift in either direction (new findings, resolved findings) surfaces in the report; unallowlisted new findings also gate.
CI runs nightly on master and on every PR that touches the rule
pack, the chain engine, or the bench code. PRs get a sticky
comment with the per-goat delta; every run uploads the full report
as a goat-bench-report workflow artifact for download.
Reproducing locally
git clone https://github.com/dmartinochoa/pipeline-check
cd pipeline-check
pip install -e .
# Full corpus
python bench/goat_runner.py --markdown
# One goat
python bench/goat_runner.py --goat cfngoat --json
Each clone is shallow (--depth 1). cicd-goat is the largest at
~5000 files; the full bench runs in 1-2 minutes. Exit code 0 means
every scannable goat hit 100% recall on its curated expected list
and no new findings appeared against the baseline that weren't
allowlisted.
Per-goat curation
Each goat's expected.txt documents the mapping from fired check
IDs to the goat's intended teaching.
cicd-goat — OWASP CI/CD Top 10
The seven Alice-in-Wonderland challenges
(solutions/)
are anchored against specific CICD-SEC risks:
| Check ID | OWASP CICD-SEC | Goat scenario |
|---|---|---|
GHA-003 |
SEC-4 PPE | release.yaml: script injection via ${{ github.event.* }} |
GHA-014 |
SEC-1 | Deploy missing environment: binding |
GHA-015 |
SEC-7 | Deploy missing timeout-minutes |
JF-003 |
SEC-5 / SEC-7 | agent any (drives cheshire-cat's Direct-PPE-to-Controller) |
JF-005 |
SEC-1 | caterpillar deploy missing manual input |
JF-011 |
SEC-10 | No buildDiscarder retention |
JF-015 |
SEC-7 | No timeout {} wrapper |
JF-028 |
SEC-9 | caterpillar + reportcov publish without SLSA (dormouse's Codecov-style scenario) |
JF-030 |
SEC-1 | mock-turtle dangerous shell idiom (auto-merge bypass) |
cicd-goat-comparison — 120-scenario cross-scanner matrix
greylag-ci/cicd-goat is a
purpose-built testbed for cross-scanner comparison: 120 scenarios
across 16 providers and formats, each isolating one CI/CD or IaC
vulnerability with a minimal fixture. It scores nine scanners head to
head (pipeline-check, Checkov, KICS, Trivy, zizmor, poutine, octoscan,
ciguard, actionlint).
On the 43 GitHub Actions scenarios, where the GHA-specialist scanners compete directly, Pipeline-check leads by a wide margin:
| Scanner | GHA scenarios |
|---|---|
| pipeline-check | 37 / 43 |
| zizmor | 17 / 43 |
| poutine | 14 / 43 |
| octoscan | 13 / 43 |
| Checkov | 10 / 43 |
| KICS | 8 / 43 |
| actionlint | 6 / 43 |
Across all 16 categories Pipeline-check is the top scorer in 14 and the sole leader in 11: GitHub Actions, GitLab CI (14/14), Azure Pipelines (7/7), CircleCI (6/7), Bitbucket Pipelines (7/7), Jenkins (4/6), Tekton (4/4), Argo (5/5), Drone (3/3), Buildkite (2/2), and Cloud Build (2/2). It ties Trivy for first on Dockerfile, Kubernetes, and Helm (3/3 each). Terraform and CloudFormation are scored only for the IaC scanners (Checkov, KICS, Trivy), which lead there.
The local GOAT bench gates a slice of this corpus: the runner scans the
GHA + npm scenarios and asserts 27 curated check IDs still fire, so a
rule regression that would cost Pipeline-check its leaderboard standing
trips CI here first. The full per-scenario expected values for every
scanner live in
tools/scenarios.yaml
in the goat repo.
cfngoat — vulnerable AWS CloudFormation
Every fire maps to a misconfiguration on cfngoat.yaml:
| Check ID | Misconfiguration |
|---|---|
CF-001 |
AWS::IAM::AccessKey long-lived static credential as code |
CT-001 |
Stack deploys AWS resources with no AWS::CloudTrail::Trail (CIS AWS 3.1) |
KMS-001 |
KMS::Key.LogsKey rotation disabled (CIS AWS 3.8) |
LMB-001 |
AnalysisLambda + CleanBucketFunction lack CodeSigningConfigArn (CIS SSC 2.4.2) |
LMB-003 |
AnalysisLambda env vars carry plaintext secret-shaped values (CIS AWS 3.7) |
kubernetes-goat — vulnerable K8s cluster + workloads
27 check IDs grouped by attack pattern, each mapped to one of the goat's 22 documented scenarios or the CIS Kubernetes Benchmark coverage the goat's "K8s CIS benchmarks analysis" challenge is explicitly demonstrating:
- Container escape (challenges 2 + 4):
K8S-002/003/004(hostNetwork/PID/IPC),K8S-005(privileged),K8S-013/014(hostPath / sensitive hostPath). - Privilege escalation / root hardening:
K8S-006(allowPrivilegeEscalation),K8S-007/035(runAsNonRoot / runAsUser 0),K8S-009(capabilities). - Sandboxing defense-in-depth:
K8S-008(readOnlyRootFilesystem),K8S-010(seccompProfile). - ServiceAccount token surface (challenge 12):
K8S-011/012/034. - RBAC (challenge 16):
K8S-020(cluster-admin binding). - Sensitive credentials (challenge 1):
K8S-018(Secret literals). - Namespace / network isolation (challenges 11 + 20):
K8S-019/023/031/032. - Image supply chain (challenges 7, 15):
K8S-001(unpinned digest). - DoS / resource exhaustion (challenge 13):
K8S-015/016(resource limits),K8S-033(ResourceQuota). - Control-plane abuse:
K8S-030. - Operational hygiene:
K8S-024(probes).
terragoat status
bridgecrewio/terragoat
is in the corpus and scans via --tf-source terraform/aws
(direct-HCL parsing shipped post-1.3.0). The goat entry in
bench/goats.yml
is active but expected.txt has not yet been curated. The next
step is to run python bench/goat_runner.py --goat terragoat
--suggest and hand-curate the expected set against terragoat's
documented misconfigurations.
Adding a goat
- Add the goat to
bench/goats.ymlwith a pinnedref:and the provider mix to scan. python bench/goat_runner.py --goat <slug> --suggestwrites a candidateexpected.txtpopulated with every check ID the current scan fires.- Hand-curate
expected.txtdown to the IDs the goat is intended to teach. Document the mapping inline. python bench/goat_runner.py --goat <slug> --update-baselinerecords the drift reference.- Commit
expected.txt,allowlist.txt(empty is fine for now), andbaseline.json.
The bench workflow picks the new goat up automatically.