Files
trufflehog/scripts/test/diff_corpora_results.py
Mustansir 3a022f9d59 Automate corpora testing in CI (#4927)
* add detector corpora test workflow and script

* only run once per PR, make comment descriptive, add handling for manual runs to get PR issue number

* comment out types to see result on all commits

* uncomment types

* remove table from comment

* comment out types

* Phase 0: add explicit pipefail and capture trufflehog stderr

* Phase 1: differential diffing PR vs main

* DEMO: loosen Stripe regex (will revert)

* DEMO: loosen JDBC regex (will revert)

* Phase 1 fix: add --allow-verification-overlap, fix no-diff detection

The bench uses --no-verification, so the engine's overlap-path dedup
(which exists to protect verifiers from duplicate calls) adds noise
without value here — it causes shifts in unrelated detectors when only
one detector's regex changes. Pair --allow-verification-overlap with
--no-verification so each detector's regex behavior is measured
independently.

Also fix the false 'no diff vs main' claim that triggered when
NEW/REMOVED were zero but total counts differed.

* revert jdbc detector change

* Phase 2: detector scoping, new-detector handling, blast radius, status emoji

* DEMO: loosen JDBC + add fictional acmevault detector

* Phase 2 fix: harden corpus byte counting against early trufflehog exit

awk's END block doesn't run when trufflehog exits before draining stdin
(SIGPIPE kills awk first), leaving the bytes file empty and breaking the
step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a
default of 0 and validate it's an integer before arithmetic. Also fold
unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay
out of CI logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 3a (1/3): add hack/extract-keywords for detector keyword introspection

Static AST parse of a detector package to extract the strings returned by
its Keywords() method. Used by the upcoming keyword-corpus builder to fan
out per-detector GitHub Code Search queries during the corpora bench.

AST-first because each detector lives in its own package; importing them
dynamically would require codegen or `plugin`. Falls back to a regex over
the function body, then a directory-wide grep, when AST resolution can't
statically resolve the return value (helper calls, build-tagged variants).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 3a (2/3): add Layer 1 keyword corpus builder + workflow integration

build_keyword_corpus.py queries GitHub Code Search for each changed
detector's pre-filter keywords and emits a zstd-compressed JSONL whose
shape matches the existing S3 corpus exactly: each line is
`{"provenance": {...}, "content": "<raw file content>"}`. The corpora
script's existing `unzstd | jq -r .content` pipe handles it unchanged —
provenance is descriptive only and never reaches trufflehog.

Rate-limit policy is header-driven: the search bucket's
X-RateLimit-Remaining and X-RateLimit-Reset headers gate every call,
with a 2.1s floor between requests as belt-and-suspenders. 403/429s
honor Retry-After or fall back to the reset window. Cap is 100 unique
results per detector, deduped on (repo, path, sha), with a per-keyword
sub-cap so one popular keyword can't starve the others.

A sidecar JSON reports per-detector fetch counts and a thin_l1 list of
detectors whose total returned results were zero (or whose keyword
extraction failed). The diff script reads it via a new
--keyword-corpus-meta arg and renders a single contiguous blockquote
callout above the per-detector details — a sidecar instead of an
in-corpus signal because stdin metadata is dropped from trufflehog's
findings output.

Workflow change: a new "Build keyword corpus (Layer 1)" step fires after
detector detection and overwrites DATASETS via $GITHUB_ENV to append the
keyword corpus path. The corpora script picks it up unchanged through
its existing local-file branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 4 complete - Heatmap visualization

* Phase 4 rework (1/2): emit heatmap-grid.json sidecar from render_heatmap.py

Add a JSON sidecar that captures the same Δ matrix the PNG renders. The
diff script consumes this to render an emoji-bucketed Markdown table —
GitHub's PR-comment Markdown sanitizer strips data: URLs and serves
artifact zips behind auth, so neither inline base64 nor an artifact
<img src> embed actually displays. The PNG stays for artifact archival
and click-through.

Sidecar shape: {detectors, decoders, deltas, _layout}, with _layout
documenting the deltas[i][j] orientation inline so future readers don't
have to reverse-engineer it. Emitted whenever the grid is non-empty,
even if matplotlib import fails — the comment never depends on the PNG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 4 rework (2/2): replace data-URL embed with emoji-bucketed Markdown table

GitHub's PR-comment Markdown sanitizer strips data: URLs from user
content, so the inline base64 PNG embed shipped in the prior commit
rendered as a broken link in the comment DOM (no <img> tag emitted).
Artifact-zip URLs require auth to download, so a fallback ![](...) link
is also a non-starter — it would render as a broken image.

Switch to a per-(detector, decoder) Δ table built from the grid JSON
sidecar render_heatmap.py now emits. Cells use emoji buckets aligned
with the existing status-emoji thresholds so the visual weight matches
the summary table:

  🟥 Δ ≥ +6   (matches NEW > 5 → 🔴)
  🟧 +1..+5
   0
  🟦 ≤ −1

Renders identically on web, mobile, email notifications, and CI log
replay — every surface a PR comment lands on. The colored matplotlib
PNG stays as a workflow artifact; when --heatmap-artifact-url is
supplied, the table is followed by a click-through link for reviewers
who want the rich version.

Workflow YAML mirrors the rename (--heatmap-png → --heatmap-grid) and
drops the base64-blob log filter — the report is back to plain
human-readable text.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 5 complete - Polish

* cleanup, enable verification

* fix bug

* optimizations

* cache keywords corpus

* rewrite comment message

* cache github api corpus per keyword

* cleanup

* remove github corpus

* revert changes for testing

* move Configure AWS credentials step to run only when detector changes are detected

* revert unnecessary changes

* cleanup + bugbot fixes

* run test with bigger (30gb) dataset, loosen jdbc regex

* optimizations

* bugbot fixes

* revert jdbc changes and bugbot fix

* run only on regex and/or keywords change

* bugbot fixes

* bugbot fix

* incorporate brad's comments, loosen jdbc regex to run a test to ensure everything works as expected

* revert test changes

* fix misleading bench skipped message

* run only once when PR opens

* pipe stderr directly to CI log instead of writing to file, loosen jdbc regex to trigger workflow

* testing: run on all commits

* add archive timeout

* remove bigger dataset

* revert testing changes

---------

Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 14:47:53 +05:00

273 lines
9.3 KiB
Python
Executable File
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
Diffs two trufflehog JSONL outputs (main vs PR build) and emits a Markdown
report to stdout.
Identity per finding: (DetectorName, Raw or RawV2 fallback). Set semantics —
duplicates within a single scan collapse into one identity, so a regex change
either adds a new (detector, secret) identity or removes one.
Verification is disabled at scan time (--no-verification) to avoid network
calls against a large corpus where thousands of matches could dominate runtime.
The diff measures regex match changes only.
When --changed-detectors is provided, the report focuses on the detectors
changed by the PR. Detectors flagged via --new-detectors are rendered with 🆕
status and absolute density (no main baseline). When --corpus-bytes is
provided, a blast-radius column projects matches per 10 GB of scanned content.
Usage:
diff_corpora_results.py <main.jsonl> <pr.jsonl>
[--changed-detectors=<csv>]
[--new-detectors=<csv>]
[--corpus-bytes=<n>]
"""
import argparse
import json
import sys
from collections import defaultdict
PREAMBLE = (
"Scans a corpus of real-world public code against only the detectors "
"changed in this PR, then compares unique match counts between the PR "
"build and the main baseline to catch regex regressions. Verification "
"is disabled — each detector's regex is measured independently."
)
STATUS_KEY = (
"- 🔴 regression: >5 new, >20% increase over main, or any removed\n"
"- ⚠️ warning: 15 new and ≤20% increase over main\n"
"- ✅ clean\n"
"- 🆕 new detector (no baseline)"
)
# Marker on the very first line of the body so peter-evans/find-comment can
# locate the sticky comment via substring match. Workflow file references the
# same literal — keep the two in sync.
STICKY_COMMENT_MARKER = "<!-- detector-bench -->"
def parse_csv(s):
"""Parse a comma-separated detector list into normalized name set.
Strips ``.v<n>`` version suffixes and lowercases. JSONL DetectorName is the
proto enum name (e.g., ``JDBC``); we match case-insensitively by name only,
since version doesn't appear in the output. Versioned scoping happens at
the trufflehog --include-detectors level.
"""
if not s:
return set()
out = set()
for item in s.split(","):
item = item.strip()
if not item:
continue
if "." in item:
item = item.split(".", 1)[0]
out.add(item.lower())
return out
def load_findings(path):
"""Returns dict: detector_name -> {"identities": set[str], "total": int}."""
by_detector = defaultdict(lambda: {"identities": set(), "total": 0})
with open(path, "r", encoding="utf-8", errors="replace") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except json.JSONDecodeError:
continue
detector = obj.get("DetectorName") or ""
if not detector:
continue
raw = obj.get("Raw") or obj.get("RawV2") or ""
by_detector[detector]["identities"].add(raw)
by_detector[detector]["total"] += 1
return by_detector
def status_emoji(new_count, removed_count, unique_main):
"""Hybrid threshold: 🔴 on absolute (>5) OR relative (>20% of main) NEW, OR any REMOVED."""
if removed_count > 0:
return "🔴"
if new_count > 5 or new_count > 0.20 * max(unique_main, 1):
return "🔴"
if new_count > 0:
return "⚠️"
return ""
def build_top_line_summary(rows, changed):
regressed = sum(1 for r in rows if not r["is_new"] and r["emoji"] == "🔴")
warned = sum(1 for r in rows if not r["is_new"] and r["emoji"] == "⚠️")
new_count = sum(1 for r in rows if r["is_new"])
clean = sum(1 for r in rows if r["emoji"] == "")
scoped = ", ".join(f"`{d}`" for d in sorted(changed)) if changed else ""
parts = []
if regressed:
parts.append(f"{regressed} regressed")
if warned:
parts.append(f"{warned} warned")
parts += [f"{new_count} new", f"{clean} clean"]
summary = f"**{' · '.join(parts)}**"
if scoped:
summary += f" \u00a0|\u00a0 Scoped to: {scoped}"
return summary
def render(main, pr, changed=None, new_detectors=None):
new_detectors = new_detectors or set()
if changed:
all_names = {d for d in (set(main) | set(pr))
if d.lower() in changed}
# Detectors that the PR claims to have changed (or added) but that
# produced zero matches on either side. These don't appear in JSONL,
# so we surface them as a warning row.
seen_lower = {d.lower() for d in (set(main) | set(pr))}
missing = sorted(d for d in changed if d not in seen_lower)
else:
all_names = set(main) | set(pr)
missing = []
_empty = {"identities": set(), "total": 0}
rows = []
has_diff = False
for d in sorted(all_names):
# A detector is only treated as fully new if the new_detectors set
# says so AND main produced no findings for it. When a PR modifies an
# existing version and adds a new version of the same detector (e.g.
# jdbc.v1 + jdbc.v2), both collapse to "jdbc" in new_detectors but
# main still ran against the existing version — its results must not
# be discarded.
is_new = d.lower() in new_detectors and d not in main
m = main.get(d, _empty)
p = pr.get(d, _empty)
new_ids = p["identities"] - m["identities"]
removed_ids = m["identities"] - p["identities"]
if is_new:
emoji = "🆕"
else:
emoji = status_emoji(len(new_ids), len(removed_ids), len(m["identities"]))
if new_ids or removed_ids or m["total"] != p["total"]:
has_diff = True
rows.append({
"detector": d,
"is_new": is_new,
"emoji": emoji,
"total_main": m["total"],
"total_pr": p["total"],
"unique_main": len(m["identities"]),
"unique_pr": len(p["identities"]),
"new_count": len(new_ids),
"removed_count": len(removed_ids),
})
parts = [
STICKY_COMMENT_MARKER,
"## Corpora Test Results",
"",
PREAMBLE,
"",
]
if rows:
parts += [build_top_line_summary(rows, changed), ""]
if not rows and not missing:
parts += ["_(No findings on either side for the changed detectors.)_", ""]
return "\n".join(parts)
if rows:
if has_diff or any(r["is_new"] for r in rows):
rows.sort(
key=lambda r: (
0 if r["is_new"] else 1,
-(r["new_count"] + r["removed_count"]),
r["detector"],
)
)
else:
rows.sort(key=lambda r: r["detector"])
cols = ["Status", "Detector", "Unique matches (main)", "Unique matches (PR)",
"New", "Removed"]
aligns = ["", "", "---:", "---:", "---:", "---:"]
parts += [
"| " + " | ".join(cols) + " |",
"|" + "|".join(a if a else "---" for a in aligns) + "|",
]
for r in rows:
if r["is_new"]:
cells = [
r["emoji"],
r["detector"],
"",
str(r["unique_pr"]),
"",
"",
]
else:
cells = [
r["emoji"],
r["detector"],
str(r["unique_main"]),
str(r["unique_pr"]),
str(r["new_count"]),
str(r["removed_count"]),
]
parts.append("| " + " | ".join(cells) + " |")
parts.append("")
parts.append(STATUS_KEY)
parts.append("")
if missing:
parts += [
"### ⚠️ Changed detectors with zero matches in both builds",
"",
"These detectors were modified by the PR but produced no matches "
"against the corpus on either side. Could be a deliberate scope "
"narrowing, or — more concerning — a regex so loose the engine "
"silently filtered the flood (issue #3578). Worth a manual look.",
"",
]
for d in missing:
parts.append(f"- `{d}`")
parts.append("")
return "\n".join(parts)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("main_jsonl")
parser.add_argument("pr_jsonl")
parser.add_argument("--changed-detectors", default="",
help="CSV of detectors changed in PR; filters report.")
parser.add_argument("--new-detectors", default="",
help="CSV of detectors present in PR but not main; rendered with 🆕.")
args = parser.parse_args()
main_findings = load_findings(args.main_jsonl)
pr_findings = load_findings(args.pr_jsonl)
changed = parse_csv(args.changed_detectors)
new_detectors = parse_csv(args.new_detectors)
sys.stdout.write(render(
main_findings,
pr_findings,
changed=changed if changed else None,
new_detectors=new_detectors,
))
if __name__ == "__main__":
main()