Files
Mustansir 3a022f9d59 Automate corpora testing in CI (#4927)
* add detector corpora test workflow and script

* only run once per PR, make comment descriptive, add handling for manual runs to get PR issue number

* comment out types to see result on all commits

* uncomment types

* remove table from comment

* comment out types

* Phase 0: add explicit pipefail and capture trufflehog stderr

* Phase 1: differential diffing PR vs main

* DEMO: loosen Stripe regex (will revert)

* DEMO: loosen JDBC regex (will revert)

* Phase 1 fix: add --allow-verification-overlap, fix no-diff detection

The bench uses --no-verification, so the engine's overlap-path dedup
(which exists to protect verifiers from duplicate calls) adds noise
without value here — it causes shifts in unrelated detectors when only
one detector's regex changes. Pair --allow-verification-overlap with
--no-verification so each detector's regex behavior is measured
independently.

Also fix the false 'no diff vs main' claim that triggered when
NEW/REMOVED were zero but total counts differed.

* revert jdbc detector change

* Phase 2: detector scoping, new-detector handling, blast radius, status emoji

* DEMO: loosen JDBC + add fictional acmevault detector

* Phase 2 fix: harden corpus byte counting against early trufflehog exit

awk's END block doesn't run when trufflehog exits before draining stdin
(SIGPIPE kills awk first), leaving the bytes file empty and breaking the
step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a
default of 0 and validate it's an integer before arithmetic. Also fold
unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay
out of CI logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 3a (1/3): add hack/extract-keywords for detector keyword introspection

Static AST parse of a detector package to extract the strings returned by
its Keywords() method. Used by the upcoming keyword-corpus builder to fan
out per-detector GitHub Code Search queries during the corpora bench.

AST-first because each detector lives in its own package; importing them
dynamically would require codegen or `plugin`. Falls back to a regex over
the function body, then a directory-wide grep, when AST resolution can't
statically resolve the return value (helper calls, build-tagged variants).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 3a (2/3): add Layer 1 keyword corpus builder + workflow integration

build_keyword_corpus.py queries GitHub Code Search for each changed
detector's pre-filter keywords and emits a zstd-compressed JSONL whose
shape matches the existing S3 corpus exactly: each line is
`{"provenance": {...}, "content": "<raw file content>"}`. The corpora
script's existing `unzstd | jq -r .content` pipe handles it unchanged —
provenance is descriptive only and never reaches trufflehog.

Rate-limit policy is header-driven: the search bucket's
X-RateLimit-Remaining and X-RateLimit-Reset headers gate every call,
with a 2.1s floor between requests as belt-and-suspenders. 403/429s
honor Retry-After or fall back to the reset window. Cap is 100 unique
results per detector, deduped on (repo, path, sha), with a per-keyword
sub-cap so one popular keyword can't starve the others.

A sidecar JSON reports per-detector fetch counts and a thin_l1 list of
detectors whose total returned results were zero (or whose keyword
extraction failed). The diff script reads it via a new
--keyword-corpus-meta arg and renders a single contiguous blockquote
callout above the per-detector details — a sidecar instead of an
in-corpus signal because stdin metadata is dropped from trufflehog's
findings output.

Workflow change: a new "Build keyword corpus (Layer 1)" step fires after
detector detection and overwrites DATASETS via $GITHUB_ENV to append the
keyword corpus path. The corpora script picks it up unchanged through
its existing local-file branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 4 complete - Heatmap visualization

* Phase 4 rework (1/2): emit heatmap-grid.json sidecar from render_heatmap.py

Add a JSON sidecar that captures the same Δ matrix the PNG renders. The
diff script consumes this to render an emoji-bucketed Markdown table —
GitHub's PR-comment Markdown sanitizer strips data: URLs and serves
artifact zips behind auth, so neither inline base64 nor an artifact
<img src> embed actually displays. The PNG stays for artifact archival
and click-through.

Sidecar shape: {detectors, decoders, deltas, _layout}, with _layout
documenting the deltas[i][j] orientation inline so future readers don't
have to reverse-engineer it. Emitted whenever the grid is non-empty,
even if matplotlib import fails — the comment never depends on the PNG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 4 rework (2/2): replace data-URL embed with emoji-bucketed Markdown table

GitHub's PR-comment Markdown sanitizer strips data: URLs from user
content, so the inline base64 PNG embed shipped in the prior commit
rendered as a broken link in the comment DOM (no <img> tag emitted).
Artifact-zip URLs require auth to download, so a fallback ![](...) link
is also a non-starter — it would render as a broken image.

Switch to a per-(detector, decoder) Δ table built from the grid JSON
sidecar render_heatmap.py now emits. Cells use emoji buckets aligned
with the existing status-emoji thresholds so the visual weight matches
the summary table:

  🟥 Δ ≥ +6   (matches NEW > 5 → 🔴)
  🟧 +1..+5
   0
  🟦 ≤ −1

Renders identically on web, mobile, email notifications, and CI log
replay — every surface a PR comment lands on. The colored matplotlib
PNG stays as a workflow artifact; when --heatmap-artifact-url is
supplied, the table is followed by a click-through link for reviewers
who want the rich version.

Workflow YAML mirrors the rename (--heatmap-png → --heatmap-grid) and
drops the base64-blob log filter — the report is back to plain
human-readable text.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Phase 5 complete - Polish

* cleanup, enable verification

* fix bug

* optimizations

* cache keywords corpus

* rewrite comment message

* cache github api corpus per keyword

* cleanup

* remove github corpus

* revert changes for testing

* move Configure AWS credentials step to run only when detector changes are detected

* revert unnecessary changes

* cleanup + bugbot fixes

* run test with bigger (30gb) dataset, loosen jdbc regex

* optimizations

* bugbot fixes

* revert jdbc changes and bugbot fix

* run only on regex and/or keywords change

* bugbot fixes

* bugbot fix

* incorporate brad's comments, loosen jdbc regex to run a test to ensure everything works as expected

* revert test changes

* fix misleading bench skipped message

* run only once when PR opens

* pipe stderr directly to CI log instead of writing to file, loosen jdbc regex to trigger workflow

* testing: run on all commits

* add archive timeout

* remove bigger dataset

* revert testing changes

---------

Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 14:47:53 +05:00
..