mirror of
https://github.com/trufflesecurity/trufflehog.git
synced 2026-05-16 13:20:35 +00:00
3a022f9d59
* add detector corpora test workflow and script * only run once per PR, make comment descriptive, add handling for manual runs to get PR issue number * comment out types to see result on all commits * uncomment types * remove table from comment * comment out types * Phase 0: add explicit pipefail and capture trufflehog stderr * Phase 1: differential diffing PR vs main * DEMO: loosen Stripe regex (will revert) * DEMO: loosen JDBC regex (will revert) * Phase 1 fix: add --allow-verification-overlap, fix no-diff detection The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed. * revert jdbc detector change * Phase 2: detector scoping, new-detector handling, blast radius, status emoji * DEMO: loosen JDBC + add fictional acmevault detector * Phase 2 fix: harden corpus byte counting against early trufflehog exit awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 3a (1/3): add hack/extract-keywords for detector keyword introspection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 3a (2/3): add Layer 1 keyword corpus builder + workflow integration build_keyword_corpus.py queries GitHub Code Search for each changed detector's pre-filter keywords and emits a zstd-compressed JSONL whose shape matches the existing S3 corpus exactly: each line is `{"provenance": {...}, "content": "<raw file content>"}`. The corpora script's existing `unzstd | jq -r .content` pipe handles it unchanged — provenance is descriptive only and never reaches trufflehog. Rate-limit policy is header-driven: the search bucket's X-RateLimit-Remaining and X-RateLimit-Reset headers gate every call, with a 2.1s floor between requests as belt-and-suspenders. 403/429s honor Retry-After or fall back to the reset window. Cap is 100 unique results per detector, deduped on (repo, path, sha), with a per-keyword sub-cap so one popular keyword can't starve the others. A sidecar JSON reports per-detector fetch counts and a thin_l1 list of detectors whose total returned results were zero (or whose keyword extraction failed). The diff script reads it via a new --keyword-corpus-meta arg and renders a single contiguous blockquote callout above the per-detector details — a sidecar instead of an in-corpus signal because stdin metadata is dropped from trufflehog's findings output. Workflow change: a new "Build keyword corpus (Layer 1)" step fires after detector detection and overwrites DATASETS via $GITHUB_ENV to append the keyword corpus path. The corpora script picks it up unchanged through its existing local-file branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 4 complete - Heatmap visualization * Phase 4 rework (1/2): emit heatmap-grid.json sidecar from render_heatmap.py Add a JSON sidecar that captures the same Δ matrix the PNG renders. The diff script consumes this to render an emoji-bucketed Markdown table — GitHub's PR-comment Markdown sanitizer strips data: URLs and serves artifact zips behind auth, so neither inline base64 nor an artifact <img src> embed actually displays. The PNG stays for artifact archival and click-through. Sidecar shape: {detectors, decoders, deltas, _layout}, with _layout documenting the deltas[i][j] orientation inline so future readers don't have to reverse-engineer it. Emitted whenever the grid is non-empty, even if matplotlib import fails — the comment never depends on the PNG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 4 rework (2/2): replace data-URL embed with emoji-bucketed Markdown table GitHub's PR-comment Markdown sanitizer strips data: URLs from user content, so the inline base64 PNG embed shipped in the prior commit rendered as a broken link in the comment DOM (no <img> tag emitted). Artifact-zip URLs require auth to download, so a fallback  link is also a non-starter — it would render as a broken image. Switch to a per-(detector, decoder) Δ table built from the grid JSON sidecar render_heatmap.py now emits. Cells use emoji buckets aligned with the existing status-emoji thresholds so the visual weight matches the summary table: 🟥 Δ ≥ +6 (matches NEW > 5 → 🔴) 🟧 +1..+5 ⬜ 0 🟦 ≤ −1 Renders identically on web, mobile, email notifications, and CI log replay — every surface a PR comment lands on. The colored matplotlib PNG stays as a workflow artifact; when --heatmap-artifact-url is supplied, the table is followed by a click-through link for reviewers who want the rich version. Workflow YAML mirrors the rename (--heatmap-png → --heatmap-grid) and drops the base64-blob log filter — the report is back to plain human-readable text. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 5 complete - Polish * cleanup, enable verification * fix bug * optimizations * cache keywords corpus * rewrite comment message * cache github api corpus per keyword * cleanup * remove github corpus * revert changes for testing * move Configure AWS credentials step to run only when detector changes are detected * revert unnecessary changes * cleanup + bugbot fixes * run test with bigger (30gb) dataset, loosen jdbc regex * optimizations * bugbot fixes * revert jdbc changes and bugbot fix * run only on regex and/or keywords change * bugbot fixes * bugbot fix * incorporate brad's comments, loosen jdbc regex to run a test to ensure everything works as expected * revert test changes * fix misleading bench skipped message * run only once when PR opens * pipe stderr directly to CI log instead of writing to file, loosen jdbc regex to trigger workflow * testing: run on all commits * add archive timeout * remove bigger dataset * revert testing changes --------- Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>