Automate corpora testing in CI (#4927)

* add detector corpora test workflow and script * only run once per PR, make comment descriptive, add handling for manual runs to get PR issue number * comment out types to see result on all commits * uncomment types * remove table from comment * comment out types * Phase 0: add explicit pipefail and capture trufflehog stderr * Phase 1: differential diffing PR vs main * DEMO: loosen Stripe regex (will revert) * DEMO: loosen JDBC regex (will revert) * Phase 1 fix: add --allow-verification-overlap, fix no-diff detection The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed. * revert jdbc detector change * Phase 2: detector scoping, new-detector handling, blast radius, status emoji * DEMO: loosen JDBC + add fictional acmevault detector * Phase 2 fix: harden corpus byte counting against early trufflehog exit awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 3a (1/3): add hack/extract-keywords for detector keyword introspection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 3a (2/3): add Layer 1 keyword corpus builder + workflow integration build_keyword_corpus.py queries GitHub Code Search for each changed detector's pre-filter keywords and emits a zstd-compressed JSONL whose shape matches the existing S3 corpus exactly: each line is `{"provenance": {...}, "content": "<raw file content>"}`. The corpora script's existing `unzstd | jq -r .content` pipe handles it unchanged — provenance is descriptive only and never reaches trufflehog. Rate-limit policy is header-driven: the search bucket's X-RateLimit-Remaining and X-RateLimit-Reset headers gate every call, with a 2.1s floor between requests as belt-and-suspenders. 403/429s honor Retry-After or fall back to the reset window. Cap is 100 unique results per detector, deduped on (repo, path, sha), with a per-keyword sub-cap so one popular keyword can't starve the others. A sidecar JSON reports per-detector fetch counts and a thin_l1 list of detectors whose total returned results were zero (or whose keyword extraction failed). The diff script reads it via a new --keyword-corpus-meta arg and renders a single contiguous blockquote callout above the per-detector details — a sidecar instead of an in-corpus signal because stdin metadata is dropped from trufflehog's findings output. Workflow change: a new "Build keyword corpus (Layer 1)" step fires after detector detection and overwrites DATASETS via $GITHUB_ENV to append the keyword corpus path. The corpora script picks it up unchanged through its existing local-file branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 4 complete - Heatmap visualization * Phase 4 rework (1/2): emit heatmap-grid.json sidecar from render_heatmap.py Add a JSON sidecar that captures the same Δ matrix the PNG renders. The diff script consumes this to render an emoji-bucketed Markdown table — GitHub's PR-comment Markdown sanitizer strips data: URLs and serves artifact zips behind auth, so neither inline base64 nor an artifact <img src> embed actually displays. The PNG stays for artifact archival and click-through. Sidecar shape: {detectors, decoders, deltas, _layout}, with _layout documenting the deltas[i][j] orientation inline so future readers don't have to reverse-engineer it. Emitted whenever the grid is non-empty, even if matplotlib import fails — the comment never depends on the PNG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 4 rework (2/2): replace data-URL embed with emoji-bucketed Markdown table GitHub's PR-comment Markdown sanitizer strips data: URLs from user content, so the inline base64 PNG embed shipped in the prior commit rendered as a broken link in the comment DOM (no <img> tag emitted). Artifact-zip URLs require auth to download, so a fallback ![](...) link is also a non-starter — it would render as a broken image. Switch to a per-(detector, decoder) Δ table built from the grid JSON sidecar render_heatmap.py now emits. Cells use emoji buckets aligned with the existing status-emoji thresholds so the visual weight matches the summary table: 🟥 Δ ≥ +6 (matches NEW > 5 → 🔴) 🟧 +1..+5 ⬜ 0 🟦 ≤ −1 Renders identically on web, mobile, email notifications, and CI log replay — every surface a PR comment lands on. The colored matplotlib PNG stays as a workflow artifact; when --heatmap-artifact-url is supplied, the table is followed by a click-through link for reviewers who want the rich version. Workflow YAML mirrors the rename (--heatmap-png → --heatmap-grid) and drops the base64-blob log filter — the report is back to plain human-readable text. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Phase 5 complete - Polish * cleanup, enable verification * fix bug * optimizations * cache keywords corpus * rewrite comment message * cache github api corpus per keyword * cleanup * remove github corpus * revert changes for testing * move Configure AWS credentials step to run only when detector changes are detected * revert unnecessary changes * cleanup + bugbot fixes * run test with bigger (30gb) dataset, loosen jdbc regex * optimizations * bugbot fixes * revert jdbc changes and bugbot fix * run only on regex and/or keywords change * bugbot fixes * bugbot fix * incorporate brad's comments, loosen jdbc regex to run a test to ensure everything works as expected * revert test changes * fix misleading bench skipped message * run only once when PR opens * pipe stderr directly to CI log instead of writing to file, loosen jdbc regex to trigger workflow * testing: run on all commits * add archive timeout * remove bigger dataset * revert testing changes --------- Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:20:35 +00:00 · 2026-05-15 14:47:53 +05:00
parent cd6b46a502
commit 3a022f9d59
5 changed files with 875 additions and 0 deletions
@@ -0,0 +1,266 @@
 name: Corpora Test
 on:
  workflow_dispatch:
  pull_request:
    types: [opened]
    paths:
      - 'pkg/detectors/**'
      - 'pkg/engine/defaults/defaults.go'
      - '.github/workflows/detector-corpora-test.yml'
      - 'scripts/test/detector_corpora_test.sh'
      - 'scripts/test/diff_corpora_results.py'
      - 'scripts/test/detect_changed_detectors.sh'
 env:
  DATASETS: |
    s3://trufflehog-corpora-datasets/contents.2025-11-04.jsonl.zstd
 jobs:
  corpora-test:
    if: ${{ github.repository == 'trufflesecurity/trufflehog' && !github.event.pull_request.head.repo.fork }}
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - name: Checkout code
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
        with:
          fetch-depth: 0
          persist-credentials: false
      - name: Install Go
        uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6
        with:
          go-version: "1.25"
      - name: Install dependencies
        run: sudo apt-get install -y zstd jq
      - name: Resolve merge-base
        id: merge_base
        shell: bash
        run: |
          set -o pipefail
          git fetch --no-tags --prune origin main
          MERGE_BASE=$(git merge-base origin/main HEAD)
          echo "Merge base: $MERGE_BASE"
          echo "sha=$MERGE_BASE" >> "$GITHUB_OUTPUT"
      # Determine which detectors changed in this PR. The PR build scopes its
      # scan to the full set; the main build excludes detectors that don't
      # exist there yet (new detectors). If the set is empty, the workflow
      # short-circuits with a skip comment — scoping is the entire point of
      # Phase 2, falling back to scan-all defeats it.
      - name: Detect changed detectors
        id: detect
        shell: bash
        env:
          BASE_REF: ${{ steps.merge_base.outputs.sha }}
        run: |
          set -o pipefail
          chmod +x scripts/test/detect_changed_detectors.sh
          PR_CSV=$(./scripts/test/detect_changed_detectors.sh --pr-csv || true)
          MAIN_CSV=$(./scripts/test/detect_changed_detectors.sh --main-csv || true)
          NEW_LIST=$(./scripts/test/detect_changed_detectors.sh --new-only || true)
          NEW_CSV=$(echo "$NEW_LIST" | paste -sd, -)
          echo "PR detectors:   $PR_CSV"
          echo "Main detectors: $MAIN_CSV"
          echo "New detectors:  $NEW_CSV"
          echo "pr_csv=$PR_CSV" >> "$GITHUB_OUTPUT"
          echo "main_csv=$MAIN_CSV" >> "$GITHUB_OUTPUT"
          echo "new_csv=$NEW_CSV" >> "$GITHUB_OUTPUT"
          if [[ -n "$PR_CSV" ]]; then
            echo "any_changed=true" >> "$GITHUB_OUTPUT"
          else
            echo "any_changed=false" >> "$GITHUB_OUTPUT"
          fi
      # Sticky comment: find any prior detector-bench comment on the PR by
      # the marker substring and update it in place. The marker — kept in
      # sync with STICKY_COMMENT_MARKER in scripts/test/diff_corpora_results.py —
      # has to appear in BOTH the skip body and the diff body so the same
      # comment flips between them as iterative pushes change which path
      # fires. Skip body is only posted on pull_request events; workflow_dispatch
      # runs with no changed detectors silently finish without posting.
      - name: Find existing skip comment
        if: steps.detect.outputs.any_changed != 'true' && github.event_name == 'pull_request'
        id: find_skip_comment
        uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4
        with:
          issue-number: ${{ github.event.pull_request.number }}
          comment-author: 'github-actions[bot]'
          body-includes: '<!-- detector-bench -->'
      - name: Post or update skip comment
        if: steps.detect.outputs.any_changed != 'true' && github.event_name == 'pull_request'
        uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5
        with:
          comment-id: ${{ steps.find_skip_comment.outputs.comment-id }}
          issue-number: ${{ github.event.pull_request.number }}
          edit-mode: replace
          body: |
            <!-- detector-bench -->
            ## Corpora Test Results
            No detector regex or keyword changes in this PR. Bench skipped.
      - name: Configure AWS credentials
        if: steps.detect.outputs.any_changed == 'true'
        uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37 # v6
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      # Cache the main scan results by merge-base + scoped detector set.
      # On subsequent pushes to the same PR without a rebase, both are
      # identical, so the main scan (35 GB of S3 streaming + trufflehog) is
      # skipped entirely.
      - name: Restore main scan cache
        id: main_scan_cache
        if: steps.detect.outputs.any_changed == 'true' && steps.detect.outputs.main_csv != ''
        uses: actions/cache/restore@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
        with:
          path: /tmp/results-main.jsonl
          key: main-scan-v1-${{ steps.merge_base.outputs.sha }}-${{ steps.detect.outputs.main_csv }}
      # Two independent builds run in parallel:
      #   A) prepare main worktree → build main binary (git I/O then CPU)
      #      Skipped on main scan cache hit or when main_csv is empty
      #      (all changed detectors are new — no baseline needed).
      #   B) build PR binary (CPU, no dependencies)
      - name: Build binaries
        if: steps.detect.outputs.any_changed == 'true'
        shell: bash
        env:
          MERGE_BASE: ${{ steps.merge_base.outputs.sha }}
          MAIN_CSV: ${{ steps.detect.outputs.main_csv }}
          MAIN_SCAN_CACHE_HIT: ${{ steps.main_scan_cache.outputs.cache-hit }}
        run: |
          set -o pipefail
          # Chain A: prepare worktree, then build main binary.
          # Skipped when main scan results are already cached, or when all
          # changed detectors are new (main_csv empty — no baseline needed).
          if [[ -n "$MAIN_CSV" && "$MAIN_SCAN_CACHE_HIT" != 'true' ]]; then
            (
              git worktree add /tmp/trufflehog-main-src "$MERGE_BASE"
              cd /tmp/trufflehog-main-src
              CGO_ENABLED=0 go build -o /tmp/trufflehog-main .
            ) &
            PID_MAIN_BUILD=$!
          fi
          # Chain B: build PR binary (no dependencies).
          CGO_ENABLED=0 go build -o /tmp/trufflehog-pr . &
          PID_PR_BUILD=$!
          [[ -n "${PID_MAIN_BUILD:-}" ]] && { wait $PID_MAIN_BUILD || { echo "Main binary build failed" >&2; exit 1; }; }
          wait $PID_PR_BUILD || { echo "PR binary build failed" >&2; exit 1; }
      # PR and main scans share a single S3 stream per dataset file, teed to
      # both binaries simultaneously. The main side is skipped on a cache hit
      # (results already in /tmp/results-main.jsonl) or when main_csv is empty
      # (PR adds only new detectors — no overlap with main).
      - name: Run corpora tests
        if: steps.detect.outputs.any_changed == 'true'
        shell: bash
        env:
          PR_CSV: ${{ steps.detect.outputs.pr_csv }}
          MAIN_CSV: ${{ steps.detect.outputs.main_csv }}
          MAIN_SCAN_CACHE_HIT: ${{ steps.main_scan_cache.outputs.cache-hit }}
        run: |
          set -o pipefail
          files=()
          while IFS= read -r dataset; do
            [[ -z "$dataset" ]] && continue
            files+=("$dataset")
          done <<< "$DATASETS"
          export TRUFFLEHOG_BIN=/tmp/trufflehog-pr
          export OUTPUT_JSONL=/tmp/results-pr.jsonl
          export INCLUDE_DETECTORS="$PR_CSV"
          if [[ -n "$MAIN_CSV" && "$MAIN_SCAN_CACHE_HIT" != 'true' ]]; then
            # Dual-binary: single S3 download teed to both PR and main binaries.
            export TRUFFLEHOG_BIN_MAIN=/tmp/trufflehog-main
            export OUTPUT_JSONL_MAIN=/tmp/results-main.jsonl
            export INCLUDE_DETECTORS_MAIN="$MAIN_CSV"
          elif [[ -z "$MAIN_CSV" ]]; then
            echo "No overlapping detectors in main; skipping main scan."
            : > /tmp/results-main.jsonl
          else
            echo "Main scan cache hit; skipping main scan."
          fi
          ./scripts/test/detector_corpora_test.sh "${files[@]}" \
            || { echo "Corpora scan failed" >&2; exit 1; }
      - name: Save main scan cache
        if: steps.detect.outputs.any_changed == 'true' && steps.detect.outputs.main_csv != '' && steps.main_scan_cache.outputs.cache-hit != 'true'
        uses: actions/cache/save@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
        with:
          path: /tmp/results-main.jsonl
          key: main-scan-v1-${{ steps.merge_base.outputs.sha }}-${{ steps.detect.outputs.main_csv }}
      - name: Diff results
        if: steps.detect.outputs.any_changed == 'true'
        shell: bash
        env:
          CHANGED: ${{ steps.detect.outputs.pr_csv }}
          NEW_DETECTORS: ${{ steps.detect.outputs.new_csv }}
        run: |
          set -o pipefail
          python3 scripts/test/diff_corpora_results.py \
            /tmp/results-main.jsonl /tmp/results-pr.jsonl \
            --changed-detectors="$CHANGED" \
            --new-detectors="$NEW_DETECTORS" \
            > /tmp/diff-report.md
          cat /tmp/diff-report.md
      # workflow_dispatch runs don't carry an issue context, so resolve the
      # PR number by branch lookup. pull_request events fall through to the
      # event's issue number. Output feeds the find/update pair below.
      - name: Resolve PR number
        if: steps.detect.outputs.any_changed == 'true'
        id: resolve_pr
        uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9
        with:
          script: |
            let issue_number;
            if (context.eventName === 'workflow_dispatch') {
              const pulls = await github.rest.pulls.list({
                owner: context.repo.owner,
                repo: context.repo.repo,
                head: `${context.repo.owner}:${context.ref.replace('refs/heads/', '')}`,
                state: 'open',
              });
              if (pulls.data.length === 0) {
                core.setFailed(`No open PR found for branch ${context.ref}`);
                return;
              }
              issue_number = pulls.data[0].number;
            } else {
              issue_number = context.issue.number;
            }
            core.setOutput('issue_number', issue_number);
      - name: Find existing diff comment
        if: steps.detect.outputs.any_changed == 'true'
        id: find_diff_comment
        uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4
        with:
          issue-number: ${{ steps.resolve_pr.outputs.issue_number }}
          comment-author: 'github-actions[bot]'
          body-includes: '<!-- detector-bench -->'
      - name: Post or update diff comment
        if: steps.detect.outputs.any_changed == 'true'
        uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5
        with:
          comment-id: ${{ steps.find_diff_comment.outputs.comment-id }}
          issue-number: ${{ steps.resolve_pr.outputs.issue_number }}
          edit-mode: replace
          body-path: /tmp/diff-report.md
@@ -10,3 +10,7 @@ tmp/go-test.json
 .captain/detectors/quarantines.yaml
 .captain/detectors/flakes.yaml
 .vscode
 # Python
 __pycache__/
 *.pyc
@@ -0,0 +1,206 @@
 #!/usr/bin/env bash
 #
 # detect_changed_detectors.sh — Phase 2
 #
 # Emits the list of detectors changed between two git refs, formatted for
 # trufflehog's --include-detectors flag (comma-separated, lowercase protobuf
 # enum names, optional ".v<n>" version suffix).
 #
 # Source of truth for each detector's identifier:
 #   - Proto enum name comes from the detector's Type() implementation in its
 #     source files (e.g. `return detectorspb.DetectorType_AzureBatch` →
 #     `azurebatch`). Necessary because the package directory often differs
 #     from the enum name (azure_batch vs AzureBatch, npmtokenv2 vs NpmToken,
 #     close vs closecrm, etc.).
 #   - Version comes from the directory suffix only (`/v<n>`). Detectors that
 #     encode the version in the dir name (e.g. `npmtokenv2`) are emitted
 #     without a version suffix; trufflehog then matches all versions of that
 #     proto type — wider scope but correct.
 #
 # "New detector" detection compares pkg/engine/defaults/defaults.go imports
 # between the two refs. A detector imported at HEAD but not at BASE is new.
 #
 # Modes:
 #   (none)       List all changed detectors at HEAD, one per line, in
 #                <name>[.v<n>] form.
 #   --pr-csv     Same set as default mode, comma-joined.
 #   --main-csv   Changed detectors that also exist at BASE (excludes new),
 #                comma-joined. Use as --include-detectors for the main build.
 #   --new-only   Just the new detectors (in HEAD but not BASE), one per line.
 #
 # Env:
 #   BASE_REF   default origin/main
 #   HEAD_REF   default HEAD
 set -euo pipefail
 MODE="${1:-list}"
 BASE_REF="${BASE_REF:-origin/main}"
 HEAD_REF="${HEAD_REF:-HEAD}"
 REPO_ROOT="$(git rev-parse --show-toplevel)"
 cd "$REPO_ROOT"
 # Resolve BASE to a concrete commit. Workflow already runs `git fetch origin
 # main`; locally that may not be true, so we fall back to `main` if the
 # remote-tracking ref is missing.
 if ! git rev-parse --verify "$BASE_REF" >/dev/null 2>&1; then
    if git rev-parse --verify main >/dev/null 2>&1; then
        BASE_REF=main
    else
        echo "error: cannot resolve BASE_REF=$BASE_REF and no local 'main'" >&2
        exit 1
    fi
 fi
 MERGE_BASE=$(git merge-base "$BASE_REF" "$HEAD_REF")
 # Step 1 — changed detector dirs (relative to repo root).
 # Pattern: pkg/detectors/<name>(/v<n>)?/<file>.go, excludes _test.go and
 # files inside common/, custom_detectors/.
 mapfile -t CHANGED_DIRS < <(
    git diff --name-only "$MERGE_BASE...$HEAD_REF" -- 'pkg/detectors/**/*.go' \
        | grep -Ev '_test\.go$' \
        | grep -Ev '^pkg/detectors/(common|custom_detectors)/' \
        | sed -E 's|^(pkg/detectors/[^/]+(/v[0-9]+)?)/[^/]+\.go$|\1|' \
        | sort -u
 )
 # Step 2 — defaults.go imports at each ref. Each line has form
 #   "github.com/trufflesecurity/trufflehog/v3/pkg/detectors/<name>(/v<n>)?"
 # We extract just the <name>(/v<n>)? portion to use as the dir identifier.
 parse_defaults_imports() {
    local ref="$1"
    git show "$ref:pkg/engine/defaults/defaults.go" 2>/dev/null \
        | grep -oE '"github\.com/trufflesecurity/trufflehog/v3/pkg/detectors/[^"]+"' \
        | sed -E 's|.*/pkg/detectors/||; s|"$||' \
        | sort -u
 }
 mapfile -t HEAD_IMPORTS < <(parse_defaults_imports "$HEAD_REF")
 mapfile -t BASE_IMPORTS < <(parse_defaults_imports "$MERGE_BASE")
 # Set difference: detectors imported at HEAD but not at BASE. The dir
 # identifier (e.g. "github/v2", "stripe") matches the form we extracted in
 # step 1, so we can intersect directly without re-mapping.
 NEW_DIRS_FILE=$(mktemp)
 trap 'rm -f "$NEW_DIRS_FILE"' EXIT
 comm -23 \
    <(printf '%s\n' "${HEAD_IMPORTS[@]+"${HEAD_IMPORTS[@]}"}") \
    <(printf '%s\n' "${BASE_IMPORTS[@]+"${BASE_IMPORTS[@]}"}") \
    > "$NEW_DIRS_FILE"
 is_new_detector() {
    grep -qxF "$1" "$NEW_DIRS_FILE"
 }
 # Step 2b — skip detectors whose diff doesn't touch regex patterns or Keywords.
 # Corpora results only change when the matching logic changes; verification,
 # redaction, or structural changes don't affect match counts.
 has_pattern_change() {
    local dir="$1"
    # Fast path: regex or Keywords() signature on a changed line.
    git diff "$MERGE_BASE...$HEAD_REF" -- "$dir"/*.go 2>/dev/null \
        | grep -qE '^[+-][^+-].*(regexp\.|MustCompile|Keywords)' && return 0
    # Slow path: compare the Keywords() function body between refs to catch
    # changes to the return value (e.g. []string{"old"} → []string{"new"})
    # where the changed lines don't mention "Keywords" themselves.
    local file
    while IFS= read -r file; do
        [[ "$file" == *_test.go ]] && continue
        local head_body base_body
        head_body=$(git show "$HEAD_REF:$file" 2>/dev/null \
            | awk '/func[[:space:]].*Keywords\(\)[[:space:]]*\[\]string/,/^[[:space:]]*\}/' \
            | tail -n +2)
        base_body=$(git show "$MERGE_BASE:$file" 2>/dev/null \
            | awk '/func[[:space:]].*Keywords\(\)[[:space:]]*\[\]string/,/^[[:space:]]*\}/' \
            | tail -n +2)
        [[ "$head_body" != "$base_body" ]] && return 0
    done < <(git diff --name-only "$MERGE_BASE...$HEAD_REF" -- "$dir"/*.go 2>/dev/null)
    return 1
 }
 # Step 3 — for a dir, derive `<protoname>[.v<n>]`.
 detector_id_for_dir() {
    local dir="$1"
    local version=""
    if [[ "$dir" =~ ^pkg/detectors/[^/]+/v([0-9]+)$ ]]; then
        version=".v${BASH_REMATCH[1]}"
    fi
    # Extract proto enum name. Multiple matches are possible (a detector may
    # also reference related types in helpers); the Type() return is by far
    # the most common, so the modal value wins.
    local proto
    proto=$(
        grep -E 'return[[:space:]]+\S*DetectorType_[A-Za-z0-9]+' "$dir"/*.go 2>/dev/null \
            | grep -v '_test\.go' \
            | grep -oE 'DetectorType_[A-Za-z0-9]+' \
            | sort | uniq -c | sort -rn \
            | head -1 \
            | awk '{print $2}' \
            | sed 's/^DetectorType_//' \
            | tr '[:upper:]' '[:lower:]'
    )
    if [[ -z "$proto" ]]; then
        return 1
    fi
    echo "${proto}${version}"
 }
 # Step 4 — emit per mode.
 emit_list() {
    local dir id
    for dir in "${CHANGED_DIRS[@]:-}"; do
        [[ -z "$dir" ]] && continue
        has_pattern_change "$dir" || continue
        if id=$(detector_id_for_dir "$dir"); then
            echo "$id"
        else
            echo "warning: could not resolve detector id for $dir" >&2
        fi
    done | sort -u
 }
 emit_main_list() {
    local dir id
    for dir in "${CHANGED_DIRS[@]:-}"; do
        [[ -z "$dir" ]] && continue
        has_pattern_change "$dir" || continue
        # Strip `pkg/detectors/` prefix to get the import-path form, then
        # check against the new-detector set.
        local import_form="${dir#pkg/detectors/}"
        if is_new_detector "$import_form"; then
            continue
        fi
        if id=$(detector_id_for_dir "$dir"); then
            echo "$id"
        fi
    done | sort -u
 }
 emit_new_list() {
    local dir id
    for dir in "${CHANGED_DIRS[@]:-}"; do
        [[ -z "$dir" ]] && continue
        has_pattern_change "$dir" || continue
        local import_form="${dir#pkg/detectors/}"
        if ! is_new_detector "$import_form"; then
            continue
        fi
        if id=$(detector_id_for_dir "$dir"); then
            echo "$id"
        fi
    done | sort -u
 }
 case "$MODE" in
    list)       emit_list ;;
    --pr-csv)   emit_list | paste -sd, - ;;
    --main-csv) emit_main_list | paste -sd, - ;;
    --new-only) emit_new_list ;;
    *)          echo "Usage: $0 [--pr-csv|--main-csv|--new-only]" >&2; exit 2 ;;
 esac
@@ -0,0 +1,127 @@
 #!/bin/bash
 set -euo pipefail
 if [[ $# -lt 1 ]]; then
    echo "Usage: $0 <corpora_file.jsonl.zstd> [<corpora_file2.jsonl.zstd> ...]"
    exit 1
 fi
 # CI sets OUTPUT_JSONL to per-run paths and skips the human-readable DuckDB
 # summary. Local invocations leave it unset and get the summary table for
 # debugging.
 if [[ -z "${OUTPUT_JSONL+x}" ]]; then
    OUTPUT_JSONL="/tmp/corpora_results.jsonl"
    RUN_DUCKDB_SUMMARY=1
 else
    RUN_DUCKDB_SUMMARY=0
 fi
 > "$OUTPUT_JSONL"
 REPO_ROOT="$(git rev-parse --show-toplevel)"
 TRUFFLEHOG_BIN="${TRUFFLEHOG_BIN:-${REPO_ROOT}/trufflehog}"
 if [[ ! -x "$TRUFFLEHOG_BIN" ]]; then
    CGO_ENABLED=0 go build -o "$TRUFFLEHOG_BIN" "$REPO_ROOT"
 fi
 # When set, scope the scan to specific detectors. Comma-separated, lowercase
 # proto enum names with optional ".v<n>" suffix (matches the format produced
 # by scripts/test/detect_changed_detectors.sh).
 INCLUDE_DETECTORS="${INCLUDE_DETECTORS:-}"
 INCLUDE_FLAG=()
 if [[ -n "$INCLUDE_DETECTORS" ]]; then
    INCLUDE_FLAG=(--include-detectors="$INCLUDE_DETECTORS")
 fi
 if [[ -n "${OUTPUT_JSONL_MAIN:-}" ]]; then
    > "$OUTPUT_JSONL_MAIN"
 fi
 # --no-verification avoids network calls against a large corpus where thousands
 # of matches could trigger API calls, dominating runtime. Verifier behavior is
 # covered by detector unit and integration tests.
 #
 # Dual-binary mode: when TRUFFLEHOG_BIN_MAIN / OUTPUT_JSONL_MAIN /
 # INCLUDE_DETECTORS_MAIN are set, the corpus stream is teed to both the PR
 # binary (stdout side) and the main binary (process substitution) so S3 is
 # only downloaded once.
 scan() {
    local input="$1"
    set +e
    local main_include_flag=()
    if [[ -n "${INCLUDE_DETECTORS_MAIN:-}" ]]; then
        main_include_flag=(--include-detectors="$INCLUDE_DETECTORS_MAIN")
    fi
    local rc=0
    if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then
        # Single S3 download teed to both binaries simultaneously.
        unzstd -c "$input" \
            | jq -r .content \
            | tee >(
                "${TRUFFLEHOG_BIN_MAIN}" \
                    --no-update \
                    --no-verification \
                    --allow-verification-overlap \
                    --log-level=3 \
                    --concurrency=8 \
                    --json \
                    --archive-timeout=2h \
                    "${main_include_flag[@]}" \
                    stdin >> "${OUTPUT_JSONL_MAIN}"
              ) \
            | "$TRUFFLEHOG_BIN" \
                --no-update \
                --no-verification \
                --allow-verification-overlap \
                --log-level=3 \
                --concurrency=8 \
                --json \
                --print-avg-detector-time \
                --archive-timeout=2h \
                "${INCLUDE_FLAG[@]}" \
                stdin >> "$OUTPUT_JSONL"
        rc=$?
        wait
    else
        unzstd -c "$input" \
            | jq -r .content \
            | "$TRUFFLEHOG_BIN" \
                --no-update \
                --no-verification \
                --allow-verification-overlap \
                --log-level=3 \
                --concurrency=8 \
                --json \
                --print-avg-detector-time \
                --archive-timeout=2h \
                "${INCLUDE_FLAG[@]}" \
                stdin >> "$OUTPUT_JSONL"
        rc=$?
    fi
    set -e
    return $rc
 }
 for CORPORA_FILE in "$@"; do
    if [[ "$CORPORA_FILE" == s3://* ]]; then
        aws s3 cp "$CORPORA_FILE" - | scan /dev/stdin
    else
        scan "$CORPORA_FILE"
    fi
 done
 if [[ "$RUN_DUCKDB_SUMMARY" == "1" ]]; then
    duckdb -c "
 CREATE TABLE t AS FROM read_json_auto('$OUTPUT_JSONL', ignore_errors=true);
 SELECT
    t.DetectorName detector,
    COUNT(*) total
 FROM t
 GROUP BY all
 ORDER BY total DESC, detector
 LIMIT 50;
 "
 fi
@@ -0,0 +1,272 @@
 #!/usr/bin/env python3
 """
 Diffs two trufflehog JSONL outputs (main vs PR build) and emits a Markdown
 report to stdout.
 Identity per finding: (DetectorName, Raw or RawV2 fallback). Set semantics —
 duplicates within a single scan collapse into one identity, so a regex change
 either adds a new (detector, secret) identity or removes one.
 Verification is disabled at scan time (--no-verification) to avoid network
 calls against a large corpus where thousands of matches could dominate runtime.
 The diff measures regex match changes only.
 When --changed-detectors is provided, the report focuses on the detectors
 changed by the PR. Detectors flagged via --new-detectors are rendered with 🆕
 status and absolute density (no main baseline). When --corpus-bytes is
 provided, a blast-radius column projects matches per 10 GB of scanned content.
 Usage:
    diff_corpora_results.py <main.jsonl> <pr.jsonl>
        [--changed-detectors=<csv>]
        [--new-detectors=<csv>]
        [--corpus-bytes=<n>]
 """
 import argparse
 import json
 import sys
 from collections import defaultdict
 PREAMBLE = (
    "Scans a corpus of real-world public code against only the detectors "
    "changed in this PR, then compares unique match counts between the PR "
    "build and the main baseline to catch regex regressions. Verification "
    "is disabled — each detector's regex is measured independently."
 )
 STATUS_KEY = (
    "- 🔴 regression: >5 new, >20% increase over main, or any removed\n"
    "- ⚠️ warning: 1–5 new and ≤20% increase over main\n"
    "- ✅ clean\n"
    "- 🆕 new detector (no baseline)"
 )
 # Marker on the very first line of the body so peter-evans/find-comment can
 # locate the sticky comment via substring match. Workflow file references the
 # same literal — keep the two in sync.
 STICKY_COMMENT_MARKER = "<!-- detector-bench -->"
 def parse_csv(s):
    """Parse a comma-separated detector list into normalized name set.
    Strips ``.v<n>`` version suffixes and lowercases. JSONL DetectorName is the
    proto enum name (e.g., ``JDBC``); we match case-insensitively by name only,
    since version doesn't appear in the output. Versioned scoping happens at
    the trufflehog --include-detectors level.
    """
    if not s:
        return set()
    out = set()
    for item in s.split(","):
        item = item.strip()
        if not item:
            continue
        if "." in item:
            item = item.split(".", 1)[0]
        out.add(item.lower())
    return out
 def load_findings(path):
    """Returns dict: detector_name -> {"identities": set[str], "total": int}."""
    by_detector = defaultdict(lambda: {"identities": set(), "total": 0})
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
            except json.JSONDecodeError:
                continue
            detector = obj.get("DetectorName") or ""
            if not detector:
                continue
            raw = obj.get("Raw") or obj.get("RawV2") or ""
            by_detector[detector]["identities"].add(raw)
            by_detector[detector]["total"] += 1
    return by_detector
 def status_emoji(new_count, removed_count, unique_main):
    """Hybrid threshold: 🔴 on absolute (>5) OR relative (>20% of main) NEW, OR any REMOVED."""
    if removed_count > 0:
        return "🔴"
    if new_count > 5 or new_count > 0.20 * max(unique_main, 1):
        return "🔴"
    if new_count > 0:
        return "⚠️"
    return "✅"
 def build_top_line_summary(rows, changed):
    regressed = sum(1 for r in rows if not r["is_new"] and r["emoji"] == "🔴")
    warned = sum(1 for r in rows if not r["is_new"] and r["emoji"] == "⚠️")
    new_count = sum(1 for r in rows if r["is_new"])
    clean = sum(1 for r in rows if r["emoji"] == "✅")
    scoped = ", ".join(f"`{d}`" for d in sorted(changed)) if changed else ""
    parts = []
    if regressed:
        parts.append(f"{regressed} regressed")
    if warned:
        parts.append(f"{warned} warned")
    parts += [f"{new_count} new", f"{clean} clean"]
    summary = f"**{' · '.join(parts)}**"
    if scoped:
        summary += f" \u00a0|\u00a0 Scoped to: {scoped}"
    return summary
 def render(main, pr, changed=None, new_detectors=None):
    new_detectors = new_detectors or set()
    if changed:
        all_names = {d for d in (set(main) | set(pr))
                     if d.lower() in changed}
        # Detectors that the PR claims to have changed (or added) but that
        # produced zero matches on either side. These don't appear in JSONL,
        # so we surface them as a warning row.
        seen_lower = {d.lower() for d in (set(main) | set(pr))}
        missing = sorted(d for d in changed if d not in seen_lower)
    else:
        all_names = set(main) | set(pr)
        missing = []
    _empty = {"identities": set(), "total": 0}
    rows = []
    has_diff = False
    for d in sorted(all_names):
        # A detector is only treated as fully new if the new_detectors set
        # says so AND main produced no findings for it. When a PR modifies an
        # existing version and adds a new version of the same detector (e.g.
        # jdbc.v1 + jdbc.v2), both collapse to "jdbc" in new_detectors but
        # main still ran against the existing version — its results must not
        # be discarded.
        is_new = d.lower() in new_detectors and d not in main
        m = main.get(d, _empty)
        p = pr.get(d, _empty)
        new_ids = p["identities"] - m["identities"]
        removed_ids = m["identities"] - p["identities"]
        if is_new:
            emoji = "🆕"
        else:
            emoji = status_emoji(len(new_ids), len(removed_ids), len(m["identities"]))
        if new_ids or removed_ids or m["total"] != p["total"]:
            has_diff = True
        rows.append({
            "detector": d,
            "is_new": is_new,
            "emoji": emoji,
            "total_main": m["total"],
            "total_pr": p["total"],
            "unique_main": len(m["identities"]),
            "unique_pr": len(p["identities"]),
            "new_count": len(new_ids),
            "removed_count": len(removed_ids),
        })
    parts = [
        STICKY_COMMENT_MARKER,
        "## Corpora Test Results",
        "",
        PREAMBLE,
        "",
    ]
    if rows:
        parts += [build_top_line_summary(rows, changed), ""]
    if not rows and not missing:
        parts += ["_(No findings on either side for the changed detectors.)_", ""]
        return "\n".join(parts)
    if rows:
        if has_diff or any(r["is_new"] for r in rows):
            rows.sort(
                key=lambda r: (
                    0 if r["is_new"] else 1,
                    -(r["new_count"] + r["removed_count"]),
                    r["detector"],
                )
            )
        else:
            rows.sort(key=lambda r: r["detector"])
        cols = ["Status", "Detector", "Unique matches (main)", "Unique matches (PR)",
                "New", "Removed"]
        aligns = ["", "", "---:", "---:", "---:", "---:"]
        parts += [
            "| " + " | ".join(cols) + " |",
            "|" + "|".join(a if a else "---" for a in aligns) + "|",
        ]
        for r in rows:
            if r["is_new"]:
                cells = [
                    r["emoji"],
                    r["detector"],
                    "—",
                    str(r["unique_pr"]),
                    "—",
                    "—",
                ]
            else:
                cells = [
                    r["emoji"],
                    r["detector"],
                    str(r["unique_main"]),
                    str(r["unique_pr"]),
                    str(r["new_count"]),
                    str(r["removed_count"]),
                ]
            parts.append("| " + " | ".join(cells) + " |")
        parts.append("")
        parts.append(STATUS_KEY)
        parts.append("")
    if missing:
        parts += [
            "### ⚠️ Changed detectors with zero matches in both builds",
            "",
            "These detectors were modified by the PR but produced no matches "
            "against the corpus on either side. Could be a deliberate scope "
            "narrowing, or — more concerning — a regex so loose the engine "
            "silently filtered the flood (issue #3578). Worth a manual look.",
            "",
        ]
        for d in missing:
            parts.append(f"- `{d}`")
        parts.append("")
    return "\n".join(parts)
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("main_jsonl")
    parser.add_argument("pr_jsonl")
    parser.add_argument("--changed-detectors", default="",
                        help="CSV of detectors changed in PR; filters report.")
    parser.add_argument("--new-detectors", default="",
                        help="CSV of detectors present in PR but not main; rendered with 🆕.")
    args = parser.parse_args()
    main_findings = load_findings(args.main_jsonl)
    pr_findings = load_findings(args.pr_jsonl)
    changed = parse_csv(args.changed_detectors)
    new_detectors = parse_csv(args.new_detectors)
    sys.stdout.write(render(
        main_findings,
        pr_findings,
        changed=changed if changed else None,
        new_detectors=new_detectors,
    ))
 if __name__ == "__main__":
    main()