* add detector corpora test workflow and script
* only run once per PR, make comment descriptive, add handling for manual runs to get PR issue number
* comment out types to see result on all commits
* uncomment types
* remove table from comment
* comment out types
* Phase 0: add explicit pipefail and capture trufflehog stderr
* Phase 1: differential diffing PR vs main
* DEMO: loosen Stripe regex (will revert)
* DEMO: loosen JDBC regex (will revert)
* Phase 1 fix: add --allow-verification-overlap, fix no-diff detection
The bench uses --no-verification, so the engine's overlap-path dedup
(which exists to protect verifiers from duplicate calls) adds noise
without value here — it causes shifts in unrelated detectors when only
one detector's regex changes. Pair --allow-verification-overlap with
--no-verification so each detector's regex behavior is measured
independently.
Also fix the false 'no diff vs main' claim that triggered when
NEW/REMOVED were zero but total counts differed.
* revert jdbc detector change
* Phase 2: detector scoping, new-detector handling, blast radius, status emoji
* DEMO: loosen JDBC + add fictional acmevault detector
* Phase 2 fix: harden corpus byte counting against early trufflehog exit
awk's END block doesn't run when trufflehog exits before draining stdin
(SIGPIPE kills awk first), leaving the bytes file empty and breaking the
step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a
default of 0 and validate it's an integer before arithmetic. Also fold
unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay
out of CI logs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Phase 3a (1/3): add hack/extract-keywords for detector keyword introspection
Static AST parse of a detector package to extract the strings returned by
its Keywords() method. Used by the upcoming keyword-corpus builder to fan
out per-detector GitHub Code Search queries during the corpora bench.
AST-first because each detector lives in its own package; importing them
dynamically would require codegen or `plugin`. Falls back to a regex over
the function body, then a directory-wide grep, when AST resolution can't
statically resolve the return value (helper calls, build-tagged variants).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Phase 3a (2/3): add Layer 1 keyword corpus builder + workflow integration
build_keyword_corpus.py queries GitHub Code Search for each changed
detector's pre-filter keywords and emits a zstd-compressed JSONL whose
shape matches the existing S3 corpus exactly: each line is
`{"provenance": {...}, "content": "<raw file content>"}`. The corpora
script's existing `unzstd | jq -r .content` pipe handles it unchanged —
provenance is descriptive only and never reaches trufflehog.
Rate-limit policy is header-driven: the search bucket's
X-RateLimit-Remaining and X-RateLimit-Reset headers gate every call,
with a 2.1s floor between requests as belt-and-suspenders. 403/429s
honor Retry-After or fall back to the reset window. Cap is 100 unique
results per detector, deduped on (repo, path, sha), with a per-keyword
sub-cap so one popular keyword can't starve the others.
A sidecar JSON reports per-detector fetch counts and a thin_l1 list of
detectors whose total returned results were zero (or whose keyword
extraction failed). The diff script reads it via a new
--keyword-corpus-meta arg and renders a single contiguous blockquote
callout above the per-detector details — a sidecar instead of an
in-corpus signal because stdin metadata is dropped from trufflehog's
findings output.
Workflow change: a new "Build keyword corpus (Layer 1)" step fires after
detector detection and overwrites DATASETS via $GITHUB_ENV to append the
keyword corpus path. The corpora script picks it up unchanged through
its existing local-file branch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Phase 4 complete - Heatmap visualization
* Phase 4 rework (1/2): emit heatmap-grid.json sidecar from render_heatmap.py
Add a JSON sidecar that captures the same Δ matrix the PNG renders. The
diff script consumes this to render an emoji-bucketed Markdown table —
GitHub's PR-comment Markdown sanitizer strips data: URLs and serves
artifact zips behind auth, so neither inline base64 nor an artifact
<img src> embed actually displays. The PNG stays for artifact archival
and click-through.
Sidecar shape: {detectors, decoders, deltas, _layout}, with _layout
documenting the deltas[i][j] orientation inline so future readers don't
have to reverse-engineer it. Emitted whenever the grid is non-empty,
even if matplotlib import fails — the comment never depends on the PNG.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Phase 4 rework (2/2): replace data-URL embed with emoji-bucketed Markdown table
GitHub's PR-comment Markdown sanitizer strips data: URLs from user
content, so the inline base64 PNG embed shipped in the prior commit
rendered as a broken link in the comment DOM (no <img> tag emitted).
Artifact-zip URLs require auth to download, so a fallback  link
is also a non-starter — it would render as a broken image.
Switch to a per-(detector, decoder) Δ table built from the grid JSON
sidecar render_heatmap.py now emits. Cells use emoji buckets aligned
with the existing status-emoji thresholds so the visual weight matches
the summary table:
🟥 Δ ≥ +6 (matches NEW > 5 → 🔴)
🟧 +1..+5
⬜ 0
🟦 ≤ −1
Renders identically on web, mobile, email notifications, and CI log
replay — every surface a PR comment lands on. The colored matplotlib
PNG stays as a workflow artifact; when --heatmap-artifact-url is
supplied, the table is followed by a click-through link for reviewers
who want the rich version.
Workflow YAML mirrors the rename (--heatmap-png → --heatmap-grid) and
drops the base64-blob log filter — the report is back to plain
human-readable text.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Phase 5 complete - Polish
* cleanup, enable verification
* fix bug
* optimizations
* cache keywords corpus
* rewrite comment message
* cache github api corpus per keyword
* cleanup
* remove github corpus
* revert changes for testing
* move Configure AWS credentials step to run only when detector changes are detected
* revert unnecessary changes
* cleanup + bugbot fixes
* run test with bigger (30gb) dataset, loosen jdbc regex
* optimizations
* bugbot fixes
* revert jdbc changes and bugbot fix
* run only on regex and/or keywords change
* bugbot fixes
* bugbot fix
* incorporate brad's comments, loosen jdbc regex to run a test to ensure everything works as expected
* revert test changes
* fix misleading bench skipped message
* run only once when PR opens
* pipe stderr directly to CI log instead of writing to file, loosen jdbc regex to trigger workflow
* testing: run on all commits
* add archive timeout
* remove bigger dataset
* revert testing changes
---------
Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* spectralops-detector
* increased context deadline
* registered detector in engine
* Correct comment on SpectralOps API key format
* updated keyword
* fixed test
* Merged main
* Regen protos, Updated desc and ignore secret part in test
* updated protos
* extract subject id in boxoauth detector
* update the detector design for subject id handling
* verify subject ID via CCG auth before populating AnalysisInfo
Instead of blindly pairing every found subject ID with verified
credentials, use the Box CCG token endpoint to verify which subject ID
actually works. Emit one result per credential pair with only the
verified subject ID in AnalysisInfo. Add gock-based tests for all
AnalysisInfo scenarios.
* replace analysisinfo with secretparts
* [INS-379] Populate SecretParts unconditionally in BoxOauth detector
Always set client_id and client_secret in SecretParts regardless of
verification status. subject_id is added only when a valid triplet is
confirmed via verifySubjectID. Previously SecretParts was only populated
for a fully verified triple, leaving unverified results with no
credential data stored.
* address bugbot comment; Non-deterministic map iteration for subject ID selection
---------
Co-authored-by: Charlie Gunyon <camgunz@users.noreply.github.com>
* added gitlab oauth detector
* addressed comments; embed detectors.DefaultMultiPartCredentialProvider and use available ctx in tests
* made comments more meaningful and break loop when secret is verified against one client id
* switch from detectorspb to detector_typepb
* populate secretparts
* add secretparts to results in gitlaboauth2 detector
---------
Co-authored-by: Charlie Gunyon <camgunz@users.noreply.github.com>
* unify common logic in atlassian data center detectors
* initialize url pat once
* engine_test fix
* remove bitbucketdatacenter from defaults test exclude list
* adding customizable successRanges and rotatedRanges to customDetector
*setting definitive = true in the legacy 200 path so that an earlier ranged verifier's rangesInEffect = true can't trigger a spurious SetVerificationError after the legacy verifier already confirmed the secret as live.
* adressed review comments
* Add Pinecone API key detector
Adds a new detector for Pinecone vector database API keys (pcsk_* format).
Detection:
- Regex matches the pcsk_{key_id}_{secret} structure with tight bounds
(4+ char key ID, 40+ char secret, word boundaries) to minimize false positives.
- Extracts the embedded key_id from the token unconditionally, which maps to
the key entry in the Pinecone console and aids revocation.
Verification:
- Uses GET /indexes (non-state-changing, read-only) with Api-Key header auth.
- Validates response body structure (requires "indexes" JSON key), not just
status codes, to be resilient against API changes.
- 200: Verified. Extracts project_id from index host, total_indexes, and
metadata for up to 5 indexes (name, host, cloud, region).
- 401: Invalid key (handles both plain text and JSON error bodies).
- 403: Valid key with restricted permissions (DataPlane-only roles).
Marked as verified with permission=restricted metadata.
Uses common.SaneHttpClient() with standard timeouts (5s response, 2s dial,
3s TLS) and io.LimitReader (1MB cap) on response body. No SDK dependencies.
Registered as DetectorType Pinecone = 1048.
Made-with: Cursor
* Align Pinecone detector with SecretParts
Migrate Pinecone verification metadata to SecretParts, address the redundant 200-response JSON parsing Bugbot flagged, and add focused regression coverage for malformed verification responses.
* Refactor Pinecone detector and add tests
---------
Co-authored-by: Dylan Ayrey <dylan@Dylans-MacBook-Pro.local>
Co-authored-by: Dustin Decker <dustin@trufflesec.com>
Co-authored-by: Shahzad Haider <shahzadhaider.se@gmail.com>
Co-authored-by: Shahzad Haider <76992801+shahzadhaider1@users.noreply.github.com>
* Update linter to disallow assigning SecretParts later
This means all detectors.Result objects must be created with the
SecretParts field set.
* Update documentation
* Migrate existing detectors to always initialize SecretParts
* added content-type=application/json as default header
* improv: set content type header when it is not present in the config
* handled canonicalization
Non-fatal scanning errors aren't really "errors" - they're consequences of the vicissitudes of scanning a ton of data. They're actionable if you're looking for them but if you're not they're extremely noisy. This commit downgrades them to only show up when you look for them.
* Cache verification info for reuse
* updated hashing
* tried fixing concurrent verification issue
* Added singleflight to avoid concurrent verification
* resolved linter
* Ok I tried something new for a much simpler detector
* some enhancements
* Added test case
* re-added tags
* Deduplicate concurrent credential verification requests via singleflight
* Enforce dedup key via DoWithDedup, remove public WithDedupKey
* remove unused func
* Remove dead io.Copy after io.ReadAll error in singleflight transport
* Preserve deadline on shared singleflight request after WithoutCancel
* Added test cases and moved existing ones to http_test.go
* fixed linter
* Populate SecretParts on single-part detectors
Adds SecretParts: map[string]string{"key": <secret>} to every detector
package that constructs detectors.Result with a single captured secret
value. This is the single-part half of the SecretParts migration (the
linter's common case, ~695 packages).
* Populate SecretParts on multi-part detectors
* Rename AnalysisInfo field to SecretParts on detectors.Result
Mechanical rename of the detectors.Result.AnalysisInfo field to
SecretParts to prepare for its replacement for Raw.
* Add checksecretparts static analysis for detectors.Result
Introduces a small Go tool under hack/checksecretparts that finds
detector packages which construct detectors.Result without populating
the new SecretParts field.
The check runs in CI as warning-only (continue-on-error) so the ~900
unmigrated detectors don't block unrelated PRs while they're being
migrated; it can be flipped to a hard failure by dropping
continue-on-error and passing -fail once all detectors populate the
field.
Covers composite and pointer literals, ignores test files on both
sides (construction and reference), and suppresses findings for any
package that mentions SecretParts anywhere in its non-test source so
that detectors setting the field via later assignment (x.SecretParts
= ...) are not flagged.
* checksecretparts: count distinct packages in summary
The summary was reporting len(findings) as the number of packages, but
findings contain one entry per construction site; a single package with
multiple detectors.Result{} literals produces multiple entries. Count
distinct packages from Finding.Package and include both totals in the
output.
* Rename AnalysisInfo field to SecretParts on detectors.Result
Mechanical rename of the detectors.Result.AnalysisInfo field to
SecretParts to prepare for its replacement for Raw.
* Document SecretParts contract in detector-authoring docs
Adds a "Populating SecretParts" section to both the external and internal
detector-authoring guides. Covers what SecretParts is, the rule that every
Result must populate it, and the shape for single-part vs multi-part
credentials with code examples. Establishes "key" as the convention for
single-part detectors (matches ~48 existing detectors vs ~13 using "token").
* Add man page generation for trufflehog
Add a hidden --generate-man-page flag that uses an enhanced Kingpin
template to produce a standards-compliant roff man page. The man page
includes auto-generated OPTIONS and COMMANDS sections (synced with the
CLI definitions), plus hand-maintained EXAMPLES, EXIT STATUS,
ENVIRONMENT, FILES, BUGS, and SEE ALSO sections.
The DESCRIPTION and EXAMPLES sections call out the interactive TUI
that launches when trufflehog is run without a command in a terminal.
To keep the generated output deterministic, the --concurrency flag is
defined with a static "N" placeholder; the runtime.NumCPU() default is
applied at runtime instead of at flag-definition time. The usage line
also uses the lowercase binary name.
Distribution:
- Makefile `man` target to regenerate locally
- GoReleaser before hook to regenerate at release time with the
correct version injected via ldflags
- Release archives explicitly include LICENSE, README.md, and
docs/man/trufflehog.1
- Homebrew formula installs the man page to man1
The Makefile `test-release` target is also updated to GoReleaser v2
flag syntax (--skip=publish,sign), which is needed for the snapshot
release to succeed locally without cosign installed.
Made-with: Cursor
* Add man page maintenance guardrails
Add a CI job that regenerates the man page on every PR and fails if
the checked-in copy at docs/man/trufflehog.1 is out of date with the
current CLI definitions. Document the regeneration workflow in
CONTRIBUTING.md so contributors know to run `make man` and commit
the result when changing flags or subcommands.
Made-with: Cursor
* add jira data center pat detector
* fix lint and engine test
* fix merge
* make protos and fix imports after merge
* tighten regex
* bugbot fix
* embed DefaultMultiPartCredentialProvider
* make protos
* add improvements and bugbot fixes
* make protos after merge
* deduplicate tokens and use DetectorHttpClientWithNoLocalAddresses
* incorporated comments
* added extra endpoint at the right place
* bugbot fixes
* bugbot fix
* Add AnalysisError type and AnalysisErrorInfo interface
Introduce a shared error type that provides structured metadata
(analyzer type, operation, service, resource) for analysis failures.
This allows the scanner to extract context from errors without
depending on concrete types.
* Wrap errors in simple API analyzers with AnalysisError
Batch A: Airbrake, Anthropic, Asana, DigitalOcean, DockerHub,
ElevenLabs, Fastly, Groq, HuggingFace, Mailchimp, Mailgun, Mux,
Netlify, Ngrok, Notion, OpenAI, Opsgenie, Posthog, Postman,
Sendgrid, Sourcegraph.
Wraps credential validation errors with operation
"validate_credentials" and AnalyzePermissions errors with
operation "analyze_permissions".
* Wrap errors in remaining analyzers with AnalysisError (Batches B-E)
Batch B (OAuth/multi-credential): airtableoauth, airtablepat, datadog,
dropbox, figma, launchdarkly, plaid
Batch C (Complex): bitbucket, databricks, github, gitlab, jira, monday,
planetscale, shopify, slack, square, stripe, twilio
Batch D (Database): mysql, postgres (service: Database)
Batch E (PrivateKey): privatekey (service: crypto)
* Use Type().String() and constants for NewAnalysisError calls
Address PR feedback: replace hardcoded analyzer type strings with
a.Type().String() and replace raw operation/service strings with
package-level constants (OperationValidateCredentials,
OperationAnalyzePermissions, ServiceAPI, ServiceConfig, etc.).
* Omit empty resource parenthetical from AnalysisError messages
Conditionally include "(resource: ...)" only when non-empty,
avoiding cluttered messages like "... (resource: ): ..." that
appear for the majority of analyzers that don't set a resource.
* Wrap no-data error path in GitHub analyzer with AnalysisError
* Added tests to ensure that custom endpoint configuration works in artifactory detectors
* updated tests to use testify for assertions instead of vanilla logic
git built from source can report versions like "2.52.gaea8cc3", causing
an index out of range panic. The patch component is unused, so the regex
now captures only major.minor. Extract the helper into a shared pkg/gitcmd
package to remove duplication with the azureapimanagement detector.
Fixes#4801
* deprecated the squareup detector
* regenerated detectors.pb.go after merging latest main changes
---------
Co-authored-by: Charlie Gunyon <charlie.gunyon@trufflesec.com>
* upgrade golangci-lint
* upgrade golangci-lint to v2 via go tool and align local/CI lint configuration
* pinned version script for local/CI parity
* resolve golangci-lint binary via GOPATH to avoid PATH shadowing
* use golangci-lint-action@v7 in CI with prebuilt binary to avoid Go 1.25 toolchain fetch while restricting make lint command to same version as CI
* incorporate feedback
- extract exact semantic version
- check system path first before GOPATH/bin to avoid potential duplication
* Add cross-reference comments for lint version/args sync
Made-with: Cursor
---------
Co-authored-by: Bryan Beverly <bryan.beverly@trufflesec.com>
We want to move the detector types out of the Scanning team purview. So I split off detector types into its own proto file (so that file detector_type.proto can be owned by the Integrations team), regenerated the pb files with "make protos", and made the detector files use the new generated detector_type.pb.go. Included the new detector_type.proto file in CODEOWNERS and made CODEOWNERS categories that contain larger teams be towards the top so that more fine grained ownership is filtered properly.
* Add HTML decoder for secret detection in HTML-formatted sources
Sources like MS Teams and Confluence emit HTML rather than plain text,
causing secrets split across tags or embedded in attributes to be missed.
This adds an HTML decoder to the pipeline that extracts text nodes,
high-signal attribute values, script/style/comment content, and code blocks.
It handles syntax-highlight boundary detection, zero-width character stripping,
and double-encoded HTML entity decoding.
Made-with: Cursor
* Fix dead code and plus-sign corruption in HTML decoder
- Remove unreachable "xlink:href" map entry: the html parser splits
namespace-prefixed attributes into separate Namespace/Key fields,
so attr.Key is "href" (already in the map), never "xlink:href".
- Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts
'+' to space per form-encoding spec, corrupting secrets that contain
literal '+' characters (e.g. base64 values, API keys).
Made-with: Cursor
* updated comment around syntaxHighlightPrefixes to guide future additions
* removed Enabled func from HTML struct to follow normal flag conventions
* Fix script/style boundary, redundant br check, and raw-text entity corruption
- Add script/style to blockElements so they get newline boundaries
instead of concatenating with adjacent inline text.
- Remove redundant `|| n.Data == "br"` since br is already in blockElements.
- Move residual entity decoding into walkNode per text node, skipping
it for script/style raw-text content where the HTML parser does not
decode entities.
Made-with: Cursor
* fix expired Azure secrets being silently dropped
The Azure Entra service principal v2 detector dropped findings entirely
when Azure returned AADSTS7000222 (secret expired). The `continue
SecretLoop` in ProcessData skipped result creation, causing expired
secrets to vanish from output and stop receiving last_seen updates.
Changed the ErrSecretExpired handler to emit an unverified result with
the expiry error preserved, consistent with how ErrSecretInvalid and
ErrConditionalAccessPolicy are handled.
Made-with: Cursor
* Address PR feedback: treat expired secret as definitively invalid
Reviewers noted that an expired secret (AADSTS7000222) is not an
indeterminate verification state — it is definitively invalid. Pass nil
instead of the verification error when creating the result, and update
tests accordingly.
Made-with: Cursor
Return an explicit error when AnalyzePermissions yields nil info
instead of passing nil to secretInfoToAnalyzerResult. Wrap classic
PAT repo/gist enumeration errors with context for easier debugging.
* fix(azure-refresh-token): handle AADSTS50173 as explicit revocation signal
* added same functionality for ErrTokenExpired
* removed TokenLoop since it is no longer used
* removed the createResult to avoid guessing at the secret information.
Triggers on release publish events to run the release bot, which
generates release notes using GitHub, Jira, and AI services.
Adapted from the thog repo workflow with trufflehog-specific adjustments:
repository argument set to trufflehog, environment requirement removed
in favor of a repo-level secret, permissions restricted, and a fork
guard added for consistency with other trufflehog workflows.
Made-with: Cursor