Files
trufflehog/pkg
Drew LaFiandra 9bfdb3e85e Add HTML decoder for secret detection in HTML-formatted sources (#4840)
* Add HTML decoder for secret detection in HTML-formatted sources

Sources like MS Teams and Confluence emit HTML rather than plain text,
causing secrets split across tags or embedded in attributes to be missed.
This adds an HTML decoder to the pipeline that extracts text nodes,
high-signal attribute values, script/style/comment content, and code blocks.
It handles syntax-highlight boundary detection, zero-width character stripping,
and double-encoded HTML entity decoding.

Made-with: Cursor

* Fix dead code and plus-sign corruption in HTML decoder

- Remove unreachable "xlink:href" map entry: the html parser splits
  namespace-prefixed attributes into separate Namespace/Key fields,
  so attr.Key is "href" (already in the map), never "xlink:href".
- Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts
  '+' to space per form-encoding spec, corrupting secrets that contain
  literal '+' characters (e.g. base64 values, API keys).

Made-with: Cursor

* updated comment around syntaxHighlightPrefixes to guide future additions

* removed Enabled func from HTML struct to follow normal flag conventions

* Fix script/style boundary, redundant br check, and raw-text entity corruption

- Add script/style to blockElements so they get newline boundaries
  instead of concatenating with adjacent inline text.
- Remove redundant `|| n.Data == "br"` since br is already in blockElements.
- Move residual entity decoding into walkNode per text node, skipping
  it for script/style raw-text content where the HTML parser does not
  decode entities.

Made-with: Cursor
2026-04-07 13:24:51 -07:00
..
2026-04-01 17:53:42 +05:00
2024-12-02 12:35:01 -08:00
2026-02-20 11:03:16 -08:00
2024-09-26 10:17:47 -07:00
2024-06-04 07:13:14 -04:00