mirror of
https://github.com/trufflesecurity/trufflehog.git
synced 2026-05-16 13:20:35 +00:00
9bfdb3e85e
* Add HTML decoder for secret detection in HTML-formatted sources Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor * Fix dead code and plus-sign corruption in HTML decoder - Remove unreachable "xlink:href" map entry: the html parser splits namespace-prefixed attributes into separate Namespace/Key fields, so attr.Key is "href" (already in the map), never "xlink:href". - Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts '+' to space per form-encoding spec, corrupting secrets that contain literal '+' characters (e.g. base64 values, API keys). Made-with: Cursor * updated comment around syntaxHighlightPrefixes to guide future additions * removed Enabled func from HTML struct to follow normal flag conventions * Fix script/style boundary, redundant br check, and raw-text entity corruption - Add script/style to blockElements so they get newline boundaries instead of concatenating with adjacent inline text. - Remove redundant `|| n.Data == "br"` since br is already in blockElements. - Move residual entity decoding into walkNode per text node, skipping it for script/style raw-text content where the HTML parser does not decode entities. Made-with: Cursor