Files
Mergen/scripts/rewrite/check_themida_equivalence.py
naci 605a36e8ed lifter: correctness fixes, refactors, and regression tests (#205)
* lifter: restore indirect-jump threshold to 128

* gitignore: glob output_*.ll instead of enumerating dumps

Replace output_finalnoopt.ll / output_no_opts.ll entries with
output_*.ll so ad-hoc lifter dumps (output_rets.ll, output_newpath.ll,
etc.) stop showing up in git status.

* lifter: factor REAL_return path through emitResolvedFunctionReturn

Pull the rax-zext + CreateRet + run/finished bookkeeping out of the
REAL_return branch in lift_ret() into a local lambda so future ret
exit points can reuse it without duplicating four lines of
boilerplate.

Drop the dead returnStruct/myStruct scaffolding and the
originalFunc_finalnopt local: every InsertValue call site has been
commented out for a long time and the locals had no remaining uses.
The active code emits a plain rax return.

No behavior change.

* lifter: advance RSP past continuation slot in ret-to-IAT chain

In the chained import-return pattern (`ret` to IAT slot, IAT slot
holds an external function address, the function returns and control
resumes at the next stack slot's continuation address), the lifter
collapses the two pops into a single `call @import; br contBB`. RSP
was only advanced past the IAT slot itself, so post-call register
state still claimed RSP pointed at the continuation address. Any
downstream stack read from RSP saw stale data and any solver that
constant-folded RSP picked up a value that no longer matched the
post-chain physical layout.

Bump RSP by another `ptrSize` immediately before lowering the
import call so the continuation block inherits the same RSP it would
have under a faithful two-pop lowering.

* lifter/test: regression test for ret-to-IAT chain RSP advancement

Locks in dd95fe7. The microtest stands up a LifterUnderTest, plants
[importVA, contVA] on the stack at an RSP that is intentionally NOT
equal to STACKP_VALUE (so the lift_ret REAL_return short-circuit does
not fire), registers the import in the lifter's importMap, and lifts
a single `ret` (0xC3).

It then asserts that:
- the chain handler emitted a direct call to the registered import
- RSP after the chain equals entry RSP + 16, not + 8

Without the fix the test fails with RSP = entry + 8 (only the IAT
slot pop is modeled), exactly the off-by-8 the fix closes.

Verified the test catches the regression by reverting dd95fe7
locally before re-applying — the failing message reads
"RSP after chain = 0x14FDA8; expected 0x14fdb0".

* scripts/themida: filter lifter-synthesized helpers from import diff

Calls to lifter-emitted helpers (`@exception`, `@fastfail`,
`@not_implemented`, etc.) surfaced as 'extra import (not required)'
lines on every Themida equivalence run. They are not user imports;
they are lowered from INT1/INT3/UD2/INT29/SYSCALL/segment-load
sites in the lifter's own semantics files.

Skip them in `_extract_call_names` so the equivalence diff shows
only real imports. The list of helpers lives next to the call regex
so it stays adjacent to the code that emits them; if a new helper
shows up in the IR (e.g. another illegal-instruction lowering) the
script will surface it as an 'extra import' until the entry is added
here, which is the right tripwire.

Before: example2 \xe2\x80\x94 6 distinct imports, 10 calls (3 noise calls)
After:  example2 \xe2\x80\x94 4 distinct imports, 7 calls (clean)

* lifter/analysis: replace 'TODO: fix?' marker with positive explanation

The 2-value path-solving fork's swap branch had a 'TODO: fix?'
comment from the original draft. Traced both branches and confirmed
the swap is correct:

- When the select's trueValue equals firstcase, condition is the
  select's condition as-is and firstcase\xe2\x86\x92bb_true wires correctly.
- When trueValue equals secondcase, condition still expresses 'true
  picks trueValue' but downstream code uses firstcase\xe2\x86\x92bb_true.
  Swapping firstcase\xe2\x86\x94secondcase makes firstcase refer to the trueVal
  constant so the existing CreateCondBr wiring stays correct without
  a parallel reversed-branch path.

Replaced the TODO with a comment that explains why the swap is
necessary, so future readers do not waste time investigating a
branch that is intentional.

* lifter: accept Register64/Memory64 source for punpcklqdq

Iced classifies operand types by the bytes the instruction actually
accesses, not by physical register width. PUNPCKLQDQ only reads the
low 64 bits of its second operand, so Iced reports Register64 (or
Memory64 for the m128 form) for a source whose physical encoding is
`xmm/m128`. The lift handler's accept check rejected anything other
than Register128/Memory128 and fell through to the not_implemented
exit, so every `punpcklqdq xmm, xmm/m128` site lowered to a bogus
`call @not_implemented; ret` instead of the unpack semantic.

Widen the accept set to Register64 and Memory64 too. The body
already truncates the source to i64 before OR'ing it into the high
half of the result, so a 64-bit-typed source is semantically
identical to a 128-bit one for this handler.

Fixes the two pre-existing oracle test failures
`punpcklqdq_xmm0_xmm1_basic` and
`punpcklqdq_xmm0_xmm1_zero_upper_from_zero_source`. `python test.py
all` stays at 244/244, confirming no semantic regressions.

* lifter: replace lift_jmp's fallthrough switch with an isDirectJump if

The RIP-relative add for direct jumps lived inside a 4-case switch
whose body intentionally fell through into `default: break;`. It
worked, but:

- Implicit fallthrough is a -Wimplicit-fallthrough hazard. Today the
  default does nothing; tomorrow someone adds a body and every direct
  jump silently runs it.
- The switch's discriminator is exactly `isDirectJump`, which is
  already computed two lines above for the path-solver context. The
  switch was a parallel restatement of the same predicate.

Collapse the switch into `if (isDirectJump) { trunc = add(trunc,
ripval); }` so the predicate has one definition and there is no
fallthrough to misuse. Behavior unchanged: the same immediate cases
still get the RIP-relative bump, indirect jumps still skip it, and
`python test.py all` stays at 244/244.

* lifter/test: regression test for SSE memory-form handler dispatch

Lock in that pand/por/pxor accept the `xmm, [mem]` encoding form. The
test lifts `66 0F DB 00`, `66 0F EB 00`, and `66 0F EF 00` (one
`xmm0, [rax]` site each) and asserts that the lifted function does
not contain a direct call to @not_implemented.

Pure structural acceptance: not validating bitwise-AND/OR/XOR
semantics, only that the handler dispatched at all. Iced today
reports Memory128 for these encodings so the test passes against the
existing `Register128 || Memory128` accept sets. If a future Iced
update reclassifies the source operand by bytes-actually-accessed
(the way it already does for punpcklqdq, where it reports
Register64/Memory64 even for an `xmm/m128` encoding) the handler
would silently fall through to `call @not_implemented; ret` and
miscompile every memory-form site \u2014 this test trips first.

* lifter: drop duplicate stdout print on unresolved indirect jmp

`lift_jmp` printed every UnresolvedIndirectJump twice: once as a raw
`std::cout << "[diag] lift_jmp: ..."` and once through
`diagnostics.warning(...)` on the very next line. The diagnostics
framework already persists the warning to `output_diagnostics.json`
at lift completion, and no script or test grep'd the stdout form.

Drop the std::cout. The diagnostic remains in the recorded diagnostics
list, surfaceable via the JSON dump or the in-memory entries vector.
This removes the only unguarded raw `[diag]` print in the lift path
-- the rest are gated on `liftProgressDiagEnabled` or specific hot
addresses for active debugging.

* scripts/themida: fix docstring escape leak in import-filter doc

Audit of #205 caught a literal `\\u2014` and unnecessary
`\\"` escapes in the `_extract_call_names` docstring \xe2\x80\x94 leftovers
from how the surrounding commit (#205, scripts/themida: filter
lifter-synthesized helpers) was authored. Replace the literal
escape with a plain `--` and drop the redundant backslash-quotes;
the docstring now renders cleanly at `help(_extract_call_names)`
and looks normal in the source.

Behavior unchanged: `python test.py themida` still passes with
the same import-diff filter (4 imports, 7 calls for example2).

---------

Co-authored-by: yusufcanislek <yusuf.canislek@meetdandy.com>
2026-05-02 11:58:47 +03:00

265 lines
8.8 KiB
Python

#!/usr/bin/env python3
"""Themida devirtualization semantic-equivalence check.
Premise: a virtualized function, after devirtualization, must call the same
external imports as its non-virtualized counterpart. Structural divergence in
the lifted IR (block count, SSA names, pass-ordering artifacts) is allowed;
semantic divergence at the import boundary is not.
For each entry in ``themida_samples.json``:
- lift the virtualized binary
- extract external call names from the resulting IR
- compare against the manifest's ``required_imports`` list
- fail hard on any required import that is absent
Use ``--update`` to regenerate ``required_imports`` from the reference binary.
Samples whose binaries are not present on disk are skipped, not failed, because
the binaries live outside the repository (``../testthemida/``) and are not
available in CI.
"""
from __future__ import annotations
import argparse
import json
import os
import re
import subprocess
import sys
from pathlib import Path
from typing import Dict, List, Set, Tuple
ROOT = Path(__file__).resolve().parents[2]
LIFTER = ROOT / "build_iced" / "lifter.exe"
DEFAULT_MANIFEST = Path(__file__).resolve().parent / "themida_samples.json"
WORKROOT = ROOT.parent / "rewrite-regression-work" / "themida_ir"
# Matches a `call` that targets a named function:
# %N = call i64 @GetStdHandle(...)
# call void @"VirtualizerSDK64.dll#103"(...)
# tail call ptr @foo(...)
# Captures the callee identifier (unquoted or the contents of the quotes).
_CALL_RE = re.compile(
r'''^\s*(?:%\S+\s*=\s*)?(?:tail\s+|musttail\s+|notail\s+)?'''
r'''call\s+\S+\s+@(?:"([^"]+)"|([A-Za-z_][\w.]*))\s*\(''',
re.MULTILINE,
)
def _lift(binary: Path, entry: str, workdir: Path) -> Path:
if not LIFTER.exists():
raise SystemExit(
f"Lifter not found: {LIFTER}. Build it with "
"'cmd /c scripts\\dev\\build_iced.cmd' first."
)
if not binary.exists():
raise SystemExit(f"Binary not found: {binary}")
workdir.mkdir(parents=True, exist_ok=True)
# Clear previous artifacts so a stale file can't mask a new failure.
for stale in ("output.ll", "output_no_opts.ll", "output_diagnostics.json"):
(workdir / stale).unlink(missing_ok=True)
result = subprocess.run(
[str(LIFTER), str(binary), entry],
cwd=workdir,
env=os.environ.copy(),
capture_output=True,
text=True,
timeout=300,
)
ir_path = workdir / "output_no_opts.ll"
if not ir_path.exists():
tail_stdout = (result.stdout or "")[-600:]
tail_stderr = (result.stderr or "")[-600:]
raise SystemExit(
f"Lifter did not emit output_no_opts.ll for {binary} @ {entry}\n"
f"exit_code={result.returncode}\n"
f"stdout-tail:\n{tail_stdout}\n"
f"stderr-tail:\n{tail_stderr}"
)
return ir_path
# Lifter-synthesized helper names that appear as `call @<name>` in the IR but
# are not user imports — emitted by INT/UD2/syscall/segment-load lowering.
# Keeping this list close to the call extractor so it stays in sync with the
# semantics files that emit them (Semantics.ipp, Semantics_Misc.ipp, etc.).
_LIFTER_SYNTH_HELPERS = frozenset({
"exception", # INT1 / INT3 / UD2
"fastfail", # INT29
"not_implemented", # many fallbacks (SCAS/REP, SYSCALL, etc.)
"invalid", # illegal-instruction path
"loadGS", # GS segment register read
"loadDS", # DS segment register read
"pext", # BMI2 PEXT pseudo-intrinsic
})
def _extract_call_names(ir_text: str) -> Dict[str, int]:
"""Return a multiset of call-target identifiers found in IR text.
Intramodule calls to ``@main`` and outlined ``@sub_*`` thunks are excluded
-- we only care about named external imports that the lifter resolved.
Lifter-synthesized helper calls (``@exception``, ``@fastfail``, etc.) are
also excluded so they do not surface as bogus "extra import" diffs.
"""
counts: Dict[str, int] = {}
for match in _CALL_RE.finditer(ir_text):
name = match.group(1) or match.group(2)
if not name:
continue
if name == "main" or name.startswith("sub_") or name.startswith("llvm."):
continue
if name in _LIFTER_SYNTH_HELPERS:
continue
counts[name] = counts.get(name, 0) + 1
return counts
def _filter_imports(imports: Dict[str, int], ignore_patterns: List[str]) -> Dict[str, int]:
if not ignore_patterns:
return dict(imports)
compiled = [re.compile(p) for p in ignore_patterns]
return {k: v for k, v in imports.items() if not any(p.search(k) for p in compiled)}
def _diff(
required: Set[str],
actual: Dict[str, int],
) -> Tuple[List[str], bool]:
lines: List[str] = []
ok = True
missing = sorted(required - set(actual))
if missing:
ok = False
for name in missing:
lines.append(f" MISSING required import: @{name}")
extra = sorted(set(actual) - required)
# Extras are informational — a smarter devirtualizer may legitimately
# surface additional imports. Never fatal.
for name in extra:
lines.append(f" extra import (not required): @{name} x{actual[name]}")
return lines, ok
def _check_sample(sample: dict) -> bool:
name = sample["name"]
virt_rel = sample["virt_binary"]
virt = (ROOT / virt_rel).resolve()
entry = sample["entry"]
ignore = sample.get("ignore_imports", [])
required = set(sample.get("required_imports", []))
if not virt.exists():
print(f"SKIP: {name} — virt binary not present at {virt}")
return True
if not required:
print(
f"FAIL: {name} — manifest has no required_imports; "
f"run `python test.py themida --update` first."
)
return False
ir_path = _lift(virt, entry, WORKROOT / name)
ir_text = ir_path.read_text(encoding="utf-8", errors="replace")
actual = _filter_imports(_extract_call_names(ir_text), ignore)
lines, ok = _diff(required, actual)
summary = (
f"{len(actual)} distinct imports, {sum(actual.values())} calls "
f"(required {len(required)})"
)
if ok:
print(f"PASS: {name}{summary}")
for line in lines:
print(line)
else:
print(f"FAIL: {name}{summary}")
for line in lines:
print(line)
req_str = ", ".join(sorted(required)) or "(none)"
act_str = ", ".join(f"{k}x{v}" for k, v in sorted(actual.items())) or "(none)"
print(f" required: {req_str}")
print(f" actual: {act_str}")
return ok
def _update_sample(sample: dict) -> dict:
name = sample["name"]
ref_rel = sample.get("reference_binary")
entry = sample["entry"]
ignore = sample.get("ignore_imports", [])
if not ref_rel:
print(f"SKIP update: {name} — no reference_binary in manifest")
return sample
ref = (ROOT / ref_rel).resolve()
if not ref.exists():
print(f"SKIP update: {name} — reference binary not present at {ref}")
return sample
ir_path = _lift(ref, entry, WORKROOT / f"{name}_reference")
ir_text = ir_path.read_text(encoding="utf-8", errors="replace")
imports = _filter_imports(_extract_call_names(ir_text), ignore)
updated = dict(sample)
updated["required_imports"] = sorted(imports.keys())
print(
f"UPDATED {name}: {len(imports)} required imports from {ref.name} "
f"({sum(imports.values())} total calls in reference)"
)
return updated
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--manifest", type=Path, default=DEFAULT_MANIFEST)
parser.add_argument(
"--update",
action="store_true",
help="regenerate required_imports from reference binaries",
)
parser.add_argument(
"filter",
nargs="*",
help="only process samples whose name contains any of these tokens",
)
args = parser.parse_args()
manifest = json.loads(args.manifest.read_text(encoding="utf-8"))
all_samples = manifest["samples"]
def matches(sample: dict) -> bool:
return not args.filter or any(tok in sample["name"] for tok in args.filter)
if args.update:
manifest["samples"] = [
_update_sample(s) if matches(s) else s for s in all_samples
]
args.manifest.write_text(
json.dumps(manifest, indent=2) + "\n", encoding="utf-8"
)
print(f"Wrote {args.manifest}")
return
selected = [s for s in all_samples if matches(s)]
if not selected:
raise SystemExit("No samples matched filter")
all_ok = True
for s in selected:
all_ok &= _check_sample(s)
if not all_ok:
raise SystemExit("Themida equivalence check FAILED")
print("All Themida equivalence checks passed.")
if __name__ == "__main__":
main()