15 Commits

Author SHA1 Message Date
naci c8102a69cf themida: correctness gate, diagnostic tracer, ret-to-IAT recognition, gen revisit knob (#182)
* tests: add Themida devirtualization import-equivalence check

Adds python test.py themida that lifts every sample in
scripts/rewrite/themida_samples.json and asserts the resulting IR calls
every import declared in required_imports. Names are pinned against
a lift of the non-virtualized reference binary via --update.

This is a correctness gate that complements the existing
coverage gate ('2544 instructions, 0 errors'). Currently red on
example2-virt.bin: the lifter unrolls the VM without surfacing
GetStdHandle / WriteConsoleA / ReadConsoleA / CharUpperA from the
guest program. That gap is the active devirtualization frontier;
this test makes it visible instead of silently green.

Samples whose binaries are absent (`../testthemida/*.bin` lives
outside the repo) are skipped rather than failed, so the check
runs cleanly in CI without the binaries present.

* diag: add Unicorn-based external-call tracer; document Themida transform

Adds scripts/dev/trace_external_calls.py: loads a PE into Unicorn,
patches every IAT slot with a unique unmapped-address sentinel, then
emulates from the chosen entry. When any call/jmp/ret resolves its
target to a sentinel, logs the call-site address, the mnemonic, and
the addressing form. One-shot diagnostic for answering 'what x86
instruction issues this external call at runtime.'

Using it on example2-virt.bin shows the Themida transform precisely:
- guest imports (GetStdHandle etc) remain in the IAT
- every guest call site is rewritten from 'call [rip+IAT]' to a
  VM-staged 'push target; ret' where target was loaded from the
  IAT upstream
- for example2, the first external call happens at VA 0x14017fa77
  via 'ret 0', popping the GetStdHandle IAT value off the stack
- Themida strips its own SDK markers (VirtualizerSDK64.dll#103/#503)
  from the IAT; our ignore_imports filter already accounts for this

The lifter's current recognition handles direct call-through-IAT
and register-indirect IAT calls (the non-virt binary resolves 5
imports cleanly). It does not recognize the ret-pops-IAT-loaded-
pointer pattern, which is why the virt lift surfaces zero imports.

Also annotates themida_samples.json with these properties inline
so the transform semantics live next to the test that exercises
them.

* diag: trace_external_calls can dump visited PCs and record sentinel push chain

Two additions, both motivated by the example2-virt.bin diagnosis session:

- --dump-visited <path>: writes every unique instruction PC the emulator
  executes, in first-visit order. Diff against the lifter's 'reached
  addresses' trace (MERGEN_DIAG_LIFT_PROGRESS=1) to localise where the
  lifter's static exploration diverges from the dynamic path.

- UC_HOOK_MEM_WRITE for stack-addressed 8-byte writes whose payload is a
  sentinel. Records every such write, not just the first, because Themida
  uses push-pop swap gadgets that stage a sentinel on the stack
  transiently before the 'real' push lands it at the ret-target slot.
  The last-5-pushes summary exposes this.

Findings for example2-virt.bin @ 0x140001000:
- lifter covers emu_pos=0..1298 out of 4210 unique PCs (~30%)
- external call site is at emu_pos=4209; gap of 2911 unvisited PCs
- lifter visits 5 addresses the runtime never takes (wrong concolic branch)
- the 'final push to ret slot' is not a 'push [iat]' but rather
  'sub qword ptr [r14], <const>' — the VM decrypts a pre-staged
  stack slot in place to reconstruct the IAT pointer. Pattern-match
  recognition alone cannot handle this; concrete VM-dispatch unrolling
  is required.

* diag: add MERGEN_NO_LOOP_GEN env gate for loop-generalization

Adds an env-var toggle at the top of canGeneralizeStructuredLoopHeader.
When MERGEN_NO_LOOP_GEN=1, the gate rejects every header, forcing
pure concrete exploration with no phi-widening abstraction.

Diagnostic knob, not a user-facing feature. Used to localise how much
of a lift's coverage depends on generalization vs. the concolic engine.

Measurement on example2-virt.bin @ 0x140001000:

                            gen ON        gen OFF (NO_LOOP_GEN=1)
  blocks_attempted              56          2642   (47x)
  instructions_lifted         2544         34229   (13.5x)
  output_no_opts.ll lines     6022         30481   (5x)
  unique addrs visited          34           338   (10x)
  addrs in 0x14017xxxx           0           103   (call-handler cluster)
  external call site reached:   no           yes (via BB 0x14017fa72)
  themida equivalence test:    red           red (recognition still gap)

Loop-generalization is the dominant reachability blocker on Themida
VM dispatchers at current tuning. Pure concrete exploration reaches
the external-call handler block but does not emit named import calls
because lift_ret has no path to match a resolved ret target against
importMap. Recognition is the next fix surface; reachability is large
mostly because of generalization tuning.

Side-effects of gen OFF that are NOT acceptable in production:
 - Lifter decodes .rdata IAT bytes as instructions (OUTSD error at
   0x140002688 on this sample)
 - Top-revisited addresses hit ~1142x each: the lifter spins in
   tight loops without generalization cutting them off; block budget
   (4096) would fire eventually on a larger sample

So the knob is purely diagnostic. The real production fix is selective
generalization (distinguish 'VM dispatcher' from 'guest loop') plus
lift_ret import recognition.

* lifter: recognize ret-to-IAT as named external call in lift_ret

Adds a recognition path in lift_ret: if the value being popped resolves
to a concrete address that's in importMap, emit callFunctionIR for the
named import, then simulate the external's own ret by popping one more
qword (the continuation address pre-staged by the caller). solvePath
then continues at the continuation instead of trying to lift the IAT
pointer as code.

Two resolution routes:
  1. realval is a ConstantInt (direct push+ret of an IAT load)
  2. realval is symbolic but computePossibleValues folds to a single
     concrete value (obfuscated chains that constant-fold at this path)

Scope limits:
- Non-virt example2.bin lift is unchanged (still resolves 5 imports
  via register-indirect path; the new ret path does not fire because
  the binary uses 'call [iat]', not 'push+ret').
- Virt example2-virt.bin lift: the recognition code runs but does not
  surface imports because the lifter's static resolution of the
  arithmetic-decrypt chain produces wrong concrete targets. E.g. the
  ret at 0x14017fa77 resolves to 0x140002628 (somewhere in .rdata) via
  computePossibleValues; at runtime the emulator sees it pop the
  GetStdHandle IAT pointer (0x140002490). The recognition logic is
  correct; the upstream data flow is lying. Fixing that requires
  selective-generalization tuning or concrete VM unrolling, tracked
  separately.

So β lands as ground work for simpler push+ret thunks and for future
work where state-propagation fidelity improves. It is not a Themida
fix on its own.

* lifter: gate canGeneralize on per-header revisit count

Adds a revisit-count threshold to canGeneralizeStructuredLoopHeader:
below threshold N the gate rejects (concrete exploration continues);
at or above N it falls through to the existing loop-shape checks.
Tunable via MERGEN_GEN_MIN_REVISITS; default is 0 (inert, matches
pre-existing behaviour).

Also promotes ++liftAttemptCounts[addr] out from under the
liftProgressDiagEnabled gate so the counter is always maintained.

Rationale: on Themida example2-virt.bin @ 0x140001000, the existing
gate (always-generalize on first qualifying revisit) abstracts the
VM's dispatch loop too early, cutting reachability to ~30% of the
dynamic execution path. A higher threshold lets the dispatcher run
concretely for more iterations before abstracting. Measurement (all
other settings at defaults):

  T=0   (current)  blocks=   56   insns=  2544   err=0  warn=0
  T=4              blocks=   88   insns=  3842   err=0  warn=4
  T=16             blocks=  393   insns= 11747   err=1  warn=0
  T=32             blocks=  425   insns= 12067   err=1  warn=0
  T=128            blocks=  617   insns= 13987   err=1  warn=0
  MERGEN_NO_LOOP_GEN=1 (kill)
                   blocks= 2642   insns= 34229   err=1  warn=0

Caveat: at T=6, T=8, T=12 the lifter crashes with an access violation
partway through lifting. The crash fires in the Themida dispatcher
state machinery around 0x1400237F9 when generalization fires mid-
iteration with state that the existing machinery is not prepared to
handle. Other nearby T values (T=5, 7, 9, 10, 11, 13-19) are stable.

So the knob is landing as experimental infrastructure with default=0
(no-op). Future work can pair a safe non-zero default with a fix for
the dispatcher-state crash.

---------

Co-authored-by: Claude <claude@anthropic.com>
2026-04-24 14:54:22 +03:00
yusufcanislek 8fba033cc6 Fix VMP gate and loop safety 2026-04-03 15:00:42 +03:00
yusufcanislek bb1af2c073 fix: un-skip calc_cout test — .pdata auto-outline already handles it
The calc_cout test was skipped since Phase 1 with the assumption that
statically-linked STL calls required a new inline policy. In reality,
the .pdata auto-outline (PR #78) already registers all STL functions
as outline targets, and the lifter correctly outlines the operator<<
call and lifts the pure computation (x*3+7).

The test was failing because it used the wrong symbol name ('calc_cout'
instead of the MSVC-mangled '?calc_cout@@YAHH@Z').

Changes:
- Un-skip calc_cout, fix symbol to mangled name, add 4 semantic cases
- Add _INTTOPTR_CALL_RE to strip outlined calls to concrete addresses
  (they segfault in lli but are provably dead for return value)
- Golden hashes: 42 -> 44 files

Zero skipped tests remaining. 29/29 samples, 150 semantic cases.
2026-03-29 12:07:26 +03:00
yusufcanislek 1020775ec0 feat: prototype minimization + canonical IR naming
Two new post-optimization passes that run after the final O2 pipeline:

PrototypeMinimizationPass:
- Removes unused function arguments based on Argument::use_empty()
- Typical reduction: 34 params -> 0-2 (e.g. @main(i64 %RCX) instead of all 16 GPRs + 16 XMMs + 2 ptrs)
- Splices basic blocks into new function, remaps argument uses, erases old function
- Updated check_semantic.py to parse actual IR signatures instead of hardcoded 34-param list

CanonicalNamingPass:
- Strips address-derived suffixes from block/value names for deterministic output
- Blocks: entry, bb1, bb2, ... (sequential)
- Values: semantic prefix preserved, address suffix removed (realadd-5368713230- -> realadd)
- Same input now produces byte-identical IR across rebuilds

Also fixed writeFunctionToFile to use stored module pointer M instead of
fnc->getParent() (dangling after prototype minimization erases the old function).

Review fixes:
- CanonicalNamingPass: use StringMap<unsigned> instead of DenseMap<StringRef> (dangling key)
- PrototypeMinimizationPass: restrict call rewriting to CallInst (not InvokeInst/CallBrInst)
- PrototypeMinimizationPass: guard F->eraseFromParent() with use_empty() check
- check_semantic.py: widen define regex to handle dso_local and other prefixes

All 28 samples pass, 146 semantic cases, 56 golden hashes updated.
2026-03-29 11:00:07 +03:00
yusufcanislek 99b9316d4a fix: exclude C-compiled samples from golden hash determinism check
C-compiled samples (calc_fib, calc_grade, calc_jumptable, calc_mixed,
calc_sum_array, calc_switch) produce toolchain-dependent binaries with
different virtual addresses. The lifted IR contains these addresses, so
IR hashes vary across build environments.

- Remove calc_* entries from golden_ir_hashes.json (34 ASM-stable entries remain)
- Update check_determinism() to only validate files in the golden set;
  extra IR files from C-compiled samples are reported as skipped
- Correctness for C-compiled samples is enforced by semantic tests
  (107 cases verifying actual computed return values via lli)
2026-03-26 08:02:38 +03:00
yusufcanislek eb10474eb8 feat: commit working-tree changes required by rewrite gates
Lifter improvements:
- PathSolver.ipp: enhanced path memoization, switch-target diagnostics
- GEPTracker.ipp: expanded value tracking, graceful bail-out paths
- Semantics_Misc.ipp: clean up CPUID handler (remove dead comments,
  simplify constant emission)

Rewrite infrastructure:
- instruction_microtests.json: add jumptable manifest entries
  (calc_jumptable, jumptable_basic, jumptable_dense) with semantic cases
- golden_ir_hashes.json: add hashes for new jumptable samples
- build_samples.cmd: support C jumptable /O2 compilation pass
- oracle vectors: regenerated (oracle_vectors.json trimmed to current
  seed set, full-handler vectors updated with new handlers)
- run_microtests.cmd / run_all_handlers.cmd: script improvements
- test.py: add jumptable semantic cases to coverage

Dev scripts:
- configure_iced/zydis.cmd, build_iced/zydis.cmd: improved toolchain
  detection and MERGEN_BUILD_JOBS support

Review automation:
- format_comment.py, invariant_guard.py, risk_map.py, shard_pr.py:
  minor fixes aligned with verify_plan public API rename

Docs:
- REWRITE_BASELINE.md: updated coverage summary and script docs
- REVIEWER_RULES.md: minor formatting
2026-03-26 07:53:43 +03:00
yusufcanislek eb5589f7d1 rewrite: centralize manifest validation and add negative contract checks 2026-03-19 01:59:20 +03:00
yusufcanislek 1ed00cc67e Refactor: reorganize lifter/ into subdirectories with PascalCase naming
Directory structure:
  lifter/core/       - LifterClass, pipeline, drivers, application, utils
  lifter/semantics/  - Semantics*.ipp, OperandUtils.ipp, opcodes
  lifter/disasm/     - Disassembler backends, mnemonic/register mappings
  lifter/memory/     - GEPTracker, MemoryPolicy, FileReader
  lifter/analysis/   - PathSolver, CustomPasses
  lifter/test/       - TestInstructions, Tester, test_vectors/

Naming convention standardized to PascalCase:
  fileReader.hpp     -> FileReader.hpp
  lifterClass.hpp    -> LifterClass.hpp
  icedDisassembler*  -> IcedDisassembler*
  utils.h/cpp        -> Utils.h/cpp
  includes.h         -> Includes.h
  pp_macros.hpp      -> PPMacros.hpp
  test_instructions* -> TestInstructions*
  tester.hpp         -> Tester.hpp

Include resolution uses cmake include-directories so no
path prefixes needed in #include directives. All script
paths updated for new test_vectors and opcodes locations.
2026-03-06 18:07:26 +03:00
yusufcanislek 45b3f570bf Address review blockers on switch default and symbolic reads 2026-03-06 16:42:14 +03:00
yusufcanislek 4feb9abd47 Address PR feedback on path solver and determinism gate 2026-03-06 15:16:53 +03:00
yusufcanislek 33f24ed0fc Fix InstructionCache DenseMap corruption: empty/tombstone keys were identical
The InstructionKey::InstructionKeyInfo had getEmptyKey() and getTombstoneKey()
both returning InstructionKey(nullptr, nullptr). LLVM DenseMap requires these
to be distinct sentinel values. This violated the DenseMap contract, causing
bucket corruption during copy/iteration (the old FIXME about 'last item
corrupted').

Fix: use reinterpret_cast sentinel pointers -1 and -2, matching LLVM convention.
Also cleaned up the non-const copy constructor (removed dead local copy and
stale FIXME comment).

Also adds:
- switch_sparse.asm test (non-consecutive case values: 10, 50, 200, 1000)
- calc_cout.cpp test (skipped - documents inline policy limitation with STL)
- C++ compilation support in build_samples.cmd
- Skip mechanism for manifest entries (skip: true + skip_reason)
- Fix test.py update-golden to not run determinism check before updating

68 pattern checks, 40 golden hashes, 108 handler microtests — all green.
2026-03-06 00:47:45 +03:00
yusufcanislek a67bcf3ee2 Add C test binaries, NASM test cases, deterministic IR hashing, SCOPE doc
Test infra:
- test.py: flag checks always-on for quick/all; deterministic IR hash
  verification via SHA-256; update-golden subcommand
- run.ps1: accept both .asm and .c source files in manifest validation
- build_samples.cmd: compile C files with cl.exe /Od /GS- alongside NASM
- CI: rewrite-strict-gate.yml uses test.py defaults (flags always on)

New test cases (10 total):
- 6 NASM: nested_branch, loop_simple, bitchain, multi_arg, diamond, cmov_chain
- 4 C (MSVC /Od): calc_grade (5-way branch), calc_mixed (symbolic+concrete),
  calc_fib (loop->const fold to 13), calc_sum_array (array->const fold to 150)

Manifest: 17 samples, 40 pattern checks
Golden hashes: 34 .ll files (17 optimized + 17 unoptimized)
Handler microtests: 108/111 (97.3%), flags enforced

Docs:
- docs/SCOPE.md: supported/unsupported pattern matrix
2026-03-05 20:31:53 +03:00
yusufcanislek 7362486e82 Address PR review: fail-fast oracle pipeline, stable shift vectors, stricter validation
- Workflow: enforce per-package choco install exit-code checks in rewrite gate
- Seed builder:
  - make sar/shl/shr overrides use immediate count=1 (stable OF semantics)
  - merge DEFAULT_INITIAL into all smoke cases
  - define explicit default flag state (CF/PF/AF/ZF/SF/OF/DF=0, IF=1)
- Enrichment:
  - validate seed schema and cases array
  - validate expected payload type
  - strict computed-helper input checks (required RSI/RDI/RBP where needed)
  - reject malformed initial/register/flag objects
- Oracle generation:
  - fail fast on emulation errors (no silent skip downgrade)
  - validate expected.registers/expected.flags are objects for none/computed
- Lifter/tests:
  - reset hadConditionalBranch/lastBranchTaken per testcase
  - disambiguate branchHelper when true/false destinations are equal
  - reject invalid expected.branch_taken types and non {0,1} ints
- Coverage/reporting:
  - count covered handlers only within opcode universe
  - guard text coverage percentage against divide-by-zero
  - normalize test.py report --vectors path relative to repo root
- Regenerated oracle seeds/vectors and updated full-handler vectors
  - oracle_vectors_full_handlers.json now: 130 cases, 3 skipped (cpuid, rdtsc, ret)
2026-03-04 12:08:00 +03:00
yusufcanislek 9ecd4b68e0 Expand handler test coverage to 75/111 and fix lzcnt/tzcnt/xadd bugs
Semantic fixes:
- lzcnt/tzcnt: inline constant-fold ctlz/cttz intrinsics when source is
  ConstantInt (bypasses dead simplifyValue function)
- xadd: fix register aliasing bug when src==dest by swapping write order
  (write SRC=original first, then DEST=sum last)

Test infrastructure:
- Add enrich_seed.py: auto-populates expected registers/flags for all
  auto-discovered handler cases based on Capstone disassembly analysis
- Add report_coverage.py: handler coverage report (text + JSON)
- Promote full handler oracle vectors as default oracle_vectors.json
- Update run_all_handlers.cmd pipeline with enrichment step
- Add 'report' subcommand to test.py

Coverage: 75/111 handlers tested with oracle (68%), 35 skipped (control
flow/stack/system instructions requiring special setup), 1 uncovered (cli).
79 active test cases pass register checks, 11 have flag mismatches (tracked).
2026-03-04 09:40:33 +03:00
yusufcanislek 567e0d7daf Add rewrite regression automation, vectors, and documentation 2026-03-03 23:04:21 +03:00