Files
Mergen/docs/LOOP_HANDLING.md
T
naci c8102a69cf themida: correctness gate, diagnostic tracer, ret-to-IAT recognition, gen revisit knob (#182)
* tests: add Themida devirtualization import-equivalence check

Adds python test.py themida that lifts every sample in
scripts/rewrite/themida_samples.json and asserts the resulting IR calls
every import declared in required_imports. Names are pinned against
a lift of the non-virtualized reference binary via --update.

This is a correctness gate that complements the existing
coverage gate ('2544 instructions, 0 errors'). Currently red on
example2-virt.bin: the lifter unrolls the VM without surfacing
GetStdHandle / WriteConsoleA / ReadConsoleA / CharUpperA from the
guest program. That gap is the active devirtualization frontier;
this test makes it visible instead of silently green.

Samples whose binaries are absent (`../testthemida/*.bin` lives
outside the repo) are skipped rather than failed, so the check
runs cleanly in CI without the binaries present.

* diag: add Unicorn-based external-call tracer; document Themida transform

Adds scripts/dev/trace_external_calls.py: loads a PE into Unicorn,
patches every IAT slot with a unique unmapped-address sentinel, then
emulates from the chosen entry. When any call/jmp/ret resolves its
target to a sentinel, logs the call-site address, the mnemonic, and
the addressing form. One-shot diagnostic for answering 'what x86
instruction issues this external call at runtime.'

Using it on example2-virt.bin shows the Themida transform precisely:
- guest imports (GetStdHandle etc) remain in the IAT
- every guest call site is rewritten from 'call [rip+IAT]' to a
  VM-staged 'push target; ret' where target was loaded from the
  IAT upstream
- for example2, the first external call happens at VA 0x14017fa77
  via 'ret 0', popping the GetStdHandle IAT value off the stack
- Themida strips its own SDK markers (VirtualizerSDK64.dll#103/#503)
  from the IAT; our ignore_imports filter already accounts for this

The lifter's current recognition handles direct call-through-IAT
and register-indirect IAT calls (the non-virt binary resolves 5
imports cleanly). It does not recognize the ret-pops-IAT-loaded-
pointer pattern, which is why the virt lift surfaces zero imports.

Also annotates themida_samples.json with these properties inline
so the transform semantics live next to the test that exercises
them.

* diag: trace_external_calls can dump visited PCs and record sentinel push chain

Two additions, both motivated by the example2-virt.bin diagnosis session:

- --dump-visited <path>: writes every unique instruction PC the emulator
  executes, in first-visit order. Diff against the lifter's 'reached
  addresses' trace (MERGEN_DIAG_LIFT_PROGRESS=1) to localise where the
  lifter's static exploration diverges from the dynamic path.

- UC_HOOK_MEM_WRITE for stack-addressed 8-byte writes whose payload is a
  sentinel. Records every such write, not just the first, because Themida
  uses push-pop swap gadgets that stage a sentinel on the stack
  transiently before the 'real' push lands it at the ret-target slot.
  The last-5-pushes summary exposes this.

Findings for example2-virt.bin @ 0x140001000:
- lifter covers emu_pos=0..1298 out of 4210 unique PCs (~30%)
- external call site is at emu_pos=4209; gap of 2911 unvisited PCs
- lifter visits 5 addresses the runtime never takes (wrong concolic branch)
- the 'final push to ret slot' is not a 'push [iat]' but rather
  'sub qword ptr [r14], <const>' — the VM decrypts a pre-staged
  stack slot in place to reconstruct the IAT pointer. Pattern-match
  recognition alone cannot handle this; concrete VM-dispatch unrolling
  is required.

* diag: add MERGEN_NO_LOOP_GEN env gate for loop-generalization

Adds an env-var toggle at the top of canGeneralizeStructuredLoopHeader.
When MERGEN_NO_LOOP_GEN=1, the gate rejects every header, forcing
pure concrete exploration with no phi-widening abstraction.

Diagnostic knob, not a user-facing feature. Used to localise how much
of a lift's coverage depends on generalization vs. the concolic engine.

Measurement on example2-virt.bin @ 0x140001000:

                            gen ON        gen OFF (NO_LOOP_GEN=1)
  blocks_attempted              56          2642   (47x)
  instructions_lifted         2544         34229   (13.5x)
  output_no_opts.ll lines     6022         30481   (5x)
  unique addrs visited          34           338   (10x)
  addrs in 0x14017xxxx           0           103   (call-handler cluster)
  external call site reached:   no           yes (via BB 0x14017fa72)
  themida equivalence test:    red           red (recognition still gap)

Loop-generalization is the dominant reachability blocker on Themida
VM dispatchers at current tuning. Pure concrete exploration reaches
the external-call handler block but does not emit named import calls
because lift_ret has no path to match a resolved ret target against
importMap. Recognition is the next fix surface; reachability is large
mostly because of generalization tuning.

Side-effects of gen OFF that are NOT acceptable in production:
 - Lifter decodes .rdata IAT bytes as instructions (OUTSD error at
   0x140002688 on this sample)
 - Top-revisited addresses hit ~1142x each: the lifter spins in
   tight loops without generalization cutting them off; block budget
   (4096) would fire eventually on a larger sample

So the knob is purely diagnostic. The real production fix is selective
generalization (distinguish 'VM dispatcher' from 'guest loop') plus
lift_ret import recognition.

* lifter: recognize ret-to-IAT as named external call in lift_ret

Adds a recognition path in lift_ret: if the value being popped resolves
to a concrete address that's in importMap, emit callFunctionIR for the
named import, then simulate the external's own ret by popping one more
qword (the continuation address pre-staged by the caller). solvePath
then continues at the continuation instead of trying to lift the IAT
pointer as code.

Two resolution routes:
  1. realval is a ConstantInt (direct push+ret of an IAT load)
  2. realval is symbolic but computePossibleValues folds to a single
     concrete value (obfuscated chains that constant-fold at this path)

Scope limits:
- Non-virt example2.bin lift is unchanged (still resolves 5 imports
  via register-indirect path; the new ret path does not fire because
  the binary uses 'call [iat]', not 'push+ret').
- Virt example2-virt.bin lift: the recognition code runs but does not
  surface imports because the lifter's static resolution of the
  arithmetic-decrypt chain produces wrong concrete targets. E.g. the
  ret at 0x14017fa77 resolves to 0x140002628 (somewhere in .rdata) via
  computePossibleValues; at runtime the emulator sees it pop the
  GetStdHandle IAT pointer (0x140002490). The recognition logic is
  correct; the upstream data flow is lying. Fixing that requires
  selective-generalization tuning or concrete VM unrolling, tracked
  separately.

So β lands as ground work for simpler push+ret thunks and for future
work where state-propagation fidelity improves. It is not a Themida
fix on its own.

* lifter: gate canGeneralize on per-header revisit count

Adds a revisit-count threshold to canGeneralizeStructuredLoopHeader:
below threshold N the gate rejects (concrete exploration continues);
at or above N it falls through to the existing loop-shape checks.
Tunable via MERGEN_GEN_MIN_REVISITS; default is 0 (inert, matches
pre-existing behaviour).

Also promotes ++liftAttemptCounts[addr] out from under the
liftProgressDiagEnabled gate so the counter is always maintained.

Rationale: on Themida example2-virt.bin @ 0x140001000, the existing
gate (always-generalize on first qualifying revisit) abstracts the
VM's dispatch loop too early, cutting reachability to ~30% of the
dynamic execution path. A higher threshold lets the dispatcher run
concretely for more iterations before abstracting. Measurement (all
other settings at defaults):

  T=0   (current)  blocks=   56   insns=  2544   err=0  warn=0
  T=4              blocks=   88   insns=  3842   err=0  warn=4
  T=16             blocks=  393   insns= 11747   err=1  warn=0
  T=32             blocks=  425   insns= 12067   err=1  warn=0
  T=128            blocks=  617   insns= 13987   err=1  warn=0
  MERGEN_NO_LOOP_GEN=1 (kill)
                   blocks= 2642   insns= 34229   err=1  warn=0

Caveat: at T=6, T=8, T=12 the lifter crashes with an access violation
partway through lifting. The crash fires in the Themida dispatcher
state machinery around 0x1400237F9 when generalization fires mid-
iteration with state that the existing machinery is not prepared to
handle. Other nearby T values (T=5, 7, 9, 10, 11, 13-19) are stable.

So the knob is landing as experimental infrastructure with default=0
(no-op). Future work can pair a safe non-zero default with a fix for
the dispatcher-state crash.

---------

Co-authored-by: Claude <claude@anthropic.com>
2026-04-24 14:54:22 +03:00

13 KiB

Loop Handling

This file documents how the lifter currently recognizes loops, how it switches a recognized loop into "generalized" lifting mode, and how the generalized state is consumed by downstream value tracking. It also lists the load-bearing hardcoded addresses, the gating contexts, and the known limitations so the next session can change loop behavior without re-excavating the code.

For build/test workflow use docs/BUILDING.md and docs/REWRITE_BASELINE.md. For the support matrix use docs/SCOPE.md. For the pipeline order around loop lifting use ARCHITECTURE.md.

Phases

Loop handling is a sequence of three phases on the same basic block:

Phase What it does Where
Detect Recognize that a backward jump target is a real loop header (not an acyclic backward branch) isStructuredLoopHeaderShape + canGeneralizeStructuredLoopHeader in lifter/core/LifterClass.hpp
Generalize Switch the lifter from concrete per-path execution to a phi-driven "loop mode" at the header branch_backup, load_generalized_backup, record_generalized_loop_backedge in lifter/core/LifterClass_Concolic.hpp
Consume Re-route specific load / register reads through canonical/backedge phi values during loop-mode lifting retrieve_generalized_loop_* family in lifter/core/LifterClass_Concolic.hpp

Detection

isStructuredLoopHeaderShape(BasicBlock*) walks the block chain starting at the candidate header and accepts on the first conditional branch it reaches, with these constraints:

  • Maximum walk depth: 8 hops.
  • The header itself may have up to 2 predecessors; deeper hops in the chain may have only 1 predecessor each.
  • Each hop must terminate with a BranchInst.
  • A non-conditional unconditional br with a single successor is allowed (trampoline relaxation), but a multi-successor non-branch terminator rejects.
  • An empty block on any hop rejects.
  • A cycle in the walk rejects.

Trampoline relaxation: when the entry block is a single unconditional br and a deeper hop has not yet been fully terminated (mid-lift), the chain is still accepted so the header can be latched. The actual loop-vs-acyclic decision is made by blockCanReach and visitedAddresses checks downstream.

canGeneralizeStructuredLoopHeader(addr) then applies the operational guards in this order:

  1. getControlFlow() == ControlFlow::Unflatten — feature-gate.
  2. currentPathSolveAllowsStructuredLoopGeneralization() (or the resolved-target widening) — see Path-solve context gating.
  3. addr <= blockInfo.block_address — only backward targets.
  4. visitedAddresses.contains(addr) — header must already have been lifted at least once.
  5. Not already in pendingLoopGeneralizationAddresses.
  6. Not already in generalizedLoopAddresses.
  7. addrToBB[addr] exists and is non-empty.
  8. isStructuredLoopHeaderShape(it->second).
  9. blockCanReach(header, currentBlock) — confirms an actual cycle.

All guards must pass; any reject is logged via diagnostic output gated on liftProgressDiagEnabled (MERGEN_DIAG_LIFT_PROGRESS=1).

Path-solve context gating

currentPathSolveContext distinguishes how the lifter reached the current point:

Context Generalization allowed?
ConditionalBranch yes
DirectJump yes
IndirectJump only via the resolved-target widening (...ForResolvedTarget)
Ret no

The IndirectJump widening exists because once solvePath has pinned an indirect jump to a concrete address, that target is no longer speculative and a backward edge is a legitimate loop. Ret-path contexts have their own lifecycle and are deliberately excluded from generalization.

Generalized loop state

When generalization fires, the lifter re-enters the header in "loop mode." The state lives in lifter/core/LifterClass_Concolic.hpp:

struct GeneralizedLoopControlFieldState {
  bool valid = false;
  llvm::BasicBlock* headerBlock = nullptr;
  llvm::BasicBlock* canonicalSource = nullptr;
  // Backedge side is variable-width: a loop header may be reached from
  // multiple backedges (N>=1). Size 1 is the common 2-way loop case.
  llvm::SmallVector<llvm::BasicBlock*, 2> backedgeSources;
  uint64_t canonicalControl = 0;
  llvm::SmallVector<uint64_t, 2> backedgeControls;
  llvm::DenseMap<uint64_t, ValueByteReference> canonicalBuffer;
  llvm::SmallVector<llvm::DenseMap<uint64_t, ValueByteReference>, 2> backedgeBuffers;
} activeGeneralizedLoopControlFieldState;

llvm::DenseMap<llvm::BasicBlock*, GeneralizedLoopControlFieldState>
    generalizedLoopControlFieldStates;

activeGeneralizedLoopControlFieldState tracks the state for the loop currently being lifted. generalizedLoopControlFieldStates is the per-header archive used after promotion so a later re-entry can rebuild the state.

Two related stores hold raw register/flag phi nodes per header:

  • generalizedLoopRegisterPhis: BB -> array<PHINode*, REGISTER_COUNT>
  • generalizedLoopFlagPhis: BB -> array<PHINode*, FLAGS_END>

State transitions:

Event What changes
branch_backup(bb, generalized=false) Snapshots current registers/flags/buffer/cache/assumptions/counter into BBbackup[bb].
branch_backup(bb, generalized=true) Appends the snapshot to generalizedLoopBackedgeBackup[bb] (a SmallVector<backup_point, 2>), deduplicated by sourceBlock so a repeat call from the same source replaces its entry in place. BBbackup[bb] only set if absent.
load_backup(bb) Restores BBbackup[bb], clears activeGeneralizedLoopLocalBuffer.
load_generalized_backup(bb) Builds make_generalized_loop_backup(bb) and restores it; populates activeGeneralizedLoopControlFieldState from the canonical/backedge snapshots.
record_generalized_loop_backedge(bb) Promotes the loop: copies activeGeneralizedLoopControlFieldState into the per-header archive, marks the address generalized.

Phi construction at the header

make_generalized_loop_backup(bb, canonical, ArrayRef<backup_point> sources) calls mergeValue for every register and flag slot. With one canonical source and N backedge sources, it produces a (1 + N)-incoming phi (one incoming per distinct sources[i].sourceBlock). Sources duplicating canonical.sourceBlock are filtered before phi construction.

auto mergeValue = [&](Value* canonicalValue,
                      ArrayRef<Value*> backedgeValues,
                      const char* name, PHINode*& phiOut,
                      bool widenFirstBackedge) -> Value* {
  // Require canonical + all backedges present and type-matched. Any
  // mismatch falls back to backedgeValues.front(), preserving the
  // pre-N-way single-backedge semantics for the 2-way case.
  auto* phi = phiBuilder.CreatePHI(canonicalValue->getType(),
                                   1 + backedgeValues.size(), name);
  phi->addIncoming(canonicalValue, canonicalSource);
  for (size_t i = 0; i < backedgeValues.size(); ++i) {
    phi->addIncoming(widenFirstBackedge
                         ? UndefValue::get(backedgeValues[i]->getType())
                         : backedgeValues[i],
                     sources[i]->sourceBlock);
  }
  phiOut = phi;
  return phi;
};

widenFirstBackedge controls whether the backedge incoming is Undef (allowing later folding to refine) or the concrete backedge value:

  • Registers: widenFirstBackedge = !shouldPreserveGeneralizedBackedgeRegisterIndex(i). RSP is preserved (passes the actual backedge value), every other GPR widens to Undef.
  • Flags: always widen to Undef.

Preserving RSP through the first backedge prevents the stack pointer from being treated as "could be anything" inside the loop body.

Consuming the state during loop-mode lifting

When the lifter is in loop mode (currentBlockUsesGeneralizedLoopState() == true) and the active state is valid, several read paths re-route through the state instead of the normal load/register pipeline. All are CRTP-dispatched in lifter/core/LifterClass.hpp and implemented in lifter/core/LifterClass_Concolic.hpp; symbolic-mode stubs in lifter/core/LifterClass_Symbolic.hpp return nullptr so symbolic analysis is unchanged.

Helper What it returns
retrieve_generalized_loop_local_value(addr, bytes) Loop-local stack-buffer value if activeGeneralizedLoopLocalBuffer has it; else nullptr (caller falls back).
retrieve_generalized_loop_control_field_value(loadOffset, bytes, orgLoad) Phi of canonical/backedge values for a load whose offset is controlSlot + (Trunc/ZExt/SExt of) phi with a recognized constant displacement.
retrieve_generalized_loop_control_slot_value(addr, bytes) Phi of canonical/backedge control values when addr == kThemidaControlCursorSlot.
retrieve_generalized_loop_target_slot_value(addr, bytes) Phi of canonical/backedge values for a recognized target slot.
retrieve_generalized_loop_phi_address_value(load, bytes, orgLoad) Phi of loaded values when the load's address is a phi of two concrete addresses derived from canonical/backedge.
retrieve_generalized_loop_local_phi_address_value(load, bytes, orgLoad) Same as above for loop-local stack-buffer addresses.

computePossibleValues (in lifter/memory/GEPTracker.ipp) also has a PHINode case that unions every incoming's value set, so callers downstream of these phis get the full possible-value enumeration instead of an empty fallback.

Hardcoded reference-sample addresses

A handful of constants in lifter/core/LifterClass_Concolic.hpp are tied to the reference Themida sample (testthemida/example2-virt.bin @ 0x140001000):

static constexpr uint64_t kThemidaControlCursorSlot = 0x14004DD19ULL;
static constexpr uint64_t kThemidaLoopCarriedSlot   = 0x14004DC67ULL;
static constexpr std::array<uint64_t, 3> kSupportedGeneralizedControlFieldOffsets = {
    0x6ULL, 0xAULL, 0xCULL};

The diagnostic prints scattered across PathSolver.ipp, LifterClass.hpp, LifterClass_Concolic.hpp, and GEPTracker.ipp that gate on specific Themida addresses (0x1400237F9ULL, 0x140023582-0x1400237FFULL, etc.) only fire under MERGEN_DIAG_LIFT_PROGRESS=1 and are session scaffolding for that sample. They produce no output for any other binary.

Tests

Loop handling has roughly thirty microtests in lifter/test/Tester.hpp. The most relevant groups:

Group Coverage
structured_loop_header_* Acceptance / rejection for conditional, jump-chain, acyclic-backward, non-conditional-terminator, multi-predecessor shapes.
loop_generalization_* Per-context guards: conditional branch allowed, direct jump allowed, indirect jump blocked when unresolved / allowed when resolved, ret blocked.
pending_generalized_loop_* Same guards in the pendingLoopGeneralizationAddresses lifecycle.
generalized_loop_restore_* Backedge flag-state and register-state merging across load_generalized_backup.
generalized_loop_*_creates_phi Each retrieve_generalized_loop_* helper produces the expected phi shape (control slot, control slot displacement, target slot, control field load, local phi address).
compute_possible_values_* The PHI handler unions incomings (also covers cast-width preservation and rolled-arithmetic-chain enumeration).

When changing loop handling, run at minimum:

python test.py micro
python test.py baseline

For changes that touch register/flag phi shape, also run the two Themida gates — a coverage gate and a correctness gate. Both must stay green (or at least not regress).

Coverage gate — confirms the VM unrolls without crashing:

build_iced\lifter.exe ..\testthemida\example2-virt.bin 0x140001000

Inspect output_diagnostics.json for lift_stats.instructions_lifted == 2544 and summary.warning == 0, summary.error == 0. This only certifies that the lifter walked the VM's 2544 handler instructions without reporting errors — it does not certify that the recovered IR is semantically equivalent to the original function.

Correctness gate — confirms the devirtualized IR calls the same external imports as the non-virtualized reference:

python test.py themida

Fails hard if any required import from scripts/rewrite/themida_samples.json is missing from the lifted IR. Required imports are pinned against a lift of the non-virt reference binary; regenerate with python test.py themida --update when the reference changes (not when the virt output changes). Passing this gate means the VM devirtualization recovered the guest program's external-call semantics, not just the VM's own state-machine activity.

This gate is currently red on example2-virt.bin: the lifter unrolls the VM but does not surface the guest's GetStdHandle / WriteConsoleA / ReadConsoleA / CharUpperA calls. That gap is the active Themida-frontier work item — the coverage gate passing while the correctness gate fails is exactly the failure mode the two-gate split is designed to make visible.

Known limitations

Limitation Status
REP/REPE/REPNE-prefixed SCAS Rejected as not_implemented; needs a model for repeated-scan termination.
INT 2 continuation under VMP 3.6 Naive architectural fallthrough is wrong; recovery requires modeling the dispatcher / exception-mediated control flow. See VMP_TESTING_NOTES.md.
Loop unrolling / loop-invariant code motion Not implemented. The lifter relies on LLVM's downstream optimization passes for this once the IR is in shape.