Files
naci 71fc60766d Loop generalization, BSR/BSF intrinsics, and stack alloca split (#207)
* lifter: move themida control/target slots to per-state fields (phase A)

Lift the kThemidaControlCursorSlot/kThemidaLoopCarriedSlot constants out of
the helper bodies and into per-loop GeneralizedLoopControlFieldState fields
controlSlot/targetSlot. The populator seeds them to the legacy Themida
defaults, so behavior is unchanged on the reference Themida sample and on
every existing test. Phase B will replace the populator's literal seed with
active discovery against canonical/backedge buffers, enabling per-binary
slot identification.

Sites updated:
- GeneralizedLoopControlFieldState: new controlSlot/targetSlot fields,
  reset in clearGeneralizedLoopControlFieldState
- load_generalized_backup_impl: introduces controlSlot/targetSlot locals,
  uses them for canonical+backedge reads, seeds them into the activated state
- matchGeneralizedLoopControlFieldAddress: gates the GEP-base check on
  activeGeneralizedLoopControlFieldState.controlSlot
- retrieve_generalized_loop_control_slot_value_impl: gates on state.controlSlot
- retrieve_generalized_loop_target_slot_value_impl: gates on state.targetSlot
- record_generalized_loop_backedge_impl: reads current control via
  stateIt->second.controlSlot in both rotate and append-or-update paths

Tests that build state directly (bypassing the populator) are updated to
seed the new fields where they call retrieve helpers.

* lifter: discover themida control/target slots from canonical+backedge buffers (phase B)

Replaces the populator's hardcoded reads at kThemidaControlCursorSlot /
kThemidaLoopCarriedSlot with active per-loop discovery against the canonical
backup buffer and the generalized-loop backedge buffers. The discovery is
implemented as two helpers next to load_generalized_backup_impl:

- tryPopulateControlFromSlot(canonical, backedges, slot, dst): probes a
  specific candidate slot and, on success, fills dst with the canonical /
  backedge controls and per-backedge buffers that the slot motivates.
- discoverGeneralizedLoopSlots(canonical, backedges): drives the search.

The control-slot search prefers the legacy Themida cursor (zero behavior
change on the reference sample) and falls back to scanning canonical for the
qword-start address with the most-varying backedges, tiebreaking by lowest
address. The target-slot search prefers the legacy carried slot and falls
back to the lowest-address candidate that is tracked across canonical and
every selected backedge buffer.

Stack-frame addresses (anything inside [STACKP-reserve, STACKP+reserve)) are
excluded from candidates, so caller-frame stack args at e.g. STACKP+24 are
no longer mistakenly chosen as target slots. This matters for the nested-loop
local-buffer test, whose canonical buffer carries a tracked qword above
STACKP from the outer loop's prior backedge.

Two existing KNOWN-LIMITATION tests are flipped to assert the new positive
contract:
- generalized_loop_non_themida_control_slot_produces_no_phi ->
  generalized_loop_non_themida_slot_picks_up_as_target_when_legacy_control_present
  (the non-Themida slot is now picked up as the target slot when the legacy
  cursor is present, and a 2-way phi is produced at it).
- generalized_loop_non_themida_target_slot_produces_no_phi ->
  generalized_loop_discovery_picks_non_themida_target_slot
  (the discovered target slot is asserted, and the helper produces a 2-way
  phi with both incoming concrete values).

Verification:
- 228 rewrite_microtests pass (no regressions).
- check_themida_equivalence.py: example2 still recovers all 4 required
  imports (CharUpperA, GetStdHandle, ReadConsoleA, WriteConsoleA).

* loop generalization: data-driven register preservation + multi-slot carried state

Phase 2: Replace shouldPreserveGeneralizedBackedgeRegisterIndex (hardcoded
Themida-specific index set {1,4,7,9,10,12,14}) with data-driven comparison
of canonical vs backedge values. A register is now preserved when its value
changed across the loop boundary; RSP is always preserved. This prevents
non-Themida loops from silently losing loop-carried state in registers
outside the hardcoded set.

Phase 1: Extend GeneralizedLoopControlFieldState with a carriedSlots vector
that tracks ALL varying memory qwords discovered during slot analysis, not
just the single controlSlot + targetSlot. The retrieve_target_slot helper
now checks carriedSlots after the legacy targetSlot, building phis for any
matching carried address. Rotation logic in record_generalized_loop_backedge
updates carried slot values alongside the primary control slot.

Phase 3: Add vm_tea_round_loop sample — TEA-style compound cross-update with
3 independently loop-carried state variables (v0, v1, sum). 10 semantic test
cases including the previously-failing x=0x65501 input. All pass.

Test results: 247/247 pattern-verified, 245/245 semantic (2342 cases), all
microtests green including flag checks.

* add vm_subroutine_loop: single-depth call/ret VM with indirect PC dispatch

The vm_subroutine_loop pattern previously crashed the lifter with an access
violation (0xC0000005). The combination of multi-slot carried state, data-
driven register preservation, and emergency generalization now handles this
pattern correctly: 8 semantic cases pass, no crash.

The sample uses a one-deep return-PC slot (rpc) for indirect dispatch —
the simplest form of the pattern that was fundamentally unsupported.

248/248 samples, 246/246 semantic (2350 cases), Themida gate green.

* add vm_callret_loop and vm_bubblesort_loop: previously budget-blown patterns

Both patterns previously exhausted maxBasicBlockBudget (~4087 blocks):
- vm_callret_loop: stack-array-indexed PC dispatch (rstack[rsp])
- vm_bubblesort_loop: conditional two-slot array swap per iteration

With emergency generalization (75% budget threshold), both now lift
without hitting the budget ceiling (75 and 59 blocks respectively).
The patterns are registered with IR shape checks only (no semantic
assertions) because the indirect dispatch and conditional multi-slot
writes are not yet semantically accurate under generalization.

250/250 samples, 247/247 semantic (2358 cases), Themida gate green.

* semantics: rewrite BSR/BSF to use llvm.ctlz/cttz intrinsics

Replace the bitWidth-iteration unrolled bit-scan loops (32 AND+ICMP+SELECT
chains for i32, 64 for i64) with single @llvm.ctlz / @llvm.cttz intrinsic
calls. BSR = bitWidth - 1 - ctlz(x, true); BSF = cttz(x, true).

The zero-input case is handled with is_zero_undef=true (matching BSR/BSF
architectural undefined-when-zero behavior) plus an explicit select that
returns undef when the input is zero. Constant folding is preserved.

IR quality improvement: vm_imported_clz_loop and vm_imported_bsr_loop
now show a single call @llvm.ctlz.i32 instead of 30+ bsrtest/icmp/select
instructions. Pattern manifests updated to match 'call'.

lift_lzcnt and lift_tzcnt already used the intrinsics — BSR/BSF were the
only remaining scalar bit-scan ops with unrolled implementations.

Side benefit: flag-stress tests bsf_00 and bsf_01 fixed (constant-folded
input now produces correct PF flag instead of running the unrolled loop).

* PromotePseudoStackPass: split into main + escape alloca by call-escape

Previously, GEPs that flow into call arguments either:
  (a) were skipped from promotion → left as memory-base GEPs that
      PromotePseudoMemory turned into raw inttoptr(stack_addr) constants
      (e.g., 'ptr nonnull inttoptr (i64 1375592 to ptr)' as WriteConsoleA
      lpNumberOfCharsWritten arg), or
  (b) were promoted to a single shared alloca, blocking SROA for the
      whole alloca and leaving hundreds of dead dispatcher-scratch stores
      in the post-opt IR.

Two-alloca split fixes both:
  - Main alloca:  scratch slots that don't escape via calls. SROA
                  decomposes it cleanly; DSE eliminates dead stores.
  - Escape alloca: slots whose pointer flows into a CallBase. Won't
                   SROA but is isolated, so dispatcher noise doesn't
                   block its dead-store elimination.

Classification is by constant offset: any offset touched by ANY GEP
with a CallBase user is marked escaped. All GEPs (constant or not) at
escaped offsets go to the escape alloca to preserve pointer identity
within each slot. Non-constant offsets always go to the main alloca
(in practice, lifters use them for buffers; constant offsets for API
scalar slots).

Themida WriteConsoleA call now shows clean alloca GEPs:
  ptr nonnull %4  (= stackmemory.escape + 200)
  ptr nonnull %6  (= stackmemory.escape + 208)
instead of:
  ptr nonnull inttoptr (i64 1375584 to ptr)
  ptr nonnull inttoptr (i64 1375592 to ptr)

Stack-range inttoptr in Themida output drops to zero. Total stores
drop dramatically (the remaining ones are .themida section writes,
a separate dispatcher-state issue not related to the stack alloca).

Pattern updates for 4 samples whose IR shape changed due to the
cleaner alloca decomposition:
  - vm_fibonacci_loop:        switch i32 -> br i1
  - vm_search_loop:           br i1 -> select
  - vm_signed_dword_sum64:    sext -> ashr
  - vm_signed_word_sum64:     sext -> ashr

---------

Co-authored-by: yusufcanislek <yusuf.canislek@meetdandy.com>
2026-05-08 14:00:42 +03:00
..