naci 71fc60766d Loop generalization, BSR/BSF intrinsics, and stack alloca split (#207)
* lifter: move themida control/target slots to per-state fields (phase A)

Lift the kThemidaControlCursorSlot/kThemidaLoopCarriedSlot constants out of
the helper bodies and into per-loop GeneralizedLoopControlFieldState fields
controlSlot/targetSlot. The populator seeds them to the legacy Themida
defaults, so behavior is unchanged on the reference Themida sample and on
every existing test. Phase B will replace the populator's literal seed with
active discovery against canonical/backedge buffers, enabling per-binary
slot identification.

Sites updated:
- GeneralizedLoopControlFieldState: new controlSlot/targetSlot fields,
  reset in clearGeneralizedLoopControlFieldState
- load_generalized_backup_impl: introduces controlSlot/targetSlot locals,
  uses them for canonical+backedge reads, seeds them into the activated state
- matchGeneralizedLoopControlFieldAddress: gates the GEP-base check on
  activeGeneralizedLoopControlFieldState.controlSlot
- retrieve_generalized_loop_control_slot_value_impl: gates on state.controlSlot
- retrieve_generalized_loop_target_slot_value_impl: gates on state.targetSlot
- record_generalized_loop_backedge_impl: reads current control via
  stateIt->second.controlSlot in both rotate and append-or-update paths

Tests that build state directly (bypassing the populator) are updated to
seed the new fields where they call retrieve helpers.

* lifter: discover themida control/target slots from canonical+backedge buffers (phase B)

Replaces the populator's hardcoded reads at kThemidaControlCursorSlot /
kThemidaLoopCarriedSlot with active per-loop discovery against the canonical
backup buffer and the generalized-loop backedge buffers. The discovery is
implemented as two helpers next to load_generalized_backup_impl:

- tryPopulateControlFromSlot(canonical, backedges, slot, dst): probes a
  specific candidate slot and, on success, fills dst with the canonical /
  backedge controls and per-backedge buffers that the slot motivates.
- discoverGeneralizedLoopSlots(canonical, backedges): drives the search.

The control-slot search prefers the legacy Themida cursor (zero behavior
change on the reference sample) and falls back to scanning canonical for the
qword-start address with the most-varying backedges, tiebreaking by lowest
address. The target-slot search prefers the legacy carried slot and falls
back to the lowest-address candidate that is tracked across canonical and
every selected backedge buffer.

Stack-frame addresses (anything inside [STACKP-reserve, STACKP+reserve)) are
excluded from candidates, so caller-frame stack args at e.g. STACKP+24 are
no longer mistakenly chosen as target slots. This matters for the nested-loop
local-buffer test, whose canonical buffer carries a tracked qword above
STACKP from the outer loop's prior backedge.

Two existing KNOWN-LIMITATION tests are flipped to assert the new positive
contract:
- generalized_loop_non_themida_control_slot_produces_no_phi ->
  generalized_loop_non_themida_slot_picks_up_as_target_when_legacy_control_present
  (the non-Themida slot is now picked up as the target slot when the legacy
  cursor is present, and a 2-way phi is produced at it).
- generalized_loop_non_themida_target_slot_produces_no_phi ->
  generalized_loop_discovery_picks_non_themida_target_slot
  (the discovered target slot is asserted, and the helper produces a 2-way
  phi with both incoming concrete values).

Verification:
- 228 rewrite_microtests pass (no regressions).
- check_themida_equivalence.py: example2 still recovers all 4 required
  imports (CharUpperA, GetStdHandle, ReadConsoleA, WriteConsoleA).

* loop generalization: data-driven register preservation + multi-slot carried state

Phase 2: Replace shouldPreserveGeneralizedBackedgeRegisterIndex (hardcoded
Themida-specific index set {1,4,7,9,10,12,14}) with data-driven comparison
of canonical vs backedge values. A register is now preserved when its value
changed across the loop boundary; RSP is always preserved. This prevents
non-Themida loops from silently losing loop-carried state in registers
outside the hardcoded set.

Phase 1: Extend GeneralizedLoopControlFieldState with a carriedSlots vector
that tracks ALL varying memory qwords discovered during slot analysis, not
just the single controlSlot + targetSlot. The retrieve_target_slot helper
now checks carriedSlots after the legacy targetSlot, building phis for any
matching carried address. Rotation logic in record_generalized_loop_backedge
updates carried slot values alongside the primary control slot.

Phase 3: Add vm_tea_round_loop sample — TEA-style compound cross-update with
3 independently loop-carried state variables (v0, v1, sum). 10 semantic test
cases including the previously-failing x=0x65501 input. All pass.

Test results: 247/247 pattern-verified, 245/245 semantic (2342 cases), all
microtests green including flag checks.

* add vm_subroutine_loop: single-depth call/ret VM with indirect PC dispatch

The vm_subroutine_loop pattern previously crashed the lifter with an access
violation (0xC0000005). The combination of multi-slot carried state, data-
driven register preservation, and emergency generalization now handles this
pattern correctly: 8 semantic cases pass, no crash.

The sample uses a one-deep return-PC slot (rpc) for indirect dispatch —
the simplest form of the pattern that was fundamentally unsupported.

248/248 samples, 246/246 semantic (2350 cases), Themida gate green.

* add vm_callret_loop and vm_bubblesort_loop: previously budget-blown patterns

Both patterns previously exhausted maxBasicBlockBudget (~4087 blocks):
- vm_callret_loop: stack-array-indexed PC dispatch (rstack[rsp])
- vm_bubblesort_loop: conditional two-slot array swap per iteration

With emergency generalization (75% budget threshold), both now lift
without hitting the budget ceiling (75 and 59 blocks respectively).
The patterns are registered with IR shape checks only (no semantic
assertions) because the indirect dispatch and conditional multi-slot
writes are not yet semantically accurate under generalization.

250/250 samples, 247/247 semantic (2358 cases), Themida gate green.

* semantics: rewrite BSR/BSF to use llvm.ctlz/cttz intrinsics

Replace the bitWidth-iteration unrolled bit-scan loops (32 AND+ICMP+SELECT
chains for i32, 64 for i64) with single @llvm.ctlz / @llvm.cttz intrinsic
calls. BSR = bitWidth - 1 - ctlz(x, true); BSF = cttz(x, true).

The zero-input case is handled with is_zero_undef=true (matching BSR/BSF
architectural undefined-when-zero behavior) plus an explicit select that
returns undef when the input is zero. Constant folding is preserved.

IR quality improvement: vm_imported_clz_loop and vm_imported_bsr_loop
now show a single call @llvm.ctlz.i32 instead of 30+ bsrtest/icmp/select
instructions. Pattern manifests updated to match 'call'.

lift_lzcnt and lift_tzcnt already used the intrinsics — BSR/BSF were the
only remaining scalar bit-scan ops with unrolled implementations.

Side benefit: flag-stress tests bsf_00 and bsf_01 fixed (constant-folded
input now produces correct PF flag instead of running the unrolled loop).

* PromotePseudoStackPass: split into main + escape alloca by call-escape

Previously, GEPs that flow into call arguments either:
  (a) were skipped from promotion → left as memory-base GEPs that
      PromotePseudoMemory turned into raw inttoptr(stack_addr) constants
      (e.g., 'ptr nonnull inttoptr (i64 1375592 to ptr)' as WriteConsoleA
      lpNumberOfCharsWritten arg), or
  (b) were promoted to a single shared alloca, blocking SROA for the
      whole alloca and leaving hundreds of dead dispatcher-scratch stores
      in the post-opt IR.

Two-alloca split fixes both:
  - Main alloca:  scratch slots that don't escape via calls. SROA
                  decomposes it cleanly; DSE eliminates dead stores.
  - Escape alloca: slots whose pointer flows into a CallBase. Won't
                   SROA but is isolated, so dispatcher noise doesn't
                   block its dead-store elimination.

Classification is by constant offset: any offset touched by ANY GEP
with a CallBase user is marked escaped. All GEPs (constant or not) at
escaped offsets go to the escape alloca to preserve pointer identity
within each slot. Non-constant offsets always go to the main alloca
(in practice, lifters use them for buffers; constant offsets for API
scalar slots).

Themida WriteConsoleA call now shows clean alloca GEPs:
  ptr nonnull %4  (= stackmemory.escape + 200)
  ptr nonnull %6  (= stackmemory.escape + 208)
instead of:
  ptr nonnull inttoptr (i64 1375584 to ptr)
  ptr nonnull inttoptr (i64 1375592 to ptr)

Stack-range inttoptr in Themida output drops to zero. Total stores
drop dramatically (the remaining ones are .themida section writes,
a separate dispatcher-state issue not related to the stack alloca).

Pattern updates for 4 samples whose IR shape changed due to the
cleaner alloca decomposition:
  - vm_fibonacci_loop:        switch i32 -> br i1
  - vm_search_loop:           br i1 -> select
  - vm_signed_dword_sum64:    sext -> ashr
  - vm_signed_word_sum64:     sext -> ashr

---------

Co-authored-by: yusufcanislek <yusuf.canislek@meetdandy.com>
2026-05-08 14:00:42 +03:00
2024-09-25 10:02:20 +03:00
2024-03-23 16:23:33 +03:00
2025-04-24 05:34:33 +03:00
2023-11-13 15:46:01 +03:00

Project Overview:

Mergen is a tool engineered to convert Assembly code into LLVM Intermediate Representation (IR). This tool is designed for:

  • The deobfuscation or devirtualization of obfuscated binary code
  • The enhancement of the reverse engineering process, making it more efficient and effective, especially for complex software systems.

Guide to build & run

To build and run the project, take a look at docs/BUILDING.md.

Rewrite baseline gate

Rewrite work should keep the baseline regression gate green. The gate builds focused PE samples, runs lifter, and verifies lifted IR outputs.

Core Objectives:

  • Deobfuscation

  • Devirtualization

  • Optimization

How does it work?

We symbolicly execute (or symbolicly lift) the target, the idea here is not lifting individual instructions, but lifting a whole function. We dont expect one instruction nor one basic block to behave same each time, instead treat them like they can be and are for different purposes each time. We try to keep the generated IR simple and optimizeable as possible. We also have different needs than an usual compiler. We use analysis to evaluate control flow. We can't depend on LLVM for all of our analysis, because they are created for different goals and could be unoptimal for our use-case.

image

Examples

This is the practical example to illustrate how Mergen solves against virtualized programs.

  1. VMProtect
  2. Branches/Jumptables
  3. Themida 3.1.6.0 LION64 (Red)

Example #1 (VMProtect)

This is our target program

struct test {
    int a;
    int b;
    int c;
};

int maths(test a, int b, int c) {
        return a.a  + b - c;
}

image

image

VMProtect settings, everything is turned off, we virtualize the function on ultra setting. (Tested versions 3.4.0-3.6.0 3.8.1)

image

image

Here, we run mergen. First argument is the name of the file and the second argument is the address of the function. Look how simple it is to run. And we can compile the output so we can explore it using our favorite decompiler.

image

; ModuleID = 'my_lifting_module'
source_filename = "my_lifting_module"

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read)
define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %0, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr nocapture readonly %memory) local_unnamed_addr #0 {
entry:
  %stackmemory = alloca i128, i128 13758960, align 8
  %1 = trunc i64 %r8 to i32
  %2 = trunc i64 %rdx to i32
  %GEPLoadxd-5369456437- = getelementptr i8, ptr %memory, i64 %rcx
  %3 = load i32, ptr %GEPLoadxd-5369456437-, align 4
  %adc-temp-5370242400- = sub i32 %2, %1
  %realnot-5369532059- = add i32 %adc-temp-5370242400-, %3
  %stackmemory10243.sroa.55.1375304.insert.ext10255 = zext i32 %realnot-5369532059- to i64
  ret i64 %stackmemory10243.sroa.55.1375304.insert.ext10255
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read) }

After compiling:

image

image

Now you might notice the registers are a little bit off. This is because of we dont follow the calling conventions, if we were to follow the calling conventions, function signature would look like this:

define i64 @main(i64 %rcx, i64 %rdx, i64 %rdx, i64 %r8, i64 %r9 ...)

So, we just adjust the function signature to look normally. If you have more questions about this part, I suggest you research calling conventions and ABI.

Example #2 (Branches/Jumptables)

So, lets say we have this code. VM's will take the below code then turn it to an indirect jump, its slightly more unconvenient for the reverser.

int maths(int a, int b, int c) {
    if (a > b)
        return a + b + c;
    else
        return a - b - c;
}
next_handler = xxx;
if ( a-b > 0 )
  next_handler = yyy;
jump next_handler;

We try to always analyze values and keep track of them. This allows us to understand control flow. For jumptable-like branches Optimized output would be a simple

define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr nocapture readnone %TEB, ptr nocapture readnone %memory) local_unnamed_addr #0 {
fake_ret:
  %0 = lshr i64 %rcx, 62
  %common.ret.op = and i64 %0, 2
  ret i64 %common.ret.op
}

Unoptimized output. (DCE'd for readability)

source_filename = "my_lifting_module"

define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr %TEB, ptr %memory) {
  %lsb = and i64 %rcx, 255
  %pf1 = mul i64 %lsb, 72340172838076673
  %pf2 = and i64 %pf1, -9205322385119247871
  %pf3 = urem i64 %pf2, 511
  %pf4 = and i64 %pf3, 1
  %pf5 = icmp eq i64 0, %pf4
  %0 = zext i1 %pf5 to i64
  %createrflag2 = shl i64 %0, 2
  %creatingrflag = or i64 2, %createrflag2
  %zeroflag = icmp eq i64 %rcx, 0
  %1 = zext i1 %zeroflag to i64
  %createrflag21 = shl i64 %1, 6
  %creatingrflag2 = or i64 %creatingrflag, %createrflag21
  %signflag = icmp slt i64 %rcx, 0
  %2 = zext i1 %signflag to i64
  %createrflag23 = shl i64 %2, 7
  %creatingrflag4 = or i64 %creatingrflag2, %createrflag23
  %GEPSTORE-5368713221- = getelementptr i8, ptr %memory, i64 1376032
  store i64 %creatingrflag4, ptr %GEPSTORE-5368713221-, align 4
  %realand-5368713229- = and i64 %creatingrflag4, 128
  %shr-lshr-5368713233- = lshr i64 %realand-5368713229-, 7
  %3 = mul i64 %shr-lshr-5368713233-, 4
  %bvalue_indexvalue = add i64 5368713249, %3
  %4 = icmp eq i64 %bvalue_indexvalue, 5368713253
  %lolb- = select i1 %4, i64 5368713264, i64 5368713257
  %GEPSTORE-5368713248- = getelementptr i8, ptr %memory, i64 1376032
  store i64 %lolb-, ptr %GEPSTORE-5368713248-, align 4
  br i1 %4, label %real_ret, label %real_ret41

real_ret:                                         ; preds = %fake_ret
  %inc-5368713273- = add i64 %shr-lshr-5368713233-, 1
  ret i64 %inc-5368713273-

real_ret41:                                       ; preds = %fake_ret
  ret i64 %shr-lshr-5368713233-
}

Notice this part

  %realand-5368713229- = and i64 %creatingrflag4, 128
  %shr-lshr-5368713233- = lshr i64 %realand-5368713229-, 7

We get the flags, then we get the 7th bit which is Sign Flag, then we use the Sign Flag to calculate an address. Through analysis, we determine the address could be one of two values, 5368713257 or 5368713264, then we turn that into a comparison. If address is 5368713257, take one branch, if other, take another. When doing this, it is also important to mark the condition appopriate value because later, we might need to calculate another jump with the same exact value.

Even though we solve the indirect jumps, jumps with more than 2 possible location are not supported. This is because the analysis for them are not implemented yet. This allows us to solve the vm-style branches, but have problem with real life jumptables.

Example #3 (Themida 3.1.6.0 LION64 (Red))

Our target program:

image

Themida settings (we only care about vms atm):

image

image

image

After vm:

image

Running Mergen:

image

Output code: click here So, why our result is not succesful as lifting a binary thats protected by vmp?

Themida actively writes on .themida section. Unlike stack, we cant disregard these writes, because these values might be read by other stuff later.

But, we have a temporary solution to that. Remove all stores into .themida section. Since our program doesnt write into memory, I just commented all the stores. Now we are left with this:

source_filename = "my_lifting_module"

define i64 @main(i64 %rax, i64 %rcx, i64 %rdx, i64 %rbx, i64 %rsp, i64 %rbp, i64 %rsi, i64 %rdi, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, ptr writeonly %memory) local_unnamed_addr #0 {
  %trunc = trunc i64 %r8 to i32
  %trunc1 = trunc i64 %rdx to i32
  %trunc2 = trunc i64 %rcx to i32
  %realadd-5369771371- = add i32 %trunc1, %trunc2
  %realadd-5369582686- = add i32 %realadd-5369771371-, %trunc
  %trunc457139 = zext i32 %realadd-5369582686- to i64
  ret i64 %trunc457139
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: write) }

Technical challenges

  • Loops
  • Self Modifying Code ( especially with conditional modification)
  • Being in an universe where "outlining" and "unrolling" passes doesnt exist.

Getting in touch

Join our Mergen Discord Server to trade ideas or just chatting in general.

S
Description
Languages
C++ 69.1%
C 17.1%
Python 9.9%
PowerShell 1.2%
Batchfile 1%
Other 1.6%