mirror of https://github.com/NaC-L/Mergen.git synced 2026-05-12 09:40:34 +00:00

Files

T

naci 0292c7d564 chore: gitignore session scratch + note cross-branch build cache hazard (#113 )

During the Themida-frontier session, two failure modes cost real time:

1) Ad-hoc lifter runs produced scratch files (internal_0x*.ll, *handoff.md, linked_target.txt, vlizer_stub.txt) that got committed on a research branch and then had to be scrubbed before merge.  Extend .gitignore with patterns matching the observed pollution class.

2) 'python test.py baseline' was run against origin/main with a build_iced/ directory that still held object files from a feature branch.  The resulting lifter binary linked a stale mix of old and new code, producing a failure set that matched neither branch.  This led to a false 'branch matches main' claim that was only caught after CI.  Document the required wipe-and-rebuild in the operator defaults.

No code changes.

Co-authored-by: yusufcanislek <yusuf.canislek@meetdandy.com>

2026-04-22 22:03:56 +03:00

8.9 KiB

Raw Permalink Blame History

Repository Guidelines

Project Overview

Mergen is a function-level x64 PE to LLVM IR lifter for deobfuscation and devirtualization. The active workflow is rewrite/regression driven: changes are expected to preserve lifted IR shape, runtime semantics, and deterministic outputs.

Primary repo entry points:

README.md — project purpose and high-level entry links
ARCHITECTURE.md — current pipeline order and invariants
docs/SCOPE.md — support matrix and quality contract
docs/REWRITE_BASELINE.md — operational regression workflow

Architecture & Data Flow

The core pipeline is:

CLI entry in lifter/core/Lifter.cpp
Runtime image validation in lifter/core/RuntimeImageContext.hpp
Lifter setup / auto-outline in lifter/core/LifterStages.hpp
Memory policy + paged memory setup in lifter/memory/MemoryPolicySetup.hpp and lifter/core/LifterPipelineStages.hpp
Signature stage in lifter/core/LifterPipelineStages.hpp
Lift loop in lifter/core/LiftDriver.hpp
Fixpoint optimization in lifter/core/MergenPB.hpp
Final post-passes and IR emission

Important invariants:

STACKP_VALUE is fixed at 0x14FEA0 (lifter/core/Includes.h).
Stack reserve is clamped to [0x1000, 0x100000] (lifter/memory/MemoryPolicySetup.hpp).
Pass order is intentional: GEPLoadPass -> ReplaceTruncWithLoadPass -> PromotePseudoStackPass -> PromotePseudoMemory, then O2, then post-passes such as switch normalization and canonical naming (ARCHITECTURE.md, lifter/core/MergenPB.hpp). Do not reorder casually.
The disassembler boundary is normalized through lifter/disasm/CommonDisassembler.hpp; semantics should consume normalized operands, not backend-specific details.

Key Directories

lifter/core/ — CLI, runtime image setup, pipeline orchestration, ABI/signature handling
lifter/semantics/ — opcode dispatch and instruction semantics (Semantics.ipp, Semantics_*.ipp, x86_64_opcodes.x)
lifter/disasm/ — Iced/Zydis abstraction layer
lifter/memory/ — file-backed memory, page map, pseudo-memory/stack promotion
lifter/analysis/ — custom LLVM passes and path solving
lifter/test/ — in-process instruction/oracle test harness and golden metadata
testcases/rewrite_smoke/ — rewrite smoke corpus sources
scripts/rewrite/ — baseline gate, sample build, manifest validation, oracle generation, semantic checks
scripts/dev/ — preferred configure/build entrypoints
docs/ — current workflow, scope, and reviewer policy docs

Important Files

cmake.toml — source of truth for build configuration; CMakeLists.txt is generated, do not edit it directly.
test.py — primary QA entrypoint.
scripts/rewrite/instruction_microtests.json — source of truth for rewrite smoke samples, expected IR patterns, semantic cases, and CI skips.
lifter/test/test_vectors/oracle_vectors.json — default instruction oracle vectors.
lifter/test/test_vectors/golden_ir_hashes.json — determinism gate for tracked IR outputs.
.editorconfig and .clang-format — formatting contract (2 spaces, LF, UTF-8, 100-column LLVM-based style).

Development Commands

Before running any command in this section, confirm the exact repo root and cwd. Prefer these repo-provided scripts over ad hoc shell commands.

Preferred Windows build flow:

cmd /c scripts\dev\configure_iced.cmd
cmd /c scripts\dev\build_iced.cmd

Alternate Zydis-only lane:

cmd /c scripts\dev\configure_zydis.cmd
cmd /c scripts\dev\build_zydis.cmd

Primary test commands:

python test.py quick
python test.py all
python test.py baseline
python test.py micro --check-flags
python test.py negative
python test.py coverage --full
python test.py report --json

Useful targeted flows:

python test.py micro add
python test.py semantic branch
scripts\rewrite\run.cmd
scripts\rewrite\run_microtests.cmd --check-flags xor

Runtime / Tooling Preferences

Platform focus is Windows. CI uses scripts/dev/*.cmd and windows-latest.
Prefer the iced lane by default; use Zydis only when you need the fallback/backend-specific lane.
Configure/build scripts assume Ninja + clang-cl; they do not invoke VsDevCmd.bat.
LLVM_DIR must resolve to LLVM 18; CI currently downloads LLVM 18.1.8.
Cargo is expected on PATH for the iced lane.
Build outputs live in build_iced/, build_zydis/, or other build*/ directories; treat them as generated artifacts.
Regression artifacts are written outside the repo by default to ../rewrite-regression-work/.

Code Conventions & Common Patterns

Extend instruction support through the existing opcode table and semantics files; do not add parallel dispatch paths.
- Wire new entries in lifter/semantics/x86_64_opcodes.x.
- Implement behavior in the appropriate Semantics_*.ipp file.
Preserve the normalized operand model across disassembly and semantics. Cross-check lifter/disasm/CommonDisassembler.hpp, backend adapters, and downstream helpers before changing operand enums or widths.
Memory accesses should go through the existing operand/memory helpers (lifter/semantics/OperandUtils.ipp); bypassing them usually breaks constant folding, page-map behavior, or pseudo-stack promotion.
Call handling is ABI-aware. Check lifter/core/AbiCallContract.hpp and existing control-flow helpers before changing call lowering.
Prefer explicit failures and diagnostics over silent fallbacks. The repo already has structured lift diagnostics (lifter/core/LiftDiagnostics.hpp) and strict negative tests in test.py.
When touching build definitions, update cmake.toml; regenerate behavior flows through cmkr into CMakeLists.txt.
Keep docs and test manifests in the same change when behavior changes. This repo relies on docs/tests as active contracts, not afterthoughts.

Testing & QA Expectations

python test.py is the canonical entrypoint. quick and all are the main gates used in CI.
The rewrite baseline is manifest-backed: every source in testcases/rewrite_smoke/ must have exactly one manifest entry in scripts/rewrite/instruction_microtests.json.
Golden IR hashing is part of the contract. C/C++-compiled smoke samples are excluded from golden hashes because their IR addresses are toolchain-dependent; they are checked via semantic tests instead.
python test.py negative matters: it guards explicit failure behavior for malformed manifests, unsafe paths, and bad vector schemas.
Use focused verification that matches your change:
- Core/semantics/disasm/test harness changes: python test.py micro --check-flags
- Rewrite script/manifest changes: python test.py baseline and python test.py negative
- Coverage/vector plumbing: python test.py coverage --full and python test.py report --json
- Build script/CMake changes: rerun the affected scripts\dev\configure_*.cmd + build_*.cmd lane

Operator workflow defaults

Use these with the repo-specific architecture/test rules above.

Confirm the real repo root, source-of-truth file, and owning subsystem before searching or editing.
Narrow search scope before using broad repo scans.
Prefer read, find, grep, ast_grep, edit, ast_edit, and lsp before bash for discovery or structural edits.
Before build/test/git/bash commands, confirm the exact cwd and lane you intend to run.
If you edit the same file twice, re-read it first.
Default to one main line of work; split into subtasks only when file boundaries are real and outputs are independent.
Do not finish non-trivial work without focused verification that matches the changed subsystem.
Before comparing two branches with python test.py baseline/quick, wipe build_iced/ (rm -rf build_iced && cmd /c scripts\dev\configure_iced.cmd && cmd /c scripts\dev\build_iced.cmd). Incremental builds reuse object files across branches and will happily link a stale mix of old and new code, producing a lifter binary whose failure set reflects neither branch. This has caused at least one false "branch matches main" claim.

What not to do

Do not start with repo-root scans when a narrower directory or entry document can answer the question.
Do not run configure/build/test commands from an assumed cwd.
Do not use bash-first discovery when a specialized tool can answer it.
Do not spawn reviewer/subtask branches just to spread a single code path across multiple agents.

Process Notes For AI Assistants

Prefer docs/REWRITE_BASELINE.md and CI workflows over older generic build docs when commands disagree.
Do not edit generated files or artifact outputs unless the task is explicitly about generation.
Before changing exported behavior, inspect direct consumers and the matching rewrite/test manifests.
If you add a new sample, update both testcases/rewrite_smoke/ and scripts/rewrite/instruction_microtests.json in the same change.
If you change semantics or ABI behavior, expect to update oracle vectors, microtests, semantic expectations, and possibly golden hashes.

8.9 KiB Raw Permalink Blame History