mirror of
https://github.com/NaC-L/Mergen.git
synced 2026-05-12 09:40:34 +00:00
0292c7d564
During the Themida-frontier session, two failure modes cost real time: 1) Ad-hoc lifter runs produced scratch files (internal_0x*.ll, *handoff.md, linked_target.txt, vlizer_stub.txt) that got committed on a research branch and then had to be scrubbed before merge. Extend .gitignore with patterns matching the observed pollution class. 2) 'python test.py baseline' was run against origin/main with a build_iced/ directory that still held object files from a feature branch. The resulting lifter binary linked a stale mix of old and new code, producing a failure set that matched neither branch. This led to a false 'branch matches main' claim that was only caught after CI. Document the required wipe-and-rebuild in the operator defaults. No code changes. Co-authored-by: yusufcanislek <yusuf.canislek@meetdandy.com>
8.9 KiB
8.9 KiB
Repository Guidelines
Project Overview
Mergen is a function-level x64 PE to LLVM IR lifter for deobfuscation and devirtualization. The active workflow is rewrite/regression driven: changes are expected to preserve lifted IR shape, runtime semantics, and deterministic outputs.
Primary repo entry points:
README.md— project purpose and high-level entry linksARCHITECTURE.md— current pipeline order and invariantsdocs/SCOPE.md— support matrix and quality contractdocs/REWRITE_BASELINE.md— operational regression workflow
Architecture & Data Flow
The core pipeline is:
- CLI entry in
lifter/core/Lifter.cpp - Runtime image validation in
lifter/core/RuntimeImageContext.hpp - Lifter setup / auto-outline in
lifter/core/LifterStages.hpp - Memory policy + paged memory setup in
lifter/memory/MemoryPolicySetup.hppandlifter/core/LifterPipelineStages.hpp - Signature stage in
lifter/core/LifterPipelineStages.hpp - Lift loop in
lifter/core/LiftDriver.hpp - Fixpoint optimization in
lifter/core/MergenPB.hpp - Final post-passes and IR emission
Important invariants:
STACKP_VALUEis fixed at0x14FEA0(lifter/core/Includes.h).- Stack reserve is clamped to
[0x1000, 0x100000](lifter/memory/MemoryPolicySetup.hpp). - Pass order is intentional:
GEPLoadPass -> ReplaceTruncWithLoadPass -> PromotePseudoStackPass -> PromotePseudoMemory, then O2, then post-passes such as switch normalization and canonical naming (ARCHITECTURE.md,lifter/core/MergenPB.hpp). Do not reorder casually. - The disassembler boundary is normalized through
lifter/disasm/CommonDisassembler.hpp; semantics should consume normalized operands, not backend-specific details.
Key Directories
lifter/core/— CLI, runtime image setup, pipeline orchestration, ABI/signature handlinglifter/semantics/— opcode dispatch and instruction semantics (Semantics.ipp,Semantics_*.ipp,x86_64_opcodes.x)lifter/disasm/— Iced/Zydis abstraction layerlifter/memory/— file-backed memory, page map, pseudo-memory/stack promotionlifter/analysis/— custom LLVM passes and path solvinglifter/test/— in-process instruction/oracle test harness and golden metadatatestcases/rewrite_smoke/— rewrite smoke corpus sourcesscripts/rewrite/— baseline gate, sample build, manifest validation, oracle generation, semantic checksscripts/dev/— preferred configure/build entrypointsdocs/— current workflow, scope, and reviewer policy docs
Important Files
cmake.toml— source of truth for build configuration;CMakeLists.txtis generated, do not edit it directly.test.py— primary QA entrypoint.scripts/rewrite/instruction_microtests.json— source of truth for rewrite smoke samples, expected IR patterns, semantic cases, and CI skips.lifter/test/test_vectors/oracle_vectors.json— default instruction oracle vectors.lifter/test/test_vectors/golden_ir_hashes.json— determinism gate for tracked IR outputs..editorconfigand.clang-format— formatting contract (2 spaces, LF, UTF-8, 100-column LLVM-based style).
Development Commands
Before running any command in this section, confirm the exact repo root and cwd. Prefer these repo-provided scripts over ad hoc shell commands.
Preferred Windows build flow:
cmd /c scripts\dev\configure_iced.cmd
cmd /c scripts\dev\build_iced.cmd
Alternate Zydis-only lane:
cmd /c scripts\dev\configure_zydis.cmd
cmd /c scripts\dev\build_zydis.cmd
Primary test commands:
python test.py quick
python test.py all
python test.py baseline
python test.py micro --check-flags
python test.py negative
python test.py coverage --full
python test.py report --json
Useful targeted flows:
python test.py micro add
python test.py semantic branch
scripts\rewrite\run.cmd
scripts\rewrite\run_microtests.cmd --check-flags xor
Runtime / Tooling Preferences
- Platform focus is Windows. CI uses
scripts/dev/*.cmdandwindows-latest. - Prefer the iced lane by default; use Zydis only when you need the fallback/backend-specific lane.
- Configure/build scripts assume Ninja +
clang-cl; they do not invokeVsDevCmd.bat. LLVM_DIRmust resolve to LLVM 18; CI currently downloads LLVM 18.1.8.- Cargo is expected on PATH for the iced lane.
- Build outputs live in
build_iced/,build_zydis/, or otherbuild*/directories; treat them as generated artifacts. - Regression artifacts are written outside the repo by default to
../rewrite-regression-work/.
Code Conventions & Common Patterns
- Extend instruction support through the existing opcode table and semantics files; do not add parallel dispatch paths.
- Wire new entries in
lifter/semantics/x86_64_opcodes.x. - Implement behavior in the appropriate
Semantics_*.ippfile.
- Wire new entries in
- Preserve the normalized operand model across disassembly and semantics. Cross-check
lifter/disasm/CommonDisassembler.hpp, backend adapters, and downstream helpers before changing operand enums or widths. - Memory accesses should go through the existing operand/memory helpers (
lifter/semantics/OperandUtils.ipp); bypassing them usually breaks constant folding, page-map behavior, or pseudo-stack promotion. - Call handling is ABI-aware. Check
lifter/core/AbiCallContract.hppand existing control-flow helpers before changing call lowering. - Prefer explicit failures and diagnostics over silent fallbacks. The repo already has structured lift diagnostics (
lifter/core/LiftDiagnostics.hpp) and strict negative tests intest.py. - When touching build definitions, update
cmake.toml; regenerate behavior flows through cmkr intoCMakeLists.txt. - Keep docs and test manifests in the same change when behavior changes. This repo relies on docs/tests as active contracts, not afterthoughts.
Testing & QA Expectations
python test.pyis the canonical entrypoint.quickandallare the main gates used in CI.- The rewrite baseline is manifest-backed: every source in
testcases/rewrite_smoke/must have exactly one manifest entry inscripts/rewrite/instruction_microtests.json. - Golden IR hashing is part of the contract. C/C++-compiled smoke samples are excluded from golden hashes because their IR addresses are toolchain-dependent; they are checked via semantic tests instead.
python test.py negativematters: it guards explicit failure behavior for malformed manifests, unsafe paths, and bad vector schemas.- Use focused verification that matches your change:
- Core/semantics/disasm/test harness changes:
python test.py micro --check-flags - Rewrite script/manifest changes:
python test.py baselineandpython test.py negative - Coverage/vector plumbing:
python test.py coverage --fullandpython test.py report --json - Build script/CMake changes: rerun the affected
scripts\dev\configure_*.cmd+build_*.cmdlane
- Core/semantics/disasm/test harness changes:
Operator workflow defaults
Use these with the repo-specific architecture/test rules above.
- Confirm the real repo root, source-of-truth file, and owning subsystem before searching or editing.
- Narrow search scope before using broad repo scans.
- Prefer
read,find,grep,ast_grep,edit,ast_edit, andlspbefore bash for discovery or structural edits. - Before build/test/git/bash commands, confirm the exact cwd and lane you intend to run.
- If you edit the same file twice, re-read it first.
- Default to one main line of work; split into subtasks only when file boundaries are real and outputs are independent.
- Do not finish non-trivial work without focused verification that matches the changed subsystem.
- Before comparing two branches with
python test.py baseline/quick, wipebuild_iced/(rm -rf build_iced && cmd /c scripts\dev\configure_iced.cmd && cmd /c scripts\dev\build_iced.cmd). Incremental builds reuse object files across branches and will happily link a stale mix of old and new code, producing a lifter binary whose failure set reflects neither branch. This has caused at least one false "branch matches main" claim.
What not to do
- Do not start with repo-root scans when a narrower directory or entry document can answer the question.
- Do not run configure/build/test commands from an assumed cwd.
- Do not use bash-first discovery when a specialized tool can answer it.
- Do not spawn reviewer/subtask branches just to spread a single code path across multiple agents.
Process Notes For AI Assistants
- Prefer
docs/REWRITE_BASELINE.mdand CI workflows over older generic build docs when commands disagree. - Do not edit generated files or artifact outputs unless the task is explicitly about generation.
- Before changing exported behavior, inspect direct consumers and the matching rewrite/test manifests.
- If you add a new sample, update both
testcases/rewrite_smoke/andscripts/rewrite/instruction_microtests.jsonin the same change. - If you change semantics or ABI behavior, expect to update oracle vectors, microtests, semantic expectations, and possibly golden hashes.