XenonRecomp/thirdparty/capstone/docs/ARCHITECTURE.md

# Capstone Architecture overview

## Architecture of Capstone

TODO

## Architecture of a Module

An architecture module is split into two components.

1. The disassembler logic, which decodes bytes to instructions.
2. The mapping logic, which maps the result from component 1 to
a Capstone internal representation and adds additional detail.

### Component 1 - Disassembler logic

The disassembler logic consists exclusively of code from LLVM.
It uses:

- Generated state machines, enums and the like for instruction decoding.
- Handwritten disassembler logic for decoding instruction operands
and controlling the decoding procedure.

### Component 2 - Mapping logic

The mapping component has three different task:

1. Serving as programmable interface for the Capstone core to the LLVM code.
2. Mapping LLVM decoded instructions to a Capstone instruction.
3. Adding additional detail to the Capstone instructions
(e.g. operand `read/write` attributes etc.).

### Instruction representation

There exist two structs which represent an instruction:

- `MCInst`: The LLVM representation of an instruction.
- `cs_insn`: The Capstone representation of an instruction.

The `MCInst` is used by the disassembler component for storing the decoded instruction.
The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`.

The `cs_insn` is meant to be used by the Capstone core.
It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation
and holds more details about an instruction.

### Disassembling process

There are two steps in disassembling an instruction.

1. Decoding bytes to a `MCInst`.
2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step.

Here is a boiled down explanation about these steps.

**Step 1**

```
                                                                 ARCH_LLVM_getInstr(
                                     ARCH_getInstr(bytes)   ┌───┐   bytes)           ┌─────────┐            ┌──────────┐
                                    ┌──────────────────────►│ A ├──────────────────► │         ├───────────►│          ├────┐
                                    │                       │ R │                    │ LLVM    │            │ LLVM     │    │ Decode
                                    │                       │ C │                    │         │            │          │    │ Instr.
                                    │                       │ H │                    │         │decode(Op0) │          │◄───┘
┌────────┐ disasm(bytes) ┌──────────┴──┐                    │   │                    │ Disass- │ ◄──────────┤ Decoder  │
│CS Core ├──────────────►│ ARCH Module │                    │   │                    │ embler  ├──────────► │ State    │
└────────┘               └─────────────┘                    │ M │                    │         │            │ Machine  │
                                    ▲                       │ A │                    │         │decode(Op1) │          │
                                    │                       │ P │                    │         │ ◄──────────┤          │
                                    │                       │ P │                    │         ├──────────► │          │
                                    │                       │ I │                    │         │            │          │
                                    │                       │ N │                    │         │            │          │
                                    └───────────────────────┤ G │◄───────────────────┤         │◄───────────┤          │
                                                            └───┘                    └─────────┘            └──────────┘
```

In the first decoding step the instruction bytes get forwarded to the
decoder state machine.
After the instruction was identified, the state machine calls decoder functions
for each operand to extract the operand values from the bytes.

The disassembler and the state machine are equivalent to what `llvm-objdump` uses
(in fact they use the same files, except we translated them from C++ to C).

**Step 2**

```
                                      ARCH_printInst(        ARCH_LLVM_printInst(
                                         MCInst,                MCInst,
                                         asm_buf)      ┌───┐    asm_buf)         ┌────────┐            ┌──────────┐
                                      ┌───────────────►│ A ├───────────────────► │        ├───────────►│          ├──────┐
                                      │                │ R │                     │ LLVM   │            │ LLVM     │      │ Decode
                                      │                │ C │                     │        │            │          │      │ Mnemonic
                                      │                │ H │ add_cs_detail(Op0)  │        │ print(Op0) │          │◄─────┘
                                      │                │   │ ◄───────────────────┤        │ ◄──────────┤          │
           printer(MCInst,            │                │   ├───────────────────► │        ├──────────► │ Asm-     │
┌────────┐         asm_buf)┌──────────┴──┐             │   │                     │ Inst   │            │ Writer   │
│CS Core ├────────────────►│ ARCH Module │             │   │                     │ Printer│            │ State    │
└────────┘                 └─────────────┘             │ M │                     │        │            │ Machine  │
                                      ▲                │ A │ add_cs_detail(Op1)  │        │ print(Op1) │          │
                                      │                │ P │ ◄───────────────────┤        │ ◄──────────┤          │
                                      │                │ P ├───────────────────► │        ├──────────► │          │
                                      │                │ I │                     │        │            │          │
                                      │                │ N │                     │        │            │          │
                                      └────────────────┤ G │◄────────────────────┤        │◄───────────┤          │
                                                       └───┘                     └────────┘            └──────────┘
```

The second decoding step passes the `MCInst` and a buffer to the printer.

After determining the mnemonic, each operand is printed by using
functions defined in the `InstPrinter`.

Each time an operand is printed, the mapping component is called
to populate the `cs_insn` with the operand information and details.

Again the `InstPrinter` and `AsmWriter` are translated code from LLVM,
so they mirror the behavior of `llvm-objdump`.
Initial Commit 2024-09-07 18:00:09 +06:00			`# Capstone Architecture overview`

			`## Architecture of Capstone`

			`TODO`

			`## Architecture of a Module`

			`An architecture module is split into two components.`

			`1. The disassembler logic, which decodes bytes to instructions.`
			`2. The mapping logic, which maps the result from component 1 to`
			`a Capstone internal representation and adds additional detail.`

			`### Component 1 - Disassembler logic`

			`The disassembler logic consists exclusively of code from LLVM.`
			`It uses:`

			`- Generated state machines, enums and the like for instruction decoding.`
			`- Handwritten disassembler logic for decoding instruction operands`
			`and controlling the decoding procedure.`

			`### Component 2 - Mapping logic`

			`The mapping component has three different task:`

			`1. Serving as programmable interface for the Capstone core to the LLVM code.`
			`2. Mapping LLVM decoded instructions to a Capstone instruction.`
			`3. Adding additional detail to the Capstone instructions`
			(e.g. operand `read/write` attributes etc.).

			`### Instruction representation`

			`There exist two structs which represent an instruction:`

			- `MCInst`: The LLVM representation of an instruction.
			- `cs_insn`: The Capstone representation of an instruction.

			The `MCInst` is used by the disassembler component for storing the decoded instruction.
			The mapping component on the other hand, uses the `MCInst` to populate the `cs_insn`.

			The `cs_insn` is meant to be used by the Capstone core.
			It is distinct from the `MCInst`. It uses different instruction identifiers, other operand representation
			`and holds more details about an instruction.`

			`### Disassembling process`

			`There are two steps in disassembling an instruction.`

			1. Decoding bytes to a `MCInst`.
			2. Decoding the assembler string for the `MCInst` AND mapping it to a `cs_insn` in the same step.

			`Here is a boiled down explanation about these steps.`

			`Step 1`

			```
			`ARCH_LLVM_getInstr(`
			`ARCH_getInstr(bytes) ┌───┐ bytes) ┌─────────┐ ┌──────────┐`
			`┌──────────────────────►│ A ├──────────────────► │ ├───────────►│ ├────┐`
			`│ │ R │ │ LLVM │ │ LLVM │ │ Decode`
			`│ │ C │ │ │ │ │ │ Instr.`
			`│ │ H │ │ │decode(Op0) │ │◄───┘`
			`┌────────┐ disasm(bytes) ┌──────────┴──┐ │ │ │ Disass- │ ◄──────────┤ Decoder │`
			`│CS Core ├──────────────►│ ARCH Module │ │ │ │ embler ├──────────► │ State │`
			`└────────┘ └─────────────┘ │ M │ │ │ │ Machine │`
			`▲ │ A │ │ │decode(Op1) │ │`
			`│ │ P │ │ │ ◄──────────┤ │`
			`│ │ P │ │ ├──────────► │ │`
			`│ │ I │ │ │ │ │`
			`│ │ N │ │ │ │ │`
			`└───────────────────────┤ G │◄───────────────────┤ │◄───────────┤ │`
			`└───┘ └─────────┘ └──────────┘`
			```

			`In the first decoding step the instruction bytes get forwarded to the`
			`decoder state machine.`
			`After the instruction was identified, the state machine calls decoder functions`
			`for each operand to extract the operand values from the bytes.`

			The disassembler and the state machine are equivalent to what `llvm-objdump` uses
			`(in fact they use the same files, except we translated them from C++ to C).`

			`Step 2`

			```
			`ARCH_printInst( ARCH_LLVM_printInst(`
			`MCInst, MCInst,`
			`asm_buf) ┌───┐ asm_buf) ┌────────┐ ┌──────────┐`
			`┌───────────────►│ A ├───────────────────► │ ├───────────►│ ├──────┐`
			`│ │ R │ │ LLVM │ │ LLVM │ │ Decode`
			`│ │ C │ │ │ │ │ │ Mnemonic`
			`│ │ H │ add_cs_detail(Op0) │ │ print(Op0) │ │◄─────┘`
			`│ │ │ ◄───────────────────┤ │ ◄──────────┤ │`
			`printer(MCInst, │ │ ├───────────────────► │ ├──────────► │ Asm- │`
			`┌────────┐ asm_buf)┌──────────┴──┐ │ │ │ Inst │ │ Writer │`
			`│CS Core ├────────────────►│ ARCH Module │ │ │ │ Printer│ │ State │`
			`└────────┘ └─────────────┘ │ M │ │ │ │ Machine │`
			`▲ │ A │ add_cs_detail(Op1) │ │ print(Op1) │ │`
			`│ │ P │ ◄───────────────────┤ │ ◄──────────┤ │`
			`│ │ P ├───────────────────► │ ├──────────► │ │`
			`│ │ I │ │ │ │ │`
			`│ │ N │ │ │ │ │`
			`└────────────────┤ G │◄────────────────────┤ │◄───────────┤ │`
			`└───┘ └────────┘ └──────────┘`
			```

			The second decoding step passes the `MCInst` and a buffer to the printer.

			`After determining the mnemonic, each operand is printed by using`
			functions defined in the `InstPrinter`.

			`Each time an operand is printed, the mapping component is called`
			to populate the `cs_insn` with the operand information and details.

			Again the `InstPrinter` and `AsmWriter` are translated code from LLVM,
			so they mirror the behavior of `llvm-objdump`.