6.0 KiB
Architecture of the Auto-Sync framework
This document is split into four parts.
- An overview of the update process and which subcomponents of
auto-sync
do what. - The instructions how to update an architecture which already supports
auto-sync
. - Instructions how to refactor an architecture to use
auto-sync
. - Notes about how to add a new architecture to Capstone with
auto-sync
.
Please read the section about capstone module design in ARCHITECTURE.md before proceeding. The architectural understanding is important for the following.
Update procedure
As already described in the ARCHITECTURE
document, Capstone uses translated
and generated source code from LLVM.
Because LLVM is written in C++ and Capstone in C the update process is internally complicated but almost completely automated.
auto-sync
categorizes source files of a module into three groups. Each group is updated differently.
File type | Update method | Edits by hand |
---|---|---|
Generated files | Generated by patched LLVM backends | Never/Not allowed |
Translated LLVM C++ files | CppTranslater and Differ |
Only changes which are too complicated for automation. |
Capstone files | By hand | all |
Let's look at the update procedure for each group in detail.
Note: The only exception to touch generated files is via git patches. This is the last resort if something is broken in LLVM, and we cannot generate correct files.
Generated files
Generated files always have the file extension .inc
.
There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:
- For Capstone:
<ARCH>GenCS<NAME>.inc
. - For LLVM code:
<ARCH>Gen<NAME>.inc
.
The files are generated by refactored LLVM TableGen emitter backends.
The procedure looks roughly like this:
┌──────────┐
1 2 3 4 │CS .inc │
┌───────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌─►│files │
│ .td │ │ │ │ │ │ Code- │ │ └──────────┘
│ files ├────►│ TableGen ├────►│ CodeGen ├────►│ Emitter ├──┤
└───────┘ └──────┬────┘ └───────────┘ └──────────┘ │ ┌──────────┐
│ ▲ └─►│LLVM .inc │
└─────────────────────────────────┘ │files │
└──────────┘
-
LLVM architectures are defined in
.td
files. They describe instructions, operands, features and other properties of an architecture. -
LLVM TableGen parses these files and converts them to an internal representation.
-
In the second step a TableGen component called CodeGen abstracts the these properties even further. The result is a representation which is not specific to any architecture (e.g. the
CodeGenInstruction
class can represent a machine instruction of any architecture). -
The
Code-Emitter
uses the abstract representation of the architecture (provided fromCodeGen
) to generated state machines for instruction decoding. Architecture specific information (think of register names, operand properties etc.) is taken fromTableGen's
internal representation.
The result is emitted to .inc
files. Those are included in the translated C++ files or Capstone code where necessary.
Translation of LLVM C++ files
We use two tools to translate C++ to C files.
First the CppTranslator
and afterward the Differ
.
The CppTranslator
parses the C++ files and patches C++ syntax
with its equivalent C syntax.
Note: For details about this checkout suite/auto-sync/CppTranslator/README.md
.
Because the result of the CppTranslator
is not perfect,
we still have many syntax problems left.
Those need to be fixed partially by hand.
Differ
In order to ease this process we run the Differ
after the CppTranslator
.
The Differ
compares our two versions of C files we have now.
One of them are the C files currently used by the architecture module.
On the other hand we have the translated C files. Those are still faulty and need to be fixed.
Most fixes are syntactical problems. Those were almost always resolved before, during the last update.
The Differ
helps you to compare the files and let you select which version to accept.
Sometimes (not very often though), the newly translated C files contain important changes. Most often though, the old files are already correct.
The Differ
parses both files into an abstract syntax tree and compares certain nodes with the same name
(mostly functions).
The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. If there exists a saved decision for two nodes, and the nodes did not change since the last time, it applies the previous decision automatically again.
The Differ
is far from perfect. It only helps to automatically apply "known to be good" fixes
and gives the user a better interface to solve the other problems.
But there will still be syntax errors left afterward. These must be fixed by hand.