2024-09-07 18:15:29 +06:00

6.0 KiB

Architecture of the Auto-Sync framework

This document is split into four parts.

  1. An overview of the update process and which subcomponents of auto-sync do what.
  2. The instructions how to update an architecture which already supports auto-sync.
  3. Instructions how to refactor an architecture to use auto-sync.
  4. Notes about how to add a new architecture to Capstone with auto-sync.

Please read the section about capstone module design in ARCHITECTURE.md before proceeding. The architectural understanding is important for the following.

Update procedure

As already described in the ARCHITECTURE document, Capstone uses translated and generated source code from LLVM.

Because LLVM is written in C++ and Capstone in C the update process is internally complicated but almost completely automated.

auto-sync categorizes source files of a module into three groups. Each group is updated differently.

File type Update method Edits by hand
Generated files Generated by patched LLVM backends Never/Not allowed
Translated LLVM C++ files CppTranslater and Differ Only changes which are too complicated for automation.
Capstone files By hand all

Let's look at the update procedure for each group in detail.

Note: The only exception to touch generated files is via git patches. This is the last resort if something is broken in LLVM, and we cannot generate correct files.

Generated files

Generated files always have the file extension .inc.

There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:

  • For Capstone: <ARCH>GenCS<NAME>.inc.
  • For LLVM code: <ARCH>Gen<NAME>.inc.

The files are generated by refactored LLVM TableGen emitter backends.

The procedure looks roughly like this:

                                                                   ┌──────────┐
    1               2                 3                4           │CS .inc   │
┌───────┐     ┌───────────┐     ┌───────────┐     ┌──────────┐  ┌─►│files     │
│ .td   │     │           │     │           │     │ Code-    │  │  └──────────┘
│ files ├────►│ TableGen  ├────►│  CodeGen  ├────►│ Emitter  ├──┤
└───────┘     └──────┬────┘     └───────────┘     └──────────┘  │  ┌──────────┐
                     │                                 ▲        └─►│LLVM .inc │
                     └─────────────────────────────────┘           │files     │
                                                                   └──────────┘
  1. LLVM architectures are defined in .td files. They describe instructions, operands, features and other properties of an architecture.

  2. LLVM TableGen parses these files and converts them to an internal representation.

  3. In the second step a TableGen component called CodeGen abstracts the these properties even further. The result is a representation which is not specific to any architecture (e.g. the CodeGenInstruction class can represent a machine instruction of any architecture).

  4. The Code-Emitter uses the abstract representation of the architecture (provided from CodeGen) to generated state machines for instruction decoding. Architecture specific information (think of register names, operand properties etc.) is taken from TableGen's internal representation.

The result is emitted to .inc files. Those are included in the translated C++ files or Capstone code where necessary.

Translation of LLVM C++ files

We use two tools to translate C++ to C files.

First the CppTranslator and afterward the Differ.

The CppTranslator parses the C++ files and patches C++ syntax with its equivalent C syntax.

Note: For details about this checkout suite/auto-sync/CppTranslator/README.md.

Because the result of the CppTranslator is not perfect, we still have many syntax problems left.

Those need to be fixed partially by hand.

Differ

In order to ease this process we run the Differ after the CppTranslator.

The Differ compares our two versions of C files we have now. One of them are the C files currently used by the architecture module. On the other hand we have the translated C files. Those are still faulty and need to be fixed.

Most fixes are syntactical problems. Those were almost always resolved before, during the last update. The Differ helps you to compare the files and let you select which version to accept.

Sometimes (not very often though), the newly translated C files contain important changes. Most often though, the old files are already correct.

The Differ parses both files into an abstract syntax tree and compares certain nodes with the same name (mostly functions).

The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. If there exists a saved decision for two nodes, and the nodes did not change since the last time, it applies the previous decision automatically again.

The Differ is far from perfect. It only helps to automatically apply "known to be good" fixes and gives the user a better interface to solve the other problems. But there will still be syntax errors left afterward. These must be fixed by hand.