UNDER CONSTRUCTION
This archive provides a description of popular assemblers and their architecture-specific differences.
Assemblers
GCC generates assembly code and invokes GNU Assembler (also known as "gas"), which is part of GNU Binutils, to convert the assembly code into machine code. The GCC driver is also capable of accepting assembly input files. Due to GCC's widespread usage, GNU Assembler is arguably the most popular assembler.
Within the LLVM project, the LLVM integrated assembler is a library that is linked by Clang, llvm-mc, and lld (for LTO purposes) to generate machine code. It supports a wide range of GNU Assembler syntax and can be used as a drop-in replacement for GNU Assembler.
On the Windows platform, the Microsoft Macro Assembler (MASM) is widely used.
For x86 architecture, NASM is another popular assembler.
Architectures
x86
There are two main branches of syntax: Intel syntax and AT&T syntax. AT&T syntax is derived from PDP-11 and exhibits several key differences:
- The operand list is reversed compared to Intel syntax.
- The four-part generic addressing mode is written as
displacement(base,index,scale)
instead of[base+index*scale+disp]
in Intel syntax. - Immediate values are prefixed with
$
, while registers are prefixed with%
. - The mnemonics have a suffix indicating the operand size, e.g.
b
for 1 byte,w
for 2 bytes (Word),d
for 4 bytes (Dword), andq
for 8 bytes (Qword).
Although the sigils add some complexity to the language, they do
provide a distinct advantage: symbol references can be parsed without
ambiguity. Many x86 instructions take an operand that can be a register
or a memory location. With sigils, parsing becomes unambiguous, as
demonstrated by examples such as subl var, %eax
and
subl $1, %eax
.
1 | % gcc -S a.c |
Intel syntax is generally concise, except for the verbose size
directives (e.g., DWORD PTR
). It is widely utilized in the
Windows environment and within the reverse engineering community.
However, Intel syntax has a flaw related to ambiguity, as it prevents the use of variable names that collide with registers (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53929).
1 | % cat ambiguous.c |
I believe it would be beneficial if the designers added sigils to
Intel syntax to disambiguate symbol references from registers. The
absence of AT&T-style line noise makes Intel syntax code much more
readable. Unfortunately, Intel syntax is less popular in software code
due to GCC defaulting to AT&T syntax (Please,
really, make -masm=intel
the default for x86.
Using as -msyntax=intel -mnaked-reg
allows parsing the
input in Intel syntax without a register prefix. This is similar to
including a .intel_syntax noprefix
directive in the
input.
With llvm-mc -x86-asm-syntax=intel
, the input can be
parsed in Intel syntax. Using -output-asm-variant=1
will
print instructions in Intel syntax.
MIPS
Modifiers are utilized to describe different access types of a symbol. This serves as a bonus as it prevents symbol references from being mistaken as register names. However, the function call-like syntax can appear verbose.
1 | lui a0, %tprel_hi(tls) |
Power ISA
Power ISA assembly may seem unusual, as general-purpose registers are
not prefixed with the r
prefix. Whether an integer denotes
a register or an immediate value depends on its position as an operand
in an instruction. I find that this difference slightly affects
readability.
Similar to x86, postfix modifiers are used to describe different access kinds of a symbol.
AArch64
Prefix modifiers are used to describe various access types of a symbol. Personally, this is the modifier syntax that I prefer the most.
1 | add x8, x8, :tprel_hi12:tls |
RISC-V
The modifier syntax is copied from MIPS.
The documentation is available on https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.
Inline assembly
Certain compilers allow the inclusion of assembly code within a high-level language.
The most widely used implementation is GCC Basic Asm and Extended Asm. On Windows, MSVC supports inline assembly for x86-32 but not for x86-64 and Arm.
Clang supports both GCC and MSVC inline assembly. Clang's MSVC inline assembly can be utilized with x86-64.
Some compilers provide additional variants of inline assembly. Here are some relevant links:
- Free Pascal https://wiki.freepascal.org/Asm
- D https://dlang.org/spec/iasm.html
- Jai https://jai.community/t/inline-assembly/139
- Nim https://nim-lang.github.io/Nim/manual.html#statements-and-expressions-assembler-statement
Notes on GNU Assembler
.file
and .loc
directives are used to
create .debug_line
.
.cfi*
directives are used to create
.eh_frame
or .debug_frame
.
GNU Assembler implements "INDEFINITE REPEAT BLOCK DIRECTIVES: .IRP
AND .IRPC" from MACRO-11. Unfortunately there is no directive for
for (int i = 0; i < 20; i++)
.
.irpc i,0123456789
just gives 10 iterations and writing all
integers using .irp
is tedious and error-prone.
1 | .rept 3 |
.if
, .ifdef
, and .ifndef
directives allow us to write conditional code in assembly tests without
using a C preprocessor. I often use .ifdef
to combine
positive tests and negative tests in one file.
1 | # RUN: llvm-mc %s | FileCheck %s |
GNU Assembler has supported
.incbin
since 2001-07 (hey, C/C++ #embed
).
The review thread mentioned that .incbin
had been supported
by some other assemblers.
Notes on LLVM integrated assembler
In general, inline assembly is parsed by LLVMMCParser for validation
and formatting purposes. Parsing can be disabled for certain targets by
default, and the parsing can be explicitly disabled by using the
-fno-integrated-as
option.
Let's focus on ELF platforms for the following description, assuming our goal is to create a relocatable object file. The input file can be either LLVM IR (intermediate code; the initial input file may be in C/C++) or assembly language.
If the input is LLVM IR, LLVM creates a MCObjectStreamer
object with new MCELFStreamer
or a target-registered
factory (e.g., AArch64ELFStreamer
). The streamer
constructor creates a MCAssembler
object. For an assembly
input file, LLVM additionally creates a MCAsmParser
object
and a MCTargetAsmParser
object.
MSVC inline assembly
TODO
__asm
blocks are parsed for Windows target triples. This
extension is available on other targets by specifying
-fasm-blocks
or the broad -fms-extensions
. An
__asm
statement is represented as a
clang::MSAsmStmt
object.
clang::Parser::ParseMicrosoftAsmStatement
parses the inline
assembly string and calls
llvm::AsmParser::parseMSInlineAsm
. It is worth noting that
the string may be modified during this process. For a
clang::MSAsmStmt
object, LLVM IR is generated through
clang::CodeGen::CodeGenFunction::EmitAsmStmt
.