This article describes some notes about z/Architecture with a focus on the ELF ABI and ELF linkers. An lld/ELF patch sparked my motivation to write this post.
z/Architecture is a big-endian mainframe computer architecture supporting 24-bit, 31-bit, and 64-bit addressing modes. It is the latest generation in a lineage stretching back to the 1964 with IBM System/360 (32-bit general-purpose registers and 24-bit addressing). This lineage includes System/370 (1970), System/370 Extended Architecture (1983), Enterprise Systems Architecture/370 (1988), and Enterprise Systems Architecture/390 (1990). For a deeper dive into the design choices behind z/Architecture's extension from ESA/390, you can refer to "Development and attributes of z/Architecture."
Linux on IBM Z is a 64-bit operating system on z/Architecture, related to an older effort porting Linux to ESA/390. As the Wikipedia page clarifies:
Historically the Linux kernel architecture designations were "s390" and "s390x" to distinguish between the 32-bit and 64-bit Linux on IBM Z kernels respectively, but "s390" now also refers generally to the one Linux on IBM Z kernel architecture.
Documents
- z/Architecture Principles of Operation: This is the instruction set manual with an unusual name inheirted from IBM System/360 Principles of Operation.
- Assembler Language Programming for IBM System z: This book is more readable than Principles of Operation.
- z/Architecture Reference Summary: A concise reference of instructions.
- zSeries ELF Application Binary Interface Supplement (v1.0.2), 2002: This ABI document has been superseded by s390x-abi.
- https://github.com/IBM/s390x-abi: The latest version of the psABI (processor supplement to the System V ABI) resides here. While the absence of updates between 2002 and 2021 might seem odd, rest assured the documentation is actively maintained.
Instruction notes
Each instruction has a length of two, four or six bytes, and must be located at a 2-byte boundary. Six-byte instructions have been available since S/360.
There are 16 64-bit general-purpose registers. r14 is used as the link register while r15 is the stack pointer. In s390x-abi, registers r6 to r13, and r15 are designated as designated as non-volatile (not clobbered by a function call). Registers r2 to r6 are used for integer arguments.
- r6 being non-volatile for argument storage seems uncommon compared to other architectures.
- Only 4 registers are used for integer argument storage, which is inadequate. It is unclear why r0 and r1 are not used.
There are no PC-relative addressing. Fortunately, only one
instruction is needed to load _GLOBAL_OFFSET_TABLE_
(see
"Global Offset Table" below) into a register (usually r12).
1 | larl %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_ |
The RIL instruction format, consisting of 6 bytes, encodes a register and a 32-bit immediate operand. This enables it to implement valuable instructions like BRASL (Branch Relative And Save Long, like x86's CALL) and LARL (Load Address Relative Long, like x86's MOV with RIP-relative addressing).
1 | int var; |
1 | // s390x |
Function calls are done with BRASL (Branch Relative And Save Long, RIL type).
Global Offset Table
The .got
section has 3 reserved entries. The linker
defines _GLOBAL_OFFSET_TABLE_
at the start of
.got
. _GLOBAL_OFFSET_TABLE_[0]
stores the
link-time address of _DYNAMIC
, which is used by glibc.
_GLOBAL_OFFSET_TABLE_[1]
and
_GLOBAL_OFFSET_TABLE_[2]
are for lazy binding PLT
(_dl_runtime_resolve
and link map in glibc).
The assembler modifier @GOTENT
designates a 32-bit
immediate operand. The assembler modifier @GOT
designates
an immediate operand of either 16-bit or 32-bit.
Compilers generate a LGRL (Load Relative Long) instruction to load the GOT entry of a symbol. When the symbol is non-preemptible and not an ifunc, the GOT indirection can be optimized to LARL (Load Address Relative Long). This is similar to x86-64's GOTPCRELX optimization.
1 | lgrl %r1, var@GOT # R_390_GOTENT(var) |
Procedure Linkage Table
At 32 bytes per entry, PLTs are notably larger than other architectures. Only the first 14 bytes (encompassing three instructions) are strictly necessary for eager binding.
1 | larl %r1, .got.plt[n] |
Relocations
There are 5 absolute relocation types:
R_390_{8,16,20,32,64}
. They can be used as data relocations
(.byte
, .short
, etc) as well as code
relocations.
R_390_8
is used by instruction formats with a 8-bit immediate operand (e.g. SI).R_390_16
is used by instruction formats with a 16-bit immediate operand (e.g. RI).R_390_20
is used by instruction formats with a 20-bit displacement (e.g. RSY, RXY).R_390_32
is used by instruction formats with a 32-bit displacement (e.g. RIL).
R_390_GOTPLT*
relocations seem unused. GCC never emits
the assembler modifier @GOTPLT
.
Thread Local Storage
Refer to All about thread-local storage for TLS. On s390x, TLS Variant II is employed, with the glibc implementation completed in 2003. overall, this design exhibits lower efficiency compared to other architectures. I believe the low efficiency is a self-inflicted problem instead of an architectural limitation.
First, let's look at thread pointer accessing.
- s390: 32-bit thread pointer stored in 32-bit access register
a0
. - s390x: 64-bit thread pointer split across
a0
anda1
, both still 32-bit.
This necessitates three instructions (14 bytes) to retrieve the full thread pointer, while 64-bit access registers would simplify this:
1 | ear %r0, %a0 # r0 = hi(r0) | a0 |
General dynamic TLS model
In the general dynamic TLS model, a key difference compared to other
architectures is the use of __tls_get_offset
instead of
__tls_get_addr
. The process involves several steps,
illustrated by the provided assembly code:
1 | ear %r0, %a0 |
- Retrieving the thread pointer and
_GLOBAL_OFFSET_TABLE_
: Four instructions are required but can be shared by subsequent TLS accesses. This step can be reordered. - Obtaining the GOT offset: The offset (
a@TLSGD
) is stored in the.data.rel.ro
section. The offset refers to two GOT entries (atls_index
structure), relocated by dynamic relocationsR_390_TLS_DTPMOD
andR_390_TLS_DTPOFF
. The dynamic loader will set the values to(m, a@DTPOFF)
, the module ID and an offset of the symbol relative to the dynamic TLS block. - Finding the offset relative to the current dynamic TLS block
(
DTPOFF
):__tls_get_offset(r2)
returnsdtv[m] + a@DTPOFF - TP
.__tls_get_addr
in other architectures just returndtv[m] + a@DTPOFF
. - Adding the thread pointer to get the symbol address in the current thread
In glibc, __tls_get_offset
is defined as:
1 | // unsigned long __tls_get_offset(unsigned long offset); |
While this general dynamic approach works, it's considered the least efficient implementation of general dynamic TLS among the architectures I have analyzed. Here is why:
- Ineffecient
tls_index
argument (similar to AArch32): This requires an extra lookup in.data.rel.ro
. - Redundant argument:
__tls_get_offset
takes the GOT offset instead of the direct GOT entry address. - Indirect return value: Instead of returning the final TLS symbol
address directly,
__tls_get_offset
only provides an offset, requiring an extra instruction for addition with the TP.
The motivation behind this design might be related to reducing the number of instructions rewritten during TLS optimizations. However, it clearly comes at the cost of performance.
The general-dynamic code sequence can be optimized to initial-exec or local-exec.
1 | // general-dynamic to initial-exec |
In both cases, the linker only needs to patch one instruction, instead of four for PPC64.
Local dynamic TLS model
The process involves several steps, illustrated by the provided assembly code:
1 | lgrl %r2,.LC0 # r2 = *(.LC0) = GOT offset of a tls_index object holding {module_ID, 0} |
- Retrieving the thread pointer and
_GLOBAL_OFFSET_TABLE_
- Obtaining the GOT offset: The offset (
a@TLSLDM
) is stored in the.data.rel.ro
section. The offset refers to two GOT entries (atls_index
structure): the module ID and a zero. The module ID entry is relocated by a dynamic relocationR_390_TLS_DTPMOD
. - Finding the dynamic TLS block address:
__tls_get_offset(r2)
returnsdtv[m] - TP
. It is notdtv[m] + XXX - TP
because the second GOT entry is zero. - Adding DTPOFF to get the symbol address in the current thread
The first three steps can be shared among TLS symbols.
The local-dynamic code sequence can be optimized to local-exec.
1 | lgrl %r2,.LC0 # r2 = 0 |
Initial Exec TLS model
1 | lgrl %r1, a@INDNTPOFF # R_390_TLS_IEENT(a); linker resolves this to a GOT holding the TP offset |
Unfortunately, initial-exec cannot be optimized to local-exec. PPC32 has a similar initial-exec TLS code sequence and it allows TLS optimization by defining a marker relocation.
Local Exec TLS model
The code sequence loads the TP offset indirectly in a manner similar to AArch32.
1 | lgrl %r1, .LC0 # r1 = a@NTPOFF |
The indirection is unfortunate. The lgfi
(Load
Immediate) instruction loads a 32-bit signed integer, which can actually
be used instead.