The ELF specification says for the STV_DEFAULT
visibility, "Global and weak symbols are also preemptable, that is, they may by preempted by definitions of the same name in another component. In many implementations, a defined symbol of any binding in the executable cannot be preempted, but a default visibility STB_GLOBAL
or STB_WEAK
symbol can be preempted.
It may be a bit surprising that a defined default visibility STB_GLOBAL
symbol can be preempted. In the example below, the callee (f
) is defined in the same translation unit.
1 |
|
GCC's interpretation is: since a -fpic
compiled object file can be linked as a shared object, the symbol f
is interposable/preemptible at runtime. By default GCC considers the definition inexact and suppresses interprocedural optimizations including inlining.
The emitted assembly looks like:
1 | .globl f |
In -shared
mode, the linker notices that f
is preemptable and will resolve the branch target to a PLT entry with a dynamic relocation R_*_JUMP_SLOT
. Assuming interposition doesn't happen, the ideal behavior is that the branch jumps to the target directly.
The combined compiler and linker behavior causes a performance cost of 5% or more. This is a feature used by 0.01% libraries that penalizes 99.99% libraries. Read on.
On PE-COFF, f
cannot be interposed. On Mach-O, f
can be interposed only if the dylib is linked with -interposable
.
GCC -fno-semantic-interposition
GCC 5 introduced -fno-semantic-interposition
to optimize -fpic
. First, GCC can apply interprocedural optimizations including inlining like -fno-pic
and -fpie
. Second, in the emitted assembly, a function call will go through a local alias to avoid PLT if linked with -shared
.
1 | .globl f |
If the branch instruction uses a regular STB_GLOBAL
symbol: the linker notices that the default visibility f
is preemptable in -shared
mode and will resolve the branch target to a PLT entry with a dynamic relocation R_*_JUMP_SLOT
.
1 | .globl f |
If f
is a non-definition declaration, -fno-semantic-interposition
has no behavior difference.
Clang -fno-semantic-interposition
Longstanding behavior
It turns out that the first merit of the GCC feature "interprocedural optimizations including inlining are applicable" is actually Clang's longstanding behavior for definitions of the external linkage in -fpic
code.
When -fsemantic-interposition
was contributed by Serge Guelton, I noted that we should keep the aggressive behavior, even if it differs from GCC, not to regress the longstanding optimizations. (ipconstprop, inliner, sccp, sroa treat normal ExternalLinkage GlobalObject
s non-interposable.) (Before https://reviews.llvm.org/D72197, MC resolved a PC-relative VK_None fixup to a non-local symbol at assembly time (no outstanding relocation), if the target is defined in the same section. Put it simply, even if IR optimizations failed to optimize and allowed interposition for the function call in void foo() {} void bar() { foo(); }
, the assembler would disallow it.)
If a project really requires symbol interposition to work (extremely rare), it may be unhappy with Clang's default behavior. The project should specify -fsemantic-interposition
to disable interprocedural optimizations.
dso_local inference in -fpic -fno-semantic-interposition
mode
I contributed an optimization to Clang 11: in -fpic -fno-semantic-interposition
mode, default visibility external linkage definitions get the dso_local specifier, like in -fno-pic
and -fpie
modes. For -fpic
code, accesses to a dso_local symbol will go through its local alias .Lfoo$local
. For -fno-pic
and -fpie
code, accesses to a dso_local symbol can use the original foo
because the object file shall not be linked with -shared
.
With dso_local, there are some noticeable behavior differences:
- variable access: access the local symbol directly instead of going through a GOT indirection
- function call:
call .Lfoo$local
- taking the address of a function: similar to a variable access
For the previous C example, the emitted assembly will look like the following.
1 | .globl f |
The local alias is a .L
symbol. This is deliberate:
- The assembler suppresses the symbol table entry and converts relocations to reference the section symbol instead.
- Tools cannot be confused by two symbols at the same location. In the GCC produced object file, currently llvm-objdump will name the function
f.localalias
.
This behavior change causes the branch target symbol to have a different type: STT_FUNC
-> STT_NOTYPE
. In some processor supplementary ABI, there may be implications on range extension thunks. ABI makers should be aware of this.
GCC doesn't optimize global variable access. Feature request: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100483. This rarely matters for performance, though.
In Clang cc1, there are three states:
-fsemantic-interposition
: this represents-fpic -fsemantic-interposition
. Don't set dso_local on default visibility external linkage definitions. Emit a module flag metadataSemanticInterposition
to disallow interprocedural optimizations.-fhalf-no-semantic-interposition
: this represents-fpic
without a semantic interposition option. Don't set dso_local on default visibility external linkage definitions. However, interprocedural optimizations on such definitons are allowed.- (default): this represents either of
-fno-pic
,-fpie
, and-fpic -fno-semantic-interposition
. Set dso_local on default visibility external linkage definitions. Interprocedural optimizations on such definitons are allowed.
Targets
As of Clang 12, -fno-semantic-interposition
is only effective on x86.
Hopefully, this optimization will be available on AArch64 (https://reviews.llvm.org/D101873) and RISC-V (https://reviews.llvm.org/D101876).
ThinLTO
I believe ThinLTO -fpic -fno-semantic-interposition
works for ThinLTO:) ThinLTO required two changes: if a GlobalVariable is converted to a declaration, we should drop the dso_local specifier (D74749 D74751).
Space overhead
In GCC's foo.localalias
scheme, there is an extra symbol table entry (sizeof(Elf64_Sym) = 24
) and a string in string table.
In Clang's .Lfoo$local
scheme, this generally costs a STT_SECTION
symbol table entry (the entry can usually be suppressed).
Can Clang default to -fno-semantic-interposition
?
Clang currently has three states. There are some optimization opportunities between the half state and the full -fno-semantic-interposition
. It is natural to ask whether we can drop the half state and make -fpic
default to the full state.
This is something I'd like Clang to do, but I'll note that there is still some risk. Some points favoring the changed default.
Interprocedural optimizations (including inlining)
For ELF -fpic, Clang never suppresses interprocedural optimizations (including inlining) on default visibility external linkage definitions. So projects relying on blocked interprocedural optimizations have been broken for years. They only probably work recently by specifying -fsemantic-interposition
.
Assembler behavior for VK_None
1 | .globl f |
Before 2020-01 (https://reviews.llvm.org/D72197), the integrated assembler resolved the fixup when the target symbol and the location are in the same section. There was no relocation, so the linker could not produce a PLT.
Non-x86 targets typically use VK_None
for branch instruction. x86 uses VK_PLT
for -fpie
and -fpic
.
If a project passed with -fno-function-sections
on aarch64/ppc/etc before 2020-01, we have some confidence that the project does not rely on function semantic interposition.
Difference from -fvisibility=protected
A non-default visibility symbol cannot be preempted, even if the binding is STB_WEAK
. -fvisibility=protected
can make a weak definition protected. If you want a weak definition to be preemptible, you may need __attribute__((weak,visibility("default")))
, which is verbose and error-prone.
ld -shared -Bsymbolic
is very similar to -pie
. -Bsymbolic
can subsume some optimizations of -fno-semantic-interposition
.
- variable access: on x86-64, with
R_X86_64_GOTPCRELX
/R_X86_64_REX_GOTPCRELX
, the GOT indirection can be suppressed. However, the code sequence is still longer than without GOT. On PowerPC64, there is a similar TOC optimization. On other architectures, no difference. - function call:
call foo@PLT
will not create a PLT entry.
-Bsymbolic-functions
only applies to STT_FUNC
symbols and is generally safer than -Bsymbolic
. The main problem with -Bsymbolic
is that it doesn't work with copy relocations. (-Bsymbolic
can lead to multiple type info objects but that actually works because libsupc++/libc++abi does cough string comparison).
LD_PRELOAD
There are several types of LD_PRELOAD
usage.
First, use LD_PRELOAD=same_soname.so
to replace a DT_NEEDED
entry with the same SONAME. Both -fno-semantic-interposition
and -Bsymbolic
are compatible with such usage.
Second, use LD_PRELOAD=malloc.so
to intercept some functions not defined in the application or any of its shared object dependencies. Both -fno-semantic-interposition
and -Bsymbolic
are compatible.
1 | void *f() { return malloc(0xb612); } |
Third, use LD_PRELOAD=different_soname.so
to replace a function defined in a shared object dependency and the SONAME is different. Such usage is incompatible with -Bsymbolic
. If the function is referenced in its definiting translation unit, the call sites are statically bound with -fno-semantic-interposition
; otherwise the usage is still compatible.
Applications
Python. CPython 3.10 sets -fno-semantic-interposition
. Red Hat Enterprise Linux 8.2 brings faster Python 3.8 run speeds says there is a huge improvement (up to 30%). This is really an upper bound you can see from real world applications. I think this actually suggests some code problems in CPython.
A small single-digit performance boost (say, 4%) is what I'd normally expect. In https://bugs.archlinux.org/task/70697 and https://bugzilla.redhat.com/show_bug.cgi?id=1956484, I suggest that distributions consider -fno-semantic-interposition
and -Bsymbolic-functions
when building Clang.