TL;DR: Trail of Bits has developed abi3audit, a new Python tool for checking Python packages for CPython application binary interface (ABI) violations. We’ve used it to discover hundreds of inconsistently and incorrectly tagged package distributions, each of which is a potential source of crashes and exploitable memory corruption due to undetected ABI differences. It’s publicly available under a permissive open source license, so you can use it today!

Python is one of the most popular programming languages, with a correspondingly large package ecosystem: over 600,000 programmers use PyPI to distribute over 400,000 unique packages, powering much of the world’s software.

The age of Python’s packaging ecosystem also sets it apart: among general-purpose languages, it is predated only by Perl’s CPAN. This, combined with the mostly independent development of packaging tooling and standards, has made Python’s ecosystem among the more complex of the major programming language ecosystems. Those complexities include:

  • Two major current packaging formats (source distributions and wheels), as well as a smattering of domain-specific and legacy formats (zipapps, Python Eggs, conda’s own format, &c.);
  • A constellation of different packaging tools and package specification files: setuptools, flit, poetry, and PDM, as well as pip, pipx, and pipenv for actually installing packages;
  • …and a corresponding constellation of package and dependency specification files: pyproject.toml (PEP 518-style), pyproject.toml (Poetry-style), setup.py, setup.cfg, Pipfile, requirements.txt, MANIFEST.in, and so forth.

This post will cover just one tiny piece of Python packaging’s complexity: the CPython stable ABI. We’ll see what the stable ABI is, why it exists, how it’s integrated into Python packaging, and how each piece goes terribly wrong to make accidental ABI violations easy.

The CPython stable API and ABI

Not unlike many other reference implementations, Python’s reference implementation (CPython) is written in C and provides two mechanisms for native interaction:

  • A C Application Programming Interface (API), allowing C and C++ programmers to compile against CPython’s public headers and use any exposed functionality;
  • An Application Binary Interface (ABI), allowing any language with C ABI support (like Rust or Golang) to link against CPython’s runtime and use the same internals

Developers can use the CPython API and ABI to write CPython extensions. These extensions behave exactly like ordinary Python modules but interact directly with the interpreter’s implementation details rather than the “high-level” objects and APIs exposed in Python itself.

CPython extensions are a cornerstone of the Python ecosystem: they provide an “escape hatch” for performance-critical tasks in Python, as well as enable code reuse from native languages (like the broader C, C++, and Rust packaging ecosystems).

At the same time, extensions pose a problem: CPython’s APIs change between releases (as the implementation details of CPython change), meaning that it is unsound, by default, to load a CPython extension into an interpreter of a different version. The implications of this unsoundness vary: a user might get lucky and have no problems at all, might experience crashes due to missing functions or, worst of all, experience memory corruption due to changes in function signatures and structure layouts.

To ameliorate the situation, CPython’s developers created the stable API and ABI: a set of macros, types, functions, and data objects that are guaranteed to remain available and forward-compatible between minor releases. In other words: a CPython extension built for CPython 3.7’s stable API will also load and function correctly on CPython 3.8 and forwards, but is not guaranteed to load and function with CPython 3.6 or earlier.

At the ABI level, this compatibility is referred to as “abi3”, and is optionally tagged in the extension’s filename: mymod.abi3.so, for example, designates a loadable stable-ABI-compatible CPython extension module named mymod. Critically, the Python interpreter does not do anything with this tag — it’s simply ignored.

This is the first strike: CPython has no notion of whether an extension is actually stable-ABI-compatible. We’ll now see how this compounds with the state of Python packaging to produce even more problems.

CPython extensions and packaging

On its own, a CPython extension is just a bare Python module. To be useful to others, it needs to be packaged and distributed like all other modules.

With source distributions, packaging a CPython extension is straightforward (for some definitions of straightforward): the source distribution’s build system (generally setup.py) describes the compilation steps needed to produce the native extension, and the package installer runs these steps during installation.

For example, here’s how we define microx’s native extension (microx_core) using setuptools:

Distributing a CPython extension via source distribution has advantages (✅) and disadvantages (❌):

    ✅API and ABI stability are non-issues: the package either builds during installation or it doesn’t and, when it does build, it runs against the same interpreter that it built against.

    ✅Source builds are burdensome for users: they require end-users of Python software to install the CPython development headers, as well as maintain a native toolchain corresponding to the language or ecosystem that the extension targets. That means requiring a C/C++ (and increasingly, Rust) toolchain on every deployment machine, adding size and complexity.

    ❌Source builds are fundamentally fragile: compilers and native dependencies are in constant flux, leaving end users (who are Python experts at best, not compiled language experts) to debug compiler and linker errors.

The Python packaging ecosystem’s solution to these problems is wheels. Wheels are a binary distribution format, which means that they can (but are not required to) provide pre-compiled binary extensions and other shared objects that can be installed as-is, without custom build steps. This is where ABI compatibility is absolutely essential: binary wheels are loaded blindly by the CPython interpreter, so any mismatch between the actual and expected interpreter ABIs can cause crashes (or worse, exploitable memory corruption).

Because wheels can contain pre-compiled extensions, they need to be tagged for the version(s) of Python that they support. This tagging is done with PEP 425-style “compatibility” tags: microx-1.4.1-cp37-cp37m-macosx_10_15_x86_64.whl designates a wheel that was built for CPython 3.7 on macOS 10.15 for x86-64, meaning that other Python versions, host OSes, and architectures should not attempt to install it.

On its own, this limitation makes wheel packaging for CPython extensions a bit of a hassle:

    ❌In order to support all valid combinations of {Python Version, Host OS, Host Architecture}, the packager must build a valid wheel for each. This means additional test, build, and distribution complexity, as well as exponential CI growth as a package’s support matrix expands.

    ❌Because wheels are (by default) tied to a single Python version, packagers are required to generate a new set of wheels on each Python minor version change. In other words: new Python versions start out without access to a significant chunk of the packaging ecosystem until packagers can play catch up.

This is where the stable ABI becomes critical: instead of building one wheel per Python, version packagers can build an “abi3” wheel for the lowest supported Python version. This comes with the guarantee that the wheel will work on all future (minor) releases, solving both the build matrix size problem and the ecosystem bootstrapping problem above.

Building an “abi3” wheel is a two-step process: the wheel is built locally (usually using the same build system as the source distribution) and then retagged with abi3 as the ABI tag rather than a single Python version (like cp37 for CPython 3.7).

Critically: neither of these steps is validated, because Python’s build tools have no good way to validate them. This leaves us with the second and third strikes:

  • To correctly build a wheel against the stable API and ABI, the build needs to set the Py_LIMITED_API macro to the intended CPython support version (or, for Rust with PyO3, to use the correct build feature). This prevents Python’s C headers from using non-stable functionality or potentially inlining incompatible implementation details.

For example, to build a wheel as cp37-abi3 (stable ABI for CPython 3.7+), the extension needs to either #define Py_LIMITED_API 0x03070000 in its own source code, or use the setuptools.Extension construct’s define_macros argument to configure it. These are easy to forget, and produce no warning when forgotten!

Additionally, when using setuptools, the packager may choose to set py_limited_api=True. But this does not enable any actual API restrictions; it merely adds the .abi3 tag to the built extension’s filename. As you’ll recall this is not currently checked by the CPython interpreter, so this is effectively a no-op.

Critically, it does not affect the actual wheel build. The wheel is built however the underlying setuptools.Extension sees fit: it might be completely right, it might be a little wrong (stable ABI, but for the wrong CPython version), or it might be completely wrong.

This breakdown happens because of the devolved nature of Python packaging: the code that builds extensions is in pypa/setuptools, while the code that builds wheels is in pypa/wheel — two completely separate codebases. Extension building is designed as a black box, a fact that Rust and other language ecosystems take advantage of (there is no Py_LIMITED_API macro to sensibly define in a PyO3-based extension — it’s all handled separately by build features).

To summarize:

  • Stable ABI (“abi3”) wheels are the only reliable way to package native extensions without a massive build matrix.
  • However, none of the dials that control abi3-compatible wheel building talk to each other: it’s possibly to build an abi3-compatible wheel without tagging it as such, or to build a non-abi3 wheel and tag it incorrectly as compatible, or to tag an abi3-compatible wheel as compatible with the wrong CPython version.
  • Consequently, the correctness of the current abi3-compatible wheel ecosystem is suspect. ABI violations are capable of causing crashes and even exploitable memory corruption, so we need to quantify the current state of affairs.

How bad is it, really?

This all seems pretty bad, but it’s just an abstract problem: it’s entirely possible that every Python packager gets their wheel builds right, and hasn’t published any incorrectly tagged (or completely invalid) abi3-style wheels.

To get a sense for how bad things really are, we developed abi3audit. Abi3audit’s entire raison d’être is finding these kinds of ABI violation bugs: it scans individual extensions, Python wheels (which can contain multiple extensions), and entire package histories, reporting back anything that doesn’t match the specified stable ABI version or is entirely incompatible with the stable ABI.

To get a list of auditable packages to feed into abi3audit, I used PyPI’s public BigQuery dataset to generate a list of every abi3-wheel-containing package downloaded from PyPI in the last 21 days:

#standardSQL
SELECT DISTINCT file.project
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.filename LIKE '%abi3%'
  -- Only query the last 21 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 21 DAY)
    AND CURRENT_DATE()

( I chose 21 because I blew through my BigQuery quota while testing. It’d be interesting to see the full list of downloads over a year or the entire history of PyPI, although I’d expect diminishing returns.)

From that query, I got 357 packages, which I’ve uploaded as a GitHub Gist. With those packages saved, a JSON report from abi3audit was only a single invocation away:

The JSON from that audit is also available as a GitHub Gist.

First, some high-level statistics:

  • Of the 357 initial packages queried from PyPI, 339 actually contained auditable wheels. Some were 404s (presumably created and then deleted), while others were tagged with abi3 but did not actually contain any CPython extension modules (which does, technically, make them abi3 compatible!). A handful of these were ctypes-style modules, with either a vendored library or code to load a library that the host was expected to contain.
  • Those 339 remaining packages had a total of 13650 abi3-tagged wheels between them. The largest (in terms of wheels) was eclipse-zenoh-nightly, with 1596 wheels (or nearly 12 percent of all abi3-tagged wheels on PyPI).
  • The 13650 abi3-tagged wheels had a total of 39544 shared objects, each a potential Python extension, between them. In other words: the average abi3-tagged wheel has 2.9 shared objects in it, each of which was audited by abi3audit.
    • Attempting to parse each shared object in each abi3-tagged wheel produced all kinds of curious results: plenty of wheels contained invalid shared objects: ELF files that began with garbage (but contained a valid ELF later in the file), temporary build artifacts that weren’t cleaned up, and a handful of wheels that appeared to contain editor-style swap files for hand-modified binaries. Unfortunately, unlike Moyix, we did not discover any catgirls.

Now, the juicy parts:

Of the 357 valid packages, 54 (15 percent) contained wheels with ABI version violations. In other words: roughly one in six packages had wheels that claimed support for a particular Python version, but actually used the ABI of a newer Python version.

More severely: of those same 357 valid packages, 11 (3.1 percent) contained outright ABI violations. In other words: roughly one in thirty packages had wheels that claimed to be stable ABI compatible, but weren’t at all!

In total, 1139 (roughly 3 percent) Python extensions had version violations, and 90 (roughly 0.02 percent) had outright ABI violations. This suggests two things: that the same packages tend to have ABI violations across multiple wheels and extensions, and that multiple extensions within the same wheel tend to have ABI violations at the same time (which makes sense, since they should share the same build).

Here are some that we found particularly interesting:

PyQt6 and sip

PyQt6 and sip are both part of the Qt project, and both had ABI version violations: multiple wheels were tagged for CPython 3.6 (cp36-abi3), but used APIs that were only stabilized with CPython 3.7.

sip additionally had a handful of wheels with outright ABI violations, all from the internal _Py_DECREF API:

refl1d

refl1d is a NIST-developed reflectometry package. They did a couple of releases tagged for the stable ABI of Python 3.2 (the absolute lowest), while actually targeting the stable ABI of Python 3.11 (the absolute highest — not even released yet!).

hdbcli

hdbcli appears to be a proprietary client for SAP HANA, published by SAP themselves. It’s tagged as abi3, which is cool! Unfortunately, it isn’t actually abi3-compatible:

This, again, suggests building without the correct macros. We’d be able to figure out more with the source code, but this package appears to be completely proprietary.

gdp and pifacecam

These are two smaller packages, but they piqued my interest because both had stable ABI violations that weren’t just the reference/counting helper APIs:

dockerfile

Finally, I liked this one because it turns out to be a Python extension written in Go, not C, C++, or Rust!

The maintainer had the right idea, but didn’t define Py_LIMITED_API to any particular value. So Python’s headers “helpfully” interpreted that as not limited at all:

The path forward

First, the silver lining: most of the extremely popular packages in the list had no ABI violations or version mismatches. Cryptography and bcrypt were spared, for example, indicating strong build controls on their side. Other relatively popular packages had version violations, but they were generally minor (for example: expecting a function that was only stabilized with 3.7, but has been present and the same since 3.3).

Overall, however, these results are not great: they indicate (1) that a significant portion of the “abi3” wheels on PyPI aren’t really abi3-compatible at all (or are compatible with a different version than they claim), and (2) that maintainers don’t fully understand the different knobs that control abi3 tagging (and that those knobs do not actually modify the build itself).

More generally, the results point to a need for better controls, better documentation, and better interoperation between Python’s different packaging components. In nearly all cases, the package’s maintainer has attempted to do the right thing, but seemingly wasn’t aware of the additional steps necessary to actually build an abi3-compatible wheel. In addition to improving the package-side tooling here, the auditing is also automatable: we’ve designed abi3audit in part to demonstrate that it would be possible for PyPI to catch these kinds of wheel errors before they become a part of the public index.