[The first guest post in some years. Welcome, Arne Hormann!]
Hi there, I'm Arne. In July, I stumbled on a lobste.rs link (thanks, Carlana) that led me to Tim’s article on Q Numbers. As I'm regularly working with both floating point numbers and Go, the post made me curious — I was sure I could improve the compression scheme. I posted comments. Tim liked my idea and challenged me to create a PR. I did the encoding but felt overwhelmed with the integration. Tim took over and merged it. And since v1.4.0 released in August 28th, it's in Quamina.
In Tim’s post about that change, he wrote “Arne explained to me how it works in some chat that I can’t find, and to be honest I can’t quite look at this and remember the explanation”. Well, you're reading it right now.
Float64 ·
Let's first talk about the data we are dealing with here. Quamina operates on JSON. While no JSON specification limits
numbers to
IEEE 754 floating point (Go calls
them float64
),
RFC 8259 recommends treating them as such.
They look like this:
1 bit sign; 0 is positive, 1 is negative.
11 bits exponent; It uses a bias of 1023
((1<<10)-1
), and all exponent bits set means Infinity or Not a Number
(NaN
)
1 mantissa high bit (implicit, never stored: 0
if all exponent bits are 0
, otherwise 1
). This is required so
there's exactly one way to store each number.
52 explicit mantissa bits. The 53 mantissa bits are the binary digits of the number itself.
Both 0
and -0
exist and are equal but are represented differently. We have to normalize -0
to 0
for comparability,
but according to Tim’s tests, -0
cannot occur in Quamina because JSON decoding silently converts -0
to 0
.
With an exponent with all bits set, Infinity has a mantissa of 0
. Any other mantissa is NaN. But both sign values are
used. There are a lot of different NaNs; 1<<53 - 2
different ones!
In JSON, Infinity and NaN are not representable as numbers, so we don't have to concern ourselves with them.
Finally, keep in mind that these are binary numbers, not decimals. Decimal 0.1
cannot be accurately encoded. But 0.5
can. And all integers up to 1<<53
, too.
Adding numbers to Quamina · Quamina operates on UTF-8 strings and compares them byte by byte. To add numbers to it, they have to be bytewise-comparable and all bytes have to be valid in UTF-8.
Given these constraints, let's consider the problem of comparability, first.
Sortable bits ·
We can use math.Float64bits()
to convert a float64
into its individual bits (stored as uint64
).
Positive numbers are already perfectly sorted. But they are smaller than negative numbers. To fix that, we always flip the
sign bit so it's 1
for positive and 0
for negative values.
Negative values are sorted exactly the wrong way around. To totally reverse their order, we have to flip all exponent and mantissa bits in addition to the sign bits.
A simple implementation would look like:
func numbitsFromFloat64(f float64) numbits {
u := math.Float64bits(f)
if f < 0 {
return numbits(^u)
}
return numbits(u) | (1 << 63)
}
Now let's look at the actual code:
func numbitsFromFloat64(f float64) numbits {
u := math.Float64bits(f)
mask := (u>>63)*^uint64(0) | (1 << 63)
return numbits(u ^ mask)
}
The mask
line can be a bit of a headscratcher...
u>>63
moves the sign bit of the number to the lowest bit
the result will be 0
for positive values and 1
for negative values
that's multiplied with ^0
- all 64 bits set to true
we now have 0
for positive numbers and all bits set for negative numbers
regardless, always set the sign bit with | (1 << 63)
By xoring mask
to the original bits, we get our transformation for lexically ordered numbers.
The code is a bit hard to parse because it avoids branches for performance reasons.
Sortable bytes · If we were to use the uint64 now, it would work out great. But it has to be split into its individual bytes. And on little endian systems - most of the frequently used ones - the byte order will be wrong and has to be reversed. That's done by storing the bytes in big endian encoding.
UTF-8 bytes ·
Our next problem is the required UTF-8 encoding. Axel suggested to only use the lower 7 bits. That means we need up to 10 byte instead of 8 (7*10 = 70
fits 8*8 = 64
).
Compress ·
The final insight was that trailing 0
bytes do not influence the comparison at all and can be dropped. Which compresses
positive power-of-two numbers and integers even further. Not negative values, though - the required bit inversion messes it
up.
And that's all on this topic concerning Quamina — even if there's so, so much more about floats.