A varint implementation for PG?

Поиск
Список
Период
Сортировка
От Andres Freund
Тема A varint implementation for PG?
Дата
Msg-id 20191210015054.5otdfuftxrqb5gum@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: A varint implementation for PG?  (Craig Ringer <craig@2ndquadrant.com>)
Re: A varint implementation for PG?  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

I several times, most recently for the record format in the undo
patchset, wished for a fast variable width integer implementation for
postgres. Using very narrow integers, for space efficiency, solves the
space usage problem, but leads to extensibility / generality problems.

Other cases where I wanted a variable width integer encodings are, non
exhaustively: variable-width TIDs, on-disk representation for a smaller
SQL level integer, WAL format, wire protocol, ...

I spent a few days experimenting with different variable length
encodings for unsigned and signed integers. My goal is/was to find an
encoding and a prototype that has reasonably low space overhead and that
is very quick to encode/decode.


The probably most common encoding is often called LEB128. It basically
uses 7 bits in each byte for data, and one bit as a "continuation"
bit. While not bad space-wise, it's not great for decoding (but also
encoding) speed. The problem is a) that there's a branch for each byte,
which will result in poorly predictable code, b) the 7 byte base needs
to be changed to an 8 byte base (not as expensive as the branching
issue, but quite noticably).

I think a better encoding is something similar to UTF-8. That is,
encoding the byte length in unary in the leading byte, with the data
following that prefix (plus the separator ending the prefix). That
allows to avoid per-byte branching during decoding.

In contrast to UTF-8, in my prototype I decided to encode the prefix in
unary as the number of leading 0 bits - the main advantage is that that
makes it possible to use instructions like "count leading zeroes" to
determine the length (otherwise one needs to invert the data first).


To encode negative numbers, I chose to use what's often called "zigzag"
encoding. One cannot just naively use the above encoding for signed
integers, as otherwise two's complement negative numbers would always be
of the maximum length (due to the leading 1 bit(s)). Instead the
to-be-encoded number is multiplied by two (i.e. <<1), and the sign bit
is stored at the lowest bit; negative numbers are additionally stored
with bit bits inverted.

The only disadvantage of that encoding that I am aware of is that
encoded signed varints cannot just be memcmp'd (which they could with
e.g. a single sign bit).

Alternatively one could store the sign directly after the "separator
bit" (the one ending the unary length indicator). But then there's the
problem that the number has leading data after the indicator. It's also
a bit harder to write the code for that encoding. I wrote the code for
that first, and I think it's very likely worth going for the simpler
approach.


In my benchmarks encoding + decoding a varint in this way costs
approximately 23 cycles with my intel skylake cpu. I think pipelining is
hiding a bit of the latency, but not too much. As the benchmark
currently is stupid and just encodes all 32bit integers, it's overly
friendly to the branch predictor however.

Even with those caveats, I think that's a pretty good result. Other
encodings were more expensive. And I think there's definitely some room
for optimization left.



I've pushed a repo containing my initial implementation of the above to
https://github.com/anarazel/varint_experiment and attached a tarball
containing it. The header file contains a more detailed description of
the format.

So far I've only implemented uint64_t and int64_t encoding/decoding
functions. It'd definitely make sense to implement a 4 byte variant, and
potentially also one supporting larger lengths. I think it'd make sense
to have a variant support length up to 64bit long unary length indicator
(i.e. 64bytes of data (potentially minus one bit)).

If data lengths longer than that are required for a use case, it
probably is better to either a) use the max-representable 8 byte integer
as an indicator that the length is stored or b) sacrifice another bit to
represent whether the integer is the data itself or the length. Not sure
if this is worthwhile.


Do others see use in this? If so, does the encoding outlined above sound
reasonable?  I'm using a variant of this for a proposal for a different
undo record format, but I really hope this would be interesting for
others too.



Greetings,

Andres Freund

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: Using multiple extended statistics for estimates
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Online checksums verification in the backend