Re: GIN improvements part 1: additional information

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: GIN improvements part 1: additional information
Дата
Msg-id 52B2B1D7.2050802@vmware.com
обсуждение исходный текст
Ответ на Re: GIN improvements part 1: additional information  (Oleg Bartunov <obartunov@gmail.com>)
Ответы Re: GIN improvements part 1: additional information
Список pgsql-hackers
On 12/19/2013 08:37 AM, Oleg Bartunov wrote:
> Guys,
>
> before digging deep into the art of comp/decomp world I'd like to know
> if you familiar with results of
> http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
> some newer research ?

Yeah, I saw that paper.

> Do we agree in what we really want ? Basically,
> there are three main features: size, compression, decompression speed
> - we should take two :)

According to that Zhang et al paper you linked, the Vbyte method 
actually performs the worst on all of those measures. The other 
algorithms are quite similar in terms of size (PForDelta being the most 
efficient), while PForDelta is significantly faster to compress/decompress.

Just by looking at those numbers, PForDelta looks like a clear winner. 
However, it operates on much bigger batches than the other algorithms; I 
haven't looked at it in detail, but Zhang et al used 128 integer 
batches, and they say that 32 integers is the minimum batch size. If we 
want to use it for the inline posting lists stored in entry tuples, that 
would be quite wasteful if there are only a few item pointers on the tuple.

Also, in the tests I've run, the compression/decompression speed is not 
a significant factor in total performance, with either varbyte encoding 
or Simple9-like encoding I hacked together.

Actually, now that I think about this a bit more, maybe we should go 
with Rice encoding after all? It's the most efficient in terms of size, 
and I believe it would be fast enough.

> Should we design sort of plugin, which could support independent
> storage on disk, so users can apply different techniques, depending on
> data.
>
> What I want to say is that we certainly can play with this very
> challenged task, but we have limited time  before 9.4 and we should
> think in positive direction.

Once we have the code in place to deal with one encoding, it's easy to 
switch the implementation. Making it user-configurable or pluggable 
would be overkill IMHO.

What I'm saying is that we should make sure we get the page format right 
(in particular, I strongly feel we should use the self-contained 
PostingListSegment struct instead of item-indees that I mentioned in the 
other post), with the implementation details hidden in the functions in 
ginpostinglist.c. Then we can easily experiment with different algorithms.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: array_length(anyarray)
Следующее
От: Andres Freund
Дата:
Сообщение: Re: clang's -Wmissing-variable-declarations shows some shoddy programming