Re: [WIP] Effective storage of duplicates in B-tree index.

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: [WIP] Effective storage of duplicates in B-tree index.
Дата
Msg-id a5dec069-c723-bc39-5bec-3e1baebde2c9@iki.fi
обсуждение исходный текст
Ответ на Re: [WIP] Effective storage of duplicates in B-tree index.  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Ответы Re: [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On 18/03/16 19:19, Anastasia Lubennikova wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in
> this feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message
> couple of sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression
> for each particular index.
> 4. Recheck locking considerations. I tried to write code as less
> invasive as possible, but we need to make sure that algorithm is still
> correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with the 
varbyte encoding. It's tempting to commit this without it for now, and 
add the compression later, but I'd like to avoid having to deal with 
multiple binary-format upgrades, so let's figure out the final on-disk 
format that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but we 
might not want to use that exact scheme for B-tree. Firstly, an 
important criteria when we designed GIN's encoding scheme was to avoid 
expanding on-disk size for any data set, which meant that a TID had to 
always be encoded in 6 bytes or less. We don't have that limitation with 
B-tree, because in B-tree, each item is currently stored as a separate 
IndexTuple, which is much larger. So we are free to choose an encoding 
scheme that's better at packing some values, at the expense of using 
more bytes for other values, if we want to. Some analysis on what we 
want would be nice. (It's still important that removing a TID from the 
list never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUC 
or reloption, we might need something that's faster for random access 
than GIN's posting lists. Or can we just add the setting, but it would 
be nice to have some more analysis on the worst-case performance before 
we decide on that.

I find the macros in nbtree.h in the patch quite confusing. They're 
similar to what we did in GIN, but again we might want to choose 
differently here. So some discussion on the desired IndexTuple layout is 
in order. (One clear bug is that using the high bit of BlockNumber for 
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)

- Heikki




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Reviewing freeze map code
Следующее
От: Krzysztof Kaczkowski
Дата:
Сообщение: Cluster on NAS and data center.