Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Поиск
Список
Период
Сортировка
От Anastasia Lubennikova
Тема Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Дата
Msg-id 003b6e85-ac91-bbee-9e19-e7b032681cc5@postgrespro.ru
обсуждение исходный текст
Ответ на Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
09.09.2019 22:54, Peter Geoghegan wrote:
>
> Attached is v11, which makes the kill_prior_tuple optimization work
> with posting list tuples. The only catch is that it can only work when
> all "logical tuples" within a posting list are known-dead, since of
> course there is only one LP_DEAD bit available for each posting list.
>
> The hardest part of this kill_prior_tuple work was writing the new
> _bt_killitems() code, which I'm still not 100% happy with. Still, it
> seems to work well -- new pageinspect LP_DEAD status info was added to
> the second patch to verify that we're setting LP_DEAD bits as needed
> for posting list tuples. I also had to add a new nbtree-specific,
> posting-list-aware version of index_compute_xid_horizon_for_tuples()
> -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
> avoid splitting a posting list with the LP_DEAD bit set. I took a
> naive approach to avoiding that problem, adding code to
> _bt_findinsertloc() to prevent it. Posting list splits are generally
> assumed to be rare, so the fact that this is slightly inefficient
> should be fine IMV.
>
> I also refactored deduplication itself in anticipation of making the
> WAL logging more efficient, and incremental. So, the structure of the
> code within _bt_dedup_one_page() was simplified, without really
> changing it very much (I think). I also fixed a bug in
> _bt_dedup_one_page(). The check for dead items was broken in previous
> versions, because the loop examined the high key tuple in every
> iteration.
>
> Making _bt_dedup_one_page() more efficient and incremental is still
> the most important open item for the patch.


Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not 
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be 
resolved
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.


Attached is v12, which contains WAL optimizations for posting split and 
page
deduplication. Changes to prior version:

* xl_btree_split record doesn't contain posting tuple anymore, instead 
it keeps
'in_posting offset'  and repeats the logic of _bt_insertonpg() as you 
proposed
upthread.

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains 
info about
groups of tuples deduplicated into posting tuples. In principle, it is 
possible
to fit it into some existing record, but I preferred to keep things clear.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that 
have
very few duplicates or pages that are already full of posting lists.

To be honest, I don't believe that incremental deduplication can really 
improve
something, because no matter how many items were compressed we still 
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Meskes
Дата:
Сообщение: Re: A problem presentaion about ECPG, DECLARE STATEMENT
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: pgbench - allow to create partitioned tables