Re: [PoC] Improve dead tuple storage for lazy vacuum

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: [PoC] Improve dead tuple storage for lazy vacuum
Дата
Msg-id CAFBsxsF8_ce1Vc9HTqjdh4KknPDekjMJTW6JPq99U+GhdzxuqA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:

>     /* Storage for Tids */
>     union tree
>     {
>         local_radix_tree    *local;
>         shared_radix_tree   *shared;
>     };

We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.

> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.

> Overall, I think
> there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.

> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.

> For example, if we want to support
> Optimistic Lock Coupling[1], 

Interesting, from the same authors!

> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.

> > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.

I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.

--
John Naylor
EDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: mprove tab completion for ALTER EXTENSION ADD/DROP
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pgsql: Add new GUC createrole_self_grant.