Re: Declarative partitioning - another take

Поиск

Список

Период

Сортировка

От	Amit Langote
Тема	Re: Declarative partitioning - another take
Дата	18 ноября 2016 г. 11:00:07
Msg-id	2587c47e-7ff7-80db-953f-fb532c215219@lab.ntt.co.jp обсуждение исходный текст
Ответ на	Re: Declarative partitioning - another take (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: Declarative partitioning - another take
Список	pgsql-hackers

Дерево обсуждения

On 2016/11/18 4:14, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Meanwhile, here are updated patch that address most of the following comments.
>
> OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
> a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
> there,

Fixed.

> and I think the entire comment block that starts with "NOTE:
> SQL specifies that a NULL" and ends with "it's unlikely that NULL
> would result." should be changed to say something like /* As for
> catalogued constraints, we treat a NULL result as a success, not a
> failure. */ rather than duplicating an existing comment that doesn't
> quite apply here.

Ah, you're right that the comment does not apply as it is.  I rewrote that
comment.

Oh but wait, that means I can insert rows with NULLs in the range
partition key if I choose to insert it directly into the partition,
whereas I have been thinking all this while that there could never be
NULLs in the partition key of a range partition.  What's more,
get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
every partition key column in case of a range partition.  Is that
wrongheaded altogether?  (also see my reply to your earlier message about
NULLs in the range partition key)

>  Finally, ExecConstraints() contains a new if-block
> whose sole contents are another if-block.  Perhaps if (this && that)
> would be better.

Agreed, should have noticed that.

> Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
> you've chosen is awfully convoluted and probably not that efficient.
> For example, get_partition_for_tuple() contains this loop:
>
> +       prev = parent;
> +       node = parent->downlink;
> +       while (node != NULL)
> +       {
> +               if (node->index >= cur_idx)
> +                       break;
> +
> +               prev = node;
> +               node = node->next;
> +       }
>
> Well, it looks to me like that's an O(n) way to find the n'th
> partition, which seems like a pretty bad idea in performance-critical
> code, which this is.  I think this whole structure needs a pretty
> heavy overhaul.  Here's a proposal:

Thanks for the idea below!

> 1. Forget the idea of a tree.  Instead, let the total number of tables
> in the partitioning hierarchy be N and let the number of those that
> are partitioned be K.  Assign each partitioned table in the hierarchy
> an index between 0 and K-1.  Make your top level data structure (in
> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
> objects, with the partitioning root in entry 0 and the rest in the
> remaining entries.
>
> 2. Within each PartitionDispatch object, store (a) a pointer to a
> PartitionDesc and (b) an array of integers of length equal to the
> PartitionDesc's nparts value.  Each integer i, if non-negative, is the
> final return value for get_partition_for_tuple.  If i == -1, tuple
> routing fails.  If i < -1, we must next route using the subpartition
> whose PartitionDesc is at index -(i+1).  Arrange for the array to be
> in the same order the PartitionDesc's OID list.
>
> 3. Now get_partition_for_tuple looks something like this:
>
> K = 0
> loop:
>     pd = PartitionDispatch[K]
>     idx = list/range_partition_for_tuple(pd->partdesc, ...)
>     if (idx >= -1)
>         return idx
>     K = -(idx + 1)
>
> No recursion, minimal pointer chasing, no linked lists.  The whole
> thing is basically trivial aside from the cost of
> list/range_partition_for_tuple itself; optimizing that is a different
> project.  I might have some details slightly off here, but hopefully
> you can see what I'm going for: you want to keep the computation that
> happens in get_partition_for_tuple() to an absolute minimum, and
> instead set things up in advance so that getting the partition for a
> tuple is FAST.  And you want the data structures that you are using in
> that process to be very compact, hence arrays instead of linked lists.

This sounds *much* better.  Here is a quick attempt at coding the design
you have outlined above in the attached latest set of patches.

PS: I haven't run the patches through pgindent yet.

Thanks,
Amit

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Declarative partitioning - another take

Вложения