Re: Declarative partitioning - another take

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Declarative partitioning - another take
Дата
Msg-id CA+TgmobF2r=f-crrE-k7WM8iFpBKLz3dtBtEc=KmkudYViYcyQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Declarative partitioning - another take  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Ответы Re: Declarative partitioning - another take
Список pgsql-hackers
On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Meanwhile, here are updated patch that address most of the following comments.

OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
there, and I think the entire comment block that starts with "NOTE:
SQL specifies that a NULL" and ends with "it's unlikely that NULL
would result." should be changed to say something like /* As for
catalogued constraints, we treat a NULL result as a success, not a
failure. */ rather than duplicating an existing comment that doesn't
quite apply here.  Finally, ExecConstraints() contains a new if-block
whose sole contents are another if-block.  Perhaps if (this && that)
would be better.

Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
you've chosen is awfully convoluted and probably not that efficient.
For example, get_partition_for_tuple() contains this loop:

+       prev = parent;
+       node = parent->downlink;
+       while (node != NULL)
+       {
+               if (node->index >= cur_idx)
+                       break;
+
+               prev = node;
+               node = node->next;
+       }

Well, it looks to me like that's an O(n) way to find the n'th
partition, which seems like a pretty bad idea in performance-critical
code, which this is.  I think this whole structure needs a pretty
heavy overhaul.  Here's a proposal:

1. Forget the idea of a tree.  Instead, let the total number of tables
in the partitioning hierarchy be N and let the number of those that
are partitioned be K.  Assign each partitioned table in the hierarchy
an index between 0 and K-1.  Make your top level data structure (in
lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
objects, with the partitioning root in entry 0 and the rest in the
remaining entries.

2. Within each PartitionDispatch object, store (a) a pointer to a
PartitionDesc and (b) an array of integers of length equal to the
PartitionDesc's nparts value.  Each integer i, if non-negative, is the
final return value for get_partition_for_tuple.  If i == -1, tuple
routing fails.  If i < -1, we must next route using the subpartition
whose PartitionDesc is at index -(i+1).  Arrange for the array to be
in the same order the PartitionDesc's OID list.

3. Now get_partition_for_tuple looks something like this:

K = 0
loop:   pd = PartitionDispatch[K]   idx = list/range_partition_for_tuple(pd->partdesc, ...)   if (idx >= -1)
returnidx   K = -(idx + 1)
 

No recursion, minimal pointer chasing, no linked lists.  The whole
thing is basically trivial aside from the cost of
list/range_partition_for_tuple itself; optimizing that is a different
project.  I might have some details slightly off here, but hopefully
you can see what I'm going for: you want to keep the computation that
happens in get_partition_for_tuple() to an absolute minimum, and
instead set things up in advance so that getting the partition for a
tuple is FAST.  And you want the data structures that you are using in
that process to be very compact, hence arrays instead of linked lists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: proposal: psql \setfileref
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Fun fact about autovacuum and orphan temp tables