Re: Declarative partitioning

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Declarative partitioning
Дата
Msg-id CA+TgmobAWu=BzbiUScSwCFxq+6Z5iPgi3bjV-6gFna+o=8Y_Hg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Declarative partitioning  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Ответы Re: Declarative partitioning  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Список pgsql-hackers
On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Personally, I would be more inclined to make this a CREATE TABLE statement, like
>>
>> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
>> CREATE TABLE subpartition_name PARTITION OF partition_name FOR VALUES ...
>
> I guess the first of which would actually be:
>
> CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES ...
> PARTITION BY ...
>
> Some might think that writing potentially the same PARTITION BY clause 100
> times for 100 level-1 partitions could be cumbersome. That is what
> SUBPARTITION BY notation may be good as a shorthand for.

I think that anybody who is doing that much partitioning is going to
use a tool to generate the DDL anyway, so it doesn't really matter.

>> I think if you've got SUBPARTITION as a keyword in the syntax
>> anywhere, you're doing it wrong.  The toplevel object shouldn't really
>> care whether its children are themselves partitioned or not.
>
> This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It
> might as well be just cascaded PARTITION BYs. The point is to specify as
> many of those with CREATE TABLE toplevel as the number of levels of
> partitioning we want. That does however fix the number of levels in advance.

Which I think is not good.  If we say that a table can be partitioned,
and a table that is a partition can also be partitioned, we've got a
nice general system.  Fixing the number of partitioning levels for the
sake of a little syntactic sugar is, IMHO, getting our priorities
backwards.

> In the patch I have posted, here are some details of the tuple routing
> implementation for instance - the top-level parent knows only of its
> immediate partitions. Once a level-1 partition from that list is
> determined using a tuple's level-1 key, the tuple is passed down to choose
> one of its own partitions using the level-2 key. The key descriptor list
> is associated with the top-level parent alone and the recursive code knows
> to iterate the key list to apply nth key for nth level. The recursion
> happens in the partition module.

I haven't looked at the code, but that sounds like the right approach.

> Now there are also functions to let obtain, say *all* or only *leaf*
> partitions (OIDs) of a top-level parent but they are used for certain DDL
> scenarios. As far as DML is concerned, the level-at-a-time recursive
> approach as described above is used. Queries not yet because the plan is a
> flattened append of leaf partitions anyway.

OK.

> If such notation convenience at the expense of loss of generality is not
> worth it, I'm willing to go ahead and implement SUBPARTITION-less syntax.
>
> CREATE TABLE toplevel() PARTITION BY
> CREATE TABLE partition PARTITION OF toplevel FOR VALUES ... PARTITION BY
> CREATE TABLE subpartition PARTITION OF partition FOR VALUES
> ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE
> ALTER TABLE partitioned DETACH PARTITION name [ USING newname ]
>
> While we are on the syntax story, how about FOR VALUES syntax for range
> partitions (sorry for piggybacking it here in this message). At least some
> people dislike LESS THAN notation. Corey Huinker says we should be using
> range type literals for that. It's not clear though that using range type
> literals directly is a way ahead. For one, range type API expects there to
> exist a range type with given element type. Whereas all we require for
> range partitioning proper is for column type to have a btree operator
> class. Should we require it to have an associated range type as well?
> Don't think that there exists an API to check that either. All in all,
> range types are good to implement things in applications but not so much
> within the backend (unless I'm missing something). I know reinventing the
> wheel is disliked as well but perhaps we could offer something like the
> following because Corey offered some examples which would help from the
> flexibility:
>
> START [ EXCL ] (startval) END [ INCL ] (endval)

I don't think using range type literals is gonna work.  There's no
guarantee that the necessary range types exist.  However, we could
possibly use a syntax inspired by the syntax for range types.  I'm a
little nervous about asking people to type commands with mismatching
braces:

CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 );

...but maybe it's a good idea.  It certainly has the advantage of
being more compact than a lot of there ways we might choose to write
that.  And I think LESS THAN sucks.  It's just clunky and awkward
syntax.

> Some people said something akin to interval partitioning would be good like:
>
> PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START
> WITH value;
>
> But that could be just a UI addition to the design where each partition
> has [startval, endval) bounds. In any case, not a version 1 material I'd
> think.

Agreed; not right now.

> I see the trade-off. I agree that considering the significance for attach
> partition case is quite important.
>
> So, the tuple routing code should be ready to use projection if there
> happens to be a partition with differing tuple descriptor. In the code I
> posted, a ResultRelInfo is lazily built afresh for each inserted tuple in
> ExecInsert's case and for each tuple where the chosen partition is
> different from the previous tuple's in CopyFrom's case. One can feel that
> there is a certain overhead to that approach for the bulk-loading case
> (almost every CopyFrom invocation). Now if projection enters this picture,
> we have to consider that too.  Should we initialize ResultRelInfo's and
> corresponding ProjectionInfo's for all partitions beforehand? Consider how
> ModifyTable currently handles update on inheritance set, for example. That
> would incur unnecessary overhead if only a single tuple is inserted. But
> it would certainly help bulk-loading case. Am I missing something?

I'd say you are missing benchmarking data.  :-)

Why should the single-tuple case be harmed here?

> Sorry for rather incoherent articulation in the previous message. Let me
> try to say something that's perhaps more specific.
>
> Consider that we create partitions as inheritance children, that is,
> creating a partition using:
>
> CREATE TABLE partition PARTITION OF parent FOR VALUES ...
>
> is equivalent to saying
>
> CREATE TABLE partition (...) INHERITS (parent)
>
> except that the latter allows partition to have its own column
> definitions, table constraints and multiple parents as long as things are
> conflict-free. The former requires specifying partition bounding values.
> The common thing between the two then is StoreCatalogInheritance(partOid,
> parentOid) that will mark partition as inheritance child of parent in
> pg_inherits. So, our new partitions are really inheritance children but
> that's not apparent to users (where this last bit is important).

So far, this sounds good to me.

> Then consider ALTER TABLE partition - should we need to handle it in way
> different from existing inheritance code would do -
>
>  * Prevent something whereas regular inheritance wouldn't?
>  * Do something instead of/in addition to whatever regular inheritance does?
>
> Consider adding a column to partition - regular inheritance wouldn't
> prevent it because adding a column to child table doesn't get in the way
> of how inheritance works. Now we can make it so that if table is a
> partition created with CRATE TABLE PARTITION OF (which we can tell because
> the table has pg_partition entry), we should *actively* prevent adding a
> column. Inheritance already prevents dropping an inherited attribute in
> child but says "cannot drop inherited column". IMO in a partition's case,
> it should say "cannot drop columns of a partition".

Agreed.  So maybe pg_inherits gets an additional column that indicates
whether this is "regular" inheritance or "partitioning" inheritance,
and in the latter case there are additional restrictions.

> OTOH, adding a column to parent propagates to all partitions by way of
> ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is
> necessary for partitioned tables (which we can tell from relkind or maybe
> because table has pg_partitioned_rel entry) - especially no need to
> propagate column default and check constraint, etc. Because one cannot
> apply insert/update on partitions anyway.

Hmm.  I feel like maybe that should be allowed.  If the user wants to
bulk-load a bunch of data and knows that it goes in a particular
partition, why should they have to pay the overhead of determining
that anew for each row?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Writing new unit tests with PostgresNode
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: [JDBC] JDBC behaviour