Обсуждение: On partitioning
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja reference Tom's post http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us which mentions the possibility of a different partitioning implementation than what we have so far. As it turns out, I've been thinking about partitioning recently, so I thought I would share what I'm thinking so that others can poke holes. My intention is to try to implement this as soon as possible. Declarative partitioning ======================== In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Partitions are a physical implementation detail. Therefore we do not allow the owner to be changed, or permissions to be granted directly to partitions; all these operations happen to the parent relation instead. System Catalogs --------------- In pg_class we have two additional relkind values: * relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation. It is used to indicate a parent table, i.e. one theuser can directly address in DML queries. Such relations DO NOT have their own storage. These use the same rules as regulartables for access privileges, ownership and so on. * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent). These cannot be addresseddirectly in DML queries and only limited DDL support is provided. They don't have their own pg_attribute entrieseither and therefore they are always identical in column definitions to the parent relation. Since they are not accessibledirectly, there is no need for ACL considerations; the parent relation's owner is the owner, and grants are appliedto the parent relation only. XXX --- is there a need for a partition having different column default values than itsparent relation? Partitions are numbered sequentially, normally from 1 onwards; but it is valid to have negative partition numbers and 0. Partitions don't have names (except automatically generated ones for pg_class.relname, but they are unusable in DDL). Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. If a tuple for a partitioned relation is run through expressions of all partitions, exactly one should return true. If none returns true, it might be because the partition has not been created yet. A user-facing error is raised in this case (Rationale: if user creates a partitioned rel and there is no partition that accepts some given tuple, it's the user's fault.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. There are two new system catalogs: pg_partitioned_rel --> (prelrelid, prelexpr) pg_partition --> (partrelid, partseq, partexpr, partoverflow) For partitioned rels that have prelexpr, we run that expression and obtain the partition number; as a crosscheck we run partexpr and ensure it returns true. For partitioned rels that don't have prelexpr, we run partexpr for each partition in turn until one returns true. This means that for a properly set up partitioned table, we need to run a single expression on a tuple to find out what partition the tuple belongs into. Per-partition expressions are formed as each partition is created, and are based on the user-supplied partitioning criterion. Master expressions are formed at relation creation time. (XXX Can we change the master expression later, as a result of some ALTER command? Presumably this would mean that all partitions might need to be rewritten.) Triggers -------- (These are user-defined triggers, not partitioning triggers. In fact there are no partitioning triggers at all.) Triggers are attached to the parent relation, not to the specific partition. When a trigger function runs on a tuple inserted, updated or modified on a partition, the data received by the trigger function makes it appear that the tuple belongs to the parent relation. There is no need to let the trigger know which partition the tuple went in or came from. XXX is there a need to give it the partition number that the tuple went it? Syntax ------ CREATE TABLE xyz ( ... ) PARTITION BY RANGE ( a_expr ) This creates the main table only: no partitions are created automatically. We do not support other types of partitioning at this stage. We will implement these later. We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a table after the fact). We leave this as a future improvement. Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE <xyz> CREATE PARTITION <n> This creates a new partition * ALTER TABLE <xyz> CREATE PARTITION FOR <value> Same as above; the partition number is determined automatically. Allowed actions on a RELKIND_PARTITION: * ALTER PARTITION <n> ON TABLE <xyz> SET TABLESPACE * ALTER PARTITION <n> ON TABLE <xyz> DROP * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz> * VACUUM parent PARTITION <n> As a future extension we will allow partitions to become detached from the parent relation, thus becoming an independent table. This might be a relatively expensive operation: pg_attribute entries need to be created, for example. Overflow Partitions ------------------- There is no explicit concept of overflow partitions. Vacuum, aging ------------- PARTITIONED_RELs, not containing tuples directly, do not have relfrozenxid or relminmxid. Each partition has individual values for these variables. Autovacuum knows to ignore PARTITIONED_RELs, and considers each RELKIND_PARTITION separately. Each partition is vacuumed as a normal relation. Planner ------- A partitioned relation behaves just like a regular relation for purposes of planner. XXX do we need special considerations regarding relation size estimation? For scan plans, we need to prepare Append lists which are used to scan for tuples in a partitioned relation. We can setup fake constraint expressions based on the partitioning expressions, which let the planner discard unnecessary partitions by way of constraint exclusion. (In the future we might be interested in creating specialized plan and execution nodes that know more about partitioned relations, to avoid creating useless Append trees only to prune them later.) Executor -------- When doing an INSERT or UPDATE ResultRelInfo needs to be expanded for partitioned relations: the target relation of an insertion is the parent relation, but the actual partition needs to be resolved at ModifyTable execution time. This means RelOptInfo needs to know about partitions; either we deal with them as "other rels" terms, or we create a new RelOptKind. At any rate, running the partitioning expression on the new tuple would give an partition index. This needs to be done once for each new tuple. I think during ExecInsert, after running triggers and before executing constraints, we need to switch resultRelationDesc from the parent relation into the partition-specific relation. ExecInsertIndexTuples only knows about partitions. It's an error to call it using a partitioned rel. Heap Access Method ------------------ For the purposes of low-level routines in heapam.c, only partitions exist; trying to insert or modify tuples in a RELKIND_PARTITIONED_REL is an error. heap_insert and heap_multi_insert only accept inserting tuples into an individual partition. These routines do not check that the tuples belong into the specific partition; that's responsibility of higher-level code. Because of this, code like COPY will need to make its own checks. Maybe we should offer another API (in between high-level things such as ModifyTable/COPY and heapam.c) that receives tuples into a PARTITIONED_REL and routes them into specific partitions. Note: need to ensure we do not slow down COPY for the regular case of RELKIND_RELATION. Taking backups -------------- pg_dump is able to dump a partitioned relation as a CREATE TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION commands. The data of all partitions is considered a single COPY operation. XXX this limits the ability to restore in parallel. To fix we might consider using one COPY for each partition. It's not clear what relation should be mentioned in such a COPY command, though -- my instinct is that it should reference the parent table only, not the individual partition. Previous Discussion ------------------- http://www.postgresql.org/message-id/d3c4af540703292358s8ed731el7771ab14083aa610@mail.gmail.com Auto Partitioning Patch - WIP version 1 (Nikhil Sontakke, March 2007) http://www.postgresql.org/message-id/20080111231945.GY6934@europa.idg.com.au Declarative partitioning grammar (Gavin Sherry, January 2008) http://www.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com Patch for automating partitions in PostgreSQL 8.4 Beta 2 (Kedar Potdar, Jun 2009) http://www.postgresql.org/message-id/20091029111531.96CD.52131E4D@oss.ntt.co.jp Syntax for partitioning (Itagaki Takahiro, Oct 2009) http://www.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com Partitioning syntax (Itagaki Takahiro, Jan 2010) Not really related:http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.siteDynamic Partitioning using SegmentVisibility Maps(Simon Riggs, January 2008) Still To Be Designed -------------------- * Dependency issues * Are indexes/constraints inherited from the parent rel? * Multiple keys? Subpartitioning? Hash partitioning? Open Questions -------------- * What's the syntax to refer to specific partitions within a partitioned table? We could do "TABLE <xyz> PARTITION <n>",but for example if in the future we add hash partitioning, we might need some non-integer addressing (OTOH assigningsequential numbers to hash partitions doesn't seem so bad). Discussing with users of other DBMSs partitioningfeature, one useful phrase is "TABLE <xyz> PARTITION FOR <value>". * Do we want to provide partitioned materialized views? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > [ partition sketch ] > In this design, partitions are first-class objects, not normal tables in > inheritance hierarchies. There are no pg_inherits entries involved at all. Hm, actually I'd say they are *not* first class objects; the problem with the existing design is exactly that child tables *are* first class objects. This is merely a terminology quibble though. > * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned > relation (its parent). These cannot be addressed directly in DML > queries and only limited DDL support is provided. They don't have > their own pg_attribute entries either and therefore they are always > identical in column definitions to the parent relation. Not sure that not storing the pg_attribute rows is a good thing; but that's something that won't be clear till you try to code it. > Each partition is assigned an Expression that receives a tuple and > returns boolean. This expression returns true if a given tuple belongs > into it, false otherwise. -1, in fact minus a lot. One of the core problems of the current approach is that the system, particularly the planner, hasn't got a lot of insight into exactly what the partitioning scheme is in a partitioned table built on inheritance. If you allow the partitioning rule to be a black box then that doesn't get any better. I want to see a design wherein the system understands *exactly* what the partitioning behavior is. I'd start with supporting range-based partitioning explicitly, and maybe we could add other behaviors such as hashing later. In particular, there should never be any question at all that there is exactly one partition that a given row belongs to, not more, not less. You can't achieve that with a set of independent filter expressions; a meta-rule that says "exactly one of them should return true" is an untrustworthy band-aid. (This does not preclude us from mapping the tuple through the partitioning rule and finding that the corresponding partition doesn't currently exist. I think we could view the partitioning rule as a function from tuples to partition numbers, and then we look in pg_class to see if such a partition exists.) > Additionally, each partitioned relation may have a master expression. > This receives a tuple and returns an integer, which corresponds to the > number of the partition it belongs into. I guess this might be the same thing I'm arguing for, except that I say it is not optional but is *the* way you define the partitioning. And I don't really want black-box expressions even in this formulation. If you're looking for arbitrary partitioning rules, you can keep on using inheritance. The point of inventing partitioning, IMHO, is for the system to have a lot more understanding of the behavior than is possible now. As an example of the point I'm trying to make, the planner should be able to discard range-based partitions that are eliminated by a WHERE clause with something a great deal cheaper than the theorem prover it currently has to use for the purpose. Black-box partitioning rules not only don't improve that situation, they actually make it worse. Other than that, this sketch seems reasonable ... regards, tom lane
On Fri, Aug 29, 2014 at 4:56 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > For scan plans, we need to prepare Append lists which are used to scan > for tuples in a partitioned relation. We can setup fake constraint > expressions based on the partitioning expressions, which let the planner > discard unnecessary partitions by way of constraint exclusion. > > (In the future we might be interested in creating specialized plan and > execution nodes that know more about partitioned relations, to avoid > creating useless Append trees only to prune them later.) This seems like a big part of the point of doing first class partitions. If we have an equivalence class that specifies a constant for all the variables in the master expression then we should be able to look up the corresponding partition as a O(1) operation (or O(log(n) if it involves searching a list) rather than iterating over all the partitions and trying to prove lots of exclusions. We might even need a btree index to store the partitions so that we can handle scaling up and still find the corresponding partitions quickly. And I think there are still unanswered questions about indexes. You seem to be implying that users would be free to create any index they want on any partition. It's probably going to be necessary to support creating an index on the partitioned table which would create an index on each of the partitions and, crucially, automatically create corresponding indexes whenever new partitions are added. That said, everything that's here sounds pretty spot-on to me. -- greg
2014-08-29 18:35 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> [ partition sketch ]
> In this design, partitions are first-class objects, not normal tables in
> inheritance hierarchies. There are no pg_inherits entries involved at all.
Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects. This is merely a terminology quibble though.
+1 .. only few partitions slowdown planning significantly from 1ms to 20ms, what is a issue with very simple queries over PK
> * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
> relation (its parent). These cannot be addressed directly in DML
> queries and only limited DDL support is provided. They don't have
> their own pg_attribute entries either and therefore they are always
> identical in column definitions to the parent relation.
Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.
> Each partition is assigned an Expression that receives a tuple and
> returns boolean. This expression returns true if a given tuple belongs
> into it, false otherwise.
-1, in fact minus a lot. One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance. If you allow the partitioning rule to be a black box then
that doesn't get any better. I want to see a design wherein the system
understands *exactly* what the partitioning behavior is. I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.
In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.
(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)
> Additionally, each partitioned relation may have a master expression.
> This receives a tuple and returns an integer, which corresponds to the
> number of the partition it belongs into.
I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning. And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance. The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.
As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose. Black-box partitioning rules not only don't
improve that situation, they actually make it worse.
Other than that, this sketch seems reasonable ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark <stark@mit.edu> writes: > And I think there are still unanswered questions about indexes. One other interesting thought that occurs to me: are we going to support UPDATEs that cause a row to belong to a different partition? If so, how are we going to handle the update chain links? regards, tom lane
Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > And I think there are still unanswered questions about indexes. > > One other interesting thought that occurs to me: are we going to support > UPDATEs that cause a row to belong to a different partition? If so, how > are we going to handle the update chain links? Bah, I didn't mention it? My current thinking is that it would be disallowed; if you have chosen your partitioning key well enough it shouldn't be necessary. As a workaround you can always DELETE/INSERT. Maybe we can allow it later, but for a first cut this seems more than good enough. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> One other interesting thought that occurs to me: are we going to support >> UPDATEs that cause a row to belong to a different partition? If so, how >> are we going to handle the update chain links? > Bah, I didn't mention it? My current thinking is that it would be > disallowed; if you have chosen your partitioning key well enough it > shouldn't be necessary. As a workaround you can always DELETE/INSERT. > Maybe we can allow it later, but for a first cut this seems more than > good enough. Hm. I certainly agree that it's a case that could be disallowed for a first cut, but it'd be nice to have some clue about how we might allow it eventually. regards, tom lane
On 2014-08-29 13:15:16 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> One other interesting thought that occurs to me: are we going to support > >> UPDATEs that cause a row to belong to a different partition? If so, how > >> are we going to handle the update chain links? > > > Bah, I didn't mention it? My current thinking is that it would be > > disallowed; if you have chosen your partitioning key well enough it > > shouldn't be necessary. As a workaround you can always DELETE/INSERT. > > Maybe we can allow it later, but for a first cut this seems more than > > good enough. > > Hm. I certainly agree that it's a case that could be disallowed for a > first cut, but it'd be nice to have some clue about how we might allow it > eventually. Not pretty, but we could set t_ctid to some 'magic' value when switching partitions. Everything chasing ctid chains could then error out when hitting a invisible row with such a t_ctid. The usecases for doing such updates really are more maintenance style commands, so it's possibly not too bad from a usability POV :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-08-29 13:15:16 -0400, Tom Lane wrote: >> Hm. I certainly agree that it's a case that could be disallowed for a >> first cut, but it'd be nice to have some clue about how we might allow it >> eventually. > Not pretty, but we could set t_ctid to some 'magic' value when switching > partitions. Everything chasing ctid chains could then error out when > hitting a invisible row with such a t_ctid. An actual fix would presumably involve adding a partition number to the ctid chain field in tuples in partitioned tables. The reason I bring it up now is that we'd have to commit to doing that (or at least leaving room for it) in the first implementation, if we don't want to have an on-disk compatibility break. There is certainly room to argue that the value of this capability isn't worth the disk space this solution would eat. But we should have that argument while the option is still feasible ... > The usecases for doing such > updates really are more maintenance style commands, so it's possibly not > too bad from a usability POV :( I'm afraid that might just be wishful thinking. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Tom Lane wrote: > >> One other interesting thought that occurs to me: are we going to support > >> UPDATEs that cause a row to belong to a different partition? If so, how > >> are we going to handle the update chain links? > > > Bah, I didn't mention it? My current thinking is that it would be > > disallowed; if you have chosen your partitioning key well enough it > > shouldn't be necessary. As a workaround you can always DELETE/INSERT. > > Maybe we can allow it later, but for a first cut this seems more than > > good enough. > > Hm. I certainly agree that it's a case that could be disallowed for a > first cut, but it'd be nice to have some clue about how we might allow it > eventually. I hesitate to suggest this, but we have free flag bits in MultiXactStatus. We could use a specially marked multixact member to indicate the OID of the target relation; perhaps set an infomask bit to indicate that this has happened. Of course, no HOT updates are possible so I think it's okay from a heap_prune_chain perspective. This abuses the knowledge that OIDs and XIDs are both 32 bits long. Since nowhere else we have the space necessary to store the longer data that a cross-partition update would require, I don't see anything else ATM. (For a moment I thought about abusing combo CIDs, but that doesn't work because this requires to be persistent and visible from other backends, neither of which is a quality of combocids.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-08-29 13:29:19 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-08-29 13:15:16 -0400, Tom Lane wrote: > >> Hm. I certainly agree that it's a case that could be disallowed for a > >> first cut, but it'd be nice to have some clue about how we might allow it > >> eventually. > > > Not pretty, but we could set t_ctid to some 'magic' value when switching > > partitions. Everything chasing ctid chains could then error out when > > hitting a invisible row with such a t_ctid. > > An actual fix would presumably involve adding a partition number to the > ctid chain field in tuples in partitioned tables. The reason I bring it > up now is that we'd have to commit to doing that (or at least leaving room > for it) in the first implementation, if we don't want to have an on-disk > compatibility break. Right. Just adding it unconditionally doesn't sound feasible to me. Our per-row overhead is already too large. And it doesn't sound fun to have the first-class partitions use a different heap tuple format than plain relations. What we could do is to add some sort of 'jump' tuple when moving a tuple from one relation to another. So, when updating a tuple between partitions we add another in the old partition with xmin_jump = xmax_jump = xmax_old and have the jump tuple's content point to the new relation. Far from pretty, but it'd only matter overhead wise when used. > > The usecases for doing such > > updates really are more maintenance style commands, so it's possibly not > > too bad from a usability POV :( > > I'm afraid that might just be wishful thinking. I admit that you might very well be right there :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-08-29 13:29:19 -0400, Tom Lane wrote: >> An actual fix would presumably involve adding a partition number to the >> ctid chain field in tuples in partitioned tables. The reason I bring it >> up now is that we'd have to commit to doing that (or at least leaving room >> for it) in the first implementation, if we don't want to have an on-disk >> compatibility break. > What we could do is to add some sort of 'jump' tuple when moving a tuple > from one relation to another. So, when updating a tuple between > partitions we add another in the old partition with xmin_jump = > xmax_jump = xmax_old and have the jump tuple's content point to the new > relation. Hm, that might work. It sounds more feasible than Alvaro's suggestion of abusing cmax --- I don't think that field is free for use in this context. regards, tom lane
On 08/29/2014 07:15 PM, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> One other interesting thought that occurs to me: are we going to support >>> UPDATEs that cause a row to belong to a different partition? If so, how >>> are we going to handle the update chain links? >> Bah, I didn't mention it? My current thinking is that it would be >> disallowed; if you have chosen your partitioning key well enough it >> shouldn't be necessary. As a workaround you can always DELETE/INSERT. >> Maybe we can allow it later, but for a first cut this seems more than >> good enough. > Hm. I certainly agree that it's a case that could be disallowed for a > first cut, but it'd be nice to have some clue about how we might allow it > eventually. There needs to be some structure that is specific to partitions and not multiple plain tables which would then be used for both update chains and cross-partition indexes (as you seem to imply by jumping from indexes to update chains a few posts back). It would need to replace plain tid (pagenr, tupnr) with triple of (partid, pagenr, tupnr). Cross-partition indexes are especially needed if we want to allow putting UNIQUE constraints on non-partition-key columns. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 08/29/2014 05:56 PM, Alvaro Herrera wrote: > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja > reference Tom's post > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us > which mentions the possibility of a different partitioning > implementation than what we have so far. As it turns out, I've been > thinking about partitioning recently, so I thought I would share what > I'm thinking so that others can poke holes. My intention is to try to > implement this as soon as possible. > > > Declarative partitioning > ======================== > ... > Still To Be Designed > -------------------- > * Dependency issues > * Are indexes/constraints inherited from the parent rel? I'd say mostly yes. There could some extra "constraint exclusion type" magic for conditional indexes, but the rest probably should come from "main table" And there should be some kind of cross-partition indexes. At"partitioning" capability, this can probably wait for version2. > * Multiple keys? Why not. But probably just for hash partitioning. > Subpartitioning? Probably not. If you need speed for huge numbers of partitions, use Gregs idea of keeping the partitions in a tree (or just having a partition index). > Hash partitioning? At some point definitely. Also one thing you left unmentioned is dropping (and perhaps also truncating) a partition. We still may want to do historic data management the same way we do it now, by just getting rid of the whole partition or its data. At some point we may also want to do redistributing data between partitions, maybe for case where we end up with 90% of the data in on partition due to bad partitioning key or partitioning function choice. This is again something that is hard now and can therefore be left to a later version. > Open Questions > -------------- > > * What's the syntax to refer to specific partitions within a partitioned > table? > We could do "TABLE <xyz> PARTITION <n>", but for example if in > the future we add hash partitioning, we might need some non-integer > addressing (OTOH assigning sequential numbers to hash partitions doesn't > seem so bad). Discussing with users of other DBMSs partitioning feature, > one useful phrase is "TABLE <xyz> PARTITION FOR <value>". Or more generally TABLE <xyz> PARTITION FOR/WHERE col1=val1, col2=val2, ...; Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
Hannu Krosing wrote: > Cross-partition indexes are especially needed if we want to allow putting > UNIQUE constraints on non-partition-key columns. I'm not going to implement cross-partition indexes in the first patch. They are a huge can of worms. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > In this design, partitions are first-class objects, not normal tables in > inheritance hierarchies. There are no pg_inherits entries involved at all. Whoa. I always assumed that table inheritance was a stepping-stone to real partitioning, and that real partitioning would be built on top of table inheritance. In particular, I assume that (as Itagaki Takahiro's patch did those many years ago) we'd add some metadata somewhere to allow fast tuple routing (for both pruning and inserts/updates). What's the benefit of inventing something new instead? I'm skeptical about your claim that there will be no pg_inherits entries involved at all. You need some way to know which partitions go with which parent table. You can store that many-to-one mapping someplace other than pg_inherits, but it seems to me that that doesn't buy you anything; they're just pg_inherits entries under some other name. Why reinvent that? > Each partition is assigned an Expression that receives a tuple and > returns boolean. This expression returns true if a given tuple belongs > into it, false otherwise. If a tuple for a partitioned relation is run > through expressions of all partitions, exactly one should return true. > If none returns true, it might be because the partition has not been > created yet. A user-facing error is raised in this case (Rationale: if > user creates a partitioned rel and there is no partition that accepts > some given tuple, it's the user's fault.) > > Additionally, each partitioned relation may have a master expression. > This receives a tuple and returns an integer, which corresponds to the > number of the partition it belongs into. I agree with Tom: this is a bad design. In particular, if we want to scale to large numbers of partitions (a principal weakness of the present system) we need the operation of routing a tuple to a partition to be as efficient as possible. Range partitioning can be O(lg n) where n is the number of partitions: store a list of the boundaries and binary-search it. List partitioning can be O(lg k) where k is the number of values (which may be more than the number of partitions) via a similar technique. Hash partitioning can be O(1). I'm not sure what other kind of partitioning anybody would want to do, but it's likely that they *won't* want it to be O(1) in the number of partitions. So I'd say have *only* the master expression. But, really, I don't think an expression is the right way to store this; evaluating that repeatedly will, I think, still be too slow. Think about what happens in PL/pgsql: minimizing the number of times that you enter and exit the executor helps performance enormously, even if the expressions are simple enough not to need planning. I think the representation should be more like an array of partition boundaries and the pg_proc OID of a comparator. > Per-partition expressions are formed as each partition is created, and > are based on the user-supplied partitioning criterion. Master > expressions are formed at relation creation time. (XXX Can we change > the master expression later, as a result of some ALTER command? > Presumably this would mean that all partitions might need to be > rewritten.) This is another really important point. If you store an opaque expression mapping partitioning keys to partition numbers, you can't do things like this efficiently. With a more transparent representation, like a sorted array of partition boundaries for range partitioning, or a sorted array of hash values for consistent hashing, you can do things like split and merge partitions efficiently, minimizing rewriting. > Planner ------- > > A partitioned relation behaves just like a regular relation for purposes > of planner. XXX do we need special considerations regarding relation > size estimation? > > For scan plans, we need to prepare Append lists which are used to scan > for tuples in a partitioned relation. We can setup fake constraint > expressions based on the partitioning expressions, which let the planner > discard unnecessary partitions by way of constraint exclusion. So if we're going to do all this, why bother making the partitions anything other than inheritance children? There might be some benefit in having the partitions be some kind of stripped-down object if we could avoid some of these planner gymnastics and get, e.g. efficient run-time partition pruning. But if you're going to generate Append plans and switch ResultRelInfos and stuff just as you would for an inheritance hierarchy, why not just make it an inheritance hierarchy? It seems pretty clear to me that we need partitioned tables to have the same tuple descriptor throughout the relation, for efficient tuple routing and so on. But the other restrictions you're proposing to impose on partitions have no obvious value that I can see. We could have a rule that when you inherit from a partition root, you can only inherit from that one table (no multiple inheritance) and your tuple descriptor must match precisely (down to dropped columns and column ordering) and that would give you everything I think you really need here. There's no gain to be had in forbidding partitions from having different owners, or being selected from directly, or having user-visible names. The first of those is arguably useless, but it's not really causing us any problems, and the latter two are extremely useful features. Unless you are going to implement partition pruning is so good that it will never fail to realize a situation where only one partition needs to be scanned, letting users target the partition directly is a very important escape hatch. > (In the future we might be interested in creating specialized plan and > execution nodes that know more about partitioned relations, to avoid > creating useless Append trees only to prune them later.) Good idea. > pg_dump is able to dump a partitioned relation as a CREATE > TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION > commands. The data of all partitions is considered a single COPY > operation. > > XXX this limits the ability to restore in parallel. To fix we might consider > using one COPY for each partition. It's not clear what relation should be > mentioned in such a COPY command, though -- my instinct is that it > should reference the parent table only, not the individual partition. Targeting the individual partitions seems considerably better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 30, 2014 at 12:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja > reference Tom's post > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us > which mentions the possibility of a different partitioning > implementation than what we have so far. As it turns out, I've been > thinking about partitioning recently, so I thought I would share what > I'm thinking so that others can poke holes. My intention is to try to > implement this as soon as possible. > +1.
Another thought about this general topic: Alvaro Herrera <alvherre@2ndquadrant.com> writes: > ... > Allowed actions on a RELKIND_PARTITION: > * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz> > ... > Still To Be Designed > -------------------- > * Are indexes/constraints inherited from the parent rel? I think one of the key design decisions we have to make is whether partitions are all constrained to have exactly the same set of indexes. If we don't insist on that it will greatly complicate planning compared to what we'll get if we do insist on it, because then the planner will need to generate a separate customized plan subtree for each partition. Aside from costing planning time, most likely that would forever prevent us from pushing some types of intelligence about partitioning into the executor. Now, in the current model, it's up to the user what indexes to create on each partition, and sometimes one might feel that maintaining a particular index is unnecessary in some partitions. But the flip side of that is it's awfully easy to screw yourself by forgetting to add some index when you add a new partition. So I'm not real sure which approach is superior from a purely user-oriented perspective. I'm not trying to push one or the other answer right now, just noting that this is a critical decision. regards, tom lane
On 08/31/2014 10:03 PM, Tom Lane wrote: > Another thought about this general topic: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> ... >> Allowed actions on a RELKIND_PARTITION: >> * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz> >> ... >> Still To Be Designed >> -------------------- >> * Are indexes/constraints inherited from the parent rel? > I think one of the key design decisions we have to make is whether > partitions are all constrained to have exactly the same set of indexes. > If we don't insist on that it will greatly complicate planning compared > to what we'll get if we do insist on it, because then the planner will > need to generate a separate customized plan subtree for each partition. > Aside from costing planning time, most likely that would forever prevent > us from pushing some types of intelligence about partitioning into the > executor. > > Now, in the current model, it's up to the user what indexes to create > on each partition, and sometimes one might feel that maintaining a > particular index is unnecessary in some partitions. But the flip side > of that is it's awfully easy to screw yourself by forgetting to add > some index when you add a new partition. The "forgetting" part is easy to solve by inheriting all indexes from parent (or template) partition unless explicitly told not to. One other thing that has been bothering me about this proposal is the ability to take partitions offline for maintenance or to load them offline ant then switch in. In current scheme we do this using ALTER TABLE ... [NO] INHERIT ... If we also want to have this with the not-directly-accessible partitions then perhaps it could be done by having a possibility to move a partition between two tables with exactly the same structure ? > So I'm not real sure which > approach is superior from a purely user-oriented perspective. What we currently have is a very flexible scheme which has a few drawbacks 1) unnecessarily complex for simple case 2) easy to shoot yourself in the foot by forgetting something 3) can be hard on planner, especially with huge number of partitions An alternative way of solving these problems is adding some (meta-)constraints to current way of doing things and some more automation CREATE TABLE FOR PARTITIONMASTER WITH (ALL_INDEXES_SAME=ON, SAME_STRUCTURE_ALWAYS=ON, SINGLE_INHERITANCE_ONLY=ON, NESTED_INHERITS=OFF, PARTITION_FUNCTION=default_range_partitioning(int) ); and then force these when adding inherited tables (in this case partition tables) either via CREATE TABLE or ALTER TABLE Best Regards -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On Fri, Aug 29, 2014 at 12:35:50PM -0400, Tom Lane wrote: > > Each partition is assigned an Expression that receives a tuple and > > returns boolean. This expression returns true if a given tuple belongs > > into it, false otherwise. > > -1, in fact minus a lot. One of the core problems of the current approach > is that the system, particularly the planner, hasn't got a lot of insight > into exactly what the partitioning scheme is in a partitioned table built > on inheritance. If you allow the partitioning rule to be a black box then > that doesn't get any better. I want to see a design wherein the system > understands *exactly* what the partitioning behavior is. I'd start with > supporting range-based partitioning explicitly, and maybe we could add > other behaviors such as hashing later. > > In particular, there should never be any question at all that there is > exactly one partition that a given row belongs to, not more, not less. > You can't achieve that with a set of independent filter expressions; > a meta-rule that says "exactly one of them should return true" is an > untrustworthy band-aid. > > (This does not preclude us from mapping the tuple through the partitioning > rule and finding that the corresponding partition doesn't currently exist. > I think we could view the partitioning rule as a function from tuples to > partition numbers, and then we look in pg_class to see if such a partition > exists.) There is one situation where you need to be more flexible, and that is if you ever want to support online repartitioning. To do that you have to distinguish between "I want to insert tuple X, which partition should it go into" and "I want to know which partitions I need to look for partition_key=Y". For the latter you really have need an expression per partition, or something equivalent. If performance is an issue I suppose you could live with having an "old" and an "new" partition scheme, so you couldn't have two "live repartitionings" happening simultaneously. Now, if you want to close the door on online repartitioning forever then that fine. But being in the position of having to say "yes our partitioning scheme sucks, but we would have to take the database down for a week to fix it" is no fun. Unless logical replication provides a way out maybe?? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Aside from costing planning time, most likely that would forever prevent > us from pushing some types of intelligence about partitioning into the > executor. How would it affect this calculus if there were partitioned indexes which were created on the overall table and guaranteed to exist on each partition that the planner could use -- and then possibly also per-partition indexes that might exist in addition to those? So the planner could make deductions and leave some intelligence about partitions to the executor as long as they only depend on partitioned indexes but might be able to take advantage of a per-partition index if it's an unusual situation. I'm imagining for example a partitioned table where only the current partition is read-write and OLTP queries restrict themselves to working only with the current partition. Having excluded the other partitions the planner is free to use any of the indexes liberally. That said, I think the typical approach to this is to only allow indexes that are defined for the whole table. If the user wants to have different indexes for the current time period they would have a separate table with all the indexes on it that is only moved into the partitioned table once it's finished being used for for the atypical queries. Oracle supports "local partitioned indexes" (which are partitioned like the table) and "global indexes" (which span partitions) but afaik it doesn't support indexes on only some partitions. Furthermore, we have partial indexes. Partial indexes mean you can always create a partial index on just one partition's range of keys. The index will exist for all partitions but just be empty for all but the partitions that matter. The planner can plan based on the partial index's where clause which would accomplish the same thing, I think. -- greg
On 2014-08-29 20:12:16 +0200, Hannu Krosing wrote: > It would need to replace plain tid (pagenr, tupnr) with triple of (partid, > pagenr, tupnr). > > Cross-partition indexes are especially needed if we want to allow putting > UNIQUE constraints on non-partition-key columns. I actually don't think this is necessary. I'm pretty sure that you can build an efficient and correct version of unique constraints with several underlying indexes in different partitions each. The way exclusion constraints are implemented imo is a good guide. I personally think that implementing cross partition indexes has a low enough cost/benefit ratio that I doubt it's wise to tackle it anytime soon. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Greg Stark <stark@mit.edu> writes: > On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Aside from costing planning time, most likely that would forever prevent >> us from pushing some types of intelligence about partitioning into the >> executor. > How would it affect this calculus if there were partitioned indexes > which were created on the overall table and guaranteed to exist on > each partition that the planner could use -- and then possibly also > per-partition indexes that might exist in addition to those? That doesn't actually fix the planning-time issue at all. Either the planner considers each partition individually to create a custom plan for it, or it doesn't. The "push into executor" idea I was alluding to is that we might invent plan constructs like a ModifyTable node that applies to a whole inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be done at runtime. You're not going to get a plan structure like that if the planner is building a separate plan subtree for each partition. regards, tom lane
On 2014-08-31 16:03:30 -0400, Tom Lane wrote: > Another thought about this general topic: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > ... > > Allowed actions on a RELKIND_PARTITION: > > * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz> > > ... > > Still To Be Designed > > -------------------- > > * Are indexes/constraints inherited from the parent rel? > > I think one of the key design decisions we have to make is whether > partitions are all constrained to have exactly the same set of indexes. > If we don't insist on that it will greatly complicate planning compared > to what we'll get if we do insist on it, because then the planner will > need to generate a separate customized plan subtree for each partition. > Aside from costing planning time, most likely that would forever prevent > us from pushing some types of intelligence about partitioning into the > executor. > Now, in the current model, it's up to the user what indexes to create > on each partition, and sometimes one might feel that maintaining a > particular index is unnecessary in some partitions. But the flip side > of that is it's awfully easy to screw yourself by forgetting to add > some index when you add a new partition. So I'm not real sure which > approach is superior from a purely user-oriented perspective. I think we're likely to end up with both. In many cases it'll be far superior from a usability and planning perspective to have indices on the 'toplevel table' (do we have a good name for that?). But on the flip side, one of the significant use cases for partitioning is dealing with historical data. In many cases old data has to be saved for years but is barely ever queried. It'd be a shame to inflict all indexes on all partitions for that kind of data. It'd surely be a useful step to add sane partitioning without that capability, but we shouldn't base the design on that decision. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 1, 2014 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The "push into executor" idea I was alluding to is that we might invent > plan constructs like a ModifyTable node that applies to a whole > inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be > done at runtime. You're not going to get a plan structure like that > if the planner is building a separate plan subtree for each partition. Well my message was assuming that in that case it would only consider the partitioned indexes. It would only consider the isolated indexes if the planner was able to identify a specific partition. That's probably the only type of query where such indexes are likely to be useful. -- greg
On 2014-09-01 11:59:37 -0400, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Aside from costing planning time, most likely that would forever prevent > >> us from pushing some types of intelligence about partitioning into the > >> executor. > > > How would it affect this calculus if there were partitioned indexes > > which were created on the overall table and guaranteed to exist on > > each partition that the planner could use -- and then possibly also > > per-partition indexes that might exist in addition to those? > > That doesn't actually fix the planning-time issue at all. Either the > planner considers each partition individually to create a custom plan > for it, or it doesn't. We could have a information about the indexing situation in child partitions on the toplevel table. I.e. note whether child partitions have individual indexes. And possibly constraints. > The "push into executor" idea I was alluding to is that we might invent > plan constructs like a ModifyTable node that applies to a whole > inheritance^H^H^Hpartitioning tree and leaves the tuple routing to be > done at runtime. You're not going to get a plan structure like that > if the planner is building a separate plan subtree for each partition. It doesn't sound impossible to evaluate at plan time whether to use nodes covering several partitions or use a separate subplan for individual partitions. We're going to need information which partitions to scan in those nodes anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/01/2014 06:59 PM, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> On Sun, Aug 31, 2014 at 9:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Aside from costing planning time, most likely that would forever prevent >>> us from pushing some types of intelligence about partitioning into the >>> executor. > >> How would it affect this calculus if there were partitioned indexes >> which were created on the overall table and guaranteed to exist on >> each partition that the planner could use -- and then possibly also >> per-partition indexes that might exist in addition to those? > > That doesn't actually fix the planning-time issue at all. Either the > planner considers each partition individually to create a custom plan > for it, or it doesn't. Hmm. Couldn't you plan together all partitions that do have the same indexes? In other words, create a custom plan for each group of partitions, rather than each partition? - Heikki
On 09/01/2014 05:52 PM, Andres Freund wrote: > On 2014-08-29 20:12:16 +0200, Hannu Krosing wrote: >> It would need to replace plain tid (pagenr, tupnr) with triple of (partid, >> pagenr, tupnr). >> >> Cross-partition indexes are especially needed if we want to allow putting >> UNIQUE constraints on non-partition-key columns. > I actually don't think this is necessary. I'm pretty sure that you can > build an efficient and correct version of unique constraints with > several underlying indexes in different partitions each. The way > exclusion constraints are implemented imo is a good guide. > > I personally think that implementing cross partition indexes has a low > enough cost/benefit ratio that I doubt it's wise to tackle it anytime > soon. Also it has the downside of (possibly) making DROP PARTITION either slow or wasting space until next VACUUM. So if building composite unique indexes over multiple per-partition indexes is doable, I would much prefer this. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 09/01/2014 11:52 PM, Andres Freund wrote: > I personally think that implementing cross partition indexes has a low > enough cost/benefit ratio that I doubt it's wise to tackle it anytime > soon. UNIQUE constraints on partitioned tables (and thus foreign key constraints pointing to partitioned tables) are a pretty big limitation at the moment. That said, the planner may well be able to use the greater knowledge of the partitioned table structure to do this implictly, as it knows that a unique index on the partition is also implicitly unique across partitions on the partitioning key. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09/01/2014 04:03 AM, Tom Lane wrote: > I think one of the key design decisions we have to make is whether > partitions are all constrained to have exactly the same set of indexes. ... and a lot of that comes down to what use cases the partitioning is meant to handle, and what people are expected to continue to DIY with inheritance. Simple range and hash partitioning are the main things being discussed. Other moderately common partitioning uses seem to be hot/cold partitioning, usually on unequal ranges, and closely related live/dead partitioning for apps that soft-delete data. In both those you may well want to suppress indexes on the cold/dead portion, much like we currently have partial indexes. In fact, how different is an index that's present on only a subset of partitions to a partial index, in planning terms? We know the partitions it is/isn't on, after all, and can form an expression that finds just those partitions. (I guess the answer there is that partial index planning is probably not smart enough to be useful for this). > If we don't insist on that it will greatly complicate planning compared > to what we'll get if we do insist on it, because then the planner will > need to generate a separate customized plan subtree for each partition. Seems to be like a "make room to support it in future, but don't do it now" thing. Partitioning schemes like: [prior years] [last year] [this year] [this month] [this week] could benefit from it, but they also need things like online repartitioning, updates to move tuples across partitions, etc. So it's all in the "let's not lock it out for the future, but lets not tackle it now either" box. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote: > There is one situation where you need to be more flexible, and that is > if you ever want to support online repartitioning. To do that you have > to distinguish between "I want to insert tuple X, which partition > should it go into" and "I want to know which partitions I need to look > for partition_key=Y". > > For the latter you really have need an expression per partition, or > something equivalent. If performance is an issue I suppose you could > live with having an "old" and an "new" partition scheme, so you > couldn't have two "live repartitionings" happening simultaneously. > > Now, if you want to close the door on online repartitioning forever > then that fine. But being in the position of having to say "yes our > partitioning scheme sucks, but we would have to take the database down > for a week to fix it" is no fun. > > Unless logical replication provides a way out maybe?? I am unclear why having information per-partition rather than on the parent table helps with online reparitioning. Robert's idea of using normal table inheritance means we can access/move the data independently of the partitioning system. My guess is that we will need to do repartitioning with some tool, rather than as part of normal database operation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote: > On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote: > > There is one situation where you need to be more flexible, and that is > > if you ever want to support online repartitioning. To do that you have > > to distinguish between "I want to insert tuple X, which partition > > should it go into" and "I want to know which partitions I need to look > > for partition_key=Y". > > I am unclear why having information per-partition rather than on the > parent table helps with online reparitioning. An example: We have three partitions, one for X<0 (A), one for 0<=X<5 (B) and one for X>=5 (C). These are in three different tables. Now we give the command to merge the last two partitions B&C. You now have the choice to lock the table while you move all the tuples from C to B. Or you can make some adjustments such that new tuples that would have gone to C now go to B. And if there is a query for X=10 that you look in *both* B & C. Then the existing tuples can be moved from C to B at any time without blocking any other operations. Is this clearer? If you up front decide that which partition to query will be determined by a function that can only return one table, then the above becomes impossible. > Robert's idea of using normal table inheritance means we can access/move > the data independently of the partitioning system. My guess is that we > will need to do repartitioning with some tool, rather than as part of > normal database operation. Doing it as some tool seems like a hack to me. And since the idea was (I thought) that partitions would not be directly accessable from SQL, it has to be in the database itself. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Tue, Sep 2, 2014 at 4:18 PM, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Tue, Sep 02, 2014 at 09:44:17AM -0400, Bruce Momjian wrote: >> On Sun, Aug 31, 2014 at 10:45:29PM +0200, Martijn van Oosterhout wrote: >> > There is one situation where you need to be more flexible, and that is >> > if you ever want to support online repartitioning. To do that you have >> > to distinguish between "I want to insert tuple X, which partition >> > should it go into" and "I want to know which partitions I need to look >> > for partition_key=Y". >> >> I am unclear why having information per-partition rather than on the >> parent table helps with online reparitioning. > > An example: > > We have three partitions, one for X<0 (A), one for 0<=X<5 (B) and one > for X>=5 (C). These are in three different tables. > > Now we give the command to merge the last two partitions B&C. You now > have the choice to lock the table while you move all the tuples from C > to B. > > Or you can make some adjustments such that new tuples that would have gone > to C now go to B. And if there is a query for X=10 that you look in > *both* B & C. Then the existing tuples can be moved from C to B at any > time without blocking any other operations. > > Is this clearer? If you up front decide that which partition to query > will be determined by a function that can only return one table, then > the above becomes impossible. > >> Robert's idea of using normal table inheritance means we can access/move >> the data independently of the partitioning system. My guess is that we >> will need to do repartitioning with some tool, rather than as part of >> normal database operation. > > Doing it as some tool seems like a hack to me. And since the idea was (I > thought) that partitions would not be directly accessable from SQL, it > has to be in the database itself. I agree. My main point about reusing the inheritance stuff is that we've done over the years is that we shouldn't reinvent the wheel, but rather build on what we've already got. If the proposed design somehow involved treating all of the partitions as belonging to the same TID space (which doesn't really seem possible, but let's suspend disbelief) so that you could have a single index that covers all the partitions, and the system would somehow work out which TIDs live in which physical files, then it would be reasonable to view the storage layer as an accident that higher levels of the system don't need to know anything about. But the actual proposal involves having multiple relations that have to get planned just like real tables, and that means all the optimizations that we've done on gathering statistics for inheritance hierarchies, and MergeAppend, and every other bit of planner smarts that we have will be applicable to this new method, too. Let's not do anything that forces us to reinvent all of those things. Now, to be fair, one could certainly argue (and I would agree) that the existing optimizations are insufficient. In particular, the fact that SELECT * FROM partitioned_table WHERE not_the_partitioning_key = 1 has to be planned separately for every partition is horrible, and the fact that SELECT * FROM partitioned_table WHERE partitioning_key = 1 has to use an algorithm that is both O(n) in the partition count and has a relatively high constant factor to exclude all of the non-matching partitions also sucks. But I think we're better off trying to view those as further optimizations that we can apply to certain special cases of partitioning - e.g. when the partitioning syntax is used, constrain all the tables to have identical tuple descriptors and matching indexes (and maybe constraints) so that when you plan, you can do it once and then used the transposed plan for all partitions. Figuring out how to do run-time partition pruning would be awesome, too. But I don't see that any of this stuff gets easier by ignoring what's already been built; then you're likely to spend all your time reinventing the crap we've already done, and any cases where the new system misses an optimization that's been achieved in the current system become unpleasant dilemmas for our users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I tend to agree with Robert that partitioning should continue using inheritance based implementation. In addition to his point about reinventing things it could be pointed out that there are discussions/proposals elsewhere about building foreign table inheritance capability; having partitioning use the same general infrastructure would pave a way for including sharding features more easily in future (perhaps sooner). Maybe I am missing something; but isn't it a case that making partitions a physical implementation detail would make it difficult to support individual partitions be on different servers (sharding basically)? Moreover, recent FDW development seems to be headed in direction of substantial core support for foreign objects/tables; it seems worthwhile for partitioning design to assume a course so that future sharding feature developers can leverage both. Perhaps I am just speculating here but I thought of adding this one point to the discussion. Having said that, it can also be seen that the subset of inheritance infrastructure that constitutes partitioning support machinery would have to be changed considerably if we are now onto partitioning 2.0 here. -- Amit
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Amit Langote > Sent: Friday, September 19, 2014 2:13 PM > To: robertmhaas@gmail.com > Cc: pgsql-hackers@postgresql.org; bruce@momjian.us; tgl@sss.pgh.pa.us; > alvherre@2ndquadrant.com > Subject: Re: [HACKERS] On partitioning > > Hi, > Apologize for having broken the original thread. :( This was supposed to in reply to - http://www.postgresql.org/message-id/CA+Tgmob5DEtO4SbD15q0OQJjyc05cTk8043Utwu_ =XDtvyGNSw@mail.gmail.com -- Amit
On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote: > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja > reference Tom's post > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us > which mentions the possibility of a different partitioning > implementation than what we have so far. As it turns out, I've been > thinking about partitioning recently, so I thought I would share what > I'm thinking so that others can poke holes. My intention is to try to > implement this as soon as possible. I realize there hasn't been much progress on this thread, but I wanted to chime in to say I think our current partitioning implementation is too heavy administratively, error-prone, and performance-heavy. I support a redesign of this feature. I think the current mixture of inheritance, triggers/rules, and check constraints can be properly characterized as a Frankenstein solution, where we paste together parts until we get something that works --- our partitioning badly needs a redesign. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote: > > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja > > reference Tom's post > > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us > > which mentions the possibility of a different partitioning > > implementation than what we have so far. As it turns out, I've been > > thinking about partitioning recently, so I thought I would share what > > I'm thinking so that others can poke holes. My intention is to try to > > implement this as soon as possible. > > I realize there hasn't been much progress on this thread, but I wanted > to chime in to say I think our current partitioning implementation is > too heavy administratively, error-prone, and performance-heavy. On the contrary, I think there was lots of progress; there's lots of useful feedback from the initial design proposal I posted. I am a bit sad to admit that I'm not working on it at the moment as I had originally planned, though, because other priorities slipped in and I am not able to work on this for a while. Therefore if someone else wants to work on this topic, be my guest -- otherwise I hope to get on it in a few months. > I support a redesign of this feature. I think the current mixture of > inheritance, triggers/rules, and check constraints can be properly > characterized as a Frankenstein solution, where we paste together parts > until we get something that works --- our partitioning badly needs a > redesign. Agreed, and I don't think just hiding the stitches is good enough. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Fri, Aug 29, 2014 at 11:56:07AM -0400, Alvaro Herrera wrote: > > > Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja > > > reference Tom's post > > > http://www.postgresql.org/message-id/1598.1399826841@sss.pgh.pa.us > > > which mentions the possibility of a different partitioning > > > implementation than what we have so far. As it turns out, I've been > > > thinking about partitioning recently, so I thought I would share what > > > I'm thinking so that others can poke holes. My intention is to try to > > > implement this as soon as possible. > > > > I realize there hasn't been much progress on this thread, but I wanted > > to chime in to say I think our current partitioning implementation is > > too heavy administratively, error-prone, and performance-heavy. > > On the contrary, I think there was lots of progress; there's lots of > useful feedback from the initial design proposal I posted. I am a bit > sad to admit that I'm not working on it at the moment as I had > originally planned, though, because other priorities slipped in and I am > not able to work on this for a while. Therefore if someone else wants > to work on this topic, be my guest -- otherwise I hope to get on it in a > few months. Oh, I just meant code progress --- I agree the discussion was fruitful. > > I support a redesign of this feature. I think the current mixture of > > inheritance, triggers/rules, and check constraints can be properly > > characterized as a Frankenstein solution, where we paste together parts > > until we get something that works --- our partitioning badly needs a > > redesign. > > Agreed, and I don't think just hiding the stitches is good enough. LOL, yeah. I do training on partitioning occasionally and the potential for mistakes is huge. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Hi, > On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > I realize there hasn't been much progress on this thread, but I wanted > > > to chime in to say I think our current partitioning implementation is > > > too heavy administratively, error-prone, and performance-heavy. > > > > On the contrary, I think there was lots of progress; there's lots of > > useful feedback from the initial design proposal I posted. I am a bit > > sad to admit that I'm not working on it at the moment as I had > > originally planned, though, because other priorities slipped in and I am > > not able to work on this for a while. Therefore if someone else wants > > to work on this topic, be my guest -- otherwise I hope to get on it in a > > few months. > > Oh, I just meant code progress --- I agree the discussion was fruitful. > FWIW, I think Robert's criticism regarding not basing this on inheritance scheme was not responded to. He mentions a patch by Itagaki-san (four years ago, abandoned unfortunately); details here: https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress This patch could be resurrected fixing some parts of it as was suggested at the time. But, the most important decisions regarding the patch like storage structure, syntax etc. would require building some consensus whether this is a worthwhile direction. At least some consideration must be given to the idea that we might want to have remote partitions backed by FDW infrastructure in near future, although that may not be the primary goal of partitioning effort. What do others think? -- Amit
Amit Langote wrote: > > On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > I realize there hasn't been much progress on this thread, but I wanted > > > > to chime in to say I think our current partitioning implementation is > > > > too heavy administratively, error-prone, and performance-heavy. > > > > > > On the contrary, I think there was lots of progress; there's lots of > > > useful feedback from the initial design proposal I posted. I am a bit > > > sad to admit that I'm not working on it at the moment as I had > > > originally planned, though, because other priorities slipped in and I am > > > not able to work on this for a while. Therefore if someone else wants > > > to work on this topic, be my guest -- otherwise I hope to get on it in a > > > few months. > > > > Oh, I just meant code progress --- I agree the discussion was fruitful. > > FWIW, I think Robert's criticism regarding not basing this on inheritance > scheme was not responded to. It was responded to by ignoring it. I didn't see anybody else supporting the idea that inheritance is in any way a sane thing to base partitioning on. Sure, we have accumulated lots of kludges over the years to cope with the fact that, really, it doesn't work very well. So what. We can keep them, I don't care. Anyway as I said above, I'm not particularly interested in any more discussion on this topic for the time being, since I don't have time to work on this patch. If anybody wants to continue discussing to improve the design some more, and even implement it or parts of it, that's fine with me -- but please expect me not to answer. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote: > Amit Langote wrote: > > > > On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: > > > > Bruce Momjian wrote: > > > > > I realize there hasn't been much progress on this thread, but I wanted > > > > > to chime in to say I think our current partitioning implementation is > > > > > too heavy administratively, error-prone, and performance-heavy. > > > > > > > > On the contrary, I think there was lots of progress; there's lots of > > > > useful feedback from the initial design proposal I posted. I am a bit > > > > sad to admit that I'm not working on it at the moment as I had > > > > originally planned, though, because other priorities slipped in and I am > > > > not able to work on this for a while. Therefore if someone else wants > > > > to work on this topic, be my guest -- otherwise I hope to get on it in a > > > > few months. > > > > > > Oh, I just meant code progress --- I agree the discussion was fruitful. > > > > FWIW, I think Robert's criticism regarding not basing this on inheritance > > scheme was not responded to. > > It was responded to by ignoring it. I didn't see anybody else > supporting the idea that inheritance is in any way a sane thing to base > partitioning on. Sure, we have accumulated lots of kludges over the > years to cope with the fact that, really, it doesn't work very well. So > what. We can keep them, I don't care. As far as I understdood Robert's criticism it was more about the internals, than about the userland representation. To me it's absolutely clear that 'real partitioning' userland shouldn't be based on the current hacks to allow it. But I do think that a first step very well might reuse the planner/executor smarts about it. Even a good chunk of the tablecmd.c logic might be reusable for individual partitions without much change. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, > From: Andres Freund [mailto:andres@2ndquadrant.com] > On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote: > > Amit Langote wrote: > > > FWIW, I think Robert's criticism regarding not basing this on inheritance > > > scheme was not responded to. > > > > It was responded to by ignoring it. I didn't see anybody else > > supporting the idea that inheritance is in any way a sane thing to base > > partitioning on. Sure, we have accumulated lots of kludges over the > > years to cope with the fact that, really, it doesn't work very well. So > > what. We can keep them, I don't care. > > As far as I understdood Robert's criticism it was more about the > internals, than about the userland representation. To me it's absolutely > clear that 'real partitioning' userland shouldn't be based on the > current hacks to allow it. For my understanding: By partitioning 'userland' representation, do you mean an implementation choice where a partition is literally an inheritance child of the partitioned table as registered in pg_inherits? Or something else? Thanks, Amit
On 2014-10-28 14:34:22 +0900, Amit Langote wrote: > > Hi, > > > From: Andres Freund [mailto:andres@2ndquadrant.com] > > On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote: > > > Amit Langote wrote: > > > > FWIW, I think Robert's criticism regarding not basing this on > inheritance > > > > scheme was not responded to. > > > > > > It was responded to by ignoring it. I didn't see anybody else > > > supporting the idea that inheritance is in any way a sane thing to base > > > partitioning on. Sure, we have accumulated lots of kludges over the > > > years to cope with the fact that, really, it doesn't work very well. So > > > what. We can keep them, I don't care. > > > > As far as I understdood Robert's criticism it was more about the > > internals, than about the userland representation. To me it's absolutely > > clear that 'real partitioning' userland shouldn't be based on the > > current hacks to allow it. > > For my understanding: > > By partitioning 'userland' representation, do you mean an implementation > choice where a partition is literally an inheritance child of the partitioned > table as registered in pg_inherits? Or something else? Yes, I mean explicit usage of INHERITS. In my opinion we can reuse (some of) the existing logic for INHERITS to implement "proper" partitioning, but that should be an implementation detail. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > In my opinion we can reuse (some of) the existing logic for INHERITS to > implement "proper" partitioning, but that should be an implementation > detail. Sure, that would be a sensible way to do it. I mostly care about not throwing out all the work that's been done on the planner and executor. Maybe you're thinking we'll eventually replace that with something better, which is fine, but I wouldn't underestimate the effort to make that happen. For example, I think it's be sensible for the first patch to just add some new user-visible syntax with some additional catalog representation that doesn't actually do all that much yet. Then subsequent patches could use that additional metadata to optimize partition prune, implement tuple routing, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-10-28 08:19:36 -0400, Robert Haas wrote: > On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > In my opinion we can reuse (some of) the existing logic for INHERITS to > > implement "proper" partitioning, but that should be an implementation > > detail. > > Sure, that would be a sensible way to do it. I mostly care about not > throwing out all the work that's been done on the planner and > executor. In that ase I'm not sure if there's actual disagreement here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, > owner@postgresql.org] On Behalf Of Robert Haas > Sent: Tuesday, October 28, 2014 9:20 PM > > On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund <andres@2ndquadrant.com> > wrote: > > In my opinion we can reuse (some of) the existing logic for INHERITS to > > implement "proper" partitioning, but that should be an implementation > > detail. > > Sure, that would be a sensible way to do it. I mostly care about not > throwing out all the work that's been done on the planner and > executor. Maybe you're thinking we'll eventually replace that with > something better, which is fine, but I wouldn't underestimate the > effort to make that happen. For example, I think it's be sensible for > the first patch to just add some new user-visible syntax with some > additional catalog representation that doesn't actually do all that > much yet. Then subsequent patches could use that additional metadata > to optimize partition prune, implement tuple routing, etc. > I mentioned upthread about the possibility of resurrecting Itagaki-san's patch [1] to try to make things work in this direction.I would be willing to spend time on this. I see useful reviews of the patch by Robert [2], Simon [3] at the timebut it wasn't pursued further. I think those reviews were valuable design input that IMHO would still be relevant. Itseems the reviews suggested some significant changes to the design proposed. Off course, there are many other considerationsdiscussed upthread that need to be addressed. Incorporating those changes and others, I think such an approachcould be worthwhile. Thoughts? [1] https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress [2] http://www.postgresql.org/message-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com [3] http://www.postgresql.org/message-id/1279196337.1735.9598.camel@ebony Thanks, Amit
On Thu, Nov 6, 2014 at 9:17 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I mentioned upthread about the possibility of resurrecting Itagaki-san's patch [1] to try to make things work in thisdirection. I would be willing to spend time on this. I see useful reviews of the patch by Robert [2], Simon [3] at thetime but it wasn't pursued further. I think those reviews were valuable design input that IMHO would still be relevant.It seems the reviews suggested some significant changes to the design proposed. Off course, there are many otherconsiderations discussed upthread that need to be addressed. Incorporating those changes and others, I think such anapproach could be worthwhile. I'd be in favor of that. I am not sure whether the code is close enough to what we need to be really useful, but that's for you to decide. In my view, the main problem we should be trying to solve here is "avoid relying on constraint exclusion". In other words, the syntax for adding a partition should put some metadata into the system catalogs that lets us do partitioning pruning very very quickly, without theorem-proving. For example, for list or range partitioning, a list of partition bounds would be just right: you could binary-search it. The same metadata should also be suitable for routing inserts to the proper partition, and handling partition motion when a tuple is updated. Now there's other stuff we might very well want to do, but I think making partition pruning and tuple routing fast would be a pretty big win by itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Saturday, November 08, 2014 5:41 AM > > I'd be in favor of that. Thanks! > I am not sure whether the code is close > enough to what we need to be really useful, but that's for you to > decide. Hmm, I'm not entirely convinced about the patch as it stands either but, I will try to restate below what the patch in its current state does anyway (just to refresh): The patch provides syntax to:* Specify partitioning key, optional partition definitions within CREATE TABLE,* A few ALTERTABLE commands that let you define a partitioning key (partitioning a table after the fact), attach/detach an existing table as a partition of a partitioned table,* CREATE PARTITION to create a new partition on a partitioned table. Above commands are merely transformed into ALTER TABLE subcommands that arrange partitioned table and partitions into inheritance hierarchy, but with extra information, that is, allowed values for the partition in a new anyarray column called 'pg_inherits.values'. A special case of ATExecAddInherit() namely ATExecAttachPartitionI(), as part of its processing, also adds partition constraints in the form of appropriate CHECK constraints. So, a few of the manual steps are automated and additional (IMHO non-opaque) metadata (namely partition boundaries/list values) is added. Additionally, defining a partitioning key (PARTITION BY) creates a pg_partition entry that specifies for a partitioned table the following - partition kind (range/list), an opclass for the key value comparison and a key 'expression' (say, "colname % 10"). A few key things I can think of as needing improvement would be (perhaps just reiterating a review of the patch): * partition pruning would still depend on constraint exclusion using the CHECK constraints (same old)* there is no tuple-routing at all (same can be said of partition pruning above)* partition pruning or tuple-routing would require a scan over pg_inherits (perhaps inefficient)* partitioning key is an expression which might not be a good idea in early stages of the implementation (might be better off with just the attnum of the column to partition on?)* there is no DROP PARTITION (in fact, it is suggested not to go CREATE/DROP PARTITION route at all) -> ALTER TABLE ... ADD/DROP PARTITION? Some other important ones:* dependency handling related oversights* constraint propagation related oversights And then some of the oddities of behaviour that I am seeing while trying out things that the patch does. Please feel free to suggest those that I am not seeing. I am sure these improvements need more than just tablecmds.c hacking which is what the current patch mostly does. The first two points could use separate follow-on patches as I feel they need extensive changes unless I am missing something. I will try to post possible solutions to these issues provided metadata in current form is OK to proceed. > In my view, the main problem we should be trying to solve > here is "avoid relying on constraint exclusion". In other words, the > syntax for adding a partition should put some metadata into the system > catalogs that lets us do partitioning pruning very very quickly, > without theorem-proving. For example, for list or range partitioning, > a list of partition bounds would be just right: you could > binary-search it. The same metadata should also be suitable for > routing inserts to the proper partition, and handling partition motion > when a tuple is updated. > > Now there's other stuff we might very well want to do, but I think > making partition pruning and tuple routing fast would be a pretty big > win by itself. > Those are definitely the goals worth striving for. Thanks for your time. Regards, Amit
On Mon, Nov 10, 2014 at 8:53 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Above commands are merely transformed into ALTER TABLE subcommands that arrange > partitioned table and partitions into inheritance hierarchy, but with extra > information, that is, allowed values for the partition in a new anyarray column > called 'pg_inherits.values'. A special case of ATExecAddInherit() namely > ATExecAttachPartitionI(), as part of its processing, also adds partition > constraints in the form of appropriate CHECK constraints. So, a few of the > manual steps are automated and additional (IMHO non-opaque) metadata (namely > partition boundaries/list values) is added. I thought putting the partition boundaries into pg_inherits was a strange choice. I'd put it in pg_class, or in pg_partition if we decide to create that. Maybe as anyarray, but I think pg_node_tree might even be better. That can also represent data of some arbitrary type, but it doesn't enforce that everything is uniform. So you could have a list of objects of the form {RANGEPARTITION :lessthan {CONST ...} :partition 16982} or similar. The relcache could load that up and convert the list to a C array, which would then be easy to binary-search. As you say, you also need to store the relevant operator somewhere, and the fact that it's a range partition rather than list or hash, say. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I thought putting the partition boundaries into pg_inherits was a > strange choice. I'd put it in pg_class, or in pg_partition if we > decide to create that. Yeah. I rather doubt that we want this mechanism to be very closely tied to the existing inheritance features. If we do that, we are going to need a boatload of error checks to prevent people from breaking partitioned tables by applying the sort of twiddling that inheritance allows. > Maybe as anyarray, but I think pg_node_tree > might even be better. That can also represent data of some arbitrary > type, but it doesn't enforce that everything is uniform. Of course, the more general you make it, the more likely that it'll be impossible to optimize well. regards, tom lane
On Wed, Nov 12, 2014 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I thought putting the partition boundaries into pg_inherits was a >> strange choice. I'd put it in pg_class, or in pg_partition if we >> decide to create that. > > Yeah. I rather doubt that we want this mechanism to be very closely > tied to the existing inheritance features. If we do that, we are > going to need a boatload of error checks to prevent people from breaking > partitioned tables by applying the sort of twiddling that inheritance > allows. Well, as I said upthread, I think it would be a pretty poor idea to imagine that the first version of this feature is going to obsolete everything we've done with inheritance. Are we going to reinvent the machinery to make inheritance children get scanned when the parent does? Reinvent Merge Append? >> Maybe as anyarray, but I think pg_node_tree >> might even be better. That can also represent data of some arbitrary >> type, but it doesn't enforce that everything is uniform. > > Of course, the more general you make it, the more likely that it'll be > impossible to optimize well. The point for me is just that range and list partitioning probably need different structure, and hash partitioning, if we want to support that, needs something else again. Range partitioning needs an array of partition boundaries and an array of child OIDs. List partitioning needs an array of specific values and a child table OID for each. Hash partitioning needs something probably quite different. We might be able to do it as a pair of arrays - one of type anyarray and one of type OID - and meet all needs that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/12/14, 5:27 PM, Robert Haas wrote: >>> Maybe as anyarray, but I think pg_node_tree >>> >>might even be better. That can also represent data of some arbitrary >>> >>type, but it doesn't enforce that everything is uniform. >> > >> >Of course, the more general you make it, the more likely that it'll be >> >impossible to optimize well. > The point for me is just that range and list partitioning probably > need different structure, and hash partitioning, if we want to support > that, needs something else again. Range partitioning needs an array > of partition boundaries and an array of child OIDs. List partitioning > needs an array of specific values and a child table OID for each. > Hash partitioning needs something probably quite different. We might > be able to do it as a pair of arrays - one of type anyarray and one of > type OID - and meet all needs that way. Another issue is I don't know that we could support multi-key partitions with something like an anyarray. Perhaps that'sOK as a first pass, but I expect it'll be one of the next things folks ask for. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
* Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Nov 12, 2014 at 5:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Maybe as anyarray, but I think pg_node_tree > >> might even be better. That can also represent data of some arbitrary > >> type, but it doesn't enforce that everything is uniform. > > > > Of course, the more general you make it, the more likely that it'll be > > impossible to optimize well. Agreed- a node tree seems a bit too far to make this really work well.. But, I'm curious what you were thinking specifically? A node tree which accepts an "argument" of the constant used in the original query and then spits back a table might work reasonably well for that case- but with declarative partitioning, I expect us to eventually be able to eliminate complete partitions from consideration on both sides of a partition-table join and optimize cases where we have two partitioned tables being joined with a compatible join key and only actually do joins between the partitions which overlap each other. I don't see those happening if we're allowing a node tree (only). If having a node tree is just one option among other partitioning options, then we can provide users with the ability to choose what suits their particular needs. > The point for me is just that range and list partitioning probably > need different structure, and hash partitioning, if we want to support > that, needs something else again. Range partitioning needs an array > of partition boundaries and an array of child OIDs. List partitioning > needs an array of specific values and a child table OID for each. > Hash partitioning needs something probably quite different. We might > be able to do it as a pair of arrays - one of type anyarray and one of > type OID - and meet all needs that way. I agree that these will require different structures in the catalog.. While reviewing the superuser checks, I expected to have a similar need and discussed various options- having multiple catalog tables, having a single table with multiple columns, having a single table with a 'type' column and then a bytea blob. In the end, it wasn't really necessary as the only thing which I expected to need more than 'yes/no' were the directory permissions (which it looks like might end up killed anyway, much to my sadness..), but while considering the options, I continued to feel like anything but independent tables was hacking around to try and reduce the number of inodes used for folks who don't actually use these features, and that's a terrible reason to complicate the catalog and code, in my view. It occurs to me that we might be able to come up with a better way to address the inode concern and therefore ignore it. There are other considerations to having more catalog tables, but declarative partitioning is an important enough feature, in my view, that I wouldn't care if it required 10 catalog tables to implement. Misrepresenting it with a catalog that's got a bunch of columns, all but one of which are NULL, or by using essentially removing the knowledge of the data type from the system by using a type column with some binary blob, isn't doing ourselves or our users any favors. That's not to say that I'm against a solution which only needs one catalog table, but let's not completely throw away proper structure because of inode or other resource consideration issues. We have quite a few other catalog tables which are rarely used and it'd be good to address the issue with those consuming resources independently. I'm not a fan of using pg_class- there are a number of columns in there which I would *not* wish to be allowed to be different per partition (starting with relowner and relacl...). Making those NULL would be just as bad (probably worse, really, since we'd also need to add new columns to pg_class to indicate the partitioning...) as having a sparsely populated new catalog table. Thanks! Stephen
> From: Stephen Frost [mailto:sfrost@snowman.net] > Sent: Thursday, November 13, 2014 3:40 PM > > > The point for me is just that range and list partitioning probably > > need different structure, and hash partitioning, if we want to support > > that, needs something else again. Range partitioning needs an array > > of partition boundaries and an array of child OIDs. List partitioning > > needs an array of specific values and a child table OID for each. > > Hash partitioning needs something probably quite different. We might > > be able to do it as a pair of arrays - one of type anyarray and one of > > type OID - and meet all needs that way. > > I agree that these will require different structures in the catalog.. > While reviewing the superuser checks, I expected to have a similar need > and discussed various options- having multiple catalog tables, having a > single table with multiple columns, having a single table with a 'type' > column and then a bytea blob. In the end, it wasn't really necessary as > the only thing which I expected to need more than 'yes/no' were the > directory permissions (which it looks like might end up killed anyway, > much to my sadness..), but while considering the options, I continued to > feel like anything but independent tables was hacking around to try and > reduce the number of inodes used for folks who don't actually use these > features, and that's a terrible reason to complicate the catalog and > code, in my view. > Greenplum uses a single table for this purpose with separate columns for range and list cases, for example. They store allowed values per partition though. They have 6 partitioning related catalog/system views., by the way. Perhaps, interesting as a reference. http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_parti tions.html Thanks, Amit
> owner@postgresql.org] On Behalf Of Amit Langote > Sent: Thursday, November 13, 2014 3:50 PM > > Greenplum uses a single table for this purpose with separate columns for range > and list cases, for example. They store allowed values per partition though. > They have 6 partitioning related catalog/system views., by the way. Perhaps, > interesting as a reference. > > http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_p > arti > tions.html > Oops, wrong link. Use this one instead. http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_parti tion_rule.html > Thanks, > Amit
On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote: > Agreed- a node tree seems a bit too far to make this really work well.. > But, I'm curious what you were thinking specifically? I gave a pretty specific example in my email. > A node tree which > accepts an "argument" of the constant used in the original query and > then spits back a table might work reasonably well for that case- A node tree is not a function. It's a data structure. So it doesn't have arguments. > but > with declarative partitioning, I expect us to eventually be able to > eliminate complete partitions from consideration on both sides of a > partition-table join and optimize cases where we have two partitioned > tables being joined with a compatible join key and only actually do > joins between the partitions which overlap each other. I don't see > those happening if we're allowing a node tree (only). If having a node > tree is just one option among other partitioning options, then we can > provide users with the ability to choose what suits their particular > needs. This seems completely muddled to me. What we're talking about is how to represent the partition definition in the system catalogs. I'm not proposing that the user would "partition by pg_node_tree"; what the heck would that even mean? I'm proposing one way of serializing the partition definitions that the user specifies into something that can be stored into a system catalog, which happens to reuse the existing infrastructure that we use for that same purpose in various other places. I don't have a problem with somebody coming up with another way of representing the data in the catalogs; I'm just brainstorming. But saying that we'll be able to optimize joins better if we store the same data as anyarray rather than pg_node_tree or visca versa doesn't make any sense at all. > I'm not a fan of using pg_class- there are a number of columns in there > which I would *not* wish to be allowed to be different per partition > (starting with relowner and relacl...). Making those NULL would be just > as bad (probably worse, really, since we'd also need to add new columns > to pg_class to indicate the partitioning...) as having a sparsely > populated new catalog table. I think you are, again, confused as to what we're discussing. Nobody, including Alvaro, has proposed a design where the individual partitions don't have pg_class entries of some kind. What we're talking about is where to store the metadata for partition exclusion and tuple routing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Nov 13, 2014 at 1:39 AM, Stephen Frost <sfrost@snowman.net> wrote: > > but > > with declarative partitioning, I expect us to eventually be able to > > eliminate complete partitions from consideration on both sides of a > > partition-table join and optimize cases where we have two partitioned > > tables being joined with a compatible join key and only actually do > > joins between the partitions which overlap each other. I don't see > > those happening if we're allowing a node tree (only). If having a node > > tree is just one option among other partitioning options, then we can > > provide users with the ability to choose what suits their particular > > needs. > > This seems completely muddled to me. What we're talking about is how > to represent the partition definition in the system catalogs. I'm not > proposing that the user would "partition by pg_node_tree"; what the > heck would that even mean? They'd provide an expression which would be able to identify the partition to be used. In a way, this is exactly how many folks do partitioning today with inheritence- consider the if/else trees in triggers for handling new data coming into the parent table. That's also why it wouldn't be easy to optimize for. > I'm proposing one way of serializing the > partition definitions that the user specifies into something that can > be stored into a system catalog, which happens to reuse the existing > infrastructure that we use for that same purpose in various other > places. Ok, I didn't immediately see how a node tree would be used for this- but I admit that I've not gone back through the entirety of this iteration of the partitioning discussion. > I don't have a problem with somebody coming up with another > way of representing the data in the catalogs; I'm just brainstorming. Ditto. > But saying that we'll be able to optimize joins better if we store the > same data as anyarray rather than pg_node_tree or visca versa doesn't > make any sense at all. Ok, if the node tree is constrained in what can be stored in it then I understand how we could still use optimize based on what we've stored in it. I'm not entirely sure a node tree makes sense but at least I understand better. > > I'm not a fan of using pg_class- there are a number of columns in there > > which I would *not* wish to be allowed to be different per partition > > (starting with relowner and relacl...). Making those NULL would be just > > as bad (probably worse, really, since we'd also need to add new columns > > to pg_class to indicate the partitioning...) as having a sparsely > > populated new catalog table. > > I think you are, again, confused as to what we're discussing. Nobody, > including Alvaro, has proposed a design where the individual > partitions don't have pg_class entries of some kind. What we're > talking about is where to store the metadata for partition exclusion > and tuple routing. This discussion has gone a few rounds before and, yes, I was just jumping into the middle of this particular round, but I'm pretty sure I'm not the first to point out that storing the individual partition information into pg_class isn't ideal since there are pieces that we don't actually want to be different per partition, as I outlined previously. Perhaps what that means is we should actually go the other way and move *those* columns into a new catalog instead. Consider this (totally off-the-cuff): pg_relation (pg_tables? pg_heaps?) oid relname relnamespace reltype reloftype relowner relam (?) relhas* relisshared relpersistencerelkind (?) relnatts relchecks relacl reloptions relhowpartitioned (?) pg_class pg_relation.oid relfilenode reltablespace relpages reltuples reltoastrelid reltoastidxid relfrozenxid relhasindexes(?) relpartitioninfo (whatever this ends up being) The general idea being to seperate the user-facing notion of a "table" from the underlying implementation, with the implementation allowing multiple sets of files to be used for each table. It's certainly not for the faint of heart, but we saw what happened with our inheiritance structure allowing different permissions on the child tables- we ended up creating a pretty grotty hack to deal with it (going through the parent bypasses the permissions). That's the best solution for that situation, but it's far from ideal and it'd be nice to avoid that same risk with partitioning. Additionally, if each partition is in pg_class, how are we handling name conflicts? Why do individual partitions even need to have a name? Do we allow queries against them directly? etc.. These are just my thoughts on it and I really don't intend to derail progress on having a partitioning system and I hope that my comments don't lead to that happening. Thanks, Stephen
On Thu, Nov 13, 2014 at 9:12 PM, Stephen Frost <sfrost@snowman.net> wrote: >> > I'm not a fan of using pg_class- there are a number of columns in there >> > which I would *not* wish to be allowed to be different per partition >> > (starting with relowner and relacl...). Making those NULL would be just >> > as bad (probably worse, really, since we'd also need to add new columns >> > to pg_class to indicate the partitioning...) as having a sparsely >> > populated new catalog table. >> >> I think you are, again, confused as to what we're discussing. Nobody, >> including Alvaro, has proposed a design where the individual >> partitions don't have pg_class entries of some kind. What we're >> talking about is where to store the metadata for partition exclusion >> and tuple routing. > > This discussion has gone a few rounds before and, yes, I was just > jumping into the middle of this particular round, but I'm pretty sure > I'm not the first to point out that storing the individual partition > information into pg_class isn't ideal since there are pieces that we > don't actually want to be different per partition, as I outlined > previously. Perhaps what that means is we should actually go the other > way and move *those* columns into a new catalog instead. > > Consider this (totally off-the-cuff): > > pg_relation (pg_tables? pg_heaps?) > oid > relname > relnamespace > reltype > reloftype > relowner > relam (?) > relhas* > relisshared > relpersistence > relkind (?) > relnatts > relchecks > relacl > reloptions > relhowpartitioned (?) > > pg_class > pg_relation.oid > relfilenode > reltablespace > relpages > reltuples > reltoastrelid > reltoastidxid > relfrozenxid > relhasindexes (?) > relpartitioninfo (whatever this ends up being) > > The general idea being to seperate the user-facing notion of a "table" > from the underlying implementation, with the implementation allowing > multiple sets of files to be used for each table. It's certainly not > for the faint of heart, but we saw what happened with our inheiritance > structure allowing different permissions on the child tables- we ended > up creating a pretty grotty hack to deal with it (going through the > parent bypasses the permissions). That's the best solution for that > situation, but it's far from ideal and it'd be nice to avoid that same > risk with partitioning. Additionally, if each partition is in pg_class, > how are we handling name conflicts? Why do individual partitions even > need to have a name? Do we allow queries against them directly? etc.. There's certainly something to this, but "not for the faint of heart" sounds like an understatement. One of the good things about inheritance is that, if the system doesn't automatically do the right thing, there's usually an escape hatch. If the INSERT trigger you use for tuple routing is too slow, you can insert directly into the target partition. If your query doesn't realize that it can prune away all the partitions but one, or takes too long to do it, you can query directly against that partition. These aren't beautiful things and I'm sure we're all united in wanting a mechanism that will reduce the need to do them, but we need to make sure that we are removing the need for the escape hatch, and not just cementing it shut. In other words, I don't think there is a problem with people querying child tables directly; the problem is that people are forced to do so in order to get good performance. We shouldn't remove the ability for people to do that unless we're extremely certain we've fixed the problem that leads them to wish to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, > > I thought putting the partition boundaries into pg_inherits was a > strange choice. I'd put it in pg_class, or in pg_partition if we > decide to create that. Hmm, yeah I guess we are better off using pg_inherits for just saying that a partition is an inheritance child. Other detailsshould go elsewhere for sure. > Maybe as anyarray, but I think pg_node_tree > might even be better. That can also represent data of some arbitrary > type, but it doesn't enforce that everything is uniform. So you could > have a list of objects of the form {RANGEPARTITION :lessthan {CONST > ...} :partition 16982} or similar. The relcache could load that up > and convert the list to a C array, which would then be easy to > binary-search. > > As you say, you also need to store the relevant operator somewhere, > and the fact that it's a range partition rather than list or hash, > say. > I'm wondering here if it's better to keep partition values per partition wherein we have two catalogs, say, pg_partitioned_reland pg_partition_def. pg_partitioned_rel stores information like partition kind, key (attribute number(s)?), key opclass(es). Optionally, we couldalso say here if a given record (in pg_partitioned_rel) represents an actual top-level partitioned table or a partitionthat is sub-partitioned (wherein this record is just a dummy for keys of sub-partitioning and such); something likepartisdummy... pg_partition_def stores information of individual partitions (/sub-partitions, too?) such as its parent (either an actualtop level partitioned table or a sub-partitioning template), whether this is an overflow/default partition, and partitionvalues. Such a scheme would be similar to what Greenplum [1] has. Perhaps this duplicates inheritance and can be argued in that sense, though. Do you think keeping partition defining values with the top-level partitioned table would make some partitioning schemes(multikey, sub- , etc.) a bit complicated to implement? I cannot offhand imagine the actual implementation difficultiesthat might be involved myself but perhaps you have a better idea of such details and would have a say... Thanks, Amit [1] http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_partition_rule.html http://gpdb.docs.pivotal.io/4330/index.html#ref_guide/system_catalogs/pg_partition.html
On Wed, Nov 19, 2014 at 10:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Maybe as anyarray, but I think pg_node_tree >> might even be better. That can also represent data of some arbitrary >> type, but it doesn't enforce that everything is uniform. So you could >> have a list of objects of the form {RANGEPARTITION :lessthan {CONST >> ...} :partition 16982} or similar. The relcache could load that up >> and convert the list to a C array, which would then be easy to >> binary-search. >> >> As you say, you also need to store the relevant operator somewhere, >> and the fact that it's a range partition rather than list or hash, >> say. > > I'm wondering here if it's better to keep partition values per partition wherein we have two catalogs, say, pg_partitioned_reland pg_partition_def. > > pg_partitioned_rel stores information like partition kind, key (attribute number(s)?), key opclass(es). Optionally, wecould also say here if a given record (in pg_partitioned_rel) represents an actual top-level partitioned table or a partitionthat is sub-partitioned (wherein this record is just a dummy for keys of sub-partitioning and such); something likepartisdummy... > > pg_partition_def stores information of individual partitions (/sub-partitions, too?) such as its parent (either an actualtop level partitioned table or a sub-partitioning template), whether this is an overflow/default partition, and partitionvalues. Yeah, you could do something like this. There's a certain overhead to adding additional system catalogs, though. It means more inodes on disk, probably more syscaches, and more runtime spent probing those additional syscache entries to assemble a relcache entry. On the other hand, it's got a certain conceptual cleanliness to it. I do think at a very minimum it's important to have a Boolean flag in pg_class so that we need not probe what you're calling pg_partitioned_rel if no partitioning information is present there. I might be tempted to go further and add the information you are proposing to put in pg_partitioned_rel in pg_class instead, and just add one new catalog. But it depends on how many columns we end up with. Before going too much further with this I'd mock up schemas for your proposed catalogs and a list of DDL operations to be supported, with the corresponding syntax, and float that here for comment. > Such a scheme would be similar to what Greenplum [1] has. Interesting. > Perhaps this duplicates inheritance and can be argued in that sense, though. > > Do you think keeping partition defining values with the top-level partitioned table would make some partitioning schemes(multikey, sub- , etc.) a bit complicated to implement? I cannot offhand imagine the actual implementation difficultiesthat might be involved myself but perhaps you have a better idea of such details and would have a say... I don't think this is a big deal one way or the other. We're all database folks here, so deciding to normalize for performance or denormalize for conceptual cleanliness shouldn't tax our powers unduly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > > I'm wondering here if it's better to keep partition values per partition > > wherein we have two catalogs, say, pg_partitioned_rel and pg_partition_def. > > > > pg_partitioned_rel stores information like partition kind, key (attribute > > number(s)?), key opclass(es). Optionally, we could also say here if a given > > record (in pg_partitioned_rel) represents an actual top-level partitioned table > > or a partition that is sub-partitioned (wherein this record is just a dummy for > > keys of sub-partitioning and such); something like partisdummy... > > > > pg_partition_def stores information of individual partitions (/sub-partitions, > > too?) such as its parent (either an actual top level partitioned table or a sub- > > partitioning template), whether this is an overflow/default partition, and > > partition values. > > Yeah, you could do something like this. There's a certain overhead to > adding additional system catalogs, though. It means more inodes on > disk, probably more syscaches, and more runtime spent probing those > additional syscache entries to assemble a relcache entry. On the > other hand, it's got a certain conceptual cleanliness to it. > Hmm, this could be a concern. > I do think at a very minimum it's important to have a Boolean flag in > pg_class so that we need not probe what you're calling > pg_partitioned_rel if no partitioning information is present there. I > might be tempted to go further and add the information you are > proposing to put in pg_partitioned_rel in pg_class instead, and just > add one new catalog. But it depends on how many columns we end up > with. > I think something like pg_class.relispartitioned would be good as a minimum like you said. > Before going too much further with this I'd mock up schemas for your > proposed catalogs and a list of DDL operations to be supported, with > the corresponding syntax, and float that here for comment. > I came up with something like the following: * Catalog schema: CREATE TABLE pg_catalog.pg_partitioned_rel ( partrelid oid NOT NULL, partkind oid NOT NULL, partissub bool NOTNULL, partkey int2vector NOT NULL, -- partitioning attributes partopclass oidvector, PRIMARY KEY (partrelid, partissub), FOREIGN KEY (partrelid) REFERENCES pg_class (oid), FOREIGN KEY (partopclass) REFERENCESpg_opclass (oid) ) WITHOUT OIDS ; CREATE TABLE pg_catalog.pg_partition_def ( partitionid oid NOT NULL, partitionparentrel oid NOT NULL, partitionisoverflow bool NOT NULL, partitionvalues anyarray, PRIMARY KEY (partitionid), FOREIGN KEY (partitionid) REFERENCES pg_class(oid) ) WITHOUT OIDS; ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned; pg_partitioned_rel stores the partitioning information for a partitioned relation. A pg_class relation has pg_partitioned_relentry if pg_class.relispartitioned is 'true'. Though this can be challenged by saying we will want to storesub-partitioning key here too. Do we want a partition relation to be called partitioned itself for the purpose of underlyingsubpartitions? 'partissub' would be true in that case. pg_partition_def has a row for each relation that has defined restrictions on the data that partkey column can take, akaa partition. The data is known to be within the bounds defined by partitionvalues. Perhaps we could divide this into twoviz. rangeupperbound and listvalues for two partition types. When we will get to multi-level partitioning (sub-partitioning),the partitions described here would actually be either data containing relations (lowest level) or placeholderrelations (upper-level). The parentrel is supposed to make it easier to scan for all partitions of a given partitionedrelation. The partitioning hierarchy also stays in the form of inheritance stored elsewhere (pg_inherits). The main reasoning behind two separate catalogs (or at least keeping partition definitions separate) is to make life easierduring future enhancements like sub-partitioning. * DDL syntax (no multi-column partitioning, sub-partitioning support as yet): -- create partitioned table and child partitions at once. CREATE TABLE parent (...) PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ] [ ( PARTITION child { VALUES LESS THAN { ... | MAXVALUE } -- for RANGE | VALUES [ IN ] ( { ... |DEFAULT } ) -- for LIST } [ WITH ( ... ) ] [ TABLESPACE tbs ] [, ...] ) ] ; -- define partitioning key on a table ALTER TABLE parent PARTITION BY [ RANGE | LIST ] ( key_column ) [ opclass ] [ (...) ] ; -- create a new partition on a partitioned table with specified values CREATE PARTITION child ON parent VALUES ...; -- drop a partition of a partitioned table with specified values DROP PARTITION child ON parent VALUES ...; -- attach table as a partition to a partitioned table ALTER TABLE parent ATTACH PARTITION child VALUES ... ; -- detach a partition (child continues to exist as a regular table) ALTER TABLE parent DETACH PARTITION child ; Thanks, Amit
Sorry, a correction: > CREATE TABLE pg_catalog.pg_partitioned_rel > ( > partrelid oid NOT NULL, > partkind oid NOT NULL, > partissub bool NOT NULL, > partkey int2vector NOT NULL, -- partitioning attributes > partopclass oidvector, > > PRIMARY KEY (partrelid, partissub), Rather, PRIMARY KEY (partrelid) Thanks, Amit
On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Before going too much further with this I'd mock up schemas for your >> proposed catalogs and a list of DDL operations to be supported, with >> the corresponding syntax, and float that here for comment. More people should really comment on this. This is a pretty big deal if it goes forward, so it shouldn't be based on what one or two people think. > * Catalog schema: > > CREATE TABLE pg_catalog.pg_partitioned_rel > ( > partrelid oid NOT NULL, > partkind oid NOT NULL, > partissub bool NOT NULL, > partkey int2vector NOT NULL, -- partitioning attributes > partopclass oidvector, > > PRIMARY KEY (partrelid, partissub), > FOREIGN KEY (partrelid) REFERENCES pg_class (oid), > FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid) > ) > WITHOUT OIDS ; So, we're going to support exactly two levels of partitioning? partitions with partissub=false and subpartitions with partissub=true?Why not support only one level of partitioning herebut then let the children have their own pg_partitioned_rel entries if they are subpartitioned? That seems like a cleaner design and lets us support an arbitrary number of partitioning levels if we ever need them. > CREATE TABLE pg_catalog.pg_partition_def > ( > partitionid oid NOT NULL, > partitionparentrel oid NOT NULL, > partitionisoverflow bool NOT NULL, > partitionvalues anyarray, > > PRIMARY KEY (partitionid), > FOREIGN KEY (partitionid) REFERENCES pg_class(oid) > ) > WITHOUT OIDS; > > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned; What is an overflow partition and why do we want that? What are you going to do if the partitioning key has two columns of different data types? > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet): > > -- create partitioned table and child partitions at once. > CREATE TABLE parent (...) > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ] > [ ( > PARTITION child > { > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST > } > [ WITH ( ... ) ] [ TABLESPACE tbs ] > [, ...] > ) ] ; How are you going to dump and restore this, bearing in mind that you have to preserve a bunch of OIDs across pg_upgrade? What if somebody wants to do pg_dump --table name_of_a_partition? I actually think it will be much cleaner to declare the parent first and then have separate CREATE TABLE statements that glue the children in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, From: Robert Haas [mailto:robertmhaas@gmail.com] > > * Catalog schema: > > > > CREATE TABLE pg_catalog.pg_partitioned_rel > > ( > > partrelid oid NOT NULL, > > partkind oid NOT NULL, > > partissub bool NOT NULL, > > partkey int2vector NOT NULL, -- partitioning attributes > > partopclass oidvector, > > > > PRIMARY KEY (partrelid, partissub), > > FOREIGN KEY (partrelid) REFERENCES pg_class (oid), > > FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid) > > ) > > WITHOUT OIDS ; > > So, we're going to support exactly two levels of partitioning? > partitions with partissub=false and subpartitions with partissub=true? > Why not support only one level of partitioning here but then let the > children have their own pg_partitioned_rel entries if they are > subpartitioned? That seems like a cleaner design and lets us support > an arbitrary number of partitioning levels if we ever need them. > Yeah, that's what I thought at some point in favour of dropping partissub altogether. However, not that this design solves it, there is one question - if we would want to support defining for a table both partitionkey and sub-partition key in advance? That is, without having defined a first level partition yet; in that case,what level do we associate sub-(sub-) partitioning key with or more to the point where do we keep it? One way is toreplace partissub by partkeylevel with level 0 being the topmost-level partitioning key and so on while keeping the partrelidequal to the pg_class.oid of the parent. That brings us to next question of managing hierarchies in pg_partition_defcorresponding to partkeylevel in the definition of topmost partitioned relation. But I guess those are implementationdetails rather than representational unless I am being too naïve. > > CREATE TABLE pg_catalog.pg_partition_def > > ( > > partitionid oid NOT NULL, > > partitionparentrel oid NOT NULL, > > partitionisoverflow bool NOT NULL, > > partitionvalues anyarray, > > > > PRIMARY KEY (partitionid), > > FOREIGN KEY (partitionid) REFERENCES pg_class(oid) > > ) > > WITHOUT OIDS; > > > > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned; > > What is an overflow partition and why do we want that? > That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions) go. VALUESclause of the definition for such a partition would look like: (a range partition) ... VALUES LESS THAN MAXVALUE (a list partition) ... VALUES DEFAULT There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) > What are you going to do if the partitioning key has two columns of > different data types? > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. They are one of the most crucial elements ofthe scheme. I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I amthinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) I think partkind switches the interpretation of the field as appropriate. Am I missing something? By the way, I had mentionedwe could have two values fields each for range and list partition kind. > > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet): > > > > -- create partitioned table and child partitions at once. > > CREATE TABLE parent (...) > > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ] > > [ ( > > PARTITION child > > { > > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE > > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST > > } > > [ WITH ( ... ) ] [ TABLESPACE tbs ] > > [, ...] > > ) ] ; > > How are you going to dump and restore this, bearing in mind that you > have to preserve a bunch of OIDs across pg_upgrade? What if somebody > wants to do pg_dump --table name_of_a_partition? > Assuming everything's (including partitioned relation and partitions at all levels) got a pg_class entry of its own, wouldOIDs be a problem? Or what is the nature of this problem if it's possible that it may be. If someone pg_dump's an individual partition as a table, we could let it be dumped as just a plain table. I am thinking weshould be able to do that or should be doing just that (?) > I actually think it will be much cleaner to declare the parent first > and then have separate CREATE TABLE statements that glue the children > in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, > 10000). > Oh, do you mean to do away without any syntax for defining partitions with CREATE TABLE parent? By the way, do you mean the following: CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000) Instead of, CREATE PARTITION child ON parent VALUES LESS THAN 10000? And as for the dump of a partitioned table, it does sound cleaner to do it piece by piece starting with the parent and itspartitioning key (as ALTER on it?) followed by individual partitions using either of the syntax above. Moreover we dumpa sub-partition as a partition on its parent partition. Thanks for your time and valuable input. Regards, Amit
On 12/2/14, 9:43 PM, Amit Langote wrote: >> >What is an overflow partition and why do we want that? >> > > That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions) go.VALUES clause of the definition for such a partition would look like: > > (a range partition) ... VALUES LESS THAN MAXVALUE > (a list partition) ... VALUES DEFAULT > > There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) If we are going to do this, should the data just go into the parent? That's what would happen today. FWIW, I think an overflow would be useful, but there should be a way to (dis|en)able it. >> >What are you going to do if the partitioning key has two columns of >> >different data types? >> > > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. They are one of the most crucial elementsof the scheme. > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And Iam thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) > > I think partkind switches the interpretation of the field as appropriate. Am I missing something? By the way, I had mentionedwe could have two values fields each for range and list partition kind. The more SQL way would be records (composite types). That would make catalog inspection a LOT easier and presumably makeit easier to change the partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records are stored internallyas tuples; not sure if that would be faster than a List of Consts or a pg_node_tree. Nodes would theoreticallyallow using things other than Consts, but I suspect that would be a bad idea. Something else to consider... our user-space support for ranges is now rangetypes, so perhaps that's what we should use forrange partitioning. The up-side (which would be a double-edged sword) is that you could leave holes in your partitioningmap. Note that in the multi-key case we could still have a record of rangetypes. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Amit Langote wrote: > From: Robert Haas [mailto:robertmhaas@gmail.com] > > What is an overflow partition and why do we want that? > > That would be a default partition. That is, where the tuples that > don't belong elsewhere (other defined partitions) go. VALUES clause of > the definition for such a partition would look like: > > (a range partition) ... VALUES LESS THAN MAXVALUE > (a list partition) ... VALUES DEFAULT > > There has been discussion about whether there shouldn't be such a > place for tuples to go. That is, it should generate an error if a > tuple can't go anywhere (or support auto-creating a new one like in > interval partitioning?) In my design I initially had overflow partitions too, because I inherited the idea from Itagaki Takahiro's patch. Eventually I realized that it's a useless concept, because you can always have leftmost and rightmost partitions, which are just regular partitions (except they don't have a "low key", resp. "high key"). If you don't define unbounded partitions at either side, it's fine, you just raise an error whenever the user tries to insert a value for which there is no partition. Not real clear to me how this applies to list partitioning, but I have the hunch that it'd be better to deal with that without overflow partitions as well. BTW I think auto-creating partitions is a bad idea in general, because you get into lock escalation mess and furthermore you have to waste time checking for existance beforehand, which lowers performance. Just have a very easy command that users can run ahead of time (something like "CREATE PARTITION FOR VALUE now() + '30 days'", whatever), and preferrably one that doesn't fail if the partition already exist; that way, users can have (for instance) a daily create-30-partitions-ahead procedure which most days would only create one partition (the one for 30 days in the future) but whenever the odd case happens that the server is turned off just at that time someday, it creates two -- one belt, 29 suspenders. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 03, 2014 at 10:00:26AM -0300, Alvaro Herrera wrote: > Amit Langote wrote: > > > From: Robert Haas [mailto:robertmhaas@gmail.com] > > > > What is an overflow partition and why do we want that? > > > > That would be a default partition. That is, where the tuples that > > don't belong elsewhere (other defined partitions) go. VALUES clause of > > the definition for such a partition would look like: > > > > (a range partition) ... VALUES LESS THAN MAXVALUE > > (a list partition) ... VALUES DEFAULT > > > > There has been discussion about whether there shouldn't be such a > > place for tuples to go. That is, it should generate an error if a > > tuple can't go anywhere (or support auto-creating a new one like in > > interval partitioning?) > > In my design I initially had overflow partitions too, because I > inherited the idea from Itagaki Takahiro's patch. Eventually I realized > that it's a useless concept, because you can always have leftmost and > rightmost partitions, which are just regular partitions (except they > don't have a "low key", resp. "high key"). If you don't define > unbounded partitions at either side, it's fine, you just raise an error > whenever the user tries to insert a value for which there is no > partition. > Hi, Maybe I am not clear on the concept of an overflow partition, but I thought that it functioned to catch any record that did not fit the partitioning scheme. You end of range with out a "low key" or "high key" would only catch problems in those areas. If you partitioned on work days of the week, you should not have anything on Saturday/Sunday. How would that work? You would want to catch anything that was not a weekday in the overflow. Regards, Ken
maybe vertica's approach will be a useful example
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/AdministratorsGuide/Partitions/PartitioningTables.htm
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/CREATETABLE.htm
... [ PARTITION BY partition-clause ]
--
Mikhail
* ktm@rice.edu (ktm@rice.edu) wrote: > On Wed, Dec 03, 2014 at 10:00:26AM -0300, Alvaro Herrera wrote: > > In my design I initially had overflow partitions too, because I > > inherited the idea from Itagaki Takahiro's patch. Eventually I realized > > that it's a useless concept, because you can always have leftmost and > > rightmost partitions, which are just regular partitions (except they > > don't have a "low key", resp. "high key"). If you don't define > > unbounded partitions at either side, it's fine, you just raise an error > > whenever the user tries to insert a value for which there is no > > partition. > > Maybe I am not clear on the concept of an overflow partition, but I > thought that it functioned to catch any record that did not fit the > partitioning scheme. You end of range with out a "low key" or "high > key" would only catch problems in those areas. If you partitioned on > work days of the week, you should not have anything on Saturday/Sunday. > How would that work? You would want to catch anything that was not a > weekday in the overflow. Yeah, I'm not a big fan of just dropping data on the floor either. That's the perview of CHECK constraints and shouldn't be a factor of the partitioning system, imv. There is a flip side to this though, which is that users who have those CHECK constraints probably don't want to be bothered by having to have an overflow partition, which leads into the question of, if we have them as a supported capability, what would the default be? My gut feeling is that the default should be 'no overflow', in which case I'm not sure it's useful as it won't be there for these cases where strange data shows up unexpectedly and the system wants to put it somewhere. Supporting overflow partitions would also mean supporting the ability to move data out of those partitions and into 'real' partitions which the user creates to deal with the odd/new data. That doesn't strike me as being too much fun for us to have to figure out, though if we do, we might be able to do a better job (with less blocking happening, etc) than the user could. Lastly, my inclination is that it's a capability which could be added later if there is demand for it, so perhaps the best answer is to not include it now (feature creep and all that). Thanks, Stephen
Hi, > From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com] > On 12/2/14, 9:43 PM, Amit Langote wrote: > > >> >What are you going to do if the partitioning key has two columns of > >> >different data types? > >> > > > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. > They are one of the most crucial elements of the scheme. > > > > I wonder if your suggestion of pg_node_tree plays well here. This then could > be a list of CONSTs or some such... And I am thinking it's a concern only for > range partitions, no? (that is, a multicolumn partition key) > > > > I think partkind switches the interpretation of the field as appropriate. Am I > missing something? By the way, I had mentioned we could have two values > fields each for range and list partition kind. > > The more SQL way would be records (composite types). That would make > catalog inspection a LOT easier and presumably make it easier to change the > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records > are stored internally as tuples; not sure if that would be faster than a List of > Consts or a pg_node_tree. Nodes would theoretically allow using things other > than Consts, but I suspect that would be a bad idea. > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference? > Something else to consider... our user-space support for ranges is now > rangetypes, so perhaps that's what we should use for range partitioning. The > up-side (which would be a double-edged sword) is that you could leave holes > in your partitioning map. Note that in the multi-key case we could still have a > record of rangetypes. That is something I had mind at least at some point. My general doubt remains about the usage of user space SQL types forcatalog fields though I may be completely uninitiated about such usage. Thanks, Amit
Hi, > From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] > Amit Langote wrote: > > > From: Robert Haas [mailto:robertmhaas@gmail.com] > > > > What is an overflow partition and why do we want that? > > > > That would be a default partition. That is, where the tuples that > > don't belong elsewhere (other defined partitions) go. VALUES clause of > > the definition for such a partition would look like: > > > > (a range partition) ... VALUES LESS THAN MAXVALUE > > (a list partition) ... VALUES DEFAULT > > > > There has been discussion about whether there shouldn't be such a > > place for tuples to go. That is, it should generate an error if a > > tuple can't go anywhere (or support auto-creating a new one like in > > interval partitioning?) > > In my design I initially had overflow partitions too, because I > inherited the idea from Itagaki Takahiro's patch. Eventually I realized > that it's a useless concept, because you can always have leftmost and > rightmost partitions, which are just regular partitions (except they > don't have a "low key", resp. "high key"). If you don't define > unbounded partitions at either side, it's fine, you just raise an error > whenever the user tries to insert a value for which there is no > partition. > I think your mention of "low key" and "high key" of a partition has forced me into rethinking how I was going about this. For example, in Itagaki-san's patch, only upper bound for a range partition would go into the catalog while the CHECK expression for that partition would use upper bound for previous partition as lower bound for the partition (an expression of form lower <= key AND key < upper). I'd think that's presumptuous to a certain degree in that the arrangement does not allow holes in the range. That also means range partitions on either end are unbounded on one side. In fact, what I called overflow partition would get (last_partitions_upper <= key) as its CHECK expression and vice versa. You suggest such unbounded partitions be disallowed? Which would mean we do not allow either of the partition bounds to be null in case of a range partition or list of values to be non-empty in case of a LIST partition. > Not real clear to me how this applies to list partitioning, but I have > the hunch that it'd be better to deal with that without overflow > partitions as well. > Likewise, CHECK expression for a LIST overflow partition would look something like NOT (key = ANY ( ARRAY[<values-of-all-other-partitions>])). By the way, I am not saying the primary metadata of partitions is CHECK expression anymore. I hope we can do away without them for partitioning sooner than later. I am looking to have bounds/values stored in the partition definition catalog not as an expression but as something readily amenable to use at places where it's useful. Suggestions are welcome! > BTW I think auto-creating partitions is a bad idea in general, because > you get into lock escalation mess and furthermore you have to waste time > checking for existance beforehand, which lowers performance. Just have > a very easy command that users can run ahead of time (something like > "CREATE PARTITION FOR VALUE now() + '30 days'", whatever), and > preferrably one that doesn't fail if the partition already exist; that > way, users can have (for instance) a daily create-30-partitions-ahead > procedure which most days would only create one partition (the one for > 30 days in the future) but whenever the odd case happens that the server > is turned off just at that time someday, it creates two -- one belt, 29 > suspenders. > Yeah, I mentioned auto-partitioning just to know if that's how people usually prefer to have overflow cases dealt with. I'd much rather focus on straightforward cases at this point. Having said that, I agree that users of partitioning should have a mechanism you mention though not sure about the details. Thanks, Amit
On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>
> Hi,
>
> > From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> > On 12/2/14, 9:43 PM, Amit Langote wrote:
> >
> > >> >What are you going to do if the partitioning key has two columns of
> > >> >different data types?
> > >> >
> > > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought.
> > They are one of the most crucial elements of the scheme.
> > >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could
> > be a list of CONSTs or some such... And I am thinking it's a concern only for
> > range partitions, no? (that is, a multicolumn partition key)
> > >
> > > I think partkind switches the interpretation of the field as appropriate. Am I
> > missing something? By the way, I had mentioned we could have two values
> > fields each for range and list partition kind.
> >
> > The more SQL way would be records (composite types). That would make
> > catalog inspection a LOT easier and presumably make it easier to change the
> > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > are stored internally as tuples; not sure if that would be faster than a List of
> > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > than Consts, but I suspect that would be a bad idea.
> >
>
> While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
>
>
>
> Hi,
>
> > From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> > On 12/2/14, 9:43 PM, Amit Langote wrote:
> >
> > >> >What are you going to do if the partitioning key has two columns of
> > >> >different data types?
> > >> >
> > > Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought.
> > They are one of the most crucial elements of the scheme.
> > >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could
> > be a list of CONSTs or some such... And I am thinking it's a concern only for
> > range partitions, no? (that is, a multicolumn partition key)
> > >
> > > I think partkind switches the interpretation of the field as appropriate. Am I
> > missing something? By the way, I had mentioned we could have two values
> > fields each for range and list partition kind.
> >
> > The more SQL way would be records (composite types). That would make
> > catalog inspection a LOT easier and presumably make it easier to change the
> > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > are stored internally as tuples; not sure if that would be faster than a List of
> > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > than Consts, but I suspect that would be a bad idea.
> >
>
> While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
>
I think you can check the same by manually creating table
with a user-defined type.
Create type typ as (f1 int, f2 text);
Create table part_tab(c1 int, c2 typ);
On Wed, Dec 3, 2014 at 6:30 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Amit Langote wrote:
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
>
> > > What is an overflow partition and why do we want that?
> >
> > That would be a default partition. That is, where the tuples that
> > don't belong elsewhere (other defined partitions) go. VALUES clause of
> > the definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a
> > place for tuples to go. That is, it should generate an error if a
> > tuple can't go anywhere (or support auto-creating a new one like in
> > interval partitioning?)
>
> In my design I initially had overflow partitions too, because I
> inherited the idea from Itagaki Takahiro's patch. Eventually I realized
> that it's a useless concept, because you can always have leftmost and
> rightmost partitions, which are just regular partitions (except they
> don't have a "low key", resp. "high key"). If you don't define
> unbounded partitions at either side, it's fine, you just raise an error
> whenever the user tries to insert a value for which there is no
> partition.
>
> Not real clear to me how this applies to list partitioning, but I have
> the hunch that it'd be better to deal with that without overflow
> partitions as well.
>
Well, overflow partitions might not sound to be a nice idea and we
> Amit Langote wrote:
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
>
> > > What is an overflow partition and why do we want that?
> >
> > That would be a default partition. That is, where the tuples that
> > don't belong elsewhere (other defined partitions) go. VALUES clause of
> > the definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a
> > place for tuples to go. That is, it should generate an error if a
> > tuple can't go anywhere (or support auto-creating a new one like in
> > interval partitioning?)
>
> In my design I initially had overflow partitions too, because I
> inherited the idea from Itagaki Takahiro's patch. Eventually I realized
> that it's a useless concept, because you can always have leftmost and
> rightmost partitions, which are just regular partitions (except they
> don't have a "low key", resp. "high key"). If you don't define
> unbounded partitions at either side, it's fine, you just raise an error
> whenever the user tries to insert a value for which there is no
> partition.
>
> Not real clear to me how this applies to list partitioning, but I have
> the hunch that it'd be better to deal with that without overflow
> partitions as well.
>
Well, overflow partitions might not sound to be a nice idea and we
might not want to do it or atleast not in first version, however
I think it could be useful in certain cases like if in a long running
transaction user is able to insert many rows into appropriate partitions
and one row falls out of the defined partition's range; an error in such a
case can annoy user, also I think similar situation could occur for
bulk insert (COPY).
From: Amit Kapila [mailto:amit.kapila16@gmail.com] On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > The more SQL way would be records (composite types). That would make > > catalog inspection a LOT easier and presumably make it easier to change the > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records > > are stored internally as tuples; not sure if that would be faster than a List of > > Consts or a pg_node_tree. Nodes would theoretically allow using things other > > than Consts, but I suspect that would be a bad idea. > > > > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference? > > I think you can check the same by manually creating table > with a user-defined type. > Create type typ as (f1 int, f2 text); > Create table part_tab(c1 int, c2 typ); Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom type touse in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. Thanks, Amit
On Tue, Dec 2, 2014 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> Before going too much further with this I'd mock up schemas for your
> >> proposed catalogs and a list of DDL operations to be supported, with
> >> the corresponding syntax, and float that here for comment.
>
> More people should really comment on this. This is a pretty big deal
> if it goes forward, so it shouldn't be based on what one or two people
> think.
>
> > * Catalog schema:
> >
> > CREATE TABLE pg_catalog.pg_partitioned_rel
> > (
> > partrelid oid NOT NULL,
> > partkind oid NOT NULL,
> > partissub bool NOT NULL,
> > partkey int2vector NOT NULL, -- partitioning attributes
> > partopclass oidvector,
> >
> > PRIMARY KEY (partrelid, partissub),
> > FOREIGN KEY (partrelid) REFERENCES pg_class (oid),
> > FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> > )
> > WITHOUT OIDS ;
>
> So, we're going to support exactly two levels of partitioning?
> partitions with partissub=false and subpartitions with partissub=true?
> Why not support only one level of partitioning here but then let the
> children have their own pg_partitioned_rel entries if they are
> subpartitioned? That seems like a cleaner design and lets us support
> an arbitrary number of partitioning levels if we ever need them.
>
> > CREATE TABLE pg_catalog.pg_partition_def
> > (
> > partitionid oid NOT NULL,
> > partitionparentrel oid NOT NULL,
> > partitionisoverflow bool NOT NULL,
> > partitionvalues anyarray,
> >
> > PRIMARY KEY (partitionid),
> > FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> > )
> > WITHOUT OIDS;
> >
> > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
>
> What is an overflow partition and why do we want that?
>
> What are you going to do if the partitioning key has two columns of
> different data types?
>
> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
> >
> > -- create partitioned table and child partitions at once.
> > CREATE TABLE parent (...)
> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> > [ (
> > PARTITION child
> > {
> > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> > }
> > [ WITH ( ... ) ] [ TABLESPACE tbs ]
> > [, ...]
> > ) ] ;
>
> On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> Before going too much further with this I'd mock up schemas for your
> >> proposed catalogs and a list of DDL operations to be supported, with
> >> the corresponding syntax, and float that here for comment.
>
> More people should really comment on this. This is a pretty big deal
> if it goes forward, so it shouldn't be based on what one or two people
> think.
>
> > * Catalog schema:
> >
> > CREATE TABLE pg_catalog.pg_partitioned_rel
> > (
> > partrelid oid NOT NULL,
> > partkind oid NOT NULL,
> > partissub bool NOT NULL,
> > partkey int2vector NOT NULL, -- partitioning attributes
> > partopclass oidvector,
> >
> > PRIMARY KEY (partrelid, partissub),
> > FOREIGN KEY (partrelid) REFERENCES pg_class (oid),
> > FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> > )
> > WITHOUT OIDS ;
>
> So, we're going to support exactly two levels of partitioning?
> partitions with partissub=false and subpartitions with partissub=true?
> Why not support only one level of partitioning here but then let the
> children have their own pg_partitioned_rel entries if they are
> subpartitioned? That seems like a cleaner design and lets us support
> an arbitrary number of partitioning levels if we ever need them.
>
> > CREATE TABLE pg_catalog.pg_partition_def
> > (
> > partitionid oid NOT NULL,
> > partitionparentrel oid NOT NULL,
> > partitionisoverflow bool NOT NULL,
> > partitionvalues anyarray,
> >
> > PRIMARY KEY (partitionid),
> > FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> > )
> > WITHOUT OIDS;
> >
> > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
>
> What is an overflow partition and why do we want that?
>
> What are you going to do if the partitioning key has two columns of
> different data types?
>
> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
> >
> > -- create partitioned table and child partitions at once.
> > CREATE TABLE parent (...)
> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> > [ (
> > PARTITION child
> > {
> > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> > }
> > [ WITH ( ... ) ] [ TABLESPACE tbs ]
> > [, ...]
> > ) ] ;
>
> How are you going to dump and restore this, bearing in mind that you
> have to preserve a bunch of OIDs across pg_upgrade? What if somebody
> wants to do pg_dump --table name_of_a_partition?
>
> have to preserve a bunch of OIDs across pg_upgrade? What if somebody
> wants to do pg_dump --table name_of_a_partition?
>
Do we really need to support dml or pg_dump for individual partitions?
On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > The more SQL way would be records (composite types). That would make
> > > catalog inspection a LOT easier and presumably make it easier to change the
> > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > > are stored internally as tuples; not sure if that would be faster than a List of
> > > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > > than Consts, but I suspect that would be a bad idea.
> > >
> >
> > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
> >
>
> > I think you can check the same by manually creating table
> > with a user-defined type.
>
> > Create type typ as (f1 int, f2 text);
> > Create table part_tab(c1 int, c2 typ);
>
> Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom type to use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid.
>
I think you are right. I think in this case we need something similar
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > The more SQL way would be records (composite types). That would make
> > > catalog inspection a LOT easier and presumably make it easier to change the
> > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records
> > > are stored internally as tuples; not sure if that would be faster than a List of
> > > Consts or a pg_node_tree. Nodes would theoretically allow using things other
> > > than Consts, but I suspect that would be a bad idea.
> > >
> >
> > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_tree at a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference?
> >
>
> > I think you can check the same by manually creating table
> > with a user-defined type.
>
> > Create type typ as (f1 int, f2 text);
> > Create table part_tab(c1 int, c2 typ);
>
> Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom type to use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid.
>
I think you are right. I think in this case we need something similar
to column pg_index.indexprs which is of type pg_node_tree(which
seems to be already suggested by Robert). So may be we can proceed
with this type and see if any one else has better idea.
From: Amit Kapila [mailto:amit.kapila16@gmail.com] On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > From: Amit Kapila [mailto:amit.kapila16@gmail.com] > On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > > > The more SQL way would be records (composite types). That would make > > > catalog inspection a LOT easier and presumably make it easier to change the > > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records > > > are stored internally as tuples; not sure if that would be faster than a List of > > > Consts or a pg_node_tree. Nodes would theoretically allow using things other > > > than Consts, but I suspect that would be a bad idea. > > > > > > > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference? > > > > > I think you can check the same by manually creating table > > with a user-defined type. > > > Create type typ as (f1 int, f2 text); > > Create table part_tab(c1 int, c2 typ); > > Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom typeto use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. > > I think you are right. I think in this case we need something similar > to column pg_index.indexprs which is of type pg_node_tree(which > seems to be already suggested by Robert). So may be we can proceed > with this type and see if any one else has better idea. Yeah, with that, I was thinking we may be able to do something like dump a Node that describes the range partition boundsor list of allowed values (say, RangePartitionValues, ListPartitionValues). Thanks, Amit
From: Amit Kapila [mailto:amit.kapila16@gmail.com] Sent: Friday, December 05, 2014 5:10 PM To: Amit Langote Cc: Jim Nasby; Robert Haas; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers Subject: Re: [HACKERS] On partitioning On Fri, Dec 5, 2014 at 12:27 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > From: Amit Kapila [mailto:amit.kapila16@gmail.com] > On Thu, Dec 4, 2014 at 10:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > > > The more SQL way would be records (composite types). That would make > > > catalog inspection a LOT easier and presumably make it easier to change the > > > partitioning key (I'm assuming ALTER TYPE cascades to stored data). Records > > > are stored internally as tuples; not sure if that would be faster than a List of > > > Consts or a pg_node_tree. Nodes would theoretically allow using things other > > > than Consts, but I suspect that would be a bad idea. > > > > > > > While I couldn’t find an example in system catalogs where a record/composite type is used, there are instances of pg_node_treeat a number of places like in pg_attrdef and others. Could you please point me to such a usage for reference? > > > > > I think you can check the same by manually creating table > > with a user-defined type. > > > Create type typ as (f1 int, f2 text); > > Create table part_tab(c1 int, c2 typ); > > Is there such a custom-defined type used in some system catalog? Just not sure how one would put together a custom typeto use in a system catalog given the way a system catalog is created. That's my concern but it may not be valid. > > > I think you are right. I think in this case we need something similar > to column pg_index.indexprs which is of type pg_node_tree(which > seems to be already suggested by Robert). So may be we can proceed > with this type and see if any one else has better idea. One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized to beused with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (like anarray of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitioned tables.Perhaps this is what Robert suggested elsewhere. Thanks, Amit
On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> So, we're going to support exactly two levels of partitioning? >> partitions with partissub=false and subpartitions with partissub=true? >> Why not support only one level of partitioning here but then let the >> children have their own pg_partitioned_rel entries if they are >> subpartitioned? That seems like a cleaner design and lets us support >> an arbitrary number of partitioning levels if we ever need them. > > Yeah, that's what I thought at some point in favour of dropping partissub altogether. However, not that this design solvesit, there is one question - if we would want to support defining for a table both partition key and sub-partition keyin advance? That is, without having defined a first level partition yet; in that case, what level do we associate sub-(sub-)partitioning key with or more to the point where do we keep it? Do we really need to allow that? I think you let people partition a toplevel table, and then partition its partitions once they've been created. I'm not sure there's a good reason to associate the subpartitioning scheme with the toplevel table. For one thing, that forces all subpartitions to be partitioned the same way - do we want to insist on that? If we do, then I agree that we need to think a little harder here. > That would be a default partition. That is, where the tuples that don't belong elsewhere (other defined partitions) go.VALUES clause of the definition for such a partition would look like: > > (a range partition) ... VALUES LESS THAN MAXVALUE > (a list partition) ... VALUES DEFAULT > > There has been discussion about whether there shouldn't be such a place for tuples to go. That is, it should generate anerror if a tuple can't go anywhere (or support auto-creating a new one like in interval partitioning?) I think Alvaro's response further down the thread is right on target. But to go into a bit more detail, let's consider the three possible cases: - Hash partitioning. Every key value gets hashed to some partition. The concept of an overflow or default partition doesn't even make sense. - List partitioning. Each key for which the user has defined a mapping gets sent to the corresponding partition. The keys that aren't mapped anywhere can either (a) cause an error or (b) get mapped to some default partition. It's probably useful to offer both behaviors. But I don't think it requires a partitionisoverflow column, because you can represent it some other way, such as by making partitionvalues NULL, which is otherwise meaningless. - Range partitioning. In this case, what you've basically got is a list of partition bounds and a list of target partitions. Suppose there are N partition bounds; then there will be N+1 targets. Some of those targets can be undefined, meaning an attempt to insert a key with that value will error out. For example, suppose the user defines a partition for values 1-3 and 10-13. Then your list of partition bounds looks like this: 1,3,10,13 And your list of destinations looks like this: undefined,firstpartition,undefined,secondpartition,undefined More commonly, the ranges will be contiguous, so that there are no gaps. If you have everything <10 in the first partition, everything 10-20 in the second partition, and everything else in a third partition, then you have bounds 10,20 and destinations firstpartition,secondpartition,thirdpartition. If you want values greater than 20 to error out, then you have bounds 10,20 and destinations firstpartition,secondpartition,undefined. In none of this do you really have "an overflow partition". Rather, the first and last destinations, if defined, catch everything that has a key lower than the lowest key or higher than the highest key. If not defined, you error out. > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And Iam thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) I guess you could list or hash partition on multiple columns, too. And yes, this is why I though of pg_node_tree. >> > * DDL syntax (no multi-column partitioning, sub-partitioning support as yet): >> > >> > -- create partitioned table and child partitions at once. >> > CREATE TABLE parent (...) >> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ] >> > [ ( >> > PARTITION child >> > { >> > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE >> > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST >> > } >> > [ WITH ( ... ) ] [ TABLESPACE tbs ] >> > [, ...] >> > ) ] ; >> >> How are you going to dump and restore this, bearing in mind that you >> have to preserve a bunch of OIDs across pg_upgrade? What if somebody >> wants to do pg_dump --table name_of_a_partition? >> > Assuming everything's (including partitioned relation and partitions at all levels) got a pg_class entry of its own, wouldOIDs be a problem? Or what is the nature of this problem if it's possible that it may be. For pg_dump --binary-upgrade, you need a statement like SELECT binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for each pg_class entry. So you can't easily have a single SQL statement creating multiple such entries. > Oh, do you mean to do away without any syntax for defining partitions with CREATE TABLE parent? That's what I was thinking. Or at least just make that a shorthand for something that can also be done with a series of SQL statements. > By the way, do you mean the following: > > CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000) > > Instead of, > > CREATE PARTITION child ON parent VALUES LESS THAN 10000? To me, it seems more logical to make it a variant of CREATE TABLE, similar to what we do already with CREATE TABLE tab OF typename. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Do we really need to support dml or pg_dump for individual partitions? I think we do. It's quite reasonable for a DBA (or developer or whatever) to want to dump all the data that's in a single partition; for example, maybe they have the table partitioned, but also spread across several servers. When the data on one machine grows too big, they want to dump that partition, move it to a new machine, and drop the partition from the old machine. That needs to be easy and efficient. More generally, with inheritance, I've seen the ability to reference individual inheritance children be a real life-saver on any number of occasions. Now, a new partitioning system that is not as clunky as constraint exclusion will hopefully be fast enough that people don't need to do it very often any more. But I would be really cautious about removing the option. That is the equivalent of installing a new fire suppression system and then boarding up the emergency exit. Yeah, you *hope* the new fire suppression system is good enough that nobody will ever need to go out that way any more. But if you're wrong, people will die, so getting rid of it isn't prudent. The stakes are not quite so high here, but the principle is the same. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 5, 2014 at 3:11 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I think you are right. I think in this case we need something similar >> to column pg_index.indexprs which is of type pg_node_tree(which >> seems to be already suggested by Robert). So may be we can proceed >> with this type and see if any one else has better idea. > > Yeah, with that, I was thinking we may be able to do something like dump a Node that describes the range partition boundsor list of allowed values (say, RangePartitionValues, ListPartitionValues). That's exactly what the kind of thing I was thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/5/14, 3:42 AM, Amit Langote wrote: >> > I think you are right. I think in this case we need something similar >> >to column pg_index.indexprs which is of type pg_node_tree(which >> >seems to be already suggested by Robert). So may be we can proceed >> >with this type and see if any one else has better idea. > One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized tobe used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (likean array of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitionedtables. Perhaps this is what Robert suggested elsewhere. In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite, andthe field that stores the actual composite data would need to be a "dumb" varlena that stores the composite HeapTupleHeader. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 12/5/14, 1:22 PM, Jim Nasby wrote: > On 12/5/14, 3:42 AM, Amit Langote wrote: >>> > I think you are right. I think in this case we need something similar >>> >to column pg_index.indexprs which is of type pg_node_tree(which >>> >seems to be already suggested by Robert). So may be we can proceed >>> >with this type and see if any one else has better idea. >> One point raised about/against pg_node_tree was the values represented therein would turn out to be too generalized tobe used with advantage during planning. But, it seems we could deserialize it in advance back to the internal form (likean array of a struct) as part of the cached relation data. This overhead would only be incurred in case of partitionedtables. Perhaps this is what Robert suggested elsewhere. > > In order to store a composite type in a catalog, we would need to have one field that has the typid of the composite, andthe field that stores the actual composite data would need to be a "dumb" varlena that stores the composite HeapTupleHeader. On further thought; if we disallow NULL as a partition boundary, we don't need a separate rowtype; we could just use theone associated with the relation itself. Presumably that would make comparing tuples to the relation list a lot easier. I was hung up on how that would work in the case of ALTER TABLE, but we'd have the same problem with using pg_node_tree:if you alter a table in such a way that *might* affect your partitioning, you have to do some kind of revalidationanyway. The other option would be to use some custom rowtype to store boundary values and have a method that can form a boundarytuple from a real one. Either way, I suspect this is better than frequently evaluating pg_node_trees. There may be one other option. If range partitions are defined in terms of an expression that is different for every partition(ie: (substr(product_key, 1, 4), date_trunc('month', sales_date))) then we could use a hash of that expression toidentify a partition. In other words, range partitioning becomes a special case of hash partitioning. I do think we needa programmatic means to identify the range of an individual partition and hash won't solve that, but the performanceof that case isn't critical so we could use pretty much whatever we wanted to there. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > The other option would be to use some custom rowtype to store boundary > values and have a method that can form a boundary tuple from a real one. > Either way, I suspect this is better than frequently evaluating > pg_node_trees. On what basis do you expect that? Every time you use a view, you're using a pg_node_tree. Nobody's ever complained that having to reload the pg_node_tree column was too slow, and I see no reason to suppose that things would be any different here. I mean, we can certainly invent something new if there is a reason to do so. But you (and a few other people) seem to be trying pretty hard to avoid using the massive amount of infrastructure that we already have to do almost this exact thing, which puzzles the heck out of me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/5/14, 2:02 PM, Robert Haas wrote: > On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> The other option would be to use some custom rowtype to store boundary >> values and have a method that can form a boundary tuple from a real one. >> Either way, I suspect this is better than frequently evaluating >> pg_node_trees. > > On what basis do you expect that? Every time you use a view, you're > using a pg_node_tree. Nobody's ever complained that having to reload > the pg_node_tree column was too slow, and I see no reason to suppose > that things would be any different here. > > I mean, we can certainly invent something new if there is a reason to > do so. But you (and a few other people) seem to be trying pretty hard > to avoid using the massive amount of infrastructure that we already > have to do almost this exact thing, which puzzles the heck out of me. My concern is how to do the routing of incoming tuples. I'm assuming it'd be significantly faster to compare two tuples thanto run each tuple through a bunch of nodetrees. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 5, 2014 at 3:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On what basis do you expect that? Every time you use a view, you're >> using a pg_node_tree. Nobody's ever complained that having to reload >> the pg_node_tree column was too slow, and I see no reason to suppose >> that things would be any different here. >> >> I mean, we can certainly invent something new if there is a reason to >> do so. But you (and a few other people) seem to be trying pretty hard >> to avoid using the massive amount of infrastructure that we already >> have to do almost this exact thing, which puzzles the heck out of me. > > My concern is how to do the routing of incoming tuples. I'm assuming it'd be > significantly faster to compare two tuples than to run each tuple through a > bunch of nodetrees. As I said before, that's a completely unrelated problem. To quickly route tuples for range or list partitioning, you're going to want to have an array of Datums in memory and bseach it. That says nothing about how they should be stored on disk. Whatever the on-disk representation looks like, the relcache is going to need to reassemble it into an array that can be binary-searched. As long as that's not hard to do - and none of the proposals here would make it hard to do - there's no reason to care about this from that point of view. At least, not that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.
How would you distinguish values in list partition for multiple
columns? I mean for range partition, we are sure there will
be either one value for each column, but for list it could
be multiple and not fixed for each partition, so I think it will not
be easy to support the multicolumn partition key for list
partitions.
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do. It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers. When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine. That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions. Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more. But I would be really cautious
> about removing the option. That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more. But if you're
> wrong, people will die, so getting rid of it isn't prudent. The
> stakes are not quite so high here, but the principle is the same.
>
Sure, I don't feel we should not provide anyway to take dump
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do. It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers. When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine. That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions. Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more. But I would be really cautious
> about removing the option. That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more. But if you're
> wrong, people will die, so getting rid of it isn't prudent. The
> stakes are not quite so high here, but the principle is the same.
>
Sure, I don't feel we should not provide anyway to take dump
for individual partition but not at level of independent table.
May be something like --table <table_name>
--partition <partition_name>.
In general, I think we should try to avoid exposing that partitions are
individual tables as that might hinder any future enhancement in that
area (example if we someone finds a different and better way to
arrange the partition data, then due to the currently exposed syntax,
we might feel blocked).
Hi Robert, > From: Robert Haas [mailto:robertmhaas@gmail.com] > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> So, we're going to support exactly two levels of partitioning? > >> partitions with partissub=false and subpartitions with partissub=true? > >> Why not support only one level of partitioning here but then let the > >> children have their own pg_partitioned_rel entries if they are > >> subpartitioned? That seems like a cleaner design and lets us support > >> an arbitrary number of partitioning levels if we ever need them. > > > > Yeah, that's what I thought at some point in favour of dropping partissub > altogether. However, not that this design solves it, there is one question - if > we would want to support defining for a table both partition key and sub- > partition key in advance? That is, without having defined a first level partition > yet; in that case, what level do we associate sub-(sub-) partitioning key with > or more to the point where do we keep it? > > Do we really need to allow that? I think you let people partition a > toplevel table, and then partition its partitions once they've been > created. I'm not sure there's a good reason to associate the > subpartitioning scheme with the toplevel table. For one thing, that > forces all subpartitions to be partitioned the same way - do we want > to insist on that? If we do, then I agree that we need to think a > little harder here. > To me, it sounds better if we insist on a uniform subpartitioning scheme across all partitions. It seems that's how it'sdone elsewhere. It would be interesting to hear what others think though. > > That would be a default partition. That is, where the tuples that don't > belong elsewhere (other defined partitions) go. VALUES clause of the > definition for such a partition would look like: > > > > (a range partition) ... VALUES LESS THAN MAXVALUE > > (a list partition) ... VALUES DEFAULT > > > > There has been discussion about whether there shouldn't be such a place > for tuples to go. That is, it should generate an error if a tuple can't go > anywhere (or support auto-creating a new one like in interval partitioning?) > > I think Alvaro's response further down the thread is right on target. > But to go into a bit more detail, let's consider the three possible > cases: > > - Hash partitioning. Every key value gets hashed to some partition. > The concept of an overflow or default partition doesn't even make > sense. > > - List partitioning. Each key for which the user has defined a > mapping gets sent to the corresponding partition. The keys that > aren't mapped anywhere can either (a) cause an error or (b) get mapped > to some default partition. It's probably useful to offer both > behaviors. But I don't think it requires a partitionisoverflow > column, because you can represent it some other way, such as by making > partitionvalues NULL, which is otherwise meaningless. > > - Range partitioning. In this case, what you've basically got is a > list of partition bounds and a list of target partitions. Suppose > there are N partition bounds; then there will be N+1 targets. Some of > those targets can be undefined, meaning an attempt to insert a key > with that value will error out. For example, suppose the user defines > a partition for values 1-3 and 10-13. Then your list of partition > bounds looks like this: > > 1,3,10,13 > > And your list of destinations looks like this: > > undefined,firstpartition,undefined,secondpartition,undefined > > More commonly, the ranges will be contiguous, so that there are no > gaps. If you have everything <10 in the first partition, everything > 10-20 in the second partition, and everything else in a third > partition, then you have bounds 10,20 and destinations > firstpartition,secondpartition,thirdpartition. If you want values > greater than 20 to error out, then you have bounds 10,20 and > destinations firstpartition,secondpartition,undefined. > > In none of this do you really have "an overflow partition". Rather, > the first and last destinations, if defined, catch everything that has > a key lower than the lowest key or higher than the highest key. If > not defined, you error out. So just to clarify, first and last destinations are considered "defined" if you have something like: ... PARTITION p1 VALUES LESS THAN 10 PARTITION p2 VALUES BETWEEN 10 AND 20 PARTITION p3 VALUES GREATER THAN 20 ... And "not defined" if: ... PARTITION p1 VALUES BETWEEN 10 AND 20 ... In the second case, because no explicit definitions for values less than 10 and greater than 20 are in place, rows with thatvalue error out? If so, that makes sense. > > > I wonder if your suggestion of pg_node_tree plays well here. This then > could be a list of CONSTs or some such... And I am thinking it's a concern only > for range partitions, no? (that is, a multicolumn partition key) > > I guess you could list or hash partition on multiple columns, too. > And yes, this is why I though of pg_node_tree. > > >> > * DDL syntax (no multi-column partitioning, sub-partitioning support as > yet): > >> > > >> > -- create partitioned table and child partitions at once. > >> > CREATE TABLE parent (...) > >> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ] > >> > [ ( > >> > PARTITION child > >> > { > >> > VALUES LESS THAN { ... | MAXVALUE } -- for RANGE > >> > | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST > >> > } > >> > [ WITH ( ... ) ] [ TABLESPACE tbs ] > >> > [, ...] > >> > ) ] ; > >> > >> How are you going to dump and restore this, bearing in mind that you > >> have to preserve a bunch of OIDs across pg_upgrade? What if somebody > >> wants to do pg_dump --table name_of_a_partition? > >> > > Assuming everything's (including partitioned relation and partitions at all > levels) got a pg_class entry of its own, would OIDs be a problem? Or what is > the nature of this problem if it's possible that it may be. > > For pg_dump --binary-upgrade, you need a statement like SELECT > binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for > each pg_class entry. So you can't easily have a single SQL statement > creating multiple such entries. > Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one in the first place? > > Oh, do you mean to do away without any syntax for defining partitions with > CREATE TABLE parent? > > That's what I was thinking. Or at least just make that a shorthand > for something that can also be done with a series of SQL statements. > Perhaps this is related to the point just above. So, a single SQL statement that defines partitioning key and few partitions/subpartitionsbased on the key could be supported provided the resulting set of objects can also be created usingan alternative series of steps each of which creates at most one object. Do we want a key definition to have an oid?Perhaps not. > > By the way, do you mean the following: > > > > CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1, 10000) > > > > Instead of, > > > > CREATE PARTITION child ON parent VALUES LESS THAN 10000? > > To me, it seems more logical to make it a variant of CREATE TABLE, > similar to what we do already with CREATE TABLE tab OF typename. > Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice weend up making about how we treat subpartitioning key (one of the points discussed above) Regards, Amit
From: Amit Kapila [mailto:amit.kapila16@gmail.com] Sent: Saturday, December 06, 2014 5:00 PM To: Robert Haas Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers Subject: Re: [HACKERS] On partitioning On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... AndI am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) > > I guess you could list or hash partition on multiple columns, too. > > How would you distinguish values in list partition for multiple > columns? I mean for range partition, we are sure there will > be either one value for each column, but for list it could > be multiple and not fixed for each partition, so I think it will not > be easy to support the multicolumn partition key for list > partitions. Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning isnot widely used. It is used in combination with range or hash partitioning as composite partitioning. So, perhaps we neednot worry about that. Regards, Amit
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila Sent: Saturday, December 06, 2014 5:06 PM To: Robert Haas Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers Subject: Re: [HACKERS] On partitioning On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Do we really need to support dml or pg_dump for individual partitions? > > I think we do. It's quite reasonable for a DBA (or developer or > whatever) to want to dump all the data that's in a single partition; > for example, maybe they have the table partitioned, but also spread > across several servers. When the data on one machine grows too big, > they want to dump that partition, move it to a new machine, and drop > the partition from the old machine. That needs to be easy and > efficient. > > More generally, with inheritance, I've seen the ability to reference > individual inheritance children be a real life-saver on any number of > occasions. Now, a new partitioning system that is not as clunky as > constraint exclusion will hopefully be fast enough that people don't > need to do it very often any more. But I would be really cautious > about removing the option. That is the equivalent of installing a new > fire suppression system and then boarding up the emergency exit. > Yeah, you *hope* the new fire suppression system is good enough that > nobody will ever need to go out that way any more. But if you're > wrong, people will die, so getting rid of it isn't prudent. The > stakes are not quite so high here, but the principle is the same. > > > Sure, I don't feel we should not provide anyway to take dump > for individual partition but not at level of independent table. > May be something like --table <table_name> > --partition <partition_name>. > This does sound cleaner. > In general, I think we should try to avoid exposing that partitions are > individual tables as that might hinder any future enhancement in that > area (example if we someone finds a different and better way to > arrange the partition data, then due to the currently exposed syntax, > we might feel blocked). Sounds like a concern. I guess you are referring to whether we allow a partition relation to be included in the range tableand then some other cases. In the former case we could allow referring to individual partitions by some additional syntaxif it doesn’t end up looking too ugly or invite a bunch of other issues. This seems to have been discussed a little bit upthread (for example, see "Open Questions" in Alvaro's original proposaland Hannu Krosing's reply). Regards, Amit
On Mon, Dec 8, 2014 at 11:01 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> Sent: Saturday, December 06, 2014 5:00 PM
> To: Robert Haas
> Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
> Subject: Re: [HACKERS] On partitioning
>
> On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
> >
> > I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
>
> Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning is not widely used.
> From: Amit Kapila [mailto:amit.kapila16@gmail.com]
> Sent: Saturday, December 06, 2014 5:00 PM
> To: Robert Haas
> Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
> Subject: Re: [HACKERS] On partitioning
>
> On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >
> > > I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key)
> >
> > I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
>
> Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioning is not widely used.
So I think it is better to be clear why we are not planning to
support it, is it that because it is not required by users or
is it due to the reason that code seems to be tricky or is it due
to both of the reasons. It might help us if anyone raises this
during the development of this patch or in general if someone
requests such a feature.
From: Amit Kapila [mailto:amit.kapila16@gmail.com] > > > How would you distinguish values in list partition for multiple > > > columns? I mean for range partition, we are sure there will > > > be either one value for each column, but for list it could > > > be multiple and not fixed for each partition, so I think it will not > > > be easy to support the multicolumn partition key for list > > > partitions. > > >Irrespective of difficulties of representing it using pg_node_tree, it seems to me that multicolumn list partitioningis not widely used. > > So I think it is better to be clear why we are not planning to > support it, is it that because it is not required by users or > is it due to the reason that code seems to be tricky or is it due > to both of the reasons. It might help us if anyone raises this > during the development of this patch or in general if someone > requests such a feature. Coming back to the how pg_node_tree representation for list partitions - For each column in a multicolumn list partition key, a value would look like a dumped Node for List of Consts (all allowedvalues in a given list partition). And the whole key would then be a List of such Nodes (a dump thereof). That's perhapspretty verbose but I guess that's supposed to be only a catalog representation. During relcache building, we turnthis back into a collection of structs to efficiently locate the partition of interest whatever the method of doing thatends up being (based on partition type). The relcache step ensures that we have decoupled the concern of quickly locatingan interesting partition from its catalog representation. Of course, there may be flaws in this picture and would only reveal themselves when actually trying to implement it or theycan be pointed out in advance. Thanks, Amit
All, Pardon me for jumping into this late. In general, I like Alvaro's approach. However, I wanted to list the major shortcomings of the existing replication system (based on complaints by PGX's users and on IRC) and compare them to Alvaro's proposed implementation to make sure that enough of them are addressed, and that the ones which aren't addressed are not being addressed as a clear decision. We can't address *all* of the limitations of the current system, but let's make sure that we're addressing enough of them to make implementing a 2nd partitioning system worthwhile. Where I have ? is because I'm not clear from Alvaro's proposal whether they're addressed or not. 1.The Trigger Problem the need to write triggers for INSERT/UPDATE/DELETE. Addressed. 2. The Clutter Problem cluttering up system views and dumps with hundreds of partitioned tables Addressed. 3. Creation Problem the need two write triggers and/or cron jobs to create new partitions Addressed. 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? 5. Constant Problem Since current partitioned query planning happens before the rewrite phase, SELECTs do not use partition logic to evaluate even simple expressions, let alone IMMUTABLE or STABLE functions. Addressed?? 6. Unique Index Problem Cannot create a unique index across multiple partitions, which prevents the partitioned table from being FK'd. Not Addressed (but could be addressed in the future) 7. JOIN Problem Two partitioned tables being JOINed need to append and materialize before the join, causing a very slow join under some circumstances, even if both tables are partitioned on the same ranges. Not Addressed? (but could be addressed in the future) 8. COPY Problem Cannot bulk-load into the Master, just into individual partitions. Addressed. 9. Hibernate Problem When using the trigger method, inserts into the master partition return 0, which Hibernate and some other ORMs regard as an insert failure. Addressed. 10. Scaling Problem Inheritance partitioning becomes prohibitively slow for the planner at somewhere between 100 and 500 partitions depending on various factors. No idea? 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. 12. Extra Constraint Evaluation Inheritance partitioning evaluates *all* constraints on the partitions, whether they are part of the partitioning scheme or not. This is way expensive if those are, say, polygon comparisons. Addressed. Additionally, I believe that Alvaro's proposal will make the following activities which are supported by partition-by-inheritance more difficult or impossible. Again, these are probably acceptable because inheritance partitioning isn't going away. However, we should consciously decide that: A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then "attach" it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. B. Catchall Partition Many partitioning schemes currently contain a "catchall" partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I guess you could list or hash partition on multiple columns, too. > > How would you distinguish values in list partition for multiple > columns? I mean for range partition, we are sure there will > be either one value for each column, but for list it could > be multiple and not fixed for each partition, so I think it will not > be easy to support the multicolumn partition key for list > partitions. I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > So just to clarify, first and last destinations are considered "defined" if you have something like: > > ... > PARTITION p1 VALUES LESS THAN 10 > PARTITION p2 VALUES BETWEEN 10 AND 20 > PARTITION p3 VALUES GREATER THAN 20 > ... > > And "not defined" if: > > ... > PARTITION p1 VALUES BETWEEN 10 AND 20 > ... Yes. >> For pg_dump --binary-upgrade, you need a statement like SELECT >> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for >> each pg_class entry. So you can't easily have a single SQL statement >> creating multiple such entries. > > Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one in the first place? I mean that the binary upgrade script needs to set the OID for every pg_class object being restored, and it does that by stashing away up to one (1) pg_class OID before each CREATE statement. If a single CREATE statement generates multiple pg_class entries, this method doesn't work. > Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice weend up making about how we treat subpartitioning key (one of the points discussed above) Yes, I think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Sure, I don't feel we should not provide anyway to take dump > for individual partition but not at level of independent table. > May be something like --table <table_name> > --partition <partition_name>. > > In general, I think we should try to avoid exposing that partitions are > individual tables as that might hinder any future enhancement in that > area (example if we someone finds a different and better way to > arrange the partition data, then due to the currently exposed syntax, > we might feel blocked). I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/08/2014 11:05 AM, Robert Haas wrote: > I guess I'm in disagreement with you - and, perhaps - the majority on > this point. I think that ship has already sailed: partitions ARE > tables. We can try to make it less necessary for users to ever look > at those tables as separate objects, and I think that's a good idea. > But trying to go from a system where partitions are tables, which is > what we have today, to a system where they are not seems like a bad > idea to me. If we make a major break from how things work today, > we're going to end up having to reimplement stuff that already works. I don't thing its feasible to drop inheritance partitioning at this point; too many user exploit a lot of peculiarities of that system which wouldn't be supported by any other system. So any new partitioning system we're talking about would be *in addition* to the existing system. Hence my prior email trying to make sure that a new proposed system is sufficiently different from the existing one to be worthwhile. > Besides, I haven't really seen anyone propose something that sounds > like a credible alternative. If we could make partition objects > things that the storage layer needs to know about but the query > planner doesn't need to understand, that'd be maybe worth considering. > But I don't see any way that that's remotely feasible. On the other hand, as long as partitions exist exclusively at the planner layer, we can't fix the existing major shortcomings of inheritance partitioning, such as its inability to handle expressions. Again, see previous. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2014-12-08 14:05:52 -0500, Robert Haas wrote: > On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Sure, I don't feel we should not provide anyway to take dump > > for individual partition but not at level of independent table. > > May be something like --table <table_name> > > --partition <partition_name>. > > > > In general, I think we should try to avoid exposing that partitions are > > individual tables as that might hinder any future enhancement in that > > area (example if we someone finds a different and better way to > > arrange the partition data, then due to the currently exposed syntax, > > we might feel blocked). > > I guess I'm in disagreement with you - and, perhaps - the majority on > this point. I think that ship has already sailed: partitions ARE > tables. We can try to make it less necessary for users to ever look > at those tables as separate objects, and I think that's a good idea. > But trying to go from a system where partitions are tables, which is > what we have today, to a system where they are not seems like a bad > idea to me. If we make a major break from how things work today, > we're going to end up having to reimplement stuff that already works. I don't think this makes much sense. That'd severely restrict our ability to do stuff for a long time. Unless we can absolutely rely on the fact that partitions have the same schema and such we'll rob ourselves of significant optimization opportunities. > Besides, I haven't really seen anyone propose something that sounds > like a credible alternative. If we could make partition objects > things that the storage layer needs to know about but the query > planner doesn't need to understand, that'd be maybe worth considering. > But I don't see any way that that's remotely feasible. There are lots > of places that we assume that a heap consists of blocks number 0 up > through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits > and pieces of the way index vacuuming is handled, which in turn bleeds > into Hot Standby. You can't just decide that now block numbers are > going to be replaced by some more complex structure, or even that > they're now going to be nonlinear, without breaking a huge amount of > stuff. I think you're making a wrong fundamental assumption here. Just because we define partitions to not be full relations doesn't mean we have to treat them entirely separate. I don't see why a pg_class.relkind = 'p' entry would be something actually problematic. That'd easily allow to treat them differently in all the relevant places (all of ALTER TABLE, DML et al) and still allow all of the current planner/executor infrastructure. We can even allow direct SELECTs from individual partitions if we want to - that's trivial to achieve. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 12/08/2014 11:05 AM, Robert Haas wrote: >> I guess I'm in disagreement with you - and, perhaps - the majority on >> this point. I think that ship has already sailed: partitions ARE >> tables. We can try to make it less necessary for users to ever look >> at those tables as separate objects, and I think that's a good idea. >> But trying to go from a system where partitions are tables, which is >> what we have today, to a system where they are not seems like a bad >> idea to me. If we make a major break from how things work today, >> we're going to end up having to reimplement stuff that already works. > > I don't thing its feasible to drop inheritance partitioning at this > point; too many user exploit a lot of peculiarities of that system which > wouldn't be supported by any other system. So any new partitioning > system we're talking about would be *in addition* to the existing > system. Hence my prior email trying to make sure that a new proposed > system is sufficiently different from the existing one to be worthwhile. I think any new partitioning system should keep the good things about the existing system, of which there are some, and not try to reinvent the wheel. The yard stick for a new system shouldn't be "is this different enough?" but "does this solve the problems without creating new ones?". >> Besides, I haven't really seen anyone propose something that sounds >> like a credible alternative. If we could make partition objects >> things that the storage layer needs to know about but the query >> planner doesn't need to understand, that'd be maybe worth considering. >> But I don't see any way that that's remotely feasible. > > On the other hand, as long as partitions exist exclusively at the > planner layer, we can't fix the existing major shortcomings of > inheritance partitioning, such as its inability to handle expressions. > Again, see previous. Huh? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I guess I'm in disagreement with you - and, perhaps - the majority on >> this point. I think that ship has already sailed: partitions ARE >> tables. We can try to make it less necessary for users to ever look >> at those tables as separate objects, and I think that's a good idea. >> But trying to go from a system where partitions are tables, which is >> what we have today, to a system where they are not seems like a bad >> idea to me. If we make a major break from how things work today, >> we're going to end up having to reimplement stuff that already works. > > I don't think this makes much sense. That'd severely restrict our > ability to do stuff for a long time. Unless we can absolutely rely on > the fact that partitions have the same schema and such we'll rob > ourselves of significant optimization opportunities. I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. >> Besides, I haven't really seen anyone propose something that sounds >> like a credible alternative. If we could make partition objects >> things that the storage layer needs to know about but the query >> planner doesn't need to understand, that'd be maybe worth considering. >> But I don't see any way that that's remotely feasible. There are lots >> of places that we assume that a heap consists of blocks number 0 up >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits >> and pieces of the way index vacuuming is handled, which in turn bleeds >> into Hot Standby. You can't just decide that now block numbers are >> going to be replaced by some more complex structure, or even that >> they're now going to be nonlinear, without breaking a huge amount of >> stuff. > > I think you're making a wrong fundamental assumption here. Just because > we define partitions to not be full relations doesn't mean we have to > treat them entirely separate. I don't see why a pg_class.relkind = 'p' > entry would be something actually problematic. That'd easily allow to > treat them differently in all the relevant places (all of ALTER TABLE, > DML et al) and still allow all of the current planner/executor > infrastructure. We can even allow direct SELECTs from individual > partitions if we want to - that's trivial to achieve. We may just be using different words to talk about more-or-less the same thing, then. What I'm saying is that I want these things to keep working: - Indexes. - Merge append and any other inheritance-aware query planning techniques. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I am not necessarily saying that I have a problem with putting other restrictions on partitions, like requiring them to have the same tuple descriptor or the same ACLs as their parents. Those kinds of details bear discussion, but I'm not intrinsically opposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-12-08 14:48:50 -0500, Robert Haas wrote: > On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> I guess I'm in disagreement with you - and, perhaps - the majority on > >> this point. I think that ship has already sailed: partitions ARE > >> tables. We can try to make it less necessary for users to ever look > >> at those tables as separate objects, and I think that's a good idea. > >> But trying to go from a system where partitions are tables, which is > >> what we have today, to a system where they are not seems like a bad > >> idea to me. If we make a major break from how things work today, > >> we're going to end up having to reimplement stuff that already works. > > > > I don't think this makes much sense. That'd severely restrict our > > ability to do stuff for a long time. Unless we can absolutely rely on > > the fact that partitions have the same schema and such we'll rob > > ourselves of significant optimization opportunities. > > I don't think that's mutually exclusive with the idea of > partitions-as-tables. I mean, you can add code to the ALTER TABLE > path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > wherever you want. That'll be a lot of places you'll need to touch. More fundamentally: Why should we name something a table that's not one? > >> Besides, I haven't really seen anyone propose something that sounds > >> like a credible alternative. If we could make partition objects > >> things that the storage layer needs to know about but the query > >> planner doesn't need to understand, that'd be maybe worth considering. > >> But I don't see any way that that's remotely feasible. There are lots > >> of places that we assume that a heap consists of blocks number 0 up > >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits > >> and pieces of the way index vacuuming is handled, which in turn bleeds > >> into Hot Standby. You can't just decide that now block numbers are > >> going to be replaced by some more complex structure, or even that > >> they're now going to be nonlinear, without breaking a huge amount of > >> stuff. > > > > I think you're making a wrong fundamental assumption here. Just because > > we define partitions to not be full relations doesn't mean we have to > > treat them entirely separate. I don't see why a pg_class.relkind = 'p' > > entry would be something actually problematic. That'd easily allow to > > treat them differently in all the relevant places (all of ALTER TABLE, > > DML et al) and still allow all of the current planner/executor > > infrastructure. We can even allow direct SELECTs from individual > > partitions if we want to - that's trivial to achieve. > > We may just be using different words to talk about more-or-less the > same thing, then. That might be > What I'm saying is that I want these things to keep working: > - Indexes. Nobody argued against that I think. > - Merge append and any other inheritance-aware query planning > techniques. Same here. > - Direct access to individual partitions to bypass > tuple-routing/query-planning overhead. I think that might be ok in some cases, but in general I'd be very wary to allow that. I think it might be ok to allow direct read access, but everything else I'd be opposed. I'd much rather go the route of allowing to few things and then gradually opening up if required than the other way round (as that pretty much will never happen because it'll break deployed systems). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/08/2014 11:40 AM, Robert Haas wrote: >> I don't thing its feasible to drop inheritance partitioning at this >> point; too many user exploit a lot of peculiarities of that system which >> wouldn't be supported by any other system. So any new partitioning >> system we're talking about would be *in addition* to the existing >> system. Hence my prior email trying to make sure that a new proposed >> system is sufficiently different from the existing one to be worthwhile. > > I think any new partitioning system should keep the good things about > the existing system, of which there are some, and not try to reinvent > the wheel. The yard stick for a new system shouldn't be "is this > different enough?" but "does this solve the problems without creating > new ones?". It's unrealistic to assume that a new system would support all of the features of the existing inheritance partitioning without restriction.In fact, I'd say that such a requirement amounts tosaying "don't bother trying". For example, inheritance allows us to have different indexes, constraints, and even columns on partitions. We can have overlapping partitions, and heterogenous multilevel partitioning (partition this customer by month but partition that customer by week). We can even add triggers on individual partitions to reroute data away from a specific partition. A requirement to support all of these peculiar uses of inheritance partitioning would doom any new partitioning project. >>> Besides, I haven't really seen anyone propose something that sounds >>> like a credible alternative. If we could make partition objects >>> things that the storage layer needs to know about but the query >>> planner doesn't need to understand, that'd be maybe worth considering. >>> But I don't see any way that that's remotely feasible. >> >> On the other hand, as long as partitions exist exclusively at the >> planner layer, we can't fix the existing major shortcomings of >> inheritance partitioning, such as its inability to handle expressions. >> Again, see previous. > > Huh? Explained in the other email I posted on this thread. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I don't think that's mutually exclusive with the idea of >> partitions-as-tables. I mean, you can add code to the ALTER TABLE >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) >> wherever you want. > > That'll be a lot of places you'll need to touch. More fundamentally: Why > should we name something a table that's not one? Well, I'm not convinced that it isn't one. And adding a new relkind will involve a bunch of code churn, too. But I don't much care to pre-litigate this: when someone has got a patch, we can either agree that the approach is OK or argue that it is problematic because X. I think we need to hammer down the design in broad strokes first, and I'm not sure we're totally there yet. >> - Direct access to individual partitions to bypass >> tuple-routing/query-planning overhead. > > I think that might be ok in some cases, but in general I'd be very wary > to allow that. I think it might be ok to allow direct read access, but > everything else I'd be opposed. I'd much rather go the route of allowing > to few things and then gradually opening up if required than the other > way round (as that pretty much will never happen because it'll break > deployed systems). Why? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I think any new partitioning system should keep the good things about >> the existing system, of which there are some, and not try to reinvent >> the wheel. The yard stick for a new system shouldn't be "is this >> different enough?" but "does this solve the problems without creating >> new ones?". > > It's unrealistic to assume that a new system would support all of the > features of the existing inheritance partitioning without restriction. > In fact, I'd say that such a requirement amounts to saying "don't > bother trying". > > For example, inheritance allows us to have different indexes, > constraints, and even columns on partitions. We can have overlapping > partitions, and heterogenous multilevel partitioning (partition this > customer by month but partition that customer by week). We can even add > triggers on individual partitions to reroute data away from a specific > partition. A requirement to support all of these peculiar uses of > inheritance partitioning would doom any new partitioning project. I don't think it has to be possible to support every use case that we can support today; clearly, a part of the goal here is to be LESS general so that we can be more performant. But I think the urge to change too many things at once had better be tempered by a clear-eyed vision of what can reasonably be accomplished in one patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/8/14, 1:05 PM, Robert Haas wrote: > Besides, I haven't really seen anyone propose something that sounds > like a credible alternative. If we could make partition objects > things that the storage layer needs to know about but the query > planner doesn't need to understand, that'd be maybe worth considering. > But I don't see any way that that's remotely feasible. There are lots > of places that we assume that a heap consists of blocks number 0 up > through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits > and pieces of the way index vacuuming is handled, which in turn bleeds > into Hot Standby. You can't just decide that now block numbers are > going to be replaced by some more complex structure, or even that > they're now going to be nonlinear, without breaking a huge amount of > stuff. Agreed, but it's possible to keep a block/CTID interface while doing something different on the disk. If you think about it, partitioning is really a hack anyway. It clutters up your logical set implementation with a bunchof physical details. What most people really want when they implement partitioning is simply data locality. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 12/8/14, 12:26 PM, Josh Berkus wrote: > 4. Creation Locking Problem > high probability of lock pile-ups whenever a new partition is created on > demand due to multiple backends trying to create the partition at the > same time. > Not Addressed? Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system... > 6. Unique Index Problem > Cannot create a unique index across multiple partitions, which prevents > the partitioned table from being FK'd. > Not Addressed > (but could be addressed in the future) And would be extremely useful even with simple inheritance, let alone partitioning... > 9. Hibernate Problem > When using the trigger method, inserts into the master partition return > 0, which Hibernate and some other ORMs regard as an insert failure. > Addressed. It would be really nice to address this with regular inheritance too... > 11. Hash Partitioning > Some users would prefer to partition into a fixed number of > hash-allocated partitions. > Not Addressed. Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger... > A. COPY/ETL then attach > In inheritance partitioning, you can easily build a partition outside > the master and then "attach" it, allowing for minimal disturbance of > concurrent users. Could be addressed in the future. How much of the desire for this is because our current "row routing" solutions are very slow? I suspect that's the biggestreason, and hopefully Alvaro's proposal mostly eliminates it. > B. Catchall Partition > Many partitioning schemes currently contain a "catchall" partition which > accepts rows outside of the range of the partitioning scheme, due to bad > input data. Probably not handled on purpose; Alvaro is proposing that > we reject these instead, or create the partitions on demand, which is a > legitimate approach. > > C. Asymmetric Partitioning / NULLs in partition column > This is the classic Active/Inactive By Month setup for partitions. > Could be addressed via special handling for NULL/infinity in the > partitioned column. If we allowed for a "catchall partition" and supported normal inheritance/triggers on that partition then users could continueto do whatever they needed with data that didn't fit the "normal" partitioning pattern. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 12/08/2014 02:12 PM, Jim Nasby wrote: > On 12/8/14, 12:26 PM, Josh Berkus wrote: >> 4. Creation Locking Problem >> high probability of lock pile-ups whenever a new partition is created on >> demand due to multiple backends trying to create the partition at the >> same time. >> Not Addressed? > > Do users actually try and create new partitions during DML? That sounds > doomed to failure in pretty much any system... There is no question that it would be easier for users to create partitions on demand automatically. Particularly if you're partitioning by something other than time. For a particular case, consider users on RDS, which has no cron jobs for creating new partitons; it's on demand or manually. It's quite possible that there is no good way to work out the locking for on-demand partitions though, but *if* we're going to have a 2nd partition system, I think it's important to at least discuss the problems with on-demand creation. >> 11. Hash Partitioning >> Some users would prefer to partition into a fixed number of >> hash-allocated partitions. >> Not Addressed. > > Though, you should be able to do that in either system if you bother to > define your own hash in a BEFORE trigger... That doesn't do you any good with the SELECT query, unless you change your middleware to add a hash(column) to every query. Which would be really hard to do for joins. >> A. COPY/ETL then attach >> In inheritance partitioning, you can easily build a partition outside >> the master and then "attach" it, allowing for minimal disturbance of >> concurrent users. Could be addressed in the future. > > How much of the desire for this is because our current "row routing" > solutions are very slow? I suspect that's the biggest reason, and > hopefully Alvaro's proposal mostly eliminates it. That doesn't always work, though. In some cases the partition is being built using some fairly complex logic (think of partitions which are based on matviews) and there's no fast way to create the new data. Again, this is an acceptable casualty of an improved design, but if it will be so, we should consciously decide that. >> B. Catchall Partition >> Many partitioning schemes currently contain a "catchall" partition which >> accepts rows outside of the range of the partitioning scheme, due to bad >> input data. Probably not handled on purpose; Alvaro is proposing that >> we reject these instead, or create the partitions on demand, which is a >> legitimate approach. >> >> C. Asymmetric Partitioning / NULLs in partition column >> This is the classic Active/Inactive By Month setup for partitions. >> Could be addressed via special handling for NULL/infinity in the >> partitioned column. > > If we allowed for a "catchall partition" and supported normal > inheritance/triggers on that partition then users could continue to do > whatever they needed with data that didn't fit the "normal" partitioning > pattern. That sounds to me like it would fall under the heading of "impossible levels of backwards-compatibility". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> From: Robert Haas [mailto:robertmhaas@gmail.com] > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > >> I guess you could list or hash partition on multiple columns, too. > > > > How would you distinguish values in list partition for multiple > > columns? I mean for range partition, we are sure there will > > be either one value for each column, but for list it could > > be multiple and not fixed for each partition, so I think it will not > > be easy to support the multicolumn partition key for list > > partitions. > > I don't understand. If you want to range partition on columns (a, b), > you say that, say, tuples with (a, b) values less than (100, 200) go > here and the rest go elsewhere. For list partitioning, you say that, > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the > rest go elsewhere. I'm not sure how useful that is but it's not > illogical. > In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b.I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulateas range partvalues. Thanks, Amit
On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >> I guess you could list or hash partition on multiple columns, too.
> > >
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
> >
> > I don't understand. If you want to range partition on columns (a, b),
> > you say that, say, tuples with (a, b) values less than (100, 200) go
> > here and the rest go elsewhere. For list partitioning, you say that,
> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> > rest go elsewhere. I'm not sure how useful that is but it's not
> > illogical.
> >
>
> In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulate as range partvalues.
>
Yeah and also how would user specify the values, as an example
PARTITION BY LIST(monthly_salary)
(
PARTITION salary_less_than_thousand VALUES(300, 900),
PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > From: Robert Haas [mailto:robertmhaas@gmail.com]
> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >> I guess you could list or hash partition on multiple columns, too.
> > >
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
> >
> > I don't understand. If you want to range partition on columns (a, b),
> > you say that, say, tuples with (a, b) values less than (100, 200) go
> > here and the rest go elsewhere. For list partitioning, you say that,
> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> > rest go elsewhere. I'm not sure how useful that is but it's not
> > illogical.
> >
>
> In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this "list of values for each column in partkey" is as convenient to store and manipulate as range partvalues.
>
Yeah and also how would user specify the values, as an example
assume that table is partitioned on monthly_salary, so partition
definition would look:
PARTITION BY LIST(monthly_salary)
(
PARTITION salary_less_than_thousand VALUES(300, 900),
PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
...
)
Now if user wants to define multi-column Partition based on
monthly_salary and annual_salary, how do we want him to
specify the values. Basically how to distinguish which values
belong to first column key and which one's belong to second
column key.
On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables. I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
>
> Well, I'm not convinced that it isn't one. And adding a new relkind
> will involve a bunch of code churn, too. But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X. I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.
>
> >> - Direct access to individual partitions to bypass
> >> tuple-routing/query-planning overhead.
> >
> > I think that might be ok in some cases, but in general I'd be very wary
> > to allow that. I think it might be ok to allow direct read access, but
> > everything else I'd be opposed. I'd much rather go the route of allowing
> > to few things and then gradually opening up if required than the other
> > way round (as that pretty much will never happen because it'll break
> > deployed systems).
>
> Why?
>
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables. I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
>
> Well, I'm not convinced that it isn't one. And adding a new relkind
> will involve a bunch of code churn, too. But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X. I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.
That's right, I think at this point defining the top level behaviour/design
is very important to proceed, we can decide about the better
implementation approach afterwards (may be once initial patch is ready,
because it might not be a major work to do it either way). So here's where
we are on this point till now as per my understanding, I think that direct
operations should be prohibited on partitions, you think that they should be
allowed and Andres think that it might be better to allow direct operations
on partitions for Read.
>
> >> - Direct access to individual partitions to bypass
> >> tuple-routing/query-planning overhead.
> >
> > I think that might be ok in some cases, but in general I'd be very wary
> > to allow that. I think it might be ok to allow direct read access, but
> > everything else I'd be opposed. I'd much rather go the route of allowing
> > to few things and then gradually opening up if required than the other
> > way round (as that pretty much will never happen because it'll break
> > deployed systems).
>
> Why?
>
Because I think it will be difficult for users to write/maintain more of such
code, which is one of the complaints with previous system where user
needs to write triggers to route the tuple to appropriate partition.
I think in first step we should try to improve the tuple routing algorithm
so that it is not pain for users or atleast it should be at par with some of
the other competitive database systems and if we are not able
to come up with such an implementation, then may be we can think of
providing it as a special way for users to improve performance.
Another reason is that fundamentally partitions are managed internally
to divide the user data in a way so that access could be cheaper and we
take the specifications for defining the partitions from users and allowing
operations on internally managed objects could lead to user writing quite
some code to do what database actually does internally. If we see that
TOAST table are internally used to manage large tuples, however we
don't want users to directly perform dml on those tables.
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: >> > From: Robert Haas [mailto:robertmhaas@gmail.com] >> > I don't understand. If you want to range partition on columns (a, b), >> > you say that, say, tuples with (a, b) values less than (100, 200) go >> > here and the rest go elsewhere. For list partitioning, you say that, >> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the >> > rest go elsewhere. I'm not sure how useful that is but it's not >> > illogical. >> > >> >> In case of list partitioning, 100 and 200 would respectively be one of the >> values in lists of allowed values for a and b. I thought his concern is >> whether this "list of values for each column in partkey" is as convenient to >> store and manipulate as range partvalues. >> > > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. > Amit, in one of my earlier replies to your question of why we may not want to implement multi-column list partitioning (lackof user interest in the feature or possible complexity of the code), I tried to explain how that may work if we do chooseto go that way. Basically, something we may call PartitionColumnValue should be such that above issue can be suitablysorted out. For example, a partition defining/bounding value would be a pg_node_tree representation of List of one of the (say) followingparse nodes as appropriate - typedef struct PartitionColumnValue { NodeTag type, Oid *partitionid, char *partcolname, char partkind, Node *partrangelower, Node *partrangeupper, List *partlistvalues }; OR separately, typedef struct RangePartitionColumnValue { NodeTag type, Oid *partitionid, char *partcolname, Node *partrangelower, Node *partrangeupper }; & typedef struct ListPartitionColumnValue { NodeTag type, Oid *partitionid, char *partcolname, List *partlistvalues }; Where a partition definition would look like typedef struct PartitionDef { NodeTag type, RangeVar partition, RangeVar parentrel, char *kind, Node *values, List *options, char *tablespacename }; PartitionDef.values is an (ordered) List of PartitionColumnValue each of which corresponds to one column in the partitionkey in that order. We should be able to devise a way to load the pg_node_tree representation of PartitionDef.values (on-disk pg_partition_def.partvalues)into relcache using a "suitable data structure" so that it becomes readily usable in varietyof contexts that we are interested in using this information. Regards, Amit
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: >> > From: Robert Haas [mailto:robertmhaas@gmail.com] >> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila <amit.kapila16@gmail.com> >> > wrote: >> > >> I guess you could list or hash partition on multiple columns, too. >> > > >> > > How would you distinguish values in list partition for multiple >> > > columns? I mean for range partition, we are sure there will >> > > be either one value for each column, but for list it could >> > > be multiple and not fixed for each partition, so I think it will not >> > > be easy to support the multicolumn partition key for list >> > > partitions. >> > >> > I don't understand. If you want to range partition on columns (a, b), >> > you say that, say, tuples with (a, b) values less than (100, 200) go >> > here and the rest go elsewhere. For list partitioning, you say that, >> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the >> > rest go elsewhere. I'm not sure how useful that is but it's not >> > illogical. >> > >> >> In case of list partitioning, 100 and 200 would respectively be one of the >> values in lists of allowed values for a and b. I thought his concern is >> whether this "list of values for each column in partkey" is as convenient to >> store and manipulate as range partvalues. >> > > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail? Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same. ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? Thanks, Amit
Josh Berkus wrote: Hi, > Pardon me for jumping into this late. In general, I like Alvaro's > approach. Please don't call this "Alvaro's approach" as I'm not involved in this anymore. Amit Langote has taken ownership of it now. While some resemblance to what I originally proposed might remain, I haven't kept track of how this has evolved and this might be a totally different thing now. Or not. Anyway I just wanted to comment on a single point: > 6. Unique Index Problem > Cannot create a unique index across multiple partitions, which prevents > the partitioned table from being FK'd. > Not Addressed > (but could be addressed in the future) I think it's unlikely that we will ever create a unique index that spans all the partitions, actually. Even if there are some wild ideas on how to implement such a thing, the number of difficult issues that no one knows how to attack seems too large. I would perhaps be thinking in allowing foreign keys to be defined on column sets that are prefixed by partition keys; unique indexes must exist on all partitions on the same columns including the partition keys. (Perhaps make an extra exception that if a partition allows a single value for the partition column, that column need not be part of the unique index.) > 10. Scaling Problem > Inheritance partitioning becomes prohibitively slow for the planner at > somewhere between 100 and 500 partitions depending on various factors. > No idea? At least it was my intention to make the system scale to huge number of partitions, but this requires some forward thinking (such as avoiding loading the index list of all of them or evern opening all of them at the planner stage) and I think would be defeated if we want to keep all the generality of the inheritance-based approach. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Kapila wrote: > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > >> I don't think that's mutually exclusive with the idea of > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > > >> wherever you want. > > > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > > should we name something a table that's not one? > > > > Well, I'm not convinced that it isn't one. And adding a new relkind > > will involve a bunch of code churn, too. But I don't much care to > > pre-litigate this: when someone has got a patch, we can either agree > > that the approach is OK or argue that it is problematic because X. I > > think we need to hammer down the design in broad strokes first, and > > I'm not sure we're totally there yet. > > That's right, I think at this point defining the top level behaviour/design > is very important to proceed, we can decide about the better > implementation approach afterwards (may be once initial patch is ready, > because it might not be a major work to do it either way). So here's where > we are on this point till now as per my understanding, I think that direct > operations should be prohibited on partitions, you think that they should be > allowed and Andres think that it might be better to allow direct operations > on partitions for Read. FWIW in my original proposal I was rejecting some things that after further consideration turn out to be possible to allow; for instance directly referencing individual partitions in COPY. We could allow something like COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT or maybe COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT and this would emit the whole partition for year 2000 of table lineitems, and only that (the value is just computed on the fly to fit the partitioning constraints for that individual partition). Then pg_dump would be able to dump each and every partition separately. In a similar way we could have COPY FROM allow input into individual partitions so that such a dump can be restored in parallel for each partition. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/09/2014 12:17 AM, Amit Langote wrote: >> Now if user wants to define multi-column Partition based on >> > monthly_salary and annual_salary, how do we want him to >> > specify the values. Basically how to distinguish which values >> > belong to first column key and which one's belong to second >> > column key. >> > > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail? > > Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same. > > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? ... or just use arrays. PARTITION BY LIST ( monthly_salary, annual_salary )PARTITION salary_small VALUES ({[300,400],[5000,6000]}) ) .... ... but that begs the question of how partition by list over two columns (or more) would even work? You'd need an a*b number of partitions, and the user would be pretty much certain to miss a few value combinations.Maybe we should just restrict list partitioning toa single column for a first release, and wait and see if people ask for more? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 12/8/14, 5:19 PM, Josh Berkus wrote: > On 12/08/2014 02:12 PM, Jim Nasby wrote: >> On 12/8/14, 12:26 PM, Josh Berkus wrote: >>> 4. Creation Locking Problem >>> high probability of lock pile-ups whenever a new partition is created on >>> demand due to multiple backends trying to create the partition at the >>> same time. >>> Not Addressed? >> >> Do users actually try and create new partitions during DML? That sounds >> doomed to failure in pretty much any system... > > There is no question that it would be easier for users to create > partitions on demand automatically. Particularly if you're partitioning > by something other than time. For a particular case, consider users on > RDS, which has no cron jobs for creating new partitons; it's on demand > or manually. > > It's quite possible that there is no good way to work out the locking > for on-demand partitions though, but *if* we're going to have a 2nd > partition system, I think it's important to at least discuss the > problems with on-demand creation. Yeah, we should discuss it. Perhaps the right answer here may be our own job scheduler, something a lot of folks want anyway. >>> 11. Hash Partitioning >>> Some users would prefer to partition into a fixed number of >>> hash-allocated partitions. >>> Not Addressed. >> >> Though, you should be able to do that in either system if you bother to >> define your own hash in a BEFORE trigger... > > That doesn't do you any good with the SELECT query, unless you change > your middleware to add a hash(column) to every query. Which would be > really hard to do for joins. > >>> A. COPY/ETL then attach >>> In inheritance partitioning, you can easily build a partition outside >>> the master and then "attach" it, allowing for minimal disturbance of >>> concurrent users. Could be addressed in the future. >> >> How much of the desire for this is because our current "row routing" >> solutions are very slow? I suspect that's the biggest reason, and >> hopefully Alvaro's proposal mostly eliminates it. > > That doesn't always work, though. In some cases the partition is being > built using some fairly complex logic (think of partitions which are > based on matviews) and there's no fast way to create the new data. > Again, this is an acceptable casualty of an improved design, but if it > will be so, we should consciously decide that. Is there an example you can give here? If the scheme is that complicated I'm failing to see how you're supposed to do thingslike partition elimination. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 12/09/2014 12:17 AM, Amit Langote wrote:
> >> Now if user wants to define multi-column Partition based on
> >> > monthly_salary and annual_salary, how do we want him to
> >> > specify the values. Basically how to distinguish which values
> >> > belong to first column key and which one's belong to second
> >> > column key.
> >> >
> > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail?
> >
> > Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same.
> >
> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>
> ... or just use arrays.
>
> PARTITION BY LIST ( monthly_salary, annual_salary )
> PARTITION salary_small VALUES ({[300,400],[5000,6000]})
> ) ....
>
> ... but that begs the question of how partition by list over two columns
> (or more) would even work? You'd need an a*b number of partitions, and
> the user would be pretty much certain to miss a few value combinations.
> Maybe we should just restrict list partitioning to a single column for
> a first release, and wait and see if people ask for more?
>
> On 12/09/2014 12:17 AM, Amit Langote wrote:
> >> Now if user wants to define multi-column Partition based on
> >> > monthly_salary and annual_salary, how do we want him to
> >> > specify the values. Basically how to distinguish which values
> >> > belong to first column key and which one's belong to second
> >> > column key.
> >> >
> > Perhaps you are talking about "syntactic" difficulties that I totally missed in my other reply to this mail?
> >
> > Can we represent the same data by rather using a subpartitioning scheme? ISTM, semantics would remain the same.
> >
> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>
Using SUBPARTITION is not the answer for multi-column partition,
I think if we have to support it for List partitioning then something
on lines what Josh has mentioned below could workout, but I don't
think it is important to support multi-column partition for List at this
stage.
> ... or just use arrays.
>
> PARTITION BY LIST ( monthly_salary, annual_salary )
> PARTITION salary_small VALUES ({[300,400],[5000,6000]})
> ) ....
>
> ... but that begs the question of how partition by list over two columns
> (or more) would even work? You'd need an a*b number of partitions, and
> the user would be pretty much certain to miss a few value combinations.
> Maybe we should just restrict list partitioning to a single column for
> a first release, and wait and see if people ask for more?
>
I also think we should not support multi-column list partition in first
release.
On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Amit Kapila wrote:
> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> > wrote:
> > > >> I don't think that's mutually exclusive with the idea of
> > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE
> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> > > >> wherever you want.
> > > >
> > > > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > > > should we name something a table that's not one?
> > >
> > > Well, I'm not convinced that it isn't one. And adding a new relkind
> > > will involve a bunch of code churn, too. But I don't much care to
> > > pre-litigate this: when someone has got a patch, we can either agree
> > > that the approach is OK or argue that it is problematic because X. I
> > > think we need to hammer down the design in broad strokes first, and
> > > I'm not sure we're totally there yet.
> >
> > That's right, I think at this point defining the top level behaviour/design
> > is very important to proceed, we can decide about the better
> > implementation approach afterwards (may be once initial patch is ready,
> > because it might not be a major work to do it either way). So here's where
> > we are on this point till now as per my understanding, I think that direct
> > operations should be prohibited on partitions, you think that they should be
> > allowed and Andres think that it might be better to allow direct operations
> > on partitions for Read.
>
> FWIW in my original proposal I was rejecting some things that after
> further consideration turn out to be possible to allow; for instance
> directly referencing individual partitions in COPY. We could allow
> something like
>
> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> or maybe
> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>
>
> Amit Kapila wrote:
> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> > wrote:
> > > >> I don't think that's mutually exclusive with the idea of
> > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE
> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> > > >> wherever you want.
> > > >
> > > > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > > > should we name something a table that's not one?
> > >
> > > Well, I'm not convinced that it isn't one. And adding a new relkind
> > > will involve a bunch of code churn, too. But I don't much care to
> > > pre-litigate this: when someone has got a patch, we can either agree
> > > that the approach is OK or argue that it is problematic because X. I
> > > think we need to hammer down the design in broad strokes first, and
> > > I'm not sure we're totally there yet.
> >
> > That's right, I think at this point defining the top level behaviour/design
> > is very important to proceed, we can decide about the better
> > implementation approach afterwards (may be once initial patch is ready,
> > because it might not be a major work to do it either way). So here's where
> > we are on this point till now as per my understanding, I think that direct
> > operations should be prohibited on partitions, you think that they should be
> > allowed and Andres think that it might be better to allow direct operations
> > on partitions for Read.
>
> FWIW in my original proposal I was rejecting some things that after
> further consideration turn out to be possible to allow; for instance
> directly referencing individual partitions in COPY. We could allow
> something like
>
> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> or maybe
> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>
or
COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
COPY [TABLE] lineitems PARTITION <part_1,part_2,> TO STDOUT
I think we should try to support operations on partitions via main
table whereever it is required.
On Wed, Dec 10, 2014 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 12/09/2014 12:17 AM, Amit Langote wrote: >> >> Now if user wants to define multi-column Partition based on >> >> > monthly_salary and annual_salary, how do we want him to >> >> > specify the values. Basically how to distinguish which values >> >> > belong to first column key and which one's belong to second >> >> > column key. >> >> > >> > Perhaps you are talking about "syntactic" difficulties that I totally >> > missed in my other reply to this mail? >> > >> > Can we represent the same data by rather using a subpartitioning scheme? >> > ISTM, semantics would remain the same. >> > >> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)? >> > > Using SUBPARTITION is not the answer for multi-column partition, > I think if we have to support it for List partitioning then something > on lines what Josh has mentioned below could workout, but I don't > think it is important to support multi-column partition for List at this > stage. > Yeah, I realize multicolumn list partitioning and list-list composite partitioning are different things in many respects.And given how awkward multicolumn list partitioning is looking to implement, I also think we only allow single columnin a list partition key. >> ... or just use arrays. >> >> PARTITION BY LIST ( monthly_salary, annual_salary ) >> PARTITION salary_small VALUES ({[300,400],[5000,6000]}) >> ) .... >> >> ... but that begs the question of how partition by list over two columns >> (or more) would even work? You'd need an a*b number of partitions, and >> the user would be pretty much certain to miss a few value combinations. >> Maybe we should just restrict list partitioning to a single column for >> a first release, and wait and see if people ask for more? >> > > I also think we should not support multi-column list partition in first > release. > Yes. Thanks, Amit
On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: >> >> Amit Kapila wrote: >> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> >> > wrote: >> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> >> > wrote: >> > > >> I don't think that's mutually exclusive with the idea of >> > > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE >> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, >> > > >> ...) >> > > >> wherever you want. >> > > > >> > > > That'll be a lot of places you'll need to touch. More fundamentally: >> > > > Why >> > > > should we name something a table that's not one? >> > > >> > > Well, I'm not convinced that it isn't one. And adding a new relkind >> > > will involve a bunch of code churn, too. But I don't much care to >> > > pre-litigate this: when someone has got a patch, we can either agree >> > > that the approach is OK or argue that it is problematic because X. I >> > > think we need to hammer down the design in broad strokes first, and >> > > I'm not sure we're totally there yet. >> > >> > That's right, I think at this point defining the top level >> > behaviour/design >> > is very important to proceed, we can decide about the better >> > implementation approach afterwards (may be once initial patch is ready, >> > because it might not be a major work to do it either way). So here's >> > where >> > we are on this point till now as per my understanding, I think that >> > direct >> > operations should be prohibited on partitions, you think that they >> > should be >> > allowed and Andres think that it might be better to allow direct >> > operations >> > on partitions for Read. >> >> FWIW in my original proposal I was rejecting some things that after >> further consideration turn out to be possible to allow; for instance >> directly referencing individual partitions in COPY. We could allow >> something like >> >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT >> or maybe >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT >> > or > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > COPY [TABLE] lineitems PARTITION <part_1,part_2,> TO STDOUT > > I think we should try to support operations on partitions via main > table whereever it is required. > We can also allow to explicitly name a partition COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; Thanks, Amit
Amit Langote wrote: > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > > wrote: > >> FWIW in my original proposal I was rejecting some things that after > >> further consideration turn out to be possible to allow; for instance > >> directly referencing individual partitions in COPY. We could allow > >> something like > >> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > >> or maybe > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > >> > > or > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > COPY [TABLE] lineitems PARTITION <part_1,part_2,> TO STDOUT > > > > I think we should try to support operations on partitions via main > > table whereever it is required. Um, I think the only difference is that you added the noise word TABLE which we currently don't allow in COPY, and that you added the possibility of using named partitions, about which see below. > We can also allow to explicitly name a partition > > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; The problem with naming partitions is that the user has to pick names for every partition, which is tedious and doesn't provide any significant benefit. The input I had from users of other partitioning systems was that they very much preferred not to name the partitions at all, which is why I chose the PARTITION FOR VALUE syntax (not sure if this syntax is exactly what other systems use; it just seemed the natural choice.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > The problem with naming partitions is that the user has to pick names > for every partition, which is tedious and doesn't provide any > significant benefit. The input I had from users of other partitioning > systems was that they very much preferred not to name the partitions at > all, which is why I chose the PARTITION FOR VALUE syntax (not sure if > this syntax is exactly what other systems use; it just seemed the > natural choice.) FWIW, Oracle does name partitions. It generates the names automatically if you don't care to specify them, and the partition names for a given table live in their own namespace that is separate from the toplevel object namespace. For example: CREATE TABLE sales ( invoice_no NUMBER, sale_year INT NOT NULL, sale_month INT NOT NULL, sale_day INTNOT NULL ) STORAGE (INITIAL 100K NEXT 50K) LOGGING PARTITION BY RANGE ( sale_year, sale_month, sale_day) ( PARTITIONsales_q1 VALUES LESS THAN ( 1999, 04, 01 ) TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K), PARTITIONsales_q2 VALUES LESS THAN ( 1999, 07, 01 ) TABLESPACE tsb, PARTITION sales_q3 VALUES LESS THAN ( 1999,10, 01 ) TABLESPACE tsc, PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 ) TABLESPACE tsd) ENABLEROW MOVEMENT; I don't think this practice has much to recommend it. We're going to need a way to refer to individual partitions by name, and I don't see much benefit in making that name something other than what is stored in pg_class.relname. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Agreed, but it's possible to keep a block/CTID interface while doing > something different on the disk. Objection: hand-waving. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. I assume you just add some parentheses. PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1), (valA2, valB2), (valA3, valB3)) Multi-column list partitioning may or may not be worth implementing, but the syntax is not a real problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> From: Robert Haas [mailto:robertmhaas@gmail.com] > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund > <andres@2ndquadrant.com> wrote: > >> I don't think that's mutually exclusive with the idea of > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > >> wherever you want. > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > should we name something a table that's not one? > > Well, I'm not convinced that it isn't one. And adding a new relkind > will involve a bunch of code churn, too. But I don't much care to > pre-litigate this: when someone has got a patch, we can either agree > that the approach is OK or argue that it is problematic because X. I > think we need to hammer down the design in broad strokes first, and > I'm not sure we're totally there yet. > In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about a partitionthat is further sub-partitioned? We might allocate storage for a partition at some point and then later choose tosub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions and deallocatethe partition's storage. In other words only leaf relations in a partition hierarchy would have storage. Is theresuch a notion within code for some other purpose or we'd have to invent it for partitioning scheme? Thanks, Amit
On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about apartition that is further sub-partitioned? We might allocate storage for a partition at some point and then later chooseto sub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions anddeallocate the partition's storage. In other words only leaf relations in a partition hierarchy would have storage. Isthere such a notion within code for some other purpose or we'd have to invent it for partitioning scheme? I think it would be advantageous to have storage only for the leaf partitions, because then you don't need to waste time doing a zero-block sequential scan of the root as part of the append-plan, an annoyance of the current system. We have no concept for this right now; in fact, right now, the relkind fully determines whether a given relation has storage. One idea is to make the leaves relkind = 'r' and the interior notes some new relkind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Amit Langote wrote:
>
> > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> > > wrote:
>
> > >> FWIW in my original proposal I was rejecting some things that after
> > >> further consideration turn out to be possible to allow; for instance
> > >> directly referencing individual partitions in COPY. We could allow
> > >> something like
> > >>
> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > >> or maybe
> > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> > >>
> > > or
> > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > > COPY [TABLE] lineitems PARTITION <part_1,part_2,> TO STDOUT
> > >
> > > I think we should try to support operations on partitions via main
> > > table whereever it is required.
>
> Um, I think the only difference is that you added the noise word TABLE
> which we currently don't allow in COPY,
>
> Amit Langote wrote:
>
> > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> > > wrote:
>
> > >> FWIW in my original proposal I was rejecting some things that after
> > >> further consideration turn out to be possible to allow; for instance
> > >> directly referencing individual partitions in COPY. We could allow
> > >> something like
> > >>
> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > >> or maybe
> > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> > >>
> > > or
> > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > > COPY [TABLE] lineitems PARTITION <part_1,part_2,> TO STDOUT
> > >
> > > I think we should try to support operations on partitions via main
> > > table whereever it is required.
>
> Um, I think the only difference is that you added the noise word TABLE
> which we currently don't allow in COPY,
Yeah, we could eliminate TABLE keyword from this syntax, the reason
I have kept was for easier understanding of syntax, currently we don't have
concept of PARTITION in COPY syntax, but now if we want to introduce
such a concept, then it might be better to have TABLE keyword for the
purpose of syntax clarity.
> and that you added the
> possibility of using named partitions, about which see below.
>
> > We can also allow to explicitly name a partition
> >
> > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;
>
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit. The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all,
> possibility of using named partitions, about which see below.
>
> > We can also allow to explicitly name a partition
> >
> > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;
>
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit. The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all,
It seems to me both Oracle and DB2 supports named partitions, so even
though there are user's which don't prefer named partitions, I suspect
equal or more number of users will be there who will prefer for the sake
of migration and because they are already used to such a syntax.
On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah and also how would user specify the values, as an example
> > assume that table is partitioned on monthly_salary, so partition
> > definition would look:
> >
> > PARTITION BY LIST(monthly_salary)
> > (
> > PARTITION salary_less_than_thousand VALUES(300, 900),
> > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > ...
> > )
> >
> > Now if user wants to define multi-column Partition based on
> > monthly_salary and annual_salary, how do we want him to
> > specify the values. Basically how to distinguish which values
> > belong to first column key and which one's belong to second
> > column key.
>
> I assume you just add some parentheses.
>
> PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
> (valA2, valB2), (valA3, valB3))
>
> Multi-column list partitioning may or may not be worth implementing,
> but the syntax is not a real problem.
>
>
> On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah and also how would user specify the values, as an example
> > assume that table is partitioned on monthly_salary, so partition
> > definition would look:
> >
> > PARTITION BY LIST(monthly_salary)
> > (
> > PARTITION salary_less_than_thousand VALUES(300, 900),
> > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > ...
> > )
> >
> > Now if user wants to define multi-column Partition based on
> > monthly_salary and annual_salary, how do we want him to
> > specify the values. Basically how to distinguish which values
> > belong to first column key and which one's belong to second
> > column key.
>
> I assume you just add some parentheses.
>
> PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
> (valA2, valB2), (valA3, valB3))
>
> Multi-column list partitioning may or may not be worth implementing,
> but the syntax is not a real problem.
>
Yeah either this way or what Josh has suggested upthread, the main
point was that if at all we want to support multi-column list partitioning
then we need to have slightly different syntax, however I feel that we
can leave multi-column list partitioning for first version.
On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah either this way or what Josh has suggested upthread, the main > point was that if at all we want to support multi-column list partitioning > then we need to have slightly different syntax, however I feel that we > can leave multi-column list partitioning for first version. Yeah, possibly. I think we could stand to have a lot more discussion about the syntax here. So far the idea seems to be to copy what Oracle has, but it's not clear if we're going to have exactly what Oracle has or something subtly different. I personally don't find the Oracle syntax very PostgreSQL-ish. Stuff like "VALUES LESS THAN 500" doesn't sit especially well with me - less than according to which opclass? Are we going to insist that partitioning must use the default btree opclass so that we can use that syntax? That seems kind of lame. There are lots of interesting things we could do here, e.g.: CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]); CREATE TABLE child_name PARTITION OF parent_name FOR { (value, ...) [ TO (value, ...) ] } [, ...]; So instead of making a hard distinction between range and list partitioning, you can say: CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7); CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12); CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30), (120) TO (130); Now that might be a crappy idea for various reasons, but the point is there are a lot of details to be hammered out with the syntax, and there are several ways we can go wrong. If we choose an overly-limiting syntax, we're needlessly restricting what can be done. If we choose an overly-permissive syntax, we'll restrict the optimization opportunities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > Yeah either this way or what Josh has suggested upthread, the main > > point was that if at all we want to support multi-column list partitioning > > then we need to have slightly different syntax, however I feel that we > > can leave multi-column list partitioning for first version. > > Yeah, possibly. > > I think we could stand to have a lot more discussion about the syntax > here. So far the idea seems to be to copy what Oracle has, but it's > not clear if we're going to have exactly what Oracle has or something > subtly different. I personally don't find the Oracle syntax very > PostgreSQL-ish. Stuff like "VALUES LESS THAN 500" doesn't sit > especially well with me - less than according to which opclass? Are > we going to insist that partitioning must use the default btree > opclass so that we can use that syntax? That seems kind of lame. > Syntax like VALUES LESS THAN 500 also means, we then have to go figure out what's that partition's lower bound based on upperbound of the previous one. Forget holes in the range if they matter. I expressed that concern elsewhere in favour ofhaving available both a range's lower and upper bounds. > There are lots of interesting things we could do here, e.g.: > > CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]); So, no PARTITION BY [RANGE | LIST] clause huh? What we are calling pg_partitioned_rel would obtain following bits of information from such a definition of a partitionedrelation: * column(s) to partition on and respective opclass(es)* the level this partitioned relation lies in the partitioning hierarchy (determining its relkind and storage qualification) By the way, I am not sure how we define a partitioning key on a partition (in other words, a subpartitioning key on the correspondingpartitioned relation). Perhaps (only) via ALTER TABLE on a partition relation? > CREATE TABLE child_name PARTITION OF parent_name > FOR { (value, ...) [ TO (value, ...) ] } [, ...]; > So it's still a CREATE "TABLE" but the part 'PARTITION OF' turns this "table" into something having characteristics of apartition relation getting all kinds of new treatments at various places. It appears there is a redistribution of table-characteristicsbetween a partitioned relation and its partition. We take away storage from the former and instead giveit to the latter. On the other hand, the latter's data is only accessible through the former perhaps with escape routesfor direct access via some special syntax attached to various access commands. We also stand to lose certain abilitieswith a partitioned relation such as not able to define a unique constraint (other than what partition key couldpotentially help ensure) or use it as target of foreign key constraint (just reiterating). What we call pg_partition_def obtains following bits of information from such a definition of a partition relation: * parent relation (partitioned relation this is partition of)* partition kind (do we even want to keep carrying this around as a separate field in catalog?)* values this partition holds The last part being the most important. In case of what we would have called a 'LIST' partition, this could look like ... FOR VALUES (val1, val2, val3, ...) Assuming we only support partition key to contain only one column in such a case. In case of what we would have called a 'RANGE' partition, this could look like ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) How about BETWEEN ... AND ... ? Here we allow a partition key to contain more than one column. > So instead of making a hard distinction between range and list > partitioning, you can say: > > CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7); > CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12); > CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30), > (120) TO (130); > I would include the noise keyword VALUES just for readability if anything. > Now that might be a crappy idea for various reasons, but the point is > there are a lot of details to be hammered out with the syntax, and > there are several ways we can go wrong. If we choose an > overly-limiting syntax, we're needlessly restricting what can be done. > If we choose an overly-permissive syntax, we'll restrict the > optimization opportunities. > I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just be noise keywords. We can parse those valuesinto a parse node such that we don’t have to care about whether they describe partition as being one kind or the other.Say a List of something like, typedef struct PartitionColumnValue { NodeTag type, Oid *partitionid, char *partcolname, Node *partrangelower, Node *partrangeupper, List *partlistvalues }; Or we could still add a (char) partkind just to say which of the fields matter. We don't need any defining values here for hash partitions if and when we add support for the same. We would either be usinga system-wide common hash function or we could add something with partitioning key definition. Thanks, Amit
On Thu, Dec 11, 2014 at 8:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
>
> Yeah, possibly.
>
> I think we could stand to have a lot more discussion about the syntax
> here. So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different. I personally don't find the Oracle syntax very
> PostgreSQL-ish.
CREATE TABLE orders(id INT, shipdate DATE, …)
>
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
>
> Yeah, possibly.
>
> I think we could stand to have a lot more discussion about the syntax
> here. So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different. I personally don't find the Oracle syntax very
> PostgreSQL-ish.
I share your concern w.r.t the difficulties it can create if we don't
do it carefully (one of the issue you have mentioned upthread about
pg_dump, other such things could cause problems, if not thought
of carefully from the beginning). One more thing, on a quick check
it seems to me even DB2 uses some-thing similar to Oracle for
defining partitions
PARTITION BY RANGE(shipdate)
( PARTITION q4_05 STARTING MINVALUE,
PARTITION q1_06 STARTING '1/1/2006',
PARTITION q2_06 STARTING '4/1/2006',
PARTITION q3_06 STARTING '7/1/2006',
PARTITION q4_06 STARTING '10/1/2006'
ENDING ‘12/31/2006' )
> There are lots of interesting things we could do here, e.g.:
>
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
> CREATE TABLE child_name PARTITION OF parent_name
> FOR { (value, ...) [ TO (value, ...) ] } [, ...];
>
The only thing which slightly bothers me about this syntax is that
I don't think there is any pressing need for PostgreSQL to use
syntax similar to what some of the other databases use, however
it has an advantage for ease of migration and ease of use (as
people are already familiar with using such syntax).
> Stuff like "VALUES LESS THAN 500" doesn't sit
> especially well with me - less than according to which opclass? Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax? That seems kind of lame.
>
> especially well with me - less than according to which opclass? Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax? That seems kind of lame.
>
Can't we simply specify the opclass along with column name while
specifying partition clause which I feel is something similar to we
already do in CREATE INDEX syntax.
CREATE TABLE sales
( invoice_no NUMBER,
sale_year INT NOT NULL,
sale_month INT NOT NULL,
sale_day INT NOT NULL )
PARTITION BY RANGE ( sale_year <opclass>)
( PARTITION sales_q1 VALUES LESS THAN (1999)
( invoice_no NUMBER,
sale_year INT NOT NULL,
sale_month INT NOT NULL,
sale_day INT NOT NULL )
PARTITION BY RANGE ( sale_year <opclass>)
( PARTITION sales_q1 VALUES LESS THAN (1999)
Isn't the default operator class for a partition column would fit the
bill for this particular case as the operators required in this syntax
will be quite simple?
> There are lots of interesting things we could do here, e.g.:
>
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
> CREATE TABLE child_name PARTITION OF parent_name
> FOR { (value, ...) [ TO (value, ...) ] } [, ...];
>
The only thing which slightly bothers me about this syntax is that
it makes apparent that partitions are separate tables and it would
be inconvenient if we choose to disallow some operations on
partitions. I think it might be better we treat partitions as a way
to divide the large amount of data and users be only given the
option to specify boundaries to divide this data and storage mechanism
of partitions should be an internal detail (something like we do in
TOAST table case). I am not sure which syntax users will be more
comfortable to use as I am seeing and using Oracle type syntax from
long time so my opinion could be biased in this case. It would be really
helpful if others who need or use partitioning scheme can share their
inputs.
On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > In case of what we would have called a 'LIST' partition, this could look like > > ... FOR VALUES (val1, val2, val3, ...) > > Assuming we only support partition key to contain only one column in such a case. > > In case of what we would have called a 'RANGE' partition, this could look like > > ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) > > How about BETWEEN ... AND ... ? Sure. Mind you, I'm not proposing that the syntax I just mooted is actually for the best. What I'm saying is that we need to talk about it. > I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just be noise keywords. We can parse those valuesinto a parse node such that we don’t have to care about whether they describe partition as being one kind or the other.Say a List of something like, > > typedef struct PartitionColumnValue > { > NodeTag type, > Oid *partitionid, > char *partcolname, > Node *partrangelower, > Node *partrangeupper, > List *partlistvalues > }; > > Or we could still add a (char) partkind just to say which of the fields matter. > > We don't need any defining values here for hash partitions if and when we add support for the same. We would either beusing a system-wide common hash function or we could add something with partitioning key definition. Yeah, range and list partition definitions are very similar, but hash partition definitions are a different kettle of fish. I don't think we really need hash partitioning for anything right away - it's pretty useless unless you've got, say, a way for the partitions to be foreign tables living on remote servers - but we shouldn't pick a design that will make it really hard to add later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/12/14, 8:03 AM, Robert Haas wrote: > On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >In case of what we would have called a 'LIST' partition, this could look like >> > >> >... FOR VALUES (val1, val2, val3, ...) >> > >> >Assuming we only support partition key to contain only one column in such a case. >> > >> >In case of what we would have called a 'RANGE' partition, this could look like >> > >> >... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) >> > >> >How about BETWEEN ... AND ... ? > Sure. Mind you, I'm not proposing that the syntax I just mooted is > actually for the best. What I'm saying is that we need to talk about > it. Frankly, if we're going to require users to explicitly define each partition then I think the most appropriate API wouldbe a function. Users will be writing code to create new partitions as needed, and it's generally easier to write codethat calls a function as opposed to glomming a text string together and passing that to EXECUTE. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> Sure. Mind you, I'm not proposing that the syntax I just mooted is >> actually for the best. What I'm saying is that we need to talk about >> it. > > Frankly, if we're going to require users to explicitly define each partition > then I think the most appropriate API would be a function. Users will be > writing code to create new partitions as needed, and it's generally easier > to write code that calls a function as opposed to glomming a text string > together and passing that to EXECUTE. I have very little idea what the API you're imagining would actually look like from this description, but it sounds like a terrible idea. We don't want to make this infinitely general. We need a *fast* way to go from a value (or list of values, one per partitioning column) to a partition OID, and the way to get there is not to call arbitrary user code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> Sure. Mind you, I'm not proposing that the syntax I just mooted is >>> actually for the best. What I'm saying is that we need to talk about >>> it. >> >> Frankly, if we're going to require users to explicitly define each partition >> then I think the most appropriate API would be a function. Users will be >> writing code to create new partitions as needed, and it's generally easier >> to write code that calls a function as opposed to glomming a text string >> together and passing that to EXECUTE. > > I have very little idea what the API you're imagining would actually > look like from this description, but it sounds like a terrible idea. > We don't want to make this infinitely general. We need a *fast* way > to go from a value (or list of values, one per partitioning column) to > a partition OID, and the way to get there is not to call arbitrary > user code. I think this was mentioned upthread, but I'll repeat it anyway since it seems to need repeating. More than fast, you want it analyzable (by the planner). Ie: it has to be easy to prove partition exclusion against a where clause.
Claudio Freire <klaussfreire@gmail.com> writes: > On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I have very little idea what the API you're imagining would actually >> look like from this description, but it sounds like a terrible idea. >> We don't want to make this infinitely general. We need a *fast* way >> to go from a value (or list of values, one per partitioning column) to >> a partition OID, and the way to get there is not to call arbitrary >> user code. > I think this was mentioned upthread, but I'll repeat it anyway since > it seems to need repeating. > More than fast, you want it analyzable (by the planner). Ie: it has to > be easy to prove partition exclusion against a where clause. Actually, I'm not sure that's what we want. I thought what we really wanted here was to postpone partition-routing decisions to runtime, so that the behavior would be efficient whether or not the decision could be predetermined at plan time. This still leads to the same point Robert is making: the routing decisions have to be cheap and fast. But it's wrong to think of it in terms of planner proofs. regards, tom lane
On Fri, Dec 12, 2014 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I have very little idea what the API you're imagining would actually >>> look like from this description, but it sounds like a terrible idea. >>> We don't want to make this infinitely general. We need a *fast* way >>> to go from a value (or list of values, one per partitioning column) to >>> a partition OID, and the way to get there is not to call arbitrary >>> user code. > >> I think this was mentioned upthread, but I'll repeat it anyway since >> it seems to need repeating. > >> More than fast, you want it analyzable (by the planner). Ie: it has to >> be easy to prove partition exclusion against a where clause. > > Actually, I'm not sure that's what we want. I thought what we really > wanted here was to postpone partition-routing decisions to runtime, > so that the behavior would be efficient whether or not the decision > could be predetermined at plan time. > > This still leads to the same point Robert is making: the routing > decisions have to be cheap and fast. But it's wrong to think of it > in terms of planner proofs. You'll need proofs whether at the planner or at the execution engine. A sequential scan over a partition with a query like select * from foo where date between X and Y Would be ripe for that but at some point you need to prove that the where clause excludes whole partitions. Be it at runtime (while executing the sequential scan node) or planning time.
On 12/12/2014 02:10 PM, Tom Lane wrote: > Actually, I'm not sure that's what we want. I thought what we really > wanted here was to postpone partition-routing decisions to runtime, > so that the behavior would be efficient whether or not the decision > could be predetermined at plan time. > > This still leads to the same point Robert is making: the routing > decisions have to be cheap and fast. But it's wrong to think of it > in terms of planner proofs. The other reason I'd really like to have the new partitioning taken out of the planner: expressions. Currently, if you have partitions with constraints on, day, "event_date", the following WHERE clause will NOT use CE and will scan all partitions: WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and '2014-12-11'. This is despite the fact that the expression above gets rewritten to a constant by the time the query is executed; by then it's too late. To say nothing of functions like to_timestamp(), now(), etc. As long as partitions need to be chosen at plan time, I don't see a good way to fix the expression problem. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Dec 12, 2014 at 7:40 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 12/12/2014 02:10 PM, Tom Lane wrote: >> Actually, I'm not sure that's what we want. I thought what we really >> wanted here was to postpone partition-routing decisions to runtime, >> so that the behavior would be efficient whether or not the decision >> could be predetermined at plan time. >> >> This still leads to the same point Robert is making: the routing >> decisions have to be cheap and fast. But it's wrong to think of it >> in terms of planner proofs. > > The other reason I'd really like to have the new partitioning taken out > of the planner: expressions. > > Currently, if you have partitions with constraints on, day, > "event_date", the following WHERE clause will NOT use CE and will scan > all partitions: > > WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and > '2014-12-11'. > > This is despite the fact that the expression above gets rewritten to a > constant by the time the query is executed; by then it's too late. To > say nothing of functions like to_timestamp(), now(), etc. > > As long as partitions need to be chosen at plan time, I don't see a good > way to fix the expression problem. Fair enough, but that's not the same as not requiring easy proofs. The planner might not the one doing the proofs, but you still need proofs. Even if the proving method is hardcoded into the partitioning method, as in the case of list or range partitioning, it's still a proof. With arbitrary functions (which is what prompted me to mention proofs) you can't do that. A function works very well for inserting, but not for selecting. I could be wrong though. Maybe there's a way to turn SQL functions into analyzable things? But it would still be very easy to shoot yourself in the foot by writing one that is too complex.
Claudio Freire wrote: > Fair enough, but that's not the same as not requiring easy proofs. The > planner might not the one doing the proofs, but you still need proofs. > > Even if the proving method is hardcoded into the partitioning method, > as in the case of list or range partitioning, it's still a proof. With > arbitrary functions (which is what prompted me to mention proofs) you > can't do that. A function works very well for inserting, but not for > selecting. > > I could be wrong though. Maybe there's a way to turn SQL functions > into analyzable things? But it would still be very easy to shoot > yourself in the foot by writing one that is too complex. Arbitrary SQL expressions (including functions) are not the thing to use for partitioning -- at least that's how I understand this whole discussion. I don't think you want to do "proofs" as such -- they are expensive. To make this discussion a bit clearer, there are two things to distinguish: one is routing tuples, when an INSERT or COPY command references the partitioned table, into the individual partitions (ingress); the other is deciding which partitions to read when a SELECT query wants to read tuples from the partitioned table (egress). On ingress, what you want is something like being able to do something on the tuple that tells you which partition it belongs into. Ideally this is something much lighter than running an expression; if you can just apply an operator to the partitioning column values, that should be plenty fast. This requires no proof. On egress you need some direct way to compare the scan quals with the partitioning values. I would imagine this to be similar to how scan quals are compared to the values stored in a BRIN index: each scan qual has a corresponding operator strategy and a scan key, and you can say "aye" or "nay" based on a small set of operations that can be run cheaply, again without any proof or running arbitrary expressions. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<p><br /> El 12/12/2014 23:09, "Alvaro Herrera" <<a href="mailto:alvherre@2ndquadrant.com">alvherre@2ndquadrant.com</a>>escribió:<br /> ><br /> > Claudio Freire wrote:<br/> ><br /> > > Fair enough, but that's not the same as not requiring easy proofs. The<br /> > > plannermight not the one doing the proofs, but you still need proofs.<br /> > ><br /> > > Even if the provingmethod is hardcoded into the partitioning method,<br /> > > as in the case of list or range partitioning, it'sstill a proof. With<br /> > > arbitrary functions (which is what prompted me to mention proofs) you<br /> >> can't do that. A function works very well for inserting, but not for<br /> > > selecting.<br /> > ><br/> > > I could be wrong though. Maybe there's a way to turn SQL functions<br /> > > into analyzable things?But it would still be very easy to shoot<br /> > > yourself in the foot by writing one that is too complex.<br/> ><br /> > Arbitrary SQL expressions (including functions) are not the thing to use<br /> > for partitioning-- at least that's how I understand this whole<br /> > discussion. I don't think you want to do "proofs"as such -- they are<br /> > expensive.<br /> ><br /> > To make this discussion a bit clearer, there aretwo things to<br /> > distinguish: one is routing tuples, when an INSERT or COPY command<br /> > references thepartitioned table, into the individual partitions<br /> > (ingress); the other is deciding which partitions to readwhen a SELECT<br /> > query wants to read tuples from the partitioned table (egress).<br /> ><br /> > On ingress,what you want is something like being able to do something<br /> > on the tuple that tells you which partitionit belongs into. Ideally<br /> > this is something much lighter than running an expression; if you can<br />> just apply an operator to the partitioning column values, that should be<br /> > plenty fast. This requires noproof.<br /> ><br /> > On egress you need some direct way to compare the scan quals with the<br /> > partitioningvalues. I would imagine this to be similar to how scan<br /> > quals are compared to the values stored ina BRIN index: each scan qual<br /> > has a corresponding operator strategy and a scan key, and you can say<br /> >"aye" or "nay" based on a small set of operations that can be run<br /> > cheaply, again without any proof or runningarbitrary expressions.<p>Interesting that you mention BRIN. It does seem that it could be made to work with BRIN'soperator classes.<p>In fact, a partition-wide brin tuple could be stored per partition and that in itself could bethe definition for the partition.<p>Either preinitialized or dynamically updated. Would work even for arbitrary routingfunctions, especially if the operator class to use is customizable.<p>I stand corrected.<br />
On 12/12/2014 05:43 AM, Amit Langote wrote: > [snip] > In case of what we would have called a 'LIST' partition, this could look like > > ... FOR VALUES (val1, val2, val3, ...) > > Assuming we only support partition key to contain only one column in such a case. Hmmm…. [...] PARTITION BY LIST(col1 [, col2, ...]) just like we do for indexes would do. and CREATE PARTITION child_name OF parent_name FOR [VALUES] (val1a,val2a), (val1b,val2b), (val1c,val2c) [IN tblspc_name] just like we do for multi-valued inserts. > In case of what we would have called a 'RANGE' partition, this could look like > > ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...) > > How about BETWEEN ... AND ... ? Unless I'm missing something obvious, we already have range types for this, don't we? ... PARTITION BY RANGE (col) CREATE PARTITION child_name OF parent_name FOR [VALUES] '[val1min,val1max)', '[val2min,val2max)', '[val3min,val3max)' [IN tblspc_name] and I guess this should simplify a fully flexible implementation (if you can construct a RangeType for it, you can use that for partitioning). This would substitute the ugly (IMHO) "VALUES LESS THAN" syntax with a more flexible one (even though it might end up being converted into "less than" boundaries internally for implementation/optimization purposes) In both cases we would need to allow for overflows / default partition different from the parent table. Plus some ALTER PARTITION part_name TABLESPACE=tblspc_name The main problem being that we are assuming named partitions here, which might not be that practical at all. > [snip] >> I would include the noise keyword VALUES just for readability if >> anything. +1 FWIW, deviating from already "standard" syntax (Oracle-like --as implemented by PPAS for example-- or DB2-like) is quite counter-productive unless we have very good reasons for it... which doesn't mean that we have to do it exactly like they do (specially if we would like to go the incremental implementation route). Amit: mind if I add the DB2 syntax for partitioning to the wiki, too? This might as well help with deciding the final form of partitioning (and define the first implementation boundaries, too) Thanks, / J.L.
On 12/13/2014 03:09 AM, Alvaro Herrera wrote: > [snip] > Arbitrary SQL expressions (including functions) are not the thing to use > for partitioning -- at least that's how I understand this whole > discussion. I don't think you want to do "proofs" as such -- they are > expensive. Yup. Plus, it looks like (from reading Oracle's documentation) they end up converting the LESS THAN clauses into range lists internally. Anyone that can attest to this? (or just disprove it, if I'm wrong) I just suggested using the existing RangeType infrastructure for this ( <<, >> and && operators, specifically, might do the trick) before reading your mail citing BRIN. ... which might as well allow some interesting runtime optimizations when range partitioning is used and *a huge* number of partitions get defined --- I'm specifically thinking about massive OLTP with very deep (say, 5 years' worth) archival partitioning where it would be inconvenient to have the tuple routing information always in memory. I'm specifically suggesting some ( range_value -> partitionOID) mapping using a BRIN index for this --- it could be auto-created just like we do for primary keys. Just my 2c Thanks, / J.L.
On 12/12/14, 3:48 PM, Robert Haas wrote: > On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> Sure. Mind you, I'm not proposing that the syntax I just mooted is >>> actually for the best. What I'm saying is that we need to talk about >>> it. >> >> Frankly, if we're going to require users to explicitly define each partition >> then I think the most appropriate API would be a function. Users will be >> writing code to create new partitions as needed, and it's generally easier >> to write code that calls a function as opposed to glomming a text string >> together and passing that to EXECUTE. > > I have very little idea what the API you're imagining would actually > look like from this description, but it sounds like a terrible idea. > We don't want to make this infinitely general. We need a *fast* way > to go from a value (or list of values, one per partitioning column) to > a partition OID, and the way to get there is not to call arbitrary > user code. You were talking about the syntax for partition creation/definition; that's the API I was referring to. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 12, 2014 at 09:03:12AM -0500, Robert Haas wrote: > Yeah, range and list partition definitions are very similar, but > hash partition definitions are a different kettle of fish. I don't > think we really need hash partitioning for anything right away - > it's pretty useless unless you've got, say, a way for the partitions > to be foreign tables living on remote servers - There's a patch enabling exactly this feature in the queue for 9.5. https://commitfest.postgresql.org/action/patch_view?id=1386 > but we shouldn't pick a design that will make it really hard to add > later. Indeed not :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 12/13/2014 05:57 PM, José Luis Tallón wrote: > On 12/13/2014 03:09 AM, Alvaro Herrera wrote: >> [snip] >> Arbitrary SQL expressions (including functions) are not the thing to use >> for partitioning -- at least that's how I understand this whole >> discussion. I don't think you want to do "proofs" as such -- they are >> expensive. > > Yup. Plus, it looks like (from reading Oracle's documentation) they > end up converting the LESS THAN clauses into range lists internally. > Anyone that can attest to this? (or just disprove it, if I'm wrong) > > I just suggested using the existing RangeType infrastructure for this > ( <<, >> and && operators, specifically, might do the trick) before > reading your mail citing BRIN. > ... which might as well allow some interesting runtime > optimizations when range partitioning is used and *a huge* number of > partitions get defined --- I'm specifically thinking about massive > OLTP with very deep (say, 5 years' worth) archival partitioning where > it would be inconvenient to have the tuple routing information always > in memory. > I'm specifically suggesting some ( range_value -> partitionOID) > mapping using a BRIN index for this --- it could be auto-created just > like we do for primary keys. Reviewing the existing documentation on this topic I have stumbled on an e-mail by Simon Riggs from almost seven years ago http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site .... where he suggested a way of physically partitioning tables by using segments in a way that sounds to be quite close to what we are proposing here. ISTM that the partitioning meta-data might very well be augmented a bit in the direction Simon pointed to, adding support for "effectively read-only" and/or "explicitly marked read-only" PARTITIONS (not segments in this case) for an additional optimization. We would need some syntax additions (ALTER PARTITION <name> SET READONLY) in this case. This feature can be added later on, of course. I'd like to explicitly remark the potentially performance-enhancing effect of fillfactor=100 (cfr. http://www.postgresql.org/docs/9.3/static/sql-createtable.html) and partitions marked "effectively read-only" (cfr. Simon's proposal) when coupled with "fullscan analyze" vs. the regular sample-based analyze that autovacuum performs. When a partition consists of multiple *segments*, a generalization of the proposed BRIN index (to cover segments in addition to partitions) will further speed up scans. Just for the record, allowing some partitions to be moved to foreign tables (i.e. foreign servers, via postgres_fdw) will multiply the usefullness of this "partitioned table wide" BRIN index .... now becoming a real "global index". > Just my 2c > > > Thanks, > > / J.L. > > >
On Sun, Dec 14, 2014 at 1:57 AM, José Luis Tallón <jltallon@adv-solutions.net> wrote: > On 12/13/2014 03:09 AM, Alvaro Herrera wrote: >> >> [snip] >> Arbitrary SQL expressions (including functions) are not the thing to use >> for partitioning -- at least that's how I understand this whole >> discussion. I don't think you want to do "proofs" as such -- they are >> expensive. > > > Yup. Plus, it looks like (from reading Oracle's documentation) they end up > converting the LESS THAN clauses into range lists internally. > Anyone that can attest to this? (or just disprove it, if I'm wrong) > > I just suggested using the existing RangeType infrastructure for this ( <<, >>> and && operators, specifically, might do the trick) before reading your > mail citing BRIN. > ... which might as well allow some interesting runtime optimizations > when range partitioning is used and *a huge* number of partitions get > defined --- I'm specifically thinking about massive OLTP with very deep > (say, 5 years' worth) archival partitioning where it would be inconvenient > to have the tuple routing information always in memory. > I'm specifically suggesting some ( range_value -> partitionOID) mapping > using a BRIN index for this --- it could be auto-created just like we do for > primary keys. > > Just my 2c Since we are keen on being able to reuse existing infrastructure, I think this and RangeType, ArrayType stuff is worth thinking about though I am afraid we may lose a certain level of generality of expression we might very well be able to afford. Though that's something difficult to definitely say without actually studying it a little more detail which I haven't quite yet. We may be able to go somewhere with it perhaps. And of course the original designers of the infrastructure in question would be better able to vouch for it I think. Thanks, Amit
On Sun, Dec 14, 2014 at 1:40 AM, José Luis Tallón <jltallon@adv-solutions.net> wrote: > On 12/12/2014 05:43 AM, Amit Langote wrote: > > Amit: mind if I add the DB2 syntax for partitioning to the wiki, too? > > This might as well help with deciding the final form of partitioning > (and define the first implementation boundaries, too) > Please go ahead. Thanks, Amit
Alvaro wrote: > Claudio Freire wrote: > > > Fair enough, but that's not the same as not requiring easy proofs. The > > planner might not the one doing the proofs, but you still need proofs. > > > > Even if the proving method is hardcoded into the partitioning method, > > as in the case of list or range partitioning, it's still a proof. With > > arbitrary functions (which is what prompted me to mention proofs) you > > can't do that. A function works very well for inserting, but not for > > selecting. > > > > I could be wrong though. Maybe there's a way to turn SQL functions > > into analyzable things? But it would still be very easy to shoot > > yourself in the foot by writing one that is too complex. > > Arbitrary SQL expressions (including functions) are not the thing to use > for partitioning -- at least that's how I understand this whole > discussion. I don't think you want to do "proofs" as such -- they are > expensive. > This means if a user puts arbitrary expressions in a partition definition, say, ... FOR VALUES extract(month from current_date) TO extract(month from current_date + interval '3 months'), we make sure that those expressions are pre-computed to literal values. The exact time when that happens is open for discussionI guess. It could be either DDL time or, if feasible, during relation cache building when we compute the valuefrom pg_node_tree of this expression which we may choose to store in the partition definition catalog. The former entailsan obvious challenge of figuring out how we store the computed value into catalog (pg_node_tree of a Const?). > To make this discussion a bit clearer, there are two things to > distinguish: one is routing tuples, when an INSERT or COPY command > references the partitioned table, into the individual partitions > (ingress); the other is deciding which partitions to read when a SELECT > query wants to read tuples from the partitioned table (egress). > > On ingress, what you want is something like being able to do something > on the tuple that tells you which partition it belongs into. Ideally > this is something much lighter than running an expression; if you can > just apply an operator to the partitioning column values, that should be > plenty fast. This requires no proof. > And I am thinking this's all executor stuff. > On egress you need some direct way to compare the scan quals with the > partitioning values. I would imagine this to be similar to how scan > quals are compared to the values stored in a BRIN index: each scan qual > has a corresponding operator strategy and a scan key, and you can say > "aye" or "nay" based on a small set of operations that can be run > cheaply, again without any proof or running arbitrary expressions. > My knowledge of this is far from being perfect, though to clear any confusions - As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made to work.Of course, I may be missing something. When you say "scan qual has a corresponding operator strategy", I'd think that is a part of scan key in executor, no? Thanks, Amit
On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On egress you need some direct way to compare the scan quals with the >> partitioning values. I would imagine this to be similar to how scan >> quals are compared to the values stored in a BRIN index: each scan qual >> has a corresponding operator strategy and a scan key, and you can say >> "aye" or "nay" based on a small set of operations that can be run >> cheaply, again without any proof or running arbitrary expressions. >> > > My knowledge of this is far from being perfect, though to clear any confusions - > > As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made towork. Of course, I may be missing something. Let me be overly verbose, don't take it as patronizing, just answering in lots of detail why this could be a good idea to try. Normal indexes store a pointer for each key value of sorts. So B-Tree gets you a set of tids for each key, and so does GIN and hash. But BRIN is different in that it does the mapping differently. BRIN stores a compact, approximate representation of the set of keys within a page range. It can tell with some degree of (in)accuracy whether a key or key range could be part of that page range or not. The way it does this is abstracted out, but at its core, it stores a "compressed" representation of the key set that takes a constant amount of bits to store, and no more, no matter how many elements. What changes as the element it represents grows, is its accuracy. Currently, BRIN only supports min-max representations. It will store, for each page range, the minimum and maximum of some columns, and when you query it, you can compare range vs range, and discard whole page ranges. Now, what are partitions, if not page ranges? A BRIN tuple is a min-max pair. But BRIN in more general, it could use other data structures to hold that "compressed representation", if someone implemented them. Like bloom filters [0]. A BRIN index is a complex data structure because it has to account for physically growing tables, but all the complexities vanish when you fix a "block range" (the BR in BRIN) to a partition. Then, a mere array of BRIN tuples would suffice. BRIN already contains the machinery to turn quals into something that filters out entire partitions, if you provide the BRIN tuples. And you could even effectively matain a BRIN index for the partitions (just a BRIN tuple per partition, dynamically updated with every insertion). If you do that, you start with empty partitions, and each insert updates the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but in theory this could allow very general partition exclusion. In fact it could even work with constraint exclusion right now: you'd have a single-tuple BRIN index for each partition and benefit from it. But you don't need to pay the price of updating BRIN indexes, as min-max tuples for each partition can be produced while creating the partitions if the syntax already provides the information. Then, it's just a matter of querying this meta-data which just happens to have the form of a BRIN tuple for each partition. [0] http://en.wikipedia.org/wiki/Bloom_filter
Claudio Freire wrote: > On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> On egress you need some direct way to compare the scan quals with the > >> partitioning values. I would imagine this to be similar to how scan > >> quals are compared to the values stored in a BRIN index: each scan qual > >> has a corresponding operator strategy and a scan key, and you can say > >> "aye" or "nay" based on a small set of operations that can be run > >> cheaply, again without any proof or running arbitrary expressions. > >> > > > > My knowledge of this is far from being perfect, though to clear any > confusions - > > > > As far as planning is concerned, I could not imagine how index access > method way of pruning partitions could be made to work. Of course, I may > be missing something. > > Let me be overly verbose, don't take it as patronizing, just answering > in lots of detail why this could be a good idea to try. > Thanks for explaining. It helps. > Normal indexes store a pointer for each key value of sorts. So B-Tree > gets you a set of tids for each key, and so does GIN and hash. > > But BRIN is different in that it does the mapping differently. BRIN > stores a compact, approximate representation of the set of keys within > a page range. It can tell with some degree of (in)accuracy whether a > key or key range could be part of that page range or not. The way it > does this is abstracted out, but at its core, it stores a "compressed" > representation of the key set that takes a constant amount of bits to > store, and no more, no matter how many elements. What changes as the > element it represents grows, is its accuracy. > > Currently, BRIN only supports min-max representations. It will store, > for each page range, the minimum and maximum of some columns, and > when > you query it, you can compare range vs range, and discard whole page > ranges. > > Now, what are partitions, if not page ranges? Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this isa rang partition, list of values in case this is a list partition and hash value in case this is a hash partition. There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an accessmethod for a relation (say, top-level partitioned relation) that comes into play in executor if that access methodsurvives as preferred access method by the planner. I cannot see a way to generalize it further and make it supporteach block range as a separate relation and then use it for partition pruning in planner. This is assuming a partitionedrelation is planned as an Append node which contains a list of plans for surviving partition relations based on,say, restrict quals. I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so? > A BRIN tuple is a min-max pair. But BRIN in more general, it could use > other data structures to hold that "compressed representation", if > someone implemented them. Like bloom filters [0]. > > A BRIN index is a complex data structure because it has to account for > physically growing tables, but all the complexities vanish when you > fix a "block range" (the BR in BRIN) to a partition. Then, a mere > array of BRIN tuples would suffice. > > BRIN already contains the machinery to turn quals into something that > filters out entire partitions, if you provide the BRIN tuples. > IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right? > And you could even effectively matain a BRIN index for the partitions > (just a BRIN tuple per partition, dynamically updated with every > insertion). > > If you do that, you start with empty partitions, and each insert > updates the BRIN tuple. Avoiding concurrency loss in this case would > be tricky, but in theory this could allow very general partition > exclusion. In fact it could even work with constraint exclusion right > now: you'd have a single-tuple BRIN index for each partition and > benefit from it. > > But you don't need to pay the price of updating BRIN indexes, as > min-max tuples for each partition can be produced while creating the > partitions if the syntax already provides the information. Then, it's > just a matter of querying this meta-data which just happens to have > the form of a BRIN tuple for each partition. > Thanks, Amit
On 12/15/2014 07:42 AM, Claudio Freire wrote: > [snip] > If you do that, you start with empty partitions, and each insert > updates the BRIN tuple. Avoiding concurrency loss in this case would > be tricky, but in theory this could allow very general partition > exclusion. In fact it could even work with constraint exclusion right > now: you'd have a single-tuple BRIN index for each partition and > benefit from it. But you don't need to pay the price of updating BRIN > indexes, as min-max tuples for each partition can be produced while > creating the partitions if the syntax already provides the > information. Then, it's just a matter of querying this meta-data which > just happens to have the form of a BRIN tuple for each partition. Yup. Indeed this is the way I outlined in my previous e-mail. The only point being: Why bother with BRIN when we already have the range machinery, and it's trivial to add pointers to partitions from each range? I suggested that BRIN would solve a situation when the amount of partitions is huge (say, thousands) and we might need to be able to efficiently locate the appropriate partition. In this situation, a linear search might become prohibitive, or the data structure (a simple B-Tree, maybe) become too big to be worth keeping in memory. This is where being able to store the "partition index" on disk would be interesting. Moreover, I guess that ---by using this approach (B-Tree[range]->partition_id and/or BRIN)--- we could efficiently answer the question "do we have any tuple with this key in some partition?" which AFAICS is pretty close to us having "global indexes". Regards, / J.L.
On Mon, Dec 15, 2014 at 8:09 AM, José Luis Tallón <jltallon@adv-solutions.net> wrote: > On 12/15/2014 07:42 AM, Claudio Freire wrote: >> >> [snip] > > >> If you do that, you start with empty partitions, and each insert updates >> the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but >> in theory this could allow very general partition exclusion. In fact it >> could even work with constraint exclusion right now: you'd have a >> single-tuple BRIN index for each partition and benefit from it. But you >> don't need to pay the price of updating BRIN indexes, as min-max tuples for >> each partition can be produced while creating the partitions if the syntax >> already provides the information. Then, it's just a matter of querying this >> meta-data which just happens to have the form of a BRIN tuple for each >> partition. > > > Yup. Indeed this is the way I outlined in my previous e-mail. > > The only point being: Why bother with BRIN when we already have the range > machinery, and it's trivial to add pointers to partitions from each range? The part of BRIN I find useful is not its on-disk structure, but all the execution machinery that checks quals against BRIN tuples. It's not a trivial part of code, and is especially useful since it's generalizable. New BRIN operator classes can be created and that's an interesting power to have in partitioning as well. Casting from ranges into min-max BRIN tuples seems quite doable, so both range and list notation should work fine. But BRIN works also for the generic "routing expression" some people seem to really want, and dynamically updated BRIN meta-indexes seem to be the only efficient solution for that. BRIN lacks some features, as you noted, so it does need some love before it's usable for this. But they're features BRIN itself would find useful so you take out two ducks in one shot. > I suggested that BRIN would solve a situation when the amount of partitions > is huge (say, thousands) and we might need to be able to efficiently locate > the appropriate partition. In this situation, a linear search might become > prohibitive, or the data structure (a simple B-Tree, maybe) become too big > to be worth keeping in memory. This is where being able to store the > "partition index" on disk would be interesting. BRIN also does a linear search, so it doesn't solve that. BRIN's only power is that it can answer very fast whether some quals rule out a partition.
On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > This means if a user puts arbitrary expressions in a partition definition, say, > > ... FOR VALUES extract(month from current_date) TO extract(month from current_date + interval '3 months'), > > we make sure that those expressions are pre-computed to literal values. I would expect that to fail, just as it would fail if you tried to build an index using a volatile expression. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert wrote: > On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > This means if a user puts arbitrary expressions in a partition definition, say, > > > > ... FOR VALUES extract(month from current_date) TO extract(month from > current_date + interval '3 months'), > > > > we make sure that those expressions are pre-computed to literal values. > > I would expect that to fail, just as it would fail if you tried to > build an index using a volatile expression. Oops, wrong example, sorry. In case of an otherwise good expression? Thanks, Amit
On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Robert wrote: >> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> > This means if a user puts arbitrary expressions in a partition definition, say, >> > >> > ... FOR VALUES extract(month from current_date) TO extract(month from >> current_date + interval '3 months'), >> > >> > we make sure that those expressions are pre-computed to literal values. >> >> I would expect that to fail, just as it would fail if you tried to >> build an index using a volatile expression. > > Oops, wrong example, sorry. In case of an otherwise good expression? I'm not really sure what you are getting here. An "otherwise-good expression" basically means a constant. Index expressions have to be things that always produce the same result given the same input, because otherwise you might get a different result when searching the index than you did when building it, and then you would fail to find keys that are actually present. In the same way, partition boundaries also need to be constants. Maybe you could allow expressions that can be constant-folded, but that's about it. If you allow anything else, then the partition boundary might "move" once it's been established and then some of the data will be in the wrong partition. What possible use case is there for defining partitions with non-constant boundaries? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Robert wrote: >>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote >>> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> > This means if a user puts arbitrary expressions in a partition definition, say, >>> > >>> > ... FOR VALUES extract(month from current_date) TO extract(month from >>> current_date + interval '3 months'), >>> > >>> > we make sure that those expressions are pre-computed to literal values. >>> >>> I would expect that to fail, just as it would fail if you tried to >>> build an index using a volatile expression. >> >> Oops, wrong example, sorry. In case of an otherwise good expression? > > I'm not really sure what you are getting here. An "otherwise-good > expression" basically means a constant. Index expressions have to be > things that always produce the same result given the same input, > because otherwise you might get a different result when searching the > index than you did when building it, and then you would fail to find > keys that are actually present. I think the point is partitioning based on the result of an expression over row columns. Or if it's not, it should be made anyway: PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12); Or something like that.
On 12/15/2014 10:55 AM, Robert Haas wrote: >> This means if a user puts arbitrary expressions in a partition definition, say, >> > >> > ... FOR VALUES extract(month from current_date) TO extract(month from current_date + interval '3 months'), >> > >> > we make sure that those expressions are pre-computed to literal values. > I would expect that to fail, just as it would fail if you tried to > build an index using a volatile expression. Yes, I wasn't saying that expressions should be used when *creating* the partitions, which strikes me as a bad idea for several reasons. Expressions should be usable when SELECTing data from the partitions. Right now, they aren't, because the planner picks parttiions well before the rewrite phase which would reduce "extract (month from current_date)" to a constant. Right now, if you partition by an integer ID even, and do: SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 ) ... postgres will scan all partitions because ( 3 + 4 ) is an expression and isn't evaluated until after CE is done. I don't think there's an easy way to do the expression rewrite while we're still in planning, is there? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 17-12-2014 AM 12:15, Robert Haas wrote: > On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Robert wrote: >>> I would expect that to fail, just as it would fail if you tried to >>> build an index using a volatile expression. >> >> Oops, wrong example, sorry. In case of an otherwise good expression? > > I'm not really sure what you are getting here. An "otherwise-good > expression" basically means a constant. Index expressions have to be > things that always produce the same result given the same input, > because otherwise you might get a different result when searching the > index than you did when building it, and then you would fail to find > keys that are actually present. In the same way, partition boundaries > also need to be constants. Maybe you could allow expressions that can > be constant-folded, but that's about it. Yeah, this is what I meant. Expressions that can be constant-folded. Sorry, the example I chose was pretty lame. I was just thinking about kind of stuff that something like pg_node_tree would be a good choice for as on-disk representation of partition values. Though definitely it wouldn't be to store arbitrary expressions that evaluate to different values at different times. Thanks, Amit
On 17-12-2014 AM 12:28, Claudio Freire wrote: > On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not really sure what you are getting here. An "otherwise-good >> expression" basically means a constant. Index expressions have to be >> things that always produce the same result given the same input, >> because otherwise you might get a different result when searching the >> index than you did when building it, and then you would fail to find >> keys that are actually present. > > I think the point is partitioning based on the result of an expression > over row columns. Actually, in this case, I was thinking about a partition definition not partition key definition. That is, using an expression as partition value which has problems that I see. > Or if it's not, it should be made anyway: > > PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12); > > Or something like that. > Such a thing seems very desirable though there are some tradeoffs compared to having partitioning key be just attrnums. Or at least we can start with that. An arbitrary expression as partitioning key means that we have to recompute such an expression for each input row. Think how inefficient that may be when bulk-loading into a partitioned table during, say, a COPY. Though there may be ways to fix that. Thanks, Amit
On Tue, Dec 16, 2014 at 1:45 PM, Josh Berkus <josh@agliodbs.com> wrote: > Yes, I wasn't saying that expressions should be used when *creating* the > partitions, which strikes me as a bad idea for several reasons. > Expressions should be usable when SELECTing data from the partitions. > Right now, they aren't, because the planner picks parttiions well before > the rewrite phase which would reduce "extract (month from current_date)" > to a constant. > > Right now, if you partition by an integer ID even, and do: > > SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 ) > > ... postgres will scan all partitions because ( 3 + 4 ) is an expression > and isn't evaluated until after CE is done. Well, actually, that case works fine: rhaas=# create table partitioned_table (id integer, data text); CREATE TABLE rhaas=# create table child1 (check (id < 1000)) inherits (partitioned_table); CREATE TABLE rhaas=# create table child2 (check (id >= 1000)) inherits (partitioned_table); CREATE TABLE rhaas=# explain select * from partitioned_table where id = (3 + 4); QUERY PLAN ------------------------------------------------------------------------Append (cost=0.00..25.38 rows=7 width=36) -> SeqScan on partitioned_table (cost=0.00..0.00 rows=1 width=36) Filter: (id = 7) -> Seq Scan on child1 (cost=0.00..25.38rows=6 width=36) Filter: (id = 7) (5 rows) The reason is that 3 + 4 gets constant-folded pretty early on in the process. But in a more complicated case where the value there isn't known until runtime, yeah, it scans everything. I'm not sure what the best way to fix that is. If the partition bounds were stored in a structured way, as we've been discussing, then the Append or Merge Append node could, when initialized, check which partition the id = X qual routes to and ignore the rest. But that's more iffy with the current representation, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/16/2014 05:52 PM, Robert Haas wrote: > But in a more complicated case where the value there isn't known until > runtime, yeah, it scans everything. I'm not sure what the best way to > fix that is. If the partition bounds were stored in a structured way, > as we've been discussing, then the Append or Merge Append node could, > when initialized, check which partition the id = X qual routes to and > ignore the rest. But that's more iffy with the current > representation, I think. Huh. I was just testing: WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz '2014-12-01' + interval '1 month') In that case, the expression above got folded to constants by the time Postgres did the index scans, but it scanned all partitions. So somehow (timestamptz + interval) doesn't get constant-folded until after planning, at least not on 9.3. And of course this leaves out common patterns like "now() - interval '30 days'" or "to_timestamp('20141201','YYYYMMDD')" Anyway, what I'm saying is that I personally regard the inability to handle even moderately complex expressions a major failing of our existing partitioning scheme (possibly its worst single failing), and I would regard any new partitioning feature which didn't address that issue as suspect. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 12/16/2014 05:52 PM, Robert Haas wrote: >> But in a more complicated case where the value there isn't known until >> runtime, yeah, it scans everything. I'm not sure what the best way to >> fix that is. If the partition bounds were stored in a structured way, >> as we've been discussing, then the Append or Merge Append node could, >> when initialized, check which partition the id = X qual routes to and >> ignore the rest. But that's more iffy with the current >> representation, I think. > > Huh. I was just testing: > > WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz > '2014-12-01' + interval '1 month') > > In that case, the expression above got folded to constants by the time > Postgres did the index scans, but it scanned all partitions. So somehow > (timestamptz + interval) doesn't get constant-folded until after > planning, at least not on 9.3. > > And of course this leaves out common patterns like "now() - interval '30 > days'" or "to_timestamp('20141201','YYYYMMDD')" > > Anyway, what I'm saying is that I personally regard the inability to > handle even moderately complex expressions a major failing of our > existing partitioning scheme (possibly its worst single failing), and I > would regard any new partitioning feature which didn't address that > issue as suspect. I understand, but I think you need to be careful not to stonewall all progress in the name of getting what you want. Getting the partitioning metadata into the system catalogs in a suitable format will be a huge step forward regardless of whether it solves this particular problem right away or not, because it will make it possible to solve this problem in a highly-efficient way, which is quite hard to do right now. For example, we could (right now) write code that would do run-time partition pruning by taking the final filter clause, with all values substituted in, and re-checking for partitions that can be pruned via constraint exclusion. But that would be expensive and would often fail to find anything useful. Even in the best case where it works out it's O(n) in the number of partitions, and will therefore perform badly for large numbers of partitions (even, say, 1000). But once the partitioning metadata is stored in the catalog, we can implement this as a binary search -- O(lg n) time -- and the constant factor should be lower -- and it will be pretty easy to skip it in cases where it's useless so that we don't waste cycles spinning our wheels. Whether the initial patch covers all the cases you care about or not, and it probably won't, it will be a really big step towards making it POSSIBLE to handle those cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/16/2014 07:35 PM, Robert Haas wrote: > On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Anyway, what I'm saying is that I personally regard the inability to >> handle even moderately complex expressions a major failing of our >> existing partitioning scheme (possibly its worst single failing), and I >> would regard any new partitioning feature which didn't address that >> issue as suspect. > > I understand, but I think you need to be careful not to stonewall all > progress in the name of getting what you want. Getting the > partitioning metadata into the system catalogs in a suitable format > will be a huge step forward regardless of whether it solves this > particular problem right away or not, because it will make it possible > to solve this problem in a highly-efficient way, which is quite hard > to do right now. Sure. But there's a big difference between "we're going to take these steps and that problem will be fixable eventually" and "we're going to retain features of the current partitioning system which make that problem impossible to fix." The drift of discussion on this thread *sounded* like the latter, and I've been calling attention to the issue in an effort to make sure that it's not. Last week, I wrote a longish email listing out the common problems users have with our current partitioning as a way of benchmarking the plan for new partitioning. While some people responded to that post, absolutely nobody discussed the list of issues I gave. Is that because there's universal agreement that I got the major issues right? Seems doubtful. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 12/17/2014 08:53 PM, Josh Berkus wrote: > Last week, I wrote a longish email listing out the common problems users > have with our current partitioning as a way of benchmarking the plan for > new partitioning. While some people responded to that post, absolutely > nobody discussed the list of issues I gave. Is that because there's > universal agreement that I got the major issues right? Seems doubtful. That was a good list. - Heikki
On 12/17/2014 11:19 AM, Heikki Linnakangas wrote: > On 12/17/2014 08:53 PM, Josh Berkus wrote: >> Last week, I wrote a longish email listing out the common problems users >> have with our current partitioning as a way of benchmarking the plan for >> new partitioning. While some people responded to that post, absolutely >> nobody discussed the list of issues I gave. Is that because there's >> universal agreement that I got the major issues right? Seems doubtful. > > That was a good list. ;-) Ok, that made my morning. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 12/16/2014 07:35 PM, Robert Haas wrote: >> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> Anyway, what I'm saying is that I personally regard the inability to >>> handle even moderately complex expressions a major failing of our >>> existing partitioning scheme (possibly its worst single failing), and I >>> would regard any new partitioning feature which didn't address that >>> issue as suspect. >> >> I understand, but I think you need to be careful not to stonewall all >> progress in the name of getting what you want. Getting the >> partitioning metadata into the system catalogs in a suitable format >> will be a huge step forward regardless of whether it solves this >> particular problem right away or not, because it will make it possible >> to solve this problem in a highly-efficient way, which is quite hard >> to do right now. > > Sure. But there's a big difference between "we're going to take these > steps and that problem will be fixable eventually" and "we're going to > retain features of the current partitioning system which make that > problem impossible to fix." The drift of discussion on this thread > *sounded* like the latter, and I've been calling attention to the issue > in an effort to make sure that it's not. > > Last week, I wrote a longish email listing out the common problems users > have with our current partitioning as a way of benchmarking the plan for > new partitioning. While some people responded to that post, absolutely > nobody discussed the list of issues I gave. Is that because there's > universal agreement that I got the major issues right? Seems doubtful. I agreed with many of the things you listed but not all of them. However, I don't think it's realistic to burden whatever patch Amit writes with the duty of, for example, making global indexes work. That's a huge problem all of its own. Now, conceivably, we could try to solve that as part of the next patch by insisting that the "partitions" have to really be block number ranges within a single relfilenode rather than separate relfilenodes as they are today ... but I think that's a bad design which we would likely regret bitterly. I also think that it would likely make what's being talked about here so complicated that it will never go anywhere. I think it's better that we focus on solving one problem really well - storing metadata for partition boundaries in the catalog so that we can do efficient tuple routing and partition pruning - and leave the other problems for later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 18-12-2014 AM 04:52, Robert Haas wrote: > On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus <josh@agliodbs.com> wrote: >> >> Sure. But there's a big difference between "we're going to take these >> steps and that problem will be fixable eventually" and "we're going to >> retain features of the current partitioning system which make that >> problem impossible to fix." The drift of discussion on this thread >> *sounded* like the latter, and I've been calling attention to the issue >> in an effort to make sure that it's not. >> >> Last week, I wrote a longish email listing out the common problems users >> have with our current partitioning as a way of benchmarking the plan for >> new partitioning. While some people responded to that post, absolutely >> nobody discussed the list of issues I gave. Is that because there's >> universal agreement that I got the major issues right? Seems doubtful. > > I agreed with many of the things you listed but not all of them. > However, I don't think it's realistic to burden whatever patch Amit > writes with the duty of, for example, making global indexes work. > That's a huge problem all of its own. Now, conceivably, we could try > to solve that as part of the next patch by insisting that the > "partitions" have to really be block number ranges within a single > relfilenode rather than separate relfilenodes as they are today ... > but I think that's a bad design which we would likely regret bitterly. > I also think that it would likely make what's being talked about here > so complicated that it will never go anywhere. I think it's better > that we focus on solving one problem really well - storing metadata > for partition boundaries in the catalog so that we can do efficient > tuple routing and partition pruning - and leave the other problems for > later. > Yes, I think partitioning as a whole is a BIG enough project that we need to tackle it as a series of steps each of which is a discussion of its own. The first step might as well be discussing how we represent a partitioned table. We have a number of design decisions to make during this step itself and we would definitely want to reach a consensus on these points. Things like where we indicate if a table is partitioned (pg_class), what the partition key looks like, where it is stored, what the partition definition looks like, where it is stored, how we represent arbitrary number of levels in partitioning hierarchy, how we implement that only leaf level relations in a hierarchy have storage, what are implications of all these choices, etc. Some of these points are being discussed. I agree that while we are discussing these points, we could also be discussing how we solve problems of existing partitioning implementation using whatever the above things end up being. Proposed approaches to solve those problems might be useful to drive the first step as well or perhaps that's how it should be done anyway. Thanks, Amit
On 06-01-2015 PM 03:40, Amit Langote wrote: > > I agree that while we are discussing these points, we could also be > discussing how we solve problems of existing partitioning implementation > using whatever the above things end up being. Proposed approaches to > solve those problems might be useful to drive the first step as well or > perhaps that's how it should be done anyway. > I realize the discussion has not quite brought us to *conclusions* so far though surely there has been valuable input from people. Anyway, starting a new thread with the summary of what has been (please note that the order of listing the points does not necessarily connote the priority): * It has been repeatedly pointed out that we may want to decouple partitioning from inheritance because implementing partitioning as an extension of inheritance mechanism means that we have to keep all the existing semantics which might limit what we want to do with the special case of it which is partitioning; in other words, we would find ourselves in difficult position where we have to inject a special case code into a very generalized mechanism that is inheritance. Specifically, do we regard a partitions as pg_inherits children of its partitioning parent? * Syntax: do we want to make it similar to one of the many other databases out there? Or we could invent our own? I like the syntax that Robert suggested that covers the cases of RANGE and LIST partitioning without actually having to use those keywords explicitly; something like the following: CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]); CREATE TABLE child PARTITION OF parent_name FOR VALUES { (value, ...) [ TO (value, ...) ] } So instead of making a hard distinction between range and list partitioning, you can say: CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7); wherein, child is effectively a LIST partition CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12); wherein, child is effectively a RANGE partition on one column CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30, 130); wherein, child is effectively a RANGE partition on two columns I wonder if we could add a clause like DISTRIBUTED BY to complement PARTITION ON that represents a hash distributed/partitioned table (that could be a syntax to support sharded tables maybe; we would definitely want to move ahead in that direction I guess) * Catalog: We would like to have a catalog structure suitable to implement capabilities like multi-column partitioning, sub-partitioning (with arbitrary number of levels in the hierarchy). I had suggested that we create two new catalogs viz. pg_partitioned_rel, pg_partition_def to store metadata about a partition key of a partitioned relation and partition bound info of a partition, respectively. Also, see the point about on-disk representation of partition bounds * It is desirable to treat partitions as pg_class relations with perhaps a new relkind(s). We may want to choose an implementation where only the lowest level relations in a partitioning hierarchy have storage; those at the upper layers are mere placeholder relations though of course with associated constraints determined by partitioning criteria (with appropriate metadata entered into the additional catalogs). I am not quite sure if each kind of the relations involved in the partitioning scheme have separate namespaces and, if they are, how we implement that * In the initial implementation, we could just live with partitioning on a set of columns (and not arbitrary expressions of them) * We perhaps do not need multi-column LIST partitions as they are not very widely used and may complicate the implementation * There are a number of suggestions about how we represent partition bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype associated with the relation itself), etc. Important point to consider here may be that partition key may contain more than one column * How we represent partition definition in memory (for a given partitioned relation) - important point to remember is that such a representation should be efficient to iterate through or binary-searchable. Also see the points about tuple-routing and partition-pruning * Overflow/catchall partition: it seems we do not want/need them. It might seem desirable for example in cases where a big transaction enters a large number of tuples all but one of which find a defined partition; we may not want to error out in such case but instead enter that erring tuple into the overflow partition instead. If we choose to implement that, we would like to also implement the capability to move the tuples into the appropriate partition once it's defined. Related is the notion of automatically creating partitions if one is not already defined for a just entered tuple; but there may be locking troubles if many concurrent sessions try to do that * Tuple-routing: based on the internal representation of partition bounds for the partitions of a given partitioned table, there should be a way to map a just entered tuple to partition id it belongs to. Below mentioned BRIN-like machinery could be made to work * Partition-pruning: again, based on the internal representation of partition bounds for the partitions of a given partitioned table, there should be a way to prune partitions deemed unnecessary per scan quals. One notable suggestion is to consider BRIN (-like) machinery. For example, it is able to tell from the scan quals whether a particular block range of a given heap needs to be scanned or not based on summary info index tuple for the block range. Though, the interface is currently suitable to cover a single heap with blocks in range 0 to N-1 of that heap. What we are looking for here is a hypothetical PartitionMemTuple (or PartitionBound) that is a summary of a whole relation (in this case, the partition) NOT a block range. But I guess the infrastructure is generalized enough that we could make that work. Related then would be an equivalent of ScanKey for the partitioning case. Just as ScanKeyData has correspondence with the index being used, the hypothetical PartitionScanKeyData (which may be an entirely bad/half-baked idea!) would represent the application of comparison operator between table column (partitioning key column) and a constant (as per quals). Please help bridge the gap in my understanding of these points. I hope we can put the discussion on a concrete footing so that it leads to a way towards implementation sooner than later. Some points need more immediate attention as we would like to first tackle the issue of partition metadata. Reusing existing infrastructure should be encouraged with obvious enhancements as we find fit. I am beginning to feel there is a need to prototype a good enough solution that incorporates the suggestions that have been already provided or will be provided. It may be the only way forward though I think it definitely worthwhile to spend some time to arrive at such a set of good enough ideas on various aspects. Thanks, Amit
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > * It has been repeatedly pointed out that we may want to decouple > partitioning from inheritance because implementing partitioning as an > extension of inheritance mechanism means that we have to keep all the > existing semantics which might limit what we want to do with the special > case of it which is partitioning; in other words, we would find > ourselves in difficult position where we have to inject a special case > code into a very generalized mechanism that is inheritance. > Specifically, do we regard a partitions as pg_inherits children of its > partitioning parent? I don't think this is totally an all-or-nothing decision. I think everyone is agreed that we need to not break things that work today -- e.g. Merge Append. What that implies for pg_inherits isn't altogether clear. > * Syntax: do we want to make it similar to one of the many other > databases out there? Or we could invent our own? Well, what I think we don't want is something that is *almost* like some other database but not quite. I lean toward inventing our own since I'm not aware of something that we'd want to copy exactly. > I wonder if we could add a clause like DISTRIBUTED BY to complement > PARTITION ON that represents a hash distributed/partitioned table (that > could be a syntax to support sharded tables maybe; we would definitely > want to move ahead in that direction I guess) Maybe eventually, but let's not complicate things by worrying too much about that now. > * Catalog: We would like to have a catalog structure suitable to > implement capabilities like multi-column partitioning, sub-partitioning > (with arbitrary number of levels in the hierarchy). I had suggested > that we create two new catalogs viz. pg_partitioned_rel, > pg_partition_def to store metadata about a partition key of a > partitioned relation and partition bound info of a partition, > respectively. Also, see the point about on-disk representation of > partition bounds I'm not convinced that there is any benefit in spreading this information across two tables neither of which exist today. If the representation of the partitioning scheme is going to be a node tree, then there's no point in taking what would otherwise have been a List and storing each element of it in a separate tuple. The overarching point here is that the system catalog structure should be whatever is most convenient for the system internals; I'm not sure we understand what that is yet. > * It is desirable to treat partitions as pg_class relations with perhaps > a new relkind(s). We may want to choose an implementation where only the > lowest level relations in a partitioning hierarchy have storage; those > at the upper layers are mere placeholder relations though of course with > associated constraints determined by partitioning criteria (with > appropriate metadata entered into the additional catalogs). I think the storage-less parents need a new relkind precisely to denote that they have no storage; I am not convinced that there's any reason to change the relkind for the leaf nodes. But that's been proposed, so evidently someone thinks there's a reason to do it. > I am not > quite sure if each kind of the relations involved in the partitioning > scheme have separate namespaces and, if they are, how we implement that I am in favor of having all of the nodes in the hierarchy have names just as relations do today -- pg_class.relname. Anything else seems to me to be complex to implement and of very marginal benefit. But again, it's been proposed. > * In the initial implementation, we could just live with partitioning on > a set of columns (and not arbitrary expressions of them) Seems quite fair. > * We perhaps do not need multi-column LIST partitions as they are not > very widely used and may complicate the implementation I agree that the use case is marginal; but I'm not sure it needs to complicate the implementation much. Depending on how the implementation shakes out, prohibiting it might come to seem like more of a wart than allowing it. > * There are a number of suggestions about how we represent partition > bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype > associated with the relation itself), etc. Important point to consider > here may be that partition key may contain more than one column Yep. > * How we represent partition definition in memory (for a given > partitioned relation) - important point to remember is that such a > representation should be efficient to iterate through or > binary-searchable. Also see the points about tuple-routing and > partition-pruning Yep. > * Overflow/catchall partition: it seems we do not want/need them. It > might seem desirable for example in cases where a big transaction enters > a large number of tuples all but one of which find a defined partition; > we may not want to error out in such case but instead enter that erring > tuple into the overflow partition instead. If we choose to implement > that, we would like to also implement the capability to move the tuples > into the appropriate partition once it's defined. Related is the notion > of automatically creating partitions if one is not already defined for a > just entered tuple; but there may be locking troubles if many concurrent > sessions try to do that I think that dynamically creating new partitions is way beyond the scope of what this patch should be trying to do. If we ever do it at all, it should not be now. The value of a default partition (aka overflow partition) seems to me to be debatable. For range partitioning, it doesn't seem entirely necessary provided that you can define a range with only one endpoint (e.g. partition A has values 1 to 10, B has 11 and up, and C has 0 and down). For list partitioning, though, you might well want something like that. But is it a must-have? Dunno. > * Tuple-routing: based on the internal representation of partition > bounds for the partitions of a given partitioned table, there should be > a way to map a just entered tuple to partition id it belongs to. Below > mentioned BRIN-like machinery could be made to work > > * Partition-pruning: again, based on the internal representation of > partition bounds for the partitions of a given partitioned table, there > should be a way to prune partitions deemed unnecessary per scan quals. > One notable suggestion is to consider BRIN (-like) machinery. For > example, it is able to tell from the scan quals whether a particular > block range of a given heap needs to be scanned or not based on summary > info index tuple for the block range. Though, the interface is currently > suitable to cover a single heap with blocks in range 0 to N-1 of that > heap. What we are looking for here is a hypothetical PartitionMemTuple > (or PartitionBound) that is a summary of a whole relation (in this case, > the partition) NOT a block range. But I guess the infrastructure is > generalized enough that we could make that work. Related then would be > an equivalent of ScanKey for the partitioning case. Just as ScanKeyData > has correspondence with the index being used, the hypothetical > PartitionScanKeyData (which may be an entirely bad/half-baked idea!) > would represent the application of comparison operator between table > column (partitioning key column) and a constant (as per quals). I'm not going to say this couldn't be done, but how is any of it better than having a list of the partition bounds and binary-searching it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> * It has been repeatedly pointed out that we may want to decouple
> partitioning from inheritance because implementing partitioning as an
> extension of inheritance mechanism means that we have to keep all the
> existing semantics which might limit what we want to do with the special
> case of it which is partitioning; in other words, we would find
> ourselves in difficult position where we have to inject a special case
> code into a very generalized mechanism that is inheritance.
> Specifically, do we regard a partitions as pg_inherits children of its
> partitioning parent?
I don't think this is totally an all-or-nothing decision. I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append. What that implies for pg_inherits isn't altogether
clear.
> * Syntax: do we want to make it similar to one of the many other
> databases out there? Or we could invent our own?
Well, what I think we don't want is something that is *almost* like
some other database but not quite. I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.
> I wonder if we could add a clause like DISTRIBUTED BY to complement
> PARTITION ON that represents a hash distributed/partitioned table (that
> could be a syntax to support sharded tables maybe; we would definitely
> want to move ahead in that direction I guess)
Maybe eventually, but let's not complicate things by worrying too much
about that now.
Instead we might want to specify which server (foreign or local) each of the partition go to, something like LOCATED ON clause for each of the partitions with default as local server.
> * Catalog: We would like to have a catalog structure suitable to
> implement capabilities like multi-column partitioning, sub-partitioning
> (with arbitrary number of levels in the hierarchy). I had suggested
> that we create two new catalogs viz. pg_partitioned_rel,
> pg_partition_def to store metadata about a partition key of a
> partitioned relation and partition bound info of a partition,
> respectively. Also, see the point about on-disk representation of
> partition bounds
I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today. If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.
> * It is desirable to treat partitions as pg_class relations with perhaps
> a new relkind(s). We may want to choose an implementation where only the
> lowest level relations in a partitioning hierarchy have storage; those
> at the upper layers are mere placeholder relations though of course with
> associated constraints determined by partitioning criteria (with
> appropriate metadata entered into the additional catalogs).
I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes. But that's been
proposed, so evidently someone thinks there's a reason to do it.
> I am not
> quite sure if each kind of the relations involved in the partitioning
> scheme have separate namespaces and, if they are, how we implement that
I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname. Anything else seems
to me to be complex to implement and of very marginal benefit. But
again, it's been proposed.
> * In the initial implementation, we could just live with partitioning on
> a set of columns (and not arbitrary expressions of them)
Seems quite fair.
> * We perhaps do not need multi-column LIST partitions as they are not
> very widely used and may complicate the implementation
I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much. Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.
> * There are a number of suggestions about how we represent partition
> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> associated with the relation itself), etc. Important point to consider
> here may be that partition key may contain more than one column
Yep.
> * How we represent partition definition in memory (for a given
> partitioned relation) - important point to remember is that such a
> representation should be efficient to iterate through or
> binary-searchable. Also see the points about tuple-routing and
> partition-pruning
Yep.
> * Overflow/catchall partition: it seems we do not want/need them. It
> might seem desirable for example in cases where a big transaction enters
> a large number of tuples all but one of which find a defined partition;
> we may not want to error out in such case but instead enter that erring
> tuple into the overflow partition instead. If we choose to implement
> that, we would like to also implement the capability to move the tuples
> into the appropriate partition once it's defined. Related is the notion
> of automatically creating partitions if one is not already defined for a
> just entered tuple; but there may be locking troubles if many concurrent
> sessions try to do that
I think that dynamically creating new partitions is way beyond the
scope of what this patch should be trying to do. If we ever do it at
all, it should not be now. The value of a default partition (aka
overflow partition) seems to me to be debatable. For range
partitioning, it doesn't seem entirely necessary provided that you can
define a range with only one endpoint (e.g. partition A has values 1
to 10, B has 11 and up, and C has 0 and down). For list partitioning,
though, you might well want something like that. But is it a
must-have? Dunno.
> * Tuple-routing: based on the internal representation of partition
> bounds for the partitions of a given partitioned table, there should be
> a way to map a just entered tuple to partition id it belongs to. Below
> mentioned BRIN-like machinery could be made to work
>
> * Partition-pruning: again, based on the internal representation of
> partition bounds for the partitions of a given partitioned table, there
> should be a way to prune partitions deemed unnecessary per scan quals.
> One notable suggestion is to consider BRIN (-like) machinery. For
> example, it is able to tell from the scan quals whether a particular
> block range of a given heap needs to be scanned or not based on summary
> info index tuple for the block range. Though, the interface is currently
> suitable to cover a single heap with blocks in range 0 to N-1 of that
> heap. What we are looking for here is a hypothetical PartitionMemTuple
> (or PartitionBound) that is a summary of a whole relation (in this case,
> the partition) NOT a block range. But I guess the infrastructure is
> generalized enough that we could make that work. Related then would be
> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
> has correspondence with the index being used, the hypothetical
> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
> would represent the application of comparison operator between table
> column (partitioning key column) and a constant (as per quals).
I'm not going to say this couldn't be done, but how is any of it
better than having a list of the partition bounds and binary-searching
it?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 19-01-2015 PM 12:37, Ashutosh Bapat wrote: > On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >>> I wonder if we could add a clause like DISTRIBUTED BY to complement >>> PARTITION ON that represents a hash distributed/partitioned table (that >>> could be a syntax to support sharded tables maybe; we would definitely >>> want to move ahead in that direction I guess) >> >> Maybe eventually, but let's not complicate things by worrying too much >> about that now. >> > > Instead we might want to specify which server (foreign or local) each of > the partition go to, something like LOCATED ON clause for each of the > partitions with default as local server. > Given how things stand today, we do not allow DDL with the FDW interface, unless I'm missing something. So, we are restricted to only going the other way around, say, CREATE FOREIGN TABLE partXX PARTITION OF parent SERVER ...; assuming we like the proposed syntax - CREATE TABLE child PARTITION OF parent; I think this is also assuming we are relying on foreign table inheritance. That is, both that partitioning is based on inheritance and foreign tables support inheritance (which should be the case soon) Still, I think Robert may be correct in that it would not be sooner that we integrate foreign tables with partitioning scheme (I guess mostly the syntax aspect of it). Thanks, Amit
On 17-01-2015 AM 02:34, Robert Haas wrote: > On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> * It has been repeatedly pointed out that we may want to decouple >> partitioning from inheritance because implementing partitioning as an >> extension of inheritance mechanism means that we have to keep all the >> existing semantics which might limit what we want to do with the special >> case of it which is partitioning; in other words, we would find >> ourselves in difficult position where we have to inject a special case >> code into a very generalized mechanism that is inheritance. >> Specifically, do we regard a partitions as pg_inherits children of its >> partitioning parent? > > I don't think this is totally an all-or-nothing decision. I think > everyone is agreed that we need to not break things that work today -- > e.g. Merge Append. What that implies for pg_inherits isn't altogether > clear. > One point is that an implementation may end up establishing the parent-partition hierarchy somewhere other than (or in addition to) pg_inherits. One intention would be to avoid tying partitioning scheme to certain inheritance features that use pg_inherits. For example, consider call sites of find_all_inheritors(). One notable example is Append/MergeAppend which would be of interest to partitioning. We would want to reuse that part of the infrastructure but we could might as well write an equivalent, say find_all_partitions() which scans something other than pg_inherits to get all partitions. Now, we may not want to do that and instead add special case code to prevent partitioning from fiddling with unnecessary inheritance features in the code paths of inheritance. This seems like an important decision to make. >> * Syntax: do we want to make it similar to one of the many other >> databases out there? Or we could invent our own? > > Well, what I think we don't want is something that is *almost* like > some other database but not quite. I lean toward inventing our own > since I'm not aware of something that we'd want to copy exactly. > >> I wonder if we could add a clause like DISTRIBUTED BY to complement >> PARTITION ON that represents a hash distributed/partitioned table (that >> could be a syntax to support sharded tables maybe; we would definitely >> want to move ahead in that direction I guess) > > Maybe eventually, but let's not complicate things by worrying too much > about that now. > Agree that we may not want to mix the two too much at this point. >> * Catalog: We would like to have a catalog structure suitable to >> implement capabilities like multi-column partitioning, sub-partitioning >> (with arbitrary number of levels in the hierarchy). I had suggested >> that we create two new catalogs viz. pg_partitioned_rel, >> pg_partition_def to store metadata about a partition key of a >> partitioned relation and partition bound info of a partition, >> respectively. Also, see the point about on-disk representation of >> partition bounds > > I'm not convinced that there is any benefit in spreading this > information across two tables neither of which exist today. If the > representation of the partitioning scheme is going to be a node tree, > then there's no point in taking what would otherwise have been a List > and storing each element of it in a separate tuple. The overarching > point here is that the system catalog structure should be whatever is > most convenient for the system internals; I'm not sure we understand > what that is yet. > Agree that some concrete idea of internal representation should help guide the catalog structure. If we are going to cache the partitioning info in relcache (which we most definitely will), then we should try to make sure to consider the scenario of having a lot of partitioned tables with a lot of individual partitions. It looks like it would be similar to a scenarios where there are a lot of inheritance hierarchies. But, availability of partitioning feature would definitely cause these numbers to grow larger. Perhaps this is an important point driving this discussion. I guess this remains tied to the decision we would like make regarding inheritance (pg_inherits, etc.) >> * It is desirable to treat partitions as pg_class relations with perhaps >> a new relkind(s). We may want to choose an implementation where only the >> lowest level relations in a partitioning hierarchy have storage; those >> at the upper layers are mere placeholder relations though of course with >> associated constraints determined by partitioning criteria (with >> appropriate metadata entered into the additional catalogs). > > I think the storage-less parents need a new relkind precisely to > denote that they have no storage; I am not convinced that there's any > reason to change the relkind for the leaf nodes. But that's been > proposed, so evidently someone thinks there's a reason to do it. > Again, this remains partly tied to decisions we make regarding catalog structure. I am not sure but wouldn't we ever need to tell from a pg_class entry that a leaf relation has partition bounds associated with it? One reason I can see that we may not need it is that we would rather use relispartitioned of a non-leaf relation to trigger finding all its partitions and their associated bounds; we don't need to know (or reserve a field for) that a relation has partition bounds associated with it. The bounds can be stored in pg_partition indexed by relid. Maybe relkind is not the right field for this anyway. With that said, would we be comfortable with putting partition key into pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if relispartitioned, also look for relpartkey? >> I am not >> quite sure if each kind of the relations involved in the partitioning >> scheme have separate namespaces and, if they are, how we implement that > > I am in favor of having all of the nodes in the hierarchy have names > just as relations do today -- pg_class.relname. Anything else seems > to me to be complex to implement and of very marginal benefit. But > again, it's been proposed. > The same follows from the my other comments. >> * In the initial implementation, we could just live with partitioning on >> a set of columns (and not arbitrary expressions of them) > > Seems quite fair. > >> * We perhaps do not need multi-column LIST partitions as they are not >> very widely used and may complicate the implementation > > I agree that the use case is marginal; but I'm not sure it needs to > complicate the implementation much. Depending on how the > implementation shakes out, prohibiting it might come to seem like more > of a wart than allowing it. > Hmm, I guess implementation may turn out to be generalized enough that prohibiting it would become a special case and more work. >> * There are a number of suggestions about how we represent partition >> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype >> associated with the relation itself), etc. Important point to consider >> here may be that partition key may contain more than one column > > Yep. > >> * How we represent partition definition in memory (for a given >> partitioned relation) - important point to remember is that such a >> representation should be efficient to iterate through or >> binary-searchable. Also see the points about tuple-routing and >> partition-pruning > > Yep. > >> * Overflow/catchall partition: it seems we do not want/need them. It >> might seem desirable for example in cases where a big transaction enters >> a large number of tuples all but one of which find a defined partition; >> we may not want to error out in such case but instead enter that erring >> tuple into the overflow partition instead. If we choose to implement >> that, we would like to also implement the capability to move the tuples >> into the appropriate partition once it's defined. Related is the notion >> of automatically creating partitions if one is not already defined for a >> just entered tuple; but there may be locking troubles if many concurrent >> sessions try to do that > > I think that dynamically creating new partitions is way beyond the > scope of what this patch should be trying to do. If we ever do it at > all, it should not be now. The value of a default partition (aka > overflow partition) seems to me to be debatable. For range > partitioning, it doesn't seem entirely necessary provided that you can > define a range with only one endpoint (e.g. partition A has values 1 > to 10, B has 11 and up, and C has 0 and down). For list partitioning, > though, you might well want something like that. But is it a > must-have? Dunno. > >> * Tuple-routing: based on the internal representation of partition >> bounds for the partitions of a given partitioned table, there should be >> a way to map a just entered tuple to partition id it belongs to. Below >> mentioned BRIN-like machinery could be made to work >> >> * Partition-pruning: again, based on the internal representation of >> partition bounds for the partitions of a given partitioned table, there >> should be a way to prune partitions deemed unnecessary per scan quals. >> One notable suggestion is to consider BRIN (-like) machinery. For >> example, it is able to tell from the scan quals whether a particular >> block range of a given heap needs to be scanned or not based on summary >> info index tuple for the block range. Though, the interface is currently >> suitable to cover a single heap with blocks in range 0 to N-1 of that >> heap. What we are looking for here is a hypothetical PartitionMemTuple >> (or PartitionBound) that is a summary of a whole relation (in this case, >> the partition) NOT a block range. But I guess the infrastructure is >> generalized enough that we could make that work. Related then would be >> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData >> has correspondence with the index being used, the hypothetical >> PartitionScanKeyData (which may be an entirely bad/half-baked idea!) >> would represent the application of comparison operator between table >> column (partitioning key column) and a constant (as per quals). > > I'm not going to say this couldn't be done, but how is any of it > better than having a list of the partition bounds and binary-searching > it? > Of course, my description of it is pretty hand-wavy. A primary question for me about partition-pruning is when do we do it? Should we model it after relation_excluded_by_constraints() and hence totally plan-time? But, the tone of the discussion is that we postpone partition-pruning to execution-time and hence my perhaps misdirected attempts to inject it into some executor machinery. Thanks, Amit
On 20-01-2015 AM 10:48, Amit Langote wrote: > On 17-01-2015 AM 02:34, Robert Haas wrote: >> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> * It is desirable to treat partitions as pg_class relations with perhaps >>> a new relkind(s). We may want to choose an implementation where only the >>> lowest level relations in a partitioning hierarchy have storage; those >>> at the upper layers are mere placeholder relations though of course with >>> associated constraints determined by partitioning criteria (with >>> appropriate metadata entered into the additional catalogs). >> >> I think the storage-less parents need a new relkind precisely to >> denote that they have no storage; I am not convinced that there's any >> reason to change the relkind for the leaf nodes. But that's been >> proposed, so evidently someone thinks there's a reason to do it. >> > > Again, this remains partly tied to decisions we make regarding catalog > structure. > > I am not sure but wouldn't we ever need to tell from a pg_class entry > that a leaf relation has partition bounds associated with it? One reason > I can see that we may not need it is that we would rather use > relispartitioned of a non-leaf relation to trigger finding all its > partitions and their associated bounds; we don't need to know (or > reserve a field for) that a relation has partition bounds associated > with it. The bounds can be stored in pg_partition indexed by relid. > Maybe relkind is not the right field for this anyway. > > With that said, would we be comfortable with putting partition key into > pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if > relispartitioned, also look for relpartkey? > This paints a picture that our leaf relations would be plain old relations. They are almost similar in all respects (how they are planned, modified, maintained, ...). They just have an additional property that the values they can contain are restricted by, say, pg_partition.values; but it doesn't concern how they are planned. Planning related changes are confined to upper layers of the hierarchy instead. Kinda like saying instead of doing relation_excluded_by_constraints(childrel), we'd instead say prune_useless_partitions(&partitionedrel) possibly at some other site than its counterpart. Guess that illustrates the point. I am not sure again if we want to limit access to individual partitions unless via some special syntax, then what that means for the above. We have been discussing that. Such access limiting could (only) be facilitated by a new relkind. On the other hand, the non-leaf relations are slightly new kind of relations in that they do not have storage (they could have a tablespace which would be the default tablespace for its underlying partitions). Obviously they do not have indexes pointing at them. Because they are further partitioned, they are differently planned - most probably Append with partition-pruning (almost like Append with constraint-exclusion but supposedly quicker because of the explicit access to partition definitions and perhaps execution-time). INSERT/COPY on these involve routing tuple to the appropriate leaf relation. Not surprisingly, this is almost similar to the picture that Alvaro had presented modulo some differences. Thanks, Amit
On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Specifically, do we regard a partitions as pg_inherits children of its >>> partitioning parent? >> >> I don't think this is totally an all-or-nothing decision. I think >> everyone is agreed that we need to not break things that work today -- >> e.g. Merge Append. What that implies for pg_inherits isn't altogether >> clear. > > One point is that an implementation may end up establishing the > parent-partition hierarchy somewhere other than (or in addition to) > pg_inherits. One intention would be to avoid tying partitioning scheme > to certain inheritance features that use pg_inherits. For example, > consider call sites of find_all_inheritors(). One notable example is > Append/MergeAppend which would be of interest to partitioning. We would > want to reuse that part of the infrastructure but we could might as well > write an equivalent, say find_all_partitions() which scans something > other than pg_inherits to get all partitions. IMHO, there's little reason to avoid putting pg_inherits entries in for the partitions, and then this just works. We can find other ways to make it work if that turns out to be better, but if we don't have one, there's no reason to complicate things. > Agree that some concrete idea of internal representation should help > guide the catalog structure. If we are going to cache the partitioning > info in relcache (which we most definitely will), then we should try to > make sure to consider the scenario of having a lot of partitioned tables > with a lot of individual partitions. It looks like it would be similar > to a scenarios where there are a lot of inheritance hierarchies. But, > availability of partitioning feature would definitely cause these > numbers to grow larger. Perhaps this is an important point driving this > discussion. Yeah, it would be good if the costs of supporting, say, 1000 partitions were negligible. > A primary question for me about partition-pruning is when do we do it? > Should we model it after relation_excluded_by_constraints() and hence > totally plan-time? But, the tone of the discussion is that we postpone > partition-pruning to execution-time and hence my perhaps misdirected > attempts to inject it into some executor machinery. It's useful to prune partitions at plan time, because then you only have to do the work once. But sometimes you don't know enough to do it at plan time, so it's useful to do it at execution time, too. Then, you can do it differently for every tuple based on the actual value you have. There's no point in doing 999 unnecessary relation scans if we can tell which partition the actual run-time value must be in. But I think execution-time pruning can be a follow-on patch. If you don't restrict the scope of the first patch as much as possible, you're not going to have much luck getting this committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21-01-2015 AM 01:42, Robert Haas wrote: > On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> Specifically, do we regard a partitions as pg_inherits children of its >>>> partitioning parent? >>> >>> I don't think this is totally an all-or-nothing decision. I think >>> everyone is agreed that we need to not break things that work today -- >>> e.g. Merge Append. What that implies for pg_inherits isn't altogether >>> clear. >> >> One point is that an implementation may end up establishing the >> parent-partition hierarchy somewhere other than (or in addition to) >> pg_inherits. One intention would be to avoid tying partitioning scheme >> to certain inheritance features that use pg_inherits. For example, >> consider call sites of find_all_inheritors(). One notable example is >> Append/MergeAppend which would be of interest to partitioning. We would >> want to reuse that part of the infrastructure but we could might as well >> write an equivalent, say find_all_partitions() which scans something >> other than pg_inherits to get all partitions. > > IMHO, there's little reason to avoid putting pg_inherits entries in > for the partitions, and then this just works. We can find other ways > to make it work if that turns out to be better, but if we don't have > one, there's no reason to complicate things. > Ok, I will go forward and stick to pg_inherits approach for now. Perhaps the concerns I am expressing have other solutions that don't require abandoning pg_inherits approach altogether. >> Agree that some concrete idea of internal representation should help >> guide the catalog structure. If we are going to cache the partitioning >> info in relcache (which we most definitely will), then we should try to >> make sure to consider the scenario of having a lot of partitioned tables >> with a lot of individual partitions. It looks like it would be similar >> to a scenarios where there are a lot of inheritance hierarchies. But, >> availability of partitioning feature would definitely cause these >> numbers to grow larger. Perhaps this is an important point driving this >> discussion. > > Yeah, it would be good if the costs of supporting, say, 1000 > partitions were negligible. > >> A primary question for me about partition-pruning is when do we do it? >> Should we model it after relation_excluded_by_constraints() and hence >> totally plan-time? But, the tone of the discussion is that we postpone >> partition-pruning to execution-time and hence my perhaps misdirected >> attempts to inject it into some executor machinery. > > It's useful to prune partitions at plan time, because then you only > have to do the work once. But sometimes you don't know enough to do > it at plan time, so it's useful to do it at execution time, too. > Then, you can do it differently for every tuple based on the actual > value you have. There's no point in doing 999 unnecessary relation > scans if we can tell which partition the actual run-time value must be > in. But I think execution-time pruning can be a follow-on patch. If > you don't restrict the scope of the first patch as much as possible, > you're not going to have much luck getting this committed. > Ok, I will limit myself to focusing on following things at the moment: * Provide syntax in CREATE TABLE to declare partition key * Provide syntax in CREATE TABLE to declare a table as partition of a partitioned table and values it contains * Arrange to have partition key and values stored in appropriate catalogs (existing or new) * Arrange to cache partitioning info of partitioned tables in relcache Thanks, Amit
On 21-01-2015 PM 07:26, Amit Langote wrote: > Ok, I will limit myself to focusing on following things at the moment: > > * Provide syntax in CREATE TABLE to declare partition key While working on this, I stumbled upon the question of how we deal with any index definitions following from constraints defined in a CREATE statement. I think we do not want to have a physical index created for a table that is partitioned (in other words, has no heap of itself). As the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really two decisions to make here: 1) how do we deal with any index definitions (either explicit or implicit following from constraints defined on it) - do we allow them by marking them specially, say, in pg_index, as being mere placeholders/templates or invent some other mechanism? 2) As a short-term solution, do we simply reject creating any indexes (/any constraints that require them) on a table whose definition also includes PARTITION ON clause? Instead define them on its partitions (or any relations in hierarchy that are not further partitioned). Or maybe I'm missing something... Thanks, Amit
On 1/25/15 7:42 PM, Amit Langote wrote: > On 21-01-2015 PM 07:26, Amit Langote wrote: >> Ok, I will limit myself to focusing on following things at the moment: >> >> * Provide syntax in CREATE TABLE to declare partition key > > While working on this, I stumbled upon the question of how we deal with > any index definitions following from constraints defined in a CREATE > statement. I think we do not want to have a physical index created for a > table that is partitioned (in other words, has no heap of itself). As > the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, > EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really > two decisions to make here: > > 1) how do we deal with any index definitions (either explicit or > implicit following from constraints defined on it) - do we allow them by > marking them specially, say, in pg_index, as being mere > placeholders/templates or invent some other mechanism? > > 2) As a short-term solution, do we simply reject creating any indexes > (/any constraints that require them) on a table whose definition also > includes PARTITION ON clause? Instead define them on its partitions (or > any relations in hierarchy that are not further partitioned). > > Or maybe I'm missing something... Wasn't the idea that the parent table in a partitioned table wouldn't actually have a heap of it's own? If there's no heapthere can't be an index. That said, I think this is premature optimization that could be done later. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 27-01-2015 AM 05:46, Jim Nasby wrote: > On 1/25/15 7:42 PM, Amit Langote wrote: >> On 21-01-2015 PM 07:26, Amit Langote wrote: >>> Ok, I will limit myself to focusing on following things at the moment: >>> >>> * Provide syntax in CREATE TABLE to declare partition key >> >> While working on this, I stumbled upon the question of how we deal with >> any index definitions following from constraints defined in a CREATE >> statement. I think we do not want to have a physical index created for a >> table that is partitioned (in other words, has no heap of itself). As >> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE, >> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really >> two decisions to make here: >> >> 1) how do we deal with any index definitions (either explicit or >> implicit following from constraints defined on it) - do we allow them by >> marking them specially, say, in pg_index, as being mere >> placeholders/templates or invent some other mechanism? >> >> 2) As a short-term solution, do we simply reject creating any indexes >> (/any constraints that require them) on a table whose definition also >> includes PARTITION ON clause? Instead define them on its partitions (or >> any relations in hierarchy that are not further partitioned). >> >> Or maybe I'm missing something... > > Wasn't the idea that the parent table in a partitioned table wouldn't > actually have a heap of it's own? If there's no heap there can't be an > index. > Yes, that's right. Perhaps, we should look at heap-less partitioned relation thingy not so soon as you say below. > That said, I think this is premature optimization that could be done later. It seems so. Thanks, Amit