Обсуждение: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

Поиск
Список
Период
Сортировка

Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Rushabh Lathia
Дата:
Hi,

Consider the below test:

CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
INSERT INTO foo select i,i,i from generate_series(1,4)i;

CREATE TABLE foo_d (like foo);
INSERT INTO foo_d select i,i,i from generate_series(1,9)i;

ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;

Above ATTACH PARTITION should fail with "partition constraint is violated"
error, but instead it's working on a master branch.

Looking further I found that problem introduced with below commit:

commit 4dba331cb3dc1b5ffb0680ed8efae847de216796
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Mar 20 11:19:41 2018 -0300

    Fix CommandCounterIncrement in partition-related DDL
    
    It makes sense to do the CCIs in the places that do catalog updates,
    rather than before the places that error out because the former ones
    fail to do it.  In particular, it looks like StorePartitionBound() and
    IndexSetParentIndex() ought to make their own CCIs.
    
    Per review comments from Peter Eisentraut for row-level triggers on
    partitioned tables.

I noticed that further below commit tried to correct thing in similar
area, but still I am noticing the broken behavior in case of attaching
the partition as DEFAULT.

commit 56163004b8b2151db279744b77138d4d90e2d5cb
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Wed Mar 21 12:03:35 2018 -0300

    Fix relcache handling of the 'default' partition
    
Regards,


--
Rushabh Lathia

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Rushabh Lathia wrote:

> CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
> CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
> INSERT INTO foo select i,i,i from generate_series(1,4)i;
> 
> CREATE TABLE foo_d (like foo);
> INSERT INTO foo_d select i,i,i from generate_series(1,9)i;
> 
> ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;
> 
> Above ATTACH PARTITION should fail with "partition constraint is violated"
> error, but instead it's working on a master branch.

Hmm, offhand I don't quite see why this error fails to be thrown.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Rushabh Lathia
Дата:


On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Rushabh Lathia wrote:

> CREATE TABLE foo (a INT, b INT, c VARCHAR) PARTITION BY LIST(a);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1,2);
> CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (3,4);
> INSERT INTO foo select i,i,i from generate_series(1,4)i;
>
> CREATE TABLE foo_d (like foo);
> INSERT INTO foo_d select i,i,i from generate_series(1,9)i;
>
> ALTER TABLE foo ATTACH PARTITION foo_d DEFAULT;
>
> Above ATTACH PARTITION should fail with "partition constraint is violated"
> error, but instead it's working on a master branch.

Hmm, offhand I don't quite see why this error fails to be thrown.


ATTACH PARTITION should throw an error, because partition table "foo"
already have two partition with key values (1, 2,3 4). And table "foo_d" 
which we are attaching here has :

postgres@76099=#select * from foo_d;
 a | b | c 
---+---+---
 1 | 1 | 1
 2 | 2 | 2
 3 | 3 | 3
 4 | 4 | 4
 5 | 5 | 5
 6 | 6 | 6
 7 | 7 | 7
 8 | 8 | 8
 9 | 9 | 9
(9 rows)

After ATTACH PARTITION, when we see the partition constraint for
the newly attached partition:

postgres@76099=#\d+ foo_d
                                        Table "public.foo_d"
 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
 a      | integer           |           |          |         | plain    |              | 
 b      | integer           |           |          |         | plain    |              | 
 c      | character varying |           |          |         | extended |              | 
Partition of: foo DEFAULT
Partition constraint: (NOT ((a IS NOT NULL) AND (a = ANY (ARRAY[1, 2, 3, 4]))))

So, it says that this partition (table) should not include values (1,2, 3, 4). But
it sill has those values.

postgres@76099=#select tableoid::regclass, * from foo;
 tableoid | a | b | c 
----------+---+---+---
 foo_p1   | 1 | 1 | 1
 foo_p1   | 2 | 2 | 2
 foo_p2   | 3 | 3 | 3
 foo_p2   | 4 | 4 | 4
 foo_d    | 1 | 1 | 1
 foo_d    | 2 | 2 | 2
 foo_d    | 3 | 3 | 3
 foo_d    | 4 | 4 | 4
 foo_d    | 5 | 5 | 5
 foo_d    | 6 | 6 | 6
 foo_d    | 7 | 7 | 7
 foo_d    | 8 | 8 | 8
 foo_d    | 9 | 9 | 9
(13 rows)

So basically ATTACH PARTITION should throw an error when partition
constraint is violated.


--
Rushabh Lathia

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Rushabh Lathia wrote:
> On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:

> > Hmm, offhand I don't quite see why this error fails to be thrown.
>
> ATTACH PARTITION should throw an error, because partition table "foo"
> already have two partition with key values (1, 2,3 4). And table "foo_d"
> which we are attaching here has :

Oh, I understand how it's supposed to work.  I was just saying I don't
understand how this bug occurs.  Is it because we fail to determine the
correct partition constraint for the default partition in time for its
verification scan?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Kyotaro HORIGUCHI
Дата:
At Thu, 29 Mar 2018 13:04:06 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
<20180329160406.ii2wgbkmlnfxtwbt@alvherre.pgsql>
> Rushabh Lathia wrote:
> > On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> > wrote:
> 
> > > Hmm, offhand I don't quite see why this error fails to be thrown.
> >
> > ATTACH PARTITION should throw an error, because partition table "foo"
> > already have two partition with key values (1, 2,3 4). And table "foo_d"
> > which we are attaching here has :
> 
> Oh, I understand how it's supposed to work.  I was just saying I don't
> understand how this bug occurs.  Is it because we fail to determine the
> correct partition constraint for the default partition in time for its
> verification scan?

The reason is that CommandCounterIncrement added in
StorePartitionBound reveals the just added default partition to
get_default_oid_from_partdesc too early.  The revealed partition
has immature constraint and it overrites the right constraint
generated just above.

ATExecAttachPartition checks for default partition oid twice but
the second is just needless before the commit and harms after it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8f83aa4675..96b62bd4b6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14082,11 +14082,9 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
     }
 
     /*
-     * Check whether default partition has a row that would fit the partition
-     * being attached.
+     * Check whether existing default partition has a row that would fit the
+     * partition being attached.
      */
-    defaultPartOid =
-        get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
     if (OidIsValid(defaultPartOid))
     {
         Relation    defaultrel;

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Amit Langote
Дата:
On 2018/03/30 17:31, Kyotaro HORIGUCHI wrote:
> At Thu, 29 Mar 2018 13:04:06 -0300, Alvaro Herrera wrote:
>> Rushabh Lathia wrote:
>>> On Thu, Mar 29, 2018 at 7:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
>>> wrote:
>>
>>>> Hmm, offhand I don't quite see why this error fails to be thrown.
>>>
>>> ATTACH PARTITION should throw an error, because partition table "foo"
>>> already have two partition with key values (1, 2,3 4). And table "foo_d"
>>> which we are attaching here has :
>>
>> Oh, I understand how it's supposed to work.  I was just saying I don't
>> understand how this bug occurs.  Is it because we fail to determine the
>> correct partition constraint for the default partition in time for its
>> verification scan?
> 
> The reason is that CommandCounterIncrement added in
> StorePartitionBound reveals the just added default partition to
> get_default_oid_from_partdesc too early.  The revealed partition
> has immature constraint and it overrites the right constraint
> generated just above.
> 
> ATExecAttachPartition checks for default partition oid twice but
> the second is just needless before the commit and harms after it.

Yes.  What happens as of the commit mentioned in $subject is that the
partition constraint that's set as tab->partition_constraint during the
first call to ValidatePartitionConstraints (which is the correct one) is
overwritten by a wrong one during the 2nd call, which wouldn't happen
before the commit.  In the wrongly occurring 2nd call, we'd end up setting
tab->partition_constraint to the negation of the clause expression that
would've been set by the first call (in this case).  Thus
tab->partition_constraint ends up returning true for all the values it
contains.

I noticed that there were no tests covering this case causing 4dba331cb3
to not notice this failure in the first place.  I updated your patch to
add a few tests.  Also, I revised the comment changed by your patch a bit.

Thanks,
Amit

Вложения

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Jeevan Ladhe
Дата:

Hi,

I noticed that there were no tests covering this case causing 4dba331cb3
to not notice this failure in the first place.  I updated your patch to
add a few tests.  Also, I revised the comment changed by your patch a bit.

1. A minor typo:

+-- check that violating rows are correctly reported when attching as the
s/attching/attaching


2. I think following part of the test is already covered:

+-- trying to add a partition for 2 should fail because the default
+-- partition contains a row that would violate its new constraint which
+-- prevents rows containing 2
+create table defpart_attach_test2 partition of defpart_attach_test for values in (2);
+ERROR:  updated partition constraint for default partition "defpart_attach_test_d" would be violated by some row
+drop table defpart_attach_test;

IIUC, the test in create_table covers the same scenario as of above:

-- check default partition overlap
INSERT INTO list_parted2 VALUES('X');
CREATE TABLE fail_part PARTITION OF list_parted2 FOR VALUES IN ('W', 'X', 'Y');
ERROR:  updated partition constraint for default partition "list_parted2_def" would be violated by some row

Regards,
Jeevan Ladhe

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Amit Langote
Дата:
Thanks Jeevan for reviewing.

On 2018/04/02 13:10, Jeevan Ladhe wrote:
> Hi,
> 
> I noticed that there were no tests covering this case causing 4dba331cb3
>> to not notice this failure in the first place.  I updated your patch to
>> add a few tests.  Also, I revised the comment changed by your patch a bit.
>>
> 
> 1. A minor typo:
> 
> +-- check that violating rows are correctly reported when attching as the
> s/attching/attaching

Oops, fixed.

> 2. I think following part of the test is already covered:
> 
> +-- trying to add a partition for 2 should fail because the default
> +-- partition contains a row that would violate its new constraint which
> +-- prevents rows containing 2
> +create table defpart_attach_test2 partition of defpart_attach_test for
> values in (2);
> +ERROR:  updated partition constraint for default partition
> "defpart_attach_test_d" would be violated by some row
> +drop table defpart_attach_test;
> 
> IIUC, the test in create_table covers the same scenario as of above:
> 
> -- check default partition overlap
> INSERT INTO list_parted2 VALUES('X');
> CREATE TABLE fail_part PARTITION OF list_parted2 FOR VALUES IN ('W', 'X',
> 'Y');
> ERROR:  updated partition constraint for default partition
> "list_parted2_def" would be violated by some row

Sorry, didn't realize that it was already covered in create_tabel.sql.
Removed this one.

Attached updated patch.  Adding this to the v11 open items list.

Thanks,
Amit

Вложения

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
>      /*
> -     * Check whether default partition has a row that would fit the partition
> -     * being attached.
> +     * Check if the default partition contains a row that would belong in the
> +     * partition being attached.
>       */
> -    defaultPartOid =
> -        get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
>      if (OidIsValid(defaultPartOid))

Oh my.  This code is terrible, and I think this patch is wrong.  More
later.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Why do we need AccessExclusiveLock on all children of a relation that we
want to scan to search for rows not satisfying the constraint?  I think
it should be enough to get ShareLock, which prevents INSERT, no?  I have
a feeling I'm missing something here, but I don't know what, and all
tests pass with that change.

Also: the change proposed to remove the get_default_oid_from_partdesc()
call actually fixes the bug, but to me it was not at all obvious why.

To figure out why, I first had to realize that
ValidatePartitionConstraints was lying to me, both in name and in
comments: it doesn't actually validate anything, it merely queues
entries so that alter table's phase 3 would do the validation.  I found
this extremely confusing, so I fixed the comments to match reality, and
later decided to rename the function also.

At that point I was able to understand what the problem was: when
attaching the default partition, we were setting the constraint to be
validated for that partition to the correctly computed partition
constraint; and later in the same function we would set the constraint
to "the immature constraint" because we were now seeing that the default
partition OID was not invalid.  So it was rather a bug in
ValidatePartitionConstraints, in that it was accepting to set the
expression to validate when another expression had already been set!  I
added an assert to protect against this.  And then changed the decision
of whether or not to run this block based on the attached partition
being the default one or not; because as the "if" test was, it was just
a recipe for confusion.  (Now, if you look carefully you realize that
the new test for bound->is_default I added is kinda redundant: it can
only be set if there was a default partition OID at start of the
function, and therefore we know we're not adding a default partition.
And for the case where we *are* adding the default partition, it is
still Invalid because we don't re-read its own OID.  But relying on that
seems brittle because it breaks if we happen to update the default
partition OID in the middle of that function, which is what we were
doing.  Hence the new is_default test.)

I looked at the implementation of ValidatePartitionConstraints and
didn't like it, so I rewrote it also.

This comment is mistaken, too:
-       /*
-        * Skip if the partition is itself a partitioned table.  We can only
-        * ever scan RELKIND_RELATION relations.
-        */
... because it ignores the possibility of a partition being a foreign
table.  The code seems to work, but I think there is no test to cover
the case that a foreign table containing data that doesn't satisfy the
constraint is attached, because when I set that case to return doing
nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
test failed.


Generally speaking, I didn't like ATExecAttachPartition; it's doing too
much that should have been done in ATPrepAttachPartition.  The only
reason that's not broken is because we don't allow ATTACH PARTITION to
appear together with other commands in alter table, which seems
disappointing ... for example, when attaching multiple tables and a
default partition exist, you have to scan the default one multiple
times, which is lame.

(Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
on the parent relation ... I wonder if this can be used to cause a
deadlock during InitResultRelationInfo.)

BTW, I think this is already broken for the case where the default
partition is partitioned and you attach a partition colliding with a row
that's being concurrently inserted in a partition of the default
partition, though I didn't bother to write a test for that.

Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
<20180402191112.wneiyj4v5upnfjst@alvherre.pgsql>
> Why do we need AccessExclusiveLock on all children of a relation that we
> want to scan to search for rows not satisfying the constraint?  I think
> it should be enough to get ShareLock, which prevents INSERT, no?  I have
> a feeling I'm missing something here, but I don't know what, and all
> tests pass with that change.

Mmm. I'm not sure if there's a lock-upgrade case but the
following sentense is left at the last of one of the modified
comments. ATREwriteTables is called with AEL after that (that has
finally no effect in this case).

|                                   But we cannot risk a deadlock by taking
| * a weaker lock now and the stronger one only when needed.

I don't actually find places where the children's lock can be
raised but this seems just following the lock parent first
principle.

By the way check_default_allows_bound() (CREATE TABLE case)
contains a similar usage of find_all_inheritors(default_rel,
AEL).

> Also: the change proposed to remove the get_default_oid_from_partdesc()
> call actually fixes the bug, but to me it was not at all obvious why.

CommandCounterIncrement updates the content of a relcache entry
via invalidation. It can be surprising for callers of a function
like StorePartitionBound.

CommandCounterIncrement
 <skip inval mechanism>
   RelationCacheInvalidateEntry
     RelationFlushRelation
       RelationClearRelation

> To figure out why, I first had to realize that
> ValidatePartitionConstraints was lying to me, both in name and in
> comments: it doesn't actually validate anything, it merely queues
> entries so that alter table's phase 3 would do the validation.  I found
> this extremely confusing, so I fixed the comments to match reality, and
> later decided to rename the function also.

It is reasonable. Removing exccessive extension of lower-level
partitions is also reasonable. The previous code extended
inheritances in different manner for levels at odd and even
depth.

> At that point I was able to understand what the problem was: when
> attaching the default partition, we were setting the constraint to be
> validated for that partition to the correctly computed partition
> constraint; and later in the same function we would set the constraint
> to "the immature constraint" because we were now seeing that the default
> partition OID was not invalid.  So it was rather a bug in
> ValidatePartitionConstraints, in that it was accepting to set the
> expression to validate when another expression had already been set!  I
> added an assert to protect against this.  And then changed the decision
> of whether or not to run this block based on the attached partition
> being the default one or not; because as the "if" test was, it was just
> a recipe for confusion.  (Now, if you look carefully you realize that
> the new test for bound->is_default I added is kinda redundant: it can
> only be set if there was a default partition OID at start of the
> function, and therefore we know we're not adding a default partition.
> And for the case where we *are* adding the default partition, it is
> still Invalid because we don't re-read its own OID.  But relying on that
> seems brittle because it breaks if we happen to update the default
> partition OID in the middle of that function, which is what we were
> doing.  Hence the new is_default test.)

Seems reasonable. But even if we assume so, rereading
defaultPartOid is still breaking the assumption that
defaultPartOid is that at the time of entering to this function
and the added condition just conceals the fact.

So I think it should be an assertion.

| if (OidIsValid(defaultPartOid))
| {
|      /*
|       *  The command cannot be adding default partition if the
|       *  defaultPartOid is valid.
|       */
|      Assert(!cmd->bound->is_default);


> I looked at the implementation of ValidatePartitionConstraints and
> didn't like it, so I rewrote it also.
> 
> This comment is mistaken, too:
> -       /*
> -        * Skip if the partition is itself a partitioned table.  We can only
> -        * ever scan RELKIND_RELATION relations.
> -        */
> ... because it ignores the possibility of a partition being a foreign
> table.  The code seems to work, but I think there is no test to cover
> the case that a foreign table containing data that doesn't satisfy the
> constraint is attached, because when I set that case to return doing
> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
> test failed.

Foreign tables are intentionally not verified on attaching (in my
understanding). ATRewriteTables ignores foreign tables so it
contradicts to what ATRewriteTables thinks. As my understanding
foreign tables are assumed to be unrestrictable by local
constraint after attaching so the users are responsible to the
consistency.

> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
> much that should have been done in ATPrepAttachPartition.  The only
> reason that's not broken is because we don't allow ATTACH PARTITION to
> appear together with other commands in alter table, which seems
> disappointing ... for example, when attaching multiple tables and a
> default partition exist, you have to scan the default one multiple
> times, which is lame.

Indeed, currently only partition commands are isolated from other
alter table subcommands (except all in tablespace cases). We can
improve that in the next step?

> (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
> on the parent relation ... I wonder if this can be used to cause a
> deadlock during InitResultRelationInfo.)

Mmm. That seems strange since the RealtionGetPartitionQual
doesn't return anything (specifically NIL) there, because we
don't allow a partition to be attached to partitioned tables. It
seems totally useless.

Addition to that the code tries to add the partition qual (which
is always NIL) destructively and assign to partConstraint..

> BTW, I think this is already broken for the case where the default
> partition is partitioned and you attach a partition colliding with a row
> that's being concurrently inserted in a partition of the default
> partition, though I didn't bother to write a test for that.

How is it broken? Every attaching partitions are checked for the
specified partition bound and every partitions of the default
partition are also checked against the new default part bound. We
already hold required locks on all the participants.


> Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.

It's reassuring. Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Amit Langote
Дата:
On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Why do we need AccessExclusiveLock on all children of a relation that we
>> want to scan to search for rows not satisfying the constraint?  I think
>> it should be enough to get ShareLock, which prevents INSERT, no?  I have
>> a feeling I'm missing something here, but I don't know what, and all
>> tests pass with that change.

Thinking on this a bit, I see no problem with locking the children with
just ShareLock.  It was just a paranoia that if we're going to lock the
table itself being attached with AEL, we must its children (if any) with
AEL, too.

> Mmm. I'm not sure if there's a lock-upgrade case but the
> following sentense is left at the last of one of the modified
> comments. ATREwriteTables is called with AEL after that (that has
> finally no effect in this case).
> 
> |                                   But we cannot risk a deadlock by taking
> | * a weaker lock now and the stronger one only when needed.
> 
> I don't actually find places where the children's lock can be
> raised but this seems just following the lock parent first
> principle.

No lock upgrade happen as of now.  The comment was added by the commit
972b6ec20bf [1], which removed the code that could cause such a deadlock.
The comment fragment is simply trying to explain why we don't postpone the
locking of children to a later time, say, to the point where we actually
know that they need to be scanned.  Previously the code next to the
comment used to lock the children using AccessShareLock, because at that
point we just needed to check if the table being attached is one of those
children and then later locked with AEL if it turned out that they need to
be scanned to check the partition constraint.

> By the way check_default_allows_bound() (CREATE TABLE case)
> contains a similar usage of find_all_inheritors(default_rel,
> AEL).

Good catch.  Its job is more or less same as
ValidatePartitionConstraints(), except all the children (if any) are
scanned right away instead of queuing it like in the AT case.

>> Also: the change proposed to remove the get_default_oid_from_partdesc()
>> call actually fixes the bug, but to me it was not at all obvious why.
> 
> CommandCounterIncrement updates the content of a relcache entry
> via invalidation. It can be surprising for callers of a function
> like StorePartitionBound.
> 
> CommandCounterIncrement
>  <skip inval mechanism>
>    RelationCacheInvalidateEntry
>      RelationFlushRelation
>        RelationClearRelation

Because of the CCI() after storing the default partition OID into the
system catalog, RelationClearRelation() would changes what
rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
reference to the relcache entry of the parent that it passed to
StorePartitionBound.

So, whereas the rel->rd_partdesc wouldn't contain the default partition
before StorePartitionBound() was called, it would after.

>> To figure out why, I first had to realize that
>> ValidatePartitionConstraints was lying to me, both in name and in
>> comments: it doesn't actually validate anything, it merely queues
>> entries so that alter table's phase 3 would do the validation.  I found
>> this extremely confusing, so I fixed the comments to match reality, and
>> later decided to rename the function also.
> 
> It is reasonable. Removing exccessive extension of lower-level
> partitions is also reasonable. The previous code extended
> inheritances in different manner for levels at odd and even
> depth.

I like the new code including the naming, but I notice this changes the
order in which we do the locking now.  There are still sites in the code
where the locking order is breadth-first, that is, as determined by
find_all_inheritors(), as this function would too previously.

Also note that beside the breadth-first -> depth-first change, this also
changes the locking order of partitions for a given partitioned table.
The OIDs in partdesc->oids[] are canonically ordered (that is order of
their partition bounds), whereas find_inheritance_children() that's called
by find_all_inheritors() would lock them in the order in which the
individual OIDs were found in the system catalog.

Not sure if there is anything to be alarmed of here, but in all previous
discussions, this has been a thing to avoid.

>> At that point I was able to understand what the problem was: when
>> attaching the default partition, we were setting the constraint to be
>> validated for that partition to the correctly computed partition
>> constraint; and later in the same function we would set the constraint
>> to "the immature constraint" because we were now seeing that the default
>> partition OID was not invalid.  So it was rather a bug in
>> ValidatePartitionConstraints, in that it was accepting to set the
>> expression to validate when another expression had already been set!  I
>> added an assert to protect against this.  And then changed the decision
>> of whether or not to run this block based on the attached partition
>> being the default one or not; because as the "if" test was, it was just
>> a recipe for confusion.  (Now, if you look carefully you realize that
>> the new test for bound->is_default I added is kinda redundant: it can
>> only be set if there was a default partition OID at start of the
>> function, and therefore we know we're not adding a default partition.
>> And for the case where we *are* adding the default partition, it is
>> still Invalid because we don't re-read its own OID.  But relying on that
>> seems brittle because it breaks if we happen to update the default
>> partition OID in the middle of that function, which is what we were
>> doing.  Hence the new is_default test.)
> 
> Seems reasonable.

+1

> But even if we assume so, rereading
> defaultPartOid is still breaking the assumption that
> defaultPartOid is that at the time of entering to this function
> and the added condition just conceals the fact.

Afaics, defaultPartOid is only set at the beginning of
ATExecAttachPartition, so it seems fine.

> So I think it should be an assertion.
> 
> | if (OidIsValid(defaultPartOid))
> | {
> |      /*
> |       *  The command cannot be adding default partition if the
> |       *  defaultPartOid is valid.
> |       */
> |      Assert(!cmd->bound->is_default);

I guess that makes sense, because when trying to attach a default
partition to the table that already has one, check_new_partition_bound
that's called before this errors out before we could get here.

>> I looked at the implementation of ValidatePartitionConstraints and
>> didn't like it, so I rewrote it also.
>>
>> This comment is mistaken, too:
>> -       /*
>> -        * Skip if the partition is itself a partitioned table.  We can only
>> -        * ever scan RELKIND_RELATION relations.
>> -        */
>> ... because it ignores the possibility of a partition being a foreign
>> table.  The code seems to work, but I think there is no test to cover
>> the case that a foreign table containing data that doesn't satisfy the
>> constraint is attached, because when I set that case to return doing
>> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
>> test failed.
> 
> Foreign tables are intentionally not verified on attaching (in my
> understanding). ATRewriteTables ignores foreign tables so it
> contradicts to what ATRewriteTables thinks. As my understanding
> foreign tables are assumed to be unrestrictable by local
> constraint after attaching so the users are responsible to the
> consistency.

That and ATRewriteTable() in phase 3 cannot really cope with being handed
a foreign table as it can only work with RELKIND_RELATION tables.
Actually, the following in ATRewriteTables() also prevents passing foreign
tables:

static void
ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
{
    ListCell   *ltab;

    /* Go through each table that needs to be checked or rewritten */
    foreach(ltab, *wqueue)
    {
        AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);

        /*
         * Foreign tables have no storage, nor do partitioned tables and
         * indexes.
         */
        if (tab->relkind == RELKIND_FOREIGN_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_INDEX)
            continue;

>> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
>> much that should have been done in ATPrepAttachPartition.  The only
>> reason that's not broken is because we don't allow ATTACH PARTITION to
>> appear together with other commands in alter table, which seems
>> disappointing ... for example, when attaching multiple tables and a
>> default partition exist, you have to scan the default one multiple
>> times, which is lame.
> 
> Indeed, currently only partition commands are isolated from other
> alter table subcommands (except all in tablespace cases). We can
> improve that in the next step?

I think we can improve this.

>> (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
>> on the parent relation ... I wonder if this can be used to cause a
>> deadlock during InitResultRelationInfo.)

It does seem that there is a possibility of deadlock, because when
InitResultRelInfo(), that's initializing a partition, calls
RelationGetPartitionQual() to get its partition constraint, the partition
would already have been locked.  So this reverses the generally
established order of locking the parent first; another session which tries
to add a column to the parent, for example, will lock the parent and then
partitions.

I think we need for inserts directly into a partition to lock all of its
ancestors from the root to the direct parent (in that order) before
locking the partition itself, or maybe at least the parent if not all
ancestors.

> Mmm. That seems strange since the RealtionGetPartitionQual
> doesn't return anything (specifically NIL) there, because we
> don't allow a partition to be attached to partitioned tables. It
> seems totally useless.
> 
> Addition to that the code tries to add the partition qual (which
> is always NIL) destructively and assign to partConstraint..

It is not always NIL.  Imagine attaching a partition at a lower level.

create table foo (a int, b char) partition by list (a);
create table foo1 partition of foo for values in (1) partition by list (b);
create table foo1a (a, b) as values (2, 'b');

-- note that we're attaching to foo1, not foo
alter table foo1 attach partition foo1a for values in ('a');

If we didn't include foo1's (the parent) constraint (that is, a = 1), the
above command will wrongly succeed.  It must include a = 1 in the
constraint to be be checked when scanning foo1a.

Although, I noticed there is no test covering this.

>> BTW, I think this is already broken for the case where the default
>> partition is partitioned and you attach a partition colliding with a row
>> that's being concurrently inserted in a partition of the default
>> partition, though I didn't bother to write a test for that.
> 
> How is it broken? Every attaching partitions are checked for the
> specified partition bound and every partitions of the default
> partition are also checked against the new default part bound. We
> already hold required locks on all the participants.

Yes, concurrent insertions to either the default partition or any of its
partitions couldn't be occurring as we'd have locked them.

>> Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.
> 
> It's reassuring. Thanks.

Yes, thank you for taking the time out to clean things up.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=972b6ec20bf



Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Rushabh Lathia
Дата:

On Tue, Apr 3, 2018 at 2:52 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> Hello.
>
> At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Why do we need AccessExclusiveLock on all children of a relation that we
>> want to scan to search for rows not satisfying the constraint?  I think
>> it should be enough to get ShareLock, which prevents INSERT, no?  I have
>> a feeling I'm missing something here, but I don't know what, and all
>> tests pass with that change.

Thinking on this a bit, I see no problem with locking the children with
just ShareLock.  It was just a paranoia that if we're going to lock the
table itself being attached with AEL, we must its children (if any) with
AEL, too.

> Mmm. I'm not sure if there's a lock-upgrade case but the
> following sentense is left at the last of one of the modified
> comments. ATREwriteTables is called with AEL after that (that has
> finally no effect in this case).
>
> |                                   But we cannot risk a deadlock by taking
> | * a weaker lock now and the stronger one only when needed.
>
> I don't actually find places where the children's lock can be
> raised but this seems just following the lock parent first
> principle.

No lock upgrade happen as of now.  The comment was added by the commit
972b6ec20bf [1], which removed the code that could cause such a deadlock.
The comment fragment is simply trying to explain why we don't postpone the
locking of children to a later time, say, to the point where we actually
know that they need to be scanned.  Previously the code next to the
comment used to lock the children using AccessShareLock, because at that
point we just needed to check if the table being attached is one of those
children and then later locked with AEL if it turned out that they need to
be scanned to check the partition constraint.

> By the way check_default_allows_bound() (CREATE TABLE case)
> contains a similar usage of find_all_inheritors(default_rel,
> AEL).

Good catch.  Its job is more or less same as
ValidatePartitionConstraints(), except all the children (if any) are
scanned right away instead of queuing it like in the AT case.

>> Also: the change proposed to remove the get_default_oid_from_partdesc()
>> call actually fixes the bug, but to me it was not at all obvious why.
>
> CommandCounterIncrement updates the content of a relcache entry
> via invalidation. It can be surprising for callers of a function
> like StorePartitionBound.
>
> CommandCounterIncrement
>  <skip inval mechanism>
>    RelationCacheInvalidateEntry
>      RelationFlushRelation
>        RelationClearRelation

Because of the CCI() after storing the default partition OID into the
system catalog, RelationClearRelation() would changes what
rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
reference to the relcache entry of the parent that it passed to
StorePartitionBound.

So, whereas the rel->rd_partdesc wouldn't contain the default partition
before StorePartitionBound() was called, it would after.

>> To figure out why, I first had to realize that
>> ValidatePartitionConstraints was lying to me, both in name and in
>> comments: it doesn't actually validate anything, it merely queues
>> entries so that alter table's phase 3 would do the validation.  I found
>> this extremely confusing, so I fixed the comments to match reality, and
>> later decided to rename the function also.
>
> It is reasonable. Removing exccessive extension of lower-level
> partitions is also reasonable. The previous code extended
> inheritances in different manner for levels at odd and even
> depth.

I like the new code including the naming, but I notice this changes the
order in which we do the locking now.  There are still sites in the code
where the locking order is breadth-first, that is, as determined by
find_all_inheritors(), as this function would too previously.

Also note that beside the breadth-first -> depth-first change, this also
changes the locking order of partitions for a given partitioned table.
The OIDs in partdesc->oids[] are canonically ordered (that is order of
their partition bounds), whereas find_inheritance_children() that's called
by find_all_inheritors() would lock them in the order in which the
individual OIDs were found in the system catalog.

Not sure if there is anything to be alarmed of here, but in all previous
discussions, this has been a thing to avoid.

>> At that point I was able to understand what the problem was: when
>> attaching the default partition, we were setting the constraint to be
>> validated for that partition to the correctly computed partition
>> constraint; and later in the same function we would set the constraint
>> to "the immature constraint" because we were now seeing that the default
>> partition OID was not invalid.  So it was rather a bug in
>> ValidatePartitionConstraints, in that it was accepting to set the
>> expression to validate when another expression had already been set!  I
>> added an assert to protect against this.  And then changed the decision
>> of whether or not to run this block based on the attached partition
>> being the default one or not; because as the "if" test was, it was just
>> a recipe for confusion.  (Now, if you look carefully you realize that
>> the new test for bound->is_default I added is kinda redundant: it can
>> only be set if there was a default partition OID at start of the
>> function, and therefore we know we're not adding a default partition.
>> And for the case where we *are* adding the default partition, it is
>> still Invalid because we don't re-read its own OID.  But relying on that
>> seems brittle because it breaks if we happen to update the default
>> partition OID in the middle of that function, which is what we were
>> doing.  Hence the new is_default test.)
>
> Seems reasonable.

+1

> But even if we assume so, rereading
> defaultPartOid is still breaking the assumption that
> defaultPartOid is that at the time of entering to this function
> and the added condition just conceals the fact.

Afaics, defaultPartOid is only set at the beginning of
ATExecAttachPartition, so it seems fine.

> So I think it should be an assertion.
>
> | if (OidIsValid(defaultPartOid))
> | {
> |      /*
> |       *  The command cannot be adding default partition if the
> |       *  defaultPartOid is valid.
> |       */
> |      Assert(!cmd->bound->is_default);

I guess that makes sense, because when trying to attach a default
partition to the table that already has one, check_new_partition_bound
that's called before this errors out before we could get here.

>> I looked at the implementation of ValidatePartitionConstraints and
>> didn't like it, so I rewrote it also.
>>
>> This comment is mistaken, too:
>> -       /*
>> -        * Skip if the partition is itself a partitioned table.  We can only
>> -        * ever scan RELKIND_RELATION relations.
>> -        */
>> ... because it ignores the possibility of a partition being a foreign
>> table.  The code seems to work, but I think there is no test to cover
>> the case that a foreign table containing data that doesn't satisfy the
>> constraint is attached, because when I set that case to return doing
>> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
>> test failed.
>
> Foreign tables are intentionally not verified on attaching (in my
> understanding). ATRewriteTables ignores foreign tables so it
> contradicts to what ATRewriteTables thinks. As my understanding
> foreign tables are assumed to be unrestrictable by local
> constraint after attaching so the users are responsible to the
> consistency.

That and ATRewriteTable() in phase 3 cannot really cope with being handed
a foreign table as it can only work with RELKIND_RELATION tables.
Actually, the following in ATRewriteTables() also prevents passing foreign
tables:

static void
ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
{
    ListCell   *ltab;

    /* Go through each table that needs to be checked or rewritten */
    foreach(ltab, *wqueue)
    {
        AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);

        /*
         * Foreign tables have no storage, nor do partitioned tables and
         * indexes.
         */
        if (tab->relkind == RELKIND_FOREIGN_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_INDEX)
            continue;

>> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
>> much that should have been done in ATPrepAttachPartition.  The only
>> reason that's not broken is because we don't allow ATTACH PARTITION to
>> appear together with other commands in alter table, which seems
>> disappointing ... for example, when attaching multiple tables and a
>> default partition exist, you have to scan the default one multiple
>> times, which is lame.
>
> Indeed, currently only partition commands are isolated from other
> alter table subcommands (except all in tablespace cases). We can
> improve that in the next step?

I think we can improve this.

>> (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
>> on the parent relation ... I wonder if this can be used to cause a
>> deadlock during InitResultRelationInfo.)

It does seem that there is a possibility of deadlock, because when
InitResultRelInfo(), that's initializing a partition, calls
RelationGetPartitionQual() to get its partition constraint, the partition
would already have been locked.  So this reverses the generally
established order of locking the parent first; another session which tries
to add a column to the parent, for example, will lock the parent and then
partitions.

I think we need for inserts directly into a partition to lock all of its
ancestors from the root to the direct parent (in that order) before
locking the partition itself, or maybe at least the parent if not all
ancestors.

> Mmm. That seems strange since the RealtionGetPartitionQual
> doesn't return anything (specifically NIL) there, because we
> don't allow a partition to be attached to partitioned tables. It
> seems totally useless.
>
> Addition to that the code tries to add the partition qual (which
> is always NIL) destructively and assign to partConstraint..

It is not always NIL.  Imagine attaching a partition at a lower level.

create table foo (a int, b char) partition by list (a);
create table foo1 partition of foo for values in (1) partition by list (b);
create table foo1a (a, b) as values (2, 'b');

-- note that we're attaching to foo1, not foo
alter table foo1 attach partition foo1a for values in ('a');

If we didn't include foo1's (the parent) constraint (that is, a = 1), the
above command will wrongly succeed.  It must include a = 1 in the
constraint to be be checked when scanning foo1a.

Although, I noticed there is no test covering this.

>> BTW, I think this is already broken for the case where the default
>> partition is partitioned and you attach a partition colliding with a row
>> that's being concurrently inserted in a partition of the default
>> partition, though I didn't bother to write a test for that.
>
> How is it broken? Every attaching partitions are checked for the
> specified partition bound and every partitions of the default
> partition are also checked against the new default part bound. We
> already hold required locks on all the participants.

Yes, concurrent insertions to either the default partition or any of its
partitions couldn't be occurring as we'd have locked them.

>> Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.
>
> It's reassuring. Thanks.

Yes, thank you for taking the time out to clean things up.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=972b6ec20bf




--
Rushabh Lathia

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Thanks for the discussion.  Per your suggestions, I changed the check
for default partition OID to an assert instead of the 'if' condition,
and removed the code that attempted vainly to verify the constraint when
attaching a foreign table as a partition.  And pushed.

I think we're done here, so marked the Open Item as fixed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Robert Haas
Дата:
On Mon, Apr 2, 2018 at 3:11 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Why do we need AccessExclusiveLock on all children of a relation that we
> want to scan to search for rows not satisfying the constraint?  I think
> it should be enough to get ShareLock, which prevents INSERT, no?  I have
> a feeling I'm missing something here, but I don't know what, and all
> tests pass with that change.

I don't think it was a good idea to change this without a lot more
discussion, as part of another commit that really was about something
else, and after feature freeze.

As Kyotaro Horiguchi also mentioned, this introduces a deadlock
hazard.  With current master:

Setup:

create table foo (a int, b text) partition by range (a);
create table foo1 partition of foo for values from (1) to (100);
create table food (a int, b text) partition by range (a);
create table food1 partition of food for values from (1) to (100);

Session 1:
begin;
BEGIN
rhaas=# select * from food1;
 a | b
---+---
(0 rows)

rhaas=# insert into food1 values (1, 'thunk');

Session 2:
rhaas=# alter table foo attach partition food default;

At which point session 1 deadlocks, because the lock has to be
upgraded to AccessExclusiveLock since we're changing the constraints.

Now you might think about relaxing the lock level for the later
acquisition, too, but I'm not sure that's safe.  The issue is that
scanning a relation for rows that don't match the new constraint isn't
by itself sufficient: you also have to be sure that nobody can add one
later.  If they don't have the relation open, they'll certainly
rebuild their relcache entry when they open it, so it'll be fine.  But
if they already have the relation open, I'm not sure we can be certain
it will get rebuilt if, later in the same transaction, they try to
insert data.  This whole area needs more research -- there may very
well be good opportunities to reduce lock levels in this area, but it
needs careful study and analysis.

Please revert the part of this commit that changed the lock level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Robert Haas wrote:

> I don't think it was a good idea to change this without a lot more
> discussion, as part of another commit that really was about something
> else, and after feature freeze.

> Please revert the part of this commit that changed the lock level.

You're right, that was too hasty.  Will revert.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Alvaro Herrera
Дата:
Robert Haas wrote:

> Please revert the part of this commit that changed the lock level.

Done.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

От
Robert Haas
Дата:
On Thu, Apr 12, 2018 at 3:59 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Robert Haas wrote:
>> Please revert the part of this commit that changed the lock level.
>
> Done.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company