Обсуждение: problem with RETURNING and update row movement
Hi, While working on [1], I came across a bug. Reproduction steps: create table foo (a int, b int) partition by list (a); create table foo1 (c int, b int, a int); alter table foo1 drop c; alter table foo attach partition foo1 for values in (1); create table foo2 partition of foo for values in (2); create table foo3 partition of foo for values in (3); create or replace function trigfunc () returns trigger language plpgsql as $$ begin new.b := 2; return new; end; $$; create trigger trig before insert on foo2 for each row execute function trigfunc(); insert into foo values (1, 1), (2, 2), (3, 3); update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *; ERROR: attribute 5 of type record has wrong type DETAIL: Table has type record, but query expects integer. The error occurs when projecting the RETURNING list. The problem is that the projection being used when the error occurs belongs to result relation foo2 which is the destination partition of a row movement operation, but it's trying to access a column in the tuple produced by the plan belonging to foo1, the source partition of the row movement. foo2's RETURNING projection can only work correctly when it's being fed tuples from the plan belonging to foo2. Note that the targetlists of the plans belonging to different result relations can be different depending on the respective relation's tuple descriptors, so are not interchangeable. Also, each result relation's RETURNING list is made to refer to its own plan's output. Without row movement, there is only one result relation to consider, so there's no confusion regarding which RETURNING list to compute. With row movement however, while there is only one plan tuple, there are two result relations to consider each with its own RETURNING list. I think we should be computing the *source* relation's RETURNING list, because only that one of the two can consume the plan tuple correctly. Attached is a patch that fixes things to be that way. By the way, the problem exists since PG 11 when UPDATE row movement feature went in. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqH-2sq-3Zq-CtuWjfRSyrGPXJBf1nCKKvTHuGVyfQ1OYA%40mail.gmail.com
Вложения
Hi Amit-san, On Thu, Jun 11, 2020 at 6:10 PM Amit Langote <amitlangote09@gmail.com> wrote: > Reproduction steps: > > create table foo (a int, b int) partition by list (a); > create table foo1 (c int, b int, a int); > alter table foo1 drop c; > alter table foo attach partition foo1 for values in (1); > create table foo2 partition of foo for values in (2); > create table foo3 partition of foo for values in (3); > create or replace function trigfunc () returns trigger language > plpgsql as $$ begin new.b := 2; return new; end; $$; > create trigger trig before insert on foo2 for each row execute > function trigfunc(); > insert into foo values (1, 1), (2, 2), (3, 3); > update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *; > ERROR: attribute 5 of type record has wrong type > DETAIL: Table has type record, but query expects integer. Reproduced. Could you add the patch to the next commitfest so that it doesn't get lost? Best regards, Etsuro Fujita
On Sun, Jun 14, 2020 at 4:23 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > Hi Amit-san, > > On Thu, Jun 11, 2020 at 6:10 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Reproduction steps: > > > > create table foo (a int, b int) partition by list (a); > > create table foo1 (c int, b int, a int); > > alter table foo1 drop c; > > alter table foo attach partition foo1 for values in (1); > > create table foo2 partition of foo for values in (2); > > create table foo3 partition of foo for values in (3); > > create or replace function trigfunc () returns trigger language > > plpgsql as $$ begin new.b := 2; return new; end; $$; > > create trigger trig before insert on foo2 for each row execute > > function trigfunc(); > > insert into foo values (1, 1), (2, 2), (3, 3); > > update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *; > > ERROR: attribute 5 of type record has wrong type > > DETAIL: Table has type record, but query expects integer. > > Reproduced. Could you add the patch to the next commitfest so that it > doesn't get lost? Done. Thank you for taking a look. https://commitfest.postgresql.org/28/2597/ -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Amit san Hello. I've tested your patch. This patch can be applied comfortably and make check-world has produced no failure. I didn't do performance test because this patch doesn't have an effect on it in my opinion. I reproduced the bug you described before, which can be prevented by your patch currently. After applying your patch, I've gotten a correct output without error using the test case in the 1st mail of this thread. Just small comment about your patch. I felt the test you added in update.sql could be simpler or shorter in other form. Excuse me if I say something silly. It's because I supposed you can check the bug is prevented without definitions of both a function and its trigger for thiscase. Neither of them is essentially connected with the row movement between source partition and destination partitionand can be replaced by simpler expression ? Best, Takamichi Osumi
Hi Takamichi-san, On Tue, Jul 14, 2020 at 8:26 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > > Amit san > > > Hello. I've tested your patch. Thanks for that. > Just small comment about your patch. > I felt the test you added in update.sql could be simpler or shorter in other form. > Excuse me if I say something silly. > It's because I supposed you can check the bug is prevented without definitions of both a function and its trigger for thiscase. Neither of them is essentially connected with the row movement between source partition and destination partitionand can be replaced by simpler expression ? Well, it's true that the function and the trigger have nothing to do with the main bug, but it's often good to be sure that the bug-fix isn't breaking cases where they are present and have visible effect. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 11, 2020 at 2:40 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Hi, > > While working on [1], I came across a bug. > > Reproduction steps: > > create table foo (a int, b int) partition by list (a); > create table foo1 (c int, b int, a int); > alter table foo1 drop c; > alter table foo attach partition foo1 for values in (1); > create table foo2 partition of foo for values in (2); > create table foo3 partition of foo for values in (3); > create or replace function trigfunc () returns trigger language > plpgsql as $$ begin new.b := 2; return new; end; $$; > create trigger trig before insert on foo2 for each row execute > function trigfunc(); > insert into foo values (1, 1), (2, 2), (3, 3); > update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning *; > ERROR: attribute 5 of type record has wrong type > DETAIL: Table has type record, but query expects integer. > > The error occurs when projecting the RETURNING list. The problem is > that the projection being used when the error occurs belongs to result > relation foo2 which is the destination partition of a row movement > operation, but it's trying to access a column in the tuple produced by > the plan belonging to foo1, the source partition of the row movement. > foo2's RETURNING projection can only work correctly when it's being > fed tuples from the plan belonging to foo2. > IIUC, here the problem is related to below part of code: ExecInsert(..) { /* Process RETURNING if present */ if (resultRelInfo->ri_projectReturning) result = ExecProcessReturning(resultRelInfo, slot, planSlot); .. } The reason is that planSlot is for foo1 and slot is for foo2 and when it tries to access tuple during ExecProcessReturning(), it results in an error. Is my understanding correct? If so, then can't we ensure someway that planSlot also belongs to foo2 instead of skipping return processing in Insert and then later do more work to perform in Update. Like Takamichi-san, I also think here we don't need trigger/function in the test case. If one reads the comment you have added in the test case, it is not evident why is trigger or function required. If you really think it is important to cover the trigger case then either have a separate test or at least add some comments on how trigger helps here or what you want to test it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi Amit, Thanks for taking a look at this. On Mon, Jul 20, 2020 at 8:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > IIUC, here the problem is related to below part of code: > ExecInsert(..) > { > /* Process RETURNING if present */ > if (resultRelInfo->ri_projectReturning) > result = ExecProcessReturning(resultRelInfo, slot, planSlot); > .. > } > > The reason is that planSlot is for foo1 and slot is for foo2 and when > it tries to access tuple during ExecProcessReturning(), it results in > an error. Is my understanding correct? Yes. Specifically, the problem exists if there are any non-target relation attributes in RETURNING which are computed by referring to planSlot, the plan's output tuple, which may be shaped differently among result relations due to their tuple descriptors being different. > If so, then can't we ensure > someway that planSlot also belongs to foo2 instead of skipping return > processing in Insert and then later do more work to perform in Update. I did consider that option but failed to see a way to make it work. I am not sure if there is a way to make a copy of the plan's output tuple (planSlot) that is compatible with the destination partition. Simple conversion using execute_attr_map_slot() is okay when we know the source and the target slots contain relation tuples, but plan's output tuples will also have other junk attributes. Also, not all destination partitions have an associated plan and hence a slot to hold plan tuples. > Like Takamichi-san, I also think here we don't need trigger/function > in the test case. If one reads the comment you have added in the test > case, it is not evident why is trigger or function required. If you > really think it is important to cover the trigger case then either > have a separate test or at least add some comments on how trigger > helps here or what you want to test it. That's fair. I have updated the test comment. To expand on that here, because now we'll be computing RETURNING using the source partition's projection and the tuple in the source partition's format, I wanted to make sure that any changes made by the destination partition's triggers are reflected in the output. PFA v2. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Вложения
On Wed, Jul 22, 2020 at 3:16 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Jul 20, 2020 at 8:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > IIUC, here the problem is related to below part of code: > > ExecInsert(..) > > { > > /* Process RETURNING if present */ > > if (resultRelInfo->ri_projectReturning) > > result = ExecProcessReturning(resultRelInfo, slot, planSlot); > > .. > > } > > > > The reason is that planSlot is for foo1 and slot is for foo2 and when > > it tries to access tuple during ExecProcessReturning(), it results in > > an error. Is my understanding correct? > > Yes. Specifically, the problem exists if there are any non-target > relation attributes in RETURNING which are computed by referring to > planSlot, the plan's output tuple, which may be shaped differently > among result relations due to their tuple descriptors being different. > > > If so, then can't we ensure > > someway that planSlot also belongs to foo2 instead of skipping return > > processing in Insert and then later do more work to perform in Update. > > I did consider that option but failed to see a way to make it work. > > I am not sure if there is a way to make a copy of the plan's output > tuple (planSlot) that is compatible with the destination partition. > Simple conversion using execute_attr_map_slot() is okay when we know > the source and the target slots contain relation tuples, but plan's > output tuples will also have other junk attributes. Also, not all > destination partitions have an associated plan and hence a slot to > hold plan tuples. Yeah, I think it might be possible to create planSlot to pass to ExecInsert() so that we can process RETURNING within that function, but even if so, that would be cumbersome not only because partitions can have different rowtypes but because they can have different junk columns as well, because e.g., subplan foreign partitions may have different row ID columns as junk columns. The proposed patch is simple, so I would vote for it. (Note: in case of a foreign partition, we call ExecForeignInsert() with the source partition’s planSlot in ExecInsert(), which is not correct, but that would sbe OK because it seems unlikely that the FDW would look at the planSlot for INSERT.) One thing I noticed is that the patch changes the existing behavior. Here is an example: create table range_parted (a text, b bigint) partition by range (a, b); create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10); create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10); create function trigfunc() returns trigger as $$ begin return null; end; $$ language plpgsql; create trigger trig before insert on part_b_1_b_10 for each row execute function trigfunc(); insert into range_parted values ('a', 1); In HEAD: postgres=# update range_parted r set a = 'b' from (values ('a', 1)) s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass, r.*; tableoid | a | b ----------+---+--- (0 rows) UPDATE 0 But with the patch: postgres=# update range_parted r set a = 'b' from (values ('a', 1)) s(x, y) where s.x = r.a and s.y = r.b returning tableoid::regclass, r.*; tableoid | a | b ---------------+---+--- part_a_1_a_10 | b | 1 (1 row) UPDATE 1 This produces RETURNING, though INSERT on the destination partition was skipped by the trigger. Another thing is that the patch assumes that the tuple slot to pass to ExecInsert() would store the inserted tuple when doing that function, but that’s not always true, because in case of a foreign partition, the FDW may return a slot other than the passed-in slot when called from ExecForeignInsert(), in which case the passed-in slot would not store the inserted tuple anymore. To fix these, I modified the patch so that we 1) add to ExecInsert() an output parameter slot to store the inserted tuple, and 2) compute RETURNING based on the parameter. I also added a regression test case. Attached is an updated version of the patch. Sorry for the long delay. Best regards, Etsuro Fujita
Вложения
Yet another thing I noticed is that the patch incorrectly produces values for the tableoid columns specified in the RETURNING list, like this: +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNING tableoid::regclass, *; + tableoid | a | b | c | d | e | x | y +----------------+---+----+-----+---+---------------+---+---- + part_a_1_a_10 | c | 1 | 1 | 1 | in trigfunc() | a | 1 + part_a_10_a_20 | c | 10 | 200 | 1 | in trigfunc() | a | 10 + part_c_1_100 | c | 12 | 96 | 1 | in trigfunc() | b | 12 +(3 rows) The source partitions are shown as tableoid, but the destination partition (ie, part_c_1_c_20) should be shown. To fix this, I modified the patch further so that 1) we override tts_tableOid of the original slot with the OID of the destination partition before calling ExecProcessReturning() if needed, and 2) in ExecProcessReturning(), we only initialize ecxt_scantuple's tts_tableOid when needed, which would save cycles a bit for non-foreign-table-direct-modification cases. Attached is a new version of the patch. Best regards, Etsuro Fujita
Вложения
Fujita-san, Thanks for your time on this. On Sun, Aug 2, 2020 at 5:57 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > Yet another thing I noticed is that the patch incorrectly produces > values for the tableoid columns specified in the RETURNING list, like > this: Yeah, I noticed that too. > +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), > ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNING > tableoid::regclass, *; > + tableoid | a | b | c | d | e | x | y > +----------------+---+----+-----+---+---------------+---+---- > + part_a_1_a_10 | c | 1 | 1 | 1 | in trigfunc() | a | 1 > + part_a_10_a_20 | c | 10 | 200 | 1 | in trigfunc() | a | 10 > + part_c_1_100 | c | 12 | 96 | 1 | in trigfunc() | b | 12 > +(3 rows) > > The source partitions are shown as tableoid, but the destination > partition (ie, part_c_1_c_20) should be shown. To fix this, I > modified the patch further so that 1) we override tts_tableOid of the > original slot with the OID of the destination partition before calling > ExecProcessReturning() if needed, and 2) in ExecProcessReturning(), we > only initialize ecxt_scantuple's tts_tableOid when needed, which would > save cycles a bit for non-foreign-table-direct-modification cases. > > Attached is a new version of the patch. Thanks for the updated patch. I reviewed your changes in v3 too and they looked fine to me. However, I noticed that having system columns like ctid, xmin, etc. in the RETURNING list is now broken and maybe irrepairably due to the approach we are taking in the patch. Let me show an example: drop table foo; create table foo (a int, b int) partition by list (a); create table foo1 (c int, b int, a int); alter table foo1 drop c; alter table foo attach partition foo1 for values in (1); create table foo2 partition of foo for values in (2); create table foo3 partition of foo for values in (3); create or replace function trigfunc () returns trigger language plpgsql as $$ begin new.b := 2; return new; end; $$; create trigger trig before insert on foo2 for each row execute function trigfunc(); insert into foo values (1, 1), (2, 2), (3, 3); update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x returning tableoid::regclass, ctid, xmin, xmax, *; tableoid | ctid | xmin | xmax | a | b | x ----------+----------------+------+------------+---+---+--- foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1 foo2 | (0,3) | 782 | 0 | 2 | 2 | 2 foo2 | (0,4) | 782 | 0 | 2 | 2 | 3 (3 rows) During foo1's update, it appears that we are losing the system information in the physical tuple initialized during ExecInsert on foo2 during its conversion back to foo1's reltype using the new code. I haven't been able to figure out how to preserve the system information in HeapTuple contained in the destination slot across the conversion. Note we want to use the latter to project the RETURNING list. By the way, you'll need two adjustments to even get this example working, one of which is a reported problem that system columns in RETURNING list during an operation on partitioned table stopped working in PG 12 [1] for which I've proposed a workaround (attached). Another is that we forgot in our patch to "materialize" the virtual tuple after conversion, which means slot_getsysattr() can't find it to access system columns like xmin, etc. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/141051591267657%40mail.yandex.ru
Вложения
On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote: > However, I noticed that having system columns like ctid, xmin, etc. in > the RETURNING list is now broken and maybe irrepairably due to the > approach we are taking in the patch. Let me show an example: > > drop table foo; > create table foo (a int, b int) partition by list (a); > create table foo1 (c int, b int, a int); > alter table foo1 drop c; > alter table foo attach partition foo1 for values in (1); > create table foo2 partition of foo for values in (2); > create table foo3 partition of foo for values in (3); > create or replace function trigfunc () returns trigger language > plpgsql as $$ begin new.b := 2; return new; end; $$; > create trigger trig before insert on foo2 for each row execute > function trigfunc(); > insert into foo values (1, 1), (2, 2), (3, 3); > update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x > returning tableoid::regclass, ctid, xmin, xmax, *; > tableoid | ctid | xmin | xmax | a | b | x > ----------+----------------+------+------------+---+---+--- > foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1 > foo2 | (0,3) | 782 | 0 | 2 | 2 | 2 > foo2 | (0,4) | 782 | 0 | 2 | 2 | 3 > (3 rows) > > During foo1's update, it appears that we are losing the system > information in the physical tuple initialized during ExecInsert on > foo2 during its conversion back to foo1's reltype using the new code. > I haven't been able to figure out how to preserve the system > information in HeapTuple contained in the destination slot across the > conversion. Note we want to use the latter to project the RETURNING > list. > > By the way, you'll need two adjustments to even get this example > working, one of which is a reported problem that system columns in > RETURNING list during an operation on partitioned table stopped > working in PG 12 [1] for which I've proposed a workaround (attached). > Another is that we forgot in our patch to "materialize" the virtual > tuple after conversion, which means slot_getsysattr() can't find it to > access system columns like xmin, etc. The only solution I could think of for this so far is this: + if (map) + { + orig_slot = execute_attr_map_slot(map, + res_slot, + orig_slot); + + /* + * A HACK to install system information into the just + * converted tuple so that RETURNING computes any + * system columns correctly. This would be the same + * information that would be present in the HeapTuple + * version of the tuple in res_slot. + */ + tuple = ExecFetchSlotHeapTuple(orig_slot, true, + &should_free); + tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK); + tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; + HeapTupleHeaderSetXmin(tuple->t_data, + GetCurrentTransactionId()); + HeapTupleHeaderSetCmin(tuple->t_data, + estate->es_output_cid); + HeapTupleHeaderSetXmax(tuple->t_data, 0); /* for cleanliness */ + } + /* + * Override tts_tableOid with the OID of the destination + * partition. + */ + orig_slot->tts_tableOid = RelationGetRelid(destRel->ri_RelationDesc); + /* Also the TID. */ + orig_slot->tts_tid = res_slot->tts_tid; ..but it might be too ugly :(. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Amit-san, On Mon, Aug 3, 2020 at 2:55 PM Amit Langote <amitlangote09@gmail.com> wrote: > However, I noticed that having system columns like ctid, xmin, etc. in > the RETURNING list is now broken and maybe irrepairably due to the > approach we are taking in the patch. Let me show an example: > > drop table foo; > create table foo (a int, b int) partition by list (a); > create table foo1 (c int, b int, a int); > alter table foo1 drop c; > alter table foo attach partition foo1 for values in (1); > create table foo2 partition of foo for values in (2); > create table foo3 partition of foo for values in (3); > create or replace function trigfunc () returns trigger language > plpgsql as $$ begin new.b := 2; return new; end; $$; > create trigger trig before insert on foo2 for each row execute > function trigfunc(); > insert into foo values (1, 1), (2, 2), (3, 3); > update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x > returning tableoid::regclass, ctid, xmin, xmax, *; > tableoid | ctid | xmin | xmax | a | b | x > ----------+----------------+------+------------+---+---+--- > foo2 | (4294967295,0) | 128 | 4294967295 | 2 | 2 | 1 > foo2 | (0,3) | 782 | 0 | 2 | 2 | 2 > foo2 | (0,4) | 782 | 0 | 2 | 2 | 3 > (3 rows) > > During foo1's update, it appears that we are losing the system > information in the physical tuple initialized during ExecInsert on > foo2 during its conversion back to foo1's reltype using the new code. > I haven't been able to figure out how to preserve the system > information in HeapTuple contained in the destination slot across the > conversion. Note we want to use the latter to project the RETURNING > list. Thanks for pointing that out! > By the way, you'll need two adjustments to even get this example > working, one of which is a reported problem that system columns in > RETURNING list during an operation on partitioned table stopped > working in PG 12 [1] for which I've proposed a workaround (attached). > Another is that we forgot in our patch to "materialize" the virtual > tuple after conversion, which means slot_getsysattr() can't find it to > access system columns like xmin, etc. Ok, I’ll look at those closely. Best regards, Etsuro Fujita
On Mon, Aug 3, 2020 at 4:39 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote: > > By the way, you'll need two adjustments to even get this example > > working, one of which is a reported problem that system columns in > > RETURNING list during an operation on partitioned table stopped > > working in PG 12 [1] for which I've proposed a workaround (attached). > > Another is that we forgot in our patch to "materialize" the virtual > > tuple after conversion, which means slot_getsysattr() can't find it to > > access system columns like xmin, etc. > > The only solution I could think of for this so far is this: > > + if (map) > + { > + orig_slot = execute_attr_map_slot(map, > + res_slot, > + orig_slot); > + > + /* > + * A HACK to install system information into the just > + * converted tuple so that RETURNING computes any > + * system columns correctly. This would be the same > + * information that would be present in the HeapTuple > + * version of the tuple in res_slot. > + */ > + tuple = ExecFetchSlotHeapTuple(orig_slot, true, > + &should_free); > + tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK); > + tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); > + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; > + HeapTupleHeaderSetXmin(tuple->t_data, > + GetCurrentTransactionId()); > + HeapTupleHeaderSetCmin(tuple->t_data, > + estate->es_output_cid); > + HeapTupleHeaderSetXmax(tuple->t_data, 0); /* > for cleanliness */ > + } > + /* > + * Override tts_tableOid with the OID of the destination > + * partition. > + */ > + orig_slot->tts_tableOid = > RelationGetRelid(destRel->ri_RelationDesc); > + /* Also the TID. */ > + orig_slot->tts_tid = res_slot->tts_tid; > > ..but it might be too ugly :(. Yeah, I think that would be a bit ugly, and actually, is not correct in case of postgres_fdw foreign table, in which case Xmin and Cmin are also set to 0 [2]. I think we should probably first address the tableam issue that you pointed out, but I don't think I'm the right person to do so. Best regards, Etsuro Fujita [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da7d44b627ba839de32c9409aca659f60324de76
On Fri, Aug 7, 2020 at 10:45 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Mon, Aug 3, 2020 at 4:39 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Mon, Aug 3, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > By the way, you'll need two adjustments to even get this example > > > working, one of which is a reported problem that system columns in > > > RETURNING list during an operation on partitioned table stopped > > > working in PG 12 [1] for which I've proposed a workaround (attached). > > > Another is that we forgot in our patch to "materialize" the virtual > > > tuple after conversion, which means slot_getsysattr() can't find it to > > > access system columns like xmin, etc. > > > > The only solution I could think of for this so far is this: > > > > + if (map) > > + { > > + orig_slot = execute_attr_map_slot(map, > > + res_slot, > > + orig_slot); > > + > > + /* > > + * A HACK to install system information into the just > > + * converted tuple so that RETURNING computes any > > + * system columns correctly. This would be the same > > + * information that would be present in the HeapTuple > > + * version of the tuple in res_slot. > > + */ > > + tuple = ExecFetchSlotHeapTuple(orig_slot, true, > > + &should_free); > > + tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK); > > + tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); > > + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; > > + HeapTupleHeaderSetXmin(tuple->t_data, > > + GetCurrentTransactionId()); > > + HeapTupleHeaderSetCmin(tuple->t_data, > > + estate->es_output_cid); > > + HeapTupleHeaderSetXmax(tuple->t_data, 0); /* > > for cleanliness */ > > + } > > + /* > > + * Override tts_tableOid with the OID of the destination > > + * partition. > > + */ > > + orig_slot->tts_tableOid = > > RelationGetRelid(destRel->ri_RelationDesc); > > + /* Also the TID. */ > > + orig_slot->tts_tid = res_slot->tts_tid; > > > > ..but it might be too ugly :(. > > Yeah, I think that would be a bit ugly, and actually, is not correct > in case of postgres_fdw foreign table, in which case Xmin and Cmin are > also set to 0 [2]. Yeah, need to think a bit harder. > I think we should probably first address the > tableam issue that you pointed out, but I don't think I'm the right > person to do so. Okay, I will try to revive that thread. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
I noticed that this bugfix has stalled, probably because the other bugfix has also stalled. It seems that cleanly returning system columns from table AM is not going to be a simple task -- whatever fix we get for that is likely not going to make it all the way to PG 12. So I suggest that we should fix this bug in 11-13 without depending on that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Alvaro, On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I noticed that this bugfix has stalled, probably because the other > bugfix has also stalled. > > It seems that cleanly returning system columns from table AM is not > going to be a simple task -- whatever fix we get for that is likely not > going to make it all the way to PG 12. So I suggest that we should fix > this bug in 11-13 without depending on that. Although I would be reversing course on what I said upthread, I tend to agree with that, because the core idea behind the fix for this issue does not seem likely to be invalidated by any conclusion regarding the other issue. That is, as far as the issue here is concerned, we can't avoid falling back to using the source partition's RETURNING projection whose scan tuple is provided using the source partition's tuple slot. However, given that we have to pass the *new* tuple to the projection, not the old one, we will need a "clean" way to transfer its (the new tuple's) system columns into the source partition's tuple slot. The sketch I gave upthread of what that could look like was not liked by Fujita-san much. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 14, 2020 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I noticed that this bugfix has stalled, probably because the other > > bugfix has also stalled. > > > > It seems that cleanly returning system columns from table AM is not > > going to be a simple task -- whatever fix we get for that is likely not > > going to make it all the way to PG 12. So I suggest that we should fix > > this bug in 11-13 without depending on that. > > Although I would be reversing course on what I said upthread, I tend > to agree with that, because the core idea behind the fix for this > issue does not seem likely to be invalidated by any conclusion > regarding the other issue. That is, as far as the issue here is > concerned, we can't avoid falling back to using the source partition's > RETURNING projection whose scan tuple is provided using the source > partition's tuple slot. I agree on that point. > However, given that we have to pass the *new* tuple to the projection, > not the old one, we will need a "clean" way to transfer its (the new > tuple's) system columns into the source partition's tuple slot. The > sketch I gave upthread of what that could look like was not liked by > Fujita-san much. IIUC, I think two issues are discussed in the thread [1]: (a) there is currently no way to define the set of meaningful system columns for a partitioned table that contains pluggable storages other than standard heap, and (b) even in the case where the set is defined as before, like partitioned tables that don’t contain any pluggable storages, system column values are not obtained from a tuple inserted into a partitioned table in cases as reported in [1], because virtual slots are assigned for partitioned tables [2][3]. (I think the latter is the original issue in the thread, though.) I think we could fix this update-tuple-routing-vs-RETURNING issue without the fix for (a). But to transfer system column values in a cleaner way, I think we need to fix (b) first so that we can always obtain/copy them from the new tuple moved to another partition by INSERT following DELETE. (I think we could address this issue for v11 independently of [1], off course.) Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/141051591267657%40mail.yandex.ru [2] https://www.postgresql.org/message-id/CA%2BHiwqHrsNa4e0MfpSzv7xOM94TvX%3DR0MskYxYWfy0kjL0DAdQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/20200811180231.fcssvhelqpnydnx7%40alap3.anarazel.de
On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Mon, Sep 14, 2020 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Sat, Sep 12, 2020 at 5:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > I noticed that this bugfix has stalled, probably because the other > > > bugfix has also stalled. > > > > > > It seems that cleanly returning system columns from table AM is not > > > going to be a simple task -- whatever fix we get for that is likely not > > > going to make it all the way to PG 12. So I suggest that we should fix > > > this bug in 11-13 without depending on that. > > > > Although I would be reversing course on what I said upthread, I tend > > to agree with that, because the core idea behind the fix for this > > issue does not seem likely to be invalidated by any conclusion > > regarding the other issue. That is, as far as the issue here is > > concerned, we can't avoid falling back to using the source partition's > > RETURNING projection whose scan tuple is provided using the source > > partition's tuple slot. > > I agree on that point. > > > However, given that we have to pass the *new* tuple to the projection, > > not the old one, we will need a "clean" way to transfer its (the new > > tuple's) system columns into the source partition's tuple slot. The > > sketch I gave upthread of what that could look like was not liked by > > Fujita-san much. > > IIUC, I think two issues are discussed in the thread [1]: (a) there is > currently no way to define the set of meaningful system columns for a > partitioned table that contains pluggable storages other than standard > heap, and (b) even in the case where the set is defined as before, > like partitioned tables that don’t contain any pluggable storages, > system column values are not obtained from a tuple inserted into a > partitioned table in cases as reported in [1], because virtual slots > are assigned for partitioned tables [2][3]. (I think the latter is > the original issue in the thread, though.) Right, (b) can be solved by using a leaf partition's tuple slot as proposed. Although (a) needs a long term fix. > I think we could fix this update-tuple-routing-vs-RETURNING issue > without the fix for (a). But to transfer system column values in a > cleaner way, I think we need to fix (b) first so that we can always > obtain/copy them from the new tuple moved to another partition by > INSERT following DELETE. Yes, I think you are right. Although, I am still not sure how to "copy" system columns from one partition's slot to another's, that is, without assuming what they are. > (I think we could address this issue for v11 independently of [1], off course.) Yeah. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > IIUC, I think two issues are discussed in the thread [1]: (a) there is > > currently no way to define the set of meaningful system columns for a > > partitioned table that contains pluggable storages other than standard > > heap, and (b) even in the case where the set is defined as before, > > like partitioned tables that don’t contain any pluggable storages, > > system column values are not obtained from a tuple inserted into a > > partitioned table in cases as reported in [1], because virtual slots > > are assigned for partitioned tables [2][3]. (I think the latter is > > the original issue in the thread, though.) > > Right, (b) can be solved by using a leaf partition's tuple slot as > proposed. You mean what is proposed in [3]? > > I think we could fix this update-tuple-routing-vs-RETURNING issue > > without the fix for (a). But to transfer system column values in a > > cleaner way, I think we need to fix (b) first so that we can always > > obtain/copy them from the new tuple moved to another partition by > > INSERT following DELETE. > > Yes, I think you are right. Although, I am still not sure how to > "copy" system columns from one partition's slot to another's, that is, > without assuming what they are. I just thought we assume that partitions support all the existing system attributes until we have the fix for (a), i.e., the slot assigned for a partition must have the getsysattr callback routine from which we can fetch each existing system attribute of a underlying tuple in the slot, regardless of whether that system attribute is used for the partition or not. > > (I think we could address this issue for v11 independently of [1], off course.) > > Yeah. I noticed that I modified your patch incorrectly. Sorry for that. Attached is an updated patch fixing that. I also added a bit of code to copy CTID from the new tuple moved to another partition, not just table OID, and did some code/comment adjustments, mainly to match other places. I also created a patch for PG11, which I am attaching as well. Best regards, Etsuro Fujita
Вложения
Fujita-san, On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > IIUC, I think two issues are discussed in the thread [1]: (a) there is > > > currently no way to define the set of meaningful system columns for a > > > partitioned table that contains pluggable storages other than standard > > > heap, and (b) even in the case where the set is defined as before, > > > like partitioned tables that don’t contain any pluggable storages, > > > system column values are not obtained from a tuple inserted into a > > > partitioned table in cases as reported in [1], because virtual slots > > > are assigned for partitioned tables [2][3]. (I think the latter is > > > the original issue in the thread, though.) > > > > Right, (b) can be solved by using a leaf partition's tuple slot as > > proposed. > > You mean what is proposed in [3]? Yes. Although, I am for assigning a dedicated slot to partitions *unconditionally*, whereas the PoC patch Andres shared makes it conditional on either needing tuple conversion between the root and the partition or having a RETURNING projection present in the query. > > > I think we could fix this update-tuple-routing-vs-RETURNING issue > > > without the fix for (a). But to transfer system column values in a > > > cleaner way, I think we need to fix (b) first so that we can always > > > obtain/copy them from the new tuple moved to another partition by > > > INSERT following DELETE. > > > > Yes, I think you are right. Although, I am still not sure how to > > "copy" system columns from one partition's slot to another's, that is, > > without assuming what they are. > > I just thought we assume that partitions support all the existing > system attributes until we have the fix for (a), i.e., the slot > assigned for a partition must have the getsysattr callback routine > from which we can fetch each existing system attribute of a underlying > tuple in the slot, regardless of whether that system attribute is used > for the partition or not. Hmm, to copy one slot's system attributes into another, we will also need a way to "set" the system attributes in the destination slot. But maybe I didn't fully understand what you said. > > > (I think we could address this issue for v11 independently of [1], off course.) > > > > Yeah. > > I noticed that I modified your patch incorrectly. Sorry for that. > Attached is an updated patch fixing that. Ah, no problem. Thanks for updating. > I also added a bit of code > to copy CTID from the new tuple moved to another partition, not just > table OID, and did some code/comment adjustments, mainly to match > other places. I also created a patch for PG11, which I am attaching > as well. In the patch for PG 11: + new_tuple->t_self = new_tuple->t_data->t_ctid = + old_tuple->t_self; ... Should we add a one-line comment above this block of code to transfer system attributes? Maybe: /* Also transfer the system attributes. */? BTW, do you think we should alter the test in PG 11 branch to test that system attributes are returned correctly? Once we settle the other issue, we can adjust the HEAD's test to do likewise? -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 23, 2020 at 10:12 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > > IIUC, I think two issues are discussed in the thread [1]: (a) there is > > > > currently no way to define the set of meaningful system columns for a > > > > partitioned table that contains pluggable storages other than standard > > > > heap, and (b) even in the case where the set is defined as before, > > > > like partitioned tables that don’t contain any pluggable storages, > > > > system column values are not obtained from a tuple inserted into a > > > > partitioned table in cases as reported in [1], because virtual slots > > > > are assigned for partitioned tables [2][3]. (I think the latter is > > > > the original issue in the thread, though.) > > > > I think we could fix this update-tuple-routing-vs-RETURNING issue > > > > without the fix for (a). But to transfer system column values in a > > > > cleaner way, I think we need to fix (b) first so that we can always > > > > obtain/copy them from the new tuple moved to another partition by > > > > INSERT following DELETE. > > > > > > Yes, I think you are right. Although, I am still not sure how to > > > "copy" system columns from one partition's slot to another's, that is, > > > without assuming what they are. > > > > I just thought we assume that partitions support all the existing > > system attributes until we have the fix for (a), i.e., the slot > > assigned for a partition must have the getsysattr callback routine > > from which we can fetch each existing system attribute of a underlying > > tuple in the slot, regardless of whether that system attribute is used > > for the partition or not. > > Hmm, to copy one slot's system attributes into another, we will also > need a way to "set" the system attributes in the destination slot. > But maybe I didn't fully understand what you said. Sorry, my thought was vague. To store xmin/xmax/cmin/cmax into a given slot, we need to extend the TupleTableSlot struct to contain these attributes as well? Or we need to introduce a new callback routine for that (say, setsysattr)? These would not be back-patchable, though. > > I also created a patch for PG11, which I am attaching > > as well. > > In the patch for PG 11: > > + new_tuple->t_self = new_tuple->t_data->t_ctid = > + old_tuple->t_self; > ... > > Should we add a one-line comment above this block of code to transfer > system attributes? Maybe: /* Also transfer the system attributes. */? Will add that comment. Thanks for reviewing! > BTW, do you think we should alter the test in PG 11 branch to test > that system attributes are returned correctly? Yeah, I think so. I didn’t come up with any good test cases for that, though. > Once we settle the > other issue, we can adjust the HEAD's test to do likewise? Yeah, but for the other issue, I started thinking that we should just forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... Best regards, Etsuro Fujita
On Thu, Sep 24, 2020 at 04:25:24AM +0900, Etsuro Fujita wrote: > Sorry, my thought was vague. To store xmin/xmax/cmin/cmax into a > given slot, we need to extend the TupleTableSlot struct to contain > these attributes as well? Or we need to introduce a new callback > routine for that (say, setsysattr)? These would not be > back-patchable, though. Please note that the latest patch fails to apply, so this needs a rebase. -- Michael
Вложения
On Thu, Sep 24, 2020 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 24, 2020 at 04:25:24AM +0900, Etsuro Fujita wrote: > > Sorry, my thought was vague. To store xmin/xmax/cmin/cmax into a > > given slot, we need to extend the TupleTableSlot struct to contain > > these attributes as well? Or we need to introduce a new callback > > routine for that (say, setsysattr)? These would not be > > back-patchable, though. > > Please note that the latest patch fails to apply, so this needs a > rebase. I saw the CF-bot failure too yesterday, although it seems that it's because the bot is trying to apply the patch version meant for v11 branch onto HEAD branch. I just checked that the patches apply cleanly to their respective branches. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Wed, Sep 23, 2020 at 10:12 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > On Mon, Sep 14, 2020 at 10:45 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > > Although, I am still not sure how to > > > > "copy" system columns from one partition's slot to another's, that is, > > > > without assuming what they are. > > > > > > I just thought we assume that partitions support all the existing > > > system attributes until we have the fix for (a), i.e., the slot > > > assigned for a partition must have the getsysattr callback routine > > > from which we can fetch each existing system attribute of a underlying > > > tuple in the slot, regardless of whether that system attribute is used > > > for the partition or not. > > > > Hmm, to copy one slot's system attributes into another, we will also > > need a way to "set" the system attributes in the destination slot. > > But maybe I didn't fully understand what you said. > > Sorry, my thought was vague. To store xmin/xmax/cmin/cmax into a > given slot, we need to extend the TupleTableSlot struct to contain > these attributes as well? Or we need to introduce a new callback > routine for that (say, setsysattr)? These would not be > back-patchable, though. Yeah, I'd think so too. BTW, the discussion so far on the other thread is oblivious to the issue being discussed here, where we need to find a way to transfer system attributes between a pair of partitions that are possibly incompatible with each other in terms of what set of system attributes they support. Although, if we prevent accessing system attributes when performing the operation on partitioned tables, like what you seem to propose below, then we wouldn't really have that problem. > > BTW, do you think we should alter the test in PG 11 branch to test > > that system attributes are returned correctly? > > Yeah, I think so. I didn’t come up with any good test cases for that, though. > > > Once we settle the > > other issue, we can adjust the HEAD's test to do likewise? > > Yeah, but for the other issue, I started thinking that we should just > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... When the command is being performed on a partitioned table you mean? That is, it'd be okay to reference them when the command is being performed directly on a leaf partition, although it's another thing whether the leaf partitions themselves have sensible values to provide for them. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 24, 2020 at 2:47 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > BTW, the discussion so far on the other thread is oblivious to the > issue being discussed here, where we need to find a way to transfer > system attributes between a pair of partitions that are possibly > incompatible with each other in terms of what set of system attributes > they support. Yeah, we should discuss the two issues together. > Although, if we prevent accessing system attributes > when performing the operation on partitioned tables, like what you > seem to propose below, then we wouldn't really have that problem. Yeah, I think so. > > Yeah, but for the other issue, I started thinking that we should just > > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... > > When the command is being performed on a partitioned table you mean? Yes. One concern about that is triggers: IIUC, triggers on a partition as-is can or can not reference xmin/xmax/cmin/cmax depending on whether a dedicated tuple slot for the partition is used or not. We should do something about this if we go in that direction? > That is, it'd be okay to reference them when the command is being > performed directly on a leaf partition, although it's another thing > whether the leaf partitions themselves have sensible values to provide > for them. I think so too. Best regards, Etsuro Fujita
On Thu, Sep 24, 2020 at 7:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Thu, Sep 24, 2020 at 2:47 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > > > Yeah, but for the other issue, I started thinking that we should just > > > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD... > > > > When the command is being performed on a partitioned table you mean? > > Yes. One concern about that is triggers: IIUC, triggers on a > partition as-is can or can not reference xmin/xmax/cmin/cmax depending > on whether a dedicated tuple slot for the partition is used or not. > We should do something about this if we go in that direction? Maybe I'm missing something, but assuming that we're talking about prohibiting system attribute access in the RETURNING clause, how does that affect what triggers can or cannot do? AFAICS, only AFTER row-level triggers may sensibly access system attributes and whether/how they can do so has not much to do with the slot that ExecInsert() gets the new tuple in. It seems that the AFTER trigger infrastructure remembers an affected tuple's ctid and fetches it just before calling trigger function by asking the result relation's (e.g., a partition's) access method. To illustrate, with HEAD: create table foo (a int, b int) partition by range (a); create table foo1 partition of foo for values from (1) to (2); create or replace function report_system_info () returns trigger language plpgsql as $$ begin raise notice 'ctid: %', new.ctid; raise notice 'xmin: %', new.xmin; raise notice 'xmax: %', new.xmax; raise notice 'cmin: %', new.cmin; raise notice 'cmax: %', new.cmax; raise notice 'tableoid: %', new.tableoid; return NULL; end; $$; create trigger foo_after_trig after insert on foo for each row execute function report_system_info(); begin; insert into foo values (1); NOTICE: ctid: (0,1) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 0 NOTICE: cmax: 0 NOTICE: tableoid: 16387 insert into foo values (1); NOTICE: ctid: (0,2) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 1 NOTICE: cmax: 1 NOTICE: tableoid: 16387 insert into foo values (1); NOTICE: ctid: (0,3) NOTICE: xmin: 532 NOTICE: xmax: 0 NOTICE: cmin: 2 NOTICE: cmax: 2 NOTICE: tableoid: 16387 -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 24, 2020 at 2:26 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Sep 24, 2020 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 24, 2020 at 04:25:24AM +0900, Etsuro Fujita wrote: > > > Sorry, my thought was vague. To store xmin/xmax/cmin/cmax into a > > > given slot, we need to extend the TupleTableSlot struct to contain > > > these attributes as well? Or we need to introduce a new callback > > > routine for that (say, setsysattr)? These would not be > > > back-patchable, though. > > > > Please note that the latest patch fails to apply, so this needs a > > rebase. > > I saw the CF-bot failure too yesterday, although it seems that it's > because the bot is trying to apply the patch version meant for v11 > branch onto HEAD branch. I just checked that the patches apply > cleanly to their respective branches. I checked that the last statement is still true, so I've switched the status back to Needs Review. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 30, 2020 at 12:34 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Sep 24, 2020 at 2:26 PM Amit Langote <amitlangote09@gmail.com> wrote: > > I saw the CF-bot failure too yesterday, although it seems that it's > > because the bot is trying to apply the patch version meant for v11 > > branch onto HEAD branch. I just checked that the patches apply > > cleanly to their respective branches. > > I checked that the last statement is still true, so I've switched the > status back to Needs Review. And the patch for HEAD no longer applies, because of a recent refactoring commit that hit update row movement code. Rebased patch for HEAD attached. Patch for PG11 should apply as-is. Here is a summary of where we stand on this, because another issue related to using RETURNING with partitioned tables [1] kind of ties into this. The problem reported on this thread is solved by ExecUpdate() itself evaluating the RETURNING list using the source partition's ri_projectReturning, instead of ExecInsert() doing it using the destination partition's ri_projectReturning. It must work that way, because the tuple descriptor of 'planSlot', which provides the values of the columns mentioned in RETURNING of tables other than the target table, is based on the source partition. Note that the tuple inserted into the destination partition needs to be converted into the source partition if their tuple descriptors differ, so that RETURNING correctly returns the new values of the updated tuple. The point we are stuck on is whether this fix will create problems for the case when the RETURNING list contains system columns. Now that we will be projecting the tuple inserted into the destination using its copy in the source partition's format and we have no way of preserving the system information in the tuple across conversions, we need to find a way to make RETURNING project the correct system information. Fujita-san has suggested (or agreed with a suggestion made at [1]) that we should simply prevent system information from being projected with RETURNING when the command is being performed on a partitioned table, in which case this becomes a non-issue. If that is not what we decide to do on the other thread, then at least we will have to figure out over there how we will support RETURNING system columns when row movement occurs during an update on partitioned tables. The question I guess is whether that thread must conclude before the fix here can be committed. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/flat/141051591267657%40mail.yandex.ru
Вложения
On Sat, Mar 6, 2021 at 12:52 AM David Steele <david@pgmasters.net> wrote: > On 10/30/20 6:00 AM, Amit Langote wrote: > > The question > > I guess is whether that thread must conclude before the fix here can > > be committed. > > Indeed. But it seems like there is a candidate patch on [1] though that > thread has also stalled. I'll try to get some status on that thread but > the question remains if this patch will be stalled until [1] is resolved. Sorry for the delay in following up on this thread. I've posted new patches on the other. FWIW, I think we should go ahead and apply the patches for the bug reported here. Anyone who tries to project an updated tuple's system columns using RETURNING are likely to face problems one way or another, especially if they have partitioned tables containing partitions of varying table AMs, but at least they won't face the bug discussed here. I've attached patches for all affected branches. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
Amit Langote <amitlangote09@gmail.com> writes: > FWIW, I think we should go ahead and apply the patches for the bug > reported here. Anyone who tries to project an updated tuple's system > columns using RETURNING are likely to face problems one way or > another, especially if they have partitioned tables containing > partitions of varying table AMs, but at least they won't face the bug > discussed here. Agreed, we should get this fixed in time for the next minor releases. The issue no longer exists on HEAD, thanks to 86dc90056 having got rid of per-target-relation variance in the contents of planSlot. But we still need a fix for the back branches. So I looked over the v13 patch, and found a couple of things I didn't like: * I think what you did in ExecProcessReturning is buggy. It's not a great idea to have completely different processes for getting tableoid set in normal-relation vs foreign-relation cases, and in this case the foreign-relation case was simply wrong. Maybe the bug isn't reachable for lack of support of cross-partition motion with FDWs, but I'm not sure about that. We really need to decouple the RETURNING expressions (which will belong to the source relation) from the value injected for tableoid (which will belong to the destination). * I really disliked the API change that ExecInsert is responsible for computing RETURNING except when it isn't. That's confusing and there's no good reason for it, since it's not really any easier to deal with the case at the call site than inside ExecInsert. In the attached revision I made ExecInsert handle RETURNING calculations by asking the callers to pass in the ResultRelInfo that should be used for the purpose. We could alternatively have taken the responsibility for RETURNING out of ExecInsert altogether, making the callers call ExecProcessReturning. I think that might have netted out slightly simpler than this. But we're unlikely to apply such a change in HEAD, so it seemed better to keep the division of responsibilities the same as it is in other branches. Thoughts? (I've not looked at porting this to v12 or v11 yet.) regards, tom lane diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 9572ef0690..ff4a2d0e58 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -149,34 +149,40 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList) /* * ExecProcessReturning --- evaluate a RETURNING list * - * resultRelInfo: current result rel + * projectReturning: the projection to evaluate + * resultRelOid: result relation's OID * tupleSlot: slot holding tuple actually inserted/updated/deleted * planSlot: slot holding tuple returned by top subplan node * + * In cross-partition UPDATE cases, projectReturning and planSlot are as + * for the source partition, and tupleSlot must conform to that. But + * resultRelOid is for the destination partition. + * * Note: If tupleSlot is NULL, the FDW should have already provided econtext's * scan tuple. * * Returns a slot holding the result tuple */ static TupleTableSlot * -ExecProcessReturning(ResultRelInfo *resultRelInfo, +ExecProcessReturning(ProjectionInfo *projectReturning, + Oid resultRelOid, TupleTableSlot *tupleSlot, TupleTableSlot *planSlot) { - ProjectionInfo *projectReturning = resultRelInfo->ri_projectReturning; ExprContext *econtext = projectReturning->pi_exprContext; /* Make tuple and any needed join variables available to ExecProject */ if (tupleSlot) econtext->ecxt_scantuple = tupleSlot; + else + Assert(econtext->ecxt_scantuple); econtext->ecxt_outertuple = planSlot; /* - * RETURNING expressions might reference the tableoid column, so - * reinitialize tts_tableOid before evaluating them. + * RETURNING expressions might reference the tableoid column, so be sure + * we expose the desired OID, ie that of the real target relation. */ - econtext->ecxt_scantuple->tts_tableOid = - RelationGetRelid(resultRelInfo->ri_RelationDesc); + econtext->ecxt_scantuple->tts_tableOid = resultRelOid; /* Compute the RETURNING expressions */ return ExecProject(projectReturning); @@ -368,6 +374,16 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot *slot, CmdType cmdtype * For INSERT, we have to insert the tuple into the target relation * and insert appropriate tuples into the index relations. * + * slot contains the new tuple value to be stored. + * planSlot is the output of the ModifyTable's subplan; we use it + * to access "junk" columns that are not going to be stored. + * In a cross-partition UPDATE, srcSlot is the slot that held the + * updated tuple for the source relation; otherwise it's NULL. + * + * returningRelInfo is the resultRelInfo for the source relation of a + * cross-partition UPDATE; otherwise it's the current result relation. + * We use it to process RETURNING lists, for reasons explained below. + * * Returns RETURNING result if any, otherwise NULL. * ---------------------------------------------------------------- */ @@ -375,6 +391,8 @@ static TupleTableSlot * ExecInsert(ModifyTableState *mtstate, TupleTableSlot *slot, TupleTableSlot *planSlot, + TupleTableSlot *srcSlot, + ResultRelInfo *returningRelInfo, EState *estate, bool canSetTag) { @@ -677,8 +695,40 @@ ExecInsert(ModifyTableState *mtstate, ExecWithCheckOptions(WCO_VIEW_CHECK, resultRelInfo, slot, estate); /* Process RETURNING if present */ - if (resultRelInfo->ri_projectReturning) - result = ExecProcessReturning(resultRelInfo, slot, planSlot); + if (returningRelInfo->ri_projectReturning) + { + /* + * In a cross-partition UPDATE with RETURNING, we have to use the + * source partition's RETURNING list, because that matches the output + * of the planSlot, while the destination partition might have + * different resjunk columns. This means we have to map the + * destination slot back to the source's format so we can apply that + * RETURNING list. This is expensive, but it should be an uncommon + * corner case, so we won't spend much effort on making it fast. + */ + if (returningRelInfo != resultRelInfo) + { + Relation srcRelationDesc = returningRelInfo->ri_RelationDesc; + AttrMap *map; + + map = build_attrmap_by_name_if_req(RelationGetDescr(resultRelationDesc), + RelationGetDescr(srcRelationDesc)); + if (map) + { + /* We assume we can use srcSlot to hold the converted tuple */ + TupleTableSlot *origSlot = slot; + + slot = execute_attr_map_slot(map, slot, srcSlot); + slot->tts_tid = origSlot->tts_tid; + slot->tts_tableOid = origSlot->tts_tableOid; + free_attrmap(map); + } + } + + result = ExecProcessReturning(returningRelInfo->ri_projectReturning, + RelationGetRelid(resultRelationDesc), + slot, planSlot); + } return result; } @@ -1027,7 +1077,9 @@ ldelete:; } } - rslot = ExecProcessReturning(resultRelInfo, slot, planSlot); + rslot = ExecProcessReturning(resultRelInfo->ri_projectReturning, + RelationGetRelid(resultRelationDesc), + slot, planSlot); /* * Before releasing the target tuple again, make sure rslot has a @@ -1203,6 +1255,7 @@ lreplace:; { bool tuple_deleted; TupleTableSlot *ret_slot; + TupleTableSlot *orig_slot = slot; TupleTableSlot *epqslot = NULL; PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing; int map_index; @@ -1309,6 +1362,7 @@ lreplace:; mtstate->rootResultRelInfo, slot); ret_slot = ExecInsert(mtstate, slot, planSlot, + orig_slot, resultRelInfo, estate, canSetTag); /* Revert ExecPrepareTupleRouting's node change. */ @@ -1505,7 +1559,9 @@ lreplace:; /* Process RETURNING if present */ if (resultRelInfo->ri_projectReturning) - return ExecProcessReturning(resultRelInfo, slot, planSlot); + return ExecProcessReturning(resultRelInfo->ri_projectReturning, + RelationGetRelid(resultRelationDesc), + slot, planSlot); return NULL; } @@ -2154,7 +2210,9 @@ ExecModifyTable(PlanState *pstate) * ExecProcessReturning by IterateDirectModify, so no need to * provide it here. */ - slot = ExecProcessReturning(resultRelInfo, NULL, planSlot); + slot = ExecProcessReturning(resultRelInfo->ri_projectReturning, + RelationGetRelid(resultRelInfo->ri_RelationDesc), + NULL, planSlot); estate->es_result_relation_info = saved_resultRelInfo; return slot; @@ -2244,6 +2302,7 @@ ExecModifyTable(PlanState *pstate) slot = ExecPrepareTupleRouting(node, estate, proute, resultRelInfo, slot); slot = ExecInsert(node, slot, planSlot, + NULL, estate->es_result_relation_info, estate, node->canSetTag); /* Revert ExecPrepareTupleRouting's state change. */ if (proute) diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out index bf939d79f6..807005c40a 100644 --- a/src/test/regress/expected/update.out +++ b/src/test/regress/expected/update.out @@ -445,6 +445,46 @@ UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (r part_c_1_100 | b | 17 | 95 | 19 | (6 rows) +-- The following tests computing RETURNING when the source and the destination +-- partitions of a UPDATE row movement operation have different tuple +-- descriptors, which has been shown to be problematic in the cases where the +-- RETURNING targetlist contains non-target relation attributes that are +-- computed by referring to the source partition plan's output tuple. Also, +-- a trigger on the destination relation may change the tuple, which must be +-- reflected in the RETURNING output, so we test that too. +CREATE TABLE part_c_1_c_20 (LIKE range_parted); +ALTER TABLE part_c_1_c_20 DROP a, DROP b, ADD a text, ADD b bigint; +ALTER TABLE range_parted ATTACH PARTITION part_c_1_c_20 FOR VALUES FROM ('c', 1) TO ('c', 20); +CREATE FUNCTION trigfunc () RETURNS TRIGGER LANGUAGE plpgsql as $$ BEGIN NEW.e := 'in trigfunc()'; RETURN NEW; END; $$; +CREATE TRIGGER trig BEFORE INSERT ON part_c_1_c_20 FOR EACH ROW EXECUTE FUNCTION trigfunc(); +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNINGtableoid::regclass, *; + tableoid | a | b | c | d | e | x | y +---------------+---+----+-----+---+---------------+---+---- + part_c_1_c_20 | c | 1 | 1 | 1 | in trigfunc() | a | 1 + part_c_1_c_20 | c | 10 | 200 | 1 | in trigfunc() | a | 10 + part_c_1_c_20 | c | 12 | 96 | 1 | in trigfunc() | b | 12 +(3 rows) + +DROP TRIGGER trig ON part_c_1_c_20; +DROP FUNCTION trigfunc; +:init_range_parted; +CREATE FUNCTION trigfunc () RETURNS TRIGGER LANGUAGE plpgsql as $$ BEGIN RETURN NULL; END; $$; +CREATE TRIGGER trig BEFORE INSERT ON part_c_1_c_20 FOR EACH ROW EXECUTE FUNCTION trigfunc(); +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNINGtableoid::regclass, *; + tableoid | a | b | c | d | e | x | y +----------+---+---+---+---+---+---+--- +(0 rows) + +:show_data; + partname | a | b | c | d | e +--------------+---+----+-----+----+--- + part_c_1_100 | b | 13 | 97 | 2 | + part_d_15_20 | b | 15 | 105 | 16 | + part_d_15_20 | b | 17 | 105 | 19 | +(3 rows) + +DROP TABLE part_c_1_c_20; +DROP FUNCTION trigfunc; -- Transition tables with update row movement :init_range_parted; CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql index 8c558a7bc7..dc7274a4bb 100644 --- a/src/test/regress/sql/update.sql +++ b/src/test/regress/sql/update.sql @@ -236,6 +236,31 @@ DROP VIEW upview; UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted), *; :show_data; +-- The following tests computing RETURNING when the source and the destination +-- partitions of a UPDATE row movement operation have different tuple +-- descriptors, which has been shown to be problematic in the cases where the +-- RETURNING targetlist contains non-target relation attributes that are +-- computed by referring to the source partition plan's output tuple. Also, +-- a trigger on the destination relation may change the tuple, which must be +-- reflected in the RETURNING output, so we test that too. +CREATE TABLE part_c_1_c_20 (LIKE range_parted); +ALTER TABLE part_c_1_c_20 DROP a, DROP b, ADD a text, ADD b bigint; +ALTER TABLE range_parted ATTACH PARTITION part_c_1_c_20 FOR VALUES FROM ('c', 1) TO ('c', 20); +CREATE FUNCTION trigfunc () RETURNS TRIGGER LANGUAGE plpgsql as $$ BEGIN NEW.e := 'in trigfunc()'; RETURN NEW; END; $$; +CREATE TRIGGER trig BEFORE INSERT ON part_c_1_c_20 FOR EACH ROW EXECUTE FUNCTION trigfunc(); +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNINGtableoid::regclass, *; + +DROP TRIGGER trig ON part_c_1_c_20; +DROP FUNCTION trigfunc; + +:init_range_parted; +CREATE FUNCTION trigfunc () RETURNS TRIGGER LANGUAGE plpgsql as $$ BEGIN RETURN NULL; END; $$; +CREATE TRIGGER trig BEFORE INSERT ON part_c_1_c_20 FOR EACH ROW EXECUTE FUNCTION trigfunc(); +UPDATE range_parted r set a = 'c' FROM (VALUES ('a', 1), ('a', 10), ('b', 12)) s(x, y) WHERE s.x = r.a AND s.y = r.b RETURNINGtableoid::regclass, *; +:show_data; + +DROP TABLE part_c_1_c_20; +DROP FUNCTION trigfunc; -- Transition tables with update row movement :init_range_parted;
On Thu, Apr 22, 2021 at 9:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > FWIW, I think we should go ahead and apply the patches for the bug > > reported here. Anyone who tries to project an updated tuple's system > > columns using RETURNING are likely to face problems one way or > > another, especially if they have partitioned tables containing > > partitions of varying table AMs, but at least they won't face the bug > > discussed here. > > Agreed, we should get this fixed in time for the next minor releases. Thanks for taking a look at this. > The issue no longer exists on HEAD, thanks to 86dc90056 having got > rid of per-target-relation variance in the contents of planSlot. > But we still need a fix for the back branches. > > So I looked over the v13 patch, and found a couple of things > I didn't like: > > * I think what you did in ExecProcessReturning is buggy. It's > not a great idea to have completely different processes for > getting tableoid set in normal-relation vs foreign-relation > cases, and in this case the foreign-relation case was simply > wrong. Maybe the bug isn't reachable for lack of support of > cross-partition motion with FDWs, but I'm not sure about that. The two cases I see are some callers passing a valid 'tupleSlot' to ExecProcessReturning() and some (just one!) not. The latter case applies only to a foreign relations that are directly-modified and relies on the FDWs having set econtext->ecxt_scantuple to the correct slot. In the 1st case, the callers should already have made sure that the correct tableoid is passed through 'tableSlot', although Fujita-san had noticed in [1] that my earlier patch failed to do that in the cross-partition update case. Along with fixing the problem of my patch, he also proposed that we set the tableoid of the slot assigned to ecxt_scantuple only in the 2nd case to save cycles. I'm fine with leaving it the way it is as your updated patch does, but I don't really see a bug being introduced. Actually, the code was like what Fujita-san proposed we do before b8d71745eac changed it to the current way: @@ -166,20 +166,15 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo, /* Make tuple and any needed join variables available to ExecProject */ if (tupleSlot) econtext->ecxt_scantuple = tupleSlot; - else - { - HeapTuple tuple; - - /* - * RETURNING expressions might reference the tableoid column, so - * initialize t_tableOid before evaluating them. - */ - Assert(!TupIsNull(econtext->ecxt_scantuple)); - tuple = ExecFetchSlotHeapTuple(econtext->ecxt_scantuple, true, NULL); - tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc); - } econtext->ecxt_outertuple = planSlot; + /* + * RETURNING expressions might reference the tableoid column, so + * reinitialize tts_tableOid before evaluating them. + */ + econtext->ecxt_scantuple->tts_tableOid = + RelationGetRelid(resultRelInfo->ri_RelationDesc); + /* Compute the RETURNING expressions */ return ExecProject(projectReturning); } > We really need to decouple the RETURNING expressions (which > will belong to the source relation) from the value injected > for tableoid (which will belong to the destination). Yeah, that makes sense. > * I really disliked the API change that ExecInsert is responsible > for computing RETURNING except when it isn't. That's confusing > and there's no good reason for it, since it's not really any > easier to deal with the case at the call site than inside ExecInsert. > > In the attached revision I made ExecInsert handle RETURNING > calculations by asking the callers to pass in the ResultRelInfo > that should be used for the purpose. That seems fine to me. > We could alternatively > have taken the responsibility for RETURNING out of ExecInsert > altogether, making the callers call ExecProcessReturning. > I think that might have netted out slightly simpler than this. > But we're unlikely to apply such a change in HEAD, so it seemed > better to keep the division of responsibilities the same as it > is in other branches. I agree. > Thoughts? The patch looks good to me, except one thing: /* + * In a cross-partition UPDATE with RETURNING, we have to use the + * source partition's RETURNING list, because that matches the output + * of the planSlot, while the destination partition might have + * different resjunk columns. This means we have to map the + * destination slot back to the source's format so we can apply that + * RETURNING list. This is expensive, but it should be an uncommon + * corner case, so we won't spend much effort on making it fast. + */ + if (returningRelInfo != resultRelInfo) + { I think we should also add slot != srcSlot to this condition. The uncommon corner case is the source and/or the destination partitions having different column order than the root parent, thus requiring the tuple to be converted during tuple routing using slots appropriate to each relation, which causes 'slot' to end up different than 'srcSlot'. But in the common case, where tuple descriptors match between all tables involved, 'slot' should be the same as 'srcSlot'. > (I've not looked at porting this to v12 or v11 yet.) I did that; patches attached. (I haven't changed them to incorporate the above comment though.) -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CAPmGK14QXD5Te_vwGgpuVWXRcrC%2Bd8FyWse0aHSqnDDSeeCRFQ%40mail.gmail.com
Вложения
Amit Langote <amitlangote09@gmail.com> writes: > I think we should also add slot != srcSlot to this condition. Good idea, should save useless comparisons of identical tupdescs. Done. >> (I've not looked at porting this to v12 or v11 yet.) > I did that; patches attached. (I haven't changed them to incorporate > the above comment though.) Thanks for that, saved me some work. I've pushed these. regards, tom lane
On Fri, Apr 23, 2021 at 12:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > I think we should also add slot != srcSlot to this condition. > > Good idea, should save useless comparisons of identical tupdescs. > Done. > > >> (I've not looked at porting this to v12 or v11 yet.) > > > I did that; patches attached. (I haven't changed them to incorporate > > the above comment though.) > > Thanks for that, saved me some work. I've pushed these. Thanks for working on closing this and that other issue. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, Apr 23, 2021 at 10:08 AM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Apr 23, 2021 at 12:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Langote <amitlangote09@gmail.com> writes: > > >> (I've not looked at porting this to v12 or v11 yet.) > > > > > I did that; patches attached. (I haven't changed them to incorporate > > > the above comment though.) > > > > Thanks for that, saved me some work. I've pushed these. > > Thanks for working on closing this and that other issue. Thanks to both of you for working on these! Best regards, Etsuro Fujita