Обсуждение: executor relation handling
This is the continuation of the discussion at: https://www.postgresql.org/message-id/7500.1531920772%40sss.pgh.pa.us Actually, for more background on what I've written below, reading this email in the same discussion would help: https://www.postgresql.org/message-id/4114.1531674142@sss.pgh.pa.us Attached find a series of patches, the first of which tries to implement the main topic discussed in the above email, which is to eliminate execution-time locking of relations in PlannedStmt.rtable. One of the other patches removes the plan node fields that are obsoleted by eliminating execution-time locking of relations. Those fields served no purpose beside telling the executor which relations to lock, or more precisely which relations to lock before initializing the plan tree so that we don't end up upgrading lock strength due to same relation being both a source relation and target relation. When working on that, I noticed that planner fails to remove PlanRowMarks of relations that won't be scanned by a given plan, which results in executor redundantly locking relations that the planner already deemed unnecessary to scan. The locking would be gone with one of the proposed patches, but there are still a couple of overheads during executor initialization of having all those PlanRowMarks. For example, ExecInitLockRows or ExecInitModifyTable calling ExecFindRowMark would end up going over PlanRowMarks that won't ever be used, which especially grows worse with many partitions. Short description of each patch follows: 0001-Don-t-lock-range-table-relations-in-the-executor.patch This removes all instances of heap_open(relid, <not-NoLock>) in the executor code with heap_open(relid, NoLock). To verify that an appropriate lock is already taken on a relation by the time we get into executor, this also installs an assert-build-only check that confirms that lmgr indeed holds the lock. To remember which lock was taken when creating a given RTE_RELATION range table entry, this adds a lockmode field to RangeTblEntry. Finally, because we don't lock in the executor and hence there are no concerns about lock strength upgrade hazard, InitPlan doesn't need toinitialize ResultRelInfos and ExecRowMarks, in favor of doing that in the ExecInit* routines of respective nodes which need those ResultRelInfos and ExecLockRows. (This doesn't touch index relations though, as they don't have a range table entry.) 0002-Remove-useless-fields-from-planner-nodes.patch This removes some fields from PlannedStmt whose only purpose currently is to help InitPlan do certain things that it no longer does, as mentioned above. Also, some fields in Append, MergeAppend, ModifyTable, whose only purpose currently is to propagate partitioned table RT indexes so that executor could lock them, are removed. Removing them also means that the planner doesn't have to spend cycles initializing them. 0003-Prune-PlanRowMark-of-relations-that-are-pruned-from-.patch This prevents PlanRowMarks corresponding to relations that won't be scanned by a given plan from appearing in the rowMarks field of LockRows or ModifyTable nodes. This results in removing significant overhead from the executor initialization of those nodes, especially for partitioned tables with many partitions. 0004-Revise-executor-range-table-relation-opening-closing.patch This adds two arrays to EState indexed by RT indexes, one for RangeTblEntry's and another for Relation pointers. The former reduces the cost of fetching RangeTblEntry by RT index. The latter is used by a newly introduced function ExecRangeTableRelation(), which replaces heap_open for relations that are contained in the range table. If a given RTE's relation is opened by multiple times, only the first call of ExecRangeTableRelation will go to relcache. Although improving executor's performance is not the main goal of these patches, the fact that we're getting rid of redundant processing means there would be at least some speedup, especially with large number of relations in the range table, such as with partitioned tables with many partitions. * Benchmark script used: $ cat /tmp/select-lt.sql select * from lt where b = 999; 'lt' above is a list-partitioned table with 1000 partitions, with one partition for each value in the range 1..1000. $ for i in 1 2; > do > pgbench -n -Mprepared -T 60 -f /tmp/select-lt.sql > done master tps = 768.172129 (excluding connections establishing) tps = 764.180834 (excluding connections establishing) patch 0001 (no locking in the executor) tps = 775.060323 (excluding connections establishing) tps = 778.772917 (excluding connections establishing) patch 0002 (remove useless planner node fields) tps = 782.165436 (excluding connections establishing) tps = 759.417411 (excluding connections establishing patch 0003 (prune PlanRowMarks) tps = 783.558539 (excluding connections establishing) tps = 776.106055 (excluding connections establishing) patch 0004 (executor range table Relation open) tps = 778.924649 (excluding connections establishing) tps = 769.093431 (excluding connections establishing) Speedup is more pronounced with a benchmark that needs RowMarks, because one of the patches (0003) removes overhead around handling them. $ cat /tmp/select-lt-for-share.sql select * from lt where b = 999 for share; master tps = 94.095985 (excluding connections establishing) tps = 93.955702 (excluding connections establishing) patch 0001 (no locking in the executor) tps = 199.030555 (excluding connections establishing) tps = 197.630424 (excluding connections establishing) patch 0002 (remove useless planner node fields) tps = 194.384994 (excluding connections establishing) tps = 195.821362 (excluding connections establishing) patch 0003 (prune PlanRowMarks) tps = 712.544029 (excluding connections establishing) tps = 717.540052 (excluding connections establishing) patch 0004 (executor range table Relation open) tps = 725.189715 (excluding connections establishing) tps = 727.344683 (excluding connections establishing) Will add this to next CF. Thanks, Amit
Вложения
On 2018/08/16 17:22, Amit Langote wrote: > 0004-Revise-executor-range-table-relation-opening-closing.patch > > This adds two arrays to EState indexed by RT indexes, one for > RangeTblEntry's and another for Relation pointers. The former reduces the > cost of fetching RangeTblEntry by RT index. The latter is used by a newly > introduced function ExecRangeTableRelation(), which replaces heap_open for > relations that are contained in the range table. If a given RTE's > relation is opened by multiple times, only the first call of > ExecRangeTableRelation will go to relcache. David Rowely recently, independently [1], proposed one of the ideas mentioned above (store RangeTblEntry's in an array in EState). As I mentioned in reply to his email, I think his implementation of the idea is better than mine, so I've merged his patch in the above patch, except one part: instead of removing the es_range_table list altogether, I decided to keep it around so that ExecSerializePlan doesn't have to build one all over again from the array like his patch did. Updated patches attached; 0001-0003 are same as v1. Thanks, Amit [1] Make executor's Range Table an array instead of a List https://postgr.es/m/CAKJS1f9EypD_=xG6ACFdF=1cBjz+Z9hiHHSd-RqLjor+QyA-nw@mail.gmail.com
Вложения
On 4 September 2018 at 20:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Updated patches attached; 0001-0003 are same as v1. I've looked at these. Here's my review so far: 0001: 1. The following does not seem to be true any longer: + /* + * If you change the conditions under which rel locks are acquired + * here, be sure to adjust ExecOpenScanRelation to match. + */ per: @@ -652,28 +654,10 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags) { Relation rel; Oid reloid; - LOCKMODE lockmode; - /* - * Determine the lock type we need. First, scan to see if target relation - * is a result relation. If not, check if it's a FOR UPDATE/FOR SHARE - * relation. In either of those cases, we got the lock already. - */ - lockmode = AccessShareLock; - if (ExecRelationIsTargetRelation(estate, scanrelid)) - lockmode = NoLock; - else - { - /* Keep this check in sync with InitPlan! */ - ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); - - if (erm != NULL && erm->relation != NULL) - lockmode = NoLock; - } - 2. Should addRangeTableEntryForRelation() initialize lockmode, or maybe take it as a parameter? 3. Don't think there's a need to capitalise true and false in: + * Return TRUE if we acquired a new lock, FALSE if already held. 4. The comment probably should read "lock level to obtain, or 0 if no lock is required" in: + int lockmode; /* lock taken on the relation or 0 */ The field should likely also be LOCKMODE, not int. 5. AcquireExecutorLocks() does not need the local variable named rt_index 0002: 6. I don't think "rootRelation" is a good name for this field. I think "root" is being confused with "target". Nothing is it say the target is the same as the root. + Index rootRelation; /* RT index of root partitioned table */ Perhaps "partitionedTarget" is a better name? 7. Should use "else" instead of "else if" in: + /* Top-level Plan must be LockRows or ModifyTable */ + Assert(IsA(stmt->planTree, LockRows) || + IsA(stmt->planTree, ModifyTable)); + if (IsA(stmt->planTree, LockRows)) + rowMarks = ((LockRows *) stmt->planTree)->rowMarks; + else if (IsA(stmt->planTree, ModifyTable)) + rowMarks = ((ModifyTable *) stmt->planTree)->rowMarks; or you'll likely get a compiler warning on non-Assert enabled builds. 0003: 8. The following code seems repeated enough to warrant a static function: + rowMarks = NIL; + foreach(lc, root->rowMarks) + { + PlanRowMark *rc = lfirst(lc); + + if (root->simple_rel_array[rc->rti] != NULL && + IS_DUMMY_REL(root->simple_rel_array[rc->rti])) + continue; + + rowMarks = lappend(rowMarks, rc); + } Also, why not reverse the condition and do the lappend inside the if? Save two lines. 9. The following code appears in copy.c, which is pretty much the same as the code in execMain.c: estate->es_range_table_size = list_length(cstate->range_table); estate->es_range_table_array = (RangeTblEntry **) palloc(sizeof(RangeTblEntry *) * estate->es_range_table_size); /* Populate the range table array */ i = 0; foreach(lc, cstate->range_table) estate->es_range_table_array[i++] = lfirst_node(RangeTblEntry, lc); Would it not be better to invent a function with the signature: void setup_range_table_array(EState *estate, List *rangeTable) and use it in both locations? 10. In create_estate_for_relation() I don't think you should remove the line that sets es_range_table. @@ -199,7 +199,8 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel) rte->rtekind = RTE_RELATION; rte->relid = RelationGetRelid(rel->localrel); rte->relkind = rel->localrel->rd_rel->relkind; - estate->es_range_table = list_make1(rte); If you're keeping es_range_table then I think it needs to always be set properly to help prevent future bugs in that area. 11. ExecRangeTableRelation should look more like: ExecRangeTableRelation(EState *estate, Index rti) { Relation rel = estate->es_relations[rti - 1]; if (rel != NULL) RelationIncrementReferenceCount(rel); else { RangeTblEntry *rte = exec_rt_fetch(rti, estate->es_range_table_array); /* * No need to lock the relation lock, because upstream code * must hold the lock already. */ rel = estate->es_relations[rti - 1] = heap_open(rte->relid, NoLock); } return rel; } 12. I think this should read: /* Fill in RTEs. es_relations will be populated later. */ + /* Fill the RTEs, Relations array will be filled later. */ 13. I also notice that you're still cleaning up Relations with heap_close or relation_close. Did you consider not doing that and just having a function that's run at the end of execution which closes all non-NULL es_relations? This way you'd not need to perform RelationIncrementReferenceCount inside ExecRangeTableRelation. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thank you for reviewing. On 2018/09/10 13:36, David Rowley wrote: > On 4 September 2018 at 20:53, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Updated patches attached; 0001-0003 are same as v1. > > I've looked at these. Here's my review so far: > > 0001: > > 1. The following does not seem to be true any longer: > > + /* > + * If you change the conditions under which rel locks are acquired > + * here, be sure to adjust ExecOpenScanRelation to match. > + */ > > per: > > @@ -652,28 +654,10 @@ ExecOpenScanRelation(EState *estate, Index > scanrelid, int eflags) > { > Relation rel; > Oid reloid; > - LOCKMODE lockmode; > > - /* > - * Determine the lock type we need. First, scan to see if target relation > - * is a result relation. If not, check if it's a FOR UPDATE/FOR SHARE > - * relation. In either of those cases, we got the lock already. > - */ > - lockmode = AccessShareLock; > - if (ExecRelationIsTargetRelation(estate, scanrelid)) > - lockmode = NoLock; > - else > - { > - /* Keep this check in sync with InitPlan! */ > - ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); > - > - if (erm != NULL && erm->relation != NULL) > - lockmode = NoLock; > - } > - Yeah, removed that comment in ExecBuildRowMark. > 2. Should addRangeTableEntryForRelation() initialize lockmode, or > maybe take it as a parameter? Hmm, that sounds like a good idea. Looking at the various places it's called from though, it's not clear in many instances which lock was taken on the relation that's passed to it, because the lock itself seems to be taken several stack frames removed from the call site. That said, at least for the cases that we care about, that is, the cases in which the RTE being built will be passed to the executor by the way of being in a PlannedStmt, it's clear which lock is taken. So, we can add the lockmode parameter to addRangeTableEntryForRelation as you suggest and pass the actual lockmode value only in the cases we care about, mentioning the fact in the function's header comment that callers may pass NoLock arbitrarily if it's clear the RTE won't be passed to the executor. I've modified patch that way. Thoughts? > 3. Don't think there's a need to capitalise true and false in: > > + * Return TRUE if we acquired a new lock, FALSE if already held. OK, fixed. > 4. The comment probably should read "lock level to obtain, or 0 if no > lock is required" in: > > + int lockmode; /* lock taken on the relation or 0 */ OK. > The field should likely also be LOCKMODE, not int. OK. > 5. AcquireExecutorLocks() does not need the local variable named rt_index Good catch, removed. > 0002: > > 6. I don't think "rootRelation" is a good name for this field. I think > "root" is being confused with "target". Nothing is it say the target > is the same as the root. > > + Index rootRelation; /* RT index of root partitioned table */ > > Perhaps "partitionedTarget" is a better name? I realized that we don't need a new Index field here. nominalRelation serves the purpose that rootRelation is meant for, so it seems silly to have two fields of the same value. Instead, let's have a bool partitionedTarget which is set to true if the target (whose RT index is nominalRelation) is a partitioned tables. > 7. Should use "else" instead of "else if" in: > > + /* Top-level Plan must be LockRows or ModifyTable */ > + Assert(IsA(stmt->planTree, LockRows) || > + IsA(stmt->planTree, ModifyTable)); > + if (IsA(stmt->planTree, LockRows)) > + rowMarks = ((LockRows *) stmt->planTree)->rowMarks; > + else if (IsA(stmt->planTree, ModifyTable)) > + rowMarks = ((ModifyTable *) stmt->planTree)->rowMarks; > > or you'll likely get a compiler warning on non-Assert enabled builds. Yep, fixed. > 0003: > > 8. The following code seems repeated enough to warrant a static function: > > + rowMarks = NIL; > + foreach(lc, root->rowMarks) > + { > + PlanRowMark *rc = lfirst(lc); > + > + if (root->simple_rel_array[rc->rti] != NULL && > + IS_DUMMY_REL(root->simple_rel_array[rc->rti])) > + continue; > + > + rowMarks = lappend(rowMarks, rc); > + } > > Also, why not reverse the condition and do the lappend inside the if? > Save two lines. OK, made this change and added a static function called get_unpruned_rowmarks(PlannerInfo *root). > > 9. The following code appears in copy.c, which is pretty much the same > as the code in execMain.c: > > estate->es_range_table_size = list_length(cstate->range_table); > estate->es_range_table_array = (RangeTblEntry **) > palloc(sizeof(RangeTblEntry *) * > estate->es_range_table_size); > /* Populate the range table array */ > i = 0; > foreach(lc, cstate->range_table) > estate->es_range_table_array[i++] = lfirst_node(RangeTblEntry, lc); > > Would it not be better to invent a function with the signature: > > void > setup_range_table_array(EState *estate, List *rangeTable) > > and use it in both locations? Agreed, but I named it ExecInitRangeTable. > 10. In create_estate_for_relation() I don't think you should remove > the line that sets es_range_table. > > @@ -199,7 +199,8 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel) > rte->rtekind = RTE_RELATION; > rte->relid = RelationGetRelid(rel->localrel); > rte->relkind = rel->localrel->rd_rel->relkind; > - estate->es_range_table = list_make1(rte); > > If you're keeping es_range_table then I think it needs to always be > set properly to help prevent future bugs in that area. My bad, fixed. > 11. ExecRangeTableRelation should look more like: > > ExecRangeTableRelation(EState *estate, Index rti) > { > Relation rel = estate->es_relations[rti - 1]; > > if (rel != NULL) > RelationIncrementReferenceCount(rel); > else > { > RangeTblEntry *rte = exec_rt_fetch(rti, estate->es_range_table_array); > > /* > * No need to lock the relation lock, because upstream code > * must hold the lock already. > */ > rel = estate->es_relations[rti - 1] = heap_open(rte->relid, NoLock); > } > > return rel; > } Much better, done. > 12. I think this should read: /* Fill in RTEs. es_relations will be > populated later. */ > > + /* Fill the RTEs, Relations array will be filled later. */ I've rewritten the comment. > 13. I also notice that you're still cleaning up Relations with > heap_close or relation_close. Did you consider not doing that and just > having a function that's run at the end of execution which closes all > non-NULL es_relations? This way you'd not need to perform > RelationIncrementReferenceCount inside ExecRangeTableRelation. Agreed, done. I'd be slightly hesitant to remove ExecCloseScanRelation, ExecDestroyPartitionPruneState et al if they weren't just a wrapper around heap_close. Please find attached revised patches. Thanks, Amit
Вложения
Hi Amit, On 9/12/18 1:23 AM, Amit Langote wrote: > Please find attached revised patches. > After applying 0004 I'm getting a crash in 'eval-plan-qual' during check-world using export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && ./configure --enable-dtrace --with-openssl --with-gssapi --with-libxml --with-llvm --enable-debug --enable-depend --enable-tap-tests --enable-cassert Confirmed by CFBot in [1]. [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/427530296 Best regards, Jesper
On Wed, Sep 12, 2018 at 9:23 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Hi Amit, > > On 9/12/18 1:23 AM, Amit Langote wrote: >> >> Please find attached revised patches. >> > > After applying 0004 I'm getting a crash in 'eval-plan-qual' during > check-world using > > export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && > ./configure --enable-dtrace --with-openssl --with-gssapi --with-libxml > --with-llvm --enable-debug --enable-depend --enable-tap-tests > --enable-cassert > > Confirmed by CFBot in [1]. > > [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/427530296 Thanks Jesper. Will look into it first thing tomorrow morning. Thanks, Amit
On 2018/09/13 0:23, Amit Langote wrote: > On Wed, Sep 12, 2018 at 9:23 PM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Hi Amit, >> >> On 9/12/18 1:23 AM, Amit Langote wrote: >>> >>> Please find attached revised patches. >>> >> >> After applying 0004 I'm getting a crash in 'eval-plan-qual' during >> check-world using >> >> export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && >> ./configure --enable-dtrace --with-openssl --with-gssapi --with-libxml >> --with-llvm --enable-debug --enable-depend --enable-tap-tests >> --enable-cassert >> >> Confirmed by CFBot in [1]. >> >> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/427530296 > > Thanks Jesper. Will look into it first thing tomorrow morning. Attached updated patches. Beside the issue that caused eval-plan-qual isolation test to crash, I also spotted and fixed an oversight in the 0002 patch which would lead to EState.es_output_cid being set to wrong value and causing unexpected error during tuple locking as result of that. Thanks, Amit
Вложения
Hi Amit, On 9/13/18 12:58 AM, Amit Langote wrote: > Attached updated patches. > > Beside the issue that caused eval-plan-qual isolation test to crash, I > also spotted and fixed an oversight in the 0002 patch which would lead to > EState.es_output_cid being set to wrong value and causing unexpected error > during tuple locking as result of that. > Thanks for the update. However, the subscription TAP test (src/test/subscription/t/001_rep_changes.pl) is still failing. CFBot has the same log https://travis-ci.org/postgresql-cfbot/postgresql/builds/427999969 as locally. Best regards, Jesper
Thanks again, Jesper. On 2018/09/13 20:27, Jesper Pedersen wrote: > Hi Amit, > > On 9/13/18 12:58 AM, Amit Langote wrote: >> Attached updated patches. >> >> Beside the issue that caused eval-plan-qual isolation test to crash, I >> also spotted and fixed an oversight in the 0002 patch which would lead to >> EState.es_output_cid being set to wrong value and causing unexpected error >> during tuple locking as result of that. >> > > Thanks for the update. > > However, the subscription TAP test > (src/test/subscription/t/001_rep_changes.pl) is still failing. > > CFBot has the same log > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/427999969 > > as locally. My bad. I missed that logical replication code depends on the affected executor code. Fixed patches attached. Thanks, Amit
Вложения
I've just completed a review of the v5 patch set. I ended up just making the changes myself since Amit mentioned he was on leave for a few weeks. Summary of changes: 1. Changed the way we verify the lock already exists with debug builds. I reverted some incorrect code added to LockRelationOid that seems to have gotten broken after being rebased on f868a8143a9. I've just added some functions that verify the lock is in the LockMethodLocalHash hashtable. 2. Fixed some incorrect lock types being passed into addRangeTableEntryForRelation() 3. Added code in addRangeTableEntryForRelation to verify we actually hold the lock that the parameter claims we do. (This found all the errors I fixed in #2) 4. Updated various comments outdated by the patches 5. Updated executor README's mention that we close relations when calling the end node function. This is now handled at the end of execution. 6. Renamed nominalRelation to targetRelation. I think this fits better since we're overloading the variable. 7. Use LOCKMODE instead of int in some places. 8. Changed warning about relation not locked to WARNING instead of NOTICE. 9. Renamed get_unpruned_rowmarks() to get_nondummy_rowmarks(). Pruning makes me think of partition pruning but the function checks for dummy rels. These could be dummy for reasons other than partition pruning. I've attached a diff showing the changes I made along with the full patches which I tagged as v6. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2018/09/27 18:15, David Rowley wrote: > I've just completed a review of the v5 patch set. I ended up just > making the changes myself since Amit mentioned he was on leave for a > few weeks. Thanks David. I'm back today and will look at the updated patches tomorrow. Regards, Amit
Hi, On 9/27/18 5:15 AM, David Rowley wrote: > I've just completed a review of the v5 patch set. I ended up just > making the changes myself since Amit mentioned he was on leave for a > few weeks. > > Summary of changes: > > 1. Changed the way we verify the lock already exists with debug > builds. I reverted some incorrect code added to LockRelationOid that > seems to have gotten broken after being rebased on f868a8143a9. I've > just added some functions that verify the lock is in the > LockMethodLocalHash hashtable. > 2. Fixed some incorrect lock types being passed into > addRangeTableEntryForRelation() > 3. Added code in addRangeTableEntryForRelation to verify we actually > hold the lock that the parameter claims we do. (This found all the > errors I fixed in #2) > 4. Updated various comments outdated by the patches > 5. Updated executor README's mention that we close relations when > calling the end node function. This is now handled at the end of > execution. > 6. Renamed nominalRelation to targetRelation. I think this fits > better since we're overloading the variable. > 7. Use LOCKMODE instead of int in some places. > 8. Changed warning about relation not locked to WARNING instead of NOTICE. > 9. Renamed get_unpruned_rowmarks() to get_nondummy_rowmarks(). Pruning > makes me think of partition pruning but the function checks for dummy > rels. These could be dummy for reasons other than partition pruning. > > I've attached a diff showing the changes I made along with the full > patches which I tagged as v6. > Thanks David and Amit -- this version passes check-world. I'll also take a deeper look. Best regards, Jesper
On 2018/09/27 18:15, David Rowley wrote: > I've just completed a review of the v5 patch set. I ended up just > making the changes myself since Amit mentioned he was on leave for a > few weeks. > > Summary of changes: > > 1. Changed the way we verify the lock already exists with debug > builds. I reverted some incorrect code added to LockRelationOid that > seems to have gotten broken after being rebased on f868a8143a9. I've > just added some functions that verify the lock is in the > LockMethodLocalHash hashtable. Thanks. I guess I wasn't terribly happy with my job of rebasing on top of f868a8143a9, because that commit had made the result of LockAcquireExtended a bit ambiguous for me. I like the new CheckRelationLockedByUs() and LocalLockExists() functions that you added to lock manager. > 2. Fixed some incorrect lock types being passed into > addRangeTableEntryForRelation() > > 3. Added code in addRangeTableEntryForRelation to verify we actually > hold the lock that the parameter claims we do. (This found all the > errors I fixed in #2) I see that I'd falsely copy-pasted AccessShareLock in transformRuleStmt, which you've corrected to AccessExclusiveLock. I was not sure about other cases where you've replaced NoLock by something else, because they won't be checked or asserted, but perhaps that's fine. > 4. Updated various comments outdated by the patches > 5. Updated executor README's mention that we close relations when > calling the end node function. This is now handled at the end of > execution. Thank you. I see that you also fixed some useless code that I had left lying around as result of code movement such as the following dead code: @@ -2711,9 +2711,7 @@ ExecEndModifyTable(ModifyTableState *node) { int i; - /* - * close the result relation(s) if any, but hold locks until xact commit. - */ + /* Perform cleanup. */ for (i = 0; i < node->mt_nplans; i++) { ResultRelInfo *resultRelInfo = node->resultRelInfo + i; @@ -2728,7 +2726,6 @@ ExecEndModifyTable(ModifyTableState *node) /* Close indices and then the relation itself */ ExecCloseIndices(resultRelInfo); heap_close(resultRelInfo->ri_RelationDesc, NoLock); - resultRelInfo++; } > 6. Renamed nominalRelation to targetRelation. I think this fits > better since we're overloading the variable. That makes sense to me. > 7. Use LOCKMODE instead of int in some places. > 8. Changed warning about relation not locked to WARNING instead of NOTICE. Oops, think I'd forgotten to do that myself. > 9. Renamed get_unpruned_rowmarks() to get_nondummy_rowmarks(). Pruning > makes me think of partition pruning but the function checks for dummy > rels. These could be dummy for reasons other than partition pruning. Makes sense too. > I've attached a diff showing the changes I made along with the full > patches which I tagged as v6. Thanks a lot for working on that. I've made minor tweaks, which find in the attached updated patches (a .diff file containing changes from v6 to v7 is also attached). Regards, Amit
Вложения
On 28 September 2018 at 20:00, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've made minor tweaks, which find in > the attached updated patches (a .diff file containing changes from v6 to > v7 is also attached). Thanks for looking over the changes. I've looked at the v6 to v7 diff and it seems all good, apart from: + * The following asserts that the necessary lock on the relation I think we maybe should switch the word "assert" for "verifies". The Assert is just checking we didn't get a NoLock and I don't think you're using "assert" meaning the Assert() marco, so likely should be changed to avoid confusion. Apart from that, I see nothing wrong with the patches, so I think we should get someone else to look. I'm marking it as ready for committer. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/09/28 17:21, David Rowley wrote: > On 28 September 2018 at 20:00, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I've made minor tweaks, which find in >> the attached updated patches (a .diff file containing changes from v6 to >> v7 is also attached). > > Thanks for looking over the changes. > > I've looked at the v6 to v7 diff and it seems all good, apart from: > > + * The following asserts that the necessary lock on the relation > > I think we maybe should switch the word "assert" for "verifies". The > Assert is just checking we didn't get a NoLock and I don't think > you're using "assert" meaning the Assert() marco, so likely should be > changed to avoid confusion. Okay, I've revised the text in the attached updated patch. > Apart from that, I see nothing wrong with the patches, so I think we > should get someone else to look. I'm marking it as ready for > committer. Thanks for your time reviewing the patches. Regards, Amit
Вложения
On 28 September 2018 at 20:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/09/28 17:21, David Rowley wrote: >> I think we maybe should switch the word "assert" for "verifies". The >> Assert is just checking we didn't get a NoLock and I don't think >> you're using "assert" meaning the Assert() marco, so likely should be >> changed to avoid confusion. > > Okay, I've revised the text in the attached updated patch. Meh, I just noticed that the WARNING text claims "InitPlan" is the function name. I think it's best to get rid of that. It's pretty much redundant anyway if you do: \set VERBOSITY verbose -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/09/28 17:48, David Rowley wrote: > On 28 September 2018 at 20:28, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/09/28 17:21, David Rowley wrote: >>> I think we maybe should switch the word "assert" for "verifies". The >>> Assert is just checking we didn't get a NoLock and I don't think >>> you're using "assert" meaning the Assert() marco, so likely should be >>> changed to avoid confusion. >> >> Okay, I've revised the text in the attached updated patch. > > Meh, I just noticed that the WARNING text claims "InitPlan" is the > function name. I think it's best to get rid of that. It's pretty much > redundant anyway if you do: \set VERBOSITY verbose Oops, good catch that one. Removed "InitPlan: " from the message in the attached. Thanks, Amit
Вложения
Hi, On 9/28/18 4:58 AM, Amit Langote wrote: >>> Okay, I've revised the text in the attached updated patch. >> >> Meh, I just noticed that the WARNING text claims "InitPlan" is the >> function name. I think it's best to get rid of that. It's pretty much >> redundant anyway if you do: \set VERBOSITY verbose > > Oops, good catch that one. Removed "InitPlan: " from the message in the > attached. > I have looked at the patch (v9), and have no further comments. I can confirm a speedup in the SELECT FOR SHARE case. Thanks for working on this ! Best regards, Jesper
On 2018-Sep-28, Amit Langote wrote: > On 2018/09/28 17:48, David Rowley wrote: > > Meh, I just noticed that the WARNING text claims "InitPlan" is the > > function name. I think it's best to get rid of that. It's pretty much > > redundant anyway if you do: \set VERBOSITY verbose > > Oops, good catch that one. Removed "InitPlan: " from the message in the > attached. Were there two cases of that? Because one still remains. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 September 2018 at 03:49, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Sep-28, Amit Langote wrote: >> On 2018/09/28 17:48, David Rowley wrote: > >> > Meh, I just noticed that the WARNING text claims "InitPlan" is the >> > function name. I think it's best to get rid of that. It's pretty much >> > redundant anyway if you do: \set VERBOSITY verbose >> >> Oops, good catch that one. Removed "InitPlan: " from the message in the >> attached. > > Were there two cases of that? Because one still remains. Yeah, 0001 added it and 0004 removed it again replacing it with the corrected version. I've attached v10 which fixes this and aligns the WARNING text in ExecInitRangeTable() and addRangeTableEntryForRelation(). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
David Rowley <david.rowley@2ndquadrant.com> writes: > I've attached v10 which fixes this and aligns the WARNING text in > ExecInitRangeTable() and addRangeTableEntryForRelation(). I started poking at this. I thought that it would be a good cross-check to apply just the "front half" of 0001 (i.e., creation and population of the RTE lockmode field), and then to insert checks in each of the "back half" places (executor, plancache, etc) that the lockmodes they are computing today match what is in the RTE. Those checks fell over many times in the regression tests. There seem to be at least four distinct problems: 1. You set up transformRuleStmt to insert AccessExclusiveLock into the "OLD" and "NEW" RTEs for a view. This is surely wrong; we do not want to take exclusive lock on a view just to run a query using the view. It should (usually, anyway) just be AccessShareLock. However, because addRangeTableEntryForRelation insists that you hold the requested lock type *now*, just changing the parameter to AccessShareLock doesn't work. I hacked around this for the moment by passing NoLock to addRangeTableEntryForRelation and then changing rte->lockmode after it returns, but man that's ugly. It makes me wonder whether addRangeTableEntryForRelation should be checking the lockmode at all. I'm inclined to think we should remove the check for current lockmode in addRangeTableEntryForRelation, and instead just assert that the passed lockmode must be one of AccessShareLock, RowShareLock, or RowExclusiveLock depending on whether the RTE is meant to represent a plain source table, a FOR UPDATE/SHARE source table, or a target table. I don't think it's that helpful to be checking that the caller got exactly the same lock, especially given the number of places where the patch had to cheat already by using NoLock. 2. The "forUpdatePushedDown" override in AcquireRewriteLocks isn't getting modeled in the RTE lockmode fields. In other words, if we have something like CREATE VIEW vv AS SELECT * FROM tab1; SELECT * FROM vv FOR UPDATE OF vv; the checks fall over, because the tab1 RTE in the view's rangetable just has AccessShareLock, but we need RowShareLock. I fixed this by having AcquireRewriteLocks actually modify the RTE if it is promoting the lock level. This is kinda grotty, but we were already assuming that AcquireRewriteLocks could scribble on the query tree, so it seems safe enough. 3. There remain some cases where the RTE says RowExclusiveLock but the executor calculation indicates we only need AccessShareLock. AFAICT, this happens only when we have a DO ALSO rule that results in an added query that merely scans the target table. The RTE used for that purpose is just the original one, so it still has a lockmode suitable for a target table. We could probably hack the rewriter so that it changes the RTE lockmode back down to AccessShareLock in these cases. However, I'm inclined to think that that is something *not* to do, and that we should let the higher lockmode be used instead, for two reasons: (1) Asking for both AccessShareLock and RowExclusiveLock on the same table requires an extra trip through the shared lock manager, for little value that I can see. (2) If the DO ALSO rule is run before the main one, we'd be acquiring AccessShareLock before RowExclusiveLock, resulting in deadlock hazard due to lock upgrade. (I think this may be a pre-existing bug, although it could likely only manifest in corner cases such as where we're pulling a plan tree out of plancache. In most cases the first thing we'd acquire on a rule target table is RowExclusiveLock in the parser, before any rule rewriting could happen.) 4. I also notice some cases (all in FDW tests) where ExecOpenScanRelation is choosing AccessShareLock although the RTE has RowShareLock. These seem to all be cases where we're implementing FOR UPDATE/SHARE via ROW_MARK_COPY, so that this: ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); if (erm != NULL && erm->relation != NULL) lockmode = NoLock; leaves lockmode as AccessShareLock because erm->relation is NULL. Again, I think this is probably OK, and it'd be better to use RowShareLock for consistency with other places that might open the rel. It will be a change in externally-visible behavior though. Thoughts? regards, tom lane
I wrote: > I started poking at this. I thought that it would be a good cross-check > to apply just the "front half" of 0001 (i.e., creation and population of > the RTE lockmode field), and then to insert checks in each of the > "back half" places (executor, plancache, etc) that the lockmodes they > are computing today match what is in the RTE. > Those checks fell over many times in the regression tests. Here's a version that doesn't fall over (for me), incorporating fixes as I suggested for points 1 and 2, and weakening the back-half assertions enough to let points 3 and 4 succeed. I'm inclined to commit this to see if the buildfarm can find any problems I missed. Some cosmetic changes: * I renamed the RTE field to rellockmode in the name of greppability. * I rearranged the order of the arguments for addRangeTableEntryForRelation to make it more consistent with the other addRangeTableEntryXXX functions. regards, tom lane diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 4f1d365..7dfa327 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1415,6 +1415,7 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender, rte.rtekind = RTE_RELATION; rte.relid = relId; rte.relkind = RELKIND_RELATION; /* no need for exactness here */ + rte.rellockmode = AccessShareLock; context.rtables = list_make1(list_make1(&rte)); diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 9176f62..4ad5868 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -2549,6 +2549,7 @@ AddRelationNewConstraints(Relation rel, pstate->p_sourcetext = queryString; rte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, NULL, false, true); diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index d06662b..2d5bc8a 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -833,20 +833,21 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, if (stmt->relation) { + LOCKMODE lockmode = is_from ? RowExclusiveLock : AccessShareLock; + RangeTblEntry *rte; TupleDesc tupDesc; List *attnums; ListCell *cur; - RangeTblEntry *rte; Assert(!stmt->query); /* Open and lock the relation, using the appropriate lock type. */ - rel = heap_openrv(stmt->relation, - (is_from ? RowExclusiveLock : AccessShareLock)); + rel = heap_openrv(stmt->relation, lockmode); relid = RelationGetRelid(rel); - rte = addRangeTableEntryForRelation(pstate, rel, NULL, false, false); + rte = addRangeTableEntryForRelation(pstate, rel, lockmode, + NULL, false, false); rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT); tupDesc = RelationGetDescr(rel); diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index 3d82edb..d5cb62d 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -528,6 +528,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) rte->rtekind = RTE_RELATION; rte->relid = intoRelationAddr.objectId; rte->relkind = relkind; + rte->rellockmode = RowExclusiveLock; rte->requiredPerms = ACL_INSERT; for (attnum = 1; attnum <= intoRelationDesc->rd_att->natts; attnum++) diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c index cee0ef9..2fd17b2 100644 --- a/src/backend/commands/policy.c +++ b/src/backend/commands/policy.c @@ -567,7 +567,9 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id) qual_expr = stringToNode(qual_value); /* Add this rel to the parsestate's rangetable, for dependencies */ - addRangeTableEntryForRelation(qual_pstate, rel, NULL, false, false); + addRangeTableEntryForRelation(qual_pstate, rel, + AccessShareLock, + NULL, false, false); qual_parse_rtable = qual_pstate->p_rtable; free_parsestate(qual_pstate); @@ -589,8 +591,9 @@ RemoveRoleFromObjectPolicy(Oid roleid, Oid classid, Oid policy_id) with_check_qual = stringToNode(with_check_value); /* Add this rel to the parsestate's rangetable, for dependencies */ - addRangeTableEntryForRelation(with_check_pstate, rel, NULL, false, - false); + addRangeTableEntryForRelation(with_check_pstate, rel, + AccessShareLock, + NULL, false, false); with_check_parse_rtable = with_check_pstate->p_rtable; free_parsestate(with_check_pstate); @@ -752,11 +755,13 @@ CreatePolicy(CreatePolicyStmt *stmt) /* Add for the regular security quals */ rte = addRangeTableEntryForRelation(qual_pstate, target_table, + AccessShareLock, NULL, false, false); addRTEtoQuery(qual_pstate, rte, false, true, true); /* Add for the with-check quals */ rte = addRangeTableEntryForRelation(with_check_pstate, target_table, + AccessShareLock, NULL, false, false); addRTEtoQuery(with_check_pstate, rte, false, true, true); @@ -928,6 +933,7 @@ AlterPolicy(AlterPolicyStmt *stmt) ParseState *qual_pstate = make_parsestate(NULL); rte = addRangeTableEntryForRelation(qual_pstate, target_table, + AccessShareLock, NULL, false, false); addRTEtoQuery(qual_pstate, rte, false, true, true); @@ -950,6 +956,7 @@ AlterPolicy(AlterPolicyStmt *stmt) ParseState *with_check_pstate = make_parsestate(NULL); rte = addRangeTableEntryForRelation(with_check_pstate, target_table, + AccessShareLock, NULL, false, false); addRTEtoQuery(with_check_pstate, rte, false, true, true); @@ -1096,8 +1103,9 @@ AlterPolicy(AlterPolicyStmt *stmt) qual = stringToNode(qual_value); /* Add this rel to the parsestate's rangetable, for dependencies */ - addRangeTableEntryForRelation(qual_pstate, target_table, NULL, - false, false); + addRangeTableEntryForRelation(qual_pstate, target_table, + AccessShareLock, + NULL, false, false); qual_parse_rtable = qual_pstate->p_rtable; free_parsestate(qual_pstate); @@ -1137,8 +1145,9 @@ AlterPolicy(AlterPolicyStmt *stmt) with_check_qual = stringToNode(with_check_value); /* Add this rel to the parsestate's rangetable, for dependencies */ - addRangeTableEntryForRelation(with_check_pstate, target_table, NULL, - false, false); + addRangeTableEntryForRelation(with_check_pstate, target_table, + AccessShareLock, + NULL, false, false); with_check_parse_rtable = with_check_pstate->p_rtable; free_parsestate(with_check_pstate); diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 29d8353..9e60bb3 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13632,7 +13632,8 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy) * rangetable entry. We need a ParseState for transformExpr. */ pstate = make_parsestate(NULL); - rte = addRangeTableEntryForRelation(pstate, rel, NULL, false, true); + rte = addRangeTableEntryForRelation(pstate, rel, AccessShareLock, + NULL, false, true); addRTEtoQuery(pstate, rte, true, true, true); /* take care of any partition expressions */ diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index 36778b6..b2de1cc 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -577,10 +577,12 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString, * 'OLD' must always have varno equal to 1 and 'NEW' equal to 2. */ rte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, makeAlias("old", NIL), false, false); addRTEtoQuery(pstate, rte, false, true, true); rte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, makeAlias("new", NIL), false, false); addRTEtoQuery(pstate, rte, false, true, true); diff --git a/src/backend/commands/view.c b/src/backend/commands/view.c index ffb71c0..e73f1d8 100644 --- a/src/backend/commands/view.c +++ b/src/backend/commands/view.c @@ -353,7 +353,7 @@ DefineViewRules(Oid viewOid, Query *viewParse, bool replace) * by 2... * * These extra RT entries are not actually used in the query, - * except for run-time permission checking. + * except for run-time locking and permission checking. *--------------------------------------------------------------- */ static Query * @@ -386,9 +386,11 @@ UpdateRangeTableOfViewParse(Oid viewOid, Query *viewParse) * OLD first, then NEW.... */ rt_entry1 = addRangeTableEntryForRelation(pstate, viewRel, + AccessShareLock, makeAlias("old", NIL), false, false); rt_entry2 = addRangeTableEntryForRelation(pstate, viewRel, + AccessShareLock, makeAlias("new", NIL), false, false); /* Must override addRangeTableEntry's default access-check flags */ diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 5443cbf..2b7cf26 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -864,6 +864,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) Relation resultRelation; resultRelationOid = getrelid(resultRelationIndex, rangeTable); + Assert(rt_fetch(resultRelationIndex, rangeTable)->rellockmode == RowExclusiveLock); resultRelation = heap_open(resultRelationOid, RowExclusiveLock); InitResultRelInfo(resultRelInfo, @@ -904,6 +905,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) Relation resultRelDesc; resultRelOid = getrelid(resultRelIndex, rangeTable); + Assert(rt_fetch(resultRelIndex, rangeTable)->rellockmode == RowExclusiveLock); resultRelDesc = heap_open(resultRelOid, RowExclusiveLock); InitResultRelInfo(resultRelInfo, resultRelDesc, @@ -924,8 +926,11 @@ InitPlan(QueryDesc *queryDesc, int eflags) /* We locked the roots above. */ if (!list_member_int(plannedstmt->rootResultRelations, resultRelIndex)) + { + Assert(rt_fetch(resultRelIndex, rangeTable)->rellockmode == RowExclusiveLock); LockRelationOid(getrelid(resultRelIndex, rangeTable), RowExclusiveLock); + } } } } @@ -955,6 +960,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) { PlanRowMark *rc = (PlanRowMark *) lfirst(l); Oid relid; + LOCKMODE rellockmode; Relation relation; ExecRowMark *erm; @@ -964,6 +970,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) /* get relation's OID (will produce InvalidOid if subquery) */ relid = getrelid(rc->rti, rangeTable); + rellockmode = rt_fetch(rc->rti, rangeTable)->rellockmode; /* * If you change the conditions under which rel locks are acquired @@ -975,9 +982,13 @@ InitPlan(QueryDesc *queryDesc, int eflags) case ROW_MARK_NOKEYEXCLUSIVE: case ROW_MARK_SHARE: case ROW_MARK_KEYSHARE: + Assert(rellockmode == RowShareLock); relation = heap_open(relid, RowShareLock); break; case ROW_MARK_REFERENCE: + /* RTE might be a query target table */ + Assert(rellockmode == AccessShareLock || + rellockmode == RowExclusiveLock); relation = heap_open(relid, AccessShareLock); break; case ROW_MARK_COPY: diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index 5b3eaec..da84d5d 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -657,20 +657,32 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags) /* * Determine the lock type we need. First, scan to see if target relation * is a result relation. If not, check if it's a FOR UPDATE/FOR SHARE - * relation. In either of those cases, we got the lock already. + * relation. + * + * Note: we may have already gotten the desired lock type, but for now + * don't try to optimize; this logic is going away soon anyhow. */ lockmode = AccessShareLock; if (ExecRelationIsTargetRelation(estate, scanrelid)) - lockmode = NoLock; + lockmode = RowExclusiveLock; else { /* Keep this check in sync with InitPlan! */ ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); - if (erm != NULL && erm->relation != NULL) - lockmode = NoLock; + if (erm != NULL) + { + if (erm->markType == ROW_MARK_REFERENCE || + erm->markType == ROW_MARK_COPY) + lockmode = AccessShareLock; + else + lockmode = RowShareLock; + } } + /* lockmode per above logic must not be more than we previously acquired */ + Assert(lockmode <= rt_fetch(scanrelid, estate->es_range_table)->rellockmode); + /* Open the relation and acquire lock as needed */ reloid = getrelid(scanrelid, estate->es_range_table); rel = heap_open(reloid, lockmode); diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 7c8220c..925cb8f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -2356,6 +2356,7 @@ _copyRangeTblEntry(const RangeTblEntry *from) COPY_SCALAR_FIELD(rtekind); COPY_SCALAR_FIELD(relid); COPY_SCALAR_FIELD(relkind); + COPY_SCALAR_FIELD(rellockmode); COPY_NODE_FIELD(tablesample); COPY_NODE_FIELD(subquery); COPY_SCALAR_FIELD(security_barrier); diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 378f2fa..3bb91c9 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -2630,6 +2630,7 @@ _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b) COMPARE_SCALAR_FIELD(rtekind); COMPARE_SCALAR_FIELD(relid); COMPARE_SCALAR_FIELD(relkind); + COMPARE_SCALAR_FIELD(rellockmode); COMPARE_NODE_FIELD(tablesample); COMPARE_NODE_FIELD(subquery); COMPARE_SCALAR_FIELD(security_barrier); diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 93f1e2c..22dbae1 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -3131,6 +3131,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node) case RTE_RELATION: WRITE_OID_FIELD(relid); WRITE_CHAR_FIELD(relkind); + WRITE_INT_FIELD(rellockmode); WRITE_NODE_FIELD(tablesample); break; case RTE_SUBQUERY: diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 519deab..ce55658 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -1361,6 +1361,7 @@ _readRangeTblEntry(void) case RTE_RELATION: READ_OID_FIELD(relid); READ_CHAR_FIELD(relkind); + READ_INT_FIELD(rellockmode); READ_NODE_FIELD(tablesample); break; case RTE_SUBQUERY: diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 22c010c..89625f4 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -6021,6 +6021,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid) rte->rtekind = RTE_RELATION; rte->relid = tableOid; rte->relkind = RELKIND_RELATION; /* Don't be too picky. */ + rte->rellockmode = AccessShareLock; rte->lateral = false; rte->inh = false; rte->inFromCl = true; @@ -6143,6 +6144,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid) rte->rtekind = RTE_RELATION; rte->relid = tableOid; rte->relkind = RELKIND_RELATION; /* Don't be too picky. */ + rte->rellockmode = AccessShareLock; rte->lateral = false; rte->inh = true; rte->inFromCl = true; diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index c020600..226927b 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -1037,6 +1037,7 @@ transformOnConflictClause(ParseState *pstate, */ exclRte = addRangeTableEntryForRelation(pstate, targetrel, + RowExclusiveLock, makeAlias("excluded", NIL), false, false); exclRte->relkind = RELKIND_COMPOSITE_TYPE; diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c index d6b93f5..660011a 100644 --- a/src/backend/parser/parse_clause.c +++ b/src/backend/parser/parse_clause.c @@ -217,6 +217,7 @@ setTargetTable(ParseState *pstate, RangeVar *relation, * Now build an RTE. */ rte = addRangeTableEntryForRelation(pstate, pstate->p_target_relation, + RowExclusiveLock, relation->alias, inh, false); pstate->p_target_rangetblentry = rte; diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 60b8de0..da600dc 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -1208,15 +1208,22 @@ addRangeTableEntry(ParseState *pstate, rte->alias = alias; /* + * Identify the type of lock we'll need on this relation. It's not the + * query's target table (that case is handled elsewhere), so we need + * either RowShareLock if it's locked by FOR UPDATE/SHARE, or plain + * AccessShareLock otherwise. + */ + lockmode = isLockedRefname(pstate, refname) ? RowShareLock : AccessShareLock; + + /* * Get the rel's OID. This access also ensures that we have an up-to-date * relcache entry for the rel. Since this is typically the first access - * to a rel in a statement, be careful to get the right access level - * depending on whether we're doing SELECT FOR UPDATE/SHARE. + * to a rel in a statement, we must open the rel with the proper lockmode. */ - lockmode = isLockedRefname(pstate, refname) ? RowShareLock : AccessShareLock; rel = parserOpenTable(pstate, relation, lockmode); rte->relid = RelationGetRelid(rel); rte->relkind = rel->rd_rel->relkind; + rte->rellockmode = lockmode; /* * Build the list of effective column names using user-supplied aliases @@ -1262,10 +1269,20 @@ addRangeTableEntry(ParseState *pstate, * * This is just like addRangeTableEntry() except that it makes an RTE * given an already-open relation instead of a RangeVar reference. + * + * lockmode is the lock type required for query execution; it must be one + * of AccessShareLock, RowShareLock, or RowExclusiveLock depending on the + * RTE's role within the query. The caller should always hold that lock mode + * or a stronger one. + * + * Note: properly, lockmode should be declared LOCKMODE not int, but that + * would require importing storage/lock.h into parse_relation.h. Since + * LOCKMODE is typedef'd as int anyway, that seems like overkill. */ RangeTblEntry * addRangeTableEntryForRelation(ParseState *pstate, Relation rel, + int lockmode, Alias *alias, bool inh, bool inFromCl) @@ -1275,10 +1292,15 @@ addRangeTableEntryForRelation(ParseState *pstate, Assert(pstate != NULL); + Assert(lockmode == AccessShareLock || + lockmode == RowShareLock || + lockmode == RowExclusiveLock); + rte->rtekind = RTE_RELATION; rte->alias = alias; rte->relid = RelationGetRelid(rel); rte->relkind = rel->rd_rel->relkind; + rte->rellockmode = lockmode; /* * Build the list of effective column names using user-supplied aliases diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index f5c1e2a..42bf9bf 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -2526,7 +2526,9 @@ transformIndexStmt(Oid relid, IndexStmt *stmt, const char *queryString) * relation, but we still need to open it. */ rel = relation_open(relid, NoLock); - rte = addRangeTableEntryForRelation(pstate, rel, NULL, false, true); + rte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, + NULL, false, true); /* no to join list, yes to namespaces */ addRTEtoQuery(pstate, rte, false, true, true); @@ -2635,9 +2637,11 @@ transformRuleStmt(RuleStmt *stmt, const char *queryString, * qualification. */ oldrte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, makeAlias("old", NIL), false, false); newrte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, makeAlias("new", NIL), false, false); /* Must override addRangeTableEntry's default access-check flags */ @@ -2733,9 +2737,11 @@ transformRuleStmt(RuleStmt *stmt, const char *queryString, * them in the joinlist. */ oldrte = addRangeTableEntryForRelation(sub_pstate, rel, + AccessShareLock, makeAlias("old", NIL), false, false); newrte = addRangeTableEntryForRelation(sub_pstate, rel, + AccessShareLock, makeAlias("new", NIL), false, false); oldrte->requiredPerms = 0; @@ -2938,6 +2944,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt, pstate->p_sourcetext = queryString; rte = addRangeTableEntryForRelation(pstate, rel, + AccessShareLock, NULL, false, true); diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index acc6498..6e420d8 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -795,7 +795,8 @@ copy_table(Relation rel) copybuf = makeStringInfo(); pstate = make_parsestate(NULL); - addRangeTableEntryForRelation(pstate, rel, NULL, false, false); + addRangeTableEntryForRelation(pstate, rel, AccessShareLock, + NULL, false, false); attnamelist = make_copy_attnamelist(relmapentry); cstate = BeginCopyFrom(pstate, rel, NULL, false, copy_read_data, attnamelist, NIL); diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 42345f9..e247539 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -199,6 +199,7 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel) rte->rtekind = RTE_RELATION; rte->relid = RelationGetRelid(rel->localrel); rte->relkind = rel->localrel->rd_rel->relkind; + rte->rellockmode = AccessShareLock; estate->es_range_table = list_make1(rte); resultRelInfo = makeNode(ResultRelInfo); diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c index 327e5c3..50788fe 100644 --- a/src/backend/rewrite/rewriteHandler.c +++ b/src/backend/rewrite/rewriteHandler.c @@ -89,6 +89,10 @@ static Bitmapset *adjust_view_column_set(Bitmapset *cols, List *targetlist); * These locks will ensure that the relation schemas don't change under us * while we are rewriting and planning the query. * + * Caution: this may modify the querytree, therefore caller should usually + * have done a copyObject() to make a writable copy of the querytree in the + * current memory context. + * * forExecute indicates that the query is about to be executed. * If so, we'll acquire RowExclusiveLock on the query's resultRelation, * RowShareLock on any relation accessed FOR [KEY] UPDATE/SHARE, and @@ -101,13 +105,11 @@ static Bitmapset *adjust_view_column_set(Bitmapset *cols, List *targetlist); * forUpdatePushedDown indicates that a pushed-down FOR [KEY] UPDATE/SHARE * applies to the current subquery, requiring all rels to be opened with at * least RowShareLock. This should always be false at the top of the - * recursion. This flag is ignored if forExecute is false. + * recursion. This flag is ignored if forExecute is false. If it is true, + * though, we adjust RTE rellockmode fields to reflect the higher lock level. * * A secondary purpose of this routine is to fix up JOIN RTE references to - * dropped columns (see details below). Because the RTEs are modified in - * place, it is generally appropriate for the caller of this routine to have - * first done a copyObject() to make a writable copy of the querytree in the - * current memory context. + * dropped columns (see details below). Such RTEs are modified in-place. * * This processing can, and for efficiency's sake should, be skipped when the * querytree has just been built by the parser: parse analysis already got @@ -170,12 +172,22 @@ AcquireRewriteLocks(Query *parsetree, lockmode = AccessShareLock; else if (rt_index == parsetree->resultRelation) lockmode = RowExclusiveLock; - else if (forUpdatePushedDown || - get_parse_rowmark(parsetree, rt_index) != NULL) + else if (forUpdatePushedDown) + { + lockmode = RowShareLock; + /* Upgrade RTE's lock mode to reflect pushed-down lock */ + if (rte->rellockmode == AccessShareLock) + rte->rellockmode = RowShareLock; + } + else if (get_parse_rowmark(parsetree, rt_index) != NULL) lockmode = RowShareLock; else lockmode = AccessShareLock; + Assert(!forExecute || lockmode == rte->rellockmode || + (lockmode == AccessShareLock && + rte->rellockmode == RowExclusiveLock)); + rel = heap_open(rte->relid, lockmode); /* @@ -1593,6 +1605,7 @@ ApplyRetrieveRule(Query *parsetree, /* Clear fields that should not be set in a subquery RTE */ rte->relid = InvalidOid; rte->relkind = 0; + rte->rellockmode = 0; rte->tablesample = NULL; rte->inh = false; /* must not be set for a subquery */ @@ -2920,8 +2933,14 @@ rewriteTargetView(Query *parsetree, Relation view) * very much like what the planner would do to "pull up" the view into the * outer query. Perhaps someday we should refactor things enough so that * we can share code with the planner.) + * + * Be sure to set rellockmode to the correct thing for the target table. + * Since we copied the whole viewquery above, we can just scribble on + * base_rte instead of copying it. */ - new_rte = (RangeTblEntry *) base_rte; + new_rte = base_rte; + new_rte->rellockmode = RowExclusiveLock; + parsetree->rtable = lappend(parsetree->rtable, new_rte); new_rt_index = list_length(parsetree->rtable); @@ -3101,8 +3120,8 @@ rewriteTargetView(Query *parsetree, Relation view) new_exclRte = addRangeTableEntryForRelation(make_parsestate(NULL), base_rel, - makeAlias("excluded", - NIL), + RowExclusiveLock, + makeAlias("excluded", NIL), false, false); new_exclRte->relkind = RELKIND_COMPOSITE_TYPE; new_exclRte->requiredPerms = 0; diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index fc034ce..049b204 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -1878,12 +1878,14 @@ RI_Initial_Check(Trigger *trigger, Relation fk_rel, Relation pk_rel) pkrte->rtekind = RTE_RELATION; pkrte->relid = RelationGetRelid(pk_rel); pkrte->relkind = pk_rel->rd_rel->relkind; + pkrte->rellockmode = AccessShareLock; pkrte->requiredPerms = ACL_SELECT; fkrte = makeNode(RangeTblEntry); fkrte->rtekind = RTE_RELATION; fkrte->relid = RelationGetRelid(fk_rel); fkrte->relkind = fk_rel->rd_rel->relkind; + fkrte->rellockmode = AccessShareLock; fkrte->requiredPerms = ACL_SELECT; for (i = 0; i < riinfo->nkeys; i++) diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index eecd64e..f6693ea 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -1000,6 +1000,7 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty) oldrte->rtekind = RTE_RELATION; oldrte->relid = trigrec->tgrelid; oldrte->relkind = relkind; + oldrte->rellockmode = AccessShareLock; oldrte->alias = makeAlias("old", NIL); oldrte->eref = oldrte->alias; oldrte->lateral = false; @@ -1010,6 +1011,7 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty) newrte->rtekind = RTE_RELATION; newrte->relid = trigrec->tgrelid; newrte->relkind = relkind; + newrte->rellockmode = AccessShareLock; newrte->alias = makeAlias("new", NIL); newrte->eref = newrte->alias; newrte->lateral = false; @@ -3206,6 +3208,7 @@ deparse_context_for(const char *aliasname, Oid relid) rte->rtekind = RTE_RELATION; rte->relid = relid; rte->relkind = RELKIND_RELATION; /* no need for exactness here */ + rte->rellockmode = AccessShareLock; rte->alias = makeAlias(aliasname, NIL); rte->eref = rte->alias; rte->lateral = false; diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 7271b58..15df189 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -1539,6 +1539,8 @@ AcquireExecutorLocks(List *stmt_list, bool acquire) else lockmode = AccessShareLock; + Assert(lockmode == rte->rellockmode); + if (acquire) LockRelationOid(rte->relid, lockmode); else @@ -1609,6 +1611,8 @@ ScanQueryForLocks(Query *parsetree, bool acquire) lockmode = RowShareLock; else lockmode = AccessShareLock; + Assert(lockmode == rte->rellockmode || + (lockmode == AccessShareLock && rte->rellockmode == RowExclusiveLock)); if (acquire) LockRelationOid(rte->relid, lockmode); else diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index 07e0576..1e064d4 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* yyyymmddN */ -#define CATALOG_VERSION_NO 201809241 +#define CATALOG_VERSION_NO 201809291 #endif diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 62209a8..3056819 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -973,9 +973,21 @@ typedef struct RangeTblEntry * that the tuple format of the tuplestore is the same as the referenced * relation. This allows plans referencing AFTER trigger transition * tables to be invalidated if the underlying table is altered. + * + * rellockmode is really LOCKMODE, but it's declared int to avoid having + * to include lock-related headers here. It must be RowExclusiveLock if + * the RTE is an INSERT/UPDATE/DELETE target, else RowShareLock if the RTE + * is a SELECT FOR UPDATE/FOR SHARE source, else AccessShareLock. + * + * Note: in some cases, rule expansion may result in RTEs that are marked + * with RowExclusiveLock even though they are not the target of the + * current query; this happens if a DO ALSO rule simply scans the original + * target table. We leave such RTEs with their original lockmode so as to + * avoid getting an additional, lesser lock. */ Oid relid; /* OID of the relation */ char relkind; /* relation kind (see pg_class.relkind) */ + int rellockmode; /* lock level that query requires on the rel */ struct TableSampleClause *tablesample; /* sampling info, or NULL */ /* diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h index b9792ac..687a7d7 100644 --- a/src/include/parser/parse_relation.h +++ b/src/include/parser/parse_relation.h @@ -69,6 +69,7 @@ extern RangeTblEntry *addRangeTableEntry(ParseState *pstate, bool inFromCl); extern RangeTblEntry *addRangeTableEntryForRelation(ParseState *pstate, Relation rel, + int lockmode, Alias *alias, bool inh, bool inFromCl); diff --git a/src/test/modules/test_rls_hooks/test_rls_hooks.c b/src/test/modules/test_rls_hooks/test_rls_hooks.c index cab67a6..d492697 100644 --- a/src/test/modules/test_rls_hooks/test_rls_hooks.c +++ b/src/test/modules/test_rls_hooks/test_rls_hooks.c @@ -80,8 +80,8 @@ test_rls_hooks_permissive(CmdType cmdtype, Relation relation) qual_pstate = make_parsestate(NULL); - rte = addRangeTableEntryForRelation(qual_pstate, relation, NULL, false, - false); + rte = addRangeTableEntryForRelation(qual_pstate, relation, AccessShareLock, + NULL, false, false); addRTEtoQuery(qual_pstate, rte, false, true, true); role = ObjectIdGetDatum(ACL_ID_PUBLIC); @@ -143,8 +143,8 @@ test_rls_hooks_restrictive(CmdType cmdtype, Relation relation) qual_pstate = make_parsestate(NULL); - rte = addRangeTableEntryForRelation(qual_pstate, relation, NULL, false, - false); + rte = addRangeTableEntryForRelation(qual_pstate, relation, AccessShareLock, + NULL, false, false); addRTEtoQuery(qual_pstate, rte, false, true, true); role = ObjectIdGetDatum(ACL_ID_PUBLIC);
I wrote: > 1. You set up transformRuleStmt to insert AccessExclusiveLock into > the "OLD" and "NEW" RTEs for a view. This is surely wrong; we do > not want to take exclusive lock on a view just to run a query using > the view. It should (usually, anyway) just be AccessShareLock. > However, because addRangeTableEntryForRelation insists that you > hold the requested lock type *now*, just changing the parameter > to AccessShareLock doesn't work. > I hacked around this for the moment by passing NoLock to > addRangeTableEntryForRelation and then changing rte->lockmode > after it returns, but man that's ugly. It makes me wonder whether > addRangeTableEntryForRelation should be checking the lockmode at all. It occurred to me that it'd be reasonable to insist that the caller holds a lock *at least as strong* as the one being recorded in the RTE, and that there's also been discussions about verifying that some lock is held when something like heap_open(foo, NoLock) is attempted. So I dusted off the part of 0001 that did that, producing the attached delta patch. Unfortunately, I can't commit this, because it exposes at least two pre-existing bugs :-(. So we'll need to fix those first, which seems like it should be a separate thread. I'm just parking this here for the moment. I think that the call sites should ultimately look like Assert(CheckRelationLockedByMe(...)); but for hunting down the places where the assertion currently fails, it's more convenient if it's just an elog(WARNING). regards, tom lane diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 3395445..f8a55ee 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1137,6 +1137,11 @@ relation_open(Oid relationId, LOCKMODE lockmode) if (!RelationIsValid(r)) elog(ERROR, "could not open relation with OID %u", relationId); + if (lockmode == NoLock && + !CheckRelationLockedByMe(r, AccessShareLock, true)) + elog(WARNING, "relation_open: no lock held on %s", + RelationGetRelationName(r)); + /* Make note that we've accessed a temporary relation */ if (RelationUsesLocalBuffers(r)) MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPREL; @@ -1183,6 +1188,11 @@ try_relation_open(Oid relationId, LOCKMODE lockmode) if (!RelationIsValid(r)) elog(ERROR, "could not open relation with OID %u", relationId); + if (lockmode == NoLock && + !CheckRelationLockedByMe(r, AccessShareLock, true)) + elog(WARNING, "try_relation_open: no lock held on %s", + RelationGetRelationName(r)); + /* Make note that we've accessed a temporary relation */ if (RelationUsesLocalBuffers(r)) MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPREL; @@ -8084,6 +8094,10 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool * idx_rel = RelationIdGetRelation(replidindex); + if (!CheckRelationLockedByMe(idx_rel, AccessShareLock, true)) + elog(WARNING, "ExtractReplicaIdentity: no lock held on %s", + RelationGetRelationName(idx_rel)); + /* deform tuple, so we have fast access to columns */ heap_deform_tuple(tp, desc, values, nulls); diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index da600dc..aaf5544 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -28,6 +28,7 @@ #include "parser/parse_enr.h" #include "parser/parse_relation.h" #include "parser/parse_type.h" +#include "storage/lmgr.h" #include "utils/builtins.h" #include "utils/lsyscache.h" #include "utils/rel.h" @@ -1272,7 +1273,7 @@ addRangeTableEntry(ParseState *pstate, * * lockmode is the lock type required for query execution; it must be one * of AccessShareLock, RowShareLock, or RowExclusiveLock depending on the - * RTE's role within the query. The caller should always hold that lock mode + * RTE's role within the query. The caller must hold that lock mode * or a stronger one. * * Note: properly, lockmode should be declared LOCKMODE not int, but that @@ -1295,6 +1296,9 @@ addRangeTableEntryForRelation(ParseState *pstate, Assert(lockmode == AccessShareLock || lockmode == RowShareLock || lockmode == RowExclusiveLock); + if (!CheckRelationLockedByMe(rel, lockmode, true)) + elog(WARNING, "addRangeTableEntryForRelation: no lock held on %s", + RelationGetRelationName(rel)); rte->rtekind = RTE_RELATION; rte->alias = alias; diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index dc0a439..517ff8e 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -288,6 +288,51 @@ UnlockRelation(Relation relation, LOCKMODE lockmode) } /* + * CheckRelationLockedByMe + * + * Returns true if current transaction holds a lock on 'relation' of mode + * 'lockmode'. If 'orstronger' is true, a stronger lockmode is also OK. + * ("Stronger" is defined as "numerically higher", which is a bit + * semantically dubious but is OK for the purposes we use this for.) + */ +bool +CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode, bool orstronger) +{ + LOCKTAG tag; + + SET_LOCKTAG_RELATION(tag, + relation->rd_lockInfo.lockRelId.dbId, + relation->rd_lockInfo.lockRelId.relId); + + if (LockHeldByMe(&tag, lockmode)) + return true; + + if (orstronger) + { + LOCKMODE slockmode; + + for (slockmode = lockmode + 1; + slockmode <= AccessExclusiveLock; + slockmode++) + { + if (LockHeldByMe(&tag, slockmode)) + { +#ifdef NOT_USED + /* Sometimes this might be useful for debugging purposes */ + elog(WARNING, "lock mode %s substituted for %s on relation %s", + GetLockmodeName(tag.locktag_lockmethodid, slockmode), + GetLockmodeName(tag.locktag_lockmethodid, lockmode), + RelationGetRelationName(relation)); +#endif + return true; + } + } + } + + return false; +} + +/* * LockHasWaitersRelation * * This is a function to check whether someone else is waiting for a diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 831ae62..10f6f60 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -564,6 +564,30 @@ DoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2) } /* + * LockHeldByMe -- test whether lock 'locktag' is held with mode 'lockmode' + * by the current transaction + */ +bool +LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode) +{ + LOCALLOCKTAG localtag; + LOCALLOCK *locallock; + + /* + * See if there is a LOCALLOCK entry for this lock and lockmode + */ + MemSet(&localtag, 0, sizeof(localtag)); /* must clear padding */ + localtag.lock = *locktag; + localtag.mode = lockmode; + + locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash, + (void *) &localtag, + HASH_FIND, NULL); + + return (locallock && locallock->nLocks > 0); +} + +/* * LockHasWaiters -- look up 'locktag' and check if releasing this * lock would wake up other processes waiting for it. */ diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h index a217de9..e5356b7 100644 --- a/src/include/storage/lmgr.h +++ b/src/include/storage/lmgr.h @@ -45,6 +45,8 @@ extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode); extern void LockRelation(Relation relation, LOCKMODE lockmode); extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode); extern void UnlockRelation(Relation relation, LOCKMODE lockmode); +extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode, + bool orstronger); extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode); extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode); diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index ff4df7f..a37fda7 100644 --- a/src/include/storage/lock.h +++ b/src/include/storage/lock.h @@ -540,6 +540,7 @@ extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks); extern void LockReleaseSession(LOCKMETHODID lockmethodid); extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks); extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks); +extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode); extern bool LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock); extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
On 1 October 2018 at 06:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It occurred to me that it'd be reasonable to insist that the caller > holds a lock *at least as strong* as the one being recorded in the RTE, > and that there's also been discussions about verifying that some lock > is held when something like heap_open(foo, NoLock) is attempted. > So I dusted off the part of 0001 that did that, producing the > attached delta patch. My imagination struggles to think of a case, but perhaps one day in the future we might have a lock manager that coordinates locks on multiple nodes. If so, is there not a risk that one day we might have a lock level greater than AccessExclusiveLock, meaning the following would get broken: + for (slockmode = lockmode + 1; + slockmode <= AccessExclusiveLock; + slockmode++) For index strategies we do: #define BTGreaterStrategyNumber 5 #define BTMaxStrategyNumber 5 So would it not be better to add the following to lockdefs.h? #define MaxLockLevel 8 then use that to terminate the loop. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 1 October 2018 at 06:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> + for (slockmode = lockmode + 1; >> + slockmode <= AccessExclusiveLock; >> + slockmode++) > So would it not be better to add the following to lockdefs.h? > #define MaxLockLevel 8 > then use that to terminate the loop. Good idea, will do. regards, tom lane
On 2018/09/30 5:04, Tom Lane wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> I've attached v10 which fixes this and aligns the WARNING text in >> ExecInitRangeTable() and addRangeTableEntryForRelation(). > > I started poking at this. Thanks a lot for looking at this. > I thought that it would be a good cross-check > to apply just the "front half" of 0001 (i.e., creation and population of > the RTE lockmode field), and then to insert checks in each of the > "back half" places (executor, plancache, etc) that the lockmodes they > are computing today match what is in the RTE. > > Those checks fell over many times in the regression tests. > > There seem to be at least four distinct problems: > > 1. You set up transformRuleStmt to insert AccessExclusiveLock into > the "OLD" and "NEW" RTEs for a view. This is surely wrong; we do > not want to take exclusive lock on a view just to run a query using > the view. It should (usually, anyway) just be AccessShareLock. > > However, because addRangeTableEntryForRelation insists that you > hold the requested lock type *now*, just changing the parameter > to AccessShareLock doesn't work. > > I hacked around this for the moment by passing NoLock to > addRangeTableEntryForRelation and then changing rte->lockmode > after it returns, but man that's ugly. It makes me wonder whether > addRangeTableEntryForRelation should be checking the lockmode at all. > > I'm inclined to think we should remove the check for current lockmode > in addRangeTableEntryForRelation, and instead just assert that the > passed lockmode must be one of AccessShareLock, RowShareLock, or > RowExclusiveLock depending on whether the RTE is meant to represent > a plain source table, a FOR UPDATE/SHARE source table, or a target > table. I don't think it's that helpful to be checking that the > caller got exactly the same lock, especially given the number of > places where the patch had to cheat already by using NoLock. Yeah, addRangeTableEntryForRelation should not insist on holding the "exact" lock, but rather *at least* as strong as the passed in lock mode. I see that the patch you posted downthread does that. In the original patch, the lock mode used for handling the CREATE RULE command was being conflated with the lock mode to require on NEW and OLD RTEs that will be added to the rule's query. > 2. The "forUpdatePushedDown" override in AcquireRewriteLocks isn't > getting modeled in the RTE lockmode fields. In other words, if we > have something like > > CREATE VIEW vv AS SELECT * FROM tab1; > SELECT * FROM vv FOR UPDATE OF vv; > > the checks fall over, because the tab1 RTE in the view's rangetable > just has AccessShareLock, but we need RowShareLock. I fixed this > by having AcquireRewriteLocks actually modify the RTE if it is > promoting the lock level. This is kinda grotty, but we were already > assuming that AcquireRewriteLocks could scribble on the query tree, > so it seems safe enough. Thanks for fixing that. > 3. There remain some cases where the RTE says RowExclusiveLock but > the executor calculation indicates we only need AccessShareLock. > AFAICT, this happens only when we have a DO ALSO rule that results > in an added query that merely scans the target table. I've seen something like that happen for ON CONFLICT's excluded pseudo-relation RTE too, because the executor deems only the RTE fetched with result relation RT index to require RowExclusiveLock. > The RTE used > for that purpose is just the original one, so it still has a lockmode > suitable for a target table. > > We could probably hack the rewriter so that it changes the RTE lockmode > back down to AccessShareLock in these cases. Is the problematic RTE one coming from an action query? If so, isn't it distinct from the original RTE and its rellockmode independently determined based on whether the action is select or not? I'm not sure if I understand why we'd need to change its lock mode. > However, I'm inclined to > think that that is something *not* to do, and that we should let the > higher lockmode be used instead, for two reasons: > > (1) Asking for both AccessShareLock and RowExclusiveLock on the same > table requires an extra trip through the shared lock manager, for little > value that I can see. > > (2) If the DO ALSO rule is run before the main one, we'd be acquiring > AccessShareLock before RowExclusiveLock, resulting in deadlock hazard > due to lock upgrade. (I think this may be a pre-existing bug, although > it could likely only manifest in corner cases such as where we're pulling > a plan tree out of plancache. In most cases the first thing we'd acquire > on a rule target table is RowExclusiveLock in the parser, before any > rule rewriting could happen.) > > 4. I also notice some cases (all in FDW tests) where ExecOpenScanRelation > is choosing AccessShareLock although the RTE has RowShareLock. These seem > to all be cases where we're implementing FOR UPDATE/SHARE via > ROW_MARK_COPY, so that this: > > ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); > > if (erm != NULL && erm->relation != NULL) > lockmode = NoLock; > > leaves lockmode as AccessShareLock because erm->relation is NULL. > Again, I think this is probably OK, and it'd be better to use > RowShareLock for consistency with other places that might open > the rel. It will be a change in externally-visible behavior though. For this and the other cases (AcquireRewriteLocks, AcquireExecutorLocks, etc.), I wonder whether we couldn't just *not* recalculate the lock mode based on inspecting the query tree to cross-check with rellockmode? Why not just use rellockmode for locking? Maybe, okay to keep doing that in debug builds though. Also, are the discrepancies like this to be considered bugs of the existing logic? Thanks, Amit
On 2018/10/01 2:18, Tom Lane wrote: > I wrote: >> 1. You set up transformRuleStmt to insert AccessExclusiveLock into >> the "OLD" and "NEW" RTEs for a view. This is surely wrong; we do >> not want to take exclusive lock on a view just to run a query using >> the view. It should (usually, anyway) just be AccessShareLock. >> However, because addRangeTableEntryForRelation insists that you >> hold the requested lock type *now*, just changing the parameter >> to AccessShareLock doesn't work. >> I hacked around this for the moment by passing NoLock to >> addRangeTableEntryForRelation and then changing rte->lockmode >> after it returns, but man that's ugly. It makes me wonder whether >> addRangeTableEntryForRelation should be checking the lockmode at all. > > It occurred to me that it'd be reasonable to insist that the caller > holds a lock *at least as strong* as the one being recorded in the RTE, > and that there's also been discussions about verifying that some lock > is held when something like heap_open(foo, NoLock) is attempted. > So I dusted off the part of 0001 that did that, producing the > attached delta patch. > > Unfortunately, I can't commit this, because it exposes at least two > pre-existing bugs :-(. So we'll need to fix those first, which seems > like it should be a separate thread. I'm just parking this here for > the moment. > > I think that the call sites should ultimately look like > > Assert(CheckRelationLockedByMe(...)); > > but for hunting down the places where the assertion currently fails, > it's more convenient if it's just an elog(WARNING). Should this check that we're not in a parallel worker process? Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/09/30 5:04, Tom Lane wrote: >> 3. There remain some cases where the RTE says RowExclusiveLock but >> the executor calculation indicates we only need AccessShareLock. >> AFAICT, this happens only when we have a DO ALSO rule that results >> in an added query that merely scans the target table. > I've seen something like that happen for ON CONFLICT's excluded > pseudo-relation RTE too, because the executor deems only the RTE fetched > with result relation RT index to require RowExclusiveLock. OK, I had not carefully inspected every case. > For this and the other cases (AcquireRewriteLocks, AcquireExecutorLocks, > etc.), I wonder whether we couldn't just *not* recalculate the lock mode > based on inspecting the query tree to cross-check with rellockmode? Why > not just use rellockmode for locking? Right, that's exactly where we want to end up. This intermediate state of the patch is just an attempt to verify that we understand when and how relying on rellockmode will change the behavior. > Also, are the discrepancies like this to be > considered bugs of the existing logic? Mmm ... hard to say. In the DO ALSO case, it's possible that query execution would take AccessShareLock and then RowExclusiveLock, which at least in principle creates a lock-upgrade deadlock hazard. So I think standardizing on taking the rellockmode will be an improvement, but I don't know that I'd call the existing behavior a bug; I certainly wouldn't risk trying to back-patch a change for it. In the ROW_MARK_COPY case, it's conceivable that with the existing code query re-execution would take only AccessShareLock on a FOR UPDATE target table, where the parser had originally taken RowShareLock. Again, it seems like making the behavior more consistent is an improvement, but not something we'd try to back-patch. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/10/01 2:18, Tom Lane wrote: >> I think that the call sites should ultimately look like >> Assert(CheckRelationLockedByMe(...)); >> but for hunting down the places where the assertion currently fails, >> it's more convenient if it's just an elog(WARNING). > Should this check that we're not in a parallel worker process? Hmm. I've not seen any failures in the parallel parts of the regular regression tests, but maybe I'd better do a force_parallel_mode run before committing. In general, I'm not on board with the idea that parallel workers don't need to get their own locks, so I don't really want to exclude parallel workers from this check. But if it's not safe for that today, fixing it is beyond the scope of this particular patch. regards, tom lane
On 1 October 2018 at 19:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > For this and the other cases (AcquireRewriteLocks, AcquireExecutorLocks, > etc.), I wonder whether we couldn't just *not* recalculate the lock mode > based on inspecting the query tree to cross-check with rellockmode? Why > not just use rellockmode for locking? Maybe, okay to keep doing that in > debug builds though. Also, are the discrepancies like this to be > considered bugs of the existing logic? I got the impression Tom was just leaving that in for a while to let the buildfarm verify the new code is getting the same lock level as the old code. Of course, I might be wrong. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 1 October 2018 at 19:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> For this and the other cases (AcquireRewriteLocks, AcquireExecutorLocks, >> etc.), I wonder whether we couldn't just *not* recalculate the lock mode >> based on inspecting the query tree to cross-check with rellockmode? Why >> not just use rellockmode for locking? Maybe, okay to keep doing that in >> debug builds though. Also, are the discrepancies like this to be >> considered bugs of the existing logic? > I got the impression Tom was just leaving that in for a while to let > the buildfarm verify the new code is getting the same lock level as > the old code. Of course, I might be wrong. Yeah, exactly. My plan is to next switch to taking the locks based on rellockmode, and then if that doesn't show any problems to switch the executor to just Assert that there's already a suitable lock, and then lastly to proceed with ripping out the no-longer-needed logic that supports the downstream calculations of lockmode. So a bit more granular than what Amit submitted, but we'll get to the same place in the end, with more confidence that we didn't break anything. regards, tom lane
I wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> Should this check that we're not in a parallel worker process? > Hmm. I've not seen any failures in the parallel parts of the regular > regression tests, but maybe I'd better do a force_parallel_mode > run before committing. > In general, I'm not on board with the idea that parallel workers don't > need to get their own locks, so I don't really want to exclude parallel > workers from this check. But if it's not safe for that today, fixing it > is beyond the scope of this particular patch. So the place where that came out in the wash is the commit I just made (9a3cebeaa) to change the executor from taking table locks to asserting that somebody else took them already. To make that work, I had to make both ExecOpenScanRelation and relation_open skip checking for lock-held if IsParallelWorker(). This makes me entirely uncomfortable with the idea that parallel workers can be allowed to not take any locks of their own. There is no basis for arguing that we have field proof that that's safe, because *up to now, parallel workers in fact did take their own locks*. And it seems unsafe on its face, because there's nothing that really guarantees that the parent process won't go away while children are still running. (elog(FATAL) seems like a counterexample, for instance.) I think that we ought to adjust parallel query to insist that children do take locks, and then revert the IsParallelWorker() exceptions I made here. I plan to leave that point in abeyance till we've got the rest of these changes in place, though. The easiest way to do it will doubtless change once we've centralized the executor's table-opening logic, so trying to code it right now seems like a waste of effort. regards, tom lane
On 2018/10/04 5:16, Tom Lane wrote: > I wrote: >> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >>> Should this check that we're not in a parallel worker process? > >> Hmm. I've not seen any failures in the parallel parts of the regular >> regression tests, but maybe I'd better do a force_parallel_mode >> run before committing. >> In general, I'm not on board with the idea that parallel workers don't >> need to get their own locks, so I don't really want to exclude parallel >> workers from this check. But if it's not safe for that today, fixing it >> is beyond the scope of this particular patch. > > So the place where that came out in the wash is the commit I just made > (9a3cebeaa) to change the executor from taking table locks to asserting > that somebody else took them already. Thanks for getting that done. > To make that work, I had to make > both ExecOpenScanRelation and relation_open skip checking for lock-held > if IsParallelWorker(). Yeah, I had to do that to when rebasing the remaining patches. > This makes me entirely uncomfortable with the idea that parallel workers > can be allowed to not take any locks of their own. There is no basis > for arguing that we have field proof that that's safe, because *up to > now, parallel workers in fact did take their own locks*. And it seems > unsafe on its face, because there's nothing that really guarantees that > the parent process won't go away while children are still running. > (elog(FATAL) seems like a counterexample, for instance.) > > I think that we ought to adjust parallel query to insist that children > do take locks, and then revert the IsParallelWorker() exceptions I made > here. Maybe I'm missing something here, but isn't the necessary adjustment just that the relations are opened with locks if inside a parallel worker? > I plan to leave that point in abeyance till we've got the rest > of these changes in place, though. The easiest way to do it will > doubtless change once we've centralized the executor's table-opening > logic, so trying to code it right now seems like a waste of effort. Okay. I've rebased the remaining patches. I broke down one of the patches into 2 and re-ordered the patches as follows: 0001: introduces a function that opens range table relations and maintains them in an array indexes by RT index 0002: introduces a new field in EState that's an array of RangeTblEntry pointers and revises macros used in the executor that access RTEs to return them from the array (David Rowley co-authored this one) 0003: moves result relation and ExecRowMark initialization out of InitPlan and into ExecInit* routines of respective nodes 0004: removes useless fields from certain planner nodes whose only purpose has been to assist the executor lock relations in proper order 0005: teaches planner to remove PlanRowMarks corresponding to dummy relations Thanks, Amit
Вложения
- v12-0001-Revise-executor-range-table-relation-locking-and.patch
- v12-0002-Introduce-an-array-of-RangeTblEntry-pointers-in-.patch
- v12-0003-Move-result-rel-and-ExecRowMark-initilization-to.patch
- v12-0004-Remove-useless-fields-from-planner-nodes.patch
- v12-0005-Prune-PlanRowMark-of-relations-that-are-pruned-f.patch
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/10/04 5:16, Tom Lane wrote: >> I think that we ought to adjust parallel query to insist that children >> do take locks, and then revert the IsParallelWorker() exceptions I made >> here. > Maybe I'm missing something here, but isn't the necessary adjustment just > that the relations are opened with locks if inside a parallel worker? Yeah, that's one plausible way to fix it. I hadn't wanted to prejudge the best way before we finish the other changes, though. > I've rebased the remaining patches. I broke down one of the patches into > 2 and re-ordered the patches as follows: Thanks, will start looking at these today. regards, tom lane
Hi, On 2018-10-03 16:16:11 -0400, Tom Lane wrote: > I wrote: > > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > >> Should this check that we're not in a parallel worker process? > > > Hmm. I've not seen any failures in the parallel parts of the regular > > regression tests, but maybe I'd better do a force_parallel_mode > > run before committing. > > In general, I'm not on board with the idea that parallel workers don't > > need to get their own locks, so I don't really want to exclude parallel > > workers from this check. But if it's not safe for that today, fixing it > > is beyond the scope of this particular patch. > > So the place where that came out in the wash is the commit I just made > (9a3cebeaa) to change the executor from taking table locks to asserting > that somebody else took them already. To make that work, I had to make > both ExecOpenScanRelation and relation_open skip checking for lock-held > if IsParallelWorker(). > > This makes me entirely uncomfortable with the idea that parallel workers > can be allowed to not take any locks of their own. There is no basis > for arguing that we have field proof that that's safe, because *up to > now, parallel workers in fact did take their own locks*. And it seems > unsafe on its face, because there's nothing that really guarantees that > the parent process won't go away while children are still running. > (elog(FATAL) seems like a counterexample, for instance.) > I think that we ought to adjust parallel query to insist that children > do take locks, and then revert the IsParallelWorker() exceptions I made > here. I plan to leave that point in abeyance till we've got the rest > of these changes in place, though. The easiest way to do it will > doubtless change once we've centralized the executor's table-opening > logic, so trying to code it right now seems like a waste of effort. I've not really followed this thread, and just caught up to here. It seems entirely unacceptable to not acquire locks on workers to me. Maybe I'm missing something, but why do/did the patches in this thread require that / introduce that? We didn't have that kind of concept before, no? The group locking stuff should rely / require that kind of thing, no? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I've not really followed this thread, and just caught up to here. It > seems entirely unacceptable to not acquire locks on workers to me. > Maybe I'm missing something, but why do/did the patches in this thread > require that / introduce that? We didn't have that kind of concept > before, no? The group locking stuff should rely / require that kind of > thing, no? I'm possibly confused, but I thought that the design of parallel query involved an expectation that workers didn't need to get their own locks. What we've determined so far in this thread is that workers *do* get their own locks (or did before yesterday), but I'd been supposing that that was accidental not intentional. In any case, I definitely intend that they will be getting their own locks again after the dust has settled. Panic not. regards, tom lane
Hi, On 2018-10-04 15:27:59 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I've not really followed this thread, and just caught up to here. It > > seems entirely unacceptable to not acquire locks on workers to me. > > Maybe I'm missing something, but why do/did the patches in this thread > > require that / introduce that? We didn't have that kind of concept > > before, no? The group locking stuff should rely / require that kind of > > thing, no? > > I'm possibly confused, but I thought that the design of parallel query > involved an expectation that workers didn't need to get their own > locks. Not as far as I'm aware of - but I'm not exactly the expert there. There's an exception that some lock classes don't conflict between the leader and the workers - that's group locking (a1c1af2a1f60). But the locks still have to be acquired, and I think it's quite dangerous not to do so. The group locking logic is required because otherwise it'd be trivial to get into deadlocks, and some of the restrictions around parallel query are required to make that safe. > What we've determined so far in this thread is that workers *do* get > their own locks (or did before yesterday), but I'd been supposing that > that was accidental not intentional. I don't think it was accidental. Greetings, Andres Freund
On 2018-10-04 12:34:44 -0700, Andres Freund wrote: > Hi, > > On 2018-10-04 15:27:59 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > I've not really followed this thread, and just caught up to here. It > > > seems entirely unacceptable to not acquire locks on workers to me. > > > Maybe I'm missing something, but why do/did the patches in this thread > > > require that / introduce that? We didn't have that kind of concept > > > before, no? The group locking stuff should rely / require that kind of > > > thing, no? > > > > I'm possibly confused, but I thought that the design of parallel query > > involved an expectation that workers didn't need to get their own > > locks. > > Not as far as I'm aware of - but I'm not exactly the expert > there. There's an exception that some lock classes don't conflict > between the leader and the workers - that's group locking > (a1c1af2a1f60). But the locks still have to be acquired, and I think > it's quite dangerous not to do so. The group locking logic is required > because otherwise it'd be trivial to get into deadlocks, and some of the > restrictions around parallel query are required to make that safe. Re-read docs + code just to make sure. Here's the relevant readme parts: src/backend/access/transam/README.parallel To prevent unprincipled deadlocks when running in parallel mode, this code also arranges for the leader and all workers to participate in group locking. See src/backend/storage/lmgr/README for more details. src/backend/storage/lmgr/README: Group Locking ------------- As if all of that weren't already complicated enough, PostgreSQL now supports parallelism (see src/backend/access/transam/README.parallel), which means that we might need to resolve deadlocks that occur between gangs of related processes rather than individual processes. This doesn't change the basic deadlock detection algorithm very much, but it makes the bookkeeping more complicated. We choose to regard locks held by processes in the same parallel group as non-conflicting. This means that two processes in a parallel group can hold a self-exclusive lock on the same relation at the same time, or one process can acquire an AccessShareLock while the other already holds AccessExclusiveLock. This might seem dangerous and could be in some cases (more on that below), but if we didn't do this then parallel query would be extremely prone to self-deadlock. For example, a parallel query against a relation on which the leader already had AccessExclusiveLock would hang, because the workers would try to lock the same relation and be blocked by the leader; yet the leader can't finish until it receives completion indications from all workers. An undetected deadlock results. This is far from the only scenario where such a problem happens. The same thing will occur if the leader holds only AccessShareLock, the worker seeks AccessShareLock, but between the time the leader attempts to acquire the lock and the time the worker attempts to acquire it, some other process queues up waiting for an AccessExclusiveLock. In this case, too, an indefinite hang results. ... So yes, locks are expected to be acquired in workers. Greetings, Andres Freund
On Thu, Oct 4, 2018 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm possibly confused, but I thought that the design of parallel query > involved an expectation that workers didn't need to get their own locks. You are, indeed, confused. A heck of a lot of effort went into making sure that the workers COULD take their own locks, and into trying to make sure that didn't break anything. That effort may or may not have been entirely successful, but I'm pretty sure that having them NOT take locks is going to be a lot worse. > What we've determined so far in this thread is that workers *do* get > their own locks (or did before yesterday), but I'd been supposing that > that was accidental not intentional. Nope, that was intentional. > In any case, I definitely intend that they will be getting their own > locks again after the dust has settled. Panic not. /me unloads metaphorical bazooka. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Oct 4, 2018 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What we've determined so far in this thread is that workers *do* get >> their own locks (or did before yesterday), but I'd been supposing that >> that was accidental not intentional. > Nope, that was intentional. Fair enough --- in which case, the patch series we're working on here was broken to suppose that it could get away with removing that. As I said, I'll make sure the locking is back before I finish with this. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > I've rebased the remaining patches. I broke down one of the patches into > 2 and re-ordered the patches as follows: > 0001: introduces a function that opens range table relations and maintains > them in an array indexes by RT index > 0002: introduces a new field in EState that's an array of RangeTblEntry > pointers and revises macros used in the executor that access RTEs to > return them from the array (David Rowley co-authored this one) I've pushed 0001 and 0002 with mostly cosmetic changes. One thing I wanted to point out explicitly, though, is that I found this bit of 0002 to be a seriously bad idea: --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -20,6 +20,7 @@ #include "executor/instrument.h" #include "lib/pairingheap.h" #include "nodes/params.h" +#include "nodes/parsenodes.h" #include "nodes/plannodes.h" #include "utils/hsearch.h" #include "utils/queryenvironment.h" Please do not add #includes of fundamental headers to other fundamental headers without clearing it with somebody. There's little enough structure to our header collection now. I don't want to end up in a situation where effectively the entire header set gets pulled into every .c file, or worse that we have actual reference loops in the headers. (This is not an academic problem; somebody actually created such a loop awhile back. Cleaning it up, by the time we'd recognized the problem, was really painful.) > 0003: moves result relation and ExecRowMark initialization out of InitPlan > and into ExecInit* routines of respective nodes I am finding myself pretty unconvinced by this one; it seems like mostly a random reallocation of responsibility with little advantage. The particular thing that brought me to a screeching halt was seeing that the patch removed ExecFindRowMark, despite the fact that that's part of our advertised FDW API (see fdwhandler.sgml), and it didn't provide any alternative way for an FDW to find out at runtime whether it's subject to a row locking requirement. I thought for a minute about just leaving the function in place, but that wouldn't work because both nodeLockRows and nodeModifyTable are written so that they find^H^H^Hbuild their rowmarks only after recursing to initialize their child plan nodes; so a child node that tried to use ExecFindRowMark during ExecInitNode would get the wrong answer. Of course, we could consider changing the order of operations during initialization of those node types, but I'm not really seeing a compelling reason why we should whack things around that much. So I'm inclined to just omit 0003. AFAICS this would only mean that we couldn't drop the global PlanRowMarks list from PlannedStmt, which does not bother me much. > 0005: teaches planner to remove PlanRowMarks corresponding to dummy relations I'm not entirely sold on the value of that either? regards, tom lane
On 2018/10/05 5:59, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> I've rebased the remaining patches. I broke down one of the patches into >> 2 and re-ordered the patches as follows: > >> 0001: introduces a function that opens range table relations and maintains >> them in an array indexes by RT index > >> 0002: introduces a new field in EState that's an array of RangeTblEntry >> pointers and revises macros used in the executor that access RTEs to >> return them from the array (David Rowley co-authored this one) > > I've pushed 0001 and 0002 with mostly cosmetic changes. Thanks a lot. > One thing I > wanted to point out explicitly, though, is that I found this bit of 0002 > to be a seriously bad idea: > > --- a/src/include/nodes/execnodes.h > +++ b/src/include/nodes/execnodes.h > @@ -20,6 +20,7 @@ > #include "executor/instrument.h" > #include "lib/pairingheap.h" > #include "nodes/params.h" > +#include "nodes/parsenodes.h" > #include "nodes/plannodes.h" > #include "utils/hsearch.h" > #include "utils/queryenvironment.h" > > Please do not add #includes of fundamental headers to other fundamental > headers without clearing it with somebody. There's little enough > structure to our header collection now. I don't want to end up in a > situation where effectively the entire header set gets pulled into > every .c file, or worse that we have actual reference loops in the > headers. (This is not an academic problem; somebody actually created > such a loop awhile back. Cleaning it up, by the time we'd recognized > the problem, was really painful.) Okay, sorry about that. I was slightly nervous that I had to do it when doing it, but forgot to mention that explicitly in the commit message or the email. >> 0003: moves result relation and ExecRowMark initialization out of InitPlan >> and into ExecInit* routines of respective nodes > > I am finding myself pretty unconvinced by this one; it seems like mostly > a random reallocation of responsibility with little advantage. The > particular thing that brought me to a screeching halt was seeing that > the patch removed ExecFindRowMark, despite the fact that that's part > of our advertised FDW API (see fdwhandler.sgml), and it didn't provide > any alternative way for an FDW to find out at runtime whether it's > subject to a row locking requirement. > > I thought for a minute about just leaving the function in place, but > that wouldn't work because both nodeLockRows and nodeModifyTable are > written so that they find^H^H^Hbuild their rowmarks only after recursing > to initialize their child plan nodes; so a child node that tried to use > ExecFindRowMark during ExecInitNode would get the wrong answer. Of > course, we could consider changing the order of operations during > initialization of those node types, but I'm not really seeing a compelling > reason why we should whack things around that much. > > So I'm inclined to just omit 0003. AFAICS this would only mean that > we couldn't drop the global PlanRowMarks list from PlannedStmt, which > does not bother me much. To be honest, I too had begun to fail to see the point of this patch since yesterday. In fact, getting this one to pass make check-world took a bit of head-scratching due to its interaction with EvalPlanQuals EState building, so I was almost to drop it from the series. But I felt that it might still be a good idea to get rid of what was described as special case code. Reading your argument against it based on how it messes up some of the assumptions regarding ExecRowMark design, I'm willing to let go of this one as more or less a cosmetic improvement. >> 0005: teaches planner to remove PlanRowMarks corresponding to dummy relations > > I'm not entirely sold on the value of that either? If you look at the first email on this thread, you can see that trimming the PlanRowMarks list down to just the ones needed during execution results in non-trivial improvement in SELECT FOR SHARE on partitioned tables with large number of partitions: <quote> Speedup is more pronounced with a benchmark that needs RowMarks, because one of the patches (0003) removes overhead around handling them. $ cat /tmp/select-lt-for-share.sql select * from lt where b = 999 for share; master tps = 94.095985 (excluding connections establishing) tps = 93.955702 (excluding connections establishing) <snip> patch 0003 (prune PlanRowMarks) tps = 712.544029 (excluding connections establishing) tps = 717.540052 (excluding connections establishing) </quote> But on reflection, this seems more like adding special case code to the planner just for this, rather than solving the more general problem of initializing *any* partition information after pruning, being discussed over at "speeding up planning with partitions" thread [1]. So, I don't mind dropping this one too. So, that leaves us with only the patch that revises the plan node fields in light of having relieved the executor of doing any locking of its own in *some* cases. That patch's premise is that we don't need the fields that the patch removes because they're only referenced for the locking purposes. But, if those plan nodes support parallel execution and hence will be passed to parallel workers for execution who will need to lock the tables contained in the plan nodes, then they better contain all the information needed for locking *every* affected tables, so we had better not removed it. Plan nodes in question are Append, MergeAppend, and ModifyTable, of which only the first two support parallel execution, but maybe we should just leave all of them alone for now. IOW, I'm not sure about that patch either. Thoughts? Thanks, Amit [1] https://www.postgresql.org/message-id/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/10/05 5:59, Tom Lane wrote: >> So I'm inclined to just omit 0003. AFAICS this would only mean that >> we couldn't drop the global PlanRowMarks list from PlannedStmt, which >> does not bother me much. > To be honest, I too had begun to fail to see the point of this patch since > yesterday. In fact, getting this one to pass make check-world took a bit > of head-scratching due to its interaction with EvalPlanQuals EState > building, so I was almost to drop it from the series. But I felt that it > might still be a good idea to get rid of what was described as special > case code. Reading your argument against it based on how it messes up > some of the assumptions regarding ExecRowMark design, I'm willing to let > go of this one as more or less a cosmetic improvement. OK. We do need to fix the comments in InitPlan that talk about acquiring locks and how that makes us need to do things in a particular order, but I'll go take care of that. > So, that leaves us with only the patch that revises the plan node fields > in light of having relieved the executor of doing any locking of its own > in *some* cases. That patch's premise is that we don't need the fields > that the patch removes because they're only referenced for the locking > purposes. But, if those plan nodes support parallel execution and hence > will be passed to parallel workers for execution who will need to lock the > tables contained in the plan nodes, then they better contain all the > information needed for locking *every* affected tables, so we had better > not removed it. No, I think this is unduly pessimistic. We need to make sure that a parallel worker has lock on tables it's actually touching, but I don't see why that should imply a requirement to hold lock on parent tables it never touches. The reasons why we need locks on tables not physically accessed by the query are (a) to ensure that we've blocked, or received sinval messages for, any DDL related to views or partition parent tables, in case that would invalidate the plan; (b) to allow firing triggers safely, in the case of partition parent tables. Neither of these issues apply to a parallel worker -- the plan is already frozen before it can ever start, and it isn't going to be firing any triggers either. In particular, I think it's fine to get rid of ExecLockNonLeafAppendTables. In the parent, that's clearly useless code now: we have already locked *every* RTE_RELATION entry in the rangetable, either when the parser/rewriter/planner added the RTE to the list to begin with, or during AcquireExecutorLocks if the plan was retrieved from the plancache. In a child it seems unnecessary as long as we're locking the leaf rels we actually touch. Possibly, we might fix the problem of inadequate locking in worker processes by having them do the equivalent of AcquireExecutorLocks, ie just run through the whole RT list and lock everything. I think we'd soon end up doing that if we ever try to allow parallelization of non-read-only queries; but that's a long way off AFAIK. For read-only queries it seems like it'll be fine if we just rejigger ExecGetRangeTableRelation to take a lock when IsParallelWorker. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > 0004: removes useless fields from certain planner nodes whose only purpose > has been to assist the executor lock relations in proper order I've pushed most of 0004 now; obviously, not the parts removing PlannedStmt.rowMarks, since that's not possible without rearrangement of the executor's RowMark handling. I didn't like the idea of unifying ModifyTable.nominalRelation with the partition root info. Those fields serve different masters --- nominalRelation, at least in its original intent, is only meant for use of EXPLAIN and might have nothing to do with what happens at execution. So even though unifying them would work today, we might regret it down the line. Instead I left that field alone and added a separate rootRelation field to carry the partition root RT index, which ends up being the same number of fields anyway since we don't need a flag for is-the-nominal-relation-a-partition-root. Still need to think a bit more about whether we want 0005 in anything like its current form. regards, tom lane
I wrote: > Still need to think a bit more about whether we want 0005 in > anything like its current form. So I poked at that for a bit, and soon realized that the *main* problem there is that ExecFindRowMark() eats O(N^2) time due to repeated searches of the es_rowMarks list. While the patch as stated would improve that for cases where most of the partitions can be pruned at plan time, it does nothing much for cases where they can't. However, it's pretty trivial to fix that: let's just use an array not a list. Patch 0001 attached does that. A further improvement we could consider is to avoid opening the relcache entries for pruned-away relations. I could not find a way to do that that was less invasive than removing ExecRowMark.relation and requiring callers to call a new function to get the relation if they need it. Patch 0002 attached is a delta on top of 0001 that does that. Replicating your select-lt-for-share test case as best I can (you never actually specified it carefully), I find that the TPS rate on my workstation goes from about 250 tps with HEAD to 920 tps with patch 0001 or 1130 tps with patch 0002. This compares to about 1600 tps for the non-FOR-SHARE version of the query. However, we should keep in mind that without partitioning overhead (ie "select * from lt_999 where b = 999 for share"), the TPS rate is over 25800 tps. Most of the overhead in the partitioned case seems to be from acquiring locks on rangetable entries that we won't ever use, and none of these patch variants are touching that problem. So ISTM that the *real* win for this scenario is going to come from teaching the system to prune unwanted relations from the query altogether, not just from the PlanRowMark list. Keeping that comparison in mind, I'm inclined to think that 0001 is the best thing to do for now. The incremental win from 0002 is not big enough to justify the API break it creates, while your 0005 is not really attacking the problem the right way. regards, tom lane diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index 7480cf5..aadf749 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -91,21 +91,22 @@ execCurrentOf(CurrentOfExpr *cexpr, * the other code can't, while the non-FOR-UPDATE case allows use of WHERE * CURRENT OF with an insensitive cursor. */ - if (queryDesc->estate->es_rowMarks) + if (queryDesc->estate->es_rowmarks) { ExecRowMark *erm; - ListCell *lc; + Index i; /* * Here, the query must have exactly one FOR UPDATE/SHARE reference to * the target table, and we dig the ctid info out of that. */ erm = NULL; - foreach(lc, queryDesc->estate->es_rowMarks) + for (i = 0; i < queryDesc->estate->es_range_table_size; i++) { - ExecRowMark *thiserm = (ExecRowMark *) lfirst(lc); + ExecRowMark *thiserm = queryDesc->estate->es_rowmarks[i]; - if (!RowMarkRequiresRowShareLock(thiserm->markType)) + if (thiserm == NULL || + !RowMarkRequiresRowShareLock(thiserm->markType)) continue; /* ignore non-FOR UPDATE/SHARE items */ if (thiserm->relid == table_oid) diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index b6abad5..0a766ef 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -909,61 +909,68 @@ InitPlan(QueryDesc *queryDesc, int eflags) } /* - * Next, build the ExecRowMark list from the PlanRowMark(s), if any. + * Next, build the ExecRowMark array from the PlanRowMark(s), if any. */ - estate->es_rowMarks = NIL; - foreach(l, plannedstmt->rowMarks) + if (plannedstmt->rowMarks) { - PlanRowMark *rc = (PlanRowMark *) lfirst(l); - Oid relid; - Relation relation; - ExecRowMark *erm; + estate->es_rowmarks = (ExecRowMark **) + palloc0(estate->es_range_table_size * sizeof(ExecRowMark *)); + foreach(l, plannedstmt->rowMarks) + { + PlanRowMark *rc = (PlanRowMark *) lfirst(l); + Oid relid; + Relation relation; + ExecRowMark *erm; - /* ignore "parent" rowmarks; they are irrelevant at runtime */ - if (rc->isParent) - continue; + /* ignore "parent" rowmarks; they are irrelevant at runtime */ + if (rc->isParent) + continue; - /* get relation's OID (will produce InvalidOid if subquery) */ - relid = exec_rt_fetch(rc->rti, estate)->relid; + /* get relation's OID (will produce InvalidOid if subquery) */ + relid = exec_rt_fetch(rc->rti, estate)->relid; - /* open relation, if we need to access it for this mark type */ - switch (rc->markType) - { - case ROW_MARK_EXCLUSIVE: - case ROW_MARK_NOKEYEXCLUSIVE: - case ROW_MARK_SHARE: - case ROW_MARK_KEYSHARE: - case ROW_MARK_REFERENCE: - relation = ExecGetRangeTableRelation(estate, rc->rti); - break; - case ROW_MARK_COPY: - /* no physical table access is required */ - relation = NULL; - break; - default: - elog(ERROR, "unrecognized markType: %d", rc->markType); - relation = NULL; /* keep compiler quiet */ - break; - } + /* open relation, if we need to access it for this mark type */ + switch (rc->markType) + { + case ROW_MARK_EXCLUSIVE: + case ROW_MARK_NOKEYEXCLUSIVE: + case ROW_MARK_SHARE: + case ROW_MARK_KEYSHARE: + case ROW_MARK_REFERENCE: + relation = ExecGetRangeTableRelation(estate, rc->rti); + break; + case ROW_MARK_COPY: + /* no physical table access is required */ + relation = NULL; + break; + default: + elog(ERROR, "unrecognized markType: %d", rc->markType); + relation = NULL; /* keep compiler quiet */ + break; + } - /* Check that relation is a legal target for marking */ - if (relation) - CheckValidRowMarkRel(relation, rc->markType); - - erm = (ExecRowMark *) palloc(sizeof(ExecRowMark)); - erm->relation = relation; - erm->relid = relid; - erm->rti = rc->rti; - erm->prti = rc->prti; - erm->rowmarkId = rc->rowmarkId; - erm->markType = rc->markType; - erm->strength = rc->strength; - erm->waitPolicy = rc->waitPolicy; - erm->ermActive = false; - ItemPointerSetInvalid(&(erm->curCtid)); - erm->ermExtra = NULL; - - estate->es_rowMarks = lappend(estate->es_rowMarks, erm); + /* Check that relation is a legal target for marking */ + if (relation) + CheckValidRowMarkRel(relation, rc->markType); + + erm = (ExecRowMark *) palloc(sizeof(ExecRowMark)); + erm->relation = relation; + erm->relid = relid; + erm->rti = rc->rti; + erm->prti = rc->prti; + erm->rowmarkId = rc->rowmarkId; + erm->markType = rc->markType; + erm->strength = rc->strength; + erm->waitPolicy = rc->waitPolicy; + erm->ermActive = false; + ItemPointerSetInvalid(&(erm->curCtid)); + erm->ermExtra = NULL; + + Assert(erm->rti > 0 && erm->rti <= estate->es_range_table_size && + estate->es_rowmarks[erm->rti - 1] == NULL); + + estate->es_rowmarks[erm->rti - 1] = erm; + } } /* @@ -2394,13 +2401,12 @@ ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo) ExecRowMark * ExecFindRowMark(EState *estate, Index rti, bool missing_ok) { - ListCell *lc; - - foreach(lc, estate->es_rowMarks) + if (rti > 0 && rti <= estate->es_range_table_size && + estate->es_rowmarks != NULL) { - ExecRowMark *erm = (ExecRowMark *) lfirst(lc); + ExecRowMark *erm = estate->es_rowmarks[rti - 1]; - if (erm->rti == rti) + if (erm) return erm; } if (!missing_ok) @@ -3131,6 +3137,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree) estate->es_range_table_array = parentestate->es_range_table_array; estate->es_range_table_size = parentestate->es_range_table_size; estate->es_relations = parentestate->es_relations; + estate->es_rowmarks = parentestate->es_rowmarks; estate->es_plannedstmt = parentestate->es_plannedstmt; estate->es_junkFilter = parentestate->es_junkFilter; estate->es_output_cid = parentestate->es_output_cid; @@ -3148,7 +3155,6 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree) } /* es_result_relation_info must NOT be copied */ /* es_trig_target_relations must NOT be copied */ - estate->es_rowMarks = parentestate->es_rowMarks; estate->es_top_eflags = parentestate->es_top_eflags; estate->es_instrument = parentestate->es_instrument; /* es_auxmodifytables must NOT be copied */ diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index d98e90e..71c6b5d 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -113,6 +113,7 @@ CreateExecutorState(void) estate->es_range_table_array = NULL; estate->es_range_table_size = 0; estate->es_relations = NULL; + estate->es_rowmarks = NULL; estate->es_plannedstmt = NULL; estate->es_junkFilter = NULL; @@ -142,8 +143,6 @@ CreateExecutorState(void) estate->es_tupleTable = NIL; - estate->es_rowMarks = NIL; - estate->es_processed = 0; estate->es_lastoid = InvalidOid; @@ -709,6 +708,12 @@ ExecInitRangeTable(EState *estate, List *rangeTable) */ estate->es_relations = (Relation *) palloc0(estate->es_range_table_size * sizeof(Relation)); + + /* + * es_rowmarks is also parallel to the es_range_table_array, but it's + * allocated only if needed. + */ + estate->es_rowmarks = NULL; } /* diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 657b593..8347089 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -34,6 +34,7 @@ struct PlanState; /* forward references in this file */ struct ParallelHashJoinState; +struct ExecRowMark; struct ExprState; struct ExprContext; struct RangeTblEntry; /* avoid including parsenodes.h here */ @@ -491,6 +492,8 @@ typedef struct EState Index es_range_table_size; /* size of the range table arrays */ Relation *es_relations; /* Array of per-range-table-entry Relation * pointers, or NULL if not yet opened */ + struct ExecRowMark **es_rowmarks; /* Array of per-range-table-entry + * ExecRowMarks, or NULL if none */ PlannedStmt *es_plannedstmt; /* link to top of plan tree */ const char *es_sourceText; /* Source text from QueryDesc */ @@ -537,8 +540,6 @@ typedef struct EState List *es_tupleTable; /* List of TupleTableSlots */ - List *es_rowMarks; /* List of ExecRowMarks */ - uint64 es_processed; /* # of tuples processed */ Oid es_lastoid; /* last oid processed (by INSERT) */ @@ -607,7 +608,9 @@ typedef struct EState * node that sources the relation (e.g., for a foreign table the FDW can use * ermExtra to hold information). * - * EState->es_rowMarks is a list of these structs. + * EState->es_rowmarks is an array of these structs, indexed by RT index, + * with NULLs for irrelevant RT indexes. es_rowmarks itself is NULL if + * there are no rowmarks. */ typedef struct ExecRowMark { @@ -629,7 +632,7 @@ typedef struct ExecRowMark * additional runtime representation of FOR [KEY] UPDATE/SHARE clauses * * Each LockRows and ModifyTable node keeps a list of the rowmarks it needs to - * deal with. In addition to a pointer to the related entry in es_rowMarks, + * deal with. In addition to a pointer to the related entry in es_rowmarks, * this struct carries the column number(s) of the resjunk columns associated * with the rowmark (see comments for PlanRowMark for more detail). In the * case of ModifyTable, there has to be a separate ExecAuxRowMark list for @@ -638,7 +641,7 @@ typedef struct ExecRowMark */ typedef struct ExecAuxRowMark { - ExecRowMark *rowmark; /* related entry in es_rowMarks */ + ExecRowMark *rowmark; /* related entry in es_rowmarks */ AttrNumber ctidAttNo; /* resno of ctid junk attribute, if any */ AttrNumber toidAttNo; /* resno of tableoid junk attribute, if any */ AttrNumber wholeAttNo; /* resno of whole-row junk attribute, if any */ diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 0a766ef..1775e77 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -919,7 +919,6 @@ InitPlan(QueryDesc *queryDesc, int eflags) { PlanRowMark *rc = (PlanRowMark *) lfirst(l); Oid relid; - Relation relation; ExecRowMark *erm; /* ignore "parent" rowmarks; they are irrelevant at runtime */ @@ -929,32 +928,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) /* get relation's OID (will produce InvalidOid if subquery) */ relid = exec_rt_fetch(rc->rti, estate)->relid; - /* open relation, if we need to access it for this mark type */ - switch (rc->markType) - { - case ROW_MARK_EXCLUSIVE: - case ROW_MARK_NOKEYEXCLUSIVE: - case ROW_MARK_SHARE: - case ROW_MARK_KEYSHARE: - case ROW_MARK_REFERENCE: - relation = ExecGetRangeTableRelation(estate, rc->rti); - break; - case ROW_MARK_COPY: - /* no physical table access is required */ - relation = NULL; - break; - default: - elog(ERROR, "unrecognized markType: %d", rc->markType); - relation = NULL; /* keep compiler quiet */ - break; - } - - /* Check that relation is a legal target for marking */ - if (relation) - CheckValidRowMarkRel(relation, rc->markType); - erm = (ExecRowMark *) palloc(sizeof(ExecRowMark)); - erm->relation = relation; erm->relid = relid; erm->rti = rc->rti; erm->prti = rc->prti; @@ -963,6 +937,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) erm->strength = rc->strength; erm->waitPolicy = rc->waitPolicy; erm->ermActive = false; + erm->ermChecked = false; ItemPointerSetInvalid(&(erm->curCtid)); erm->ermExtra = NULL; @@ -2462,6 +2437,20 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist) return aerm; } +Relation +ExecRowMarkGetRelation(EState *estate, ExecRowMark *erm) +{ + Relation relation = ExecGetRangeTableRelation(estate, erm->rti); + + /* Check that relation is a legal target for marking, if we haven't */ + if (!erm->ermChecked) + { + CheckValidRowMarkRel(relation, erm->markType); + erm->ermChecked = true; + } + return relation; +} + /* * EvalPlanQual logic --- recheck modified tuple(s) to see if we want to @@ -2935,10 +2924,9 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate) if (erm->markType == ROW_MARK_REFERENCE) { + Relation relation = ExecRowMarkGetRelation(epqstate->estate, erm); HeapTuple copyTuple; - Assert(erm->relation != NULL); - /* fetch the tuple's ctid */ datum = ExecGetJunkAttribute(epqstate->origslot, aerm->ctidAttNo, @@ -2948,18 +2936,18 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate) continue; /* fetch requests on foreign tables must be passed to their FDW */ - if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + if (relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE) { FdwRoutine *fdwroutine; bool updated = false; - fdwroutine = GetFdwRoutineForRelation(erm->relation, false); + fdwroutine = GetFdwRoutineForRelation(relation, false); /* this should have been checked already, but let's be safe */ if (fdwroutine->RefetchForeignRow == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot lock rows in foreign table \"%s\"", - RelationGetRelationName(erm->relation)))); + RelationGetRelationName(relation)))); copyTuple = fdwroutine->RefetchForeignRow(epqstate->estate, erm, datum, @@ -2979,7 +2967,7 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate) Buffer buffer; tuple.t_self = *((ItemPointer) DatumGetPointer(datum)); - if (!heap_fetch(erm->relation, SnapshotAny, &tuple, &buffer, + if (!heap_fetch(relation, SnapshotAny, &tuple, &buffer, false, NULL)) elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck"); diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index 6db345a..4e161c0 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -74,6 +74,7 @@ lnext: { ExecAuxRowMark *aerm = (ExecAuxRowMark *) lfirst(lc); ExecRowMark *erm = aerm->rowmark; + Relation relation; HeapTuple *testTuple; Datum datum; bool isNull; @@ -122,19 +123,22 @@ lnext: if (isNull) elog(ERROR, "ctid is NULL"); + /* access the tuple's relation */ + relation = ExecRowMarkGetRelation(estate, erm); + /* requests for foreign tables must be passed to their FDW */ - if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + if (relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE) { FdwRoutine *fdwroutine; bool updated = false; - fdwroutine = GetFdwRoutineForRelation(erm->relation, false); + fdwroutine = GetFdwRoutineForRelation(relation, false); /* this should have been checked already, but let's be safe */ if (fdwroutine->RefetchForeignRow == NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot lock rows in foreign table \"%s\"", - RelationGetRelationName(erm->relation)))); + RelationGetRelationName(relation)))); copyTuple = fdwroutine->RefetchForeignRow(estate, erm, datum, @@ -180,7 +184,7 @@ lnext: break; } - test = heap_lock_tuple(erm->relation, &tuple, + test = heap_lock_tuple(relation, &tuple, estate->es_output_cid, lockmode, erm->waitPolicy, true, &buffer, &hufd); @@ -230,7 +234,7 @@ lnext: } /* updated, so fetch and lock the updated version */ - copyTuple = EvalPlanQualFetch(estate, erm->relation, + copyTuple = EvalPlanQualFetch(estate, relation, lockmode, erm->waitPolicy, &hufd.ctid, hufd.xmax); @@ -286,6 +290,7 @@ lnext: { ExecAuxRowMark *aerm = (ExecAuxRowMark *) lfirst(lc); ExecRowMark *erm = aerm->rowmark; + Relation relation; HeapTupleData tuple; Buffer buffer; @@ -309,13 +314,16 @@ lnext: continue; } + /* access the tuple's relation */ + relation = ExecRowMarkGetRelation(estate, erm); + /* foreign tables should have been fetched above */ - Assert(erm->relation->rd_rel->relkind != RELKIND_FOREIGN_TABLE); + Assert(relation->rd_rel->relkind != RELKIND_FOREIGN_TABLE); Assert(ItemPointerIsValid(&(erm->curCtid))); /* okay, fetch the tuple */ tuple.t_self = erm->curCtid; - if (!heap_fetch(erm->relation, SnapshotAny, &tuple, &buffer, + if (!heap_fetch(relation, SnapshotAny, &tuple, &buffer, false, NULL)) elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck"); diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index c9ed198..4dbbb3b 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -190,6 +190,7 @@ extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo, extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo); extern ExecRowMark *ExecFindRowMark(EState *estate, Index rti, bool missing_ok); extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist); +extern Relation ExecRowMarkGetRelation(EState *estate, ExecRowMark *erm); extern TupleTableSlot *EvalPlanQual(EState *estate, EPQState *epqstate, Relation relation, Index rti, int lockmode, ItemPointer tid, TransactionId priorXmax); diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 8347089..e2cd7ea 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -597,16 +597,20 @@ typedef struct EState * ExecRowMark - * runtime representation of FOR [KEY] UPDATE/SHARE clauses * - * When doing UPDATE, DELETE, or SELECT FOR [KEY] UPDATE/SHARE, we will have an - * ExecRowMark for each non-target relation in the query (except inheritance - * parent RTEs, which can be ignored at runtime). Virtual relations such as - * subqueries-in-FROM will have an ExecRowMark with relation == NULL. See - * PlanRowMark for details about most of the fields. In addition to fields - * directly derived from PlanRowMark, we store an activity flag (to denote - * inactive children of inheritance trees), curCtid, which is used by the - * WHERE CURRENT OF code, and ermExtra, which is available for use by the plan - * node that sources the relation (e.g., for a foreign table the FDW can use - * ermExtra to hold information). + * When doing UPDATE, DELETE, or SELECT FOR [KEY] UPDATE/SHARE, we will have + * an ExecRowMark for each non-target relation in the query (except + * inheritance parent RTEs, which can be ignored at runtime). See PlanRowMark + * for details about most of the fields. In addition to fields directly + * derived from PlanRowMark, we store an activity flag (to denote inactive + * children of inheritance trees); a "checked" flag (to denote whether we've + * verified the relation is of appropriate type to be marked); curCtid, which + * is used by the WHERE CURRENT OF code; and ermExtra, which is available for + * use by the plan node that sources the relation (e.g., for a foreign table + * the FDW can use ermExtra to hold information). + * + * For performance reasons, we don't open or verify the relkind of tables + * that are never accessed in a query; this is important in partitioned + * tables with many partitions. Use ExecRowMarkGetRelation to do that. * * EState->es_rowmarks is an array of these structs, indexed by RT index, * with NULLs for irrelevant RT indexes. es_rowmarks itself is NULL if @@ -614,8 +618,7 @@ typedef struct EState */ typedef struct ExecRowMark { - Relation relation; /* opened and suitably locked relation */ - Oid relid; /* its OID (or InvalidOid, if subquery) */ + Oid relid; /* relation's OID (InvalidOid if subquery) */ Index rti; /* its range table index */ Index prti; /* parent range table index, if child */ Index rowmarkId; /* unique identifier for resjunk columns */ @@ -623,6 +626,7 @@ typedef struct ExecRowMark LockClauseStrength strength; /* LockingClause's strength, or LCS_NONE */ LockWaitPolicy waitPolicy; /* NOWAIT and SKIP LOCKED */ bool ermActive; /* is this mark relevant for current tuple? */ + bool ermChecked; /* have we verified relation's relkind? */ ItemPointerData curCtid; /* ctid of currently locked tuple, if any */ void *ermExtra; /* available for use by relation source node */ } ExecRowMark;
On 8 October 2018 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, we should keep in mind that without partitioning overhead > (ie "select * from lt_999 where b = 999 for share"), the TPS rate > is over 25800 tps. Most of the overhead in the partitioned case seems > to be from acquiring locks on rangetable entries that we won't ever > use, and none of these patch variants are touching that problem. > So ISTM that the *real* win for this scenario is going to come from > teaching the system to prune unwanted relations from the query > altogether, not just from the PlanRowMark list. Idle thought: I wonder if we could add another field to the RangeTblEntry; "delaylock". Set that to true in the planner for all other_member rels that are partitions then not obtain locks on those during AcquireExecutorLocks(). Instead, grab the lock in ExecGetRangeTableRelation() the first time through. We'd still obtain the lock for the table named in the query at the normal time so cached plans could properly be invalidated. We'd need to ensure that anything that could be changed in the partitions to cause a plan to become invalid properly obtains a lock on the partitioned table, all the way to the top of the hierarchy. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 8 October 2018 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So ISTM that the *real* win for this scenario is going to come from >> teaching the system to prune unwanted relations from the query >> altogether, not just from the PlanRowMark list. > Idle thought: I wonder if we could add another field to the > RangeTblEntry; "delaylock". Set that to true in the planner for all > other_member rels that are partitions then not obtain locks on those > during AcquireExecutorLocks(). Instead, grab the lock in > ExecGetRangeTableRelation() the first time through. Hmm, I'm afraid that's not terribly safe, unless there are ALTER TABLE restrictions on partitions that I'm not aware of. For instance, a leaf partition might have an index it didn't inherit from the parent, and the query plan might be intending to use that index. If you don't take lock on the leaf, you might not know that the index has been dropped so that a new plan is needed. The idea I had in mind was to allow hard pruning of any leaf that's been excluded *at plan time* based on partition constraints seen in its parent rel(s). That should be safe enough as long as we take locks on all the non-leaf rels --- then we'd know about any change in the constraint situation. Rather than coping with renumbering the RTEs, it might be easiest to invent an "RTE_DUMMY" RTE type that a hard-pruned RTE could be changed to. regards, tom lane
On 8 October 2018 at 13:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The idea I had in mind was to allow hard pruning of any leaf that's > been excluded *at plan time* based on partition constraints seen in > its parent rel(s). That should be safe enough as long as we take > locks on all the non-leaf rels --- then we'd know about any change > in the constraint situation. > > Rather than coping with renumbering the RTEs, it might be easiest > to invent an "RTE_DUMMY" RTE type that a hard-pruned RTE could be > changed to. The problem with that is that, if we get [1] done in PG12, then the RTE_DUMMYs would not exist, as we'd only have RTEs in the range table for partitions that survived plan-time pruning. It also leaves a problem to solve in the unneeded locks being taken on partitions for PREPAREd queries using run-time pruning. [1] https://commitfest.postgresql.org/20/1778/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I wrote: > Keeping that comparison in mind, I'm inclined to think that 0001 > is the best thing to do for now. The incremental win from 0002 > is not big enough to justify the API break it creates, while your > 0005 is not really attacking the problem the right way. I've pushed 0001 now. I believe that closes out all the patches discussed in this thread, so I've marked the CF entry committed. Thanks for all the hard work! regards, tom lane
I wrote: > Keeping that comparison in mind, I'm inclined to think that 0001 > is the best thing to do for now. The incremental win from 0002 > is not big enough to justify the API break it creates, while your > 0005 is not really attacking the problem the right way. I've pushed 0001 now. I believe that closes out all the patches discussed in this thread, so I've marked the CF entry committed. Thanks for all the hard work! regards, tom lane
On 2018/10/07 3:59, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2018/10/05 5:59, Tom Lane wrote: >>> So I'm inclined to just omit 0003. AFAICS this would only mean that >>> we couldn't drop the global PlanRowMarks list from PlannedStmt, which >>> does not bother me much. > >> To be honest, I too had begun to fail to see the point of this patch since >> yesterday. In fact, getting this one to pass make check-world took a bit >> of head-scratching due to its interaction with EvalPlanQuals EState >> building, so I was almost to drop it from the series. But I felt that it >> might still be a good idea to get rid of what was described as special >> case code. Reading your argument against it based on how it messes up >> some of the assumptions regarding ExecRowMark design, I'm willing to let >> go of this one as more or less a cosmetic improvement. > > OK. We do need to fix the comments in InitPlan that talk about acquiring > locks and how that makes us need to do things in a particular order, but > I'll go take care of that. Thanks for doing that. >> So, that leaves us with only the patch that revises the plan node fields >> in light of having relieved the executor of doing any locking of its own >> in *some* cases. That patch's premise is that we don't need the fields >> that the patch removes because they're only referenced for the locking >> purposes. But, if those plan nodes support parallel execution and hence >> will be passed to parallel workers for execution who will need to lock the >> tables contained in the plan nodes, then they better contain all the >> information needed for locking *every* affected tables, so we had better >> not removed it. > > No, I think this is unduly pessimistic. We need to make sure that a > parallel worker has lock on tables it's actually touching, but I don't > see why that should imply a requirement to hold lock on parent tables > it never touches. > > The reasons why we need locks on tables not physically accessed by the > query are (a) to ensure that we've blocked, or received sinval messages > for, any DDL related to views or partition parent tables, in case that > would invalidate the plan; (b) to allow firing triggers safely, in > the case of partition parent tables. Neither of these issues apply to > a parallel worker -- the plan is already frozen before it can ever > start, and it isn't going to be firing any triggers either. > > In particular, I think it's fine to get rid of > ExecLockNonLeafAppendTables. In the parent, that's clearly useless code > now: we have already locked *every* RTE_RELATION entry in the rangetable, > either when the parser/rewriter/planner added the RTE to the list to begin > with, or during AcquireExecutorLocks if the plan was retrieved from the > plancache. In a child it seems unnecessary as long as we're locking the > leaf rels we actually touch. > > Possibly, we might fix the problem of inadequate locking in worker > processes by having them do the equivalent of AcquireExecutorLocks, ie > just run through the whole RT list and lock everything. I think we'd soon > end up doing that if we ever try to allow parallelization of non-read-only > queries; but that's a long way off AFAIK. For read-only queries it seems > like it'll be fine if we just rejigger ExecGetRangeTableRelation to take a > lock when IsParallelWorker. Okay, thanks for the explanation. It's now clearer to me why parallel workers don't really need to lock non-leaf relations. Thanks, Amit
On 2018/10/07 3:59, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2018/10/05 5:59, Tom Lane wrote: >>> So I'm inclined to just omit 0003. AFAICS this would only mean that >>> we couldn't drop the global PlanRowMarks list from PlannedStmt, which >>> does not bother me much. > >> To be honest, I too had begun to fail to see the point of this patch since >> yesterday. In fact, getting this one to pass make check-world took a bit >> of head-scratching due to its interaction with EvalPlanQuals EState >> building, so I was almost to drop it from the series. But I felt that it >> might still be a good idea to get rid of what was described as special >> case code. Reading your argument against it based on how it messes up >> some of the assumptions regarding ExecRowMark design, I'm willing to let >> go of this one as more or less a cosmetic improvement. > > OK. We do need to fix the comments in InitPlan that talk about acquiring > locks and how that makes us need to do things in a particular order, but > I'll go take care of that. Thanks for doing that. >> So, that leaves us with only the patch that revises the plan node fields >> in light of having relieved the executor of doing any locking of its own >> in *some* cases. That patch's premise is that we don't need the fields >> that the patch removes because they're only referenced for the locking >> purposes. But, if those plan nodes support parallel execution and hence >> will be passed to parallel workers for execution who will need to lock the >> tables contained in the plan nodes, then they better contain all the >> information needed for locking *every* affected tables, so we had better >> not removed it. > > No, I think this is unduly pessimistic. We need to make sure that a > parallel worker has lock on tables it's actually touching, but I don't > see why that should imply a requirement to hold lock on parent tables > it never touches. > > The reasons why we need locks on tables not physically accessed by the > query are (a) to ensure that we've blocked, or received sinval messages > for, any DDL related to views or partition parent tables, in case that > would invalidate the plan; (b) to allow firing triggers safely, in > the case of partition parent tables. Neither of these issues apply to > a parallel worker -- the plan is already frozen before it can ever > start, and it isn't going to be firing any triggers either. > > In particular, I think it's fine to get rid of > ExecLockNonLeafAppendTables. In the parent, that's clearly useless code > now: we have already locked *every* RTE_RELATION entry in the rangetable, > either when the parser/rewriter/planner added the RTE to the list to begin > with, or during AcquireExecutorLocks if the plan was retrieved from the > plancache. In a child it seems unnecessary as long as we're locking the > leaf rels we actually touch. > > Possibly, we might fix the problem of inadequate locking in worker > processes by having them do the equivalent of AcquireExecutorLocks, ie > just run through the whole RT list and lock everything. I think we'd soon > end up doing that if we ever try to allow parallelization of non-read-only > queries; but that's a long way off AFAIK. For read-only queries it seems > like it'll be fine if we just rejigger ExecGetRangeTableRelation to take a > lock when IsParallelWorker. Okay, thanks for the explanation. It's now clearer to me why parallel workers don't really need to lock non-leaf relations. Thanks, Amit
On 2018/10/08 3:55, Tom Lane wrote: > I didn't like the idea of unifying ModifyTable.nominalRelation with > the partition root info. Those fields serve different masters --- > nominalRelation, at least in its original intent, is only meant for > use of EXPLAIN and might have nothing to do with what happens at > execution. So even though unifying them would work today, we might > regret it down the line. Instead I left that field alone and added > a separate rootRelation field to carry the partition root RT index, > which ends up being the same number of fields anyway since we don't > need a flag for is-the-nominal-relation-a-partition-root. Thanks for pushing that. I'd also named it 'rootRelation' in my original patch before David had objected to calling it that, because a command may not specify the "actual" root of a partition tree; it could be a non-root partitioned table. He'd suggested 'partitionedTarget' for the new field [1], stressing the "target" part. Maybe, 'rootRelation' isn't too confusing though. Thanks,a Amit [1] https://www.postgresql.org/message-id/CAKJS1f_M0jkgL-d%3Dk-rf6TMzghATDmZ67nzja1tz4h3G%3D27e7Q%40mail.gmail.com
On 2018/10/08 3:55, Tom Lane wrote: > I didn't like the idea of unifying ModifyTable.nominalRelation with > the partition root info. Those fields serve different masters --- > nominalRelation, at least in its original intent, is only meant for > use of EXPLAIN and might have nothing to do with what happens at > execution. So even though unifying them would work today, we might > regret it down the line. Instead I left that field alone and added > a separate rootRelation field to carry the partition root RT index, > which ends up being the same number of fields anyway since we don't > need a flag for is-the-nominal-relation-a-partition-root. Thanks for pushing that. I'd also named it 'rootRelation' in my original patch before David had objected to calling it that, because a command may not specify the "actual" root of a partition tree; it could be a non-root partitioned table. He'd suggested 'partitionedTarget' for the new field [1], stressing the "target" part. Maybe, 'rootRelation' isn't too confusing though. Thanks,a Amit [1] https://www.postgresql.org/message-id/CAKJS1f_M0jkgL-d%3Dk-rf6TMzghATDmZ67nzja1tz4h3G%3D27e7Q%40mail.gmail.com
On 2018/10/09 0:38, Tom Lane wrote: > I wrote: >> Keeping that comparison in mind, I'm inclined to think that 0001 >> is the best thing to do for now. The incremental win from 0002 >> is not big enough to justify the API break it creates, while your >> 0005 is not really attacking the problem the right way. > > I've pushed 0001 now. I believe that closes out all the patches > discussed in this thread, so I've marked the CF entry committed. > Thanks for all the hard work! Thanks a lot for reviewing and committing. Regards, Amit
On 2018/10/09 0:38, Tom Lane wrote: > I wrote: >> Keeping that comparison in mind, I'm inclined to think that 0001 >> is the best thing to do for now. The incremental win from 0002 >> is not big enough to justify the API break it creates, while your >> 0005 is not really attacking the problem the right way. > > I've pushed 0001 now. I believe that closes out all the patches > discussed in this thread, so I've marked the CF entry committed. > Thanks for all the hard work! Thanks a lot for reviewing and committing. Regards, Amit
On 2018/10/08 9:29, David Rowley wrote: > On 8 October 2018 at 13:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The idea I had in mind was to allow hard pruning of any leaf that's >> been excluded *at plan time* based on partition constraints seen in >> its parent rel(s). That should be safe enough as long as we take >> locks on all the non-leaf rels --- then we'd know about any change >> in the constraint situation. >> >> Rather than coping with renumbering the RTEs, it might be easiest >> to invent an "RTE_DUMMY" RTE type that a hard-pruned RTE could be >> changed to. > > The problem with that is that, if we get [1] done in PG12, then the > RTE_DUMMYs would not exist, as we'd only have RTEs in the range table > for partitions that survived plan-time pruning. ... > [1] https://commitfest.postgresql.org/20/1778/ Yeah, the patch proposed at [1] postpones creating partition RTEs (hence locking them) to a point after pruning, which also means we create only the necessary RTEs. In fact, it's not just the RTEs, but child PlanRowMarks, whose creation is postponed to after pruning. So, I admitted upthread that my proposed patch here would only add code that will become useless if we're able to get [1] in. Thanks, Amit
On 2018/10/08 9:29, David Rowley wrote: > On 8 October 2018 at 13:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The idea I had in mind was to allow hard pruning of any leaf that's >> been excluded *at plan time* based on partition constraints seen in >> its parent rel(s). That should be safe enough as long as we take >> locks on all the non-leaf rels --- then we'd know about any change >> in the constraint situation. >> >> Rather than coping with renumbering the RTEs, it might be easiest >> to invent an "RTE_DUMMY" RTE type that a hard-pruned RTE could be >> changed to. > > The problem with that is that, if we get [1] done in PG12, then the > RTE_DUMMYs would not exist, as we'd only have RTEs in the range table > for partitions that survived plan-time pruning. ... > [1] https://commitfest.postgresql.org/20/1778/ Yeah, the patch proposed at [1] postpones creating partition RTEs (hence locking them) to a point after pruning, which also means we create only the necessary RTEs. In fact, it's not just the RTEs, but child PlanRowMarks, whose creation is postponed to after pruning. So, I admitted upthread that my proposed patch here would only add code that will become useless if we're able to get [1] in. Thanks, Amit
On 2018/10/08 8:18, Tom Lane wrote: > I wrote: >> Still need to think a bit more about whether we want 0005 in >> anything like its current form. > > So I poked at that for a bit, and soon realized that the *main* problem > there is that ExecFindRowMark() eats O(N^2) time due to repeated searches > of the es_rowMarks list. Yeah, I proposed the patch only because of stumbling across O(N^2) behavior, but went to solve it with the planner hack, which I agree is an inferior solution overall. > While the patch as stated would improve that > for cases where most of the partitions can be pruned at plan time, it > does nothing much for cases where they can't. However, it's pretty > trivial to fix that: let's just use an array not a list. Patch 0001 > attached does that. > > A further improvement we could consider is to avoid opening the relcache > entries for pruned-away relations. I could not find a way to do that > that was less invasive than removing ExecRowMark.relation and requiring > callers to call a new function to get the relation if they need it. > Patch 0002 attached is a delta on top of 0001 that does that. > > Replicating your select-lt-for-share test case as best I can (you never > actually specified it carefully), I find that the TPS rate on my > workstation goes from about 250 tps with HEAD to 920 tps with patch 0001 > or 1130 tps with patch 0002. This compares to about 1600 tps for the > non-FOR-SHARE version of the query. > > However, we should keep in mind that without partitioning overhead > (ie "select * from lt_999 where b = 999 for share"), the TPS rate > is over 25800 tps. Most of the overhead in the partitioned case seems > to be from acquiring locks on rangetable entries that we won't ever > use, and none of these patch variants are touching that problem. > So ISTM that the *real* win for this scenario is going to come from > teaching the system to prune unwanted relations from the query > altogether, not just from the PlanRowMark list. Agreed, which is why I mentioned the other patch [1], which gets us closer to that goal. > Keeping that comparison in mind, I'm inclined to think that 0001 > is the best thing to do for now. The incremental win from 0002 > is not big enough to justify the API break it creates, while your > 0005 is not really attacking the problem the right way. I agree that 0002's improvement is only incremental and would lose its appeal if we're able to solve the bigger problem of removing range table and other overhead when planning with large number of partitions, but that might take a while. Thanks, Amit [1] https://commitfest.postgresql.org/20/1778/
On 2018/10/08 8:18, Tom Lane wrote: > I wrote: >> Still need to think a bit more about whether we want 0005 in >> anything like its current form. > > So I poked at that for a bit, and soon realized that the *main* problem > there is that ExecFindRowMark() eats O(N^2) time due to repeated searches > of the es_rowMarks list. Yeah, I proposed the patch only because of stumbling across O(N^2) behavior, but went to solve it with the planner hack, which I agree is an inferior solution overall. > While the patch as stated would improve that > for cases where most of the partitions can be pruned at plan time, it > does nothing much for cases where they can't. However, it's pretty > trivial to fix that: let's just use an array not a list. Patch 0001 > attached does that. > > A further improvement we could consider is to avoid opening the relcache > entries for pruned-away relations. I could not find a way to do that > that was less invasive than removing ExecRowMark.relation and requiring > callers to call a new function to get the relation if they need it. > Patch 0002 attached is a delta on top of 0001 that does that. > > Replicating your select-lt-for-share test case as best I can (you never > actually specified it carefully), I find that the TPS rate on my > workstation goes from about 250 tps with HEAD to 920 tps with patch 0001 > or 1130 tps with patch 0002. This compares to about 1600 tps for the > non-FOR-SHARE version of the query. > > However, we should keep in mind that without partitioning overhead > (ie "select * from lt_999 where b = 999 for share"), the TPS rate > is over 25800 tps. Most of the overhead in the partitioned case seems > to be from acquiring locks on rangetable entries that we won't ever > use, and none of these patch variants are touching that problem. > So ISTM that the *real* win for this scenario is going to come from > teaching the system to prune unwanted relations from the query > altogether, not just from the PlanRowMark list. Agreed, which is why I mentioned the other patch [1], which gets us closer to that goal. > Keeping that comparison in mind, I'm inclined to think that 0001 > is the best thing to do for now. The incremental win from 0002 > is not big enough to justify the API break it creates, while your > 0005 is not really attacking the problem the right way. I agree that 0002's improvement is only incremental and would lose its appeal if we're able to solve the bigger problem of removing range table and other overhead when planning with large number of partitions, but that might take a while. Thanks, Amit [1] https://commitfest.postgresql.org/20/1778/
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/10/08 3:55, Tom Lane wrote: >> I didn't like the idea of unifying ModifyTable.nominalRelation with >> the partition root info. Those fields serve different masters --- >> nominalRelation, at least in its original intent, is only meant for >> use of EXPLAIN and might have nothing to do with what happens at >> execution. So even though unifying them would work today, we might >> regret it down the line. Instead I left that field alone and added >> a separate rootRelation field to carry the partition root RT index, >> which ends up being the same number of fields anyway since we don't >> need a flag for is-the-nominal-relation-a-partition-root. > Thanks for pushing that. I'd also named it 'rootRelation' in my original > patch before David had objected to calling it that, because a command may > not specify the "actual" root of a partition tree; it could be a non-root > partitioned table. He'd suggested 'partitionedTarget' for the new field > [1], stressing the "target" part. Maybe, 'rootRelation' isn't too > confusing though. Well, it's the root so far as the current query is concerned --- we do not take any semantic account of partitioning levels that might exist above the table named in the query, do we? regards, tom lane
On Tue, Oct 9, 2018 at 11:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > > On 2018/10/08 3:55, Tom Lane wrote: > >> I didn't like the idea of unifying ModifyTable.nominalRelation with > >> the partition root info. Those fields serve different masters --- > >> nominalRelation, at least in its original intent, is only meant for > >> use of EXPLAIN and might have nothing to do with what happens at > >> execution. So even though unifying them would work today, we might > >> regret it down the line. Instead I left that field alone and added > >> a separate rootRelation field to carry the partition root RT index, > >> which ends up being the same number of fields anyway since we don't > >> need a flag for is-the-nominal-relation-a-partition-root. > > > Thanks for pushing that. I'd also named it 'rootRelation' in my original > > patch before David had objected to calling it that, because a command may > > not specify the "actual" root of a partition tree; it could be a non-root > > partitioned table. He'd suggested 'partitionedTarget' for the new field > > [1], stressing the "target" part. Maybe, 'rootRelation' isn't too > > confusing though. > > Well, it's the root so far as the current query is concerned --- we do not > take any semantic account of partitioning levels that might exist above > the table named in the query, do we? We don't, and I personally agree with the reasoning behind calling it rootRelation. Thanks, Amit
On Sat, Oct 6, 2018 at 2:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The reasons why we need locks on tables not physically accessed by the > query are (a) to ensure that we've blocked, or received sinval messages > for, any DDL related to views or partition parent tables, in case that > would invalidate the plan; (b) to allow firing triggers safely, in > the case of partition parent tables. Neither of these issues apply to > a parallel worker -- the plan is already frozen before it can ever > start, and it isn't going to be firing any triggers either. That last part could *easily* change in a future release. We've already started to allow CTAS with parallel query, and there have already been multiple people wanting to allow more. It would be a shame if we threw up additional obstacles in the way of that... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Oct 6, 2018 at 2:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The reasons why we need locks on tables not physically accessed by the >> query are (a) to ensure that we've blocked, or received sinval messages >> for, any DDL related to views or partition parent tables, in case that >> would invalidate the plan; (b) to allow firing triggers safely, in >> the case of partition parent tables. Neither of these issues apply to >> a parallel worker -- the plan is already frozen before it can ever >> start, and it isn't going to be firing any triggers either. > That last part could *easily* change in a future release. We've > already started to allow CTAS with parallel query, and there have > already been multiple people wanting to allow more. It would be a > shame if we threw up additional obstacles in the way of that... I hardly think that this is the most serious issue in the way of doing non-read-only things in parallel workers. In any case, a parallel worker would surely have to open any relations it is going to fire triggers for. If it gets the correct lock when it does that, all is well. If not, the Assert in relation_open will complain. regards, tom lane
On Tue, Oct 9, 2018 at 2:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That last part could *easily* change in a future release. We've > > already started to allow CTAS with parallel query, and there have > > already been multiple people wanting to allow more. It would be a > > shame if we threw up additional obstacles in the way of that... > > I hardly think that this is the most serious issue in the way of > doing non-read-only things in parallel workers. My concern, as I said, is about adding new obstacles. > In any case, a parallel worker would surely have to open any > relations it is going to fire triggers for. If it gets the correct > lock when it does that, all is well. If not, the Assert in > relation_open will complain. Well, in that case, no issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Sun, Sep 30, 2018 at 7:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think that the call sites should ultimately look like > > Assert(CheckRelationLockedByMe(...)); > > but for hunting down the places where the assertion currently fails, > it's more convenient if it's just an elog(WARNING). I just hit one of the asserts (in relation_open()) added in b04aeb0a053. Here's a simple reproducer: DROP TABLE IF EXISTS test; CREATE TABLE test (id integer primary key); PREPARE s AS DELETE FROM test WHERE id = 1; EXECUTE s; EXECUTE s; This comes from ExecInitIndexScan() and ExecInitBitmapIndexScan(), which open the index without lock if the parent table is a target relation: /* * Open the index relation. * * If the parent table is one of the target relations of the query, then * InitPlan already opened and write-locked the index, so we can avoid * taking another lock here. Otherwise we need a normal reader's lock. */ relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid); indexstate->iss_RelationDesc = index_open(node->indexid, relistarget ? NoLock : AccessShareLock); And digging into InitPlan() up to ExecInitModifyTable(): /* * If there are indices on the result relation, open them and save * descriptors in the result relation info, so that we can add new * index entries for the tuples we add/update. We need not do this * for a DELETE, however, since deletion doesn't affect indexes. Also, * inside an EvalPlanQual operation, the indexes might be open * already, since we share the resultrel state with the original * query. */ if (resultRelInfo->ri_RelationDesc->rd_rel->relhasindex && operation != CMD_DELETE && resultRelInfo->ri_IndexRelationDescs == NULL) ExecOpenIndices(resultRelInfo, node->onConflictAction != ONCONFLICT_NONE); So, this is problematic with a cached plan on a DELETE query since the lock on the target relation's index will never be acquired (the lock is acquired on the first execution in get_relation_info()). This doesn't seem unsafe though, since DROP INDEX [CONCURRENTLY] will still acquire lock on the index relation or query xid before dropping the index. I'm not sure of what's the best way to fix this problem. I wanted to modify ExecInitIndexScan() and ExecInitBitmapIndexScan() to acquire the lock for a DELETE on the target relation, however I don't think that we have that information at this point. Maybe just unconditionally acquire an AccessShareLock on the index in those functions?
Julien Rouhaud <rjuju123@gmail.com> writes: > I just hit one of the asserts (in relation_open()) added in > b04aeb0a053. Here's a simple reproducer: Yeah, I think this is the same issue being discussed in https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us I imagine the patch David recently posted in that thread will fix this, but can you check, and/or review the patch? regards, tom lane
On Sun, Feb 10, 2019 at 12:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Julien Rouhaud <rjuju123@gmail.com> writes: > > I just hit one of the asserts (in relation_open()) added in > > b04aeb0a053. Here's a simple reproducer: > > Yeah, I think this is the same issue being discussed in > > https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us Ah indeed, sorry I totally missed that thread. >> I imagine the patch David recently posted in that thread will fix this, > but can you check, and/or review the patch? I tried quickly and it does fix my test case. It's quite late here, so I'll review the patch tomorrow. Thanks.