Обсуждение: pgsql: Reorder EPQ work,to fix rowmark related bugs and improve effici
Reorder EPQ work, to fix rowmark related bugs and improve efficiency. In ad0bda5d24ea I changed the EvalPlanQual machinery to store substitution tuples in slot, instead of using plain HeapTuples. The main motivation for that was that using HeapTuples will be inefficient for future tableams. But it turns out that that conversion was buggy for non-locking rowmarks - the wrong tuple descriptor was used to create the slot. As a secondary issue 5db6df0c0 changed ExecLockRows() to begin EPQ earlier, to allow to fetch the locked rows directly into the EPQ slots, instead of having to copy tuples around. Unfortunately, as Tom complained, that forces some expensive initialization to happen earlier. As a third issue, the test coverage for EPQ was clearly insufficient. Fixing the first issue is unfortunately not trivial: Non-locked row marks were fetched at the start of EPQ, and we don't have the type information for the rowmarks available at that point. While we could change that, it's not easy. It might be worthwhile to change that at some point, but to fix this bug, it seems better to delay fetching non-locking rowmarks when they're actually needed, rather than eagerly. They're referenced at most once, and in cases where EPQ fails, might never be referenced. Fetching them when needed also increases locality a bit. To be able to fetch rowmarks during execution, rather than initialization, we need to be able to access the active EPQState, as that contains necessary data. To do so move EPQ related data from EState to EPQState, and, only for EStates creates as part of EPQ, reference the associated EPQState from EState. To fix the second issue, change EPQ initialization to allow use of EvalPlanQualSlot() to be used before EvalPlanQualBegin() (but obviously still requiring EvalPlanQualInit() to have been done). As these changes made struct EState harder to understand, e.g. by adding multiple EStates, significantly reorder the members, and add a lot more comments. Also add a few more EPQ tests, including one that fails for the first issue above. More is needed. Reported-By: yi huang Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@mail.gmail.com https://postgr.es/m/24530.1562686693@sss.pgh.pa.us Backpatch: 12-, where the EPQ changes were introduced Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/27cc7cd2bc8a5e8efc8279bc5d2a8ae42fd8ad33 Modified Files -------------- src/backend/commands/trigger.c | 3 +- src/backend/executor/execMain.c | 424 +++++++++++++------------ src/backend/executor/execScan.c | 66 +++- src/backend/executor/execUtils.c | 2 - src/backend/executor/nodeIndexonlyscan.c | 24 +- src/backend/executor/nodeIndexscan.c | 22 +- src/backend/executor/nodeLockRows.c | 14 +- src/backend/executor/nodeModifyTable.c | 11 +- src/include/executor/executor.h | 10 +- src/include/nodes/execnodes.h | 83 ++++- src/test/isolation/expected/eval-plan-qual.out | 273 +++++++++++++++- src/test/isolation/specs/eval-plan-qual.spec | 46 ++- 12 files changed, 705 insertions(+), 273 deletions(-)
On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <andres@anarazel.de> wrote: > Reorder EPQ work, to fix rowmark related bugs and improve efficiency. Did this cause the following failure in eval-plan-qual on REL_12_STABLE? https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41 -- Thomas Munro https://enterprisedb.com
Hi, On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote: >On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <andres@anarazel.de> >wrote: >> Reorder EPQ work, to fix rowmark related bugs and improve efficiency. > >Did this cause the following failure in eval-plan-qual on >REL_12_STABLE? > >https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41 Argh, yes, probably. But it's just isolationtester I think, not a backed bug. Need to backport the recent fix in master toat least 12. Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to which version? Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro <thomas.munro@gmail.com> wrote: >> On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <andres@anarazel.de> >> Did this cause the following failure in eval-plan-qual on >> REL_12_STABLE? >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41 Just saw that :-( > Argh, yes, probably. But it's just isolationtester I think, not a backed bug. Need to backport the recent fix in masterto at least 12. > Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to which version? I think the first question is why is only prairiedog showing this. Let's not panic until we understand that. prairiedog looks to be about 20 minutes away from finishing the isolation tests for master, so that will be a useful data point as to whether there's any 12/HEAD difference. I'll dig deeper as soon as that run finishes. regards, tom lane
Hi, On September 9, 2019 11:01:28 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro ><thomas.munro@gmail.com> wrote: >>> On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <andres@anarazel.de> >>> Did this cause the following failure in eval-plan-qual on >>> REL_12_STABLE? >>> >https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41 > >Just saw that :-( > >> Argh, yes, probably. But it's just isolationtester I think, not a >backed bug. Need to backport the recent fix in master to at least 12. >> Tom, do you think we should backpatch both the order fix and the >notice improvement, or just the former? And to which version? > >I think the first question is why is only prairiedog showing this. We only saw the similar failure on master on a very small number of machines as well. Given The somewhat unusual way packetshave to go out to be likely to be problematic, that's not too surprising. >Let's not panic until we understand that. If it's the packet ordering thing in isolationtester, then I think there's no need to panic, personally. And the output doeslook a lot like that. >prairiedog looks to be about 20 minutes away from finishing the >isolation tests for master, so that will be a useful data point >as to whether there's any 12/HEAD difference. I'll dig deeper >as soon as that run finishes. Cool. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On September 9, 2019 11:01:28 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think the first question is why is only prairiedog showing this. > We only saw the similar failure on master on a very small number of machines as well. Given The somewhat unusual way packetshave to go out to be likely to be problematic, that's not too surprising. Actually, now that I look at the prior discussion, it was *only* prairiedog that we saw that case on. However, in the last little while mandrill and bowerbird have come up with failures more or less like prairiedog's. Interestingly, bowerbird passed that test the first time, so it's intermittent there. I did just see prairiedog get past eval-plan-qual in master, though it won't finish that run for a good while yet. It appears to me that this is indeed a case of notice-reporting timing problems in isolationtester, since these WARNING messages are handled the same way as notices. I want to try to reproduce manually on prairiedog to confirm that, but it seems like a pretty likely explanation. If that is the explanation, I agree that there's no need for a panic response (ie re-wrapping the beta4 tarball). I doubt that very many people are going to be testing beta4 on machines that are slow enough to observe this issue --- and if there are people testing on slow machines, they probably aren't masochistic enough to be doing check-world, anyway. >>> Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to whichversion? As for that, now that we realize that this applies to more than just NOTICEs, I think we should back-patch the code change in 30717637c at least to v11, maybe all the way. I don't see any WARNINGs in the isolation expected files before v11, but it hardly seems unlikely that we might back-patch some future test that expects those to be printed in a consistent way. The case for back-patching ebd499282 (allow NOTICEs to print) is weaker, but it still seems like it might be a hazard for back-patching test cases if we don't do so. On balance I'm inclined to back-patch both changes. Thoughts? regards, tom lane
On 2019-Sep-09, Tom Lane wrote: > As for that, now that we realize that this applies to more than > just NOTICEs, I think we should back-patch the code change in > 30717637c at least to v11, maybe all the way. I don't see any > WARNINGs in the isolation expected files before v11, but it > hardly seems unlikely that we might back-patch some future test > that expects those to be printed in a consistent way. > > The case for back-patching ebd499282 (allow NOTICEs to print) > is weaker, but it still seems like it might be a hazard for > back-patching test cases if we don't do so. > > On balance I'm inclined to back-patch both changes. Thoughts? As well as a28e10e82e54, I suppose. I agree with keeping the tool similar across branches, if we're going to do this. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Sep-09, Tom Lane wrote: >> It appears to me that this is indeed a case of notice-reporting >> timing problems in isolationtester, since these WARNING messages >> are handled the same way as notices. I want to try to reproduce >> manually on prairiedog to confirm that, but it seems like a pretty >> likely explanation. Yeah, seems confirmed. I ran forty cycles of eval-plan-qual without a failure with current HEAD. I then reverted 30717637c, and eval-plan-qual fell over six times out of six, with the same diffs shown in the buildfarm report. So that patch is definitely a prerequisite to making this version of eval-plan-qual stable. >> On balance I'm inclined to back-patch both changes. Thoughts? > As well as a28e10e82e54, I suppose. I agree with keeping the tool > similar across branches, if we're going to do this. Hm, good point. My first thought was that a28e10e82e54 is just cosmetic, but it isn't entirely, because it suppresses notice reports on the control connection. That means it might actually be a prerequisite to having stable output with ebd499282 (the change of client_min_messages). After reviewing the git log a little more, I'm inclined to think we should only back-patch this stuff to 9.6, which is where 38f8bdcac ("Modify the isolation tester so that multiple sessions can wait") and a number of follow-up patches came in. That was enough of a quantum jump in flexibility that it'd likely limit our ability to back-patch tests further than that anyway. Also I don't think the patches mentioned here would apply without that ... regards, tom lane
Hi, On 2019-09-09 22:13:22 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2019-Sep-09, Tom Lane wrote: > >> It appears to me that this is indeed a case of notice-reporting > >> timing problems in isolationtester, since these WARNING messages > >> are handled the same way as notices. I want to try to reproduce > >> manually on prairiedog to confirm that, but it seems like a pretty > >> likely explanation. > > Yeah, seems confirmed. I ran forty cycles of eval-plan-qual > without a failure with current HEAD. I then reverted 30717637c, > and eval-plan-qual fell over six times out of six, with the same > diffs shown in the buildfarm report. So that patch is definitely > a prerequisite to making this version of eval-plan-qual stable. Phew. Thanks for testing that. > >> On balance I'm inclined to back-patch both changes. Thoughts? > > > As well as a28e10e82e54, I suppose. I agree with keeping the tool > > similar across branches, if we're going to do this. > > Hm, good point. My first thought was that a28e10e82e54 is just > cosmetic, but it isn't entirely, because it suppresses notice > reports on the control connection. That means it might actually > be a prerequisite to having stable output with ebd499282 (the > change of client_min_messages). > > After reviewing the git log a little more, I'm inclined to think > we should only back-patch this stuff to 9.6, which is where 38f8bdcac > ("Modify the isolation tester so that multiple sessions can wait") > and a number of follow-up patches came in. That was enough of a > quantum jump in flexibility that it'd likely limit our ability to > back-patch tests further than that anyway. Also I don't think the > patches mentioned here would apply without that ... That seems like a good plan to me. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-09-09 22:13:22 -0400, Tom Lane wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> As well as a28e10e82e54, I suppose. I agree with keeping the tool >>> similar across branches, if we're going to do this. >> Hm, good point. My first thought was that a28e10e82e54 is just >> cosmetic, but it isn't entirely, because it suppresses notice >> reports on the control connection. That means it might actually >> be a prerequisite to having stable output with ebd499282 (the >> change of client_min_messages). >> >> After reviewing the git log a little more, I'm inclined to think >> we should only back-patch this stuff to 9.6, which is where 38f8bdcac >> ("Modify the isolation tester so that multiple sessions can wait") >> and a number of follow-up patches came in. That was enough of a >> quantum jump in flexibility that it'd likely limit our ability to >> back-patch tests further than that anyway. Also I don't think the >> patches mentioned here would apply without that ... > That seems like a good plan to me. Hearing no votes against, I'll go make it so. regards, tom lane