Обсуждение: Re: pgsql: Support partition pruning at execution time
Changed CC to pgsql-hackers. Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Frankly, I don't like this. I would rather have an instrument->ntuples2 > > rather than these "divide this by nloops, sometimes" schizoid counters. > > This is already being misused by ON CONFLICT (see "other_path" in > > show_modifytable_info). But it seems like a correct fix would require > > more code. > > The question then becomes whether to put back nfiltered3, or to do > something more to your liking. Think I'd vote for the latter. Doing it properly is not a lot of code actually. Patch attached. ON CONFLICT is not changed by this patch, but that's a straightforward change. Questions: 1. Do we want to back-patch this to 10? I suppose (without checking) that EXPLAIN ANALYZE is already reporting bogus numbers for parallel index-only scans, so I think we should do that. 2. Do we want to revert Andrew's test stabilization patch? If I understand correctly, the problem is the inverse of what was diagnosed: "any running transaction at the time of the test could prevent pages from being set as all-visible". That's correct, but the test doesn't depend on pages being all-visible -- quite the contrary, it depends on the pages NOT being all-visible (which is why the HeapFetches counts are all non-zero). Since the pages contain very few tuples, autovacuum should never process the tables anyway. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Questions: > 1. Do we want to back-patch this to 10? I suppose (without checking) > that EXPLAIN ANALYZE is already reporting bogus numbers for parallel > index-only scans, so I think we should do that. You can't back-patch a change in struct Instrumentation; that'd be a ABI break. Maybe there are no third parties directly touching that struct, but I wouldn't bet on it. > 2. Do we want to revert Andrew's test stabilization patch? If I > understand correctly, the problem is the inverse of what was diagnosed: > "any running transaction at the time of the test could prevent pages > from being set as all-visible". That's correct, but the test doesn't > depend on pages being all-visible -- quite the contrary, it depends on > the pages NOT being all-visible (which is why the HeapFetches counts are > all non-zero). Since the pages contain very few tuples, autovacuum > should never process the tables anyway. I did not especially like the original test output, because even without the bug at hand, it seemed to me the number of heap fetches might vary depending on BLCKSZ. Given that the point of the test is just to check partition pruning, seems like IOS vs regular indexscan isn't a critical difference. I'd keep Andrew's change but fix the comment. regards, tom lane
On 11 April 2018 at 03:14, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > 2. Do we want to revert Andrew's test stabilization patch? If I > understand correctly, the problem is the inverse of what was diagnosed: > "any running transaction at the time of the test could prevent pages > from being set as all-visible". That's correct, but the test doesn't > depend on pages being all-visible -- quite the contrary, it depends on > the pages NOT being all-visible (which is why the HeapFetches counts are > all non-zero). Since the pages contain very few tuples, autovacuum > should never process the tables anyway. I think it's probably a good idea to revert it once the instrumentation is working correctly. It appears this found a bug in that code, so is probably useful to keep just in case something else breaks it in the future. I don't think there is too much risk of instability from other sources. There's no reason an auto-vacuum would trigger and cause a change in heap fetches. We only delete one row from lprt_a, that's not going to trigger an auto-vacuum. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11 April 2018 at 03:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> 2. Do we want to revert Andrew's test stabilization patch? If I >> understand correctly, the problem is the inverse of what was diagnosed: >> "any running transaction at the time of the test could prevent pages >> from being set as all-visible". That's correct, but the test doesn't >> depend on pages being all-visible -- quite the contrary, it depends on >> the pages NOT being all-visible (which is why the HeapFetches counts are >> all non-zero). Since the pages contain very few tuples, autovacuum >> should never process the tables anyway. > > I did not especially like the original test output, because even without > the bug at hand, it seemed to me the number of heap fetches might vary > depending on BLCKSZ. Given that the point of the test is just to check > partition pruning, seems like IOS vs regular indexscan isn't a critical > difference. I'd keep Andrew's change but fix the comment. hmm, I don't feel strongly about reverting or not, but if [auto-]vacuum has not visited the table, then I don't see why BLCKSZ would have an effect here. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Questions: > 1. Do we want to back-patch this to 10? I suppose (without checking) > that EXPLAIN ANALYZE is already reporting bogus numbers for parallel > index-only scans, so I think we should do that. I haven't looked at this closely, but have you considered adding bespoke code for IndexOnlyScan that works like ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation already do rather than jamming this into struct Instrumentation? I'm inclined to view any node-specific instrumentation that's not being pulled back to the leader as a rough edge to be filed down when it bothers somebody more than an outright bug, but perhaps that is an unduly lenient view. Still, if we take the view that it's an outright bug, I suspect we find that there may be at least a few more of those. I was pretty much oblivious to this problem during the initial parallel query development and mistakenly assumed that bringing over struct Instrumentation was good enough. It emerged late in the game that this wasn't really the case, but holding up the whole feature because some nodes might have details not reported on a per-worker basis didn't really seem to make sense. Whether that was the right call is obviously arguable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera > <alvherre@alvh.no-ip.org> wrote: > > Questions: > > 1. Do we want to back-patch this to 10? I suppose (without checking) > > that EXPLAIN ANALYZE is already reporting bogus numbers for parallel > > index-only scans, so I think we should do that. > > I haven't looked at this closely, but have you considered adding > bespoke code for IndexOnlyScan that works like > ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation > already do rather than jamming this into struct Instrumentation? Thanks for pointing these out -- I hadn't come across these. My initial impression is that those two are about transferring instrumentation state that's quite a bit more complicated than what my patch proposes for indexonly scan, which is just a single integer counter. For example, in the case of Sort, for each worker we display separately the method and type and amount of memory. In the hash case, we aggregate all of them together for some reason (though I'm not clear about this one: /* * In a parallel-aware hash join, for now we report the * maximum peak memory reported by any worker. */ hinstrument.space_peak = Max(hinstrument.space_peak, worker_hi->space_peak); -- why shouldn't we sum these values?) In contrast, in an indexonly scan you have a single counter and it doesn't really matter the distribution of fetches done by workers, so it seems okay to aggregate them all in a single counter. And it being so simple, it seems reasonable to me to put it in Instrumentation rather than have a dedicated struct. > I'm inclined to view any node-specific instrumentation that's not > being pulled back to the leader as a rough edge to be filed down when > it bothers somebody more than an outright bug, but perhaps that is an > unduly lenient view. Still, if we take the view that it's an outright > bug, I suspect we find that there may be at least a few more of those. OK. As Tom indicated, it's not possible to backpatch this anyway. Given that nobody has complained to date, it seems okay to be lenient about this. Seeking a backpatchable solution seems more trouble than is worth. > I was pretty much oblivious to this problem during the initial > parallel query development and mistakenly assumed that bringing over > struct Instrumentation was good enough. It emerged late in the game > that this wasn't really the case, but holding up the whole feature > because some nodes might have details not reported on a per-worker > basis didn't really seem to make sense. Whether that was the right > call is obviously arguable. I certainly don't blame you for that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 10, 2018 at 12:29 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > In contrast, in an indexonly scan you have a single counter and it > doesn't really matter the distribution of fetches done by workers, so it > seems okay to aggregate them all in a single counter. And it being so > simple, it seems reasonable to me to put it in Instrumentation rather > than have a dedicated struct. I don't have a strong opinion on that. Since we know how many tuples were processed by each worker, knowing how many heap fetches we have on a per-worker basis seems like a good thing to have, too. On the other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that anyway, since it knows about displaying per-worker instrumentation (look for /* Show worker detail */). If it doesn't, then that's probably bad, because I'm pretty sure that when I installed that code the stuff that got displayed for worker instrumentation pretty much matched the stuff that got displayed for overall instrumentation. In any case part of the point is that Instrumentation is supposed to be a generic structure that contains things that are for the most part common to all node types. So what MERGE did to that structure looks like in inadvisable kludge to me. I'd get rid of that and do it a different way rather than propagate the idea that nfilteredX is scratch space that can mean something different in every separate node. >> I was pretty much oblivious to this problem during the initial >> parallel query development and mistakenly assumed that bringing over >> struct Instrumentation was good enough. It emerged late in the game >> that this wasn't really the case, but holding up the whole feature >> because some nodes might have details not reported on a per-worker >> basis didn't really seem to make sense. Whether that was the right >> call is obviously arguable. > > I certainly don't blame you for that. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > > I don't have a strong opinion on that. Since we know how many tuples > were processed by each worker, knowing how many heap fetches we have > on a per-worker basis seems like a good thing to have, too. On the > other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that > anyway, since it knows about displaying per-worker instrumentation > (look for /* Show worker detail */). If it doesn't, then that's > probably bad, because I'm pretty sure that when I installed that code > the stuff that got displayed for worker instrumentation pretty much > matched the stuff that got displayed for overall instrumentation. So, after some experimentation I was able to produce this output with the previously proposed patch: QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Gather Merge (cost=1000.51..27083147.02 rows=22950230 width=12) (actual time=1198.906..1198.906 rows=0 loops=1) Output: tbl1.col1, tprt_1.col1, tprt_1.col1 Workers Planned: 5 Workers Launched: 5 Buffers: shared hit=267516 -> Nested Loop (cost=0.43..24318369.79 rows=4590046 width=12) (actual time=1183.291..1183.291 rows=0 loops=6) Output: tbl1.col1, tprt_1.col1, tprt_1.col1 Join Filter: (tbl1.col1 = tprt_1.col1) Rows Removed by Join Filter: 1200012 Buffers: shared hit=1836184 Worker 0: actual time=1174.526..1174.526 rows=0 loops=1 Buffers: shared hit=245599 Worker 1: actual time=1178.845..1178.845 rows=0 loops=1 Buffers: shared hit=270694 Worker 2: actual time=1185.703..1185.703 rows=0 loops=1 Buffers: shared hit=313607 Worker 3: actual time=1189.595..1189.595 rows=0 loops=1 Buffers: shared hit=383714 Worker 4: actual time=1176.965..1176.965 rows=0 loops=1 Buffers: shared hit=355054 -> Parallel Index Only Scan using tprt1_idx on public.tprt_1 (cost=0.43..63100.29 rows=360004 width=4) (actualtime=0.041..209.113 rows=300003 loops=6) Output: tprt_1.col1 Heap Fetches: 1800018 Buffers: shared hit=36166 Worker 0: actual time=0.035..192.255 rows=240750 loops=1 Buffers: shared hit=4849 Worker 1: actual time=0.031..184.332 rows=265307 loops=1 Buffers: shared hit=5387 Worker 2: actual time=0.046..272.412 rows=307509 loops=1 Buffers: shared hit=6098 Worker 3: actual time=0.035..212.866 rows=376064 loops=1 Buffers: shared hit=7650 Worker 4: actual time=0.036..201.076 rows=348048 loops=1 Buffers: shared hit=7006 -> Seq Scan on public.tbl1 (cost=0.00..35.50 rows=2550 width=4) (actual time=0.001..0.002 rows=4 loo Time: 1205,533 ms (00:01,206) You're right that per-worker heap-fetches would be probably better, but that seems like a project larger than I'm willing to tackle at this point in the devel cycle -- mainly because that "worker detail" does not have node-specific handling, so there's a bunch of new code to be written there. This amount of detail seems about right for me: I'm not sure that you really care all that much about how many fetches each worker had to do, but rather you care about how many there were in the table as a whole. Another query (without verbose) gives this: QUERY PLAN ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Gather (cost=20000001002.10..20249583898.07 rows=80390828 width=12) (actual time=1554.752..6633.272 rows=3600036 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Nested Loop (cost=20000000002.10..20241543815.27 rows=80390828 width=12) (actual time=1547.552..5832.355 rows=3600036loops=1) Join Filter: (tbl1.col1 = tprt_1.col1) Rows Removed by Join Filter: 21600216 Buffers: shared hit=75688 read=54346 -> Merge Append (cost=2.10..371288.65 rows=6305163 width=4) (actual time=0.758..1980.024 rows=6300063 loops=1) Sort Key: tprt_1.col1 Buffers: shared hit=75688 read=54345 -> Index Only Scan using tprt1_idx on tprt_1 (cost=0.43..77500.44 rows=1800018 width=4) (actual time=0.211..387.916rows=1800018 loops=1) Heap Fetches: 1800018 Buffers: shared hit=22028 read=14035 -> Index Only Scan using tprt2_idx on tprt_2 (cost=0.43..121147.34 rows=2700027 width=4) (actual time=0.180..628.350rows=2700027 loops=1) Heap Fetches: 2700027 Buffers: shared hit=40597 read=26251 -> Index Only Scan using tprt3_idx on tprt_3 (cost=0.42..29722.56 rows=900009 width=4) (actual time=0.153..184.429rows=900009 loops=1) Heap Fetches: 900009 Buffers: shared hit=6538 read=7029 -> Index Only Scan using tprt4_idx on tprt_4 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.023..0.023rows=0 loops=1) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using tprt5_idx on tprt_5 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.018..0.018rows=0 loops=1) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using tprt6_idx on tprt_6 (cost=0.42..29734.56 rows=900009 width=4) (actual time=0.165..191.343rows=900009 loops=1) Heap Fetches: 900009 Buffers: shared hit=6523 read=7030 -> Materialize (cost=10000000000.00..10000000048.25 rows=2550 width=4) (actual time=0.000..0.000 rows=4 loops=6300063) Buffers: shared read=1 -> Seq Scan on tbl1 (cost=10000000000.00..10000000035.50 rows=2550 width=4) (actual time=0.033..0.036 rows=4loops=1) Buffers: shared read=1 Planning Time: 4.344 ms Execution Time: 7008.095 ms (35 filas) Looks alright to me. > In any case part of the point is that Instrumentation is supposed to > be a generic structure that contains things that are for the most part > common to all node types. So what MERGE did to that structure looks > like in inadvisable kludge to me. I'd get rid of that and do it a > different way rather than propagate the idea that nfilteredX is > scratch space that can mean something different in every separate > node. Agreed. My proposed patch adds another generic counter "tuples2", which can be used at least by ON CONFLICT. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services