Обсуждение: Re: pgsql: Support partition pruning at execution time

Поиск
Список
Период
Сортировка

Re: pgsql: Support partition pruning at execution time

От
Alvaro Herrera
Дата:
Changed CC to pgsql-hackers.

Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> > Frankly, I don't like this.  I would rather have an instrument->ntuples2
> > rather than these "divide this by nloops, sometimes" schizoid counters.
> > This is already being misused by ON CONFLICT (see "other_path" in
> > show_modifytable_info).  But it seems like a correct fix would require
> > more code.
> 
> The question then becomes whether to put back nfiltered3, or to do
> something more to your liking.  Think I'd vote for the latter.

Doing it properly is not a lot of code actually.  Patch attached.  ON
CONFLICT is not changed by this patch, but that's a straightforward
change.

Questions:
1. Do we want to back-patch this to 10?  I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

2. Do we want to revert Andrew's test stabilization patch?  If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible".  That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero).  Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pgsql: Support partition pruning at execution time

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Questions:
> 1. Do we want to back-patch this to 10?  I suppose (without checking)
> that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> index-only scans, so I think we should do that.

You can't back-patch a change in struct Instrumentation; that'd be
a ABI break.  Maybe there are no third parties directly touching that
struct, but I wouldn't bet on it.

> 2. Do we want to revert Andrew's test stabilization patch?  If I
> understand correctly, the problem is the inverse of what was diagnosed:
> "any running transaction at the time of the test could prevent pages
> from being set as all-visible".  That's correct, but the test doesn't
> depend on pages being all-visible -- quite the contrary, it depends on
> the pages NOT being all-visible (which is why the HeapFetches counts are
> all non-zero).  Since the pages contain very few tuples, autovacuum
> should never process the tables anyway.

I did not especially like the original test output, because even without
the bug at hand, it seemed to me the number of heap fetches might vary
depending on BLCKSZ.  Given that the point of the test is just to check
partition pruning, seems like IOS vs regular indexscan isn't a critical
difference.  I'd keep Andrew's change but fix the comment.

            regards, tom lane


Re: pgsql: Support partition pruning at execution time

От
David Rowley
Дата:
On 11 April 2018 at 03:14, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> 2. Do we want to revert Andrew's test stabilization patch?  If I
> understand correctly, the problem is the inverse of what was diagnosed:
> "any running transaction at the time of the test could prevent pages
> from being set as all-visible".  That's correct, but the test doesn't
> depend on pages being all-visible -- quite the contrary, it depends on
> the pages NOT being all-visible (which is why the HeapFetches counts are
> all non-zero).  Since the pages contain very few tuples, autovacuum
> should never process the tables anyway.

I think it's probably a good idea to revert it once the
instrumentation is working correctly. It appears this found a bug in
that code, so is probably useful to keep just in case something else
breaks it in the future.

I don't think there is too much risk of instability from other
sources. There's no reason an auto-vacuum would trigger and cause a
change in heap fetches. We only delete one row from lprt_a, that's not
going to trigger an auto-vacuum.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: Support partition pruning at execution time

От
David Rowley
Дата:
On 11 April 2018 at 03:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> 2. Do we want to revert Andrew's test stabilization patch?  If I
>> understand correctly, the problem is the inverse of what was diagnosed:
>> "any running transaction at the time of the test could prevent pages
>> from being set as all-visible".  That's correct, but the test doesn't
>> depend on pages being all-visible -- quite the contrary, it depends on
>> the pages NOT being all-visible (which is why the HeapFetches counts are
>> all non-zero).  Since the pages contain very few tuples, autovacuum
>> should never process the tables anyway.
>
> I did not especially like the original test output, because even without
> the bug at hand, it seemed to me the number of heap fetches might vary
> depending on BLCKSZ.  Given that the point of the test is just to check
> partition pruning, seems like IOS vs regular indexscan isn't a critical
> difference.  I'd keep Andrew's change but fix the comment.

hmm, I don't feel strongly about reverting or not, but if
[auto-]vacuum has not visited the table, then I don't see why BLCKSZ
would have an effect here.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: Support partition pruning at execution time

От
Robert Haas
Дата:
On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> Questions:
> 1. Do we want to back-patch this to 10?  I suppose (without checking)
> that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> index-only scans, so I think we should do that.

I haven't looked at this closely, but have you considered adding
bespoke code for IndexOnlyScan that works like
ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
already do rather than jamming this into struct Instrumentation?

I'm inclined to view any node-specific instrumentation that's not
being pulled back to the leader as a rough edge to be filed down when
it bothers somebody more than an outright bug, but perhaps that is an
unduly lenient view.  Still, if we take the view that it's an outright
bug, I suspect we find that there may be at least a few more of those.
I was pretty much oblivious to this problem during the initial
parallel query development and mistakenly assumed that bringing over
struct Instrumentation was good enough.  It emerged late in the game
that this wasn't really the case, but holding up the whole feature
because some nodes might have details not reported on a per-worker
basis didn't really seem to make sense.  Whether that was the right
call is obviously arguable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Support partition pruning at execution time

От
Alvaro Herrera
Дата:
Robert Haas wrote:
> On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
> <alvherre@alvh.no-ip.org> wrote:
> > Questions:
> > 1. Do we want to back-patch this to 10?  I suppose (without checking)
> > that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> > index-only scans, so I think we should do that.
> 
> I haven't looked at this closely, but have you considered adding
> bespoke code for IndexOnlyScan that works like
> ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
> already do rather than jamming this into struct Instrumentation?

Thanks for pointing these out -- I hadn't come across these.

My initial impression is that those two are about transferring
instrumentation state that's quite a bit more complicated than what my
patch proposes for indexonly scan, which is just a single integer
counter.  For example, in the case of Sort, for each worker we display
separately the method and type and amount of memory.  In the hash case,
we aggregate all of them together for some reason (though I'm not clear
about this one:
                /*
                 * In a parallel-aware hash join, for now we report the
                 * maximum peak memory reported by any worker.
                 */
                hinstrument.space_peak =
                    Max(hinstrument.space_peak, worker_hi->space_peak);
 -- why shouldn't we sum these values?)

In contrast, in an indexonly scan you have a single counter and it
doesn't really matter the distribution of fetches done by workers, so it
seems okay to aggregate them all in a single counter.  And it being so
simple, it seems reasonable to me to put it in Instrumentation rather
than have a dedicated struct.

> I'm inclined to view any node-specific instrumentation that's not
> being pulled back to the leader as a rough edge to be filed down when
> it bothers somebody more than an outright bug, but perhaps that is an
> unduly lenient view.  Still, if we take the view that it's an outright
> bug, I suspect we find that there may be at least a few more of those.

OK.  As Tom indicated, it's not possible to backpatch this anyway.
Given that nobody has complained to date, it seems okay to be lenient
about this.  Seeking a backpatchable solution seems more trouble than is
worth.

> I was pretty much oblivious to this problem during the initial
> parallel query development and mistakenly assumed that bringing over
> struct Instrumentation was good enough.  It emerged late in the game
> that this wasn't really the case, but holding up the whole feature
> because some nodes might have details not reported on a per-worker
> basis didn't really seem to make sense.  Whether that was the right
> call is obviously arguable.

I certainly don't blame you for that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Support partition pruning at execution time

От
Robert Haas
Дата:
On Tue, Apr 10, 2018 at 12:29 PM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> In contrast, in an indexonly scan you have a single counter and it
> doesn't really matter the distribution of fetches done by workers, so it
> seems okay to aggregate them all in a single counter.  And it being so
> simple, it seems reasonable to me to put it in Instrumentation rather
> than have a dedicated struct.

I don't have a strong opinion on that.  Since we know how many tuples
were processed by each worker, knowing how many heap fetches we have
on a per-worker basis seems like a good thing to have, too.  On the
other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that
anyway, since it knows about displaying per-worker instrumentation
(look for /* Show worker detail */).  If it doesn't, then that's
probably bad, because I'm pretty sure that when I installed that code
the stuff that got displayed for worker instrumentation pretty much
matched the stuff that got displayed for overall instrumentation.

In any case part of the point is that Instrumentation is supposed to
be a generic structure that contains things that are for the most part
common to all node types.  So what MERGE did to that structure looks
like in inadvisable kludge to me.  I'd get rid of that and do it a
different way rather than propagate the idea that nfilteredX is
scratch space that can mean something different in every separate
node.

>> I was pretty much oblivious to this problem during the initial
>> parallel query development and mistakenly assumed that bringing over
>> struct Instrumentation was good enough.  It emerged late in the game
>> that this wasn't really the case, but holding up the whole feature
>> because some nodes might have details not reported on a per-worker
>> basis didn't really seem to make sense.  Whether that was the right
>> call is obviously arguable.
>
> I certainly don't blame you for that.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Support partition pruning at execution time

От
Alvaro Herrera
Дата:
Robert Haas wrote:
> 
> I don't have a strong opinion on that.  Since we know how many tuples
> were processed by each worker, knowing how many heap fetches we have
> on a per-worker basis seems like a good thing to have, too.  On the
> other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that
> anyway, since it knows about displaying per-worker instrumentation
> (look for /* Show worker detail */).  If it doesn't, then that's
> probably bad, because I'm pretty sure that when I installed that code
> the stuff that got displayed for worker instrumentation pretty much
> matched the stuff that got displayed for overall instrumentation.

So, after some experimentation I was able to produce this output with
the previously proposed patch:
                                                                            QUERY PLAN                         
                                                    

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather Merge  (cost=1000.51..27083147.02 rows=22950230 width=12) (actual time=1198.906..1198.906 rows=0 loops=1)
   Output: tbl1.col1, tprt_1.col1, tprt_1.col1
   Workers Planned: 5
   Workers Launched: 5
   Buffers: shared hit=267516
   ->  Nested Loop  (cost=0.43..24318369.79 rows=4590046 width=12) (actual time=1183.291..1183.291 rows=0 loops=6)
         Output: tbl1.col1, tprt_1.col1, tprt_1.col1
         Join Filter: (tbl1.col1 = tprt_1.col1)
         Rows Removed by Join Filter: 1200012
         Buffers: shared hit=1836184
         Worker 0: actual time=1174.526..1174.526 rows=0 loops=1
           Buffers: shared hit=245599
         Worker 1: actual time=1178.845..1178.845 rows=0 loops=1
           Buffers: shared hit=270694
         Worker 2: actual time=1185.703..1185.703 rows=0 loops=1
           Buffers: shared hit=313607
         Worker 3: actual time=1189.595..1189.595 rows=0 loops=1
           Buffers: shared hit=383714
         Worker 4: actual time=1176.965..1176.965 rows=0 loops=1
           Buffers: shared hit=355054
         ->  Parallel Index Only Scan using tprt1_idx on public.tprt_1  (cost=0.43..63100.29 rows=360004 width=4)
(actualtime=0.041..209.113 rows=300003 loops=6)
 
               Output: tprt_1.col1
               Heap Fetches: 1800018
               Buffers: shared hit=36166
               Worker 0: actual time=0.035..192.255 rows=240750 loops=1
                 Buffers: shared hit=4849
               Worker 1: actual time=0.031..184.332 rows=265307 loops=1
                 Buffers: shared hit=5387
               Worker 2: actual time=0.046..272.412 rows=307509 loops=1
                 Buffers: shared hit=6098
               Worker 3: actual time=0.035..212.866 rows=376064 loops=1
                 Buffers: shared hit=7650
               Worker 4: actual time=0.036..201.076 rows=348048 loops=1
                 Buffers: shared hit=7006
         ->  Seq Scan on public.tbl1  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.001..0.002 rows=4 loo
Time: 1205,533 ms (00:01,206)

You're right that per-worker heap-fetches would be probably better, but
that seems like a project larger than I'm willing to tackle at this
point in the devel cycle -- mainly because that "worker detail" does not
have node-specific handling, so there's a bunch of new code to be
written there.  This amount of detail seems about right for me: I'm not
sure that you really care all that much about how many fetches each
worker had to do, but rather you care about how many there were in the
table as a whole.

Another query (without verbose) gives this:

                                                                         QUERY PLAN
                                   
 

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=20000001002.10..20249583898.07 rows=80390828 width=12) (actual time=1554.752..6633.272 rows=3600036
loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Nested Loop  (cost=20000000002.10..20241543815.27 rows=80390828 width=12) (actual time=1547.552..5832.355
rows=3600036loops=1)
 
         Join Filter: (tbl1.col1 = tprt_1.col1)
         Rows Removed by Join Filter: 21600216
         Buffers: shared hit=75688 read=54346
         ->  Merge Append  (cost=2.10..371288.65 rows=6305163 width=4) (actual time=0.758..1980.024 rows=6300063
loops=1)
               Sort Key: tprt_1.col1
               Buffers: shared hit=75688 read=54345
               ->  Index Only Scan using tprt1_idx on tprt_1  (cost=0.43..77500.44 rows=1800018 width=4) (actual
time=0.211..387.916rows=1800018 loops=1)
 
                     Heap Fetches: 1800018
                     Buffers: shared hit=22028 read=14035
               ->  Index Only Scan using tprt2_idx on tprt_2  (cost=0.43..121147.34 rows=2700027 width=4) (actual
time=0.180..628.350rows=2700027 loops=1)
 
                     Heap Fetches: 2700027
                     Buffers: shared hit=40597 read=26251
               ->  Index Only Scan using tprt3_idx on tprt_3  (cost=0.42..29722.56 rows=900009 width=4) (actual
time=0.153..184.429rows=900009 loops=1)
 
                     Heap Fetches: 900009
                     Buffers: shared hit=6538 read=7029
               ->  Index Only Scan using tprt4_idx on tprt_4  (cost=0.15..82.41 rows=2550 width=4) (actual
time=0.023..0.023rows=0 loops=1)
 
                     Heap Fetches: 0
                     Buffers: shared hit=1
               ->  Index Only Scan using tprt5_idx on tprt_5  (cost=0.15..82.41 rows=2550 width=4) (actual
time=0.018..0.018rows=0 loops=1)
 
                     Heap Fetches: 0
                     Buffers: shared hit=1
               ->  Index Only Scan using tprt6_idx on tprt_6  (cost=0.42..29734.56 rows=900009 width=4) (actual
time=0.165..191.343rows=900009 loops=1)
 
                     Heap Fetches: 900009
                     Buffers: shared hit=6523 read=7030
         ->  Materialize  (cost=10000000000.00..10000000048.25 rows=2550 width=4) (actual time=0.000..0.000 rows=4
loops=6300063)
               Buffers: shared read=1
               ->  Seq Scan on tbl1  (cost=10000000000.00..10000000035.50 rows=2550 width=4) (actual time=0.033..0.036
rows=4loops=1)
 
                     Buffers: shared read=1
 Planning Time: 4.344 ms
 Execution Time: 7008.095 ms
(35 filas)

Looks alright to me.

> In any case part of the point is that Instrumentation is supposed to
> be a generic structure that contains things that are for the most part
> common to all node types.  So what MERGE did to that structure looks
> like in inadvisable kludge to me.  I'd get rid of that and do it a
> different way rather than propagate the idea that nfilteredX is
> scratch space that can mean something different in every separate
> node.

Agreed.  My proposed patch adds another generic counter "tuples2", which
can be used at least by ON CONFLICT.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services