Re: issue with query optimizer when joining two partitioned tables

Поиск
Список
Период
Сортировка
От Anish Kejariwal
Тема Re: issue with query optimizer when joining two partitioned tables
Дата
Msg-id CAOpcnr-xdAUq7+aeqOSYYK=HdOUY6L+H7dF=8O3qP+7Td9th1Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: issue with query optimizer when joining two partitioned tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
Thanks Tom and Heikki!  I really appreciate your help.

I went ahead and loaded all the data.  In the icream table, I now have ~175 partitions, each with 4041866 records.

The data finished loading 12 hours ago, and I then ran the same query I gave you guys, and it took 25 seconds since it used the wrong execution plan as expected.

 HashAggregate  (cost=27680.90..28045.81 rows=11228 width=41) (actual time=24769.190..24817.618 rows=11028 loops=1)
   ->  Hash Join  (cost=1304.04..18901.88 rows=292634 width=41) (actual time=3938.965..24688.718 rows=11028 loops=1)
         Hash Cond: (rc.widget_id = r.widget_id)
         ->  Append  (cost=0.00..12110.95 rows=292634 width=24) (actual time=2854.925..22887.638 rows=309579 loops=1)
               ->  Seq Scan on icecream rc  (cost=0.00..25.60 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=1)
                     Filter: ((widgetset_id = 5) AND (dataset_id = 283))
               ->  Index Scan using icecream_part_283_widgetset_id_idx on icecream_part_283 rc  (cost=0.00..12085.35 rows=292633 width=24) (actual time=2854.915..2
1784.769 rows=309579 loops=1)
                     Index Cond: (widgetset_id = 5)
                     Filter: (dataset_id = 283)
         ->  Hash  (cost=1163.69..1163.69 rows=11228 width=21) (actual time=1083.704..1083.704 rows=11028 loops=1)
               Buckets: 2048  Batches: 1  Memory Usage: 604kB
               ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=528.216..1066.659 rows=11028 loops=1)
                     ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=528.017..528.017 rows=0 loops=1)
                           Recheck Cond: (widgetset_id = 5)
                           Filter: (widget_type_id = 4)
                           ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=527.995..527.995 rows=0 loops=1)
                                 Index Cond: (widgetset_id = 5)
                     ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.191..512.847 rows=1
1028 loops=1)
                           Index Cond: (widget_type_id = 4)
                           Filter: (widgetset_id = 5)
 Total runtime: 24844.016 ms
(21 rows)


I then changed my enable* options to force it to use a nested loop:
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;


 HashAggregate  (cost=460004.79..460369.70 rows=11228 width=41) (actual time=298.014..341.822 rows=11028 loops=1)
   ->  Nested Loop  (cost=4.28..338747.85 rows=4041898 width=41) (actual time=0.175..248.529 rows=11028 loops=1)
         Join Filter: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.053..42.532 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.014..0.014 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.008..0.008 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.032..18.410 rows=11028 lo
ops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Append  (cost=0.00..29.99 rows=6 width=24) (actual time=0.009..0.012 rows=1 loops=11028)
               ->  Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.001..0.001 rows=0 loops=11028)
                     Filter: (rc.dataset_id = 283)
               ->  Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc  (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo
ps=11028)
                     Index Cond: (rc.widget_id = r.widget_id)
                     Filter: (rc.dataset_id = 283)
 Total runtime: 361.180 ms
(19 rows)

The query was nice and fast as expected.

I then restored the enable* options back to default.  The query was slow again and taking around ~19 seconds.  So, this gives us some information about whether autoanalyze is running in the background:
-I don't think waiting is making a difference.  I loaded the data 12 hours ago, and then ran the query and the query was very slow.  I then ran the query again but for a different dataset and it was also slow.  It was only once I changed the enable* parameters could I get my expected performance.  
-I would have tried select * from ice-cream again, but with 176 partitions this query is no longer feasible.  But, I do agree that select * from icecream is causing the statistics to be updated.

So, then I followed Tom's suggestion to defeat the empty-table heuristic.

select relpages from pg_class where relname = 'icecream';

 relpages 
----------
        0
(1 row)

Ok, so the planner thinks that the parent table is empty.  I then ran:
update pg_class set relpages = 1 where relname = 'icecream';


 HashAggregate  (cost=201199.27..201564.18 rows=11228 width=41) (actual time=277.195..304.620 rows=11028 loops=1)
   ->  Nested Loop  (cost=4.28..79942.45 rows=4041894 width=41) (actual time=0.227..231.181 rows=11028 loops=1)
         Join Filter: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.125..40.834 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.022..0.022 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.019..0.019 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.100..18.964 rows=11028 lo
ops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Append  (cost=0.00..6.99 rows=2 width=24) (actual time=0.008..0.012 rows=1 loops=11028)
               ->  Seq Scan on icecream rc  (cost=0.00..0.00 rows=1 width=24) (actual time=0.001..0.001 rows=0 loops=11028)
                     Filter: (rc.dataset_id = 283)
               ->  Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc  (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo
ps=11028)
                     Index Cond: (rc.widget_id = r.widget_id)
                     Filter: (rc.dataset_id = 283)
 Total runtime: 318.634 ms
(19 rows)


Wow!  that fixes it.  Thanks you so much!!!! I've been struggling with this issue for 2-3 days.  (Also, in the past, I've seen inconsistent performance with this query, which may be the result of the planner sometimes choosing the wrong plan, but I'll chase that down later).

Tom said: But maybe we should reconsider the heuristic for tables that are members of inheritance trees --- particularly parents of inheritance trees.

I agree.  I think postgres should get updated to take this into account.  I shouldn't have to set the relpages to 1 for all the empty parent tables that I have partitioned.  Should I file this as a bug/enhancement?

Also, do I need to worry about about autoanalyze/autovacuum setting back relpages to zero for the parent icecream table?

thanks!!!
Anish



On Sat, Jul 9, 2011 at 10:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 09.07.2011 00:36, Anish Kejariwal wrote:
>> My guess as to what happened:
>> -because the icecream parent table has zero records, the query optimizer
>> chooses the incorrect execution plan
>> -when I do select * from icecream, the optimizer now knows how many records
>> are really in the icecream table, by knowing that the icecream table has
>> partitions.

> "select * from icecream" won't have any direct effect on the
> optimization of subsequent queries. What probably happened is that
> autoanalyze ran in the background while you ran that select, and
> analyzed some of the partitions. Simply waiting a while would've had the
> same effect.

Yeah.  Also, the reason that a manual vacuum on icecream changes things
yet again is that in 9.0 and up, we have a notion of summary stats
across the whole inheritance tree, but autoanalyze hasn't been taught to
gather those.  The manual command on the parent table does gather them,
though.

So what's happening here is that we suddenly have an accurate idea of
the size of the join product as a result of having inheritance summary
stats to estimate with, and that drives the estimated cost of the merge
or hash join down out of the stratosphere.  The estimated cost of the
nestloop goes down a lot too, but not as much.

I experimented with a similar case here, and it seems like a lot of the
remaining error in the nestloop estimate comes from this:

>> ->   Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
>>        Filter: (dataset_id = 281)

The indexscan on the nonempty child partition is estimated at less than
10 cost units, so this is a *large* fraction of what the planner sees as
the per-outer-row cost of a nestloop.  And with more than 11000 rows on
the other side of the join, that discourages it from using the nestloop.
In reality of course this takes negligible time compared to examining
the child partition.

Now why is the seqscan cost estimate so large, when actually the parent
icecream table is totally empty?  It's because the planner has been
taught to never believe that an empty table is empty.  If memory serves,
it's really estimating on an assumption that the table contains 10 pages
and some corresponding number of rows.  This is a reasonable defensive
posture when dealing with ordinary tables, I think, since most likely
if the catalogs say the table is empty that's just a leftover from when
it was created.  But maybe we should reconsider the heuristic for tables
that are members of inheritance trees --- particularly parents of
inheritance trees.

I was able to defeat the empty-table heuristic here by doing

update pg_class set relpages = 1 where relname = 'icecream';

and then I started getting much more realistic estimates in my test
case.  (It still wanted to use a merge join initially, but after
knocking down random_page_cost it went to the nestloop.)  It would
be interesting to see what sorts of results Anish gets with that.

                       regards, tom lane

В списке pgsql-performance по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: INSERT query times
Следующее
От: Gael Le Mignot
Дата:
Сообщение: Re: Memory usage of auto-vacuum