Обсуждение: Identical DB's, different execution plans

Поиск
Список
Период
Сортировка

Identical DB's, different execution plans

От
Doug Eck
Дата:
I have two identical databases that run the same query each morning.  Starting this morning, something caused the first db to start using a different execution plan for the query, resulting in much worse performance.  I've have tried several things this morning, but I am currently stumped on what would be causing the different execution plans.

The query and the results of the explain analyze on the two db's:

db1=> explain analyze
select
    t1.bn,
    t2.mu,
    t1.nm,
    t1.root,
    t1.suffix,
    t1.type
from
     t1,
     t2
where
    t2.eff_dt = current_date
    and t1.active = true
    and t1.bn = t2.sn;

The slower plan used on db1:
                                                               QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=145.12..38799.61 rows=7876 width=47) (actual time=6.494..352.166 rows=8437 loops=1)
   ->  Bitmap Heap Scan on t2  (cost=145.12..19464.74 rows=10898 width=22) (actual time=6.472..22.684 rows=12204 loops=1)
         Recheck Cond: (eff_dt = ('now'::text)::date)
         ->  Bitmap Index Scan on t2_nu1  (cost=0.00..142.40 rows=10898 width=0) (actual time=4.013..4.013 rows=24482 loops=1)
               Index Cond: (eff_dt = ('now'::text)::date)
   ->  Index Scan using t1_uc2 on t1  (cost=0.00..1.76 rows=1 width=32) (actual time=0.012..0.026 rows=1 loops=12204)
         Index Cond: ((t1.bn)::text = (t2.sn)::text)
         Filter: active
 Total runtime: 353.629 ms
(9 rows)

Time: 354.795 ms


And the faster plan from db2:


                                                                         QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=21371.63..21720.78 rows=7270 width=47) (actual time=60.412..80.865 rows=8437 loops=1)
   Merge Cond: ("outer"."?column6?" = "inner"."?column3?")
   ->  Sort  (cost=8988.56..9100.55 rows=44794 width=32) (actual time=30.685..33.370 rows=8438 loops=1)
         Sort Key: (t1.bn)::text
         ->  Seq Scan on t1  (cost=0.00..5528.00 rows=44794 width=32) (actual time=0.008..18.280 rows=8439 loops=1)
               Filter: active
   ->  Sort  (cost=12383.07..12409.32 rows=10500 width=22) (actual time=29.718..33.515 rows=12204 loops=1)
         Sort Key: (t2.sn)::text
         ->  Index Scan using t2_nu1 on t2  (cost=0.00..11681.77 rows=10500 width=22) (actual time=0.052..13.295 rows=12204 loops=1)
               Index Cond: (eff_dt = ('now'::text)::date)
 Total runtime: 83.385 ms
(11 rows)

t2.eff_dt is defined as a date, t1.active is a boolean, all other fields are varchar.  Table t1 has a unique index (uc2) on field bn and a second unique index (uc3) on fields (root, suffix).  Table t2 has a unique index (uc1) on (sn, eff_dt), and a non-unique index (nu1) on eff_dt.

Table t1 has 12204 rows.  Table t2 has 7.1M rows, 12204 of which have eff_dt = current_date. 

Both database have autovacuum turned on, and both have been vacuumed and analyzed in the last 24 hours.

Any ideas as to what could the first db to opt for the slower subquery rather than the merge?

Thanks in advance.

Re: Identical DB's, different execution plans

От
Tom Lane
Дата:
Doug Eck <deck1@yahoo.com> writes:
> Any ideas as to what could the first db to opt for the slower subquery rather than the merge?

Not from the information given.  Presumably db1 thinks that the
mergejoin plan would be slower, but why it thinks that isn't clear yet.
Try setting enable_nestloop = off (and enable_hashjoin = off if it then
wants a hashjoin) and then post the EXPLAIN ANALYZE results.

            regards, tom lane

Re: Identical DB's, different execution plans

От
Doug Eck
Дата:
Setting enable_nestloop = off did result in a hash join, so I also set enable_hashjoin = off.

The new plan from the slower db:
                                                                  QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=20195.54..46442.99 rows=7876 width=47) (actual time=136.531..478.708 rows=8437 loops=1)
   Merge Cond: ((t1.bn)::text = "inner"."?column3?")
   ->  Index Scan using t1_uc2 on t1  (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285 rows=8438 loops=1)
         Filter: active
   ->  Sort  (cost=20195.54..20222.79 rows=10898 width=22) (actual time=136.461..138.621 rows=12204 loops=1)
         Sort Key: (t2.sn)::text
         ->  Bitmap Heap Scan on t2  (cost=145.12..19464.74 rows=10898 width=22) (actual time=7.580..120.144 rows=12204 loops=1)
               Recheck Cond: (eff_dt = ('now'::text)::date)
               ->  Bitmap Index Scan on t2_nu1  (cost=0.00..142.40 rows=10898 width=0) (actual time=4.964..4.964 rows=24483 loops=1)
                     Index Cond: (eff_dt = ('now'::text)::date)
 Total runtime: 480.344 ms
(11 rows)

And the faster one:

                                                                         QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=21371.63..21720.78 rows=7270 width=47) (actual time=60.435..80.604 rows=8437 loops=1)
   Merge Cond: ("outer"."?column6?" = "inner"."?column3?")
   ->  Sort  (cost=8988.56..9100.55 rows=44794 width=32) (actual time=30.498..33.093 rows=8438 loops=1)
         Sort Key: (t1.bn)::text
         ->  Seq Scan on t1  (cost=0.00..5528.00 rows=44794 width=32) (actual time=0.010..17.950 rows=8439 loops=1)
               Filter: active
   ->  Sort  (cost=12383.07..12409.32 rows=10500 width=22) (actual time=29.928..33.658 rows=12204 loops=1)
         Sort Key: (t2.sn)::text
         ->  Index Scan using t2_nu1 on t2  (cost=0.00..11681.77 rows=10500 width=22) (actual time=0.062..13.356 rows=12204 loops=1)
               Index Cond: (eff_dt = ('now'::text)::date)
 Total runtime: 83.054 ms
(11 rows)

And the query again:

explain analyze
select
    t1.bn,
    t2.mu,
    t1.nm,
    t1.root,
    t1.suffix,
    t1.type
from
     t1,
     t2
where
    t2.eff_dt = current_date
    and t1.active = true
    and t1.bn = t2.sn;

Thanks.

----- Original Message ----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Doug Eck <deck1@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Monday, September 29, 2008 11:42:01 AM
Subject: Re: [PERFORM] Identical DB's, different execution plans

Doug Eck <deck1@yahoo.com> writes:
> Any ideas as to what could the first db to opt for the slower subquery rather than the merge?

Not from the information given.  Presumably db1 thinks that the
mergejoin plan would be slower, but why it thinks that isn't clear yet.
Try setting enable_nestloop = off (and enable_hashjoin = off if it then
wants a hashjoin) and then post the EXPLAIN ANALYZE results.

            regards, tom lane

Re: Identical DB's, different execution plans

От
Tom Lane
Дата:
Doug Eck <deck1@yahoo.com> writes:
> The new plan from the slower db:

>    ->  Index Scan using t1_uc2 on t1  (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285
rows=8438loops=1) 
>          Filter: active

This seems a bit fishy.  In the first place, with such a simple filter
condition it shouldn't be that far off on the rowcount estimate.  In
the second place, the cost estimate is more than twice what the other
server estimates to do a seqscan and sort of the same data, and the
rowcount estimate is five times as much.  So there's something really
significantly different about the t1 tables in the two cases.

The first thing you ought to do is to look at the pg_class.relpages
and reltuples entries for t1 in both databases.  What I am suspecting is
that for some reason the "slow" db has suffered a lot of bloat in that
table, leading to a corresponding increase in the cost of a seqscan.
If so, a VACUUM FULL or CLUSTER should fix it, though you'll next need
to look into why routine vacuumings weren't happening.  (It looks like
t2 may be a bit bloated as well.)

If that's not it, we'll need to probe deeper ...

            regards, tom lane

Re: Identical DB's, different execution plans

От
Doug Eck
Дата:
Tom,

You nailed it.  The t1 table was using 9600 relpages versus 410 after the vacuum full.  The two databases are now showing similar execution plans and times.

Thanks for your help.  It is greatly appreciated.

Doug Eck

----- Original Message ----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Doug Eck <deck1@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Monday, September 29, 2008 6:20:20 PM
Subject: Re: [PERFORM] Identical DB's, different execution plans

Doug Eck <deck1@yahoo.com> writes:
> The new plan from the slower db:

>    ->  Index Scan using t1_uc2 on t1  (cost=0.00..25604.74 rows=204906 width=32) (actual time=0.061..327.285 rows=8438 loops=1)
>          Filter: active

This seems a bit fishy.  In the first place, with such a simple filter
condition it shouldn't be that far off on the rowcount estimate.  In
the second place, the cost estimate is more than twice what the other
server estimates to do a seqscan and sort of the same data, and the
rowcount estimate is five times as much.  So there's something really
significantly different about the t1 tables in the two cases.

The first thing you ought to do is to look at the pg_class.relpages
and reltuples entries for t1 in both databases.  What I am suspecting is
that for some reason the "slow" db has suffered a lot of bloat in that
table, leading to a corresponding increase in the cost of a seqscan.
If so, a VACUUM FULL or CLUSTER should fix it, though you'll next need
to look into why routine vacuumings weren't happening.  (It looks like
t2 may be a bit bloated as well.)

If that's not it, we'll need to probe deeper ...

            regards, tom lane