Обсуждение: 8.4.4, 9.0, and 9.1 Planner Differences

Поиск
Список
Период
Сортировка

8.4.4, 9.0, and 9.1 Planner Differences

От
Anthony Presley
Дата:
Hi there!

We have a dev machine running 9.0.1 (an i3 laptop, with a regular hard disk, with 4GB of RAM, and a mostly untuned postgresql.conf file).  The changed lines are:
  shared_buffers = 512MB
  temp_buffers = 48MB
  work_mem = 32MB
  maintenance_work_mem = 348MB
  checkpoint_segments = 10
  effective_cache_size = 512MB

The same database is loaded onto a production server running 9.1.1 (dual QC processors, RAID-10 SAS drives, 36GB of RAM), which replicates to a backup server.  This has a lot of changed properties:
  shared_buffers = 8500MB
  work_mem = 35MB
  maintenance_work_mem = 512MB
  wal_level = hot_standby
  checkpoint_segments = 50
  max_wal_senders = 3
  wal_keep_segments = 144
  random_page_cost = 1.0
  effective_cache_size = 16384MB
  effective_io_concurrency = 6

The same DB is loaded on both the production and the dev environment, and in all cases (about 5000 distinct different queries), the production environment is about 500x faster, except for one type of query (both databases were loaded from the same pg_dump on an 8.4.4 database):

    On the dev box, we have:  http://explain.depesz.com/s/rwU   - about 131 seconds
    On the production box, we have:  http://explain.depesz.com/s/3dt  - about .25 seconds

For the life of me, I don't understand why it would be slower.  What can we do to speed up this one query?

By the way, on 8.4.4, the query took about 84 seconds.  I cannot understand why the 9.0 is so blazing fast, but 8.4.4 and 9.1.1 are slower.  We've checked the query results (they are identical) to make sure we're not missing any data.


--
Anthony

Re: 8.4.4, 9.0, and 9.1 Planner Differences

От
Tom Lane
Дата:
Anthony Presley <anthony@resolution.com> writes:
> We have a dev machine running 9.0.1 (an i3 laptop, with a regular hard disk,
> with 4GB of RAM, and a mostly untuned postgresql.conf file).  The changed
> lines are:
>   shared_buffers = 512MB
>   temp_buffers = 48MB
>   work_mem = 32MB
>   maintenance_work_mem = 348MB
>   checkpoint_segments = 10
>   effective_cache_size = 512MB

> The same database is loaded onto a production server running 9.1.1 (dual QC
> processors, RAID-10 SAS drives, 36GB of RAM), which replicates to a backup
> server.  This has a lot of changed properties:
>   shared_buffers = 8500MB
>   work_mem = 35MB
>   maintenance_work_mem = 512MB
>   wal_level = hot_standby
>   checkpoint_segments = 50
>   max_wal_senders = 3
>   wal_keep_segments = 144
>   random_page_cost = 1.0
>   effective_cache_size = 16384MB
>   effective_io_concurrency = 6

That random_page_cost setting is going to have a huge effect on the
planner's choices, and the larger effective_cache_size setting will
likely affect plans too.  I don't find it surprising in the least
that you're getting different plan choices ... and even less so when
your "dev" and "production" DBs aren't even the same major version.
You might want to think about making your dev environment more like
your production.

> The same DB is loaded on both the production and the dev environment, and in
> all cases (about 5000 distinct different queries), the production
> environment is about 500x faster, except for one type of query (both
> databases were loaded from the same pg_dump on an 8.4.4 database):

>     On the dev box, we have:  http://explain.depesz.com/s/rwU   - about 131
> seconds
>     On the production box, we have:  http://explain.depesz.com/s/3dt  -
> about .25 seconds

Did you mislabel these?  Because if you didn't, the numbers are right
in line with what you say above.  But anyway, the problem with the
slower query appears to be poor rowcount estimates, leading the planner
to use a nestloop join when it shouldn't.  You haven't provided nearly
enough context to let anyone guess why the estimates are off, other
than boilerplate suggestions like making sure the tables have been
ANALYZEd recently, and maybe increasing the statistics targets.

            regards, tom lane

Re: 8.4.4, 9.0, and 9.1 Planner Differences

От
Anthony Presley
Дата:
On Sat, Oct 22, 2011 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anthony Presley <anthony@resolution.com> writes:
> We have a dev machine running 9.0.1 (an i3 laptop, with a regular hard disk,
> with 4GB of RAM, and a mostly untuned postgresql.conf file).  The changed
> lines are:
>   shared_buffers = 512MB
>   temp_buffers = 48MB
>   work_mem = 32MB
>   maintenance_work_mem = 348MB
>   checkpoint_segments = 10
>   effective_cache_size = 512MB

> The same database is loaded onto a production server running 9.1.1 (dual QC
> processors, RAID-10 SAS drives, 36GB of RAM), which replicates to a backup
> server.  This has a lot of changed properties:
>   shared_buffers = 8500MB
>   work_mem = 35MB
>   maintenance_work_mem = 512MB
>   wal_level = hot_standby
>   checkpoint_segments = 50
>   max_wal_senders = 3
>   wal_keep_segments = 144
>   random_page_cost = 1.0
>   effective_cache_size = 16384MB
>   effective_io_concurrency = 6

That random_page_cost setting is going to have a huge effect on the
planner's choices, and the larger effective_cache_size setting will
likely affect plans too.  I don't find it surprising in the least
that you're getting different plan choices ... and even less so when
your "dev" and "production" DBs aren't even the same major version.
You might want to think about making your dev environment more like
your production.

Tom - thanks for your input.

Upgrading to 9.1.1 on the dev box is certainly next on our list ... I like to make sure that the dev team uses a MUCH slower box than the production server, making sure that if the developers are making things fast for the machines, it's really fast on the production box.  For all of our queries except this one, this strategy is "working".

> The same DB is loaded on both the production and the dev environment, and in
> all cases (about 5000 distinct different queries), the production
> environment is about 500x faster, except for one type of query (both
> databases were loaded from the same pg_dump on an 8.4.4 database):

>     On the dev box, we have:  http://explain.depesz.com/s/rwU   - about 131
> seconds
>     On the production box, we have:  http://explain.depesz.com/s/3dt  -
> about .25 seconds

Did you mislabel these?  Because if you didn't, the numbers are right
in line with what you say above.  But anyway, the problem with the
slower query appears to be poor rowcount estimates, leading the planner
to use a nestloop join when it shouldn't.  You haven't provided nearly
enough context to let anyone guess why the estimates are off, other
than boilerplate suggestions like making sure the tables have been
ANALYZEd recently, and maybe increasing the statistics targets.

I *did* mis-label them.  The 131 seconds is actually the production box.  IE:
  production is ... http://explain.depesz.com/s/rwU

The .25 seconds is the development box.  IE:
  development is ... http://explain.depesz.com/s/3dt

I wasn't surprised that the plans are different.  I was surprised that the development box *spanked* the production system.

Here's the actual query:
select 
    preference0_.*
from
    preference preference0_, location location1_, employee employee2_
where
    preference0_.employee_id=employee2_.id 
    and preference0_.location_id=location1_.id 
    and location1_.corporation_id=41197 
    and employee2_.deleted='N' 
    and preference0_.deleted='N' 
    and
    (preference0_.id not in (
    select preference3_.id from preference preference3_, location location4_, employee employee5_ 
where preference3_.employee_id=employee5_.id and preference3_.location_id=location4_.id 
and location4_.corporation_id=41197 and employee5_.deleted='N' and preference3_.deleted='N' 
and (preference3_.startDate>'2011-11-03 00:00:00' or preference3_.endDate<'2011-11-02 00:00:00'))
    ) and
    (preference0_.employee_id in
        (select employee6_.id from employee employee6_ inner join app_user user7_ on employee6_.user_id=user7_.id 
inner join user_location userlocati8_ on user7_.id=userlocati8_.user_id, location location9_ 
where userlocati8_.location_id=location9_.id and userlocati8_.location_id=6800 and userlocati8_.deleted='N' 
and location9_.deleted='N' and employee6_.deleted='N')
    ) order by preference0_.date_created;

I have tried setting the statistics on employee.user_id to be 100 and 1000, and the rest are the default (100).

I've run both an "ANALYZE" and a "VACUUM ANALYZE" on the production system - both "generally", and on each of the above tables (employee, app_user, location, preference).

Here's an updated explain of the most recent attempt.  About 5 minutes after I analyzed them:

What else would I need to provide?


--
Anthony Presley

Re: 8.4.4, 9.0, and 9.1 Planner Differences

От
Tom Lane
Дата:
Anthony Presley <anthony@resolution.com> writes:
> I have tried setting the statistics on employee.user_id to be 100 and 1000,
> and the rest are the default (100).

> I've run both an "ANALYZE" and a "VACUUM ANALYZE" on the production system -
> both "generally", and on each of the above tables (employee, app_user,
> location, preference).

> Here's an updated explain of the most recent attempt.  About 5 minutes after
> I analyzed them:
>   http://explain.depesz.com/s/G32

Looks like the biggest estimation errors are on the location_id joins.
Maybe you should be cranking up the stats targets on those columns.

            regards, tom lane