Обсуждение: Fighting the planner >:-(

Поиск
Список
Период
Сортировка

Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
So where I'm working, a performance issue was identified that affected many functions, because the (SQL language) functions took an int argument used it in a where clause against a column (config_id) that was stored in varchar format, leading to an inefficient casting when the query was parameterized.  We could work around that with (select $3::text) instead of just $3, but since the data is actually all numbers under 65k, we altered the data type of the column to smallint, rather than editing a boatload of functions with a hacky workaround.

For most functions, this fixed the problem.

However, it had a drastically-negative impact on the query in question, which was originally taking 2 minutes, 45 seconds.  After adding a couple indexes with the config_id still as a varchar, that time is reduced down to 42 seconds.  However when the data type is smallint, the query runs for many hours - I let it run for 4.5 hours yesterday before cancelling it.

It's pretty clear that the planner is making horrid misestimates and picking a terrible plan.  I would appreciate any advice for getting this into a better state.

Here are the explain plans:

When config_id is a varchar, it executes in 42 seconds:

When config_id is a smallint, it runs too long to allow to complete, but clearly the plan is bad:

Here is the query, along with rowcounts and schema of every table involved in the query:

PostgreSQL version is 8.4, and most of our GUC's are default.

Thanks in advance for any suggestions.
--
Casey Allen Shobe
casey@shobe.info


Re: Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
My apologies - I included the wrong version of the query before...during  testing I had tried deparameterizing a few of the input parameters.  I also accidentally left out the schema for the network_config_tot2 table from the initial paste.

Here is an updated paste, which shows the correct query in a prepare statements.  The explain plans are from explain execute hewitt_test (...):

Here is the correct explain plan for this statement (still bad):



On Fri, Feb 1, 2013 at 12:11 PM, Casey Allen Shobe <casey@shobe.info> wrote:
So where I'm working, a performance issue was identified that affected many functions, because the (SQL language) functions took an int argument used it in a where clause against a column (config_id) that was stored in varchar format, leading to an inefficient casting when the query was parameterized.  We could work around that with (select $3::text) instead of just $3, but since the data is actually all numbers under 65k, we altered the data type of the column to smallint, rather than editing a boatload of functions with a hacky workaround.

For most functions, this fixed the problem.

However, it had a drastically-negative impact on the query in question, which was originally taking 2 minutes, 45 seconds.  After adding a couple indexes with the config_id still as a varchar, that time is reduced down to 42 seconds.  However when the data type is smallint, the query runs for many hours - I let it run for 4.5 hours yesterday before cancelling it.

It's pretty clear that the planner is making horrid misestimates and picking a terrible plan.  I would appreciate any advice for getting this into a better state.

Here are the explain plans:

When config_id is a varchar, it executes in 42 seconds:

When config_id is a smallint, it runs too long to allow to complete, but clearly the plan is bad:

Here is the query, along with rowcounts and schema of every table involved in the query:

PostgreSQL version is 8.4, and most of our GUC's are default.

Thanks in advance for any suggestions.
--
Casey Allen Shobe
casey@shobe.info





--
Casey Allen Shobe
casey@shobe.info


Re: Fighting the planner >:-(

От
Richard Huxton
Дата:
On 01/02/13 17:54, Casey Allen Shobe wrote:
> My apologies - I included the wrong version of the query
> before...during  testing I had tried deparameterizing a few of the
> input parameters.  I also accidentally left out the schema for the
> network_config_tot2 table from the initial paste.
>
> Here is an updated paste, which shows the correct query in a prepare
> statements.  The explain plans are from explain execute hewitt_test (...):
> http://pgsql.privatepaste.com/00c582c840
>
> Here is the correct explain plan for this statement (still bad):
> http://explain.depesz.com/s/c46

Three quick observations before the weekend.

1. You said config_id was now "smallint" in your email, but it reads
"int" in the pastes above.
    Doesn't matter much which, but just checking we've got the right pastes.

2. The total estimated cost of both queries is about the same
(477,225.19 for the varchar, 447,623.86 for the int).
    This suggests something about your configuration doesn't match the
performance of your machine, since presumably the int version is taking
at least twice as long as the varchar one.

3. Interestingly, the config_id search on both plans seems to be using a
Bitmap Index, so I'm not sure that's the root cause. However, the
varchar version seems to have a literal string it's matching against. If
you've manually substituted in a literal value, that could be skewing
the tests.

And two things for you to try if you would:

1. Can you just check and see if any of the row estimates are horribly
off for any particular clause in the query?

2. You mention your config settings are mostly at default. What's your
work_mem and can you increase it? You can issue a SET for the current
session, no need to change it globally. If you've got the RAM try
doubling it, then double it again. See what happens to your plan then.

--
   Richard Huxton
   Archonet Ltd


Re: Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
On Fri, Feb 1, 2013 at 1:50 PM, Richard Huxton <dev@archonet.com> wrote:
1. You said config_id was now "smallint" in your email, but it reads "int" in the pastes above.
   Doesn't matter much which, but just checking we've got the right pastes.

You have the correct pastes.  I did an alter to int in an attempt to see if that made any difference.  It didn't.  It takes a couple hours to do that alter so I didn't convert it back to smallint.
 
2. The total estimated cost of both queries is about the same (477,225.19 for the varchar, 447,623.86 for the int).
   This suggests something about your configuration doesn't match the performance of your machine, since presumably the int version is taking at least twice as long as the varchar one.

Configuration is pretty standard.  As for the machine, it's a VM in an ESXi setup, with dedicated resources.  The disk is very fast and is barely touched.  One CPU sits at 100% for as long as I let the query run.  There is 18GB RAM and PostgreSQL is the only service running on the machine.
 
3. Interestingly, the config_id search on both plans seems to be using a Bitmap Index, so I'm not sure that's the root cause. However, the varchar version seems to have a literal string it's matching against. If you've manually substituted in a literal value, that could be skewing the tests.

That's why I sent the followup re-parameterizing everything.  And the explains are on prepared statements with the parameterization.  If I just put the parameter values directly into the query and run it straight, it's fast.

1. Can you just check and see if any of the row estimates are horribly off for any particular clause in the query?

Yes they are.  The places where the estimate is rows=1, particularly.
 
2. You mention your config settings are mostly at default. What's your work_mem and can you increase it? You can issue a SET for the current session, no need to change it globally. If you've got the RAM try doubling it, then double it again. See what happens to your plan then.

21861KB.  I tried setting it to 192MB and re-preparing the same statement.  Here's the explain execute: http://explain.depesz.com/s/pZ0, which looks identical as before.

--
Casey Allen Shobe
casey@shobe.info

Re: Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
Rhodiumtoad on IRC helped me figure out how to move part of the query into a CTE in order to work around the planner problem.  This is a hack but it brings the time down from many hours to 17 seconds, which is better than it was even with the better plan in the first place!  For some reason it actually gets 2 seconds faster yet by putting it in a SQL function rather than using prepare/execute.

Hopefully some improvements to the planner can come from this information?

Here is the CTE version of the query:  http://pgsql.privatepaste.com/2f7fd3f669
...and  here is it's explain analyze:  http://explain.depesz.com/s/5ml

--
Casey Allen Shobe
casey@shobe.info


Re: Fighting the planner >:-(

От
Виктор Егоров
Дата:
2013/2/1 Casey Allen Shobe <casey@shobe.info>:
> Hopefully some improvements to the planner can come from this information?
>
> Here is the CTE version of the query:
> http://pgsql.privatepaste.com/2f7fd3f669
> ...and  here is it's explain analyze:  http://explain.depesz.com/s/5ml

Estimated rows for ‘hewitt_1_0_factors_precalc_new’ are 1000x less then actual.
And for ‘census_user’ estimation is 100x less, then actual.

How many rows are in those tables and what is your statistics target?


--
Victor Y. Yegorov


Re: Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
On Fri, Feb 1, 2013 at 4:12 PM, Виктор Егоров <vyegorov@gmail.com> wrote:
Estimated rows for ‘hewitt_1_0_factors_precalc_new’ are 1000x less then actual.
And for ‘census_user’ estimation is 100x less, then actual.

How many rows are in those tables and what is your statistics target?

Rowcounts are shown in the earlier paste link, but apparently I forgot to include the census table - hewitt_1_0_factors_precalc_new has 4,135,890 rows, and census_user has 1846439 rows. 

Statistics target is the default at 100.

--
Casey Allen Shobe
casey@shobe.info


Re: Fighting the planner >:-(

От
Виктор Егоров
Дата:
2013/2/1 Casey Allen Shobe <casey@shobe.info>:
> Rowcounts are shown in the earlier paste link, but apparently I forgot to
> include the census table - hewitt_1_0_factors_precalc_new has 4,135,890
> rows, and census_user has 1846439 rows.
>
> Statistics target is the default at 100.

I would try the following:
ALTER TABLE hewitt_1_0_factors_precalc_new SET STATISTICS 1000;
ALTER TABLE census_user SET STATISTICS 500;
ALTER TABLE census_output SET STATISTICS 500;
and analyzed them after. I hope I guessed ‘census_output’ name correctly.

And could you kindly share the plan after:
SET enable_nestloop TO off;


--
Victor Y. Yegorov


Re: Fighting the planner >:-(

От
Casey Allen Shobe
Дата:
I'll get back to you on this Monday - I'm heading home for the week now.

However I was unable to adjust the statistics target using that command:

alter table opportunity.census_user set statistics 500;
ERROR:  syntax error at or near "statistics"
LINE 1: alter table opportunity.census_user set statistics 500;
                                             ^



On Fri, Feb 1, 2013 at 4:55 PM, Виктор Егоров <vyegorov@gmail.com> wrote:
2013/2/1 Casey Allen Shobe <casey@shobe.info>:
> Rowcounts are shown in the earlier paste link, but apparently I forgot to
> include the census table - hewitt_1_0_factors_precalc_new has 4,135,890
> rows, and census_user has 1846439 rows.
>
> Statistics target is the default at 100.

I would try the following:
ALTER TABLE hewitt_1_0_factors_precalc_new SET STATISTICS 1000;
ALTER TABLE census_user SET STATISTICS 500;
ALTER TABLE census_output SET STATISTICS 500;
and analyzed them after. I hope I guessed ‘census_output’ name correctly.

And could you kindly share the plan after:
SET enable_nestloop TO off;


--
Victor Y. Yegorov



--
Casey Allen Shobe
casey@shobe.info


Re: Fighting the planner >:-(

От
Виктор Егоров
Дата:
2013/2/2 Casey Allen Shobe <casey@shobe.info>:
> However I was unable to adjust the statistics target using that command:
>
> alter table opportunity.census_user set statistics 500;
> ERROR:  syntax error at or near "statistics"
> LINE 1: alter table opportunity.census_user set statistics 500;

I'm sorry for this, my bad. Try the following:

ALTER TABLE census_user ALTER parent_id SET STATISTICS 500;
ALTER TABLE census_user ALTER stakeholder_code SET STATISTICS 500;

Do the same for all the columns in ‘hewitt_1_0_factors_precalc_new_pkey’ index,
setting target at 1000. I would also updated target for columns from
this filter:
Filter: (((h.discount_type)::text = ANY ('{"Avg Comp",Blue}'::text[]))
AND ((h.data_type)::text = 'Historical'::text) AND ((h.source)::text =
'Hewitt 1.0'::text)
AND ((h.service_catg_scheme)::text = '11+3'::text))


--
Victor Y. Yegorov