Обсуждение: how to change the index chosen in plan?

Поиск
Список
Период
Сортировка

how to change the index chosen in plan?

От
Rural Hunter
Дата:
I have a query like this:
select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44
and b.bid=8
postgresql selected the index on a.col1 then selected the index on
b.bid. But in my situation, I know that the query will be faster if it
chose the index on b.bid first since there are only a few rows with
value 8. So I re-wrote the query as below:
select a.* from a where a.aid in (select aid from b where bid=8) and
a.col1=33 a.col2=44
But surprisingly, postgresql didn't change the plan. it still chose to
index scan on a.col1. How can I re-wirte the query so postgresql will
scan on b.bid first?

Re: how to change the index chosen in plan?

От
Tom Lane
Дата:
Rural Hunter <ruralhunter@gmail.com> writes:
> I have a query like this:
> select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44
> and b.bid=8
> postgresql selected the index on a.col1 then selected the index on
> b.bid. But in my situation, I know that the query will be faster if it
> chose the index on b.bid first since there are only a few rows with
> value 8.

If you know that and the planner doesn't, maybe ANALYZE is called for.

            regards, tom lane

Re: how to change the index chosen in plan?

От
Rural Hunter
Дата:
No, it's not the analyze problem. For some other values on b.bid such
as 9, 10, the plan is fine since there a a lot of rows in table b for
them. But for some specific values such as 8 I want the plan changed.

于2012年6月8日 22:10:58,Tom Lane写到:
> Rural Hunter <ruralhunter@gmail.com> writes:
>> I have a query like this:
>> select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44
>> and b.bid=8
>> postgresql selected the index on a.col1 then selected the index on
>> b.bid. But in my situation, I know that the query will be faster if it
>> chose the index on b.bid first since there are only a few rows with
>> value 8.
>
> If you know that and the planner doesn't, maybe ANALYZE is called for.
>
>             regards, tom lane
>



Re: how to change the index chosen in plan?

От
"Kevin Grittner"
Дата:
Rural Hunter <ruralhunter@gmail.com> wrote:
> 于2012年6月8日 22:10:58,Tom Lane写到:
>> Rural Hunter <ruralhunter@gmail.com> writes:
>>> I have a query like this:
>>> select a.* from a inner join b on a.aid=b.aid where a.col1=33
>>> a.col2=44 and b.bid=8
>>> postgresql selected the index on a.col1 then selected the index
>>> on b.bid. But in my situation, I know that the query will be
>>> faster if it chose the index on b.bid first since there are only
>>> a few rows with value 8.
>>
>> If you know that and the planner doesn't, maybe ANALYZE is called
>> for.
>>
> No, it's not the analyze problem.

So you ran ANALYZE and retried?  If not, please do.

> For some other values on b.bid such as 9, 10, the plan is fine
> since there a a lot of rows in table b for them.

So it uses the same plan regardless of the number of rows in table b
for the value?  That sure *sounds* like you need to run ANALYZE,
possibly after adjusting the statistics target for a column or two.

> But for some specific values such as 8 I want the plan changed.

If you approach it from that line of thought, you will be unlikely
to reach a good long-term solution.  PostgreSQL has a costing model
to determine which plan is expected to be cheapest (fastest).  This
is based on statistics gathered during ANALYZE and on costing
factors.  Generally, if it's not choosing the fastest plan, you
aren't running ANALYZE frequently enough or with a fine-grained
enough statistics target _or_ you need to adjust your costing
factors to better model your actual costs.

You haven't given us a lot of clues about which it is that you need
to do, but there is *some* suggestion that you need to ANALYZE.  If
you *try* that and it doesn't solve your problem, please read this
page and provide more information:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin

Re: how to change the index chosen in plan?

От
Rural Hunter
Дата:
Hi Kevin,
Thanks for your detailed explanation.

于 2012/6/8 22:37, Kevin Grittner 写道:
> Rural Hunter <ruralhunter@gmail.com> wrote:
>> 于2012年6月8日 22:10:58,Tom Lane写到:
>>> Rural Hunter <ruralhunter@gmail.com> writes:
>>>> I have a query like this:
>>>> select a.* from a inner join b on a.aid=b.aid where a.col1=33
>>>> a.col2=44 and b.bid=8
>>>> postgresql selected the index on a.col1 then selected the index
>>>> on b.bid. But in my situation, I know that the query will be
>>>> faster if it chose the index on b.bid first since there are only
>>>> a few rows with value 8.
>>> If you know that and the planner doesn't, maybe ANALYZE is called
>>> for.
>>>
>> No, it's not the analyze problem.
>
> So you ran ANALYZE and retried?  If not, please do.
Yes, I did.
>
>> For some other values on b.bid such as 9, 10, the plan is fine
>> since there a a lot of rows in table b for them.
>
> So it uses the same plan regardless of the number of rows in table b
> for the value?
yes.
> That sure *sounds* like you need to run ANALYZE,
> possibly after adjusting the statistics target for a column or two.
  How can adjust the statistics target?
>
>> But for some specific values such as 8 I want the plan changed.
>
> If you approach it from that line of thought, you will be unlikely
> to reach a good long-term solution.  PostgreSQL has a costing model
> to determine which plan is expected to be cheapest (fastest).  This
> is based on statistics gathered during ANALYZE and on costing
> factors.  Generally, if it's not choosing the fastest plan, you
> aren't running ANALYZE frequently enough or with a fine-grained
> enough statistics target _or_ you need to adjust your costing
> factors to better model your actual costs.
>
> You haven't given us a lot of clues about which it is that you need
> to do, but there is *some* suggestion that you need to ANALYZE.  If
> you *try* that and it doesn't solve your problem, please read this
> page and provide more information:
>
> http://wiki.postgresql.org/wiki/SlowQueryQuestions
Sorry the actual tables and query are very complicated so I just
simplified the problem with my understanding. I rechecked the query and
found it should be simplified like this:
select a.* from a inner join b on a.aid=b.aid where a.col1=33 and
a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10
There is an index on (a.col1,a.col2,a.time). If I remove the order-by
clause, I can get the plan as I expected. I think that's why postgresql
selected that index. But still I want the index on b.bid selected first
for value 8 since there are only several rows with bid 8. though for
other normal values there might be several kilo to million rows.
>
> -Kevin
>



Re: how to change the index chosen in plan?

От
"Kevin Grittner"
Дата:
Rural Hunter <ruralhunter@gmail.com> wrote:

>   How can adjust the statistics target?

default_statistics_target

http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

or ALTER TABLE x ALTER COLUMN y SET STATISTICS n

http://www.postgresql.org/docs/current/interactive/sql-altertable.html

> Sorry the actual tables and query are very complicated so I just
> simplified the problem with my understanding. I rechecked the
> query and found it should be simplified like this:
> select a.* from a inner join b on a.aid=b.aid where a.col1=33 and
> a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10
> There is an index on (a.col1,a.col2,a.time). If I remove the
> order-by clause, I can get the plan as I expected. I think that's
> why postgresql selected that index.

Sounds like it expects the sort to be expensive, which means it
probably expects a large number of rows.  An EXPLAIN ANALYZE of the
query with and without the ORDER BY might be instructive.  It would
also help to know what version of PostgreSQL you have and how it is
configured, all of which shows up in the results of the query on
this page:

http://wiki.postgresql.org/wiki/Server_Configuration

> But still I want the index on b.bid selected first
> for value 8 since there are only several rows with bid 8. though
> for other normal values there might be several kilo to million
> rows.

An EXPLAIN ANALYZE of one where you think the plan is a good choice
might also help.

Oh, and just to be sure -- are you actually running queries with the
literals like you show, or are you using prepared statements with
placeholders and plugging the values in after the statement is
prepared?  Sample code, if possible, might help point to or
eliminate issues with a cached plan.  If you're running through a
cached plan, there is no way for it to behave differently based on
the value plugged into the query -- the plan has already been set
before you get to that point.

-Kevin

Re: how to change the index chosen in plan?

От
Rural Hunter
Дата:
于 2012/6/9 0:39, Kevin Grittner 写道:
> Rural Hunter <ruralhunter@gmail.com> wrote:
>
>>    How can adjust the statistics target?
>
> default_statistics_target
>
> http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER
>
> or ALTER TABLE x ALTER COLUMN y SET STATISTICS n
>
> http://www.postgresql.org/docs/current/interactive/sql-altertable.html
Thanks, I will check detail.
>
>> Sorry the actual tables and query are very complicated so I just
>> simplified the problem with my understanding. I rechecked the
>> query and found it should be simplified like this:
>> select a.* from a inner join b on a.aid=b.aid where a.col1=33 and
>> a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10
>> There is an index on (a.col1,a.col2,a.time). If I remove the
>> order-by clause, I can get the plan as I expected. I think that's
>> why postgresql selected that index.
>
> Sounds like it expects the sort to be expensive, which means it
> probably expects a large number of rows.  An EXPLAIN ANALYZE of the
> query with and without the ORDER BY might be instructive.  It would
> also help to know what version of PostgreSQL you have and how it is
> configured, all of which shows up in the results of the query on
> this page:
>
> http://wiki.postgresql.org/wiki/Server_Configuration
>
Here is the output:
name | current_setting

-----------------------------+---------------------------------------------------------------------------------------------------------------
version | PostgreSQL 9.1.3 on x86_64-unknown-linux-gnu, compiled by gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
archive_command | test ! -f /dbbk/postgres/logarch/%f.gz && gzip -c %p
 >/dbbk/postgres/logarch/%f.gz
archive_mode | on
autovacuum | on
autovacuum_freeze_max_age | 2000000000
checkpoint_segments | 20
client_encoding | UTF8
effective_cache_size | 150GB
full_page_writes | off
lc_collate | zh_CN.utf8
lc_ctype | zh_CN.utf8
listen_addresses | *
log_autovacuum_min_duration | 30min
log_destination | stderr
log_line_prefix | %t [%u@%h]
log_min_duration_statement | 10s
log_statement | ddl
logging_collector | on
maintenance_work_mem | 10GB
max_connections | 2500
max_stack_depth | 2MB
max_wal_senders | 1
port | 3500
server_encoding | UTF8
shared_buffers | 60GB
synchronous_commit | off
TimeZone | PRC
track_activities | on
track_counts | on
vacuum_freeze_table_age | 1000000000
wal_buffers | 16MB
wal_level | hot_standby
work_mem | 8MB
(33 rows)

>> But still I want the index on b.bid selected first
>> for value 8 since there are only several rows with bid 8. though
>> for other normal values there might be several kilo to million
>> rows.
>
> An EXPLAIN ANALYZE of one where you think the plan is a good choice
> might also help.
Ok, I get out a simple version of the actualy query. Here is the explain
anaylze without order-by, which is I wanted:
http://explain.depesz.com/s/p1p

Another with the order-by which I want to avoid:
http://explain.depesz.com/s/ujU

This is the count of rows in article_label with value 3072(which I
referred as table b in previous mail):
# select count(*) from article_label where lid=3072;
count
-------
56
(1 row)

>
> Oh, and just to be sure -- are you actually running queries with the
> literals like you show, or are you using prepared statements with
> placeholders and plugging the values in after the statement is
> prepared?  Sample code, if possible, might help point to or
> eliminate issues with a cached plan.  If you're running through a
> cached plan, there is no way for it to behave differently based on
> the value plugged into the query -- the plan has already been set
> before you get to that point.
Yes, I ran the query directly wih psql.
>
> -Kevin
>



Re: how to change the index chosen in plan?

От
"Kevin Grittner"
Дата:
Rural Hunter  wrote:
> 于 2012/6/9 0:39, Kevin Grittner 写道:

> name | current_setting

> full_page_writes | off

There may be exceptions on some file systems, but generally turning
this off leaves you vulnerable to possible database corruption if you
OS or hardware crashes.

> max_connections | 2500

Yikes!  You may want to look in to a connection pooler which can take
2500 client connections and funnel them into a much smaller number of
database connections.

https://wiki.postgresql.org/wiki/Number_Of_Database_Connections

> shared_buffers | 60GB

You might want to compare your performance with this setting against
a smaller setting.  Many benchmarks have shown settings about a
certain point (like 8MB to 12 MB) to be counter-productive, although
a few have shown increased performance going past that.  It really
seems to depend on your hardware and workload, so you have to test to
find the "sweet spot" for your environment.

> work_mem | 8MB

With so many connections, I can understand being this low.  One of
the advantages of using connection pooling to funnel your user
connections into fewer database conncections is that you can boost
this, which might help considerably with some types of queries.

None of the above, however, really gets to your immediate problem.
What is most significant about your settings with regard to the
problem query is what's *not* in that list.  You appear to have a
heavily cached active data set, based on the row counts and timings
in EXPLAIN ANALYZE output, and you have not adjusted your cost
factors, which assume less caching.

Try setting these on a connection and then running your queries on
that connection.

set seq_page_cost = 0.1;
set random_page_cost = 0.1;
set cpu_tuple_cost = 0.03;

> Ok, I get out a simple version of the actualy query. Here is the
> explain anaylze without order-by, which is I wanted:
> http://explain.depesz.com/s/p1p
>
> Another with the order-by which I want to avoid:
> http://explain.depesz.com/s/ujU

You neglected to mention the LIMIT clause in your earlier
presentation of the problem.  A LIMIT can have a big impact on plan
choice.  Is the LIMIT 10 part of the actual query you want to
optimize?  Either way it would be helpful to see the EXPLAIN ANALYZE
output for the the query without the LIMIT clause.

-Kevin

Re: how to change the index chosen in plan?

От
"Kevin Grittner"
Дата:
"Kevin Grittner"  wrote:

>> shared_buffers | 60GB
>
> You might want to compare your performance with this setting against
> a smaller setting. Many benchmarks have shown settings about a
> certain point (like 8MB to 12 MB) to be counter-productive, although
> a few have shown increased performance going past that. It really
> seems to depend on your hardware and workload, so you have to test to
> find the "sweet spot" for your environment.

Er, I meant "8GB to 12GB", not MB.  Sorry.

-Kevin

Re: how to change the index chosen in plan?

От
Rural Hunter
Дата:
于 2012/6/9 22:39, Kevin Grittner 写道:
> Rural Hunter  wrote:
>> 于 2012/6/9 0:39, Kevin Grittner 写道:
>
>> name | current_setting
>
>> full_page_writes | off
>
> There may be exceptions on some file systems, but generally turning
> this off leaves you vulnerable to possible database corruption if you
> OS or hardware crashes.
Yes, I understand. My situation is, the io utiliztion of my system is
quite high so I turned this off to reduce the io utilization. We have a
replication server to serve as the hot standby if there is any issue on
the primary. So currently I think it's acceptable option to me.
>
>> max_connections | 2500
>
> Yikes!  You may want to look in to a connection pooler which can take
> 2500 client connections and funnel them into a much smaller number of
> database connections.
>
> https://wiki.postgresql.org/wiki/Number_Of_Database_Connections
>
>> shared_buffers | 60GB
>
> You might want to compare your performance with this setting against
> a smaller setting.  Many benchmarks have shown settings about a
> certain point (like 8MB to 12 MB) to be counter-productive, although
> a few have shown increased performance going past that.  It really
> seems to depend on your hardware and workload, so you have to test to
> find the "sweet spot" for your environment.
>
>> work_mem | 8MB
>
> With so many connections, I can understand being this low.  One of
> the advantages of using connection pooling to funnel your user
> connections into fewer database conncections is that you can boost
> this, which might help considerably with some types of queries.
>
> None of the above, however, really gets to your immediate problem.
> What is most significant about your settings with regard to the
> problem query is what's *not* in that list.  You appear to have a
> heavily cached active data set, based on the row counts and timings
> in EXPLAIN ANALYZE output, and you have not adjusted your cost
> factors, which assume less caching.
Thanks for the advices. As of now we don't see overall performance issue
on the db. I will adjust these settings based on your advices if we
begin to see overall performance degrade.
>
> Try setting these on a connection and then running your queries on
> that connection.
>
> set seq_page_cost = 0.1;
> set random_page_cost = 0.1;
> set cpu_tuple_cost = 0.03;
I tried these settings but don't see noticeable improvement. The plan is
not changed.
>
>> Ok, I get out a simple version of the actualy query. Here is the
>> explain anaylze without order-by, which is I wanted:
>> http://explain.depesz.com/s/p1p
>>
>> Another with the order-by which I want to avoid:
>> http://explain.depesz.com/s/ujU
>
> You neglected to mention the LIMIT clause in your earlier
> presentation of the problem.  A LIMIT can have a big impact on plan
> choice.  Is the LIMIT 10 part of the actual query you want to
> optimize?  Either way it would be helpful to see the EXPLAIN ANALYZE
> output for the the query without the LIMIT clause.
Yes, sorry for that. I do need the limit clause in the query to show
only part of the results to the user(common multi-pages view). Without
the limit clause, I got the plan as I wanted:
http://explain.depesz.com/s/Qdu

So looks either I remove the order-by or limit clause, I can get what I
wanted. But I do need the both in the query...

>
> -Kevin
>



Re: how to change the index chosen in plan?

От
"Kevin Grittner"
Дата:
Rural Hunter  wrote:
> 于 2012/6/9 22:39, Kevin Grittner 写道:

>> You neglected to mention the LIMIT clause in your earlier
>> presentation of the problem. A LIMIT can have a big impact on plan
>> choice. Is the LIMIT 10 part of the actual query you want to
>> optimize? Either way it would be helpful to see the EXPLAIN
>> ANALYZE output for the the query without the LIMIT clause.
> Yes, sorry for that. I do need the limit clause in the query to
> show only part of the results to the user(common multi-pages view).
> Without the limit clause, I got the plan as I wanted:
> http://explain.depesz.com/s/Qdu
>
> So looks either I remove the order-by or limit clause, I can get
> what I wanted. But I do need the both in the query...

Well, we're still doing diagnostic steps.  What this one shows is
that your statistics are leading the planner to believe that there
will be 20846 rows with lid = 3072, while there are really only 62.
If it knew the actual number I doubt it would choose the slower plan.

The next thing I would try is:

ALTER TABLE article_label ALTER COLUMN lid SET STATISTICS = 5000;
ANALYZE article_label;

Then try the query without LIMIT and see if you get something on the
right order of magnitude comparing the estimated rows to actual on
that index scan.  You can try different STATISTICS values until you
get the lowest value that puts the estimate in the right
neighborhood.  Higher settings will increase plan time; lower
settings may lead to bad plans.

Once you've got a decent estimate, try with the ORDER BY and LIMIT
again.

If you have a hard time getting a good estimate even with a high
statistics target, you should investigate whether you have extreme
table bloat.

-Kevin

Re: how to change the index chosen in plan?

От
Rural Hunter
Дата:
于 2012/6/11 20:07, Kevin Grittner 写道:
> Rural Hunter  wrote:
>> 于 2012/6/9 22:39, Kevin Grittner 写道:
>
>>> You neglected to mention the LIMIT clause in your earlier
>>> presentation of the problem. A LIMIT can have a big impact on plan
>>> choice. Is the LIMIT 10 part of the actual query you want to
>>> optimize? Either way it would be helpful to see the EXPLAIN
>>> ANALYZE output for the the query without the LIMIT clause.
>> Yes, sorry for that. I do need the limit clause in the query to
>> show only part of the results to the user(common multi-pages view).
>> Without the limit clause, I got the plan as I wanted:
>> http://explain.depesz.com/s/Qdu
>>
>> So looks either I remove the order-by or limit clause, I can get
>> what I wanted. But I do need the both in the query...
>
> Well, we're still doing diagnostic steps.  What this one shows is
> that your statistics are leading the planner to believe that there
> will be 20846 rows with lid = 3072, while there are really only 62.
> If it knew the actual number I doubt it would choose the slower plan.
>
> The next thing I would try is:
>
> ALTER TABLE article_label ALTER COLUMN lid SET STATISTICS = 5000;
> ANALYZE article_label;
>
> Then try the query without LIMIT and see if you get something on the
> right order of magnitude comparing the estimated rows to actual on
> that index scan.  You can try different STATISTICS values until you
> get the lowest value that puts the estimate in the right
> neighborhood.  Higher settings will increase plan time; lower
> settings may lead to bad plans.
>
> Once you've got a decent estimate, try with the ORDER BY and LIMIT
> again.
I set statistics to 5000 and got estimated row count 559. Set statistics
to 8000 and got estimated row count 393.  At this step, I run the query
with both order-by and limit clause and got the expected result.
Kevin, Thank you very much for your patience and step-by-step guidance!
I learnt a lot from this case!
>
> If you have a hard time getting a good estimate even with a high
> statistics target, you should investigate whether you have extreme
> table bloat.
>
> -Kevin
>