Обсуждение: how to change the index chosen in plan?
I have a query like this: select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44 and b.bid=8 postgresql selected the index on a.col1 then selected the index on b.bid. But in my situation, I know that the query will be faster if it chose the index on b.bid first since there are only a few rows with value 8. So I re-wrote the query as below: select a.* from a where a.aid in (select aid from b where bid=8) and a.col1=33 a.col2=44 But surprisingly, postgresql didn't change the plan. it still chose to index scan on a.col1. How can I re-wirte the query so postgresql will scan on b.bid first?
Rural Hunter <ruralhunter@gmail.com> writes: > I have a query like this: > select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44 > and b.bid=8 > postgresql selected the index on a.col1 then selected the index on > b.bid. But in my situation, I know that the query will be faster if it > chose the index on b.bid first since there are only a few rows with > value 8. If you know that and the planner doesn't, maybe ANALYZE is called for. regards, tom lane
No, it's not the analyze problem. For some other values on b.bid such as 9, 10, the plan is fine since there a a lot of rows in table b for them. But for some specific values such as 8 I want the plan changed. 于2012年6月8日 22:10:58,Tom Lane写到: > Rural Hunter <ruralhunter@gmail.com> writes: >> I have a query like this: >> select a.* from a inner join b on a.aid=b.aid where a.col1=33 a.col2=44 >> and b.bid=8 >> postgresql selected the index on a.col1 then selected the index on >> b.bid. But in my situation, I know that the query will be faster if it >> chose the index on b.bid first since there are only a few rows with >> value 8. > > If you know that and the planner doesn't, maybe ANALYZE is called for. > > regards, tom lane >
Rural Hunter <ruralhunter@gmail.com> wrote: > 于2012年6月8日 22:10:58,Tom Lane写到: >> Rural Hunter <ruralhunter@gmail.com> writes: >>> I have a query like this: >>> select a.* from a inner join b on a.aid=b.aid where a.col1=33 >>> a.col2=44 and b.bid=8 >>> postgresql selected the index on a.col1 then selected the index >>> on b.bid. But in my situation, I know that the query will be >>> faster if it chose the index on b.bid first since there are only >>> a few rows with value 8. >> >> If you know that and the planner doesn't, maybe ANALYZE is called >> for. >> > No, it's not the analyze problem. So you ran ANALYZE and retried? If not, please do. > For some other values on b.bid such as 9, 10, the plan is fine > since there a a lot of rows in table b for them. So it uses the same plan regardless of the number of rows in table b for the value? That sure *sounds* like you need to run ANALYZE, possibly after adjusting the statistics target for a column or two. > But for some specific values such as 8 I want the plan changed. If you approach it from that line of thought, you will be unlikely to reach a good long-term solution. PostgreSQL has a costing model to determine which plan is expected to be cheapest (fastest). This is based on statistics gathered during ANALYZE and on costing factors. Generally, if it's not choosing the fastest plan, you aren't running ANALYZE frequently enough or with a fine-grained enough statistics target _or_ you need to adjust your costing factors to better model your actual costs. You haven't given us a lot of clues about which it is that you need to do, but there is *some* suggestion that you need to ANALYZE. If you *try* that and it doesn't solve your problem, please read this page and provide more information: http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin
Hi Kevin, Thanks for your detailed explanation. 于 2012/6/8 22:37, Kevin Grittner 写道: > Rural Hunter <ruralhunter@gmail.com> wrote: >> 于2012年6月8日 22:10:58,Tom Lane写到: >>> Rural Hunter <ruralhunter@gmail.com> writes: >>>> I have a query like this: >>>> select a.* from a inner join b on a.aid=b.aid where a.col1=33 >>>> a.col2=44 and b.bid=8 >>>> postgresql selected the index on a.col1 then selected the index >>>> on b.bid. But in my situation, I know that the query will be >>>> faster if it chose the index on b.bid first since there are only >>>> a few rows with value 8. >>> If you know that and the planner doesn't, maybe ANALYZE is called >>> for. >>> >> No, it's not the analyze problem. > > So you ran ANALYZE and retried? If not, please do. Yes, I did. > >> For some other values on b.bid such as 9, 10, the plan is fine >> since there a a lot of rows in table b for them. > > So it uses the same plan regardless of the number of rows in table b > for the value? yes. > That sure *sounds* like you need to run ANALYZE, > possibly after adjusting the statistics target for a column or two. How can adjust the statistics target? > >> But for some specific values such as 8 I want the plan changed. > > If you approach it from that line of thought, you will be unlikely > to reach a good long-term solution. PostgreSQL has a costing model > to determine which plan is expected to be cheapest (fastest). This > is based on statistics gathered during ANALYZE and on costing > factors. Generally, if it's not choosing the fastest plan, you > aren't running ANALYZE frequently enough or with a fine-grained > enough statistics target _or_ you need to adjust your costing > factors to better model your actual costs. > > You haven't given us a lot of clues about which it is that you need > to do, but there is *some* suggestion that you need to ANALYZE. If > you *try* that and it doesn't solve your problem, please read this > page and provide more information: > > http://wiki.postgresql.org/wiki/SlowQueryQuestions Sorry the actual tables and query are very complicated so I just simplified the problem with my understanding. I rechecked the query and found it should be simplified like this: select a.* from a inner join b on a.aid=b.aid where a.col1=33 and a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10 There is an index on (a.col1,a.col2,a.time). If I remove the order-by clause, I can get the plan as I expected. I think that's why postgresql selected that index. But still I want the index on b.bid selected first for value 8 since there are only several rows with bid 8. though for other normal values there might be several kilo to million rows. > > -Kevin >
Rural Hunter <ruralhunter@gmail.com> wrote: > How can adjust the statistics target? default_statistics_target http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER or ALTER TABLE x ALTER COLUMN y SET STATISTICS n http://www.postgresql.org/docs/current/interactive/sql-altertable.html > Sorry the actual tables and query are very complicated so I just > simplified the problem with my understanding. I rechecked the > query and found it should be simplified like this: > select a.* from a inner join b on a.aid=b.aid where a.col1=33 and > a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10 > There is an index on (a.col1,a.col2,a.time). If I remove the > order-by clause, I can get the plan as I expected. I think that's > why postgresql selected that index. Sounds like it expects the sort to be expensive, which means it probably expects a large number of rows. An EXPLAIN ANALYZE of the query with and without the ORDER BY might be instructive. It would also help to know what version of PostgreSQL you have and how it is configured, all of which shows up in the results of the query on this page: http://wiki.postgresql.org/wiki/Server_Configuration > But still I want the index on b.bid selected first > for value 8 since there are only several rows with bid 8. though > for other normal values there might be several kilo to million > rows. An EXPLAIN ANALYZE of one where you think the plan is a good choice might also help. Oh, and just to be sure -- are you actually running queries with the literals like you show, or are you using prepared statements with placeholders and plugging the values in after the statement is prepared? Sample code, if possible, might help point to or eliminate issues with a cached plan. If you're running through a cached plan, there is no way for it to behave differently based on the value plugged into the query -- the plan has already been set before you get to that point. -Kevin
于 2012/6/9 0:39, Kevin Grittner 写道: > Rural Hunter <ruralhunter@gmail.com> wrote: > >> How can adjust the statistics target? > > default_statistics_target > > http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER > > or ALTER TABLE x ALTER COLUMN y SET STATISTICS n > > http://www.postgresql.org/docs/current/interactive/sql-altertable.html Thanks, I will check detail. > >> Sorry the actual tables and query are very complicated so I just >> simplified the problem with my understanding. I rechecked the >> query and found it should be simplified like this: >> select a.* from a inner join b on a.aid=b.aid where a.col1=33 and >> a.col2=44 and a.time<now() and b.bid=8 order by a.time limit 10 >> There is an index on (a.col1,a.col2,a.time). If I remove the >> order-by clause, I can get the plan as I expected. I think that's >> why postgresql selected that index. > > Sounds like it expects the sort to be expensive, which means it > probably expects a large number of rows. An EXPLAIN ANALYZE of the > query with and without the ORDER BY might be instructive. It would > also help to know what version of PostgreSQL you have and how it is > configured, all of which shows up in the results of the query on > this page: > > http://wiki.postgresql.org/wiki/Server_Configuration > Here is the output: name | current_setting -----------------------------+--------------------------------------------------------------------------------------------------------------- version | PostgreSQL 9.1.3 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit archive_command | test ! -f /dbbk/postgres/logarch/%f.gz && gzip -c %p >/dbbk/postgres/logarch/%f.gz archive_mode | on autovacuum | on autovacuum_freeze_max_age | 2000000000 checkpoint_segments | 20 client_encoding | UTF8 effective_cache_size | 150GB full_page_writes | off lc_collate | zh_CN.utf8 lc_ctype | zh_CN.utf8 listen_addresses | * log_autovacuum_min_duration | 30min log_destination | stderr log_line_prefix | %t [%u@%h] log_min_duration_statement | 10s log_statement | ddl logging_collector | on maintenance_work_mem | 10GB max_connections | 2500 max_stack_depth | 2MB max_wal_senders | 1 port | 3500 server_encoding | UTF8 shared_buffers | 60GB synchronous_commit | off TimeZone | PRC track_activities | on track_counts | on vacuum_freeze_table_age | 1000000000 wal_buffers | 16MB wal_level | hot_standby work_mem | 8MB (33 rows) >> But still I want the index on b.bid selected first >> for value 8 since there are only several rows with bid 8. though >> for other normal values there might be several kilo to million >> rows. > > An EXPLAIN ANALYZE of one where you think the plan is a good choice > might also help. Ok, I get out a simple version of the actualy query. Here is the explain anaylze without order-by, which is I wanted: http://explain.depesz.com/s/p1p Another with the order-by which I want to avoid: http://explain.depesz.com/s/ujU This is the count of rows in article_label with value 3072(which I referred as table b in previous mail): # select count(*) from article_label where lid=3072; count ------- 56 (1 row) > > Oh, and just to be sure -- are you actually running queries with the > literals like you show, or are you using prepared statements with > placeholders and plugging the values in after the statement is > prepared? Sample code, if possible, might help point to or > eliminate issues with a cached plan. If you're running through a > cached plan, there is no way for it to behave differently based on > the value plugged into the query -- the plan has already been set > before you get to that point. Yes, I ran the query directly wih psql. > > -Kevin >
Rural Hunter wrote: > 于 2012/6/9 0:39, Kevin Grittner 写道: > name | current_setting > full_page_writes | off There may be exceptions on some file systems, but generally turning this off leaves you vulnerable to possible database corruption if you OS or hardware crashes. > max_connections | 2500 Yikes! You may want to look in to a connection pooler which can take 2500 client connections and funnel them into a much smaller number of database connections. https://wiki.postgresql.org/wiki/Number_Of_Database_Connections > shared_buffers | 60GB You might want to compare your performance with this setting against a smaller setting. Many benchmarks have shown settings about a certain point (like 8MB to 12 MB) to be counter-productive, although a few have shown increased performance going past that. It really seems to depend on your hardware and workload, so you have to test to find the "sweet spot" for your environment. > work_mem | 8MB With so many connections, I can understand being this low. One of the advantages of using connection pooling to funnel your user connections into fewer database conncections is that you can boost this, which might help considerably with some types of queries. None of the above, however, really gets to your immediate problem. What is most significant about your settings with regard to the problem query is what's *not* in that list. You appear to have a heavily cached active data set, based on the row counts and timings in EXPLAIN ANALYZE output, and you have not adjusted your cost factors, which assume less caching. Try setting these on a connection and then running your queries on that connection. set seq_page_cost = 0.1; set random_page_cost = 0.1; set cpu_tuple_cost = 0.03; > Ok, I get out a simple version of the actualy query. Here is the > explain anaylze without order-by, which is I wanted: > http://explain.depesz.com/s/p1p > > Another with the order-by which I want to avoid: > http://explain.depesz.com/s/ujU You neglected to mention the LIMIT clause in your earlier presentation of the problem. A LIMIT can have a big impact on plan choice. Is the LIMIT 10 part of the actual query you want to optimize? Either way it would be helpful to see the EXPLAIN ANALYZE output for the the query without the LIMIT clause. -Kevin
"Kevin Grittner" wrote: >> shared_buffers | 60GB > > You might want to compare your performance with this setting against > a smaller setting. Many benchmarks have shown settings about a > certain point (like 8MB to 12 MB) to be counter-productive, although > a few have shown increased performance going past that. It really > seems to depend on your hardware and workload, so you have to test to > find the "sweet spot" for your environment. Er, I meant "8GB to 12GB", not MB. Sorry. -Kevin
于 2012/6/9 22:39, Kevin Grittner 写道: > Rural Hunter wrote: >> 于 2012/6/9 0:39, Kevin Grittner 写道: > >> name | current_setting > >> full_page_writes | off > > There may be exceptions on some file systems, but generally turning > this off leaves you vulnerable to possible database corruption if you > OS or hardware crashes. Yes, I understand. My situation is, the io utiliztion of my system is quite high so I turned this off to reduce the io utilization. We have a replication server to serve as the hot standby if there is any issue on the primary. So currently I think it's acceptable option to me. > >> max_connections | 2500 > > Yikes! You may want to look in to a connection pooler which can take > 2500 client connections and funnel them into a much smaller number of > database connections. > > https://wiki.postgresql.org/wiki/Number_Of_Database_Connections > >> shared_buffers | 60GB > > You might want to compare your performance with this setting against > a smaller setting. Many benchmarks have shown settings about a > certain point (like 8MB to 12 MB) to be counter-productive, although > a few have shown increased performance going past that. It really > seems to depend on your hardware and workload, so you have to test to > find the "sweet spot" for your environment. > >> work_mem | 8MB > > With so many connections, I can understand being this low. One of > the advantages of using connection pooling to funnel your user > connections into fewer database conncections is that you can boost > this, which might help considerably with some types of queries. > > None of the above, however, really gets to your immediate problem. > What is most significant about your settings with regard to the > problem query is what's *not* in that list. You appear to have a > heavily cached active data set, based on the row counts and timings > in EXPLAIN ANALYZE output, and you have not adjusted your cost > factors, which assume less caching. Thanks for the advices. As of now we don't see overall performance issue on the db. I will adjust these settings based on your advices if we begin to see overall performance degrade. > > Try setting these on a connection and then running your queries on > that connection. > > set seq_page_cost = 0.1; > set random_page_cost = 0.1; > set cpu_tuple_cost = 0.03; I tried these settings but don't see noticeable improvement. The plan is not changed. > >> Ok, I get out a simple version of the actualy query. Here is the >> explain anaylze without order-by, which is I wanted: >> http://explain.depesz.com/s/p1p >> >> Another with the order-by which I want to avoid: >> http://explain.depesz.com/s/ujU > > You neglected to mention the LIMIT clause in your earlier > presentation of the problem. A LIMIT can have a big impact on plan > choice. Is the LIMIT 10 part of the actual query you want to > optimize? Either way it would be helpful to see the EXPLAIN ANALYZE > output for the the query without the LIMIT clause. Yes, sorry for that. I do need the limit clause in the query to show only part of the results to the user(common multi-pages view). Without the limit clause, I got the plan as I wanted: http://explain.depesz.com/s/Qdu So looks either I remove the order-by or limit clause, I can get what I wanted. But I do need the both in the query... > > -Kevin >
Rural Hunter wrote: > 于 2012/6/9 22:39, Kevin Grittner 写道: >> You neglected to mention the LIMIT clause in your earlier >> presentation of the problem. A LIMIT can have a big impact on plan >> choice. Is the LIMIT 10 part of the actual query you want to >> optimize? Either way it would be helpful to see the EXPLAIN >> ANALYZE output for the the query without the LIMIT clause. > Yes, sorry for that. I do need the limit clause in the query to > show only part of the results to the user(common multi-pages view). > Without the limit clause, I got the plan as I wanted: > http://explain.depesz.com/s/Qdu > > So looks either I remove the order-by or limit clause, I can get > what I wanted. But I do need the both in the query... Well, we're still doing diagnostic steps. What this one shows is that your statistics are leading the planner to believe that there will be 20846 rows with lid = 3072, while there are really only 62. If it knew the actual number I doubt it would choose the slower plan. The next thing I would try is: ALTER TABLE article_label ALTER COLUMN lid SET STATISTICS = 5000; ANALYZE article_label; Then try the query without LIMIT and see if you get something on the right order of magnitude comparing the estimated rows to actual on that index scan. You can try different STATISTICS values until you get the lowest value that puts the estimate in the right neighborhood. Higher settings will increase plan time; lower settings may lead to bad plans. Once you've got a decent estimate, try with the ORDER BY and LIMIT again. If you have a hard time getting a good estimate even with a high statistics target, you should investigate whether you have extreme table bloat. -Kevin
于 2012/6/11 20:07, Kevin Grittner 写道: > Rural Hunter wrote: >> 于 2012/6/9 22:39, Kevin Grittner 写道: > >>> You neglected to mention the LIMIT clause in your earlier >>> presentation of the problem. A LIMIT can have a big impact on plan >>> choice. Is the LIMIT 10 part of the actual query you want to >>> optimize? Either way it would be helpful to see the EXPLAIN >>> ANALYZE output for the the query without the LIMIT clause. >> Yes, sorry for that. I do need the limit clause in the query to >> show only part of the results to the user(common multi-pages view). >> Without the limit clause, I got the plan as I wanted: >> http://explain.depesz.com/s/Qdu >> >> So looks either I remove the order-by or limit clause, I can get >> what I wanted. But I do need the both in the query... > > Well, we're still doing diagnostic steps. What this one shows is > that your statistics are leading the planner to believe that there > will be 20846 rows with lid = 3072, while there are really only 62. > If it knew the actual number I doubt it would choose the slower plan. > > The next thing I would try is: > > ALTER TABLE article_label ALTER COLUMN lid SET STATISTICS = 5000; > ANALYZE article_label; > > Then try the query without LIMIT and see if you get something on the > right order of magnitude comparing the estimated rows to actual on > that index scan. You can try different STATISTICS values until you > get the lowest value that puts the estimate in the right > neighborhood. Higher settings will increase plan time; lower > settings may lead to bad plans. > > Once you've got a decent estimate, try with the ORDER BY and LIMIT > again. I set statistics to 5000 and got estimated row count 559. Set statistics to 8000 and got estimated row count 393. At this step, I run the query with both order-by and limit clause and got the expected result. Kevin, Thank you very much for your patience and step-by-step guidance! I learnt a lot from this case! > > If you have a hard time getting a good estimate even with a high > statistics target, you should investigate whether you have extreme > table bloat. > > -Kevin >