Обсуждение: Questions on 7.2.1 query plan choices
This seems pretty basic...I'd appreciate someone showing me the error of my ways...Questions below this schema... $ psql -c "\d freetext" Table "freetext" Column | Type | Modifiers ------------------+-----------------------+---------------------------- value | text | key | integer | not null isindexed | boolean | not null default 'f'::bool tobeindexed | boolean | default 'f'::bool Indexes: indexed_idx Unique keys: freetext_pkey $ psql -c "\d indexed_idx" Index "indexed_idx" Column | Type -------------+--------- tobeindexed | boolean isindexed | boolean btree $ psql -c "\d freetext_pkey" Index "freetext_pkey" Column | Type --------+--------- key | integer unique btree 1) There are over 700,000 rows in the table below, but only about 1,300 matching the where clause. How can I (and should I) get the planner to choose to traverse indexed_idx instead of a sequential scan? The following is immediately after calling 'analyze'... $ psql -c "explain select key, value from freetext where tobeindexed = 't' and isindexed = 'f' NOTICE: QUERY PLAN: Seq Scan on freetext (cost=0.00..102114.21 rows=296161 width=1138) $ psql -c "select count(key) from freetext" count -------- 728868 (1 row) $ psql -c "select count(key) from freetext where tobeindexed = 't' and isindexed = 'f'" count ------- 1319 (1 row) 2) Why does the planner choose to first scan freetext_pkey when choosing indexed_idx would narrow the 700K rows down to 1300 in the query below? As it is, it is apparently doing the equivalent of a backward seqscan of 700K rows right of the bat. $ psql -c "explain select key, value from freetext where tobeindexed = 't' and isindexed = 'f' order by key desc limit 25; NOTICE: QUERY PLAN: Limit (cost=0.00..267.87 rows=25 width=1144) -> Index Scan Backward using freetext_pkey on freetext (cost=0.00..3165306.12 rows=295414 width=1144) -Ed
On Wed, 17 Apr 2002, Ed Loehr wrote: > 1) There are over 700,000 rows in the table below, but only about 1,300 > matching the where clause. How can I (and should I) get the planner to > choose to traverse indexed_idx instead of a sequential scan? The following > is immediately after calling 'analyze'... > > $ psql -c "explain select key, value from freetext where tobeindexed = 't' > and isindexed = 'f' > NOTICE: QUERY PLAN: > > Seq Scan on freetext (cost=0.00..102114.21 rows=296161 width=1138) ... > 2) Why does the planner choose to first scan freetext_pkey when choosing > indexed_idx would narrow the 700K rows down to 1300 in the query below? Well, I don't know postgres so well, but here's my guess: Since tobeindexed and isindexed are boolean values, there are only four possible values for the index. This means that the selectivity of this index is going to be really, really low. Now that's all right if the query optimizer has some idea that, for this particular set of values, only a few thousand tuples will be retrieved. But it has to know this in advance somehow, because doing the queries above as an index lookup rather than a table scan or primary key index scan would be totally stupid if 100,000 rows were going to be returned. Now a query optimizer, lacking the right statistics on key distribution, would probably guess that on average you're going to select 1/4 of the rows [175,000 in this case] for any index value. (That would be my personal guess, if you hadn't told me that the key distribution was different.) An analyze might help, though I don't know if postgres collects the right statistics to fix this problem. But since you know yourself that you're going to have a relatively small number of rows where tobeindexed = true, you're in a perfect position to use one of the coolest features of postgres: a partial index. Create the index for only those rows where tobeindexed = t (see the docs for syntax details), and you end up with a 1300 row index rather than a 700,000 row index for a start. With any luck, the postgres query optimizer will also realize that there is no way something using that index can ever come up with more than the number of rows in the index itself, which at 1300 is very selective, and it will then use it first. Of course, I hope someone will correct me if I'm wrong here. At any rate, please post and let us know how it turns out. Partial seem to me to be designed exactly for situations like the one above. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Ed Loehr <pggeneral@bluepolka.net> writes: > $ psql -c "explain select key, value from freetext where tobeindexed = 't' > and isindexed = 'f' > NOTICE: QUERY PLAN: > Seq Scan on freetext (cost=0.00..102114.21 rows=296161 width=1138) > $ psql -c "select count(key) from freetext" > count > -------- > 728868 > (1 row) > $ psql -c "select count(key) from freetext where tobeindexed = 't' and > isindexed = 'f'" > count > ------- > 1319 > (1 row) The problem here is that the planner is estimating 296161 rows retrieved instead of 1319. If it were right, then a seqscan would be the right choice. My guess is that there is a strong correlation between the tobeindexed and isindexed columns --- but the current statistical model has no clue about cross-column correlations, so you get an estimate that's just based on the product of the frequencies independently. Curt Sampson's nearby remarks about partial indexes are not a bad suggestion. An even more direct attack is to combine these two columns into a single column with four states (you could use smallint or "char"-with-the-quotes). The 7.2 planner *would* have a pretty good idea about the relative frequencies of the different states, and would make the right seqscan-vs-indexscan choice depending on which state you were scanning for. regards, tom lane
On Thu, 18 Apr 2002, Tom Lane wrote: > Curt Sampson's nearby remarks about partial indexes are not a bad > suggestion. I just tried this out, and the disk space savings alone were pretty stunning. On a 300,000 row table with about 1750 TRUE values and the rest FALSE, the full index was over 5 MB and the partial was less than 50K. But it turns out that the analyzer's stats were good enough that it made little difference to performance. Once I analyzed the table, even with the full index postgres figured out that the index scan (estimating 1300 values, in this case) would be faster. So I guess it's key correlation thing that did it, or perhaps he just had not analzyed the table. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > >>Curt Sampson's nearby remarks about partial indexes are not a bad >>suggestion. > > I just tried this out, and the disk space savings alone were pretty > stunning. On a 300,000 row table with about 1750 TRUE values and > the rest FALSE, the full index was over 5 MB and the partial was > less than 50K. > > But it turns out that the analyzer's stats were good enough that > it made little difference to performance. Once I analyzed the table, > even with the full index postgres figured out that the index scan > (estimating 1300 values, in this case) would be faster. > > So I guess it's key correlation thing that did it, or perhaps he > just had not analzyed the table. Interesting. I analyzed immediately prior to running explain and the queries, so that is not the source. It is repeatable. Ed
On Thu, 18 Apr 2002, Ed Loehr wrote: > Interesting. I analyzed immediately prior to running explain and the > queries, so that is not the source. It is repeatable. How many rows are set for each individual boolean in that index? (I.e., if you use just 'tobeindexed = true', how many would that return? And the same for isindexed.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC