Обсуждение: Questions on 7.2.1 query plan choices

Поиск
Список
Период
Сортировка

Questions on 7.2.1 query plan choices

От
Ed Loehr
Дата:
This seems pretty basic...I'd appreciate someone showing me the error of my
ways...Questions below this schema...

$ psql -c "\d freetext"
                            Table "freetext"
       Column      |         Type          |         Modifiers
------------------+-----------------------+----------------------------
  value            | text                  |
  key              | integer               | not null
  isindexed        | boolean               | not null default 'f'::bool
  tobeindexed      | boolean               | default 'f'::bool
Indexes: indexed_idx
Unique keys: freetext_pkey

$ psql -c "\d indexed_idx"
   Index "indexed_idx"
    Column    |  Type
-------------+---------
  tobeindexed | boolean
  isindexed   | boolean
btree

$ psql -c "\d freetext_pkey"
Index "freetext_pkey"
  Column |  Type
--------+---------
  key    | integer
unique btree


1)  There are over 700,000 rows in the table below, but only about 1,300
matching the where clause.  How can I (and should I) get the planner to
choose to traverse indexed_idx instead of a sequential scan?  The following
is immediately after calling 'analyze'...

$ psql -c "explain select key, value from freetext where tobeindexed = 't'
and isindexed = 'f'
NOTICE:  QUERY PLAN:

Seq Scan on freetext  (cost=0.00..102114.21 rows=296161 width=1138)

$ psql -c "select count(key) from freetext"
  count
--------
  728868
(1 row)

$ psql -c "select count(key) from freetext where tobeindexed = 't' and
isindexed = 'f'"
  count
-------
   1319
(1 row)


2)  Why does the planner choose to first scan freetext_pkey when choosing
indexed_idx would narrow the 700K rows down to 1300 in the query below?  As
it is, it is apparently doing the equivalent of a backward seqscan of 700K
rows right of the bat.

$ psql -c "explain select key, value from freetext where tobeindexed = 't'
and isindexed = 'f' order by key desc limit 25;
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..267.87 rows=25 width=1144)
   ->  Index Scan Backward using freetext_pkey on freetext
(cost=0.00..3165306.12 rows=295414 width=1144)


-Ed


Re: Questions on 7.2.1 query plan choices

От
Curt Sampson
Дата:
On Wed, 17 Apr 2002, Ed Loehr wrote:

> 1)  There are over 700,000 rows in the table below, but only about 1,300
> matching the where clause.  How can I (and should I) get the planner to
> choose to traverse indexed_idx instead of a sequential scan?  The following
> is immediately after calling 'analyze'...
>
> $ psql -c "explain select key, value from freetext where tobeindexed = 't'
> and isindexed = 'f'
> NOTICE:  QUERY PLAN:
>
> Seq Scan on freetext  (cost=0.00..102114.21 rows=296161 width=1138)
...

> 2)  Why does the planner choose to first scan freetext_pkey when choosing
> indexed_idx would narrow the 700K rows down to 1300 in the query below?

Well, I don't know postgres so well, but here's my guess:

Since tobeindexed and isindexed are boolean values, there are only four
possible values for the index. This means that the selectivity of this
index is going to be really, really low.

Now that's all right if the query optimizer has some idea that, for this
particular set of values, only a few thousand tuples will be retrieved.
But it has to know this in advance somehow, because doing the queries
above as an index lookup rather than a table scan or primary key index
scan would be totally stupid if 100,000 rows were going to be returned.

Now a query optimizer, lacking the right statistics on key distribution,
would probably guess that on average you're going to select 1/4 of the
rows [175,000 in this case] for any index value. (That would be my
personal guess, if you hadn't told me that the key distribution was
different.)

An analyze might help, though I don't know if postgres collects the
right statistics to fix this problem.

But since you know yourself that you're going to have a relatively small
number of rows where tobeindexed = true, you're in a perfect position to
use one of the coolest features of postgres: a partial index. Create the
index for only those rows where tobeindexed = t (see the docs for syntax
details), and you end up with a 1300 row index rather than a 700,000 row
index for a start. With any luck, the postgres query optimizer will also
realize that there is no way something using that index can ever come up
with more than the number of rows in the index itself, which at 1300 is
very selective, and it will then use it first.

Of course, I hope someone will correct me if I'm wrong here. At any
rate, please post and let us know how it turns out. Partial seem to me
to be designed exactly for situations like the one above.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Questions on 7.2.1 query plan choices

От
Tom Lane
Дата:
Ed Loehr <pggeneral@bluepolka.net> writes:
> $ psql -c "explain select key, value from freetext where tobeindexed = 't'
> and isindexed = 'f'
> NOTICE:  QUERY PLAN:

> Seq Scan on freetext  (cost=0.00..102114.21 rows=296161 width=1138)

> $ psql -c "select count(key) from freetext"
>   count
> --------
>   728868
> (1 row)

> $ psql -c "select count(key) from freetext where tobeindexed = 't' and
> isindexed = 'f'"
>   count
> -------
>    1319
> (1 row)

The problem here is that the planner is estimating 296161 rows retrieved
instead of 1319.  If it were right, then a seqscan would be the right
choice.  My guess is that there is a strong correlation between the
tobeindexed and isindexed columns --- but the current statistical model
has no clue about cross-column correlations, so you get an estimate
that's just based on the product of the frequencies independently.

Curt Sampson's nearby remarks about partial indexes are not a bad
suggestion.  An even more direct attack is to combine these two
columns into a single column with four states (you could use smallint
or "char"-with-the-quotes).  The 7.2 planner *would* have a pretty
good idea about the relative frequencies of the different states, and
would make the right seqscan-vs-indexscan choice depending on which
state you were scanning for.

            regards, tom lane

Re: Questions on 7.2.1 query plan choices

От
Curt Sampson
Дата:
On Thu, 18 Apr 2002, Tom Lane wrote:

> Curt Sampson's nearby remarks about partial indexes are not a bad
> suggestion.

I just tried this out, and the disk space savings alone were pretty
stunning.  On a 300,000 row table with about 1750 TRUE values and
the rest FALSE, the full index was over 5 MB and the partial was
less than 50K.

But it turns out that the analyzer's stats were good enough that
it made little difference to performance. Once I analyzed the table,
even with the full index postgres figured out that the index scan
(estimating 1300 values, in this case) would be faster.

So I guess it's key correlation thing that did it, or perhaps he
just had not analzyed the table.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Questions on 7.2.1 query plan choices

От
Ed Loehr
Дата:
Curt Sampson wrote:

>
>>Curt Sampson's nearby remarks about partial indexes are not a bad
>>suggestion.
>
> I just tried this out, and the disk space savings alone were pretty
> stunning.  On a 300,000 row table with about 1750 TRUE values and
> the rest FALSE, the full index was over 5 MB and the partial was
> less than 50K.
>
> But it turns out that the analyzer's stats were good enough that
> it made little difference to performance. Once I analyzed the table,
> even with the full index postgres figured out that the index scan
> (estimating 1300 values, in this case) would be faster.
>
> So I guess it's key correlation thing that did it, or perhaps he
> just had not analzyed the table.


Interesting.  I analyzed immediately prior to running explain and the
queries, so that is not the source.  It is repeatable.

Ed




Re: Questions on 7.2.1 query plan choices

От
Curt Sampson
Дата:
On Thu, 18 Apr 2002, Ed Loehr wrote:

> Interesting.  I analyzed immediately prior to running explain and the
> queries, so that is not the source.  It is repeatable.

How many rows are set for each individual boolean in that index?
(I.e., if you use just 'tobeindexed = true', how many would that
return? And the same for isindexed.)

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC