Обсуждение: Question on Index usage

Поиск
Список
Период
Сортировка

Question on Index usage

От
Michael Monnerie
Дата:
http://www.postgresql.org/docs/8.3/interactive/using-explain.html

I tried reading that page, but it's still not clear to me, why the index
dbmail_messages_1 is better than dbmail_messages_7:

\d dbmail_messages
                                    Tabelle »public.dbmail_messages«
     Spalte     |          Typ          |
Attribute
----------------+-----------------------+---------------------------------------------------------------
 message_idnr   | bigint                | not null default
nextval('dbmail_message_idnr_seq'::regclass)
 mailbox_idnr   | bigint                |
 physmessage_id | bigint                |
<snip other colums not important>
Indexe:
    »dbmail_messages_pkey« PRIMARY KEY, btree (message_idnr)
    »dbmail_messages_1« btree (mailbox_idnr)
    »dbmail_messages_2« btree (physmessage_id)
    »dbmail_messages_3« btree (seen_flag)
    »dbmail_messages_4« btree (unique_id)
    »dbmail_messages_5« btree (status)
    »dbmail_messages_6« btree (status) WHERE status < 2::smallint
    »dbmail_messages_7« btree (mailbox_idnr, status, seen_flag) CLUSTER
    »dbmail_messages_8« btree (mailbox_idnr, status, recent_flag)

Then I do a SELECT that joins this with another table:

EXPLAIN ANALYZE SELECT 1 FROM dbmail_messages msg JOIN
dbmail_physmessage pm ON ( pm.id = msg.physmessage_id ) WHERE
message_idnr BETWEEN 3178782 AND 3616157 AND mailbox_idnr = 3236 AND
status IN (0,1,2) ORDER BY message_idnr ASC;
                                                                          QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=8593.80..8605.30 rows=4599 width=8) (actual
time=707.878..713.738 rows=5228 loops=1)
   Sort Key: msg.message_idnr
   Sort Method:  quicksort  Memory: 438kB
   ->  Hash Join  (cost=6891.33..8314.02 rows=4599 width=8) (actual
time=667.133..697.308 rows=5228 loops=1)
         Hash Cond: (msg.physmessage_id = pm.id)
         ->  Index Scan using dbmail_messages_1 on dbmail_messages msg
(cost=0.00..1324.96 rows=4599 width=16) (actual time=0.094..10.958
rows=5228 loops=1)
               Index Cond: (mailbox_idnr = 3236)
               Filter: ((message_idnr >= 3178782) AND (message_idnr <=
3616157) AND (status = ANY ('{0,1,2}'::integer[])))
         ->  Hash  (cost=4008.37..4008.37 rows=230637 width=8) (actual
time=666.628..666.628 rows=229809 loops=1)
               ->  Seq Scan on dbmail_physmessage pm
(cost=0.00..4008.37 rows=230637 width=8) (actual time=0.012..303.139
rows=229809 loops=1)
 Total runtime: 729.972 ms

Good, and now I
DROP INDEX dbmail_messages_1;
because anyway there are the _7 and _8 indices which both have
mailbox_idnr as their first column, so sorting is the same as in the _1
index. And _7 is even used by CLUSTER. But that makes
                                                                   QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10601.66..10613.20 rows=4615 width=8) (actual
time=759.415..765.292 rows=5228 loops=1)
   Sort Key: msg.message_idnr
   Sort Method:  quicksort  Memory: 438kB
   ->  Hash Join  (cost=7036.45..10320.79 rows=4615 width=8) (actual
time=720.352..748.912 rows=5228 loops=1)
         Hash Cond: (msg.physmessage_id = pm.id)
         ->  Bitmap Heap Scan on dbmail_messages msg
(cost=145.12..3330.99 rows=4650 width=16) (actual time=50.689..60.132
rows=5228 loops=1)
               Recheck Cond: (mailbox_idnr = 3236)
               Filter: ((message_idnr >= 3178782) AND (message_idnr <=
3616157) AND (status = ANY ('{0,1,2}'::integer[])))
               ->  Bitmap Index Scan on dbmail_messages_7
(cost=0.00..143.96 rows=5288 width=0) (actual time=50.628..50.628
rows=15759 loops=1)
                     Index Cond: (mailbox_idnr = 3236)
         ->  Hash  (cost=4008.37..4008.37 rows=230637 width=8) (actual
time=669.219..669.219 rows=229809 loops=1)
               ->  Seq Scan on dbmail_physmessage pm
(cost=0.00..4008.37 rows=230637 width=8) (actual time=0.014..301.793
rows=229809 loops=1)
 Total runtime: 781.238 ms

So while with the _1 index an "Index Scan" is used, with the _7 it needs
a "Bitmap Index Scan" plus a "Bitmap Heap Scan". Can somebody explain
why the _1 index cannot be deleted without loosing performance? The
plain "Index Scan" could be used with _7 or _8 anyway.

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Re: Question on Index usage

От
Alvaro Herrera
Дата:
Michael Monnerie wrote:


> EXPLAIN ANALYZE SELECT 1 FROM dbmail_messages msg JOIN
> dbmail_physmessage pm ON ( pm.id = msg.physmessage_id ) WHERE
> message_idnr BETWEEN 3178782 AND 3616157 AND mailbox_idnr = 3236 AND
> status IN (0,1,2) ORDER BY message_idnr ASC;


[...]
>          ->  Bitmap Heap Scan on dbmail_messages msg
> (cost=145.12..3330.99 rows=4650 width=16) (actual time=50.689..60.132
> rows=5228 loops=1)
>                Recheck Cond: (mailbox_idnr = 3236)
>                Filter: ((message_idnr >= 3178782) AND (message_idnr <=
> 3616157) AND (status = ANY ('{0,1,2}'::integer[])))
>                ->  Bitmap Index Scan on dbmail_messages_7
> (cost=0.00..143.96 rows=5288 width=0) (actual time=50.628..50.628
> rows=15759 loops=1)
>                      Index Cond: (mailbox_idnr = 3236)
>          ->  Hash  (cost=4008.37..4008.37 rows=230637 width=8) (actual
> time=669.219..669.219 rows=229809 loops=1)
[...]

> So while with the _1 index an "Index Scan" is used, with the _7 it needs
> a "Bitmap Index Scan" plus a "Bitmap Heap Scan". Can somebody explain
> why the _1 index cannot be deleted without loosing performance? The
> plain "Index Scan" could be used with _7 or _8 anyway.

Seems like the problem is that it is not pushing the "status IN"
condition as part of the index condition for some reason, and instead
using it as a filter.  Maybe something to do with the selectivity of
that clause?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Question on Index usage

От
Michael Monnerie
Дата:
On Mittwoch 21 Januar 2009 Alvaro Herrera wrote:
> Seems like the problem is that it is not pushing the "status IN"
> condition as part of the index condition for some reason, and instead
> using it as a filter.  Maybe something to do with the selectivity of
> that clause?

I was reading your answer several times, and I think I understand now
what you mean. You mean "it doesn't use index _7 because the status IN
(0,1,2) isn't usable". I didn't think about that, but that's another
good point.

Instead, I meant: Why can't I just delete index _1, if anyway I have
index _7 with the same field, just plus additional fields. I thought
that would be redundant:

Index _1 = mailbox_idnr
Index _7 = mailbox_idnr,status,seen_flag

So I would guess that Index _1 is redundant, and I can delete it because
the planner will use _7 instead. After all, for searching any
mailbox_idnr in that table (~234k entries) it doesn't really matter if
you use index _1 or _7, the sort is the same (if you don't care about
the other fields).

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Re: Question on Index usage

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Seems like the problem is that it is not pushing the "status IN"
> condition as part of the index condition for some reason, and instead
> using it as a filter.  Maybe something to do with the selectivity of
> that clause?

No, AFAIR the planner will *always* include every possibly relevant
condition for a given index.  If the condition is useless or nearly so,
that might prompt it to pick a different index instead, but not to omit
the indexqual.  I'm thinking it's not believing it can use the status
condition with that index, perhaps for datatype reasons.  What is status
declared as, exactly?

Also, what PG version is this?

            regards, tom lane

Re: Question on Index usage

От
Michael Monnerie
Дата:
On Mittwoch 21 Januar 2009 Tom Lane wrote:
> No, AFAIR the planner will *always* include every possibly relevant
> condition for a given index.  If the condition is useless or nearly
> so, that might prompt it to pick a different index instead, but not
> to omit the indexqual.  I'm thinking it's not believing it can use
> the status condition with that index, perhaps for datatype reasons.
>  What is status declared as, exactly?

OK, now I'm sure you're speaking in a direction I didn't mean ;-)
But still, it could use the index _7 for the status field, as that's
sorted and 0,1,2 would be all together.

status         | smallint              | not null default (0)::smallint

Maybe it doesn't do because it could well be 0,1,3 also, and it's not
checking if all the values in the IN() statement appear in a row?

> Also, what PG version is this?

8.3 on openSUSE 11.1: select version();
PostgreSQL 8.3.5 on x86_64-unknown-linux-gnu, compiled by GCC gcc (SUSE
Linux) 4.3.2 [gcc-4_3-branch revision 141291]

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Re: Question on Index usage

От
Michael Monnerie
Дата:
On Mittwoch 21 Januar 2009 Alvaro Herrera wrote:
> Seems like the problem is that it is not pushing the "status IN"
> condition as part of the index condition for some reason, and instead
> using it as a filter.  Maybe something to do with the selectivity of
> that clause?

I was reading your answer several times, and I think I understand now
what you mean. You mean "it doesn't use index _7 because the status IN
(0,1,2) isn't usable". I didn't think about that, but that's another
good point.

Instead, I meant: Why can't I just delete index _1, if anyway I have
index _7 with the same field, just plus additional fields. I thought
that would be redundant:

Index _1 = mailbox_idnr
Index _7 = mailbox_idnr,status,seen_flag

So I would guess that Index _1 is redundant, and I can delete it because
the planner will use _7 instead. After all, for searching any
mailbox_idnr in that table (~234k entries) it doesn't really matter if
you use index _1 or _7, the sort is the same (if you don't care about
the other fields).

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4