Обсуждение: Confusion about composite indexes

Поиск
Список
Период
Сортировка

Confusion about composite indexes

От
Bill Mitchell
Дата:
I am searching for some logic behind the selection of an index in postgres -- it seems that if I have a composite index based on both columns in a join table, it's only referenced if I query on the first term in the composite index.  I've read http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and over and think that this is the same scenario as what I face.

As an example:
OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a million rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

  Table "public.outlet_media"
  Column   |  Type  | Modifiers
-----------+--------+-----------
 outlet_id | bigint | not null
 media_id  | bigint | not null
Indexes:
    "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
    "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
    "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select outlet_id from outlet order by random() limit 50);
                                                                QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual time=841.115..884.669 rows=50 loops=1)
   ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual time=841.048..841.090 rows=50 loops=1)
         ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual time=840.980..841.011 rows=50 loops=1)
               ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8) (actual time=840.978..840.991 rows=50 loops=1)
                     Sort Key: (random())
                     Sort Method: top-N heapsort  Memory: 27kB
                     ->  Seq Scan on outlet  (cost=0.00..31511.88 rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
   ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
         Index Cond: (outlet_id = outlet.outlet_id)
 Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select media_id from media where media_name='Online News');
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual time=0.034..0.034 rows=0 loops=1)
   Hash Cond: (outlet_media.media_id = media.media_id)
   ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16) (actual time=0.012..0.012 rows=1 loops=1)
   ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 0kB
         ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
               ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
                     Filter: ((media_name)::text = 'Online News'::text)
 Total runtime: 0.084 ms
(9 rows)


Thanks in advance for whatever light can be shed.  If it's safer for me to just create individual indexes on each of the two columns  (" Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time")

regards
Bill

Re: Confusion about composite indexes

От
Chris Curvey
Дата:


On Mon, May 21, 2012 at 3:34 PM, Bill Mitchell <bill@publicrelay.com> wrote:
I am searching for some logic behind the selection of an index in postgres -- it seems that if I have a composite index based on both columns in a join table, it's only referenced if I query on the first term in the composite index.  I've read http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and over and think that this is the same scenario as what I face.

As an example:
OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a million rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

  Table "public.outlet_media"
  Column   |  Type  | Modifiers
-----------+--------+-----------
 outlet_id | bigint | not null
 media_id  | bigint | not null
Indexes:
    "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
    "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
    "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select outlet_id from outlet order by random() limit 50);
                                                                QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual time=841.115..884.669 rows=50 loops=1)
   ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual time=841.048..841.090 rows=50 loops=1)
         ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual time=840.980..841.011 rows=50 loops=1)
               ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8) (actual time=840.978..840.991 rows=50 loops=1)
                     Sort Key: (random())
                     Sort Method: top-N heapsort  Memory: 27kB
                     ->  Seq Scan on outlet  (cost=0.00..31511.88 rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
   ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
         Index Cond: (outlet_id = outlet.outlet_id)
 Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select media_id from media where media_name='Online News');
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual time=0.034..0.034 rows=0 loops=1)
   Hash Cond: (outlet_media.media_id = media.media_id)
   ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16) (actual time=0.012..0.012 rows=1 loops=1)
   ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 0kB
         ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
               ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
                     Filter: ((media_name)::text = 'Online News'::text)
 Total runtime: 0.084 ms
(9 rows)


Thanks in advance for whatever light can be shed.  If it's safer for me to just create individual indexes on each of the two columns  (" Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time")

regards
Bill

 
You are absolutely correct.  It may be easier to understand if you think about the structure of the index keys.    For clarity, I'm going to pretend that the "outlet" id is a character instead of an integer.

The index is sorted in the order of the columns in the index.  So our sorted values might look something like this (my apologies if the spacing does not come through properly)

outlet   media
A        1
A                  2
A                  3
B        1
B                  2
B                  3
C        1
C                  2
C                  3

So if I want to search for all the entries where outlet = 'a', then I find the first 'A' record, search until I find something that is not 'A', and I know that I'm done.

But to find all the entries where media = 1, I would have to search every key in the index.  And because there is an unknown amount of work involved in searching the index, the optimizer assumes that using the index will be more work than just scanning the table.

So what should you do?  I would keep the 2-key index, because you want that for uniqueness.  And the index as you have structured it is useful for searching if you are specifying the outlet.  But you may want to add a non-unique index on the media_id column, so that you will have a useful index for searching.  

The Standard Index Tradeoff Disclaimer (TM) applies:  indexes can speed up searches, but they will slow down inserts/updates/deletes because the index must be maintained.


 
--
e-Mail is the equivalent of a postcard written in pencil.  This message may not have been sent by me, or intended for you.  It may have been read or even modified while in transit.  e-Mail disclaimers have the same force in law as a note passed in study hall.  If your corporate attorney says that you need an disclaimer in your signature, you need a new corporate attorney.

Re: Confusion about composite indexes

От
Merlin Moncure
Дата:
On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill@publicrelay.com> wrote:
> I am searching for some logic behind the selection of an index in postgres
> -- it seems that if I have a composite index based on both columns in a join
> table, it's only referenced if I query on the first term in the composite
> index.  I've read
> http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and
> over and think that this is the same scenario as what I face.
>
> As an example:
> OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
> MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
> OUTLET_MEDIA: a join table used to correlate, and this has about a million
> rows
>
> Each outlet may have 1+ Media (technically, 0 or more, in this schema)
>
>   Table "public.outlet_media"
>   Column   |  Type  | Modifiers
> -----------+--------+-----------
>  outlet_id | bigint | not null
>  media_id  | bigint | not null
> Indexes:
>     "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
> Foreign-key constraints:
>     "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
>     "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
> outlet(outlet_id)
>
> When I test performance, using an OUTLET_ID, the query uses the
> outlet_media_pkey index
>
> # explain analyze select * from outlet_media where outlet_id in (select
> outlet_id from outlet order by random() limit 50);
>                                                                 QUERY
> PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual
> time=841.115..884.669 rows=50 loops=1)
>    ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual
> time=841.048..841.090 rows=50 loops=1)
>          ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual
> time=840.980..841.011 rows=50 loops=1)
>                ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8)
> (actual time=840.978..840.991 rows=50 loops=1)
>                      Sort Key: (random())
>                      Sort Method: top-N heapsort  Memory: 27kB
>                      ->  Seq Scan on outlet  (cost=0.00..31511.88
> rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
>    ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43
> rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
>          Index Cond: (outlet_id = outlet.outlet_id)
>  Total runtime: 884.759 ms
> (10 rows)
>
> However if I try the reverse, to search using the MEDIA_ID
> # explain analyze select * from outlet_media where media_id in (select
> media_id from media where media_name='Online News');
>                                                       QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual
> time=0.034..0.034 rows=0 loops=1)
>    Hash Cond: (outlet_media.media_id = media.media_id)
>    ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16)
> (actual time=0.012..0.012 rows=1 loops=1)
>    ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
> rows=0 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 0kB
>          ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual
> time=0.013..0.013 rows=0 loops=1)
>                ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8)
> (actual time=0.012..0.012 rows=0 loops=1)
>                      Filter: ((media_name)::text = 'Online News'::text)
>  Total runtime: 0.084 ms
> (9 rows)
>
>
> Thanks in advance for whatever light can be shed.  If it's safer for me to
> just create individual indexes on each of the two columns  (" Multicolumn
> indexes should be used sparingly. In most situations, an index on a single
> column is sufficient and saves space and time")

There are a couple of things going on here.  First of all, 'order by
random() limit..' guarantees a full seq scan and a short of whatever
you're ordering, so it's never going to be very efficient.  When you
wen the other way, you supplied a hard search term which can be fed to
the index and optimized through.

Secondly, if you have a mapping table that has to support searches in
both directions, it's pretty typical to do something like this:

create table map(a int references ..., b int references..., primary key(a,b));
create index on map(b);

So you can get fully index lookups on all of a, b, ab, and ba.  the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order.  They can
still help somewhat, but to a lesser degree.

merlin

Re: Confusion about composite indexes

От
Dmitriy Igrishin
Дата:


2012/5/22 Merlin Moncure <mmoncure@gmail.com>
On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill@publicrelay.com> wrote:
> I am searching for some logic behind the selection of an index in postgres
> -- it seems that if I have a composite index based on both columns in a join
> table, it's only referenced if I query on the first term in the composite
> index.  I've read
> http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and
> over and think that this is the same scenario as what I face.
>
> As an example:
> OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
> MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
> OUTLET_MEDIA: a join table used to correlate, and this has about a million
> rows
>
> Each outlet may have 1+ Media (technically, 0 or more, in this schema)
>
>   Table "public.outlet_media"
>   Column   |  Type  | Modifiers
> -----------+--------+-----------
>  outlet_id | bigint | not null
>  media_id  | bigint | not null
> Indexes:
>     "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
> Foreign-key constraints:
>     "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
>     "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
> outlet(outlet_id)
>
> When I test performance, using an OUTLET_ID, the query uses the
> outlet_media_pkey index
>
> # explain analyze select * from outlet_media where outlet_id in (select
> outlet_id from outlet order by random() limit 50);
>                                                                 QUERY
> PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual
> time=841.115..884.669 rows=50 loops=1)
>    ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual
> time=841.048..841.090 rows=50 loops=1)
>          ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual
> time=840.980..841.011 rows=50 loops=1)
>                ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8)
> (actual time=840.978..840.991 rows=50 loops=1)
>                      Sort Key: (random())
>                      Sort Method: top-N heapsort  Memory: 27kB
>                      ->  Seq Scan on outlet  (cost=0.00..31511.88
> rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
>    ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43
> rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
>          Index Cond: (outlet_id = outlet.outlet_id)
>  Total runtime: 884.759 ms
> (10 rows)
>
> However if I try the reverse, to search using the MEDIA_ID
> # explain analyze select * from outlet_media where media_id in (select
> media_id from media where media_name='Online News');
>                                                       QUERY
> PLAN
> -----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual
> time=0.034..0.034 rows=0 loops=1)
>    Hash Cond: (outlet_media.media_id = media.media_id)
>    ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16)
> (actual time=0.012..0.012 rows=1 loops=1)
>    ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
> rows=0 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 0kB
>          ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual
> time=0.013..0.013 rows=0 loops=1)
>                ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8)
> (actual time=0.012..0.012 rows=0 loops=1)
>                      Filter: ((media_name)::text = 'Online News'::text)
>  Total runtime: 0.084 ms
> (9 rows)
>
>
> Thanks in advance for whatever light can be shed.  If it's safer for me to
> just create individual indexes on each of the two columns  (" Multicolumn
> indexes should be used sparingly. In most situations, an index on a single
> column is sufficient and saves space and time")

There are a couple of things going on here.  First of all, 'order by
random() limit..' guarantees a full seq scan and a short of whatever
you're ordering, so it's never going to be very efficient.  When you
wen the other way, you supplied a hard search term which can be fed to
the index and optimized through.

Secondly, if you have a mapping table that has to support searches in
both directions, it's pretty typical to do something like this:

create table map(a int references ..., b int references..., primary key(a,b));
create index on map(b);

So you can get fully index lookups on all of a, b, ab, and ba.  the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order.  They can
still help somewhat, but to a lesser degree.
BTW, I would like to know is it worth it to create 3rd index on map(a)
to reduce the size of the index which will be used by the planer
to save some server's RAM (obviously, at the cost of extra disk space) ?

--
// Dmitriy.


Re: Confusion about composite indexes

От
Merlin Moncure
Дата:
On Mon, May 21, 2012 at 3:36 PM, Dmitriy Igrishin <dmitigr@gmail.com> wrote:
>> So you can get fully index lookups on all of a, b, ab, and ba.  the
>> primary key can't optimize ba because indexes only fully match if
>> candidate fields are supplied from left to right order.  They can
>> still help somewhat, but to a lesser degree.
>
> BTW, I would like to know is it worth it to create 3rd index on map(a)
> to reduce the size of the index which will be used by the planer
> to save some server's RAM (obviously, at the cost of extra disk space) ?

What Dmitriy is talking about here is that even though an index on
(a,b) can efficiently (in terms of searching through the tree) match
terms on just 'a', you still pay a price because the entries on the
index have to store the data for b as well,  So even though it's
algorithmically efficient you have to browse more data to do it which
pressures RAM.  In other words, an index on just 'a' is ideal for
searches on just 'a', although a,b is much better than (b,a) or no
index at all.

I personally think that generally it's better not to do that in most
cases especially if you're indexing integer keys since you're not
making *that* much difference on the overall index size.  Also,
primary key indexes are much more likely to have to stay 'hot' in the
cache anyways since they will be serving fkey reference lookups and
stuff like that so in the end you might be consuming *more* ram, not
less.

An exception might be if your key on a,b has a very small 'a' and a
very large 'b'.  But that's pretty rare in practice and it's usually a
good idea to avoid indexing large fields if you can help it.  It
really depends on the workload.

merlin

Re: Confusion about composite indexes

От
Bill Mitchell
Дата:
Thanks to everybody's input -- as a first-time poster to this listserv,
I wasn't sure how long it would take to get a response. ;)

I was frankly astonished to see that the composite index on (a,b) was
used when I searched for (a), but Chris' response makes total sense.

In this case, I don't want to go with a MAP due to the fact that I'm
actually using Java Hibernate to generate this schema and access it.

My sample query of using RANDOM() to select a random subset of the
overall outlets was actually to try and defeat any prior caching of
results, and give a more reasonable measurement -- but I didn't realize
the implications.  I had thought that coupled with a MAX clause at the
end it would simply randomize and then bail out early instead of a full
table scan - so thanks to Merlin for pointing that out.

I'll go with a 2nd index on MEDIA_ID and do some measurements of speed
increase, but it makes a lot more sense now.

thank you Postgres gurus!  :D

regards
Bill


On 5/21/12 5:11 PM, Merlin Moncure wrote:
> On Mon, May 21, 2012 at 3:36 PM, Dmitriy Igrishin <dmitigr@gmail.com> wrote:
>>> So you can get fully index lookups on all of a, b, ab, and ba.  the
>>> primary key can't optimize ba because indexes only fully match if
>>> candidate fields are supplied from left to right order.  They can
>>> still help somewhat, but to a lesser degree.
>> BTW, I would like to know is it worth it to create 3rd index on map(a)
>> to reduce the size of the index which will be used by the planer
>> to save some server's RAM (obviously, at the cost of extra disk space) ?
> What Dmitriy is talking about here is that even though an index on
> (a,b) can efficiently (in terms of searching through the tree) match
> terms on just 'a', you still pay a price because the entries on the
> index have to store the data for b as well,  So even though it's
> algorithmically efficient you have to browse more data to do it which
> pressures RAM.  In other words, an index on just 'a' is ideal for
> searches on just 'a', although a,b is much better than (b,a) or no
> index at all.
>
> I personally think that generally it's better not to do that in most
> cases especially if you're indexing integer keys since you're not
> making *that* much difference on the overall index size.  Also,
> primary key indexes are much more likely to have to stay 'hot' in the
> cache anyways since they will be serving fkey reference lookups and
> stuff like that so in the end you might be consuming *more* ram, not
> less.
>
> An exception might be if your key on a,b has a very small 'a' and a
> very large 'b'.  But that's pretty rare in practice and it's usually a
> good idea to avoid indexing large fields if you can help it.  It
> really depends on the workload.
>
> merlin