Обсуждение: Better index stategy for many fields with few values

Поиск
Список
Период
Сортировка

Better index stategy for many fields with few values

От
Oscar Picasso
Дата:
Hi,

I want to optimize something like this.

- My items table:
code int              -- can take one of 100 values
property varchar(250) -- can take one of 5000 values
param01 char(10)      -- can take one of 10 values
param02 char(10)      -- can take one of 10 values
...
[ 20 similar columns }
...
parama20 char(10)     -- can take one of 10 values

- The kind of query I want to optimize:
select * from items
where code betwwen 5 and 22
and param01 = 'P'
and param02 = 'G'
...
[ all the 20 paramXX columns are used in the query}
...
and param20 = 'C';


How can I optimize this kind of query?

I was thinking about using a multicolumns index, but I have read that we should limit multicolumns indice to at most 2 or 3 columns.

If that's true then 22 columns for a multicolumn incdex seems way too much. Or maybe it is workable as every column uses only a very limited set of values?

I was also thinking about about using a functional index.

What do you think would be the best solution in such a case?

Thanks.

Oscar


Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.

Re: Better index stategy for many fields with few values

От
PFC
Дата:
> - My items table:
> code int              -- can take one of 100 values
> property varchar(250) -- can take one of 5000 values
> param01 char(10)      -- can take one of 10 values
> param02 char(10)      -- can take one of 10 values
> ...
> [ 20 similar columns }
> ...
> parama20 char(10)     -- can take one of 10 values

    Instead of 20 columns, you could instead use a "param" field containing
an array of 20 TEXT fields.
    Then create a simple index on (code, param) and SELECT WHERE code BETWEEN
... AND param = '{P,G,....,C}'

    If you don't want to modify your structure, you can create a functional
index on an array {param1...param20}, but your queries will be a bit
uglier.

Re: Better index stategy for many fields with few values

От
Markus Schaber
Дата:
Hi, Oscar,

Oscar Picasso wrote:

> [ all the 20 paramXX columns are used in the query}

> How can I optimize this kind of query?

PostgreSQL 8.1 has so-called bitmap index scans, which can combine
several index scans before actually accessing the data.

So I think it's best to create an index on each of the paramXX columns,
and see with EXPLAIN ANALYZE what it is doing.

> I was thinking about using a multicolumns index, but I have read that
> we should limit multicolumns indice to at most 2 or 3 columns.

Yes, that's true, the index overhead gets too high.

> If that's true then 22 columns for a multicolumn incdex seems way too
> much. Or maybe it is workable as every column uses only a very limited
> set of values?

Yes, I think that a 22 column index is way too much, especially with the
new bitmap index scans available.

> I was also thinking about about using a functional index.

If there's a logical relation between those values that they can easily
combined, that may be a good alternative.


I just had another weird idea:

As your paramXX values can have only 10 parameters, it also might be
feasible to use a bunch of 10 conditional indices, like:

CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value';
CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value';
CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value';
[...]

This way, you don't have the index bloat of a 3-column index, but 10
2-column indices that cover 1/10th of the table each.

For 22 columns, you'd need a bunch of seven such indices plus a
single-column one, or can use some 3+1 and some 2+1 column index.

I'd like to see the query plans from explain analyze.

Btw, I expect query planning time to get rather significant for so much
columns, so gequo tuning, tuning work_mem (for the bitmap scans) and
prepared statements will pay off.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Better index stategy for many fields with few values

От
"Jim C. Nasby"
Дата:
On Wed, Apr 12, 2006 at 02:59:32PM +0200, Markus Schaber wrote:
> > I was thinking about using a multicolumns index, but I have read that
> > we should limit multicolumns indice to at most 2 or 3 columns.
>
> Yes, that's true, the index overhead gets too high.
>
> > I was also thinking about about using a functional index.
>
> If there's a logical relation between those values that they can easily
> combined, that may be a good alternative.

How would that be any better than just doing a multi-column index?

> I just had another weird idea:
>
> As your paramXX values can have only 10 parameters, it also might be
> feasible to use a bunch of 10 conditional indices, like:
>
> CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value';
> CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value';
> CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value';
> [...]

Not all that weird; it's known as index partitioning.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Better index stategy for many fields with few

От
"Luke Lonergan"
Дата:
Oscar,

On 4/10/06 9:58 AM, "Oscar Picasso" <oscgoogle@yahoo.com> wrote:

> - My items table:
> code int              -- can take one of 100 values
> property varchar(250) -- can take one of 5000 values
> param01 char(10)      -- can take one of 10 values
> param02 char(10)      -- can take one of 10 values
> ...
> [ 20 similar columns }
> ...
> parama20 char(10)     -- can take one of 10 values
>
> - The kind of query I want to optimize:
> select * from items
> where code betwwen 5 and 22
> and param01 = 'P'
> and param02 = 'G'
> ...
> [ all the 20 paramXX columns are used in the query}
> ...
> and param20 = 'C';

Bizgres 0.9 has an on-disk bitmap index which will likely improve this query
speed by a very large amount over normal Postgres 8.1.

- Luke



Re: Better index stategy for many fields with few values

От
"Jim Nasby"
Дата:
Adding -performance back in
-----Original Message-----
From: Oscar Picasso [mailto:oscgoogle@yahoo.com]
Sent: Wednesday, April 12, 2006 5:51 PM
To: Jim Nasby
Subject: Re: [PERFORM] Better index stategy for many fields with few values

I would like to try it.

However in an other post I added that contrary to what I stated initially all the paramXX columns are not mandatory in the query. So it seems that requirement make the problem more complexe.

Doesn't this new requirement rule out this solution? 
No, just group the columns logically.
 By the way I have test to index each column individually and check what happens in relation to bitscan map. My test table  is 1 million  rows. The explain analyze command shows that a bit scan is sometimes used but I still end up with queries that can take up to 10s which is way to much.


"Jim C. Nasby" <jnasby@pervasive.com> wrote:
On Wed, Apr 12, 2006 at 02:59:32PM +0200, Markus Schaber wrote:
> > I was thinking about using a multicolumns index, but I have read that
> > we should limit multicolumns indice to at most 2 or 3 columns.
>
> Yes, that's true, the index overhead gets too high.
>
> > I was also thinking about about using a functional index.
>
> If there's a logical relation between those values that they can easily
> combined, that may be a good alternative.

How would that be any better than just doing a multi-column index?

> I just had another weird idea:
>
> As your paramXX values can have only 10 parameters, it also might be
> feasible to use a bunch of 10 conditional indices, like:
>
> CREATE INDEX foo1 ON table (param1, param2 WHERE param0='1st value';
> CREATE INDEX foo2 ON table (param1, param2 WHERE param0='2nd value';
> CREATE INDEX foo3 ON table (param1, param2 WHERE param0='3rd value';
> [...]

Not all that weird; it's known as index partitioning.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org


Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.

Re: Better index stategy for many fields with few values

От
Markus Schaber
Дата:
Hi, Jim,

Jim C. Nasby wrote:

>>>I was also thinking about about using a functional index.
>>If there's a logical relation between those values that they can easily
>>combined, that may be a good alternative.
> How would that be any better than just doing a multi-column index?

10 different values per column, and 20 columns are 10^20 value combinations.

Partitioning it for the first column gives 10^19 combinations which is
smaller than 2^64, and thus fits into a long value.

And I just guess that a 10-partition functional index on a long value
could perform better than a multi-column index on 20 columns of
character(10), if only because it is approx. 1/25th in size.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Better index stategy for many fields with few values

От
Markus Schaber
Дата:
Hi, Jim,

Jim Nasby wrote:
> Adding -performance back in
>     I would like to try it.
>
>     However in an other post I added that contrary to what I stated
>     initially all the paramXX columns are not mandatory in the query. So
>     it seems that requirement make the problem more complexe.

Okay, this rules out my functional index over 19 columns.

>     Doesn't this new requirement rule out this solution?
>
> No, just group the columns logically.

Yes, that's the solution.

If you have common groups of columns that appear and disappear
synchroneously, pack those together in an (possibly partitioned and/or
functional) index.

Then rely on the query planner that the combines the appropriate indices
via index bitmap scan.

>      By the way I have test to index each column individually and check
>     what happens in relation to bitscan map. My test table  is 1
>     million  rows. The explain analyze command shows that a bit scan is
>     sometimes used but I still end up with queries that can take up to
>     10s which is way to much.

Is it on the first query, or on repeated queries?

It might be that you're I/O bound, and the backend has to fetch indices
and rows from Disk into RAM.

I currently don't know whether the order of indices in a multi-index
bitmap scan is relevant, but I could imagine that it may be useful to
have the most selective index scanned first.

And keep in mind that, assuming an equal distribution of your
parameters, every index bitmap hits 1/10th of the whole table on
average, so the selectivity generally is low.

The selectivity of a partitioned 3-column index will be much better
(about 1/10000th of the whole table), and less index scans and bitmaps
have to be generated.

A functional index may also make sense to CLUSTER the table to optimize
the locality of search results (and so reducing disk I/O). In case your
table has low write activity, but high read-only activity, the overhead
that comes with the additional index is neglible compared to the
performance improvement proper CLUSTERing can generate.

Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Better index stategy for many fields with few values

От
Markus Schaber
Дата:
Hi, Oscar,

Please reply to the list and not privately, so others can learn from
your replies, and possibly have better Ideas than me.

Oscar Picasso wrote:

> I cannot group the columns logically. Any column may or may not appear
> in a query.

That's suboptimal.

> Summrarizing what I have learned:
> - I cannot use multicolumn indexes because I cannot group the column
> logically.
> - I cannot use funtional indexes
> - I cannot use clustering.

You still can have a set of partitioned multi-column indices,
overlapping enough that every combination of columns is covered (or risk
a sequential sub scan for the last two or three columns, this should not
hurt too much if the first 17 columns were selective enough).

The main problem with indices is that they also decrease write performance.

If disk costs are not limited, it will make sense to have WAL, table and
indices on different disks / raid arrays, to parallelize writes.

Btw, I guess you have multiple, concurrent users?

Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Better index stategy for many fields with few values

От
Oscar Picasso
Дата:
Hi Markus,

Markus Schaber <schabi@logix-tt.com> wrote:

>Hi, Oscar,
>
>Please reply to the list and not privately, so others can learn from
>your replies, and possibly have better Ideas than me.

That was my intention. I made a mistake.

>Oscar Picasso wrote:
>
>> I cannot group the columns logically. Any column may or may not appear
>> in a query.
>
>That's suboptimal.
>
>> Summrarizing what I have learned:
>> - I cannot use multicolumn indexes because I cannot group the column
>> logically.
>> - I cannot use funtional indexes
>> - I cannot use clustering.
>
>You still can have a set of partitioned multi-column indices,
>overlapping enough that every combination of columns is covered (or risk
>a sequential sub scan for the last two or three columns, this should not
>hurt too much if the first 17 columns were selective enough).
>
>The main problem with indices is that they also decrease write performance.
>
>If disk costs are not limited, it will make sense to have WAL, table and
>indices on different disks / raid arrays, to parallelize writes.
>
>Btw, I guess you have multiple, concurrent users?

Yes I do.

I have just made other tests with only the individual indexes and performance is much better than previously. Obviously
therewas an I/O problem during my initial test. 

Something interesting though. If I use few columns in the query the results come very quickly and pg does a sequential
scan. 

When it reachs some threshold (4 or 5 columns) pg switches to bitmap scans. It then takes an almost constant time (~
2500ms) not matter how many more columns I add to the where clause. 

Interestingly enough, queries with many columns are less common. They also return less results and even many times no
resultat all.