Обсуждение: pg 7.4.rc1, Range query performance

Поиск
Список
Период
Сортировка

pg 7.4.rc1, Range query performance

От
ow
Дата:
Postgresql 7.4.rc1 on i686 pc linux compiled by gcc 2.06.
All configuration settings are default.


Hi,

Trying to find a way to improve range query performance.

The table Test has about 30 million records.

-- DLong, Dtimestamp, Dint, etc are domains of the respective types.
create table Test (
id              DLong               not null,
a               Dtimestamp              null,
b               Dint                not null,
c               Dint                not null,
d               Dstring                 null,
constraint PK_id primary key (id),
constraint AK_abc unique (a, b, c)
);

The following query retrieves a single record, it runs against AK index and is
extremely fast (30-150 ms) for the  table of this size:

-- returns result in 30-150 ms
select * from Test
where a = '2002-09-01' and b = 5 and c = 255

OTOH, the following range query that returns 30 records performs much slower,
about 33000 ms. The query is using AK index, as it should, but why does it take
so much longer to scan the index for the range of just 30 records? I see that
PG is hitting the disk very intensively for this query. Can the query be
rewritten, etc to improve performance? Thanks

select * from Test
where a >= '2002-09-01' and a <= '2002-09-30' and b = 5 and c = 255

QUERY PLAN
Index Scan using ak_abc on test  (cost=0.00..106.27 rows=30 width=53) (actual
time=33.536..33200.998 rows=30 loops=1) Index Cond: (((a)::timestamp without time zone >= '2002-09-01
00:00:00'::timestamp without time zone) AND ((a)::timestamp without time zone
<= '2002-09-30 00:00:00'::timestamp without time zone) AND ((b)::integer 
= 5) AND ((c) (..)
Total runtime: 33201.219 ms








__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree


Re: pg 7.4.rc1, Range query performance

От
Bruce Momjian
Дата:
Try CLUSTER --- that usually helps with index scans on ranges.

---------------------------------------------------------------------------

ow wrote:
> Postgresql 7.4.rc1 on i686 pc linux compiled by gcc 2.06.
> All configuration settings are default.
> 
> 
> Hi,
> 
> Trying to find a way to improve range query performance.
> 
> The table Test has about 30 million records.
> 
> -- DLong, Dtimestamp, Dint, etc are domains of the respective types.
> create table Test (
> id              DLong               not null,
> a               Dtimestamp              null,
> b               Dint                not null,
> c               Dint                not null,
> d               Dstring                 null,
> constraint PK_id primary key (id),
> constraint AK_abc unique (a, b, c)
> );
> 
> The following query retrieves a single record, it runs against AK index and is
> extremely fast (30-150 ms) for the  table of this size:
> 
> -- returns result in 30-150 ms
> select * from Test
> where a = '2002-09-01'
>   and b = 5
>   and c = 255
> 
> OTOH, the following range query that returns 30 records performs much slower,
> about 33000 ms. The query is using AK index, as it should, but why does it take
> so much longer to scan the index for the range of just 30 records? I see that
> PG is hitting the disk very intensively for this query. Can the query be
> rewritten, etc to improve performance? Thanks
> 
> select * from Test
> where a >= '2002-09-01'
>   and a <= '2002-09-30'
>   and b = 5
>   and c = 255
> 
> QUERY PLAN
> Index Scan using ak_abc on test  (cost=0.00..106.27 rows=30 width=53) (actual
> time=33.536..33200.998 rows=30 loops=1)
>   Index Cond: (((a)::timestamp without time zone >= '2002-09-01
> 00:00:00'::timestamp without time zone) AND ((a)::timestamp without time zone
> <= '2002-09-30 00:00:00'::timestamp without time zone) AND ((b)::integer 
> = 5) AND ((c) (..)
> Total runtime: 33201.219 ms
> 
> 
> 
> 
> 
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: pg 7.4.rc1, Range query performance

От
Stephan Szabo
Дата:
On Sat, 8 Nov 2003, ow wrote:

> Postgresql 7.4.rc1 on i686 pc linux compiled by gcc 2.06.
> All configuration settings are default.
>
>
> Hi,
>
> Trying to find a way to improve range query performance.
>
> The table Test has about 30 million records.
>
> -- DLong, Dtimestamp, Dint, etc are domains of the respective types.
> create table Test (
> id              DLong               not null,
> a               Dtimestamp              null,
> b               Dint                not null,
> c               Dint                not null,
> d               Dstring                 null,
> constraint PK_id primary key (id),
> constraint AK_abc unique (a, b, c)
> );

I'm not sure that AK_abc is the best index for check a range on a and
single values on b and c.  I'd think that something like an index
on (b,c,a) would probably be better for this purpose (without doing any
testing ;) ).


Re: pg 7.4.rc1, Range query performance

От
ow
Дата:
Hi,

I tried CLUSTER and it did improve performance, somewhat. The query against
"clustered" table performs about five (5) times better than the same table but
"non-clustered". However, even after that table was clustered, the difference
in performance between single record query and range query is significant:

table Test (see below) has 10M records single record - 31 ms and remains mostly constant as table grows range query
returning30 records - about 10 secs and grows together with the
 
table

Also, CLUSTER is locking the table (in our case this also means locking the
database), so it may be impossible to use it in production on large tables
(impossible in our case).

It feels like I really have a problem here. Any ideas? Thanks

P.S. For the future I would consider implementing "CREATE [CLUSTERED] INDEX"


--- Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> 
> Try CLUSTER --- that usually helps with index scans on ranges.
> 
> ---------------------------------------------------------------------------
> 
> ow wrote:
> > Postgresql 7.4.rc1 on i686 pc linux compiled by gcc 2.06.
> > All configuration settings are default.
> > 
> > 
> > Hi,
> > 
> > Trying to find a way to improve range query performance.
> > 
> > The table Test has about 30 million records.
> > 
> > -- DLong, Dtimestamp, Dint, etc are domains of the respective types.
> > create table Test (
> > id              DLong               not null,
> > a               Dtimestamp              null,
> > b               Dint                not null,
> > c               Dint                not null,
> > d               Dstring                 null,
> > constraint PK_id primary key (id),
> > constraint AK_abc unique (a, b, c)
> > );
> > 
> > The following query retrieves a single record, it runs against AK index and
> is
> > extremely fast (30-150 ms) for the  table of this size:
> > 
> > -- returns result in 30-150 ms
> > select * from Test
> > where a = '2002-09-01'
> >   and b = 5
> >   and c = 255
> > 
> > OTOH, the following range query that returns 30 records performs much
> slower,
> > about 33000 ms. The query is using AK index, as it should, but why does it
> take
> > so much longer to scan the index for the range of just 30 records? I see
> that
> > PG is hitting the disk very intensively for this query. Can the query be
> > rewritten, etc to improve performance? Thanks
> > 
> > select * from Test
> > where a >= '2002-09-01'
> >   and a <= '2002-09-30'
> >   and b = 5
> >   and c = 255
> > 
> > QUERY PLAN
> > Index Scan using ak_abc on test  (cost=0.00..106.27 rows=30 width=53)
> (actual
> > time=33.536..33200.998 rows=30 loops=1)
> >   Index Cond: (((a)::timestamp without time zone >= '2002-09-01
> > 00:00:00'::timestamp without time zone) AND ((a)::timestamp without time
> zone
> > <= '2002-09-30 00:00:00'::timestamp without time zone) AND ((b)::integer 
> > = 5) AND ((c) (..)
> > Total runtime: 33201.219 ms
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > __________________________________
> > Do you Yahoo!?
> > Protect your identity with Yahoo! Mail AddressGuard
> > http://antispam.yahoo.com/whatsnewfree
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> > 
> >                http://archives.postgresql.org
> > 
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree


Re: pg 7.4.rc1, Range query performance

От
ow
Дата:
--- Stephan Szabo <sszabo@megazone.bigpanda.com> wrote:
> I'm not sure that AK_abc is the best index for check a range on a and
> single values on b and c.  I'd think that something like an index
> on (b,c,a) would probably be better for this purpose (without doing any
> testing ;) ).

That would not work for us since most of the time users are working with most
recent data, hence column "a" is the first in the index.

Thanks





__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree


Re: pg 7.4.rc1, Range query performance

От
Bruce Momjian
Дата:
ow wrote:
> Hi,
> 
> I tried CLUSTER and it did improve performance, somewhat. The query against
> "clustered" table performs about five (5) times better than the same table but
> "non-clustered". However, even after that table was clustered, the difference
> in performance between single record query and range query is significant:
> 
> table Test (see below) has 10M records
>   single record - 31 ms and remains mostly constant as table grows
>   range query returning 30 records - about 10 secs and grows together with the
> table
> 
> Also, CLUSTER is locking the table (in our case this also means locking the
> database), so it may be impossible to use it in production on large tables
> (impossible in our case).
> 
> It feels like I really have a problem here. Any ideas? Thanks
> 
> P.S. For the future I would consider implementing "CREATE [CLUSTERED] INDEX"

Strange 30 records takes 30x the time than one record.  Can you run
ANALYZE and send us an EXPLAIN of the query to make sure it hasn't
changed?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: pg 7.4.rc1, Range query performance

От
Stephan Szabo
Дата:
On Mon, 10 Nov 2003, ow wrote:

> --- Stephan Szabo <sszabo@megazone.bigpanda.com> wrote:
> > I'm not sure that AK_abc is the best index for check a range on a and
> > single values on b and c.  I'd think that something like an index
> > on (b,c,a) would probably be better for this purpose (without doing any
> > testing ;) ).
>
> That would not work for us since most of the time users are working with most
> recent data, hence column "a" is the first in the index.

Are you doing alot of queries where you're only searching on a?  If so,
then you may want to consider a second, non-unique index to speed up this
sort of query. With a range query on a and fixed values on b and c, you're
really not utilizing an index in that order efficiently.


Re: pg 7.4.rc1, Range query performance

От
ow
Дата:
--- Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> Strange 30 records takes 30x the time than one record.  Can you run
> ANALYZE and send us an EXPLAIN of the query to make sure it hasn't
> changed?
> 

explain analyze select * from Test
where a >= '2002-06-18' and a <= '2002-07-18' and b = 5 and c = 255

QUERY PLAN
Index Scan using ak_abc on test  (cost=0.00..121.23 rows=34 width=53) (actual
time=18.060..10726.387 rows=31 loops=1) Index Cond: (((a)::timestamp without time zone >= '2002-06-18
00:00:00'::timestamp without time zone) AND ((a)::timestamp without time zone
<= '2002-07-18 00:00:00'::timestamp without time zone) AND ((b)::integer = 5)
AND ((c) (..)
Total runtime: 10726.663 ms

Thanks







__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree


Re: pg 7.4.rc1, Range query performance

От
Bruce Momjian
Дата:
ow wrote:
> --- Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> > Strange 30 records takes 30x the time than one record.  Can you run
> > ANALYZE and send us an EXPLAIN of the query to make sure it hasn't
> > changed?
> > 
> 
> explain analyze select * from Test
> where a >= '2002-06-18'
>   and a <= '2002-07-18'
>   and b = 5
>   and c = 255
> 
> QUERY PLAN
> Index Scan using ak_abc on test  (cost=0.00..121.23 rows=34 width=53) (actual
> time=18.060..10726.387 rows=31 loops=1)
>   Index Cond: (((a)::timestamp without time zone >= '2002-06-18
> 00:00:00'::timestamp without time zone) AND ((a)::timestamp without time zone
> <= '2002-07-18 00:00:00'::timestamp without time zone) AND ((b)::integer = 5)
> AND ((c) (..)
> Total runtime: 10726.663 ms

OK, I see now.  You must have a lot of rows from '2002-06-18' to
'2002-07-18', but only 33 with the b,c conditions --- not much we can do
to speed this up because the condition on 'a' isn't restrictive enough
--- not sure if b or c is either.  It is all those lookups to find the
rows that match a, then b/c that is taking the time.  In fact, it now
make sense that it takes 30x time because all the time is spent
traversing the index looking for match #1, then match #2, etc.  We would
do this quickly if there were lots of rows matching a specific 'a'
value, e.g.

> explain analyze select * from Test
> where a = '2002-06-18'   <---
>   and b = 5
>   and c = 255

The index/cluster is grouping the rows, but the grouping is by timestamp
value, not by range >= '2002-06-18' and <= '2002-07-18'.  Even though
you have index a,b,c, it really is only using the index on 'a' because
the index on b,c only happens when you have multiple duplicate 'a'
values, but in this case you have an entire months worth.  The only
quick way would be to create a functional index on 'a', and cluster on
that:
create index ii on x (date_part("month", a), b,c)

or something like that.  You can't actually index on a function and then
ordinary columns so you would need a pretty fancy function in plpgsql
that converted the a,b,c value into a nice text string and then index on
that.  Then if you used that function call in your WHERE clause, the
index would be used and it would be very fast because all your 'a'
values would be the same, and it could then jump to b and c quickly.

Sorry there isn't a simple solution.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: pg 7.4.rc1, Range query performance

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Sorry there isn't a simple solution.

But there is: make an index with the column order (b,c,a).
        regards, tom lane


Re: pg 7.4.rc1, Range query performance

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Sorry there isn't a simple solution.
> 
> But there is: make an index with the column order (b,c,a).

Oh, yea, right.  If he puts the columns he is doing a straight equals
comparison first, the 'a' comparison will work fine.  Great.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: pg 7.4.rc1, Range query performance

От
ow
Дата:
--- Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Sorry there isn't a simple solution.
> > 
> > But there is: make an index with the column order (b,c,a).
> 
> Oh, yea, right.  If he puts the columns he is doing a straight equals
> comparison first, the 'a' comparison will work fine.  Great.
> 

Yes, I think Stephan Szabo was trying to convince me of something like that for
quite sometime; it was not clear to me then. I'll give it a try.

I still think "CREATE [CLUSTERED] INDEX" (or whatever) is an important feature
and  should be implemented in pgSql. Locking large (or huge) tables for
prolonged periods of time may not be acceptable in many setups.

Thanks






__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree