Обсуждение: 7.3.1 index use / performance
Hi,
i am just in the stage of having migrated my test system to 7.3.1
and i am experiencing some performance problems.
i have a table "noon"
Table "public.noon"
Column | Type | Modifiers
------------------------+------------------------+-----------
v_code | character varying(4) |
log_no | bigint |
report_date | date |
report_time | time without time zone |
voyage_no | integer |
charterer | character varying(12) |
port | character varying(24) |
duration | character varying(4) |
rotation | character varying(9) |
......
with a total of 278 columns.
it has indexes:
Indexes: noonf_date btree (report_date),
noonf_logno btree (log_no),
noonf_rotation btree (rotation text_ops),
noonf_vcode btree (v_code),
noonf_voyageno btree (voyage_no)
On the test 7.3.1 system (a FreeBSD 4.7-RELEASE-p2, Celeron 1.2GHz
400Mb, with 168Mb for pgsql),
i get:
dynacom=# EXPLAIN ANALYZE select
FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
v_code='4500' and rotation='NOON ' and report_date between
'2002-01-07' and '2003-01-07';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..4.46 rows=1 width=39)
(actual time=0.27..52.89 rows=259 loops=1)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <=
'2003-01-07'::date))
Filter: ((v_code = '4500'::character varying) AND (rotation = 'NOON
'::character varying))
Total runtime: 53.98 msec
(4 rows)
after i drop the noonf_date index i actually get better performance
cause the backend uses now the more appropriate index noonf_vcode :
dynacom=# EXPLAIN ANALYZE select
FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
v_code='4500' and rotation='NOON ' and report_date between
'2002-01-07' and '2003-01-07';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_vcode on noon (cost=0.00..3122.88 rows=1
width=39) (actual time=0.16..13.92 rows=259 loops=1)
Index Cond: (v_code = '4500'::character varying)
Filter: ((rotation = 'NOON '::character varying) AND (report_date
>= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
Total runtime: 14.98 msec
(4 rows)
On the pgsql 7.2.3 development system (a RH linux 2.4.7, PIII 1 GHz,
1Mb, with 168M for pgsql), i always get the right index use:
dynacom=# EXPLAIN ANALYZE select
FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
v_code='4500' and rotation='NOON ' and report_date between
'2002-01-07' and '2003-01-07';
NOTICE: QUERY PLAN:
Index Scan using noonf_vcode on noon (cost=0.00..3046.38 rows=39
width=39) (actual time=0.09..8.55 rows=259 loops=1)
Total runtime: 8.86 msec
EXPLAIN
Is something i am missing??
Is this reasonable behaviour??
P.S.
Yes i have vaccumed analyzed both systems before the queries were issued.
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
Achilleus Mantzios wrote: <cut> > it has indexes: > Indexes: noonf_date btree (report_date), > noonf_logno btree (log_no), > noonf_rotation btree (rotation text_ops), > noonf_vcode btree (v_code), > noonf_voyageno btree (voyage_no) > <cut> > > ------------------------------------------------------------------------------------------------------------------- > Index Scan using noonf_date on noon (cost=0.00..4.46 rows=1 width=39) > (actual time=0.27..52.89 rows=259 loops=1) > Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= > '2003-01-07'::date)) > Filter: ((v_code = '4500'::character varying) AND (rotation = 'NOON > '::character varying)) > Total runtime: 53.98 msec > (4 rows) <cut> Maybe it is not an answer to your question, but why don't you help Postgres by yourself? For this kind of queries it's better to drop index on report_date - your report period is one year and answer to this condition is 10% records (I suppose) It would be better to change 2 indexes on v_code and rotation into one index based on both fields. What kind of queries do you have? How many records returns each "where" condition? Use indexes on fields, on which condition result in smallest amount of rows. Regards, Tomasz Myrta
On Tue, 7 Jan 2003, Tomasz Myrta wrote:
> Maybe it is not an answer to your question, but why don't you help
> Postgres by yourself?
Thanx,
i dont think that the issue here is to help postgresql by myself.
I can always stick to 7.2.3, or use indexes that 7.3.1 will
acknowledge, like noonf_vcode_date on noon (v_code,report_date).
(unfortunately when i create the above noonf_vcode_date index, it is only
used until
the next vacuum analyze, hackers is this an issue too???),
but these options are not interesting from a postgresql perspective :)
> For this kind of queries it's better to drop index on report_date - your
> report period is one year and answer to this condition is 10% records (I
> suppose)
I cannot drop the index on the report_date since a lot of other queries
need it.
> It would be better to change 2 indexes on v_code and rotation into one
> index based on both fields.
> What kind of queries do you have? How many records returns each "where"
> condition? Use indexes on fields, on which condition result in smallest
> amount of rows.
>
> Regards,
> Tomasz Myrta
>
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
On Tue, 7 Jan 2003, Achilleus Mantzios wrote: > i am just in the stage of having migrated my test system to 7.3.1 > and i am experiencing some performance problems. > > i have a table "noon" > Table "public.noon" > Column | Type | Modifiers > ------------------------+------------------------+----------- > v_code | character varying(4) | > log_no | bigint | > report_date | date | > report_time | time without time zone | > voyage_no | integer | > charterer | character varying(12) | > port | character varying(24) | > duration | character varying(4) | > rotation | character varying(9) | > ...... > > with a total of 278 columns. > > it has indexes: > Indexes: noonf_date btree (report_date), > noonf_logno btree (log_no), > noonf_rotation btree (rotation text_ops), > noonf_vcode btree (v_code), > noonf_voyageno btree (voyage_no) > > On the test 7.3.1 system (a FreeBSD 4.7-RELEASE-p2, Celeron 1.2GHz > 400Mb, with 168Mb for pgsql), > i get: > dynacom=# EXPLAIN ANALYZE select > FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where > v_code='4500' and rotation='NOON ' and report_date between > '2002-01-07' and '2003-01-07'; > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------- > Index Scan using noonf_date on noon (cost=0.00..4.46 rows=1 width=39) > (actual time=0.27..52.89 rows=259 loops=1) > Index Scan using noonf_vcode on noon (cost=0.00..3122.88 rows=1 > width=39) (actual time=0.16..13.92 rows=259 loops=1) What do the statistics for the three columns actually look like and what are the real distributions and counts like? Given an estimated cost of around 4 for the first scan, my guess would be that it's not expecting alot of rows between 2002-01-07 and 2003-01-07 which would make that a reasonable plan.
On Tue, 7 Jan 2003, Stephan Szabo wrote:
>
> On Tue, 7 Jan 2003, Achilleus Mantzios wrote:
>
> > i am just in the stage of having migrated my test system to 7.3.1
> > and i am experiencing some performance problems.
> >
> > i have a table "noon"
> > Table "public.noon"
> > Column | Type | Modifiers
> > ------------------------+------------------------+-----------
> > v_code | character varying(4) |
> > log_no | bigint |
> > report_date | date |
> > report_time | time without time zone |
> > voyage_no | integer |
> > charterer | character varying(12) |
> > port | character varying(24) |
> > duration | character varying(4) |
> > rotation | character varying(9) |
> > ......
> >
> > with a total of 278 columns.
> >
> > it has indexes:
> > Indexes: noonf_date btree (report_date),
> > noonf_logno btree (log_no),
> > noonf_rotation btree (rotation text_ops),
> > noonf_vcode btree (v_code),
> > noonf_voyageno btree (voyage_no)
> >
> > On the test 7.3.1 system (a FreeBSD 4.7-RELEASE-p2, Celeron 1.2GHz
> > 400Mb, with 168Mb for pgsql),
> > i get:
> > dynacom=# EXPLAIN ANALYZE select
> > FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
> > v_code='4500' and rotation='NOON ' and report_date between
> > '2002-01-07' and '2003-01-07';
> > QUERY PLAN
> >
>
> > -------------------------------------------------------------------------------------------------------------------
> > Index Scan using noonf_date on noon (cost=0.00..4.46 rows=1 width=39)
> > (actual time=0.27..52.89 rows=259 loops=1)
>
>
> > Index Scan using noonf_vcode on noon (cost=0.00..3122.88 rows=1
> > width=39) (actual time=0.16..13.92 rows=259 loops=1)
>
>
> What do the statistics for the three columns actually look like and what
> are the real distributions and counts like?
The two databases (test 7.3.1 and development 7.2.3) are identical
(loaded from the same pg_dump).
About the stats on these 3 columns i get: (see also attachment 1 to avoid
identation/wraparound problems)
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs |
histogram_bounds | correlation
------------+-----------+-------------+-----------+-----------+------------+-----------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+-------------
public | noon | v_code | 0 | 8 | 109 |
{4630,4650,4690,4670,4520,4610,4550,4560,4620,4770} |
{0.0283333,0.028,0.0256667,0.0243333,0.024,0.0236667,0.0233333,0.0233333,0.0226667,0.0226667} |
{2070,3210,4330,4480,4570,4680,4751,4820,4870,4940,6020}
| -0.249905
public | noon | report_date | 0 | 4 | 3408 |
{2001-11-14,1998-10-18,2000-04-03,2000-07-04,2000-12-20,2000-12-31,2001-01-12,2001-10-08,2001-12-25,1996-01-23}|
{0.002,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333}|
{"0001-12-11
BC",1994-09-27,1996-03-26,1997-07-29,1998-08-26,1999-03-29,1999-11-30,2000-09-25,2001-05-25,2002-01-17,2002-12-31}|
-0.812295
public | noon | rotation | 0 | 13 | 6 | {"NOON ","PORT LOG ","ARRIVAL
",DEPARTURE,"SEA ","NEXT PORT"} |
{0.460333,0.268667,0.139,0.119667,0.007,0.00533333} |
|
0.119698
(3 rows)
About distributions, i have:
dynacom=# SELECT rotation,count(*) from noon group by rotation;
rotation | count
-----------+-------
| 2
000000000 | 65
ARRIVAL | 1
ARRIVAL | 15471
DEPARTURE | 15030
NEXT PORT | 462
NOON | 50874
PORT LOG | 25688
SEA | 1202
(9 rows)
dynacom=# SELECT v_code,count(*) from noon group by v_code;
v_code | count
--------+-------
0004 | 1
1030 | 1
2070 | 170
2080 | 718
2110 | 558
2220 | 351
2830 | 1373
2840 | 543
2860 | 407
2910 | 418
3010 | 352
3020 | 520
3060 | 61
3130 | 117
3140 | 1
3150 | 752
3160 | 811
3170 | 818
3180 | 1064
3190 | 640
3200 | 998
3210 | 1512
3220 | 595
3230 | 374
3240 | 514
3250 | 13
3260 | 132
3270 | 614
4010 | 413
4020 | 330
4040 | 728
4050 | 778
4060 | 476
4070 | 534
4310 | 759
4320 | 424
4330 | 549
4360 | 366
4370 | 334
4380 | 519
4410 | 839
4420 | 183
4421 | 590
4430 | 859
4450 | 205
4470 | 861
4480 | 766
4490 | 169
4500 | 792
4510 | 2116
4520 | 2954
4530 | 2142
4531 | 217
4540 | 2273
4550 | 2765
4560 | 2609
4570 | 2512
4580 | 1530
4590 | 1987
4600 | 308
4610 | 2726
4620 | 2698
4630 | 2813
4640 | 1733
4650 | 2655
4660 | 2139
4661 | 65
4670 | 2607
4680 | 1729
4690 | 2587
4700 | 2101
4710 | 1830
4720 | 1321
4730 | 1258
4740 | 1506
4750 | 1391
4751 | 640
4760 | 1517
4770 | 2286
4780 | 1353
4790 | 1209
4800 | 2414
4810 | 770
4820 | 1115
4830 | 1587
4840 | 983
4841 | 707
4850 | 1297
4860 | 375
4870 | 1440
4880 | 456
4881 | 742
4890 | 210
4891 | 45
4900 | 2
4910 | 1245
4920 | 414
4930 | 1130
4940 | 1268
4950 | 949
4960 | 836
4970 | 1008
4980 | 1239
5510 | 477
5520 | 380
5530 | 448
5540 | 470
5550 | 352
5560 | 148
5570 | 213
5580 | 109
5590 | 55
6010 | 246
6020 | 185
9180 | 1
(Not all the above vessels are active or belong to me:) )
The distribution on the report_date has no probabilistic significance
since each report_date usually corresponds to one row.
So,
dynacom=# SELECT count(*) from noon;
count
--------
108795
(1 row)
dynacom=#
Now for the specific query the counts have as follows:
dynacom=# select count(*) from noon where v_code='4500';
count
-------
792
(1 row)
dynacom=# select count(*) from noon where rotation='NOON ';
count
-------
50874
(1 row)
dynacom=# select count(*) from noon where report_date between '2002-01-07'
and '2003-01-07';
count
-------
7690
(1 row)
dynacom=#
> Given an estimated cost of around 4 for the first scan, my guess would be
> that it's not expecting alot of rows between 2002-01-07 and 2003-01-07
> which would make that a reasonable plan.
>
As we see the rows returned for v_code='4500' (792) are much fewer than
the rows returned for the dates between '2002-01-07' and '2003-01-07'
(7690).
Is there a way to provide you with more information?
And i must note that the two databases were worked on after a fresh
createdb on both systems (and as i told they are identical).
But, for some reason the 7.2.3 *always* finds the best index to use :)
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> About the stats on these 3 columns i get:
Does 7.2 generate the same stats? (minus the schemaname of course)
Also, I would like to see the results of these queries on both versions,
so that we can see what the planner thinks the index selectivity is:
EXPLAIN ANALYZE select * from noon where
v_code='4500';
EXPLAIN ANALYZE select * from noon where
report_date between '2002-01-07' and '2003-01-07';
regards, tom lane
On Tue, 7 Jan 2003, Tom Lane wrote:
> Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> > About the stats on these 3 columns i get:
>
> Does 7.2 generate the same stats? (minus the schemaname of course)
Not absolutely but close:
(See attachment)
>
> Also, I would like to see the results of these queries on both versions,
> so that we can see what the planner thinks the index selectivity is:
>
> EXPLAIN ANALYZE select * from noon where
> v_code='4500';
>
> EXPLAIN ANALYZE select * from noon where
> report_date between '2002-01-07' and '2003-01-07';
>
On 7.3.1 (On a FreeBSD)
=======================
dynacom=# EXPLAIN ANALYZE select * from noon where v_code='4500';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_vcode on noon (cost=0.00..3066.64 rows=829
width=1974) (actual time=2.02..1421.14 rows=792 loops=1)
Index Cond: (v_code = '4500'::character varying)
Total runtime: 1424.82 msec
(3 rows)
dynacom=# EXPLAIN ANALYZE select * from noon where report_date between
'2002-01-07' and '2003-01-07';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..15919.50 rows=11139
width=1974) (actual time=2.05..13746.17 rows=7690 loops=1)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <=
'2003-01-07'::date))
Total runtime: 13775.48 msec
(3 rows)
On 7.2.3 (Linux)
==================
dynacom=# EXPLAIN ANALYZE select * from noon where v_code='4500';
NOTICE: QUERY PLAN:
Index Scan using noonf_vcode on noon (cost=0.00..3043.45 rows=827
width=1974) (actual time=19.59..927.06 rows=792 loops=1)
Total runtime: 928.86 msec
dynacom=# EXPLAIN ANALYZE select * from noon where report_date between
'2002-01-07' and '2003-01-07';
NOTICE: QUERY PLAN:
Index Scan using noonf_date on noon (cost=0.00..16426.45 rows=11958
width=1974) (actual time=29.64..8854.05 rows=7690 loops=1)
Total runtime: 8861.90 msec
EXPLAIN
> regards, tom lane
>
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
>> Also, I would like to see the results of these queries on both versions,
>> so that we can see what the planner thinks the index selectivity is:
>>
> [ data supplied ]
There is something really, really bizarre going on there. You have
dynacom=# EXPLAIN ANALYZE select * from noon where report_date between '2002-01-07' and '2003-01-07';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..15919.50 rows=11139 width=1974) (actual time=2.05..13746.17 rows=7690
loops=1)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
Total runtime: 13775.48 msec
(3 rows)
and from your earlier message
dynacom=# EXPLAIN ANALYZE select FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
v_code='4500' and rotation='NOON ' and report_date between '2002-01-07' and '2003-01-07';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..4.46 rows=1 width=39) (actual time=0.27..52.89 rows=259 loops=1)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
Filter: ((v_code = '4500'::character varying) AND (rotation = 'NOON'::character varying))
Total runtime: 53.98 msec
(4 rows)
There is no way that adding the filter condition should have reduced the
estimated runtime for this plan --- reducing the estimated number of
output rows, yes, but not the runtime. And in fact I can't duplicate
that when I try it here. I did this on 7.3.1:
regression=# create table noon (v_code character varying(4) ,
regression(# report_date date ,
regression(# rotation character varying(9));
CREATE TABLE
regression=# create index noonf_date on noon(report_date);
CREATE INDEX
regression=# EXPLAIN select * from noon where report_date between
regression-# '2002-01-07' and '2003-01-07';
QUERY PLAN
---------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..17.08 rows=5 width=25)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
(2 rows)
regression=# explain select * from noon where
regression-# v_code='4500' and rotation='NOON ' and report_date between
regression-# '2002-01-07' and '2003-01-07';
QUERY PLAN
--------------------------------------------------------------------------------
------------------
Index Scan using noonf_date on noon (cost=0.00..17.11 rows=1 width=25)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
Filter: ((v_code = '4500'::character varying) AND (rotation = 'NOON '::character varying))
(3 rows)
Note that the cost went up, not down.
I am wondering about a compiler bug, or some other peculiarity on your
platform. Can anyone else using FreeBSD try the above experiment and
see if they get different results from mine on 7.3.* (or CVS tip)?
regards, tom lane
On Tue, 7 Jan 2003, Tom Lane wrote:
> There is no way that adding the filter condition should have reduced the
> estimated runtime for this plan --- reducing the estimated number of
> output rows, yes, but not the runtime. And in fact I can't duplicate
My case persists:
After clean install of the database, and after vacuum analyze,
i get
dynacom=# EXPLAIN ANALYZE select
FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
report_date between '2002-01-07' and '2003-01-07';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..16458.54 rows=10774
width=39) (actual time=0.13..205.86 rows=7690 loops=1)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <=
'2003-01-07'::date))
Total runtime: 233.22 msec
dynacom=# EXPLAIN ANALYZE select
FUELCONSUMPTION,rpm,Steam_Hours,voyage_activity,ldin from noon where
report_date between '2002-01-07' and '2003-01-07' and v_code='4500';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using noonf_vcode on noon (cost=0.00..3092.52 rows=83
width=39) (actual time=0.15..15.08 rows=373 loops=1)
Index Cond: (v_code = '4500'::character varying)
Filter: ((report_date >= '2002-01-07'::date) AND (report_date <=
'2003-01-07'::date))
Total runtime: 16.56 msec
(4 rows)
I thought PostgreSQL in some sense (hub.org) used FreeBSD,
is there any 4.7 FreeBSD server with pgsql 7.3.1 you could use?
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> My case persists:
> After clean install of the database, and after vacuum analyze,
> i get
Um ... is it persisting? That looks like it's correctly picked the
vcode index this time. Strange behavior though. By "clean install"
do you mean you rebuilt Postgres, or just did dropdb/createdb/reload
data?
regards, tom lane
On Tue, 7 Jan 2003, Tom Lane wrote:
> Achilleus Mantzios <achill@matrix.gatewaynet.com> writes:
> > My case persists:
> > After clean install of the database, and after vacuum analyze,
> > i get
>
> Um ... is it persisting? That looks like it's correctly picked the
> vcode index this time. Strange behavior though. By "clean install"
> do you mean you rebuilt Postgres, or just did dropdb/createdb/reload
> data?
Just dropdb/createdb/reload.
>
> regards, tom lane
>
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
> I am wondering about a compiler bug, or some other peculiarity on your > platform. Can anyone else using FreeBSD try the above experiment and > see if they get different results from mine on 7.3.* (or CVS tip)? On FreeBSD 4.7 I received the exact same results as Tom using the statements shown by Tom. -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Rod Taylor <rbt@rbt.ca> writes:
>> I am wondering about a compiler bug, or some other peculiarity on your
>> platform. Can anyone else using FreeBSD try the above experiment and
>> see if they get different results from mine on 7.3.* (or CVS tip)?
> On FreeBSD 4.7 I received the exact same results as Tom using the
> statements shown by Tom.
On looking at the code, I do see part of a possible mechanism for this
behavior: cost_index calculates the estimated cost for qual-clause
evaluation like this:
/*
* Estimate CPU costs per tuple.
*
* Normally the indexquals will be removed from the list of restriction
* clauses that we have to evaluate as qpquals, so we should subtract
* their costs from baserestrictcost. XXX For a lossy index, not all
* the quals will be removed and so we really shouldn't subtract their
* costs; but detecting that seems more expensive than it's worth.
* Also, if we are doing a join then some of the indexquals are join
* clauses and shouldn't be subtracted. Rather than work out exactly
* how much to subtract, we don't subtract anything.
*/
cpu_per_tuple = cpu_tuple_cost + baserel->baserestrictcost;
if (!is_injoin)
cpu_per_tuple -= cost_qual_eval(indexQuals);
In theory, indexQuals will always be a subset of the qual list on which
baserestrictcost was computed, so we should always end up with a
cpu_per_tuple value at least as large as cpu_tuple_cost. I am wondering
if somehow in Achilleus's situation, cost_qual_eval() is producing a
silly result leading to negative cpu_per_tuple. I don't see how that
could happen though --- nor why it would happen on his machine and not
other people's.
regards, tom lane
On Tue, 7 Jan 2003, Tom Lane wrote:
> Rod Taylor <rbt@rbt.ca> writes:
> >> I am wondering about a compiler bug, or some other peculiarity on your
> >> platform. Can anyone else using FreeBSD try the above experiment and
> >> see if they get different results from mine on 7.3.* (or CVS tip)?
>
> > On FreeBSD 4.7 I received the exact same results as Tom using the
> > statements shown by Tom.
>
> On looking at the code, I do see part of a possible mechanism for this
> behavior: cost_index calculates the estimated cost for qual-clause
> evaluation like this:
>
This bizarre index decreased cost (when adding conditions) behaviour maybe
was due to some vacuums.
(i cant remember how many reloads and vacuums i did to the database
in the period petween the two emails).
However my linux machine with the same pgsql 7.3.1, with a full clean
installation also gives the same symptoms:
Choosing the slow index, and after some (random)
vacuums choosing the right index, and then after some vacuums chooses the
bad
index again.
>
> regards, tom lane
>
==================================================================
Achilleus Mantzios
S/W Engineer
IT dept
Dynacom Tankers Mngmt
Nikis 4, Glyfada
Athens 16610
Greece
tel: +30-10-8981112
fax: +30-10-8981877
email: achill@matrix.gatewaynet.com
mantzios@softlab.ece.ntua.gr
Just to close off the thread, here is the end-result of investigating
Achilleus Mantzios' problem.
------- Forwarded Message
Date: Wed, 08 Jan 2003 11:54:36 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Achilleus Mantzios <achill@matrix.gatewaynet.com>
Subject: Re: [SQL] [PERFORM] 7.3.1 index use / performance
I believe I see what's going on. You have a number of silly outlier
values in the report_date column --- quite a few instances of '10007-06-09'
for example. Depending on whether ANALYZE's random sample happens to
include one of these, the histogram generated by ANALYZE might look like
this (it took about half a dozen tries with ANALYZE to get this result):
dynacom=# analyze noon;
ANALYZE
dynacom=# select histogram_bounds from pg_stats where attname = 'report_date';
histogram_bounds
-----------------------------------------------------------------------------------------------------------------------------
{1969-06-26,1994-09-24,1996-04-05,1997-07-21,1998-08-27,1999-03-13,1999-11-11,2000-08-18,2001-04-18,2002-01-04,10007-06-09}
(1 row)
in which case we get this:
dynacom=# EXPLAIN select * from noon where
dynacom-# report_date between '2002-01-07' and '2003-01-07';
QUERY PLAN
---------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..4.08 rows=1 width=1975)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
(2 rows)
Seeing this histogram, the planner assumes that one-tenth of the table
is uniformly distributed between 2002-01-04 and 10007-06-09, which leads
it to the conclusion that the range between 2002-01-07 and 2003-01-07
probably contains only about one row, which causes it to prefer a scan
on report_date rather than on v_code.
The reason the problem comes and goes is that any given ANALYZE run
might or might not happen across one of the outliers. When it doesn't,
you get a histogram that leads to reasonably accurate estimates.
There are a couple of things you could do about this. One is to
increase the statistics target for report_date (see ALTER TABLE SET
STATISTICS) so that a finer-grained histogram is generated for the
report_date column. The other thing, which is more work but probably
the best answer in the long run, is to fix the outliers, which I imagine
must be incorrect entries.
You could perhaps put a constraint on report_date to prevent bogus
entries from sneaking in in future.
It looks like increasing the stats target would be worth doing also,
if you make many queries using ranges of report_date.
regards, tom lane
------- End of Forwarded Message