Обсуждение: Problems with inequalities on numeric fields in 6.5

Поиск
Список
Период
Сортировка

Problems with inequalities on numeric fields in 6.5

От
Martin Weinberg
Дата:
Folks,

I have som large data basees (from 7 million to 20 million
records and growing) and want to optimize selection in a
range between two float4 values.  I have made indices on the
relevant variables.

I notice that a query such as:

   select count(*) from mydata where x='3.4';

executes in under 10 seconds.  However, a query such as

   select count(*) from mydata where x>'3.4' and x<'3.5';

takes at least 20 minutes.  EXPLAIN suggests that both
are using an index scan.

Not forcing float4 conversion does much better, e.g.

   select count(*) from mydata where x>3.4 and x<3.5;

returns in under a minute although EXPLAIN suggests
that an sequential scan is used.

Can anyone give me some guidance on what that best
strategy is do select records in a range of float4
values?

Thanks!

--Martin

P.S. This behavior is also present in 6.4.2.

===========================================================================

Martin Weinberg                      Phone: (413) 545-3821
Dept. of Physics and Astronomy       FAX:   (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA  01003-4525

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Bruce Momjian
Дата:
> Not forcing float4 conversion does much better, e.g.
>
>    select count(*) from mydata where x>3.4 and x<3.5;
>

OK, let me ask.  Vacuum analyze.  What does pg_statistics show for
min/max values?  What does EXPLAIN show?



--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Martin Weinberg
Дата:
Bruce Momjian wrote on Tue, 06 Jul 1999 23:24:12 EDT
>> Not forcing float4 conversion does much better, e.g.
>>
>>    select count(*) from mydata where x>3.4 and x<3.5;
>>
>
>OK, let me ask.  Vacuum analyze.  What does pg_statistics show for
>min/max values?  What does EXPLAIN show?
>

The _true_ variable name is called "j_m" in the table "lmctot" and
the database is called lmc.

From "select * from pg_statistic", I have min and max to be
2.731 and 99.999 for that float4 field.

Explain for the converted values:
--------------------------------
lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE:  QUERY PLAN:

Aggregate  (cost=62349.97 rows=788100 width=4)
  ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)

And for the uncast values:
-------------------------

lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE:  QUERY PLAN:

Aggregate  (cost=62349.97 rows=788100 width=4)
  ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)


lmc=> explain select count(*) from lmctot where j_m>3.4 and
j_m<3.5;NOTICE:  QUERY PLAN:

Aggregate  (cost=423901.50 rows=788100 width=4)
  ->  Seq Scan on lmctot  (cost=423901.50 rows=788100 width=4)

EXPLAIN


Any clues?  I tried looking at the "verbose" output but that
is beyond me.

I appreciate the help.  I need to figure out whether pgsql
will do the job for this application.

--M

===========================================================================

Martin Weinberg                      Phone: (413) 545-3821
Dept. of Physics and Astronomy       FAX:   (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA  01003-4525

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Bruce Momjian
Дата:
> lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=62349.97 rows=788100 width=4)
>   ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)
>
>
Please try this:

 lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and
 j_m<3.5::float4

Also, given your min/max, I am not sure why it thinks it is going to get
788,100 rows.  How many rows in the table again?

Does (3.5-3.4)/(max-min) * #rows = 788k?

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Martin Weinberg
Дата:
Bruce Momjian wrote on Wed, 07 Jul 1999 04:07:00 EDT
>> lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
>> NOTICE:  QUERY PLAN:
>>
>> Aggregate  (cost=62349.97 rows=788100 width=4)
>>   ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)
>>
>>
>Please try this:
>
> lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and
> j_m<3.5::float4
>
>Also, given your min/max, I am not sure why it thinks it is going to get
>788,100 rows.  How many rows in the table again?
>
>Does (3.5-3.4)/(max-min) * #rows = 788k?
>

Thanks, Bruce!

Yes, I tried the latter query and it's the same:

--------------------------------------------------

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5::float4;
NOTICE:  QUERY PLAN:

Aggregate  (cost=62349.97 rows=788100 width=4)
  ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)

EXPLAIN

--------------------------------------------------
I've tried all permutations of the conversions in the ranges with
similar results (and vacuum analyzed several times as well as
dumped and reloaded and reloaded from scracth).  We have
a larger database with 20M rows which has a similar behavior.

There are 7092894 rows in database "lmc".  So:

(3.5-3.4)/(99.999-2.731) = 7292.1 != 788k

A clue?

Thanks again,

--M

===========================================================================

Martin Weinberg                      Phone: (413) 545-3821
Dept. of Physics and Astronomy       FAX:   (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA  01003-4525

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Bruce Momjian
Дата:
> Thanks, Bruce!
>
> Yes, I tried the latter query and it's the same:
>
> --------------------------------------------------
>
> lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5::float4;
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=62349.97 rows=788100 width=4)
>   ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)
>
> EXPLAIN
>
> --------------------------------------------------
> I've tried all permutations of the conversions in the ranges with
> similar results (and vacuum analyzed several times as well as
> dumped and reloaded and reloaded from scracth).  We have
> a larger database with 20M rows which has a similar behavior.
>
> There are 7092894 rows in database "lmc".  So:
>
> (3.5-3.4)/(99.999-2.731) = 7292.1 != 788k
>
> A clue?

I have just fixed a problem with index size estimates.  Try adding
#include <math.h> to the top of backend/optimizer/util/plancat.c.  That
may fix the estimated number of tuples returned.  However, it don't
think you are going to get better performance, since you are already
using the index in the above case.  The only big win I can think of is
to use CLUSTER on that field.  That should speed things up quite a bit.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Martin Weinberg
Дата:
Bruce Momjian wrote on Wed, 07 Jul 1999 12:29:13 EDT
>> Thanks, Bruce!
>>
>> Yes, I tried the latter query and it's the same:
>>
>> --------------------------------------------------
>>
>> lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:
>:float4;
>> NOTICE:  QUERY PLAN:
>>
>> Aggregate  (cost=62349.97 rows=788100 width=4)
>>   ->  Index Scan using j on lmctot  (cost=62349.97 rows=788100 width=4)
>>
>> EXPLAIN
>>
>> --------------------------------------------------
>> I've tried all permutations of the conversions in the ranges with
>> similar results (and vacuum analyzed several times as well as
>> dumped and reloaded and reloaded from scracth).  We have
>> a larger database with 20M rows which has a similar behavior.
>>
>> There are 7092894 rows in database "lmc".  So:
>>
>> (3.5-3.4)/(99.999-2.731) = 7292.1 != 788k
>>
>> A clue?
>
>I have just fixed a problem with index size estimates.  Try adding
>#include <math.h> to the top of backend/optimizer/util/plancat.c.  That
>may fix the estimated number of tuples returned.  However, it don't
>think you are going to get better performance, since you are already
>using the index in the above case.  The only big win I can think of is
>to use CLUSTER on that field.  That should speed things up quite a bit.
>

Hi Bruce,

Ok.  Sorry about the delay.

I added the math.h but that doesn't seem to change the
query plan output.

I then dropped all the indices, made a new one on three of the
variables and clustered:

create index m_col on lmctot using btree (j_m, h_m, k_m);
cluster m_col on lmctot;
vacuum analyze;

where the j_m, h_m, k_m are three float4 fields.

The cluster took about 18 hours on my 7.1 million records
(this is a dual 450Mhz Xeon Linux box).  Not sure why
this was so slow.

Anyway, this *hugely* improved queries of form:

select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:

although the explain query plan output is identical.  However
using h_m or k_m (not the first variable in the index) appears
to be doing a sequential scan.  Is that right?

I then made indices on h_m and k_m, vacuum analyzed and tried
again, but got identical performance.  If this is the way
it is, so be it, but I have the feeling that something is
not working properly.

Any ideas?

Again, with _heaps_ of thanks,

--Martin


===========================================================================

Martin Weinberg                      Phone: (413) 545-3821
Dept. of Physics and Astronomy       FAX:   (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA  01003-4525


Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

От
Bruce Momjian
Дата:
> Hi Bruce,
>
> Ok.  Sorry about the delay.
>
> I added the math.h but that doesn't seem to change the
> query plan output.
>
> I then dropped all the indices, made a new one on three of the
> variables and clustered:
>
> create index m_col on lmctot using btree (j_m, h_m, k_m);
> cluster m_col on lmctot;
> vacuum analyze;
>
> where the j_m, h_m, k_m are three float4 fields.
>
> The cluster took about 18 hours on my 7.1 million records
> (this is a dual 450Mhz Xeon Linux box).  Not sure why
> this was so slow.

That is a long time.

> Anyway, this *hugely* improved queries of form:
>
> select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:
>
> although the explain query plan output is identical.  However
> using h_m or k_m (not the first variable in the index) appears
> to be doing a sequential scan.  Is that right?

Yes, that is right.  The index is only on the one field, and can only
use secondary index variables after the first one is matched.

>
> I then made indices on h_m and k_m, vacuum analyzed and tried
> again, but got identical performance.  If this is the way
> it is, so be it, but I have the feeling that something is
> not working properly.

The big problem is that you can only cluster on one index.

What cluster has done is prevent the system from bouncing all over the
disk getting matching rows.  They are all sequential on the disk.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026