Обсуждение: indexes and floats

Поиск
Список
Период
Сортировка

indexes and floats

От
Vince Vielhaber
Дата:
Is this a bug that the index doesn't work on floats or is it a datatype
mismatch thing?

-----

template1=> create table foo (x float4, y float4);
CREATE
template1=> create index foox on foo(x);
CREATE
template1=> create index fooy on foo(y);
CREATE
template1=> insert into foo values(123.45,678.90);
INSERT 239376 1
template1=> insert into foo values(123.25,678.95);
INSERT 239377 1
template1=> explain select * from foo where x = 123.4;
NOTICE:  QUERY PLAN:

Seq Scan on foo  (cost=0.00 size=0 width=8)

EXPLAIN
template1=>

-----

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null
       # include <std/disclaimers.h>                   TEAM-OS2
   Online Searchable Campground Listings    http://www.camping-usa.com
       "There is no outfit less entitled to lecture me about bloat
               than the federal government"  -- Tony Snow
==========================================================================



Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> Is this a bug that the index doesn't work on floats or is it a
> datatype mismatch thing?
> Seq Scan on foo  (cost=0.00 size=0 width=8)

Likely neither. Notice the cost of a sequential scan; it is zero. Why
bother doing an index scan?

You need at least two more things to happen to provoke Postgres into
using your index:

1) Have more entries in the table; rule of thumb is O(100) before an
index will be used.

2) Run vacuum after getting that many entries to make sure the optimizer
will know about the configuration of the tables and indices.

Good luck.

                   - Tom

Re: [HACKERS] indexes and floats

От
Vince Vielhaber
Дата:
On 03-Aug-98 Thomas G. Lockhart wrote:
>> Is this a bug that the index doesn't work on floats or is it a
>> datatype mismatch thing?
>> Seq Scan on foo  (cost=0.00 size=0 width=8)
>
> Likely neither. Notice the cost of a sequential scan; it is zero. Why
> bother doing an index scan?
>
> You need at least two more things to happen to provoke Postgres into
> using your index:
>
> 1) Have more entries in the table; rule of thumb is O(100) before an
> index will be used.
>
> 2) Run vacuum after getting that many entries to make sure the optimizer
> will know about the configuration of the tables and indices.

Here's why I'm bringing it up.  I ran vacuum on the database yesterday.
The table consists of city, state (both text), lon and lat (both float4).
The text index is working but not the floats.  I have another table with
12,000 - 14,000 rows and I'm getting the same thing.

-----

campsites=> explain select lon from locations where lon = 83.5;
NOTICE:  QUERY PLAN:

Seq Scan on locations  (cost=7263.30 size=2 width=4)

EXPLAIN
campsites=> select count(*) from locations;
 count
------
169797
(1 row)

campsites=>

-----


Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null
       # include <std/disclaimers.h>                   TEAM-OS2
   Online Searchable Campground Listings    http://www.camping-usa.com
       "There is no outfit less entitled to lecture me about bloat
               than the federal government"  -- Tony Snow
==========================================================================



Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> >> Is this a bug that the index doesn't work on floats or is it a
> >> datatype mismatch thing?
> >> Seq Scan on foo  (cost=0.00 size=0 width=8)
> The table consists of city, state (both text), lon and lat (both
> float4). The text index is working but not the floats.  I have another
> table with 12,000 - 14,000 rows and I'm getting the same thing.

Hmm. You are right. I'm glad you insisted on pursuing this, but I might
have been able to see the problem right away if you had described it in
detail (e.g. "floats" is not clearly float4) and if your initial example
had actually been with float4 data (it showed a width of 8 which is
likely to be a float8) :/

Anyway, the problem seems to be specific to float4. The workaround is to
use float8 for now, and we'll look into fixing the problem for the next
release...

                     - Tom

Test cases from the regression database follow:

regression=> create table test (f4 float4);
CREATE
regression=> insert into test select seqno from bt_f8_heap;
INSERT 0 10000
regression=> create index ti on test using btree (f4);
CREATE
regression=> vacuum;
VACUUM
regression=> explain select f4 from test where f4 = 500.0;
NOTICE:  QUERY PLAN:

Seq Scan on test  (cost=394.00 size=1 width=4)

EXPLAIN

regression=> create table test8 as select seqno from bt_f8_heap;
SELECT
regression=> create index t8 on test8 using btree (seqno);
CREATE
regression=> vacuum;
VACUUM
regression=> explain select seqno from test8 where seqno = 500.0;
NOTICE:  QUERY PLAN:

Index Scan using t8 on test8  (cost=2.05 size=1 width=8)

EXPLAIN

Re: [HACKERS] indexes and floats

От
Tom Lane
Дата:
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
>> The table consists of city, state (both text), lon and lat (both
>> float4). The text index is working but not the floats.  I have another
>> table with 12,000 - 14,000 rows and I'm getting the same thing.

> Anyway, the problem seems to be specific to float4. The workaround is to
> use float8 for now, and we'll look into fixing the problem for the next
> release...

Ah-hah, all of a sudden this looks *real* familiar.  I bet it's because
pgsql is not noticing that "500.0" can be interpreted as a float4.
Let's try it.

play=> create table f8 (x float8);
CREATE
play=> \copy f8 from countup;            // 0..999 in a text file
Successfully copied.
play=> create index f8_i on f8 using btree (x);
CREATE
play=> create table f4 (x float4);
CREATE
play=> \copy f4 from countup;
Successfully copied.
play=> create index f4_i on f4 using btree (x);
CREATE
play=> vacuum verbose analyze;
.... blah blah
play=> explain select x from f8 where x = 500;
NOTICE:  QUERY PLAN:

Seq Scan on f8  (cost=40.00 size=100 width=8)

EXPLAIN
play=> explain select x from f8 where x = 500.0;
NOTICE:  QUERY PLAN:

Index Scan using f8_i on f8  (cost=2.05 size=1 width=8)

EXPLAIN
play=> explain select x from f4 where x = 500;
NOTICE:  QUERY PLAN:

Seq Scan on f4  (cost=40.00 size=100 width=4)

EXPLAIN
play=> explain select x from f4 where x = 500.0;
NOTICE:  QUERY PLAN:

Seq Scan on f4  (cost=40.00 size=1 width=4)

EXPLAIN
play=> explain select x from f4 where x = 500 :: float4;
NOTICE:  QUERY PLAN:

Index Scan using f4_i on f4  (cost=2.05 size=1 width=4)

EXPLAIN
play=> explain select x from f4 where x = 500.0 :: float4;
ERROR:  parser_typecast: cannot cast this expression to type 'float4'
play=>

(This is with cvs sources of a few days ago.)

I see two different bugs here:

1. select ... where x = constant; is optimized to an index scan
only if the constant is of the exact type of the field x.
Apparently, the coercion of the constant to match the type of x
happens only after the optimizer decides it doesn't know what to do.
The coercion ought to happen *before* the optimizer runs.
(I have no idea whether this is a new bug caused by the recent
type-system changes, or whether it existed in 6.3.2 and earlier.)

2. Type coercion fails for "500.0 :: float4" (ditto for "500.0 :: float8"
btw).  Presumably this is a simple localized bug in the parser or the type
coercion logic for floats.

I had previously complained of bug #1 in connection with OIDs;
with the present sources, "where oid = 123456" will not use
an index on OID, while "where oid = 123456::oid" will.

I will bet lunch (at the nearest McD's, I'm not rich ;-)) that
Vince Vielhaber's recent gripe about
    select city from locations where lower(city) = lower('st. ignace');
failing to use an index
    create index locations_city on locations(lower(city) text_ops);
is an artifact of the same sort of type-mismatch problem.

            regards, tom lane

Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> Ah-hah, all of a sudden this looks *real* familiar.  I bet it's
> because pgsql is not noticing that "500.0" can be interpreted as a
> float4. Let's try it.

Oh, you have nailed it! This is interesting because (probably) a query
like

  select f4 from t4 where f4 = 500.0;

is being automatically "upgraded" in the parser backend to

  select f4 from t4 where float8(f4) = 500.0;

So, since there is no functional index float8(f4) on the table we cannot
use an existing index on f4 to advantage.

This may be a result of my recent enhancements to the automatic type
coersion code. But I'm a little suprised that v6.3.x doesn't just
complain about a type mismatch but instead actually works. It may be
that the old code which converted constants using intermediate strings
worked (sort of) for this case. In general, the pre-enhancement code
_only_ tried to convert constants, and complained about type mismatches
when non-constants were involved.

So far in the examples, imho this (new?) feature isn't a _fatal_
problem, because you want to handle the following query correctly (I'll
switch to an int column to make it clearer):

  select i4 from t4 where i4 < 500.1;

Now, if we do the "optimizable thing" blindly, then we would transform
this to

  select i4 from t4 where i4 < int4(500.1);

But of course this would produce the wrong result if the table contains
a value of 500. Perhaps something a bit different could be implemented,
but it probably wouldn't generalize very well with the extensible type
system.

So, is there a problem to fix, or just documentation to write? I've
already written a new "Type Conversion" chapter for the User's Guide,
but haven't mentioned this specific issue.

                      - Tom

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> 1. select ... where x = constant; is optimized to an index scan
> only if the constant is of the exact type of the field x.
> Apparently, the coercion of the constant to match the type of x
> happens only after the optimizer decides it doesn't know what to do.
> The coercion ought to happen *before* the optimizer runs.
> (I have no idea whether this is a new bug caused by the recent
> type-system changes, or whether it existed in 6.3.2 and earlier.)

I think the current sources Thomas has modified work better, perhaps?

>
> 2. Type coercion fails for "500.0 :: float4" (ditto for "500.0 :: float8"
> btw).  Presumably this is a simple localized bug in the parser or the type
> coercion logic for floats.
>
> I had previously complained of bug #1 in connection with OIDs;
> with the present sources, "where oid = 123456" will not use
> an index on OID, while "where oid = 123456::oid" will.
>
> I will bet lunch (at the nearest McD's, I'm not rich ;-)) that
> Vince Vielhaber's recent gripe about
>     select city from locations where lower(city) = lower('st. ignace');
> failing to use an index
>     create index locations_city on locations(lower(city) text_ops);
> is an artifact of the same sort of type-mismatch problem.
>
>             regards, tom lane
>
>


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
Vadim Mikheev
Дата:
Tom Lane wrote:
>
> I will bet lunch (at the nearest McD's, I'm not rich ;-)) that
> Vince Vielhaber's recent gripe about
>         select city from locations where lower(city) = lower('st. ignace');
> failing to use an index
>         create index locations_city on locations(lower(city) text_ops);
> is an artifact of the same sort of type-mismatch problem.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No. This is the result of using lower('st. ignace') - function:
optimizer considers clause as usable for index only for
constants and parameters!
We discussed this ~ month ago. lower('st. ignace') could be
replaced by parameter of PARAM_EXEC type (implemented
for subqueries) to be 1. considered by optimizer as index key
value, 2. evaluated _ONCE_ by executor.
As I mentioned before, lower('st. ignace') will be evaluated
FOR EACH tuple in SeqScan!!!

PARAM_EXEC was implemented to handle queries like this:

select * from A where A.x = (select max(B.y) from B)

- subquery will be executed once and index on A (x) will be
used (if exists).

Optimizer just rewrites this query as
select * from A where A.x = _parameter_
and stores information about _parameter_ in InitPlan of
execution plan.

Look:

vac=> explain select * from test where x = lower('a');
NOTICE:  QUERY PLAN:

Seq Scan on test  (cost=40.00 size=100 width=12)
^^^^^^^^
EXPLAIN
vac=> explain select * from test where x = (select lower('a'));
NOTICE:  QUERY PLAN:

Index Scan using itest2 on test  (cost=2.05 size=1 width=12)
^^^^^^^^^^
  InitPlan
    ->  Result  (cost=0.00 size=0 width=0)

Nevertheless,

vac=> explain select * from test where lower(x) = (select lower('a'));
NOTICE:  IndexSelectivity: no key -1 in index 20305
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NOTICE:  QUERY PLAN:

Seq Scan on test  (cost=40.00 size=100 width=12)
^^^^^^^^
  InitPlan
    ->  Result  (cost=0.00 size=0 width=0)

- something is broken for functional indices...

Vadim

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> > Ah-hah, all of a sudden this looks *real* familiar.  I bet it's
> > because pgsql is not noticing that "500.0" can be interpreted as a
> > float4. Let's try it.
>
> Oh, you have nailed it! This is interesting because (probably) a query
> like
>
>   select f4 from t4 where f4 = 500.0;
>
> is being automatically "upgraded" in the parser backend to
>
>   select f4 from t4 where float8(f4) = 500.0;
>
> So, since there is no functional index float8(f4) on the table we cannot
> use an existing index on f4 to advantage.
>
> This may be a result of my recent enhancements to the automatic type
> coersion code. But I'm a little suprised that v6.3.x doesn't just
> complain about a type mismatch but instead actually works. It may be
> that the old code which converted constants using intermediate strings
> worked (sort of) for this case. In general, the pre-enhancement code
> _only_ tried to convert constants, and complained about type mismatches
> when non-constants were involved.

Yes, it did numeric constants, I think.



--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
Tom Lane
Дата:
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
> Oh, you have nailed it! This is interesting because (probably) a query
> like
>   select f4 from t4 where f4 = 500.0;
> is being automatically "upgraded" in the parser backend to
>   select f4 from t4 where float8(f4) = 500.0;
> So, since there is no functional index float8(f4) on the table we cannot
> use an existing index on f4 to advantage.

OK, that sounds plausible.  But in my examples,

    play=> explain select x from f8 where x = 500;
    NOTICE:  QUERY PLAN:
    Seq Scan on f8  (cost=40.00 size=100 width=8)

Your explanation implies that here, the parser is converting to
     select x from f8 where int4(x) = 500;
which is wrong for the same accuracy-loss reasons you cite later.
(And if that isn't what it's doing, what then?)

I think it would be a good idea if someone actually dug into this
and verified what's going on.  I have found some other cases that
lead me to think there's more to this than we understand just yet.
With an index on an int4 field, I get

tree=> explain select * from marketorderhistory where sequenceNo = 140000;
NOTICE:  QUERY PLAN:
Index Scan using marketorderhistory_sequenceno_i on marketorderhistory
  (cost=2.05 size=1 width=100)

tree=> explain select * from marketorderhistory where sequenceNo > 140000;
NOTICE:  QUERY PLAN:
Seq Scan on marketorderhistory  (cost=63.38 size=449 width=100)

which doesn't look like it could be explained by parser type coercions.
Perhaps this one just indicates an omission from the list of
type-specific routines that can be used for index comparisons?  If so,
maybe there are other omissions affecting the results for other types.

> ... you want to handle the following query correctly (I'll
> switch to an int column to make it clearer):
>   select i4 from t4 where i4 < 500.1;
> Now, if we do the "optimizable thing" blindly, then we would transform
> this to
>   select i4 from t4 where i4 < int4(500.1);
> But of course this would produce the wrong result if the table contains
> a value of 500. Perhaps something a bit different could be implemented,
> but it probably wouldn't generalize very well with the extensible type
> system.

That's a good point.  Still, it would be nice if the system had some
reasonable amount of smarts about the "primitive" types that the parser
has constant syntax for.  In particular I think an automatic coercion of
an int constant to float where needed would be a reasonable thing to
expect.  That's not happening now, see my example above.

> So, is there a problem to fix, or just documentation to write?

This one is most certainly a bug:

play=> select x from f4 where x = 500.0 :: float4;
ERROR:  parser_typecast: cannot cast this expression to type 'float4'

Beyond that, if I can force the right thing to happen by casting
the constant to the type of the field, then I can live with it.
I have seen a number of cases where the system wouldn't use an index
even with a cast, however, so I'm not a happy camper yet.

            regards, tom lane

Re: [HACKERS] indexes and floats

От
Vince Vielhaber
Дата:
On 04-Aug-98 Tom Lane wrote:

> OK, that sounds plausible.  But in my examples,
>
>       play=> explain select x from f8 where x = 500;
>       NOTICE:  QUERY PLAN:
>       Seq Scan on f8  (cost=40.00 size=100 width=8)
>
> Your explanation implies that here, the parser is converting to
>        select x from f8 where int4(x) = 500;
> which is wrong for the same accuracy-loss reasons you cite later.
> (And if that isn't what it's doing, what then?)
>
> I think it would be a good idea if someone actually dug into this
> and verified what's going on.  I have found some other cases that
> lead me to think there's more to this than we understand just yet.
> With an index on an int4 field, I get

What file(s) are these decisions actually made in?

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null
       # include <std/disclaimers.h>                   TEAM-OS2
   Online Searchable Campground Listings    http://www.camping-usa.com
       "There is no outfit less entitled to lecture me about bloat
               than the federal government"  -- Tony Snow
==========================================================================



Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> Your explanation implies that here, the parser is converting to
>          select x from f8 where int4(x) = 500;
> which is wrong for the same accuracy-loss reasons you cite later.
> (And if that isn't what it's doing, what then?)

No, that isn't what it would do. The parser does know about a heirarchy
of built-in data types, and would not downgrade a float8 to an int4.
I've sent some e-mail about the new parser features, and have written
them up in doc/src/sgml/typeconv.sgml (but haven't generated a new html
version yet).

> I think it would be a good idea if someone actually dug into this
> and verified what's going on.  I have found some other cases that
> lead me to think there's more to this than we understand just yet.

I'm pretty sure that there is another thing happening, which is getting
in the way of using indices with your example using

  where lowercase(col) = lowercase('const')

> That's a good point.  Still, it would be nice if the system had some
> reasonable amount of smarts about the "primitive" types that the
> parser has constant syntax for.  In particular I think an automatic
> coercion of an int constant to float where needed would be a
> reasonable thing to expect.  That's not happening now, see my example
> above.

I agree with your points, but it already does exactly what you would
want. I don't see an example above illustrating this problem.

> > So, is there a problem to fix, or just documentation to write?
>
> This one is most certainly a bug:
>
> play=> select x from f4 where x = 500.0 :: float4;
> ERROR:  parser_typecast: cannot cast this expression to type 'float4'

Might be easy. I'll look at it... Oh, the current workaround is to
specify it as

  where x = '500.0'::float4;

or

  where x = float4(500.0);

> Beyond that, if I can force the right thing to happen by casting
> the constant to the type of the field, then I can live with it.
> I have seen a number of cases where the system wouldn't use an index
> even with a cast, however, so I'm not a happy camper yet.

Yeah, that was Vadim's example and he sez we need to add a bit of code
to get things working the way you'd expect.

                    - Tom

Re: [HACKERS] indexes and floats

От
Tom Lane
Дата:
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
> No, that isn't what it would do. The parser does know about a heirarchy
> of built-in data types, and would not downgrade a float8 to an int4.

OK, that's good.  But if it knows that much, I'd expect it to be able
to *upgrade* an int4 to a float8 where appropriate.  That's not
happening, as in my prior example:

>> play=> explain select x from f8 where x = 500.0;
>> NOTICE:  QUERY PLAN:
>> Index Scan using f8_i on f8  (cost=2.05 size=1 width=8)
>>
>> play=> explain select x from f8 where x = 500;
>> NOTICE:  QUERY PLAN:
>> Seq Scan on f8  (cost=40.00 size=100 width=8)

The thing that I find peculiar about all this is that the system doesn't
reject the query completely, as you'd expect it to if there were really
a type mismatch problem.  It takes it and just does it a lot slower than
it should.  It seems the type coercion will happen, but too late for the
optimizer to notice.

            regards, tom lane

Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> > The parser does know about a heirarchy
> > of built-in data types, and would not downgrade a float8 to an int4.
> OK, that's good.  But if it knows that much, I'd expect it to be able
> to *upgrade* an int4 to a float8 where appropriate.  That's not
> happening

Ah, but it _is_ happening. The problem lies later. Read on...

> , as in my prior example:
> >> play=> explain select x from f8 where x = 500;
> >> NOTICE:  QUERY PLAN:
> >> Seq Scan on f8  (cost=40.00 size=100 width=8)
> It seems the type coercion will happen, but too late for the
> optimizer to notice.

This is the specific case discussed by Vadim, and is, I believe,
directly related to the

  select city from locations where lower(city) = lower('st. ignace');

example given by Vince.

The parser is converting this query to become

  select x from f8 where x = float8(500);

The problem appears to be that the optimizer/executor does not know how
to evaluate the constant string once-only, and insists on doing a
sequential scan for some reason. Not really in the parser's hands.
Remember, the types are all set and the functions are all defined by the
time the actual optimizer/executor sees anything. There may be a slight
bit of "optimization" which happens in the parser, but only to shoehorn
SQL92 into the Postgres backend. Oh, there are some heuristics in the
parser to fudge the distinctions between char/varchar/text types to
allow reuse of some of the support code, but that is a Bad Thing in
principle.

The downside to putting more heuristics into the parser (as opposed to
upgrading the optimizer/executor to handle the case) is that it embeds
more assumptions about the _behavior_ of, for example, int4->float8
conversions and would reduce the flexibility of the extensible type
system.

istm that we should be focusing on Vadim's hints on what it would take
to use indices with function calls on constants...

                         - Tom

Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> istm that we should be focusing on Vadim's hints on what it would take
> to use indices with function calls on constants...

Looking at Vadim's note again, maybe it will be the parser's duty to
insert the PARAM_EXEC node; will need more details or some time to look
at it...

                   - Tom

Re: [HACKERS] indexes and floats

От
Tom Lane
Дата:
"Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
> The parser is converting this query to become
>   select x from f8 where x = float8(500);

> The problem appears to be that the optimizer/executor does not know how
> to evaluate the constant string once-only, and insists on doing a
> sequential scan for some reason.

Ah, it's finally starting to make some sense to me.  What you're saying
is that this is a failure to do constant-folding.

Doing a sequential scan would be appropriate if the righthand side of
the comparison needed to be evaluated afresh at each row.  If the
optimizer thinks that, then that explains a lot.

The question then is why the righthand side doesn't look like a
constant.  I'd have expected that any expression not involving a table
attribute would be evaluated once (folded into a constant) before any
decisions are made on how to perform the scan.  Is that reasonable, or
is there some aspect of SQL semantics that makes it wrong?

If it is supposed to be happening, could it be that float8() is for
some reason not marked as a safely foldable function?

While I'm asking dumb questions: are "float8(500)" and "500::float8"
treated differently?  Actually, I can see that they are:

    play=> explain select x from f8 where x = 500::float8;
    NOTICE:  QUERY PLAN:
    Index Scan using f8_i on f8  (cost=2.05 size=1 width=8)

    play=> explain select x from f8 where x = float8(500);
    NOTICE:  QUERY PLAN:
    Seq Scan on f8  (cost=40.00 size=100 width=8)

But why?  Is there a difference in semantics?

            regards, tom lane

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> The problem appears to be that the optimizer/executor does not know how
> to evaluate the constant string once-only, and insists on doing a
> sequential scan for some reason. Not really in the parser's hands.
> Remember, the types are all set and the functions are all defined by the
> time the actual optimizer/executor sees anything. There may be a slight
> bit of "optimization" which happens in the parser, but only to shoehorn
> SQL92 into the Postgres backend. Oh, there are some heuristics in the
> parser to fudge the distinctions between char/varchar/text types to
> allow reuse of some of the support code, but that is a Bad Thing in
> principle.
>
> The downside to putting more heuristics into the parser (as opposed to
> upgrading the optimizer/executor to handle the case) is that it embeds
> more assumptions about the _behavior_ of, for example, int4->float8
> conversions and would reduce the flexibility of the extensible type
> system.
>
> istm that we should be focusing on Vadim's hints on what it would take
> to use indices with function calls on constants...

I understand the problem now.  There is code in the optimizer to handle
functions on constants, but it does not appear to be in a place where
index matching can use it.

I will be looking over the code in the next few weeks, but not sure if I
can figure it all out by then.  I have added it to my personal TODO
list.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> "Thomas G. Lockhart" <lockhart@alumni.caltech.edu> writes:
> > The parser is converting this query to become
> >   select x from f8 where x = float8(500);
>
> > The problem appears to be that the optimizer/executor does not know how
> > to evaluate the constant string once-only, and insists on doing a
> > sequential scan for some reason.
>
> Ah, it's finally starting to make some sense to me.  What you're saying
> is that this is a failure to do constant-folding.


Yep.  I believe it happens in the executor, but doesn't appear to happen
in the optimizer at a time when it would be useful.  You can create
functional indexes, and I think they are matched by the function, just
not constants.

>
> Doing a sequential scan would be appropriate if the righthand side of
> the comparison needed to be evaluated afresh at each row.  If the
> optimizer thinks that, then that explains a lot.


Yep.  I think that is the issue, and index matching does not
pre-evaluate a function on a constant.

>
> The question then is why the righthand side doesn't look like a
> constant.  I'd have expected that any expression not involving a table
> attribute would be evaluated once (folded into a constant) before any
> decisions are made on how to perform the scan.  Is that reasonable, or
> is there some aspect of SQL semantics that makes it wrong?
>
> If it is supposed to be happening, could it be that float8() is for
> some reason not marked as a safely foldable function?
>
> While I'm asking dumb questions: are "float8(500)" and "500::float8"
> treated differently?  Actually, I can see that they are:
>
>     play=> explain select x from f8 where x = 500::float8;
>     NOTICE:  QUERY PLAN:
>     Index Scan using f8_i on f8  (cost=2.05 size=1 width=8)
>
>     play=> explain select x from f8 where x = float8(500);
>     NOTICE:  QUERY PLAN:
>     Seq Scan on f8  (cost=40.00 size=100 width=8)
>
> But why?  Is there a difference in semantics?

Sure.  In the :: case (or CAST (const AS type)), the parser actually
converts the type INSIDE the parser to the proper type.  In the float8()
case, the value conversion is delayed until the executor.

I may be wrong in some of this, but that is what I think is happening.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
Vadim Mikheev
Дата:
Thomas G. Lockhart wrote:
>
> > istm that we should be focusing on Vadim's hints on what it would take
> > to use indices with function calls on constants...
>
> Looking at Vadim's note again, maybe it will be the parser's duty to
> insert the PARAM_EXEC node; will need more details or some time to look
> at it...

Sorry, but maybe it would be better to add new attribute
to pg_proc (and change CREATE FUNCTION syntax) to let
parser know does result of function call on constants depend
on execution time or not. This would be much better just
execute function in parser and replace function with
constant.
Currently, only random(), now() and SPI-functions should be
replaced by PARAM_EXEC, all others could be evaluated by parser.

Vadim

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> Thomas G. Lockhart wrote:
> >
> > > istm that we should be focusing on Vadim's hints on what it would take
> > > to use indices with function calls on constants...
> >
> > Looking at Vadim's note again, maybe it will be the parser's duty to
> > insert the PARAM_EXEC node; will need more details or some time to look
> > at it...
>
> Sorry, but maybe it would be better to add new attribute
> to pg_proc (and change CREATE FUNCTION syntax) to let
> parser know does result of function call on constants depend
> on execution time or not. This would be much better just
> execute function in parser and replace function with
> constant.
> Currently, only random(), now() and SPI-functions should be
> replaced by PARAM_EXEC, all others could be evaluated by parser.

I see.  We have pg_proc.proiscachable.  This looks like a good
candidate, and is not used for anything currently.  Maybe rename it to
something clearer.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
Brook Milligan
Дата:
   > Sorry, but maybe it would be better to add new attribute
   > to pg_proc (and change CREATE FUNCTION syntax) to let
   > parser know does result of function call on constants depend
   > on execution time or not. This would be much better just
   > execute function in parser and replace function with
   > constant.
   > Currently, only random(), now() and SPI-functions should be
   > replaced by PARAM_EXEC, all others could be evaluated by parser.

   I see.  We have pg_proc.proiscachable.  This looks like a good
   candidate, and is not used for anything currently.  Maybe rename it to
   something clearer.

Wouldn't the concept of VOLATILE be relevant here (ala the same
keyword in C and C++)?  In those languages the keyword influences the
nature of optimizations that can be done, which seems to be the
problems involved here.  We could have a function bool
pg_proc.is_volatile() to determine whether or not the result of the
function call can be determined from its arguments.  If so and the
arguments are constants, the function could be evaluated once in the
parser; otherwise, it would need to be evaluated for each tuple.

The current situation, I presume, corresponds with all functions being
VOLATILE and hence evaluated many times.  We could 1) add new syntax
for declaring user-defined functions to optionally include VOLATILE
(and/or NONVOLATILE; default being VOLATILE in the absense of a
keyword to maintain compatibility and to be safe), 2) have the
is_volatile() function return appropriate values for all "standard"
functions defined in the system, and 3) include handling of the
distinction in the parser.

Cheers,
Brook

Re: [HACKERS] indexes and floats

От
dg@informix.com (David Gould)
Дата:
>
>    > Sorry, but maybe it would be better to add new attribute
>    > to pg_proc (and change CREATE FUNCTION syntax) to let
>    > parser know does result of function call on constants depend
>    > on execution time or not. This would be much better just
>    > execute function in parser and replace function with
>    > constant.
>    > Currently, only random(), now() and SPI-functions should be
>    > replaced by PARAM_EXEC, all others could be evaluated by parser.
>
>    I see.  We have pg_proc.proiscachable.  This looks like a good
>    candidate, and is not used for anything currently.  Maybe rename it to
>    something clearer.
>
> Wouldn't the concept of VOLATILE be relevant here (ala the same
> keyword in C and C++)?  In those languages the keyword influences the
> nature of optimizations that can be done, which seems to be the
> problems involved here.  We could have a function bool
> pg_proc.is_volatile() to determine whether or not the result of the
> function call can be determined from its arguments.  If so and the
> arguments are constants, the function could be evaluated once in the
> parser; otherwise, it would need to be evaluated for each tuple.

In Illustra this is called "variant". A variant function (eg random()) must
be evaluated each time. A nonvariant function can have it's result cached
not just in the parser, and not just for constants, but also for any run of
identical argument values such as might occur in a table scan.

For example assuming projected_price_calc() is an expensive but nonvariant
function and the rows returned from the join are ordered by stock symbol
the following query wins big time if the result of projected_price_calc()
can be cached from one execution to the next.

select symbol, price from stocks, trades
 where trades.symbol = stocks.symbol
   and projected_price_calc(stocks.symbol) < price;

That is, don't think of this as a special case in the parser for doing
constant folding, think of it as a general way of "memoizing" or cacheing
function results. Hence the current name "iscacheble" is actually pretty
good.

Also, perhaps instead of doing constant folding in the parser, consider makeing
it part of rewrite. This pass would just traverse the tree looking for
functions with constant arguements and would pre-evaluate them and and
save the result in the wherever the cacheable results would be stored. No
special case required except that the optimizer would have to notice that
a pre-computed result was available to use as a key value.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - If simplicity worked, the world would be overrun with insects. -

Re: [HACKERS] indexes and floats

От
Vadim Mikheev
Дата:
David Gould wrote:
>
> >
> > Wouldn't the concept of VOLATILE be relevant here (ala the same
> > keyword in C and C++)?  In those languages the keyword influences the
> > nature of optimizations that can be done, which seems to be the
> > problems involved here.  We could have a function bool
> > pg_proc.is_volatile() to determine whether or not the result of the
> > function call can be determined from its arguments.  If so and the
> > arguments are constants, the function could be evaluated once in the
> > parser; otherwise, it would need to be evaluated for each tuple.
                                                     ^^^^^^^^^^^^^^
> In Illustra this is called "variant". A variant function (eg random()) must
> be evaluated each time. A nonvariant function can have it's result cached
               ^^^^^^^^^
I don't like the idea to evaluate random() in WHERE x = random()
for _each_ tuple. The same for WHERE insert_date = now() - 5
/* 5 days */. Imho, these functions should be evaluated ONCE
by executor, but EACH time when executor starts. There is big
difference between evaluation in parser/planner and in executor
due to ability (currently, very limitted) to have prepared plans
(parsed/planned). And nothing more. Imho, queries must return
consistent set of data: if I want to get rows inserted 5 days
ago and run query with WHERE insert_date = now() - 5 near 12PM
then I risk to get inconsistent set of rows if now() will be
evaluated for EACH tuple.

> Also, perhaps instead of doing constant folding in the parser, consider makeing
> it part of rewrite. This pass would just traverse the tree looking for
> functions with constant arguements and would pre-evaluate them and and
> save the result in the wherever the cacheable results would be stored. No
> special case required except that the optimizer would have to notice that
> a pre-computed result was available to use as a key value.

This is bad for performance.

Vadim

Re: [HACKERS] indexes and floats

От
dg@informix.com (David Gould)
Дата:
Vadim writes:
> David Gould wrote:
...                                                      ^^^^^^^^^^^^^^
> > In Illustra this is called "variant". A variant function (eg random()) must
> > be evaluated each time. A nonvariant function can have it's result cached
>                ^^^^^^^^^
> I don't like the idea to evaluate random() in WHERE x = random()
> for _each_ tuple. The same for WHERE insert_date = now() - 5
> /* 5 days */. Imho, these functions should be evaluated ONCE
> by executor, but EACH time when executor starts. There is big
> difference between evaluation in parser/planner and in executor
> due to ability (currently, very limitted) to have prepared plans
> (parsed/planned). And nothing more. Imho, queries must return
> consistent set of data: if I want to get rows inserted 5 days
> ago and run query with WHERE insert_date = now() - 5 near 12PM
> then I risk to get inconsistent set of rows if now() will be
> evaluated for EACH tuple.

Perhaps random was a bad example, the point I was trying to make was that
it is a function that is not wholely determined by its arguments.

However, I disagree with your contention about random() and now() also. Think
about the statement

insert into samples
  select random(), xval, yval from points;

The intent being to associate a random value with each of the x,y pairs...


> > Also, perhaps instead of doing constant folding in the parser, consider makeing
> > it part of rewrite. This pass would just traverse the tree looking for
> > functions with constant arguements and would pre-evaluate them and and
> > save the result in the wherever the cacheable results would be stored. No
> > special case required except that the optimizer would have to notice that
> > a pre-computed result was available to use as a key value.
>
> This is bad for performance.

What_ is the "This" that is bad for performance? It is not clear what you
mean here.

Do you mean memoizing? If so, I strongly disagree, and there is a ton of
work on this that supports me.

Or do you mean adding the cacheable function optimization to rewrite? If so,
why is this a problem? I am assuming that this can be piggybacked on one
of the existing expression tree traversals, so I just don't see how
this creates more work than doing it in the parser.

> Vadim
>

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - If simplicity worked, the world would be overrun with insects. -

Re: [HACKERS] indexes and floats

От
Vadim Mikheev
Дата:
David Gould wrote:
>
> >                ^^^^^^^^^
> > I don't like the idea to evaluate random() in WHERE x = random()
> > for _each_ tuple. The same for WHERE insert_date = now() - 5
> > /* 5 days */. Imho, these functions should be evaluated ONCE
> > by executor, but EACH time when executor starts. There is big
> > difference between evaluation in parser/planner and in executor
> > due to ability (currently, very limitted) to have prepared plans
> > (parsed/planned). And nothing more. Imho, queries must return
> > consistent set of data: if I want to get rows inserted 5 days
> > ago and run query with WHERE insert_date = now() - 5 near 12PM
> > then I risk to get inconsistent set of rows if now() will be
> > evaluated for EACH tuple.
>
> Perhaps random was a bad example, the point I was trying to make was that
> it is a function that is not wholely determined by its arguments.
>
> However, I disagree with your contention about random() and now() also.
> Think about the statement
>
> insert into samples
>   select random(), xval, yval from points;
>
> The intent being to associate a random value with each of the x,y pairs...

Imho, we have to treat random() etc differently in target list and
qual! WHERE defines set of rows to be returned by query, functions
in target list define representation of this set.

>
> > > Also, perhaps instead of doing constant folding in the parser, consider makeing
> > > it part of rewrite. This pass would just traverse the tree looking for
> > > functions with constant arguements and would pre-evaluate them and and
> > > save the result in the wherever the cacheable results would be stored. No
> > > special case required except that the optimizer would have to notice that
> > > a pre-computed result was available to use as a key value.
> >
> > This is bad for performance.
>
> What_ is the "This" that is bad for performance? It is not clear what you
> mean here.
>
> Do you mean memoizing? If so, I strongly disagree, and there is a ton of
> work on this that supports me.
>
> Or do you mean adding the cacheable function optimization to rewrite? If so,
                                                            ^^^^^^^^^^
This. Why should we traverse the tree once more time?
What the problem to let parser handle these functions?

> why is this a problem? I am assuming that this can be piggybacked on one
> of the existing expression tree traversals, so I just don't see how
> this creates more work than doing it in the parser.

Vadim

Re: [HACKERS] indexes and floats

От
Bruce Momjian
Дата:
> David Gould wrote:
> >
> > >                ^^^^^^^^^
> > > I don't like the idea to evaluate random() in WHERE x = random()
> > > for _each_ tuple. The same for WHERE insert_date = now() - 5
> > > /* 5 days */. Imho, these functions should be evaluated ONCE
> > > by executor, but EACH time when executor starts. There is big
> > > difference between evaluation in parser/planner and in executor
> > > due to ability (currently, very limitted) to have prepared plans
> > > (parsed/planned). And nothing more. Imho, queries must return
> > > consistent set of data: if I want to get rows inserted 5 days
> > > ago and run query with WHERE insert_date = now() - 5 near 12PM
> > > then I risk to get inconsistent set of rows if now() will be
> > > evaluated for EACH tuple.
> >
> > Perhaps random was a bad example, the point I was trying to make was that
> > it is a function that is not wholely determined by its arguments.
> >
> > However, I disagree with your contention about random() and now() also.
> > Think about the statement
> >
> > insert into samples
> >   select random(), xval, yval from points;
> >
> > The intent being to associate a random value with each of the x,y pairs...
>
> Imho, we have to treat random() etc differently in target list and
> qual! WHERE defines set of rows to be returned by query, functions
> in target list define representation of this set.
>
> >
> > > > Also, perhaps instead of doing constant folding in the parser, consider makeing
> > > > it part of rewrite. This pass would just traverse the tree looking for
> > > > functions with constant arguements and would pre-evaluate them and and
> > > > save the result in the wherever the cacheable results would be stored. No
> > > > special case required except that the optimizer would have to notice that
> > > > a pre-computed result was available to use as a key value.
> > >
> > > This is bad for performance.
> >
> > What_ is the "This" that is bad for performance? It is not clear what you
> > mean here.
> >
> > Do you mean memoizing? If so, I strongly disagree, and there is a ton of
> > work on this that supports me.
> >
> > Or do you mean adding the cacheable function optimization to rewrite? If so,
>                                                             ^^^^^^^^^^
> This. Why should we traverse the tree once more time?
> What the problem to let parser handle these functions?
>
> > why is this a problem? I am assuming that this can be piggybacked on one
> > of the existing expression tree traversals, so I just don't see how
> > this creates more work than doing it in the parser.

Yep.  Rewrite already is confusing.  Parser has this type of handling in
there already.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] indexes and floats

От
"Thomas G. Lockhart"
Дата:
> Imho, queries must return
> consistent set of data: if I want to get rows inserted 5 days
> ago and run query with WHERE insert_date = now() - 5 near 12PM
> then I risk to get inconsistent set of rows if now() will be
> evaluated for EACH tuple.

This is perhaps a bad example, because now() (and 'now') return the
transaction time, guaranteed to be the same for each row evaluation of a
query, and the same for every query within a transaction.

What should be behavior of

  update x set i = random();

be? I would have assumed that random() is evaluated once; are there any
truely volatile functions in SQL92?

> > Also, perhaps instead of doing constant folding in the parser,
> > consider making it part of rewrite. This pass would just traverse
> > the tree looking for functions with constant arguements and would
> > pre-evaluate them and and save the result in the wherever the
> > cacheable results would be stored. No special case required except
> > that the optimizer would have to notice that a pre-computed result
> > was available to use as a key value.
> This is bad for performance.

What makes this bad for performance? An extra traversal of the parse
tree? Or something else??

                      - Tom

Re: [HACKERS] indexes and floats

От
Vadim Mikheev
Дата:
Thomas G. Lockhart wrote:
>
> > Imho, queries must return
> > consistent set of data: if I want to get rows inserted 5 days
> > ago and run query with WHERE insert_date = now() - 5 near 12PM
> > then I risk to get inconsistent set of rows if now() will be
> > evaluated for EACH tuple.
>
> This is perhaps a bad example, because now() (and 'now') return the
> transaction time, guaranteed to be the same for each row evaluation of a
> query, and the same for every query within a transaction.

yes, this is bad example -:)

>
> What should be behavior of
>
>   update x set i = random();
>
> be? I would have assumed that random() is evaluated once; are there any

I don't know. I mostly concern about WHERE just because of this
defines data set to be returned by query.

> truely volatile functions in SQL92?

This is what I also would like to know.

>
> > > Also, perhaps instead of doing constant folding in the parser,
> > > consider making it part of rewrite. This pass would just traverse
> > > the tree looking for functions with constant arguements and would
> > > pre-evaluate them and and save the result in the wherever the
> > > cacheable results would be stored. No special case required except
> > > that the optimizer would have to notice that a pre-computed result
> > > was available to use as a key value.
> > This is bad for performance.
>
> What makes this bad for performance? An extra traversal of the parse
                                          ^^^^^^^^^^^^^^^
This. I don't see any reasons for this.

> tree? Or something else??

Vadim

Re: [HACKERS] indexes and floats

От
dg@informix.com (David Gould)
Дата:
> This is perhaps a bad example, because now() (and 'now') return the
> transaction time, guaranteed to be the same for each row evaluation of a
> query, and the same for every query within a transaction.
>
> What should be behavior of
>
>   update x set i = random();
>
> be? I would have assumed that random() is evaluated once; are there any
> truely volatile functions in SQL92?

Hmmm, I would have thought that was a clever way to get a nice random
value into each row ;-)

But seriously, we have the iscacheable attribute for functions. So,

  update pg_func set iscacheable = 'false' where pg_func.name = 'random';

and you get different random values in all the rows. Or do

update pg_func set iscacheable = 'true' where pg_func.name = 'random';

and you get the same random value in all the rows.

Or at least this is how postgres was intended to work. It was meant to
be an extensible table driven system. Hardcoding stuff "because it is the
right way" is really assuming that we know better than the users what they
want to do. Almost never true.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - If simplicity worked, the world would be overrun with insects. -

Re: [HACKERS] indexes and floats

От
dg@informix.com (David Gould)
Дата:
Vadim writes:
> Imho, we have to treat random() etc differently in target list and
> qual! WHERE defines set of rows to be returned by query, functions
> in target list define representation of this set.

Maybe, but it sure makes the system harder to explain. For example, what
happens to a function in the target list of a subquery select? And what
should happen.

But, perhaps we should just do whatever is easiest since anyone who writes
a query:

 select * from x where x.key = random();

probably deserves whatever they get ;-)

Seriously, I have never seen random used in a where clause. And I don't think
that queries like

 select * from events where event.time = timenow();

make a lot of sense either. How did event.time get to be _equal_ to the
current time. Given that it had to be inserted earlier and the time of
the select would not have been known then.

My point is anyone who uses a variant function in a where clause already
is pretty confused. To tell them that variant functions aren't variant when
used in where clauses probably will just confuse them more.

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - If simplicity worked, the world would be overrun with insects. -