Обсуждение: Like vs '=' bug with indexing

Поиск
Список
Период
Сортировка

Like vs '=' bug with indexing

От
m w
Дата:
I am reposting this because I'm not sure it actually
made it to the list.


I have a function to transform text into a
pseudo-metaphone variable, take this example:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 3 ;   song     | metatext
-------------+----------Born To Run | brntornBorn To Run | brntornBorn To Run | brntorn
(3 rows)

Here is the problem: Depending on whether there is an
index or not, 'like' behaves differently. Here is a
transcript:

cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;   song     | metatext
-------------+----------Born To Run | brntorn
(1 row)
cddbsql=# create index cdsongs_meta_song on cdsongs
(metatext(song)) ;
CREATE
cddbsql=# select song, metatext(song) from cdsongs
where metatext(song) like metatext('born to run')
limit 1 ;song | metatext
------+----------
(0
rows)     

This happens in both 7.0 and
7.1.


__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/


Re: Like vs '=' bug with indexing

От
Tom Lane
Дата:
m w <mttf2000@yahoo.com> writes:
> Here is the problem: Depending on whether there is an
> index or not, 'like' behaves differently.

Please provide a complete, self-contained example with which
we can reproduce the problem.
        regards, tom lane


Re: Like vs '=' bug with indexing

От
m w
Дата:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> m w <mttf2000@yahoo.com> writes:
> > Here is the problem: Depending on whether there is
> an
> > index or not, 'like' behaves differently.
> 
> Please provide a complete, self-contained example
> with which
> we can reproduce the problem.

I am trying to create a test function and some data
that reproduces the problem easily, but I think I know
what it is.

It is a two bug issue. I had a bug in my code in that
I added the "\0" to the end of my strings when I
converted from a C string to a postgres "text" object.


I think this exposes a bug in postgres where either
index or table scans (I'm not sure which just yet)
treat a zero differently than a non-zero in a varchar.

It looks as if some section of code is using the zero
to terminate a string and another section of code is
not. So at some point data which should be the same
differes either on length of data, or the trailing
zero compared to an uninitialized byte.

Removing the terminating zero from the postgres string
fixes the problem, but, if I understand postgres well
enough, this should not make a difference, and should
be reported as a bug anyway.


__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/


Re: Like vs '=' bug with indexing

От
Tom Lane
Дата:
m w <mttf2000@yahoo.com> writes:
> I think this exposes a bug in postgres where either
> index or table scans (I'm not sure which just yet)
> treat a zero differently than a non-zero in a varchar.

Embedded zeroes aren't supported in char/varchar/text fields,
and cannot be supported in a portable fashion, since these datatypes
rely on functions like strcoll() that don't allow embedded nulls in
strings.  It wouldn't surprise me too much if there are inconsistent
behaviors between indexscans and seqscans for such invalid data.

It doesn't seem real practical for us to examine the output of every
C-coded function to make sure it produces a valid value of the datatype.
Illegal returned values are a fault of the function, and ensuing
misbehaviors are still its fault ...
        regards, tom lane


Re: Like vs '=' bug with indexing

От
m w
Дата:
 Embedded zeroes aren't supported in
> char/varchar/text fields,
> and cannot be supported in a portable fashion, since
> these datatypes
> rely on functions like strcoll() that don't allow
> embedded nulls in
> strings.  It wouldn't surprise me too much if there
> are inconsistent
> behaviors between indexscans and seqscans for such
> invalid data.
> 
> It doesn't seem real practical for us to examine the
> output of every
> C-coded function to make sure it produces a valid
> value of the datatype.
> Illegal returned values are a fault of the function,
> and ensuing
> misbehaviors are still its fault ...

Fair enough, but I think it should raise a caution
flag when two different behaviors can be seen with the
same query. This may have a common cause with other
index vs non-index behavior.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/


Re: Like vs '=' bug with indexing

От
Hannu Krosing
Дата:
Tom Lane wrote:
> 
> m w <mttf2000@yahoo.com> writes:
> > I think this exposes a bug in postgres where either
> > index or table scans (I'm not sure which just yet)
> > treat a zero differently than a non-zero in a varchar.
> 
> Embedded zeroes aren't supported in char/varchar/text fields,
> and cannot be supported in a portable fashion, since these datatypes
> rely on functions like strcoll() that don't allow embedded nulls in
> strings.

Is there no simple (i.e. cheap) way to disallow \0 alltogether for 
these types then ?

perhaps just strip them out in textin() (or is it text_in()) ?

> It wouldn't surprise me too much if there are inconsistent
> behaviors between indexscans and seqscans for such invalid data.

should'nt they both use the _same_ strcoll() and friends ?

> It doesn't seem real practical for us to examine the output of every
> C-coded function to make sure it produces a valid value of the datatype.
> Illegal returned values are a fault of the function, and ensuing
> misbehaviors are still its fault ...

Should we not examine "the _possible_ outputs of every C-coded function 
to make sure it produces a valid value of the datatype" ;)

For me producing an invalid data for a datatype seems very much like 
a bug and it _should_ be reported.

-------------
Hannu


Re: Like vs '=' bug with indexing

От
m w
Дата:
--- Hannu Krosing <hannu@tm.ee> wrote:

> Should we not examine "the _possible_ outputs of
> every C-coded function 
> to make sure it produces a valid value of the
> datatype" ;)
> 
> For me producing an invalid data for a datatype
> seems very much like 
> a bug and it _should_ be reported.

No, I think Tom is right, there should be no
validation on C functions incorporated into Postgres
by users. Who wants that overhead in a production
system?

However, I think when the same SQL query produces
different results when you add an index, speaks of an
inconsistency in the system, which could be the source
of other problems.

I have seen a couple posts where results from an index
scan are not the same as the results from a table
scan, granted they were language issues, but still, my
gut tells me if I set the length of a variable to x,
and a trailing zero is included, the system should
either fail consistently or work consistently. I don't
care which, it should just be consistent.

Inconsistent behavior indicates that a different
matching algorithm is used if one uses an index
instead of a table scan. That scares me.

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/


Re: Like vs '=' bug with indexing

От
Tom Lane
Дата:
Hannu Krosing <hannu@tm.ee> writes:
> Is there no simple (i.e. cheap) way to disallow \0 alltogether for 
> these types then ?
> perhaps just strip them out in textin() (or is it text_in()) ?

They *are* stripped out in textin(), by virtue of the fact that
textin expects to see a null-terminated input string.

>> It wouldn't surprise me too much if there are inconsistent
>> behaviors between indexscans and seqscans for such invalid data.

> should'nt they both use the _same_ strcoll() and friends ?

Irrelevant; the issue is that the various comparison operators may
produce inconsistent results given invalid input.  For instance
texteq() short-circuits to a FALSE result if the lengths of the
inputs are different, which means that  'ab\0' = 'ab'  will produce
false, even though a strcoll-based comparison will claim they are
equal.  I don't think that means that texteq() is wrong to check the
lengths first.

> Should we not examine "the _possible_ outputs of every C-coded function 
> to make sure it produces a valid value of the datatype" ;)

Go for it.

Possibly chr() should reject chr(0) ...
        regards, tom lane


Re: Like vs '=' bug with indexing

От
Tom Lane
Дата:
m w <mttf2000@yahoo.com> writes:
> Inconsistent behavior indicates that a different
> matching algorithm is used if one uses an index
> instead of a table scan. That scares me.

A seq scan and an index scan are inherently different algorithms,
so I don't see exactly how you think we can avoid this risk.

In particular, if you are dealing with a btree index and a "WHERE
column = constant" query, then a seq scan is only going to be concerned
with the behavior of the '=' operator --- does it return TRUE or not
for any particular row?  But an index search is inherently going to make
ordered comparisons (<, =, >).  So there is always a potential for
inconsistent behavior if the ordering operators produce results that are
inconsistent with simple '='.  We cannot design that away --- all we can
do is fix such bugs when one is discovered in a particular datatype.
        regards, tom lane


Re: Like vs '=' bug with indexing

От
Hannu Krosing
Дата:
Tom Lane wrote:
> 
> Hannu Krosing <hannu@tm.ee> writes:
> > Is there no simple (i.e. cheap) way to disallow \0 alltogether for
> > these types then ?
> > perhaps just strip them out in textin() (or is it text_in()) ?
> 
> They *are* stripped out in textin(), by virtue of the fact that
> textin expects to see a null-terminated input string.

Ok, I was mistaken to think that pg_trigger.tgargs contained real \0's 
and not fakes and I was able to get similar output from other char 
types by using \\000 - You never can tell how many \\\\ are required to 
input a single \ to next level.

Also I remember being told that bytea _can_ hold embedded \0, no ?
> > Should we not examine "the _possible_ outputs of every C-coded function
> > to make sure it produces a valid value of the datatype" ;)
> 
> Go for it.
> 
> Possibly chr() should reject chr(0) ...

there is no function chr() at least in 7.0.2. 

and char is not usable from psql (gives out strange errors);

hannu=# select char(0);
ERROR:  length for type 'bpchar' must be at least 1
hannu=# select char(32);
ERROR:  parser: parse error at or near ";"
hannu=# select char('a');
ERROR:  parser: parse error at or near "'"

could it be possible somehow distinguish between user callable (safe) 
functions and "internal" ones ?

------------
Hannu


Re: Like vs '=' bug with indexing

От
Tom Lane
Дата:
Hannu Krosing <hannu@tm.ee> writes:
> Also I remember being told that bytea _can_ hold embedded \0, no ?

Yes.  What has that got to do with text et al?

>> Possibly chr() should reject chr(0) ...
>
> there is no function chr() at least in 7.0.2. 

I think it used to be called ichar(), but then someone pointed out that
Oracle calls it chr().

> and char is not usable from psql (gives out strange errors);

char(n) is a type name, not a function call.
        regards, tom lane


Re: Like vs '=' bug with indexing

От
Hannu Krosing
Дата:
Tom Lane wrote:
> 
> Hannu Krosing <hannu@tm.ee> writes:
> > Also I remember being told that bytea _can_ hold embedded \0, no ?
> 
> Yes.  What has that got to do with text et al?
> 
> >> Possibly chr() should reject chr(0) ...
> >
> > there is no function chr() at least in 7.0.2.
> 
> I think it used to be called ichar(), but then someone pointed out that
> Oracle calls it chr().
> 
> > and char is not usable from psql (gives out strange errors);
> 
> char(n) is a type name, not a function call.

\df list it as a function call taking either bpchar or text argument.


hannu=# \df                                  List of functions      Result        |       Function       |

 
Arguments                 
---------------------+----------------------+------------------------------------------
...char                | char                 | bpchar char                | char                 | text 
...


Maybe the cast functions (named similar to types) should be removed from
\df output 
or at least marked as such, also in pg_proc?

------------
Hannu