Bug in index scans with Locale support enabled

Поиск
Список
Период
Сортировка
От Barry Lind
Тема Bug in index scans with Locale support enabled
Дата
Msg-id 3A31830F.964982C0@xythos.com
обсуждение исходный текст
Ответы Re: Bug in index scans with Locale support enabled  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Bug in index scans with Locale support enabled  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-hackers
In researching a problem I have uncovered the following bug in index
scans when Locale support is enabled.

Given a 7.0.3 postgres installation built with Locale support enabled
and a default US RedHat 7.0 Linux installation (meaning that the LANG
environment variable is set to en_US) to enable the US english locale
and

Given the following table and index structure with the following data:

create table test (test_col text);
create index test_index on test (test_col);
insert into test values ('abc.xyz');
insert into test values ('abcxyz');
insert into test values ('abc/xyz');

If you run the query:

select * from test where test_col >= 'abc.';

One would normally expect to only get one record returned, but instead
all records are returned.

The reason for this is that in the en_US locale all non-alphanumeric
characters are ignored when doing string comparisons.  So the data above
gets treated as:
abc.xyz = abcxyz = abc/xyz  (as the non-alphanumeric characters of '.'
and '/' are ignored).  This implys that the above query will then return
all rows as the constant 'abc.' is the same as 'abc' for comparison
purposes and all rows are >= 'abc'.

Note that if you use a different locale for example en_UK, you will get
different results as this locale does not ignore the . and / in the
comparison.

Now the real problem comes in when either the like or regex operators
are used in a sql statement.  Consider the following sql:

select * from text where test_col like 'abc/%';

This query should return one row, the row for 'abc/xyz'.  However if the
above query is executed via an index scan it will return the wrong
number of rows (0 in this case).

Why is this?  Well the query plan created for the above like expression
looks like the following:
select * from text where test_col >= 'abc/' and test_col < 'abc0';

In order to use the index the like has been changed into a '>=' and a
'<' for the constant prefix ('abc/') and the constant prefix with the
last character incremented by one ('/abc0')  (0 is the next character
after / in ASCII).

Given what was shown above about how the en_US locale does comparisons
we know that the non-alphanumeric characters are ignored.  So the query
essentially becomes:
select * from text where test_col >= 'abc' and test_col < 'abc0';
and the data it is comparing against is 'abcxyz' in all cases (once the
.'s an /'s are removed).  Therefore since 'abcxyz' > 'abc0', no rows are
returned.

Over the last couple of months that I have been on the postgres mail
lists there have been a few people who reported that queries of the form
"like '/aaa/bbb/%' don't work.  From the above information I have
determined that such queries don't work if:
a) database is built with Locale support enabled (--enable-locale)
b) the database is running with locale en_US
c) the column the like is being performed on is indexed
d) the query execution plan uses the above index

(Discovering the exact set of circumstances for how to reproduce this
has driven me crazy for a while now).

The current implementation for converting the like into an index scan
doesn't work with Locale support enabled and the en_US locale as shown
above.

thanks,
--Barry

PS. my test case:

drop table test;
create table test (test_col text);
create index test_index on test (test_col);
insert into test values ('abc.xyz');
insert into test values ('abcxyz');
insert into test values ('abc/xyz');
explain select * from test where test_col like 'abc/%';
select * from test where test_col like 'abc/%';


when run against postgres 7.0.3 with locale support enabled (used the
standard RPMs on postgresql.org for RedHat) with LANG=en_US:

barry=# drop table test;
DROP
barry=# create table test (test_col text);
CREATE
barry=# create index test_index on test (test_col);
CREATE
barry=# insert into test values ('abc.xyz');
INSERT 227611 1
barry=# insert into test values ('abcxyz');
INSERT 227612 1
barry=# insert into test values ('abc/xyz');
INSERT 227613 1
barry=# explain select * from test where test_col like 'abc/%';
NOTICE:  QUERY PLAN:

Index Scan using test_index on test  (cost=0.00..8.14 rows=10 width=12)

EXPLAIN
barry=# select * from test where test_col like 'abc/%';
 test_col
----------
(0 rows)

barry=#



when run against postgres 7.0.3 with locale support enabled (used the
standard RPMs on postgresql.org) with LANG=en_UK:

barry=# drop table test;
DROP
barry=# create table test (test_col text);
CREATE
barry=# create index test_index on test (test_col);
CREATE
barry=# insert into test values ('abc.xyz');
INSERT 227628 1
barry=# insert into test values ('abcxyz');
INSERT 227629 1
barry=# insert into test values ('abc/xyz');
INSERT 227630 1
barry=# explain select * from test where test_col like 'abc/%';
NOTICE:  QUERY PLAN:

Index Scan using test_index on test  (cost=0.00..8.14 rows=10 width=12)

EXPLAIN
barry=# select * from test where test_col like 'abc/%';
 test_col
----------
 abc/xyz
(1 row)

barry=#

Note the second query (under en_UK) returns the correct rows, but the
first query (under en_US) returned the wrong number of rows.


PPS.  Another way to work around the problem is to turn off locale
specific collation using the environment variable LC_COLLATE and setting
it to the value C.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: 7.0.3(nofsync) vs 7.1
Следующее
От: mlw
Дата:
Сообщение: Re: OK, does anyone have any better ideas?