Обсуждение: Comparing strings with non-ASCII characters

Поиск
Список
Период
Сортировка

Comparing strings with non-ASCII characters

От
Artur Rataj
Дата:
Hi,

Is it possible in Postgres to use non-ASCII characters and have enabled
index optimizations for the ~ operator? Would it work correctly with
locale set to C for expressions of the form ~ '^string.*'? Or is there a
way to substitue expressions like this with two comparisons? I have tried
to do it like that >= 'string' and <= 'string_' where _ would be a
character sorted after any other used in the searched table, but I did
not found such a character in the locale pl_PL. Anyway, why the index
optimizations are disabled for the ~ operator and not for the < or >
operators in the locale? Is not there a similar problem with all of the
three operators in a non-ASCII locale? Could not the lack of index
optimizations be fixed in Postgres for locales like pl_PL with a simple
reordering of characters for a collator, at least to get the < and >
operators working right with index optimizations in such locales?

Best regards,
Artur Rataj


Re: Comparing strings with non-ASCII characters

От
Tom Lane
Дата:
Artur Rataj <arataj@iitis.gliwice.pl> writes:
> Anyway, why the index optimizations are disabled for the ~ operator
> and not for the < or > operators in the locale?

The < and > operators are compatible with the index ordering, by
definition.  The trouble with ~ is that it is not doing ordering,
but pattern matching.  You can exploit an index to narrow down
the candidates for a left-anchored pattern only when the index
ordering is strict lexicographic ... and in all too many non-C
locales, it ain't.  Digraphs, multipass sorting rules, and things
like that are killers.

We have tried and failed (repeatedly) to find a way to use non-C
indexes for LIKE and ~.  If you'd like to try again, you're more
than welcome, but I suggest you read the PG list archives to learn
the reasons why previous attempts didn't work.

The long-run answer to this will probably be to allow individual columns
to be declared as being of particular locales; then you could assign C
locale to a column you need to do pattern matching on.  This seems to
require building our own locale library :-( ... so it's not going to
happen quickly.

BTW, you might think about using full-text search (contrib/tsearch
or contrib/fulltextindex) instead of pattern matching.

            regards, tom lane

Re: Comparing strings with non-ASCII characters

От
Artur Rataj
Дата:
I have set locale to C, made initdb and started postgres, and the index is
not used again.

explain select idescr.string from isbibc_fti_descr idescr where
idescr.string ~ '^string.*';
NOTICE:  QUERY PLAN:

Seq Scan on isbibc_fti_descr idescr  (cost=0.00..63910.04 rows=17033
width=44)

On Wed, 13 Nov 2002, Artur Rataj wrote:

> Hi,
>
> Is it possible in Postgres to use non-ASCII characters and have enabled
> index optimizations for the ~ operator? Would it work correctly with
> locale set to C for expressions of the form ~ '^string.*'? Or is there a
> way to substitue expressions like this with two comparisons? I have tried
> to do it like that >= 'string' and <= 'string_' where _ would be a
> character sorted after any other used in the searched table, but I did
> not found such a character in the locale pl_PL. Anyway, why the index
> optimizations are disabled for the ~ operator and not for the < or >
> operators in the locale? Is not there a similar problem with all of the
> three operators in a non-ASCII locale? Could not the lack of index
> optimizations be fixed in Postgres for locales like pl_PL with a simple
> reordering of characters for a collator, at least to get the < and >
> operators working right with index optimizations in such locales?
>

Best regards,
Artur Rataj


Re: Comparing strings with non-ASCII characters

От
Artur Rataj
Дата:
On Wed, 13 Nov 2002, Tom Lane wrote:

> BTW, you might think about using full-text search (contrib/tsearch
> or contrib/fulltextindex) instead of pattern matching.
>

Thank you. I needed the ~ operator to match entries in a table being a
full text index itself. Now I use C locale and rows are sorted outside
Postgres, the index now is used with the expressions of the form
~ '^string.*', but does such expressions work correctly in C locale with
non-ASCII characters?

Regards,
Artur Rataj


Re: Comparing strings with non-ASCII characters

От
Tom Lane
Дата:
Artur Rataj <arataj@iitis.gliwice.pl> writes:
> Thank you. I needed the ~ operator to match entries in a table being a
> full text index itself. Now I use C locale and rows are sorted outside
> Postgres, the index now is used with the expressions of the form
> ~ '^string.*', but does such expressions work correctly in C locale with
> non-ASCII characters?

Don't see why they wouldn't --- in C locale, bytes are bytes.  But you
should experiment before trusting that opinion ...

            regards, tom lane