Обсуждение: Comparing strings with non-ASCII characters
Hi, Is it possible in Postgres to use non-ASCII characters and have enabled index optimizations for the ~ operator? Would it work correctly with locale set to C for expressions of the form ~ '^string.*'? Or is there a way to substitue expressions like this with two comparisons? I have tried to do it like that >= 'string' and <= 'string_' where _ would be a character sorted after any other used in the searched table, but I did not found such a character in the locale pl_PL. Anyway, why the index optimizations are disabled for the ~ operator and not for the < or > operators in the locale? Is not there a similar problem with all of the three operators in a non-ASCII locale? Could not the lack of index optimizations be fixed in Postgres for locales like pl_PL with a simple reordering of characters for a collator, at least to get the < and > operators working right with index optimizations in such locales? Best regards, Artur Rataj
Artur Rataj <arataj@iitis.gliwice.pl> writes: > Anyway, why the index optimizations are disabled for the ~ operator > and not for the < or > operators in the locale? The < and > operators are compatible with the index ordering, by definition. The trouble with ~ is that it is not doing ordering, but pattern matching. You can exploit an index to narrow down the candidates for a left-anchored pattern only when the index ordering is strict lexicographic ... and in all too many non-C locales, it ain't. Digraphs, multipass sorting rules, and things like that are killers. We have tried and failed (repeatedly) to find a way to use non-C indexes for LIKE and ~. If you'd like to try again, you're more than welcome, but I suggest you read the PG list archives to learn the reasons why previous attempts didn't work. The long-run answer to this will probably be to allow individual columns to be declared as being of particular locales; then you could assign C locale to a column you need to do pattern matching on. This seems to require building our own locale library :-( ... so it's not going to happen quickly. BTW, you might think about using full-text search (contrib/tsearch or contrib/fulltextindex) instead of pattern matching. regards, tom lane
I have set locale to C, made initdb and started postgres, and the index is not used again. explain select idescr.string from isbibc_fti_descr idescr where idescr.string ~ '^string.*'; NOTICE: QUERY PLAN: Seq Scan on isbibc_fti_descr idescr (cost=0.00..63910.04 rows=17033 width=44) On Wed, 13 Nov 2002, Artur Rataj wrote: > Hi, > > Is it possible in Postgres to use non-ASCII characters and have enabled > index optimizations for the ~ operator? Would it work correctly with > locale set to C for expressions of the form ~ '^string.*'? Or is there a > way to substitue expressions like this with two comparisons? I have tried > to do it like that >= 'string' and <= 'string_' where _ would be a > character sorted after any other used in the searched table, but I did > not found such a character in the locale pl_PL. Anyway, why the index > optimizations are disabled for the ~ operator and not for the < or > > operators in the locale? Is not there a similar problem with all of the > three operators in a non-ASCII locale? Could not the lack of index > optimizations be fixed in Postgres for locales like pl_PL with a simple > reordering of characters for a collator, at least to get the < and > > operators working right with index optimizations in such locales? > Best regards, Artur Rataj
On Wed, 13 Nov 2002, Tom Lane wrote: > BTW, you might think about using full-text search (contrib/tsearch > or contrib/fulltextindex) instead of pattern matching. > Thank you. I needed the ~ operator to match entries in a table being a full text index itself. Now I use C locale and rows are sorted outside Postgres, the index now is used with the expressions of the form ~ '^string.*', but does such expressions work correctly in C locale with non-ASCII characters? Regards, Artur Rataj
Artur Rataj <arataj@iitis.gliwice.pl> writes: > Thank you. I needed the ~ operator to match entries in a table being a > full text index itself. Now I use C locale and rows are sorted outside > Postgres, the index now is used with the expressions of the form > ~ '^string.*', but does such expressions work correctly in C locale with > non-ASCII characters? Don't see why they wouldn't --- in C locale, bytes are bytes. But you should experiment before trusting that opinion ... regards, tom lane