UTF8 regexp and char classes still does not work

Поиск

Список

Период

Сортировка

От	Sergey Burladyan
Тема	UTF8 regexp and char classes still does not work
Дата	28 сентября 2010 г. 18:35:12
Msg-id	877hi5a6wr.fsf@home.progtech.ru обсуждение исходный текст
Ответы	Re: UTF8 regexp and char classes still does not work
Список	pgsql-hackers

Дерево обсуждения

I see this in 9.0 Release note:
- Support locale-specific regular expression processing with UTF-8 server encoding (Tom Lane)   Locale-specific regular
expressionfunctionality includes   case-insensitive matching and locale-specific character classes.
 

But character classes still does not work, example (git REL9_0_STABLE c767c3bd):
select version();                                                       version
               
 

------------------------------------------------------------------------------------------------------------------------PostgreSQL
9.0.0on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit
 

--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';?column? | ?column? | ?column? 
----------+----------+----------t        | f        | t

all must be true, like below:

create database koi8 template template0 encoding 'koi8r' lc_collate 'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R';
\c koi8
set client_encoding TO utf8;
select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';?column? | ?column? | ?column? 
----------+----------+----------t        | t        | t

As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v,        /* context */          const chr *startp,  /* where the name starts */          const
chr*endp,    /* just past the end of the name */          int cases)          /* case-independent? */
 
function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes"
andit call pg_wc_isalpha like this:
 
for (i = 0; i <= UCHAR_MAX; i++)
{   if (pg_wc_isalpha((chr) i))       addchr(cv, (chr) i);
}
UCHAR_MAX is 255

I do not understand fully this algorithm of regular expressions, but i think cclass function also need fix.

-- 
Sergey Burladyan

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

UTF8 regexp and char classes still does not work