Обсуждение: UTF8 regexp and char classes still does not work

Поиск
Список
Период
Сортировка

UTF8 regexp and char classes still does not work

От
Sergey Burladyan
Дата:
I see this in 9.0 Release note:
- Support locale-specific regular expression processing with UTF-8 server encoding (Tom Lane)   Locale-specific regular
expressionfunctionality includes   case-insensitive matching and locale-specific character classes.
 

But character classes still does not work, example (git REL9_0_STABLE c767c3bd):
select version();                                                       version
               
 

------------------------------------------------------------------------------------------------------------------------PostgreSQL
9.0.0on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit
 

--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';?column? | ?column? | ?column? 
----------+----------+----------t        | f        | t

all must be true, like below:

create database koi8 template template0 encoding 'koi8r' lc_collate 'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R';
\c koi8
set client_encoding TO utf8;
select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';?column? | ?column? | ?column? 
----------+----------+----------t        | t        | t

As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v,        /* context */          const chr *startp,  /* where the name starts */          const
chr*endp,    /* just past the end of the name */          int cases)          /* case-independent? */
 
function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes"
andit call pg_wc_isalpha like this:
 
for (i = 0; i <= UCHAR_MAX; i++)
{   if (pg_wc_isalpha((chr) i))       addchr(cv, (chr) i);
}
UCHAR_MAX is 255

I do not understand fully this algorithm of regular expressions, but i think cclass function also need fix.

-- 
Sergey Burladyan


Re: UTF8 regexp and char classes still does not work

От
Tom Lane
Дата:
Sergey Burladyan <eshkinkot@gmail.com> writes:
> As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
> this pg_wc_isalpha is called from
> static struct cvec *
> cclass(struct vars * v,        /* context */
>            const chr *startp,  /* where the name starts */
>            const chr *endp,    /* just past the end of the name */
>            int cases)          /* case-independent? */
> function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes"
andit call pg_wc_isalpha like this:
 
> for (i = 0; i <= UCHAR_MAX; i++)
> {
>     if (pg_wc_isalpha((chr) i))
>         addchr(cv, (chr) i);
> }
> UCHAR_MAX is 255

Hmm, you're right.  I only tested that on Latin1 characters, for which
it does work because those have Unicode points below 256.  I'm not
sure of a reasonable solution for the general case --- we certainly
don't want this function iterating up to 2^21 or thereabouts.

Your test case seems to be using KOI8 encoding, though, which doesn't
have anything to do with UTF8 behavior.
        regards, tom lane


Re: UTF8 regexp and char classes still does not work

От
Sergey Burladyan
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Hmm, you're right.  I only tested that on Latin1 characters, for which
> it does work because those have Unicode points below 256.  I'm not
> sure of a reasonable solution for the general case --- we certainly
> don't want this function iterating up to 2^21 or thereabouts.

Yes, i understand this problem. How perl do this? May be this Unicode table can
be precomputed or linked to postgres binary from external source?

> Your test case seems to be using KOI8 encoding, though, which doesn't
> have anything to do with UTF8 behavior.

It's just for example of expected result. See first test, it is UTF8, two bytes per character:
> > --- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
> > select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
> >  ?column? | ?column? | ?column? 
> > ----------+----------+----------
> >  t        | f        | t


-- 
Sergey Burladyan