tolower() identifier downcasing versus multibyte encodings

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	tolower() identifier downcasing versus multibyte encodings
Дата	19 марта 2011 г. 04:11:12
Msg-id	558.1300507858@sss.pgh.pa.us обсуждение исходный текст
Ответы	Re: tolower() identifier downcasing versus multibyte encodings (Marko Kreen <markokr@gmail.com>) Re: tolower() identifier downcasing versus multibyte encodings (Bruce Momjian <bruce@momjian.us>)
Список	pgsql-hackers

Дерево обсуждения

I've been able to reproduce the behavior described here:
http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php
It's specific to UTF8 locales on Mac OS X.  I'm not sure if the
problem can manifest anywhere else; considering that OS X's UTF8
locales have a general reputation of being broken, it may only
happen on that platform.

What is happening is that downcase_truncate_identifier() tries to
downcase identifiers like this:
    unsigned char ch = (unsigned char) ident[i];
    if (ch >= 'A' && ch <= 'Z')        ch += 'a' - 'A';    else if (IS_HIGHBIT_SET(ch) && isupper(ch))        ch =
tolower(ch);   result[i] = (char) ch;
 

This is of course incapable of successfully downcasing any multibyte
characters, but there's an assumption that isupper() won't return TRUE
for a character fragment in a multibyte locale.  However, on OS X
it seems that that's not the case :-(.  For the particular example
cited by Francisco Figueiredo, I see the byte sequence \303\251
converted to \343\251, because isupper() returns TRUE for \303 and
then tolower() returns \343.  The byte \251 is not changed, but the
damage is already done: we now have an invalidly-encoded string.

It looks like the blame for the subsequent "disappearance" of the bogus
data lies with fprintf back on the client side; that surprises me a bit
because I'd only heard of glibc being so cavalier with data it thought
was invalidly encoded.  But anyway, the origin of the problem is in the
downcasing transformation.

We could possibly fix this by not attempting the downcasing
transformation on high-bit-set characters unless the encoding is
single-byte.  However, we have the exact same downcasing logic embedded
in the functions in src/port/pgstrcasecmp.c, and those don't have any
convenient way of knowing what the prevailing encoding is --- when
compiled for frontend use, they can't use pg_database_encoding_max_length.

Or we could bite the bullet and start using str_tolower(), but the
performance implications of that are unpleasant; not to mention that
we really don't want to re-introduce the "Turkish problem" with
unexpected handling of i/I in identifiers.

Or we could go the other way and stop downcasing non-ASCII letters
altogether.

None of these options seem terribly attractive.  Thoughts?
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Vaibhav Kaushal
Дата: 19 марта 2011 г., 03:31:45
Сообщение: Re: I am confused after reading codes of PostgreSQL three week

Следующее

От: Marko Kreen
Дата: 19 марта 2011 г., 09:25:13
Сообщение: Re: tolower() identifier downcasing versus multibyte encodings

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

tolower() identifier downcasing versus multibyte encodings

Предыдущее

Следующее