Re: psql weird behaviour with charset encodings

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: psql weird behaviour with charset encodings
Дата
Msg-id 3797.1273276002@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: psql weird behaviour with charset encodings  (hernan gonzalez <hgonzalez@gmail.com>)
Ответы Re: psql weird behaviour with charset encodings  (hgonzalez@gmail.com)
Список pgsql-general
hernan gonzalez <hgonzalez@gmail.com> writes:
> The issue is that psql tries (apparently) to convert to UTF8
> (even when he plans to output the raw text -LATIN9 in this case)
> just for computing the lenght of the field, to build the table.
> And because for this computation he (apparently) rely on the string
> routines with it's own locale, instead of the DB or client encoding.

I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations.  But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

                fprintf(fout, "%.*s", bytes_to_output,
                        this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print.  However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.  I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero.  I think we're going to have to
change the code to not rely on %.*s here.  Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte.  (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

            regards, tom lane

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: initdb fails on Centos 5.4 x64
Следующее
От: hgonzalez@gmail.com
Дата:
Сообщение: Re: psql weird behaviour with charset encodings