Обсуждение: psql weird behaviour with charset encodings

Поиск
Список
Период
Сортировка

psql weird behaviour with charset encodings

От
hernan gonzalez
Дата:
(Disclaimer: I've been using Postgresql for quite a long time, I
usually deal with non-ascii LATIN-9 characters ,
but that has never been a problem, until now)

My issue summarized: when psql is invoked from a user who has a locale
different from that of the database, the tabular output
is wrong for some text fields. The weird thing is that those text
fields are not just garbled, but empty. And more weird:
this does not happen in the expanded output format (\x). Apparently
it's not a terminal problem (I see all right after \x),
nor a client_encoding issue (idem). So...?

Details follow.
My scenario: Fedora 12, Postgresql 8.4.3 compiled from source.

Database encoding (global) LATIN9.
User postgres locale: LANG=en_US.iso885915,
User root locale LANG=en_US.UTF-8

When I connect from postgres user, all is right.
When I connect from root, it's not right... except with \x
Example (here last_name field has one non ascii character, N WITH TILDE) :

========================================================================

[root@myserver ~]# su - postgres
[postgres@myserver ~]$ psql db
psql (8.4.3)
db=# \set
...
ENCODING = 'LATIN9'
db=# select first_name,last_name,birth_date from persons where rid= 143;
 first_name  | last_name | birth_date
--------------+-----------+------------
Guillermo    | Calcaño   | 1996-09-30
db=# \x
db=# select first_name,last_name,birth_date from persons where rid= 143;
-[ RECORD 1 ]------------
first_name | Guillermo
last_name  | Calcaño
birth_date | 1996-09-30


[root@myserver ~]# /usr/local/pgsql/bin/psql -U postgres db
psql (8.4.3)
db=# \set
...
ENCODING = 'LATIN9'
db=# select first_name,last_name,birth_date from persons where rid= 143;
 first_name  | last_name | birth_date
--------------+-----------+------------
Guillermo    |    | 1996-09-30
(1 row)
db=# \x
db=# select first_name,last_name,birth_date from persons where rid= 143;
-[ RECORD 1 ]------------
first_name | Guillermo
last_name  | Calcaño
birth_date | 1996-09-30

==================================================================

It looks as it psql, in tabular form, needs to compute the lenght of
the field to output, and for this uses the user locale (not
the client_encoding, mind you, but the locale of the user that is
running the psql process). In case of mismatch,
it cannot decode the string and compute the lenght, so... it assumes
length=0 (?)
Is this the expect behaviour ? Has this behaviour changed recently?

--
Hernán J. González
http://hjg.com.ar/

Re: psql weird behaviour with charset encodings

От
Tom Lane
Дата:
hernan gonzalez <hgonzalez@gmail.com> writes:
> My scenario: Fedora 12, Postgresql 8.4.3 compiled from source.

> Database encoding (global) LATIN9.
> User postgres locale: LANG=en_US.iso885915,
> User root locale LANG=en_US.UTF-8

> When I connect from postgres user, all is right.
> When I connect from root, it's not right... except with \x

What's client_encoding set to in the two cases?  If it's not utf8,
does changing it to that improve matters?  Alternatively, see what
xterm (or whatever terminal window you're using) thinks the encoding
is, and change it to match psql's client_encoding.

            regards, tom lane

Re: psql weird behaviour with charset encodings

От
hernan gonzalez
Дата:
It's surely not a xterm problem, I see the characters ok with just the
\x formatting. I can check also the output redirecting to a file.

My original client_encoding seems to be LATIN9 in both cases,
accorging to the \set ouput.

If I change it (for the root user) to UTF8 with " SET CLIENT_ENCODING
TO 'UTF8';  " the conversion from LATIN9 to UTF8 indeed takes place
(and I see the characters ok, by switching my terminal to UTF8).

(BTW: I understood from the  docs that  " \set ENCODING 'UTF8'; " is
equivalent but this does nothing in my case)

But I actually dont want that. I want psql to not try any charset
conversion, just give me the raw text as is stored in the db. He seems
to do this when \x is set. But it seems that he need to compute lenght
of the strings and for this he needs to use the string routines from
his own locale.

I'm uncertain yet if this is the expected behaviour.


> What's client_encoding set to in the two cases?  If it's not utf8,
> does changing it to that improve matters?  Alternatively, see what
> xterm (or whatever terminal window you're using) thinks the encoding
> is, and change it to match psql's client_encoding.
>
>                        regards, tom lane
>



--
Hernán J. González
http://hjg.com.ar/

Re: psql weird behaviour with charset encodings

От
Tom Lane
Дата:
hernan gonzalez <hgonzalez@gmail.com> writes:
> But I actually dont want that. I want psql to not try any charset
> conversion, just give me the raw text as is stored in the db.

Well, that's what it's doing (given the default setting with
client_encoding equal to server_encoding), and then xterm is
misdisplaying the text because xterm thinks it's utf8.  I'm
not very clear on why the \x case seems to display correctly
anyway, but you really need the terminal's encoding to agree
with client_encoding in order to get non-ASCII characters to
work well.

            regards, tom lane

Re: psql weird behaviour with charset encodings

От
hernan gonzalez
Дата:
Mmm no:  \x displays correctly for me because it sends
 the raw text (in LATIN9) and I have set my terminal in LATIN9 (or ISO-8859-15)

And it's not that "xterm is misdisplaying" the text, it just that psql
is ouputting
an EMPTY (zero lenght) string for that field.
(I can even send the output to a file with \o, and check it in hexadecimal,
to re-verify  that it's not a terminal problem - it's not).

The issue is that psql tries (apparently) to convert to UTF8
(even when he plans to output the raw text -LATIN9 in this case)
just for computing the lenght of the field, to build the table.
And because for this computation he (apparently) rely on the string
routines with it's own locale, instead of the DB or client encoding.

That's why no problem arises with the \x switch.

I believe this is wrong, because when the client does not specify
an encoding, no conversion should be attempted.

Hernán

>
> Well, that's what it's doing (given the default setting with
> client_encoding equal to server_encoding), and then xterm is
> misdisplaying the text because xterm thinks it's utf8.  I'm
> not very clear on why the \x case seems to display correctly
> anyway, but you really need the terminal's encoding to agree
> with client_encoding in order to get non-ASCII characters to
> work well.
>
>                        regards, tom lane
>



--
Hernán J. González
http://hjg.com.ar/

Re: psql weird behaviour with charset encodings

От
Tom Lane
Дата:
hernan gonzalez <hgonzalez@gmail.com> writes:
> The issue is that psql tries (apparently) to convert to UTF8
> (even when he plans to output the raw text -LATIN9 in this case)
> just for computing the lenght of the field, to build the table.
> And because for this computation he (apparently) rely on the string
> routines with it's own locale, instead of the DB or client encoding.

I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations.  But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

                fprintf(fout, "%.*s", bytes_to_output,
                        this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print.  However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.  I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero.  I think we're going to have to
change the code to not rely on %.*s here.  Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte.  (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

            regards, tom lane

Re: psql weird behaviour with charset encodings

От
hgonzalez@gmail.com
Дата:
> However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.

Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?

I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.

#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}

This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. i.e. it worked as expected.

Hernán

Re: psql weird behaviour with charset encodings

От
hernan gonzalez
Дата:
Sorry about a error in my previous example (mixed width and precision).
But the conclusion is the same - it works on bytes:

#include<stdio.h>
main () {
        char s[] = "ni\xc3\xb1o"; /* 5 bytes , 4 utf8 chars */
        printf("|%*s|\n",6,s); /* this should pad a black */
        printf("|%.*s|\n",4,s); /* this should eat a char */
}

[root@myserv tmp]#  ./a.out | od -t cx1
0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a


Hernán



On Fri, May 7, 2010 at 10:48 PM,  <hgonzalez@gmail.com> wrote:
>> However, it appears that glibc's printf
> code interprets the parameter as the number of *characters* to print,
> and to determine what's a character it assumes the string is in the
> environment LC_CTYPE's encoding.
>
> Well, I myself have problems to believe that :-)
> This would be nasty... Are you sure?
>
> I couldn reproduce that.
> I made a quick test, passing a utf-8 encoded string
> (5 bytes correspoding to 4 unicode chars: "niño")
> And my glib (same Fedora 12) seems to count bytes,
> as it should.
>
> #include<stdio.h>
> main () {
> char s[] = "ni\xc3\xb1o";
> printf("|%.*s|\n",5,s);
> }
>
> This, compiled with gcc 4.4.3, run with my root locale (utf8)
> did not padded a blank. i.e. it worked as expected.
>
> Hernán

Re: psql weird behaviour with charset encodings

От
Tom Lane
Дата:
hernan gonzalez <hgonzalez@gmail.com> writes:
> Sorry about a error in my previous example (mixed width and precision).
> But the conclusion is the same - it works on bytes:

This example works like that because it's running in C locale always.
Try something like this:

#include<stdio.h>
#include<locale.h>

int main () {
        char s[] = "ni\xc3qo"; /* 5 bytes , not valid utf8 */

        setlocale(LC_ALL, "");
        printf("|%.*s|\n",3,s);
        return 0;
}


I get different (and undesirable) effects depending on LANG.

            regards, tom lane

Re: psql weird behaviour with charset encodings

От
hernan gonzalez
Дата:
Wow, you are right, this is bizarre...

And it's not that glibc intends to compute the length in unicode chars,
it actually counts bytes (c plain chars) -as it should- for computing
field widths...
But, for some strange reason, when there is some width calculation involved
it tries so parse the char[] using the locale encoding (when there's no point
in doing it!) and if it fails, it truncates (silently) the printf output.
So it seems more  a glib bug to me than an interpretion issue (bytes vs chars).
I posted some details in stackoverflow:
http://stackoverflow.com/questions/2792567/printf-field-width-bytes-or-chars

BTW, I understand that postgresql uses locale semantics in the server code.
But is this really necessary/appropiate in the client (psql) side?
Couldnt we stick
with C locale here?

--
Hernán J. González
http://hjg.com.ar/

Re: psql weird behaviour with charset encodings

От
Tom Lane
Дата:
hernan gonzalez <hgonzalez@gmail.com> writes:
> BTW, I understand that postgresql uses locale semantics in the server code.
> But is this really necessary/appropiate in the client (psql) side?
> Couldnt we stick with C locale here?

As far as that goes, I think we have to turn on that machinery in order
to have gettext() work (ie, to have localized error messages).

            regards, tom lane