Re: plperlu problem with utf8

Поиск
Список
Период
Сортировка
От Alex Hunsaker
Тема Re: plperlu problem with utf8
Дата
Msg-id AANLkTimZXAk3RVC7tSXkMtUCR9shYVAC7=uyPnwvQZK+@mail.gmail.com
обсуждение исходный текст
Ответ на Re: plperlu problem with utf8  ("David E. Wheeler" <david@kineticode.com>)
Ответы Re: plperlu problem with utf8  ("David E. Wheeler" <david@kineticode.com>)
Список pgsql-hackers
On Thu, Dec 16, 2010 at 20:24, David E. Wheeler <david@kineticode.com> wrote:
> On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
>
>> You might argue this is a bug with URI::Escape as I *think* all uri's
>> will be utf8 encoded.  Anyway, I think postgres is doing the right
>> thing here.
>
> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's
Latin-1.

So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Im saying they are not, and if you want \xc3\xa9 to be treated as
chr(233) you need to tell perl what encoding the string is in (err
well actually decode it so its in "perl space" as unicode characters
correctly).

> Maybe I'm misunderstanding, but it seems to me that:
>
> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal
representationbefore the function actually gets them. 

Currently postgres has 2 behaviors:
1) If the database is utf8, turn on the utf8 flag. According to the
perldoc snippet I quoted this should mean its a sequence of utf8 bytes
and should interpret it as such.
2) its not utf8, so we just leave it as octets.

So in "perl space" length($_[0]) returns the number of characters when
you pass in a multibyte char *not* the number of bytes.  Which is
correct, so um check we do that.  Right?

In the URI::Escape example we have:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$  use URI::Escape;  warn(length($_[0]));
returnuri_unescape($_[0]); $$ LANGUAGE plperlu; 

# select url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING: 38 at line 2

Ok that length looks right, just for grins lets try add one multibyte char:

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
WARNING:  39 CONTEXT:  PL/Perl function "url_decode" at line 2.         url_decode
-------------------------------comment passer le réveillon☺
(1 row)

Still right, now lets try the utf8::decode version that "works".  Only
lets look at the length of the string we are returning instead of the
one we are passing in:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$  use URI::Escape;  utf8::decode($_[0]);  my
$str= uri_unescape($_[0]);  warn(length($str));  return $str; 
$$ LANGUAGE plperlu;

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"        url_decode
-----------------------------comment passer le réveillon
(1 row)

Looks harmless enough...

# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"length
--------    27
(1 row)

Wait a minute... those lengths should match.

Post patch they do:
# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"length
--------    28
(1 row)

Still confused? Yeah me too.  Maybe this will help:

#!/usr/bin/perl
use URI::Escape;
my $str = uri_unescape("%c3%a9");
die "first match" if($str =~ m/\xe9/);
utf8::decode($str);
die "2nd match" if($str =~ m/\xe9/);

gives:
$ perl t.pl
2nd match at t.pl line 6.

see? Either uri_unescape() should be decoding that utf8() or you need
to do it *after* you call uri_unescape().  Hence the maybe it could be
considered a bug in uri_unescape().

> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server
encodingbefore they're returned. 
> I didn't really follow all of the above; are you aiming for the same thing?

Yeah, the patch address this part.  Right now we just spit out
whatever the internal format happens to be.

Anyway its all probably clear as mud, this part of perl is one of the
hardest IMO.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David E. Wheeler"
Дата:
Сообщение: Re: plperlu problem with utf8
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: Re: Proposed Windows-specific change: Enable crash dumps (like core files)