Re: plperlu problem with utf8

Поиск
Список
Период
Сортировка
От David Christensen
Тема Re: plperlu problem with utf8
Дата
Msg-id AB507853-2325-4B0A-AEE1-A989436BF021@endpoint.com
обсуждение исходный текст
Ответ на Re: plperlu problem with utf8  ("David E. Wheeler" <david@kineticode.com>)
Ответы Re: plperlu problem with utf8  (Alex Hunsaker <badalex@gmail.com>)
Re: plperlu problem with utf8  ("David E. Wheeler" <david@kineticode.com>)
Список pgsql-hackers
On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:

> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>
>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's
Latin-1.
>>
>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>
> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl,
givena scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on. 

This is a correct assertion as to Perl's behavior.  As far as PostgreSQL is/should be concerned in this case, this is
thecorrect handling for URI::Escape, as the input string to the function was all ASCII (= valid UTF-8), so the function
authorwould need to be responsible for the proper conversion to the internal encoding.  This would just be a simple
decode_utf8()call in the case that you've URI-escaped UTF-8-encoded unicode, however since URI escaping is only defined
foroctets, the meaning of those unescaped octets is detached from their encoding.  There are similar issues with using
othermodules which traditionally have not distinguished between characters and bytes (as an examples Digest::MD5);
Digest::MD5does not work on wide characters, as the algorithm only deals with octets, so you need to pick a target
encodingfor wide characters and encode the octets themselves rather than the characters. 

>> Im saying they are not, and if you want \xc3\xa9 to be treated as
>> chr(233) you need to tell perl what encoding the string is in (err
>> well actually decode it so its in "perl space" as unicode characters
>> correctly).
>
> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode
fromPerl's internal format on output. 

+1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments.
Inthe case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like
types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion
regardlessof data type.  For any other server_encoding, the data would need to be converted from the server_encoding to
UTF-8,presumably using the built-in conversions before passing it off to the first code path.  A similar handling would
needto be done for the return values, again datatype-dependent. 

This certainly seems like it could be inefficient in the case that we're using a non-utf8 server_encoding (there are
peoplewho do?? :-P), however from the standpoint of correctness, any plperl function that deals with this data as
characters(using common things like regexes, length, ord, chr, substring, \w metachars) will have the potential of
operatingincorrectly when provided with data that is in a different encoding than perl's internal format.  One thought
Ihad was that we could expose the server_encoding to the plperl interpreters in a special variable to make it easy to
explicitlydecode if we needed, but the problems with this are: a) there's no guarantee that Encode.pm will have a
alias/supportfor the specific server_encoding name as provided by Pg, and b) in the case of plperl (i.e., not u), there
arehorrendous issues when trying to deal with Safe.pm and character encoding.  Recent upgrades of the Encode module
includedwith perl 5.10+ have caused issues wherein circular dependencies between Encode and Encode::Alias have made it
impossibleto load in a Safe container without major pain.  (There may be some better options than I'd had on a previous
project,given that we're embedding our own interpreters and accessing more through the XS guts, so I'm not ruling out
thispossibility completely). 

Perhaps we could set a function attribute or similar which indicated that we wanted to decode the input properly on
input,whether or not this should be the default, or at the very least expose a function to the plperl[u]? runtime that
woulddecode/upgrade on demand without the caller of the function needing to know the encoding of the database it is
runningin.  This would solve issue a), because any supported server_encoding would have an internal conversion to utf8,
andsolves b) because we're avoiding the conversion from inside Safe and simply running our XS function on the input
data. (As much as I hate the ugliness of it, if we decide the decoding behavior shouldn't be the default we could even
useone of those ugly function pragmas in the function bodies.) 

>>> Maybe I'm misunderstanding, but it seems to me that:
>>>
>>> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal
representationbefore the function actually gets them. 
>>
>> Currently postgres has 2 behaviors:
>> 1) If the database is utf8, turn on the utf8 flag. According to the
>> perldoc snippet I quoted this should mean its a sequence of utf8 bytes
>> and should interpret it as such.
>
> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?

I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure the
parameterscame in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify the
string-liketypes/text data.  (Perhaps any type which had a binary cast to text would be a sufficient definition here.
Dodomains automatically inherit binary casts from their originating types?)  

>> 2) its not utf8, so we just leave it as octets.
>
> Which mean's Perl will assume that it's Latin-1, IIUC.

This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier.  This would produce bogus
resultsfor any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general
usage.

>> So in "perl space" length($_[0]) returns the number of characters when
>> you pass in a multibyte char *not* the number of bytes.  Which is
>> correct, so um check we do that.  Right?
>
> Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:
>
>    CREATE OR REPLACE FUNCTION perlgets(
>        TEXT
>    ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
>       my $text = shift;
>       return_next {
>           length  => length $text,
>           is_utf8 => utf8::is_utf8($text) ? 1 : 0
>       };
>    $$;
>
> In a utf-8 database:
>
>    utf8=# select * from perlgets('foo');
>     length │ is_utf8
>    ────────┼─────────
>          8 │ t
>    (1 row)

This example seems bogus; wouldn't length be 3 if this is the example text this was run with?  Additionally, since all
ASCIIis trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was
improperlyhandled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit)
<length(encode_utf8($hi_bit)).  I don't see that this test shows us much with the test case as given.  The is_utf8()
functionmerely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set
onascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit
charactersin the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal
format)with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though
theunderlying data, as represented in perl will be identical). 

> In a latin-1 database:
>
>    latin=# select * from perlgets('foo');
>     length │ is_utf8
>    ────────┼─────────
>          8 │ f
>    (1 row)
>
> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to
Perl'sinternal form. 

See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case) data
thathas not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation for
thestate of that flag. 

> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database.
Thatdoesn't seem right to me. 

I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do
noencoding on the data itself, which would indicate that the utf8 flag for that variable would be unset.  If this is
notcurrently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no character
semanticsattached to it. 

>> In the URI::Escape example we have:
>>
>> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>>  use URI::Escape;
>>  warn(length($_[0]));
>>  return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>>
>> # select url_decode('comment%20passer%20le%20r%C3%A9veillon');
>> WARNING: 38 at line 2
>
> What's the output? And what's the encoding of the database?
>
>> Ok that length looks right, just for grins lets try add one multibyte char:
>>
>> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
>> WARNING:  39 CONTEXT:  PL/Perl function "url_decode" at line 2.
>>         url_decode
>> -------------------------------
>> comment passer le réveillon☺
>> (1 row)
>>
>> Still right,
>
> The length is right, but the é is wrong. It looks like Perl thinks it's latin-1. Or, rather, unescape_uri() dosn't
knowthat it should be returning utf-8 characters. That *might* be a bug in URI::Escape. 

I think this has been addressed by others in previous emails.

>> now lets try the utf8::decode version that "works".  Only
>> lets look at the length of the string we are returning instead of the
>> one we are passing in:
>>
>> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>>  use URI::Escape;
>>  utf8::decode($_[0]);
>>  my $str = uri_unescape($_[0]);
>>  warn(length($str));
>>  return $str;
>> $$ LANGUAGE plperlu;
>>
>> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
>> WARNING:  28 at line 5.
>> CONTEXT:  PL/Perl function "url_decode"
>>        url_decode
>> -----------------------------
>> comment passer le réveillon
>> (1 row)
>>
>> Looks harmless enough...
>
> Looks far better, in fact. Interesting that URI::Escape does the right thing only if the utf8 flag has been turned on
inthe string passed to it. But in Perl it usually won't be, because the encoded string should generally have only ASCII
characters.

I think you'll find that this "correct display" is actually an artifact of your terminal type being set to a UTF-8
compatibleencoding and interpreting the raw output as the UTF-8 sequence in its output display; that returned count is
actuallythe number of octets, compare: 

$ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
28

$ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
27


>> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
>> WARNING:  28 at line 5.
>> CONTEXT:  PL/Perl function "url_decode"
>> length
>> --------
>>    27
>> (1 row)
>>
>> Wait a minute... those lengths should match.
>>
>> Post patch they do:
>> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
>> WARNING:  28 at line 5.
>> CONTEXT:  PL/Perl function "url_decode"
>> length
>> --------
>>    28
>> (1 row)
>>
>> Still confused? Yeah me too.
>
> Yeah…

As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version
is28.  I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle >
255code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first
andthen escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8 into
perl'sinternal format (at least as far as the version that I'm looking at, which came with 5.10.1). 

>> Maybe this will help:
>>
>> #!/usr/bin/perl
>> use URI::Escape;
>> my $str = uri_unescape("%c3%a9");
>> die "first match" if($str =~ m/\xe9/);
>> utf8::decode($str);
>> die "2nd match" if($str =~ m/\xe9/);
>>
>> gives:
>> $ perl t.pl
>> 2nd match at t.pl line 6.
>>
>> see? Either uri_unescape() should be decoding that utf8() or you need
>> to do it *after* you call uri_unescape().  Hence the maybe it could be
>> considered a bug in uri_unescape().
>
> Agreed.

-1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it.
Perhapslater versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub
uri_unescape_utf8{ Encode::decode_utf8(uri_unescape(shift))}.  This is definitely not a bug in uri_escape, as it is
onlydefined to return octets. 

>>> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the
serverencoding before they're returned. 
>>> I didn't really follow all of the above; are you aiming for the same thing?
>>
>> Yeah, the patch address this part.  Right now we just spit out
>> whatever the internal format happens to be.
>
> Ah, excellent.

I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return
(->server_encoding).  This should be for any character-type data insofar as it is feasible, but ISTR there is already
datatype-specificmarshaling occurring. 

>> Anyway its all probably clear as mud, this part of perl is one of the
>> hardest IMO.
>
> No question.


There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a
fewthings up. 

Regards,

David
--
David Christensen
End Point Corporation
david@endpoint.com






В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: unlogged tables
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: proposal : cross-column stats