Re: Careful PL/Perl Release Not Required

Поиск
Список
Период
Сортировка
От David E. Wheeler
Тема Re: Careful PL/Perl Release Not Required
Дата
Msg-id 9C246AAB-E67A-4F23-AA5E-66B59BC4F876@kineticode.com
обсуждение исходный текст
Ответ на Re: Careful PL/Perl Release Not Required  (Alex Hunsaker <badalex@gmail.com>)
Ответы Re: Careful PL/Perl Release Not Required  (Andrew Dunstan <andrew@dunslane.net>)
Re: Careful PL/Perl Release Not Required  (Alex Hunsaker <badalex@gmail.com>)
Список pgsql-hackers
On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> I'd like to quibble with you over this point if I may. :-)
> Per perldoc: JSON::XS
> "utf8" flag disabled
>           When "utf8" is disabled (the default), then
> "encode"/"decode" generate and expect Unicode strings ...
>
> So
> - If you are on < 9.1 and a utf8 database you want to pass
> utf8(false), as you have a Unicode string.

Right. That's what I realized yesterday, thanks to our exchange. I updated my code for that. The use of the term
"Unicodestring" in the JSON::XS docs is really confusing, though. A scalar with the utf8 flag on is not a unicode
string.It's Perl's representation of a string. It has no encoding (it's "decoded"). 

Like I said, the terminology is awful.

> - If you are on < 9.1 and on a non utf8 database you would want to
> pass utf8(false) as the string is *not* Unicode, its byte soup. Its in
> some _other_ encoding say EUC_JP. You would need to decode() it into
> Unicode first.

Or use utf8() or utf8(1). Then JSON::XS will decode it for you.

> - If you are on 9.1 and a utf8 database you still want to pass
> utf8(false) as the string is still unicode.
>
> - if you are on 9.1 and a non utf8 database you want to pass
> utf8(false) as the string is _now_ unicode.

Right.

> So... it seems you always want to pass false. The only case I can
> where you would want to pass true is you are on < 9.1 with a SQL_ASCII
> database and you know for a fact the string represents a utf8 byte
> sequence.
>
> Or am I missing something obvious?

Yes, that  you can pass no value to utf8() or a true value and it will decode a utf-8-encoded string for you.

>>> If you do have to change your semantics/functions, could you post an
>>> example? I'd like to make sure its because you were hitting one of
>>> those nasty corner cases and not something new is broken.
>>
>> I think that people who have non-utf-8 databases might be surprised.
>
> Yeah, surprised it does the right thing and its actually usable now ;).

Yes, but they might need to change their code, is what I'm saying.

>
>>>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:
>
>> No, he had to add the decode line, IIRC:
>>
>> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>>  use strict;
>>  use URI::Escape;
>>  utf8::decode($_[0]);
>>  return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>>
>> Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't
needto call utf8::decode(). That is, in a latin-1 database: 
>
> Meh, no, not really. He will still need to call decode.

Why? In 9.1, won't params from passed to PL/Perl functions in non-SQL_ASCII databases already be decoded?

> The problem is
> uri_unescape() does not assume an encoding on the URI. It could be
> UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats
> not the point, it knows nothing about Unicode or encodings).

Yes, but if you don't want surprises, I think you want to pass a decoded string to it.

> For example, lets say you have a latin-1 accented e "é" the byte
> sequence is the one byte: 0xe9. If you were to uri_escape that you get
> the 3 byte ascii string "%E9":
> $ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)'
> %E9
>
> If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: e9, len: 1
>
> What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9:
> $ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)'
> %00%E9
>
> What happens we uri_unescape that? Do we get back a Unicode string
> that has one character? No. And why should we? How is uri_unescape
> supposed to know what %00%E9 represent? All it knows is thats 2
> separate bytes:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: 00e9, len: 2

Yeah, this is why URI::Escape needs a uri_unescape_utf8() function to complement utf8_escape_utf8(). But to get around
that,you would of course decode the return value yourself. 

> Now, lets say you want to uri_escape a utf8 accented e, thats the two
> byte sequence: 0xc3 0xa9:
> $ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)'
> %C3%A9
>
> Ok, what happens when we uri_unescape those?:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: c3a9, len: 2
>
> So, plperl will also return 2 characters here.
>
> In the the cited case he was passing "%C3%A9" to uri_unescape() and
> expecting it to return 1 character. The additional utf8::decode() will
> tell perl the string is in utf8 so it will then return 1 char. The
> point being, decode is needed and with it, the function will work pre
> and post 9.1.

Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database was
utf-8,and should be in 9.1 if the database isn't sql_ascii? 

> In-fact on a latin-1 database it sure as heck better return two
> characters, it would be a bug if it only returned 1 as that would mean
> it would be treating a series of latin1 bytes as a series of utf8
> bytes!

If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's
Perl'sinternal representation. 

If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded.

Best,

David



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: FOR KEY LOCK foreign keys
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Change pg_last_xlog_receive_location not to move backwards