pl/perl and utf-8 in sql_ascii databases

Поиск
Список
Период
Сортировка
От Christoph Berg
Тема pl/perl and utf-8 in sql_ascii databases
Дата
Msg-id 20120209102116.GA14429@msgid.df7cb.de
обсуждение исходный текст
Ответы Re: pl/perl and utf-8 in sql_ascii databases  (Alex Hunsaker <badalex@gmail.com>)
Список pgsql-hackers
Hi,

we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup that you might see in
email headers from the internet). For this reason, the database uses
sql_ascii encoding. The columns are text, as most characters are
ascii, so bytea didn't seem the right way to go.

Currently we are on 8.3 and try to upgrade to 9.1, but the plperlu
functions we have are acting up.

Old behavior on 8.3 .. 9.0:

sql_ascii =# create or replace function whitespace(text) returns text
language plperlu as $$ $a = shift; $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace (E'\200'); -- 0x80 is not valid utf-8whitespace
------------

sql_ascii =# select whitespace (E'\200')::bytea;whitespace
------------\x80

New behavior on 9.1.2:

sql_ascii =# select whitespace (E'\200');
ERROR:  XX000: Malformed UTF-8 character (fatal) at line 1.
KONTEXT:  PL/Perl function "whitespace"
ORT:  plperl_call_perl_func, plperl.c:2037

A crude workaround is:

sql_ascii =# create or replace function whitespace_utf8_off(text)
returns text language plperlu as $$ use Encode; $a = shift;
Encode::_utf8_off($a); $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace_utf8_off (E'\200');whitespace_utf8_off
---------------------\u0080

sql_ascii =# select whitespace_utf8_off (E'\200')::bytea;whitespace_utf8_off
---------------------\xc280

(Note that the workaround is not perfect as the resulting 0x80..0xff
bytes are still tagged to be utf8.)


I think the bug is in plperl_helpers.h:

/** Create a new SV from a string assumed to be in the current database's* encoding.*/

static inline SV *
cstr2sv(const char *str)
{       SV                 *sv;       char       *utf8_str = utf_e2u(str);
       sv = newSVpv(utf8_str, 0);       SvUTF8_on(sv);
       pfree(utf8_str);
       return sv;
}

In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.

(Returned values might also need fixing.)

In my view, this is clearly a bug in pl/perl on sql_ascii databases.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Abhijit Menon-Sen
Дата:
Сообщение: Re: psql NUL record and field separator
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)