Обсуждение: BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table

Поиск
Список
Период
Сортировка

BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table

От
rpegues@tripwire.com
Дата:
The following bug has been logged on the website:

Bug reference:      14038
Logged by:          Reece
Email address:      rpegues@tripwire.com
PostgreSQL version: 9.5.1
Operating system:   Centos6
Description:

We have a table with an update trigger where if you modify a certain column,
we change the name of the row by calling a function.

In the function, substring() the name and then add a random string to that.
However, the substring appears to cut a unicode character in half, and the
update trigger then updates the name with the broken string.

After this, just doing a select * on the table returns:
invalid byte sequence for encoding "UTF8": 0xe5 0x5b 0x44

I don't think the update trigger should allow saving an invalid utf8 string
to the table for one, but the substring() cutting the unicode character in
half also might be a bug?

REPRODUCE:  run the sql below.



CREATE OR REPLACE FUNCTION public.deleted_name(text, integer)
RETURNS text AS
$BODY$
    declare
        v_old_name alias for $1;
        v_max_length alias for $2;
        l_new_name text;
        l_tag varchar(16);
        l_tag_length integer := 0;
begin
    l_tag := '[DEL ' || int4(extract(epoch from now())) || ']';
    l_tag_length := length(l_tag);
    l_new_name := substring(v_old_name, 1, v_max_length - l_tag_length) ||
l_tag;
    return l_new_name;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;
ALTER FUNCTION public.deleted_name(text, integer)
  OWNER TO postgres;




CREATE OR REPLACE FUNCTION public.test_bug_update()
  RETURNS trigger AS
$BODY$
declare
begin
    if ( old.deleted = 'f' and new.deleted = 't') then
        new.name := deleted_name(new.name, 64);
    end if;
    return new;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;
ALTER FUNCTION public.nc_scan_profile_update()
  OWNER TO postgres;




DROP TABLE test_bug;
CREATE TABLE public.test_bug
(
  id bigint NOT NULL,
  name character varying(256) NOT NULL,
  deleted boolean NOT NULL DEFAULT false
);
ALTER TABLE public.test_bug
  OWNER TO postgres;
GRANT ALL ON TABLE public.test_bug TO postgres;

CREATE TRIGGER test_bug_trigger
  BEFORE UPDATE
  ON public.test_bug
  FOR EACH ROW
  EXECUTE PROCEDURE public.test_bug_update();


INSERT INTO test_bug (id, name) values (1, 'JST Standard Scan
Profile-外部機関用情報共有システム');
UPDATE test_bug SET deleted=true where id=1;
SELECT * FROM test_bug;
rpegues@tripwire.com writes:
> We have a table with an update trigger where if you modify a certain column,
> we change the name of the row by calling a function.
> In the function, substring() the name and then add a random string to that.
> However, the substring appears to cut a unicode character in half, and the
> update trigger then updates the name with the broken string.

That should not happen if Postgres knows it's dealing with unicode data.
What have you got the database's encoding set to?

            regards, tom lane

Re: BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table

От
Reece Pegues
Дата:
Looks like the database is created with ENCODING =3D 'SQL_ASCII'

So I assume it was thus saving the data that way, and then if the client
encoding is utf8 it tried to encode to that and failed?

-Reece

On 3/21/16, 12:46 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>rpegues@tripwire.com writes:
>> We have a table with an update trigger where if you modify a certain
>>column,
>> we change the name of the row by calling a function.
>> In the function, substring() the name and then add a random string to
>>that.=20
>> However, the substring appears to cut a unicode character in half, and
>>the
>> update trigger then updates the name with the broken string.
>
>That should not happen if Postgres knows it's dealing with unicode data.
>What have you got the database's encoding set to?
>
>            regards, tom lane
Reece Pegues <RPegues@tripwire.com> writes:
> Looks like the database is created with ENCODING = 'SQL_ASCII'

Basically what that does is defeats all encoding checks inside the
backend; it'll store whatever bytes you give it.  So yeah, substring()
is expected to deal in bytes not characters in this encoding.

> So I assume it was thus saving the data that way, and then if the client
> encoding is utf8 it tried to encode to that and failed?

If client declares its encoding, the backend will verify correct encoding
before transmitting data; but if the database encoding is SQL_ASCII then
no actual conversion happens, only a validity check at transmit/receive.

            regards, tom lane

Re: BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table

От
Reece Pegues
Дата:
I see, thanks Tom!=20

On 3/21/16, 5:04 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>Reece Pegues <RPegues@tripwire.com> writes:
>> Looks like the database is created with ENCODING =3D 'SQL_ASCII'
>
>Basically what that does is defeats all encoding checks inside the
>backend; it'll store whatever bytes you give it.  So yeah, substring()
>is expected to deal in bytes not characters in this encoding.
>
>> So I assume it was thus saving the data that way, and then if the client
>> encoding is utf8 it tried to encode to that and failed?
>
>If client declares its encoding, the backend will verify correct encoding
>before transmitting data; but if the database encoding is SQL_ASCII then
>no actual conversion happens, only a validity check at transmit/receive.
>
>            regards, tom lane