Обсуждение: BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table
BUG #14038: substring cuts unicode char in half, allowing to save broken utf8 into table
От
rpegues@tripwire.com
Дата:
The following bug has been logged on the website: Bug reference: 14038 Logged by: Reece Email address: rpegues@tripwire.com PostgreSQL version: 9.5.1 Operating system: Centos6 Description: We have a table with an update trigger where if you modify a certain column, we change the name of the row by calling a function. In the function, substring() the name and then add a random string to that. However, the substring appears to cut a unicode character in half, and the update trigger then updates the name with the broken string. After this, just doing a select * on the table returns: invalid byte sequence for encoding "UTF8": 0xe5 0x5b 0x44 I don't think the update trigger should allow saving an invalid utf8 string to the table for one, but the substring() cutting the unicode character in half also might be a bug? REPRODUCE: run the sql below. CREATE OR REPLACE FUNCTION public.deleted_name(text, integer) RETURNS text AS $BODY$ declare v_old_name alias for $1; v_max_length alias for $2; l_new_name text; l_tag varchar(16); l_tag_length integer := 0; begin l_tag := '[DEL ' || int4(extract(epoch from now())) || ']'; l_tag_length := length(l_tag); l_new_name := substring(v_old_name, 1, v_max_length - l_tag_length) || l_tag; return l_new_name; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100; ALTER FUNCTION public.deleted_name(text, integer) OWNER TO postgres; CREATE OR REPLACE FUNCTION public.test_bug_update() RETURNS trigger AS $BODY$ declare begin if ( old.deleted = 'f' and new.deleted = 't') then new.name := deleted_name(new.name, 64); end if; return new; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100; ALTER FUNCTION public.nc_scan_profile_update() OWNER TO postgres; DROP TABLE test_bug; CREATE TABLE public.test_bug ( id bigint NOT NULL, name character varying(256) NOT NULL, deleted boolean NOT NULL DEFAULT false ); ALTER TABLE public.test_bug OWNER TO postgres; GRANT ALL ON TABLE public.test_bug TO postgres; CREATE TRIGGER test_bug_trigger BEFORE UPDATE ON public.test_bug FOR EACH ROW EXECUTE PROCEDURE public.test_bug_update(); INSERT INTO test_bug (id, name) values (1, 'JST Standard Scan Profile-å¤é¨æ©é¢ç¨æ å ±å ±æã·ã¹ãã '); UPDATE test_bug SET deleted=true where id=1; SELECT * FROM test_bug;
rpegues@tripwire.com writes: > We have a table with an update trigger where if you modify a certain column, > we change the name of the row by calling a function. > In the function, substring() the name and then add a random string to that. > However, the substring appears to cut a unicode character in half, and the > update trigger then updates the name with the broken string. That should not happen if Postgres knows it's dealing with unicode data. What have you got the database's encoding set to? regards, tom lane
Looks like the database is created with ENCODING =3D 'SQL_ASCII' So I assume it was thus saving the data that way, and then if the client encoding is utf8 it tried to encode to that and failed? -Reece On 3/21/16, 12:46 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >rpegues@tripwire.com writes: >> We have a table with an update trigger where if you modify a certain >>column, >> we change the name of the row by calling a function. >> In the function, substring() the name and then add a random string to >>that.=20 >> However, the substring appears to cut a unicode character in half, and >>the >> update trigger then updates the name with the broken string. > >That should not happen if Postgres knows it's dealing with unicode data. >What have you got the database's encoding set to? > > regards, tom lane
Reece Pegues <RPegues@tripwire.com> writes: > Looks like the database is created with ENCODING = 'SQL_ASCII' Basically what that does is defeats all encoding checks inside the backend; it'll store whatever bytes you give it. So yeah, substring() is expected to deal in bytes not characters in this encoding. > So I assume it was thus saving the data that way, and then if the client > encoding is utf8 it tried to encode to that and failed? If client declares its encoding, the backend will verify correct encoding before transmitting data; but if the database encoding is SQL_ASCII then no actual conversion happens, only a validity check at transmit/receive. regards, tom lane
I see, thanks Tom!=20 On 3/21/16, 5:04 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >Reece Pegues <RPegues@tripwire.com> writes: >> Looks like the database is created with ENCODING =3D 'SQL_ASCII' > >Basically what that does is defeats all encoding checks inside the >backend; it'll store whatever bytes you give it. So yeah, substring() >is expected to deal in bytes not characters in this encoding. > >> So I assume it was thus saving the data that way, and then if the client >> encoding is utf8 it tried to encode to that and failed? > >If client declares its encoding, the backend will verify correct encoding >before transmitting data; but if the database encoding is SQL_ASCII then >no actual conversion happens, only a validity check at transmit/receive. > > regards, tom lane