Обсуждение: Migration error " invalid byte sequence for encoding "UTF8": 0xff " from mysql 5.5 to postgresql 9.1
When I do migration from Mysql to PostgreSQL:
			
		firstly dump data from mysql in cmd(encoding is GBK) is WIN8:
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql --default-character-set=utf8 --skip-add-locks --compact --no-create-info --skip-quote-names -uroot -p test >dbdata.sql
then load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost  -d test -U postgres <  dbdata.sql
I got the error:
ERROR:  invalid byte sequence for encoding "UTF8": 0xff
I checked in dbdata.sql using UltraEdit, data "0xff" really exists as followings:
蛾4?4抿\0xfF???±??x¤?
and I use UltraEdit menu-> view -> encoding page to get: 936 GBK.
DDL in Mysql 5.5 is:
CREATE TABLE `personpicture` (
  `ID` char(32) NOT NULL,
  `Picture` mediumblob,
...
  KEY `personId` (`PersonID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DDL in PostgreSQL 9.1 is:
create database test encoding 'utf8';
CREATE TABLE personpicture (
  ID char(32) NOT NULL,
  Picture BYTEA,
  ....
  PRIMARY KEY (ID)
);
The error is related to Picture mediumblob and BYTEA, which is stored pic binary data.
My Operationg system is WIN8 64bit.
I've set postgresql.conf before starting everything, but error still occurs:
backslash_quote = on
escape_string_warning = off
standard_conforming_strings = off
How to resolve it?
Thanks!
peng
sunpeng <bluevaley@gmail.com> wrote: > load data to postgresql in cmd(encoding is GBK) is WIN8: > psql -h localhost -d test -U postgres < dbdata.sql > > I got the error: > ERROR: invalid byte sequence for encoding "UTF8": 0xff If the encoding is GBK then you will get errors (or incorrect characters) if it is read as UTF8. Try setting the environment variable PGCLIENTENCODING. http://www.postgresql.org/docs/9.1/static/app-psql.html -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.
And i use the following cmd to dump mysql data:
mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4'  into outfile "d:\\1.txt" ;
I got the ansi file, and use Ultraedit to see first 16 bytes:
FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
It's different from mysql workbench to see:
FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01
peng
 On Tue, Jul 1, 2014 at 9:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
sunpeng <bluevaley@gmail.com> wrote:If the encoding is GBK then you will get errors (or incorrect
> load data to postgresql in cmd(encoding is GBK) is WIN8:
> psql -h localhost -d test -U postgres < dbdata.sql
>
> I got the error:
> ERROR: invalid byte sequence for encoding "UTF8": 0xff
characters) if it is read as UTF8. Try setting the environment
variable PGCLIENTENCODING.
http://www.postgresql.org/docs/9.1/static/app-psql.html
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
sunpeng wrote: >>> load data to postgresql in cmd(encoding is GBK) is WIN8: >>> >>> psql -h localhost -d test -U postgres < dbdata.sql >>> >>> I got the error: >>> ERROR: invalid byte sequence for encoding "UTF8": 0xff >> If the encoding is GBK then you will get errors (or incorrect >> characters) if it is read as UTF8. Try setting the environment >> variable PGCLIENTENCODING. >> >> http://www.postgresql.org/docs/9.1/static/app-psql.html > I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs. > And i use the following cmd to dump mysql data: > mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4' into outfile > "d:\\1.txt" ; > I got the ansi file, and use Ultraedit to see first 16 bytes: > FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C > > It's different from mysql workbench to see: > FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01 Changing the terminal code page won't do anything, it's probably the data that are in a different encoding. I don't know enough about MySQL to know which encoding it uses when dumping data, but the man page of "mysqldump" tells me: --set-charset Add SET NAMES default_character_set to the output. This option is enabled by default. So is there a SET NAMES command in the dump? If yes, what is the argument? You will have to tell PostgreSQL the encoding of the data. As Kevin pointed out, you can do that by setting the environment variable PGCLIENT ENCODING to the correct value. Then PostgreSQL will convert the data automatically. Yours, Laurenz Albe
Thank you, friend, I use  --hex-blob :
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql --default-character-set=utf8 --skip-add-locks --compact --no-create-info --skip-quote-names --hex-blob -uroot -p test videorecresult >dbdata.sql
to dump mysql data.
And replace blob data "0x...." into "E'\\xx....'" to load data into postgresql.
On Fri, Jul 4, 2014 at 3:27 PM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
sunpeng wrote:
>>> load data to postgresql in cmd(encoding is GBK) is WIN8:
>>>
>>> psql -h localhost -d test -U postgres < dbdata.sql
>>>
>>> I got the error:
>>> ERROR: invalid byte sequence for encoding "UTF8": 0xff
>> If the encoding is GBK then you will get errors (or incorrect
>> characters) if it is read as UTF8. Try setting the environment
>> variable PGCLIENTENCODING.
>>
>> http://www.postgresql.org/docs/9.1/static/app-psql.html> I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.Changing the terminal code page won't do anything, it's probably the data
> And i use the following cmd to dump mysql data:
> mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4' into outfile
> "d:\\1.txt" ;
> I got the ansi file, and use Ultraedit to see first 16 bytes:
> FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
>
> It's different from mysql workbench to see:
> FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01
that are in a different encoding.
I don't know enough about MySQL to know which encoding it uses when dumping data,
but the man page of "mysqldump" tells me:
--set-charset
Add SET NAMES default_character_set to the output. This option is enabled by default.
So is there a SET NAMES command in the dump? If yes, what is the argument?
You will have to tell PostgreSQL the encoding of the data.
As Kevin pointed out, you can do that by setting the environment variable
PGCLIENT ENCODING to the correct value. Then PostgreSQL will convert the
data automatically.
Yours,
Laurenz Albe
On 7/4/2014 2:12 AM, sunpeng wrote: > Thank you, friend, I use --hex-blob : > mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql > --default-character-set=utf8 --skip-add-locks --compact > --no-create-info --skip-quote-names --hex-blob -uroot -p test > videorecresult >dbdata.sql > to dump mysql data. > And replace blob data "0x...." into "E'\\xx....'" to load data into > postgresql. regardless of all that, 0xFF is not a valid UTF8 character code. perhaps you should store the data in a postgres BYTEA, or at least use character encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.) -- john r pierce 37N 122W somewhere on the middle of the left coast
John R Pierce wrote: > On 7/4/2014 2:12 AM, sunpeng wrote: >> Thank you, friend, I use --hex-blob : >> mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql >> --default-character-set=utf8 --skip-add-locks --compact >> --no-create-info --skip-quote-names --hex-blob -uroot -p test >> videorecresult >dbdata.sql >> to dump mysql data. >> And replace blob data "0x...." into "E'\\xx....'" to load data into >> postgresql. > regardless of all that, 0xFF is not a valid UTF8 character code. perhaps > you should store the data in a postgres BYTEA, or at least use character > encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.) Exactly. According to mysqldump's man page, the affected field must be BINARY, VARBINARY, the BLOB types or BIT. For these PostgreSQL's "bytea" would definitely be the correct data type, and there wouldn't be any encoding problems with that. Yours, Laurenz Albe