Обсуждение: UTF8 frustrations

Поиск
Список
Период
Сортировка

UTF8 frustrations

От
jesse.waters@gmail.com
Дата:
Trying to do pg_restore from one UTF8 encoded db to another UTF8 encoded db
DB_source:
 Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
 psql 8.2.4

DB Destination:
 Debian GNU/Linux 4.0
 psql 8.1.9

I've tried:
  pg_dump from the source box & from destination box

  from destination server
  pg_dump -i -h source_server db > db.8.1.9.sql

No matter which way I issue pg_dump command it always fails on the same record.

ERROR:
pg_restore: ERROR:  invalid byte sequence for encoding "UTF8": 0xdf69
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY logs, line 69238382
pg_restore: [archiver (db)] error returned by PQendcopy: ERROR:
invalid byte sequence for encoding "UTF8": 0xdf69
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY logs, line 69238382


UTF8 to UTF8 and everything in between is UTF8, and it still fails,
  I'm  tempted to delete this record and keep going but I'm determined
to find a solution.

TIA for any help,

 Jesse Waters

Re: UTF8 frustrations

От
Michael Glaesemann
Дата:
On Sep 4, 2007, at 14:48 , jesse.waters@gmail.com wrote:

> Trying to do pg_restore from one UTF8 encoded db to another UTF8
> encoded db
> DB_source:
>  Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
>  psql 8.2.4
>
> DB Destination:
>  Debian GNU/Linux 4.0
>  psql 8.1.9


> UTF8 to UTF8 and everything in between is UTF8, and it still fails,
>   I'm  tempted to delete this record and keep going but I'm determined
> to find a solution.

(1) PostgreSQL does not support dump and restore from a new version
of PostgreSQL to an older version.

(2) There were changes to UTF8 handling between 8.1 and 8.2. I'm a
bit surprised you get an error like this as 8.1 is more lax than 8.2:
it accepts some invalid sequences as valid.

If you're really determined to do this, I recommend editing the dump
file by hand to remove or edit this record so it does not contain the
invalid sequence (as you've suggested).

Michael Glaesemann
grzm seespotcode net



Re: UTF8 frustrations

От
"Scott Marlowe"
Дата:
On 9/4/07, jesse.waters@gmail.com <jesse.waters@gmail.com> wrote:
> Trying to do pg_restore from one UTF8 encoded db to another UTF8 encoded db
> DB_source:
>  Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
>  psql 8.2.4
>
> DB Destination:
>  Debian GNU/Linux 4.0
>  psql 8.1.9
>
> I've tried:
>   pg_dump from the source box & from destination box
>
>   from destination server
>   pg_dump -i -h source_server db > db.8.1.9.sql

Are you issuing the dump and psql / restore command on the same machine?
As previously mentioned, dumping from newer to older is not supported.
 pg_dump from 8.1 might not understand the data structures it finds in
an 8.2 db, and pg_dump from 8.2 might create a dump that 8.1 doesn't
support.  Note that going the other way you are encouraged to use
pg_dump from 8.2 to dump the 8.1 database for importing to 8.2.

However, In this case I'm guessing that the problem is that you've got
different client encodings on each end.  i.e. you're dumping with one
encoding setting and restoring with another.  Note that pgsql
autoconverts from the server's encoding to the client's encoding at
the request of the client. so, you'll need to check your client
encoding from psql on the source and target machines to see if they
match.

show client_encoding ;

will tell you what your client encoding is.

Re: UTF8 frustrations

От
jesse.waters@gmail.com
Дата:
Thanks all for the quick replies.

Here is the latest issue, to verify that the pg_dump works, I'm going
to do dump and restore on the same host/cluster.

Source:
 DB_source:
 Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
 psql 8.2.4
Destination:
 same machine different db name

echo $LANG
en_US.UTF-8

SET client_encoding = 'UTF8';

Command used:
  pg_dump -Fc srcdb > db.dump
  pg_restore -d devdb db.dump

Results, same error. Now I'm really concerned.

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1625; 0 16680 TABLE
DATA logs watersj
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xdf69
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding
CONTEXT:  COPY logs, line 69238382
WARNING: errors ignored on restore: 1


Jesse Waters

Re: UTF8 frustrations

От
"Scott Marlowe"
Дата:
On 9/5/07, jesse.waters@gmail.com <jesse.waters@gmail.com> wrote:
> Thanks all for the quick replies.
>
> SET client_encoding = 'UTF8';
>
> Results, same error. Now I'm really concerned.

How / where are you setting the client_encoding?  I'm wonder if it's
really set when the pg_dump / pg_restore commands are being issued, or
if you're inheriting a different encoding along with the new
connection it gets.

You can try:

alter user set client_encoding='UTF8'

for the postgres user to see if that helps.

Re: UTF8 frustrations

От
"Albe Laurenz"
Дата:
jesse.waters@gmail.com wrote:
>
> Here is the latest issue, to verify that the pg_dump works, I'm going
> to do dump and restore on the same host/cluster.
>
> Source:
>  DB_source:
>  Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
>  psql 8.2.4
> Destination:
>  same machine different db name
>
> echo $LANG
> en_US.UTF-8
>
> SET client_encoding = 'UTF8';
>
> Command used:
>   pg_dump -Fc srcdb > db.dump
>   pg_restore -d devdb db.dump
>
> Results, same error. Now I'm really concerned.
>
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 1625; 0 16680 TABLE
DATA logs watersj
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xdf69
> HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding
> CONTEXT:  COPY logs, line 69238382
> WARNING: errors ignored on restore: 1

I can recreate this behaviour with 8.2.4 (UTF-8).

psql> CREATE TABLE test (id serial PRIMARY KEY, val text);

psql> INSERT INTO test (val) VALUES (E'\xdf\x69');

psql> \q

$ pg_dump -F c -f x.dmp -t test testdb

$ pg_restore -c -d testdb x.dmp

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1840; 0 45883 TABLE
DATA test laurenz
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
for encoding "UTF8": 0xdf69
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT:  COPY test, line 1
WARNING: errors ignored on restore: 1

The problem is that in (at least) one record in your table
watersj.logs, there is a corrupt string.

Unfortunately (as demonstrated above) it is possible to enter
corrupt data into a PostgreSQL database, this is what must have
happened in your case.

I suggest that you identify and correct this string in the original
database, then everything should work fine.

You can extract the offending row from the dump, that should
help to identify it. 69238382 rows is a little unwieldy, but
tools like awk can help:

pg_restore db.dump | awk '/^COPY logs /,/^\\\.$/ { if (lineno==69238382)
print $0; ++lineno }'

I think there is the desire to fix problems like this in 8.3,
but I don't think that's done yet.

Yours,
Laurenz Albe