Обсуждение: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

Поиск
Список
Период
Сортировка

Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Bill Moran
Дата:
I'm probably missing something painfully obvious here, but it's not
obvious to me ...

I've pulled a dump of a production database to put it in our lab for
various types of testing ... I'm glad I did, as this testing is telling
me we'll have issues if we try to upgrade.

First off, it's my understanding that with SQL_ASCII "encoding", that
PostgreSQL does no checking for valid/invalid characters, per the docs:
http://www.postgresql.org/docs/8.2/static/multibyte.html

The beginning of the dump file I am restoring has the following:
--
-- PostgreSQL database dump
--

SET client_encoding = 'SQL_ASCII';
[...]

But when I try to pull the dump in with psql, I get the following errors:
ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 

Connecting to the database and issuing "show client_encoding" shows that
the database is indeed set to SQL_ASCII.

Now ... I'm expecting the server to accept any byte sequence, since we're
using SQL_ANSII, but that is (obviously) not the case.  Am I missing
something obvious here?  Grepping the entire dump file shows absolutely
no references to UTF8 ... so why is the server trying to validate the
byte string as UTF8?

--
Bill Moran
Collaborative Fusion Inc.

Re: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Michael Fuhr
Дата:
On Tue, Feb 27, 2007 at 08:43:27AM -0500, Bill Moran wrote:
> First off, it's my understanding that with SQL_ASCII "encoding", that
> PostgreSQL does no checking for valid/invalid characters, per the docs:
> http://www.postgresql.org/docs/8.2/static/multibyte.html

Correct.  As the documentation says, SQL_ASCII "is not so much a
declaration that a specific encoding is in use, as a declaration
of ignorance about the encoding."

> The beginning of the dump file I am restoring has the following:
> --
> -- PostgreSQL database dump
> --
>
> SET client_encoding = 'SQL_ASCII';
> [...]
>
> But when I try to pull the dump in with psql, I get the following errors:
> ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
> HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
>
> Connecting to the database and issuing "show client_encoding" shows that
> the database is indeed set to SQL_ASCII.

client_encoding doesn't show the database encoding, it shows the
client encoding; execute "show server_encoding" to see the database
encoding.  You can also use "psql -l" or "\l" from within psql to
see all databases and their encodings.  The error suggests that the
database encoding is UTF8.

> Now ... I'm expecting the server to accept any byte sequence, since we're
> using SQL_ANSII, but that is (obviously) not the case.  Am I missing
> something obvious here?  Grepping the entire dump file shows absolutely
> no references to UTF8 ... so why is the server trying to validate the
> byte string as UTF8?

Probably because the database is UTF8 (see above).  Either create
the database as SQL_ASCII (see createdb's -E option) or change the
client_encoding setting in the dump to whatever the encoding really
is (probably LATIN1 or WIN1252 for Western European languages).

--
Michael Fuhr

Re: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Bill Moran
Дата:
In response to Michael Fuhr <mike@fuhr.org>:

[snip]

> > The beginning of the dump file I am restoring has the following:
> > --
> > -- PostgreSQL database dump
> > --
> >
> > SET client_encoding = 'SQL_ASCII';
> > [...]
> >
> > But when I try to pull the dump in with psql, I get the following errors:
> > ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
> > HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
> >
> > Connecting to the database and issuing "show client_encoding" shows that
> > the database is indeed set to SQL_ASCII.
>
> client_encoding doesn't show the database encoding, it shows the
> client encoding; execute "show server_encoding" to see the database
> encoding.  You can also use "psql -l" or "\l" from within psql to
> see all databases and their encodings.  The error suggests that the
> database encoding is UTF8.

Ahh ... see, I knew there'd be something obvious I was missing, but all
the google searches I tried were focused on client encoding, so I wasn't
getting this answer.

The database was, indeed, UTF8, which is the default on newer Postgres.
I find it odd that I can't alter the encoding after database creation,
but in this case it wasn't a big deal, as it's a lab system and I can
just back up a few steps and start over.

Good to learn this change now, as opposed to during the next production
deployment :)

> > Now ... I'm expecting the server to accept any byte sequence, since we're
> > using SQL_ANSII, but that is (obviously) not the case.  Am I missing
> > something obvious here?  Grepping the entire dump file shows absolutely
> > no references to UTF8 ... so why is the server trying to validate the
> > byte string as UTF8?
>
> Probably because the database is UTF8 (see above).  Either create
> the database as SQL_ASCII (see createdb's -E option) or change the
> client_encoding setting in the dump to whatever the encoding really
> is (probably LATIN1 or WIN1252 for Western European languages).

Yup.  Thanks for the quick response!

--
Bill Moran
Collaborative Fusion Inc.

Re: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Michael Fuhr
Дата:
On Tue, Feb 27, 2007 at 10:31:47AM -0500, Bill Moran wrote:
> The database was, indeed, UTF8, which is the default on newer Postgres.

initdb determines the encoding from your environment.  If you're
on a Unix-like platform run "env | grep UTF" and you'll probably
see that LANG and/or LC_* is set to something like "en_US.UTF-8".

> I find it odd that I can't alter the encoding after database creation,
> but in this case it wasn't a big deal, as it's a lab system and I can
> just back up a few steps and start over.

Altering the database encoding would mean that all string data would
have to be checked and possibly converted.  Doing that on a large
running system would be problematic; it would probably be just as
easy to dump and restore the entire database.

--
Michael Fuhr

Re: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Tom Lane
Дата:
Bill Moran <wmoran@collaborativefusion.com> writes:
> In response to Michael Fuhr <mike@fuhr.org>:
>>> Connecting to the database and issuing "show client_encoding" shows that
>>> the database is indeed set to SQL_ASCII.
>>
>> client_encoding doesn't show the database encoding, it shows the
>> client encoding; execute "show server_encoding" to see the database
>> encoding.

> The database was, indeed, UTF8, which is the default on newer Postgres.

No, it's not necessarily the default --- the default is the encoding
used by whatever locale you initdb'd in.  But what I find odd about the
above is that client_encoding should default to equal server_encoding
unless something on the client side specifically overrides it.  The
above behavior suggests that you've got a .psqlrc or PGCLIENTENCODING
environment variable or some such that is forcing client_encoding to
SQL_ASCII when the server encoding is something different.  That strikes
me as a pretty bad practice; there is use for forcing client_encoding to
something specific, but forcing it to SQL_ASCII seems useless and
possibly dangerous.

            regards, tom lane

Re: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8

От
Bill Moran
Дата:
In response to Tom Lane <tgl@sss.pgh.pa.us>:

> Bill Moran <wmoran@collaborativefusion.com> writes:
> > In response to Michael Fuhr <mike@fuhr.org>:
> >>> Connecting to the database and issuing "show client_encoding" shows that
> >>> the database is indeed set to SQL_ASCII.
> >>
> >> client_encoding doesn't show the database encoding, it shows the
> >> client encoding; execute "show server_encoding" to see the database
> >> encoding.
>
> > The database was, indeed, UTF8, which is the default on newer Postgres.
>
> No, it's not necessarily the default --- the default is the encoding
> used by whatever locale you initdb'd in.

Ok, but Postgres' idea of what locale translates to what encoding has
changed from 8.1.4 to 8.1.8 ... or FreeBSD changed something about the
way it uses locales ... or something ...

> But what I find odd about the
> above is that client_encoding should default to equal server_encoding
> unless something on the client side specifically overrides it.  The
> above behavior suggests that you've got a .psqlrc or PGCLIENTENCODING
> environment variable or some such that is forcing client_encoding to
> SQL_ASCII when the server encoding is something different.  That strikes
> me as a pretty bad practice; there is use for forcing client_encoding to
> something specific, but forcing it to SQL_ASCII seems useless and
> possibly dangerous.

Well, the explanation for that got snipped -- the file created by pg_dump
had a set client_encoding statement at the beginning of it, which seems
logical to me: the database this was dumped from was SQL_ASCII, so the
restore process should attempt to replicate that.  I'm chalking this up
to "Bill didn't notice that the new databases were getting created with a
different encoding."


--
Bill Moran
Collaborative Fusion Inc.