Обсуждение: Unable to restore dump due to client encoding issues -- or, when is SQL_ASCII really UTF8
I'm probably missing something painfully obvious here, but it's not obvious to me ... I've pulled a dump of a production database to put it in our lab for various types of testing ... I'm glad I did, as this testing is telling me we'll have issues if we try to upgrade. First off, it's my understanding that with SQL_ASCII "encoding", that PostgreSQL does no checking for valid/invalid characters, per the docs: http://www.postgresql.org/docs/8.2/static/multibyte.html The beginning of the dump file I am restoring has the following: -- -- PostgreSQL database dump -- SET client_encoding = 'SQL_ASCII'; [...] But when I try to pull the dump in with psql, I get the following errors: ERROR: invalid byte sequence for encoding "UTF8": 0xa0 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". Connecting to the database and issuing "show client_encoding" shows that the database is indeed set to SQL_ASCII. Now ... I'm expecting the server to accept any byte sequence, since we're using SQL_ANSII, but that is (obviously) not the case. Am I missing something obvious here? Grepping the entire dump file shows absolutely no references to UTF8 ... so why is the server trying to validate the byte string as UTF8? -- Bill Moran Collaborative Fusion Inc.
On Tue, Feb 27, 2007 at 08:43:27AM -0500, Bill Moran wrote: > First off, it's my understanding that with SQL_ASCII "encoding", that > PostgreSQL does no checking for valid/invalid characters, per the docs: > http://www.postgresql.org/docs/8.2/static/multibyte.html Correct. As the documentation says, SQL_ASCII "is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding." > The beginning of the dump file I am restoring has the following: > -- > -- PostgreSQL database dump > -- > > SET client_encoding = 'SQL_ASCII'; > [...] > > But when I try to pull the dump in with psql, I get the following errors: > ERROR: invalid byte sequence for encoding "UTF8": 0xa0 > HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". > > Connecting to the database and issuing "show client_encoding" shows that > the database is indeed set to SQL_ASCII. client_encoding doesn't show the database encoding, it shows the client encoding; execute "show server_encoding" to see the database encoding. You can also use "psql -l" or "\l" from within psql to see all databases and their encodings. The error suggests that the database encoding is UTF8. > Now ... I'm expecting the server to accept any byte sequence, since we're > using SQL_ANSII, but that is (obviously) not the case. Am I missing > something obvious here? Grepping the entire dump file shows absolutely > no references to UTF8 ... so why is the server trying to validate the > byte string as UTF8? Probably because the database is UTF8 (see above). Either create the database as SQL_ASCII (see createdb's -E option) or change the client_encoding setting in the dump to whatever the encoding really is (probably LATIN1 or WIN1252 for Western European languages). -- Michael Fuhr
In response to Michael Fuhr <mike@fuhr.org>: [snip] > > The beginning of the dump file I am restoring has the following: > > -- > > -- PostgreSQL database dump > > -- > > > > SET client_encoding = 'SQL_ASCII'; > > [...] > > > > But when I try to pull the dump in with psql, I get the following errors: > > ERROR: invalid byte sequence for encoding "UTF8": 0xa0 > > HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". > > > > Connecting to the database and issuing "show client_encoding" shows that > > the database is indeed set to SQL_ASCII. > > client_encoding doesn't show the database encoding, it shows the > client encoding; execute "show server_encoding" to see the database > encoding. You can also use "psql -l" or "\l" from within psql to > see all databases and their encodings. The error suggests that the > database encoding is UTF8. Ahh ... see, I knew there'd be something obvious I was missing, but all the google searches I tried were focused on client encoding, so I wasn't getting this answer. The database was, indeed, UTF8, which is the default on newer Postgres. I find it odd that I can't alter the encoding after database creation, but in this case it wasn't a big deal, as it's a lab system and I can just back up a few steps and start over. Good to learn this change now, as opposed to during the next production deployment :) > > Now ... I'm expecting the server to accept any byte sequence, since we're > > using SQL_ANSII, but that is (obviously) not the case. Am I missing > > something obvious here? Grepping the entire dump file shows absolutely > > no references to UTF8 ... so why is the server trying to validate the > > byte string as UTF8? > > Probably because the database is UTF8 (see above). Either create > the database as SQL_ASCII (see createdb's -E option) or change the > client_encoding setting in the dump to whatever the encoding really > is (probably LATIN1 or WIN1252 for Western European languages). Yup. Thanks for the quick response! -- Bill Moran Collaborative Fusion Inc.
On Tue, Feb 27, 2007 at 10:31:47AM -0500, Bill Moran wrote: > The database was, indeed, UTF8, which is the default on newer Postgres. initdb determines the encoding from your environment. If you're on a Unix-like platform run "env | grep UTF" and you'll probably see that LANG and/or LC_* is set to something like "en_US.UTF-8". > I find it odd that I can't alter the encoding after database creation, > but in this case it wasn't a big deal, as it's a lab system and I can > just back up a few steps and start over. Altering the database encoding would mean that all string data would have to be checked and possibly converted. Doing that on a large running system would be problematic; it would probably be just as easy to dump and restore the entire database. -- Michael Fuhr
Bill Moran <wmoran@collaborativefusion.com> writes: > In response to Michael Fuhr <mike@fuhr.org>: >>> Connecting to the database and issuing "show client_encoding" shows that >>> the database is indeed set to SQL_ASCII. >> >> client_encoding doesn't show the database encoding, it shows the >> client encoding; execute "show server_encoding" to see the database >> encoding. > The database was, indeed, UTF8, which is the default on newer Postgres. No, it's not necessarily the default --- the default is the encoding used by whatever locale you initdb'd in. But what I find odd about the above is that client_encoding should default to equal server_encoding unless something on the client side specifically overrides it. The above behavior suggests that you've got a .psqlrc or PGCLIENTENCODING environment variable or some such that is forcing client_encoding to SQL_ASCII when the server encoding is something different. That strikes me as a pretty bad practice; there is use for forcing client_encoding to something specific, but forcing it to SQL_ASCII seems useless and possibly dangerous. regards, tom lane
In response to Tom Lane <tgl@sss.pgh.pa.us>: > Bill Moran <wmoran@collaborativefusion.com> writes: > > In response to Michael Fuhr <mike@fuhr.org>: > >>> Connecting to the database and issuing "show client_encoding" shows that > >>> the database is indeed set to SQL_ASCII. > >> > >> client_encoding doesn't show the database encoding, it shows the > >> client encoding; execute "show server_encoding" to see the database > >> encoding. > > > The database was, indeed, UTF8, which is the default on newer Postgres. > > No, it's not necessarily the default --- the default is the encoding > used by whatever locale you initdb'd in. Ok, but Postgres' idea of what locale translates to what encoding has changed from 8.1.4 to 8.1.8 ... or FreeBSD changed something about the way it uses locales ... or something ... > But what I find odd about the > above is that client_encoding should default to equal server_encoding > unless something on the client side specifically overrides it. The > above behavior suggests that you've got a .psqlrc or PGCLIENTENCODING > environment variable or some such that is forcing client_encoding to > SQL_ASCII when the server encoding is something different. That strikes > me as a pretty bad practice; there is use for forcing client_encoding to > something specific, but forcing it to SQL_ASCII seems useless and > possibly dangerous. Well, the explanation for that got snipped -- the file created by pg_dump had a set client_encoding statement at the beginning of it, which seems logical to me: the database this was dumped from was SQL_ASCII, so the restore process should attempt to replicate that. I'm chalking this up to "Bill didn't notice that the new databases were getting created with a different encoding." -- Bill Moran Collaborative Fusion Inc.