Re: pg_dump/restore encoding woes

Поиск
Список
Период
Сортировка
От Amit Khandekar
Тема Re: pg_dump/restore encoding woes
Дата
Msg-id CACoZds2K+imiraAZNx85eb4ibxnK2EJ3mNJJ=5RP=XAh2HaWGQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: pg_dump/restore encoding woes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: pg_dump/restore encoding woes  (Amit Khandekar <amit.khandekar@enterprisedb.com>)
Список pgsql-hackers



On 27 August 2013 20:06, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 26.08.2013 18:59, Tom Lane wrote:
Heikki Linnakangas<hlinnakangas@vmware.com>  writes:

The pg_dump -E option just sets client_encoding, but I think it would be
better for -E to only set the encoding used in the dump, and
PGCLIENTENCODING env variable (if set) was used to determine the
encoding of the command-line arguments. Opinions?

I think this is going to be a lot easier said than done, but feel free
to see if you can make it work.  (As you point out, we don't have
any client-side encoding conversion infrastructure, but I don't see
how you're going to make this work without it.)

First set client_encoding to PGCLIENTENCODING (ie. let libpq do its default thing), and run the queries that fetch the OIDs of any matching tables. Only after doing that, set client_encoding to that specified by -E. That's actually pretty easy to do, as pg_dump already issues all the queries that include user-given strings first.

There's one small wrinkle in that: if the dump fails because the server sends an error, the error would come from the server in the -E encoding, because that's used as the client_encoding after the initial queries. Logically, the errors should be printed in PGCLIENTENCODING. But I think we can live with that.

The pg_restore part is more difficult, as pg_restore needs to work without a database connection at all. There the conversion has to be done client-side.


A second issue is whether we should divorce -E and PGCLIENTENCODING like
that, when they have always meant the same thing.  You mentioned the
alternative of looking at pg_dump's locale environment to determine the
command line encoding --- would that be better?

Hmm. I wasn't thinking of looking at the locale environment as an alternative to the divorce, just as a way to determine the default for the client encoding if PGCLIENTENCODING is not set.

It would be good for pg_dump to be consistent with other tools (reindexdb etc.). If we say that LC_CTYPE determines the encoding of command-line options, regardless of PGCLIENTENCODING, we should do the same in other tools, too. And that would be weird for the other tools. So no, I don't think we should do that.

My plan is to assume that the command-line options are in the "client encoding", and the client encoding is determined by the following things, in this order:

1. client_encoding setting given explicitly in the command line, as part of the connection string.
2. PGCLIENTENCODING
3. LC_CTYPE
4. same as server_encoding

The encoding to be used in the pg_dump output is a different concept, and there are reasons to *not* want the dump to be in the client encoding. Hence using server_encoding for that would be a better default than the current client encoding. This isn't so visible today, as client_encoding defaults to server_encoding anyway, but it will be if we set client_encoding='auto' (ie. look at LC_CTYPE) by default.

There are certainly cases where you'd want to use the client encoding as the dump output encoding. For example if you pipe the pg_dump output to some custom script that mangles it. Or if you just want to look at the output, ie. if the output goes to a terminal. But for the usual case of taking a backup, using the server encoding makes more sense. One option is to check if the output is a tty (like psql does), and output the dump in client or server encoding depending on that (if -E is not given).

So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels that using PGCLIENTENCODING to specify the output encoding was more accidental than on purpose. Looking at the history, the implementation has been that way forever, but the docs were adjusted by commit b524cb36 in 2005 to document that fact.

Here is a set of three patches that I've been working on:

0001-Divorce-pg_dump-E-option-from-PGCLIENTENCODING.patch

Separates pg_dump -E from PGCLIENTENCODING.

Since AH->encoding is now used only to represent dump encoding, we should rename it as AH->dump_encoding.

-----

The standard_conforming_strings parameter is currently set in setup_connection. You have moved it in setup_dump_encoding(). Is it in any way related to encoding ? If not, I think we should keep it in setup_connection().

-----

If a user supplies pg_dump --role=ROLENAME, we do the SET-ROLE in setup_connection. The role name is in client encoding. In case of parallel pg_dump, the setup_connection() in the parent process does it right, but in the worker processes, the client_encoding is already set to the dump_encoding when -E is given (parent has already updated it's client_encoding to dump_encoding and I think the child inherits client_encoding from parent), and then in the child when SET-ROLE is called through setup_connection, it does not work because role name is in client encoding, not dump encoding.

$ pg_dump -t "pöö" -E LATIN1 --role=uöö  postgres
# Above succeeds

$ `which pg_dump` -t "pöö" -E LATIN1 -j 5 -f dumpdir -Fd postgres --role=uöö 
pg_dump: [archiver (db)] query failed: ERROR:  role "uöö" does not exist
pg_dump: [parallel archiver] query was: SET ROLE "uöö"
pg_dump: [archiver (db)] query failed: ERROR:  role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR:  role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR:  role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR:  role "uöö" does not exist

 

0002-Set-client_encoding-auto-in-all-client-utilities.patch 

Set client_encoding='auto' in all the client utilities, like we already did in psql. This fixes e.g "reindexdb -t" with a table name with non-ASCII chars


This patch looks good, I don't have any issues with this.

 
0003-Convert-object-names-to-archive-encoding-before-matc.patch

Use iconv(3) in pg_restore to do encoding conversion in the client. This involves a lot of autoconf changes that I'm not 100% sure about, other than that it's pretty straightforward.

I haven't looked into this one yet.
 


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Erik Rijkers"
Дата:
Сообщение: Re: Minmax indexes
Следующее
От: Marc Mamin
Дата:
Сообщение: invalid regexp crashes the server on windows or 9.3