Re: COPY command character set

Поиск
Список
Период
Сортировка
От Peter Headland
Тема Re: COPY command character set
Дата
Msg-id 71F491F5DA99604A80DE49424BF3D02B0CD9A27B@exchange8.actuate.com
обсуждение исходный текст
Ответ на Re: COPY command character set  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
> There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said "UTF signatures"
aka the "byte order mark". IOW, the "magic number" bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian   FF FE
UTF-16 big-endian   FE FF
UTF-8   EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

--
Peter Headland
Architect
Actuate Corporation


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] COPY command character set

"Peter Headland" <pheadland@actuate.com> writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before.  I
don't remember if it was rejected or is just languishing on the TODO
list.  I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

            regards, tom lane

В списке pgsql-general по дате отправления:

Предыдущее
От: Scott Bailey
Дата:
Сообщение: Getting the oid of an anyelement
Следующее
От: "Peter Headland"
Дата:
Сообщение: Re: COPY command character set