Обсуждение: Maintaining accents with "COPY" ?
Hi I'm currently doing a CSV export using COPY: COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER; This works great apart from accents are not preserved in the output, for example é gets converted to random characters, e.g.√© or similar. How can I preserve accents ? Thanks ! Laura
> On 25/05/2023 09:14 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
>
> I'm currently doing a CSV export using COPY:
>
> COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
>
> This works great apart from accents are not preserved in the output, for
> example é gets converted to random characters, e.g. √© or similar.
>
> How can I preserve accents ?
Looks like an encoding issue and a mismatch between database encoding and client
encoding. You can check both with:
SHOW server_encoding;
SHOW client_encoding;
Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):
SET client_encoding = 'utf8';
COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';
--
Erik
> Looks like an encoding issue and a mismatch between database encoding and client > encoding. You can check both with: > > SHOW server_encoding; > SHOW client_encoding; > > Then either set the client encoding or use COPY's encoding option to match the > database encoding (I assume utf8 in this example): > > SET client_encoding = 'utf8'; > COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8'; > > -- > Erik Hi Erik, Looks like you could well be right about encoding: postgres=# SHOW server_encoding; server_encoding ----------------- UTF8 (1 row) postgres=# SHOW client_encoding; client_encoding ----------------- SQL_ASCII (1 row) I will try your suggestion...
On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
> I'm currently doing a CSV export using COPY:
>
> COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
>
>
> This works great apart from accents are not preserved in the output,
> for example é gets converted to random characters, e.g. √© or similar.
How do you check the output?
If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.
> How can I preserve accents ?
They probably already are preserved.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Вложения
> On 25/05/2023 13:26 CEST Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
> > I'm currently doing a CSV export using COPY:
> >
> > COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
> >
> > This works great apart from accents are not preserved in the output,
> > for example é gets converted to random characters, e.g. √© or similar.
>
> How do you check the output?
>
> If a single character is turned into 2 or 3 characters the issue is
> usually that the program which produces the output (in the case of COPY
> I think that would be the PostgreSQL server, not the client) produces
> UTF-8, but the program consuming it expects an 8-bit character set
> (typically windows-1252). See if oyu can tell that program that the file
> is in UTF-8.
>
> > How can I preserve accents ?
>
> They probably already are preserved.
You're right. The bytes are probably interpreted as Mac OS Roman:
$ echo é | iconv -f macintosh
é
$ echo -n é | xxd
00000000: c3a9
--
Erik
> On 25/05/2023 12:08 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote: > > > Looks like an encoding issue and a mismatch between database encoding and > > client encoding. You can check both with: > > > > SHOW server_encoding; > > SHOW client_encoding; > > > > Then either set the client encoding or use COPY's encoding option to match > > the database encoding (I assume utf8 in this example): > > > > SET client_encoding = 'utf8'; > > COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8'; > > Hi Erik, > > Looks like you could well be right about encoding: > > postgres=# SHOW server_encoding; > server_encoding > ----------------- > UTF8 > (1 row) > > postgres=# SHOW client_encoding; > client_encoding > ----------------- > SQL_ASCII > (1 row) > > I will try your suggestion... The client encoding is not the problem here. Using SQL_ASCII effectively uses the server encoding. SQL_ASCII basically means uninterpreted bytes/characters. From https://www.postgresql.org/docs/15/multibyte.html#id-1.6.11.5.7: "If the client character set is defined as SQL_ASCII, encoding conversion is disabled, regardless of the server's character set. (However, if the server's character set is not SQL_ASCII, the server will still check that incoming data is valid for that encoding; so the net effect is as though the client character set were the same as the server's.) Just as for the server, use of SQL_ASCII is unwise unless you are working with all-ASCII data." -- Erik