Обсуждение: Importing a Windows database (in en_GB.CP1252) to linux
Hello,
I have a (custom) backup created on a Windows machine in en_GB.CP1252 encoding. And of course, some characters can't be imported because they don't exist in UTF-8.
So I created a new cluster on PG18 port 5433 initialized in WIN1252 encoding:
$ \l imlocal
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges
---------+-------+----------+-----------------+---------+-------+--------+-----------+-------------------
imlocal | cat | WIN1252 | libc | C | C | ∅ | ∅ | ∅
(1 row)
I am now trying to import the data in that database but I keep getting this error:
$ pg_restore -p 5433 -t csakafl -d imlocal imlocal20251127.backup
pg_restore: error: COPY failed for table "csakafl": ERROR: invalid byte sequence for encoding "UTF8": 0x92
CONTEXT: COPY csakafl, line 298
So pg_restore still thinks I want to use UTF8.
So I created an en_GB.CP1252 locales definitions folder like this:
mkdir ~/locales
cd ~/locales
cp cp /usr/share/i18n/charmaps/CP1252.gz .
gunzip CP1252.gz
localedef -cv -i en_GB -f CP1252 ./en_GB.CP1252
Then I now want to use this to restore my database so I tried setting:
setenv LOCPATH ~/locales
setenv LC_ALL en_GB.CP1252
setenv LANG en_GB.CP1252
And even forcing LC_ALL when launching pg_restore:
$ LC_ALL=en_GB.CP1252 pg_restore -p 5433 -t csakafl -d imlocal imlocal20251127.backup
pg_restore: error: COPY failed for table "csakafl": ERROR: invalid byte sequence for encoding "UTF8": 0x92
CONTEXT: COPY csakafl, line 298
What am I doing wrong?
Thanks for your help,
JC
On Mon, 2025-12-01 at 14:37 +0100, Jean-Christophe BOGGIO wrote: > I have a (custom) backup created on a Windows machine in en_GB.CP1252 encoding. > And of course, some characters can't be imported because they don't exist in UTF-8. Hm? Which character can be encoded in WINDOWS-1252, but not in UTF-8? I don't think that can be the problem. > So I created a new cluster on PG18 port 5433 initialized in WIN1252 encoding: > > $ \l imlocal > List of databases > Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges > ---------+-------+----------+-----------------+---------+-------+--------+-----------+------------------- > imlocal | cat | WIN1252 | libc | C | C | ∅ | ∅ | ∅ > (1 row) > > I am now trying to import the data in that database but I keep getting this error: > > $ pg_restore -p 5433 -t csakafl -d imlocal imlocal20251127.backup > pg_restore: error: COPY failed for table "csakafl": ERROR: invalid byte sequence for encoding "UTF8": 0x92 > CONTEXT: COPY csakafl, line 298 > > So pg_restore still thinks I want to use UTF8. That looks like pg_restore sets a wrong client_encoding, which is weird. What do you get for pg_restore -p 5433 -t csakafl -s -f - imlocal20251127.backup | grep client_encoding If the dump was taken from a WINDOWS-1252 encoded database, that line should read SET client_encoding = 'WIN1252'; and everything should work fine. But apparently, the client_encoding is set to UTF-8 in your case. How did that happen? How exactly did you take that dump? Did you do anything (like an encoding conversion) with the dump after you took it? Yours, Laurenz Albe
Laurenz,
That looks like pg_restore sets a wrong client_encoding, which is weird. What do you get for pg_restore -p 5433 -t csakafl -s -f - imlocal20251127.backup | grep client_encoding
SET client_encoding = 'UTF8';
If the dump was taken from a WINDOWS-1252 encoded database, that line should read SET client_encoding = 'WIN1252'; and everything should work fine. But apparently, the client_encoding is set to UTF-8 in your case.
Yes, exactly.
How did that happen? How exactly did you take that dump?
This backup is a transfer from an iSeries DB2 database. It has been a nightmare to get this working (and took around 10 days to finalize). We set up a FDW Server using odbc_fdw, recreated all the tables (around 2k) and INSERTed the DB2 data to the PG tables.
Then we used PgAdmin that came with PostgreSQL 17 on the Windows machine.
I double-checked with the client: the database is in en_GB.CP1252.
BTW, if I try to import the same backup on a UTF8 database I get the exact same error (invalid byte sequence 0x92)
Did you do anything (like an encoding conversion) with the dump after you took it?
No, the backup is in custom format so I can't touch it (or at least I don't know how I could).
Where can I go from here?
On Tue, 2025-12-02 at 10:39 +0100, Jean-Christophe BOGGIO wrote: > > That looks like pg_restore sets a wrong client_encoding, which is weird. > > > > What do you get for > > > > pg_restore -p 5433 -t csakafl -s -f - imlocal20251127.backup | grep client_encoding > > SET client_encoding = 'UTF8'; > > > > How did that happen? How exactly did you take that dump? > > This backup is a transfer from an iSeries DB2 database. It has been a nightmare to > get this working (and took around 10 days to finalize). We set up a FDW Server using > odbc_fdw, recreated all the tables (around 2k) and INSERTed the DB2 data to the PG tables. > > Then we used PgAdmin that came with PostgreSQL 17 on the Windows machine. > > I double-checked with the client: the database is in en_GB.CP1252. The DB2 database or the PostgreSQL database? It must be the DB2 database, because otherwise the dump would contain SET client_encoding = 'WIN1252'; That is, unless you created the dump with pg_dump --encoding=UTF8 But then, the dump couldn't contain non-UTF-8 characters. Having used pgAdmin, you probably don't know the pg_dump command line that was used. My best guess is that odbc_fdw has a bug that does not check if the strings are properly encoded, and you somehow got corrupted data in your PostgreSQL database. But I am not sure. > > Did you do anything (like an encoding conversion) with the dump after you took it? > > No, the backup is in custom format so I can't touch it (or at least I don't know how I could). > > Where can I go from here? You can try the following: - convert the custom format dump into an SQL script with pg_restore -f script.sql imlocal20251127.backup - edit script.sql and change the line to read SET client_encoding = 'WIN1252'; - restore that dump with "psql": psql -f script.sql -d newdb That should work if *all* the strings are in WINDOWS-1252 encoding. Yours, Laurenz Albe
Laurenz, Thanks a lot for your quick reply. Le 02/12/2025 à 12:26, Laurenz Albe a écrit : >> I double-checked with the client: the database is in en_GB.CP1252. > The DB2 database or the PostgreSQL database? > > It must be the DB2 database, because otherwise the dump would contain No: the PostgreSQL one. I guess the DB2 is encoded (internally) in EBCDIC... Yes... At least that's what CSV exports directly made from the DB shows. > My best guess is that odbc_fdw has a bug that does not check if the strings are > properly encoded, and you somehow got corrupted data in your PostgreSQL database. > But I am not sure. I don't know really but I could have the client do some tests on tables that we know contain invalid byte sequences. Maybe I could try doing manual pg_dumps with different -E parameter to verify that it's not something enforced by pgAdmin. > You can try the following: > > - convert the custom format dump into an SQL script with > > pg_restore -f script.sql imlocal20251127.backup > > - edit script.sql and change the line to read > > SET client_encoding = 'WIN1252'; > > - restore that dump with "psql": > > psql -f script.sql -d newdb > > That should work if *all* the strings are in WINDOWS-1252 encoding. Thanks a lot for the proposal. It seems to be doing the job! It's running right now and so far no "bad encoding" error appeared (it's a 311Gb backup so it will take some time). I'll let you know. Gratefully, JC