Re: Linux x Windows LOCALE/ENCODING compatibility

Поиск
Список
Период
Сортировка
От Marco Colombo
Тема Re: Linux x Windows LOCALE/ENCODING compatibility
Дата
Msg-id 4CE45402.3060204@esiway.net
обсуждение исходный текст
Ответ на Re: Linux x Windows LOCALE/ENCODING compatibility  (Carlos Henrique Reimer <carlos.reimer@opendb.com.br>)
Список pgsql-general
On 11/09/2010 02:31 AM, Carlos Henrique Reimer wrote:
> Hi,
>
> I'm currently in the process of moving the data from the Windows server
> to the new Linux box but facing some problems with the encoding.
>
> Additional configuration information: Windows is running PG 8.3 and the
> new Linux box is PG 8.4.
>
> Windows dump command:
> pg_dump -U postgres -Fc -v -f "f:\backup
>
> Linux restore command:
> pg_restore -v -n brasil -d mapas /backup
>
>
> pg_restore error:
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 3574; 0 40805 TABLE
> DATA cidade
>   postgres
> pg_restore: [archiver (db)] COPY failed: ERROR:  character 0x81 of
> encoding "WIN
> 1252" has no equivalent in "UTF8"
> CONTEXT:  COPY cidade, line 6
>
> I also tried to dump using pg_dump -E UTF8 but then I got:
> pg_dump: SQL command failed
> pg_dump: Error message from server: ERROR:  character 0x81 of encoding
> "WIN1252" has no equivalent in "UTF8"
> pg_dump: The command was: COPY brasil.cidade (gid, "municpio",
> "municpi0", uf, longitude, latitude, the_geom) TO stdout;
> pg_dump: *** aborted because of error
>
> How can I fix this error?
>

Well, that's pretty much evidence that the text you have on windows is
NOT win1252-encoded. Or some parts of it, at least. According to this page:

http://en.wikipedia.org/wiki/Windows-1252

0x81 is undefined in win1252.

Please note that if the old DB has SQL_ASCII encoding, you may have
mixed encoding text in the same database (either on purpose or by
mistake) and you have to either keep using SQL_ASCII in the new DB
(which means keeping potentially invalid data around), or do proper
conversion to UTF-8 (which possibly has do be done differently for
different tables, or even different rows in the same table) BEFORE
importing it in the new DB.

To convert from one encoding to another, you can use, surprisingly, the
'convert' function. The tricky part is that once you have text of
unknown encoding, you can only guess. That is, try to convert it from
some reasonable encodings to UTF-8, and look at the result. If it looks
right, that may be the right encoding. :)

BTW, 0x81 is a rather weird char, it's invalid in many common encodings.
It's valid in win-1251, where it represents U+0403 (CYRILLIC CAPITAL
LETTER GJE), Ѓ (if you don't have the font, you won't be able to see
it), but I think it's an unlikely candidate.

In both CP437 and CP850 (old DOS encodings for western languages) it
represents U+00FC (LATIN SMALL LETTER U WITH DIAERESIS), ü (again, I
hope you can see it). That's a better candidate, it's possible someone
accessed the DB either directly, or via a web application, from a
Windows ME or older system.

Unfortunately, AFAIK, PostgreSQL doesn't support the cp850 encoding.
Maybe you can dump the data (even a single table) in text format and use
a text editor (or shell utils like perl, sed, tr) to replace offending
charaters with their proper win-1252 counterparts (e.g., 0x81 can be
replaced with 0xfc, which is u with diaresis in win-1252).

I hope it helps.

.TM.

В списке pgsql-general по дате отправления:

Предыдущее
От: "Marc Mamin"
Дата:
Сообщение: Re: Survey on backing up unlogged tables: help us with PostgreSQL development!
Следующее
От: "Marc Mamin"
Дата:
Сообщение: Re: Survey on backing up unlogged tables: help us with PostgreSQL development!