Re: Differences in UTF8 between 8.0 and 8.1

Поиск

Список

Период

Сортировка

От	Paul Lindner
Тема	Re: Differences in UTF8 between 8.0 and 8.1
Дата	26 октября 2005 г. 22:01:33
Msg-id	20051027005951.GA27655@inuus.com обсуждение исходный текст
Ответ на	Re: Differences in UTF8 between 8.0 and 8.1 (Andrew - Supernews <andrew+nonews@supernews.com>)
Ответы	Re: Differences in UTF8 between 8.0 and 8.1 Re: Differences in UTF8 between 8.0 and 8.1
Список	pgsql-hackers

Дерево обсуждения

On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>
> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
> was never actually a valid utf-8 string, and that the d2 b9 is only valid
> by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
> utf-8 check was broken, but I didn't realize it was quite so bad.

Looking at the data it appears that it is a sequence of latin1
characters.  They all have the eighth bit set and all seem to pass the
check.

In a million rows I found 2 examples of this.

However I'm running into another problem now.  The command:
 iconv -c -f UTF8 -t UTF8

does strip out the invalid characters.  However, iconv reads the
entire file into memory before it writes out any data.  This is not so
good for multi-gigabyte dump files and doesn't allow for it to be used
in a pipe between pg_dump and psql.

Anyone have any other recommendations?  GNU recode might do it, but
I'm a bit stymied by the syntax.  A quick perl script using
Text::Iconv didn't work either.  I'm off to look at some other perl
modules and will try to create a script so I can strip out the invalid
characters.

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Differences in UTF8 between 8.0 and 8.1