Re: BUG #5532: Valid UTF8 sequence errors as invalid

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: BUG #5532: Valid UTF8 sequence errors as invalid
Дата
Msg-id 14170.1277922093@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: BUG #5532: Valid UTF8 sequence errors as invalid  (Mike Lewis <mikelikespie@gmail.com>)
Ответы Re: BUG #5532: Valid UTF8 sequence errors as invalid
Список pgsql-bugs
Mike Lewis <mikelikespie@gmail.com> writes:
> I've run into a fair amount of unicode errors when trying to copy in log
> files.  Would you recommend using bytea or another data type instead of text
> or varchar... or at least copying to a staging table with bytea's and
> filtering out invalid rows when moving it to the main table?

My guess is that you're working with data that was originally
represented in UTF16, and you've used a tool that doesn't really know
what it's doing to convert to UTF8.  A correct conversion has to reunite
surrogate pairs into wider-than-16-bit Unicode characters and then
encode those as single UTF8 sequences.  Dunno if you can easily identify
the culprit, but fixing that conversion is the long-term solution.

(BTW, I should think that iconv or some related tool would have a
solution for fixing this miscoding; it's not an uncommon problem.)

            regards, tom lane

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Mike Lewis
Дата:
Сообщение: Re: BUG #5532: Valid UTF8 sequence errors as invalid
Следующее
От: "Bidski"
Дата:
Сообщение: Libpq.dll: File not recognized