Re: How to remove non-UTF values from a table?

Поиск
Список
Период
Сортировка
От Howard Cole
Тема Re: How to remove non-UTF values from a table?
Дата
Msg-id 4B28E942.8090402@selestial.com
обсуждение исходный текст
Ответ на Re: How to remove non-UTF values from a table?  (Phoenix Kiula <phoenix.kiula@gmail.com>)
Ответы Re: How to remove non-UTF values from a table?  (Richard Broersma <richard.broersma@gmail.com>)
Список pgsql-general
Phoenix Kiula wrote:
> On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@selestial.com> wrote:
>
>> Phoenix Kiula wrote:
>>
>>> An easy question for some I hope.
>>>
>>> I have a DB from 8.2 days that when I now dump and try to take into
>>> the 8.3.7, it gives me errors about utf-8 stuff.
>>>
>>> I tried searching this list's archives but could not come up with an
>>> answer.
>>>
>>> Google returns some sites like these:
>>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>>> but I'm not clear on how to use them.
>>>
>>> Following the SQL on this site I could identify some columns that
>>> contain text like this:
>>>
>>>    "Évolution générale de la situation démographique"
>>>
>>> So my guess is that the non-English characters were originally not
>>> getting written in proper utf-8 variants.
>>>
>>> Is there any SQL possibility to find these columns and replace them
>>> with utf-8 equivalents using some postgresql commands? Couldn't find
>>> anything in the "Strings functions" (chapter 9 of manual).
>>>
>>> We're on CentOS.
>>>
>>> Thanks!
>>>
>>> My recommendation would be to install the iconv utility and run it on a
>>> plain text (pg_dump -Fp) backup as suggested in the google article - and
>>> then reimport the clean UTF-8.
>>>
>>> I am surprised that you managed to install the original backup on 8.3
>>> because it seems to be much more strict on encoding - Unless your database
>>> is not in UTF-8?
>>>
>
>
>
> Thanks Howard.
>
> I ran the SQL and it finds anything that has non-English characters.
> For example:
>
>
>
http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1
>
>
> Part of this URL is actually in Japanese, but when I paste it in this
> email it comes up with all these percentage signs. I suppose this is
> "url encoded".
>
> Shouldn't this be valid UTF-8? How does PG calculate if something is
> not valid UTF-8?
>
> Thanks.
>
>
I am sure that postgres uses standard test to see if text is valid UTF8,
however I guess you will only get a warning when you try to read or
write the data, and the warning only appears in the log as far as I am
aware. Encoding checking has become more strict with each upgrade to
postgres, this is why you are seeing errors as you import into 8.3 from
an 8.2 backup.

I still maintain that you should check and fix using iconv rather than
SQL. Read your distro notes on how to install iconv (if it isn't already
installed) and run it on your plain text backup. You can then pinpoint
individual changes using diff if you want to find it in your exising 8.2
database. Better still, just fix using Iconv then import the clean data
into 8.3 or 8.4

Howard Cole
www.selestial.com


В списке pgsql-general по дате отправления:

Предыдущее
От: Howard Cole
Дата:
Сообщение: Re: Interesting Benchmark Article
Следующее
От: Michael Clark
Дата:
Сообщение: Re: Possible causes for database corruption and solutions