Re: finding bogus UTF-8

Поиск
Список
Период
Сортировка
От dennis jenkins
Тема Re: finding bogus UTF-8
Дата
Msg-id AANLkTi=VwFEMRst2Wpw-ZM-1Qr0+jiYtL+ETSO1ueUC7@mail.gmail.com
обсуждение исходный текст
Ответ на finding bogus UTF-8  (Scott Ribe <scott_ribe@elevated-dev.com>)
Ответы Re: finding bogus UTF-8
Список pgsql-general
I'm working on a project to convert a large database form SQL_ASCII to
UTF-8.  I am using this procedure:

1) pg_dump the SQL_ASCII database to an SQL text file.
2) Run through a small (efficient) C program that logs each line that
contains ANY "unclean" ASCII text.
3) Parse that log with a small perl program (hashes are easier in perl
than C) to produce a report, and emit some SQL.
4) Construct SQL update statements to "repair" the original data.
5) Repeat at step #1 until the database is clean.
6) pg_dump (SQL_ASCII) -> pg_restore -EUTF8 new database.
7) Profit!

If you are interested, I can email to you the C and Perl source.

It runs like this:

# time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
./bad-ascii-report.pl > unclean-ascii.rpt

real    11m11.804s
user    18m2.579s
sys     2m25.803s

# grep "^--" unclean-ascii.rpt
-- some_table 4051021
-- other_table 16

^^^ Numbers are count of rows that need cleaning.
Entire "rpt" file contains SQL comments "--" and SQL select statements
of the form:
select * from table where primary_key in (1, 2, 3, 4, ....);

The perl script contains a hash that maps table names to primary key
column IDs (to pick up when parsing the raw SQL restore "COPY"
script).  I will need to purge my secret schema stuff from it before
sharing it with anyone.

My solution is probably not perfect, and probably not optimal, but it
is working great so far.  I'm almost done cleaning up my database and
hope to attempt a real UTF8 restore in the near future.


On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
> I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors
(AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such
values?
>
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

В списке pgsql-general по дате отправления:

Предыдущее
От: Andy Colson
Дата:
Сообщение: Re: Schema version control
Следующее
От: dennis jenkins
Дата:
Сообщение: Re: finding bogus UTF-8