Обсуждение: finding bogus UTF-8
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPTfiles, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
I'm working on a project to convert a large database form SQL_ASCII to UTF-8. I am using this procedure: 1) pg_dump the SQL_ASCII database to an SQL text file. 2) Run through a small (efficient) C program that logs each line that contains ANY "unclean" ASCII text. 3) Parse that log with a small perl program (hashes are easier in perl than C) to produce a report, and emit some SQL. 4) Construct SQL update statements to "repair" the original data. 5) Repeat at step #1 until the database is clean. 6) pg_dump (SQL_ASCII) -> pg_restore -EUTF8 new database. 7) Profit! If you are interested, I can email to you the C and Perl source. It runs like this: # time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester | ./bad-ascii-report.pl > unclean-ascii.rpt real 11m11.804s user 18m2.579s sys 2m25.803s # grep "^--" unclean-ascii.rpt -- some_table 4051021 -- other_table 16 ^^^ Numbers are count of rows that need cleaning. Entire "rpt" file contains SQL comments "--" and SQL select statements of the form: select * from table where primary_key in (1, 2, 3, 4, ....); The perl script contains a hash that maps table names to primary key column IDs (to pick up when parsing the raw SQL restore "COPY" script). I will need to purge my secret schema stuff from it before sharing it with anyone. My solution is probably not perfect, and probably not optimal, but it is working great so far. I'm almost done cleaning up my database and hope to attempt a real UTF8 restore in the near future. On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? > > > -- > Scott Ribe > scott_ribe@elevated-dev.com > http://www.elevated-dev.com/ > (303) 722-0567 voice > > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? If you know which table and column the data is in, you can also do something like this: (I typed this up without checking the syntax of it. The basic idea is to cast the column as bytea, encode with the 'escape' method, then grep for back-slashes). select * from bad_table where regexp_match (encode (bad_column::bytea, 'escape'), '\\\\'));
> > If you are interested, I can email to you the C and Perl source. > > It runs like this: > > # time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester | > ./bad-ascii-report.pl > unclean-ascii.rpt http://www.ecoligames.com/~djenkins/pgsql/ Disclaimer: I offer NO warranty. Use at your own risk. Code does minimal error checking (its a hack / tool for manual use, not reliable production use). C code compiles cleanly with gcc. Perl code uses no libraries (just a STDIN -> STDOUT processor). This code should run damn near anywhere. The code will stay on my web server until I forget about it and re-org stuff in a few weeks, so grab it while you can.
On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
I hit this problem too, if I remember correctly when trying to upgrade a database from 8.3 to 8.4. I ended up aborting the upgrade, since the upgrade documentation made no mention of this and I didn't have time to dig into it at the time. A tool to find all instances of this would be very helpful.
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?
I hit this problem too, if I remember correctly when trying to upgrade a database from 8.3 to 8.4. I ended up aborting the upgrade, since the upgrade documentation made no mention of this and I didn't have time to dig into it at the time. A tool to find all instances of this would be very helpful.
--
Glenn Maynard
Glenn Maynard wrote: > On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com > <mailto:scott_ribe@elevated-dev.com>> wrote: > > I know that I have at least one instance of a varchar that is not > valid UTF-8, imported from a source with errors (AMA CPT files, > actually) before PG's checking was as stringent as it is today. Can > anybody suggest a query to find such values? > > > I hit this problem too, if I remember correctly when trying to upgrade a > database from 8.3 to 8.4. I ended up aborting the upgrade, since the > upgrade documentation made no mention of this and I didn't have time to > dig into it at the time. A tool to find all instances of this would be > very helpful. I'm about to pipe the ascii output of a database dump through a perl script that removes any unwanted characters. To help define what 'unwanted characters' are, compare the ord() values to decimal values at http://www.asciitable.com/ while (<>) { $_ =~ s/(.)/((ord($1) >= 0) && (ord($1) <= 8)) || ((ord($1) >= 11) && (ord($1) <= 31)) || ((ord($1) >= 127)) ?"": $1/egs; print; } comments would be appreciated. -- Until later, Geoffrey "I predict future happiness for America if they can prevent the government from wasting the labors of the people under the pretense of taking care of them." - Thomas Jefferson
On Thu, Feb 10, 2011 at 9:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMACPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? CREATE OR REPLACE FUNCTION is_utf8(text) RETURNS bool AS $$ try: args[0].decode('utf8') return True except UnicodeDecodeError: return False $$ LANGUAGE plpythonu STRICT; -- marko
On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers <lists@serioustechnology.com> wrote: > comments would be appreciated. > If all you're doing is filtering stdin to stdout and deleting a range of characters, it seems that tr would be a faster tool: cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt
Vick Khera wrote: > On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers > <lists@serioustechnology.com> wrote: >> comments would be appreciated. >> > > If all you're doing is filtering stdin to stdout and deleting a range > of characters, it seems that tr would be a faster tool: > > cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt I toyed with tr for a bit, but could not get it to work. The above did not work for me either. Not exactly sure what it's doing, but here's a couple of diff lines: 1619c1619 < days integer DEFAULT 28, --- > days integer DEFAULT 2, So it appears 'tr' is deleting the '8' character, rather then the octal value for 008. -- Until later, Geoffrey "I predict future happiness for America if they can prevent the government from wasting the labors of the people under the pretense of taking care of them." - Thomas Jefferson
On Tue, Feb 15, 2011 at 5:06 PM, Geoffrey Myers <lists@serioustechnology.com> wrote: > I toyed with tr for a bit, but could not get it to work. The above did not > work for me either. Not exactly sure what it's doing, but here's a couple > of diff lines: check your shell escaping. You may need \\ to protect the \