Обсуждение: How to find broken UTF-8 characters ?
Hi, I regularly have to import from Excel files, that hold rather simple text and number columns. That works. Occasionally there are unwanted special characters at the end of text-columns that Exel and pgAdmin either show as a upward arrow with a short leg on top 90° to the right or others are invisible with UTF-8 aware programs or get dispayed as ? by Access. A text viewer shows "ÔÇÄ" or E2 80 8E in Hex for the invisible thingy. My database is unicode so it doesn't mind those freak-chars. The problem rises when I need to export those records to CSV with pgAdmin. pgAdmin complains about not beeing able to store those lines in the local charset. How can I find those broken UTF-8 characters? How can I get rid of them?
On 04/26/10 04:12, Andreas wrote: > Excel files> pgAdmin> Access. looks like a complete offtopic > How can I find those broken UTF-8 characters? > How can I get rid of them? iconv -c BUT u should not have those characters at all if one is occured it most probably an error AND u should get rid of this error itself -- not of its consequences.
Am 26.04.2010 12:12, schrieb silly sad: > On 04/26/10 04:12, Andreas wrote: > > looks like a complete offtopic Not anymore. The bad signs are in the DB now. I'd need some command that filters somehow for inconvertible (Unicode-->local charset) data. How can I find those Unicode characters that allready sneaked in? Actually there shouldn't be anything within the tables that NEED to be coded in Unicode. something like SELECT * FROM tab_1 WHERE field_x <> ConvertToLocal(field_x) might be a good start. >> How can I get rid of them? > iconv -c AFAIK iconv would translate on file system level but I would think that messed up a allready messed up Excel workmap even further. I'd be glad to handle csv, too. > BUT > u should not have those characters at all > if one is occured it most probably an error Sure, but those files hit me over a chain of people who consider it ok to convert data over numerus file formats, cut, edit, save as X, send per mail .... then hit me and I am the one to clean up. > AND > u should get rid of this error itself -- not of its consequences. Like quitting the job and grow flowers instead? I'll consider this. ;)
>>> How can I get rid of them? >> iconv -c > AFAIK iconv would translate on file system level but I would think that > messed up a allready messed up Excel workmap even further. > I'd be glad to handle csv, too. pg_dump | iconv -c | psql
Hi, while writing the reply below I found it sounds like beeing OT but it's actually not. I just need a way to check if a collumn contains values that CAN NOT be converted from Utf8 to Latin1. I tried: Select convert_to (my_column::text, 'LATIN1') from my_table; It raises an error that says translated: ERROR: character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1« I'd like to select all those records. When I know which record has faulty content I can correct it. If this is really OT on the SQL list then please tell me where to ask. Am 28.04.2010 15:18, schrieb Justin Graf: > On 4/26/2010 8:41 AM, Andreas wrote: >>>> How can I get rid of them? >>> iconv -c >> AFAIK iconv would translate on file system level but I would think >> that messed up a allready messed up Excel workmap even further. >> I'd be glad to handle csv, too. > > I would look at a macro/script to have excel dump the data out in CSV > then move data to into Postgres It's like this. I load the spreadsheet into an Access-DB and let a VBA skript stuff the data into PG via ADODB/ODBC. Often I have to clean up more obvious things than obscure characters or amend the info out of other sources before I can upload it to PG. > > Now these are not illegal UTF chars. If those values where wacky > Postgresql would not have allowed you insert the record. > Ô = utf code 212, Ç = utf code 199, Ä = utf code 196 Those are even in Latin1. They were only 1 example. I suppose where I find them the 3 codes form a multibyte code that can't be displayd or don't get displayd as a usual letter but some symbol or asian-looking thing which definately doesn't belong there. I saw occasionally that such a wacky symbol replaced some other signes that are language specific like ä, ö, ü. Then the next sign is missing too, so something is mixing up the encoding and combines 2 chars into 1 utf8-code. > To force a string into a specific encoding we have the Covert, > Convert_From and Cover_to see section 9.5 in the help files The problem is, that pgAdmin complains those signes aren't convertible and drops the whole record out of the result of the select that I'd like to dump into a csv. > Select covert('MyUtf8', 'UTF8', 'LATIN') > or > Select covert_to('MyUtf8', 'LATIN') I found them before but didn't understand their output. e.g. Select convert('1aäßx', 'utf8', 'LATIN1') ; Result = "1a\344\337x" so it translated ä = 344 and ß = 337. The other 3 are just as they were before. How can this be valid in a single byte charset like Latin1? Especially as ä, ß are E4 and DF. Why do they come out as escaped codes when they are in Latin1 aswell as 1, a and x? > What ever pg client library used to move Excel data to PG my have > incorrectly converted some of the data or moved formatting information > into the database. I have seen Access and Excel do mightily odd > things when connecting to DB's I don't know about current versions > but 2000 and 2003 Excels did really stupid things when trying to write > to DB's including MSSQL. Cute ... we use Access 2000 and 2003 :(
On 4/28/2010 10:34 PM, Andreas wrote: <blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite">Hi, <br /><br /> whilewriting the reply below I found it sounds like beeing OT but it's actually not. <br /> I just need a way to check ifa collumn contains values that CAN NOT be converted from Utf8 to Latin1. <br /> I tried: <br /> Select convert_to (my_column::text,'LATIN1') from my_table; <br /><br /> It raises an error that says translated: <br /> ERROR: character0xe28093 in encoding »UTF8« has no equivalent in »LATIN1« <br /><br /> I'd like to select all those records. <br/> When I know which record has faulty content I can correct it. <br /><br /> If this is really OT on the SQL list thenplease tell me where to ask. <br /></blockquote><br /><font color="#990000">That's easy enough you need to write anUpdate statement using regular expression to replace all non legal Latin/ASCII char <br /><a href="http://www.postgresql.org/docs/8.4/interactive/functions-string.html">http://www.postgresql.org/docs/8.4/interactive/functions-string.html </a><br/><br /> the command is regexp_replace('MyBadString', 'SearchForallNoneAsccIIChars', 'ReplaceWithBlankString ')<br/><br /> I'm pretty sure this is the regualr expression to find all non ASCII chars.. <span class="messagetext">[^\x00-\xFF]<br/><br /> To test is try to Select </span>regexp_replace( MyColumn, '<span class="messagetext">[^\x00-\xFF]',' ') from screweduptable </span><br /><br /> If the regular expression does not work, I'mdry well, when it comes to regular expressions. Dd i say i hate regular expression. It dam near impossible to write.<br /> Once you get the expression right and working<br /><br /> the Update is straight forward.<br /> Update mytableset mybadcolumn = regexp_replace( mybadcolumn, '<span class="messagetext">[^\x00-\xFF]', ' ')</span></font><br /><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><blockquote type="cite">Select covert('MyUtf8', 'UTF8','LATIN') <br /> or <br /> Select covert_to('MyUtf8', 'LATIN') <br /></blockquote><br /> I found them before but didn'tunderstand their output. <br /> e.g. <br /> Select convert('1aäßx', 'utf8', 'LATIN1') ; <br /> Result = "1a\344\337x"<br /> so it translated ä = 344 and ß = 337. The other 3 are just as they were before. <br /> How can thisbe valid in a single byte charset like Latin1? <br /> Especially as ä, ß are E4 and DF. <br /> Why do they come out asescaped codes when they are in Latin1 aswell as 1, a and x? <br /></blockquote><br /><font color="#990000">Someone withmore knowledge how convert() works is going to have to explain why they have been escaped. PgAdmin may have escapedthem. But those characters are valid Latin1 characters<br /><br /><a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/%C3%84">http://en.wikipedia.org/wiki/%C3%84</a><br/><a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/%C3%9F">http://en.wikipedia.org/wiki/%C3%9F</a><br/> ß = latin Beta <br /><br /> It seemsAccess and Excel are putting in extra bits of data into the field. In the past i had to change inserts/updates fromAccess so it would send data in a specific char encoding. I had problems where Access was using a Windows Encoding,the ODBC converted it to Latin1, and the MsSQL Database put it in UTF8. It was no fun cleaning it up. <br /></font><br/><blockquote cite="mid:4BD8F038.6060601@gmx.net" type="cite"><br /><blockquote type="cite">What ever pg clientlibrary used to move Excel data to PG my have incorrectly converted some of the data or moved formatting informationinto the database. I have seen Access and Excel do mightily odd things when connecting to DB's I don't knowabout current versions but 2000 and 2003 Excels did really stupid things when trying to write to DB's including MSSQL.<br /></blockquote><br /> Cute ... we use Access 2000 and 2003 :( <br /></blockquote><br /><font color="#990000">Beenmore Screwed by Excel and Access flakiness, and Access Programmers thinking they are DBA's.</font><br/><br /> <br /><br /><br /> All legitimate Magwerks Corporation quotations are sent in a .PDF file attachmentwith a unique ID number generated by our proprietary quotation system. Quotations received via any other form ofcommunication will not be honored. <br /><br /> CONFIDENTIALITY NOTICE: This e-mail, including attachments, may containlegally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely forthe use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorizedagent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of thise-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to thismessage and destroy all occurrences of this e-mail immediately. <br /> Thank you. <br />
On 2010-04-29, Andreas <maps.on@gmx.net> wrote: > Hi, > > while writing the reply below I found it sounds like beeing OT but it's > actually not. > I just need a way to check if a collumn contains values that CAN NOT be > converted from Utf8 to Latin1. > I tried: > Select convert_to (my_column::text, 'LATIN1') from my_table; > > It raises an error that says translated: > ERROR: character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1« use a regular expression. ISO8859-1 is easy, all the caracters a grouped together in unicode so the regular expression consists of a single inverted range class SELECT pkey FROM tabname WHERE ( textfield || textfiled2 || textfield3 ) ~ ('[^'||chr(1)||'-'||chr(255)||']');
On 2010-04-29, Justin Graf <justin@magwerks.com> wrote: > I'm pretty sure this is the regualr expression to find all non ASCII=20 > chars.. [^\x00-\xFF] Not in postgres. \x00 does not work well in strings, and \xFF is invalid utf-8.this is why I used char() (also ASCII is undefined past at \x7F ... but the original requestwas for LATIN-1 which does end at char(255))