Re: [SPAM]-D] How to find broken UTF-8 characters ?

Поиск
Список
Период
Сортировка
От Andreas
Тема Re: [SPAM]-D] How to find broken UTF-8 characters ?
Дата
Msg-id 4BD8F038.6060601@gmx.net
обсуждение исходный текст
Ответ на How to find broken UTF-8 characters ?  (Andreas <maps.on@gmx.net>)
Ответы Re: [SPAM]-D] How to find broken UTF-8 characters ?  (Justin Graf <justin@magwerks.com>)
Список pgsql-sql
Hi,

while writing the reply below I found it sounds like beeing OT but it's 
actually not.
I just need a way to check if a collumn contains values that CAN NOT be 
converted from Utf8 to Latin1.
I tried:
Select convert_to (my_column::text, 'LATIN1') from my_table;

It raises an error that says translated:
ERROR:  character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«

I'd like to select all those records.
When I know which record has faulty content I can correct it.

If this is really OT on the SQL list then please tell me where to ask.


Am 28.04.2010 15:18, schrieb Justin Graf:
> On 4/26/2010 8:41 AM, Andreas wrote:
>>>> How can I get rid of them?
>>> iconv -c
>> AFAIK iconv would translate on file system level but I would think 
>> that messed up a allready messed up Excel workmap even further.
>> I'd be glad to handle csv, too.
>
> I would look at a macro/script to have excel dump the data out in CSV 
> then move data to into Postgres

It's like this.
I load the spreadsheet into an Access-DB and let a VBA skript stuff the 
data into PG via ADODB/ODBC.
Often I have to clean up more obvious things than obscure characters or 
amend the info out of other sources before I can upload it to PG.

>
> Now these are not illegal UTF chars.  If those values where wacky  
> Postgresql would not have allowed you insert the record.
> Ô = utf code 212, Ç = utf code 199, Ä = utf code 196

Those are even in Latin1.
They were only 1 example. I suppose where I find them the 3 codes form a 
multibyte code that can't be displayd or don't get displayd as a usual 
letter but some symbol or asian-looking thing which definately doesn't 
belong there.
I saw occasionally that such a wacky symbol replaced some other signes 
that are language specific like ä, ö, ü. Then the next sign is missing 
too, so something is mixing up the encoding and combines 2 chars into 1 
utf8-code.

> To force a string into a specific encoding  we have the Covert, 
> Convert_From and  Cover_to  see section 9.5 in the help files

The problem is, that pgAdmin complains those signes aren't convertible 
and drops the whole record out of the result of the select that I'd like 
to dump into a csv.


> Select covert('MyUtf8', 'UTF8', 'LATIN')
> or
> Select covert_to('MyUtf8',  'LATIN')

I found them before but didn't understand their output.
e.g.
Select convert('1aäßx', 'utf8', 'LATIN1') ;
Result = "1a\344\337x"
so it translated  ä = 344  and  ß = 337. The other 3 are just as they 
were before.
How can this be valid in a single byte charset like Latin1?
Especially as ä, ß are E4 and DF.
Why do they come out as escaped codes when they are in Latin1 aswell as 
1, a and x?

> What ever pg client library used to move Excel data to PG my have 
> incorrectly converted some of the data or moved formatting information 
> into the database.  I have seen Access and Excel do mightily odd 
> things when connecting to DB's  I don't know about current versions 
> but 2000 and 2003 Excels did really stupid things when trying to write 
> to DB's including MSSQL.

Cute ... we use Access 2000 and 2003   :(




В списке pgsql-sql по дате отправления:

Предыдущее
От: Gary Chambers
Дата:
Сообщение: Re: Inserting Multiple Random Rows
Следующее
От: Edward Ross
Дата:
Сообщение: problem converting strings to timestamps with time zone