Обсуждение: Encoding change question...

Поиск
Список
Период
Сортировка

Encoding change question...

От
Karl Denninger
Дата:
So I have myself a nice pickle here.....

I've got a database which was originally created with SQL_ASCII for the
encoding (anything goes text fields)

Unfortunately, I have a bunch of data that was encoded in UTF-8 that's
in an RSS feed that I need to load into said database.  iconv barfs all
over this file in an attempt to turn it into ISO-8859 (which is what the
web application currently attached to that database is emitting and
collecting.)  It appears the problem is (mostly) things like the
stylized double-quotes.

So I figured I'd go the other way, and convert what I have now in the
tables into UTF8.

Well, except that doesn't work either.

ticker=# select convert_to(subject, 'utf8') from post where
ordinal='2098167';
                           convert_to
----------------------------------------------------------------
 1%: the interest rate on IBM\222s most recent three-year bond.
(1 row)

\222 is the correct code point for the styled single apostrophe that is
in that place in ISO-8859-1 in the source.  However, the UTF prefix is
missing, as are the other two code-point characters (that is, I got the
code point but not the other two bytes that should be in front of it.
And if I set the code page on the web site to UTF-8, and also set the
encoding on the SQL session to UTF-8 I don't get the three-byte code - I
just get the one byte.  That's a bust.

There are TWO fields in this database that need converted.  I presumed I
could do something like this:

# update post set field1 = convert_to(field1, 'utf8');

It runs to completion without complaint but produces the above.  No good.

So.... is there a way to do this?  I do NOT want to dump, iconv on the
dumped file, then reload the database if I can possibly avoid it.  Yes,
I know that will work (I've tested it on my development box), but being
able to do this "hot" would be DRAMATICALLY preferred.  If I do the
iconv route I am basically rebuilding the entire database with all
that's involved in doing so in terms of downtime and such.  VERY
undesirable.

(Equally-valid would be a way to convert UTF-8 to ISO-8859-1 (Latin1),
thereby allowing me to convert the incoming data stream to what's
already in the system, but thus far I've found no joy on that at all.)

Ideas?

-- Karl


Вложения

Re: Encoding change question...

От
"Peter C. Lai"
Дата:
The doublequotes isn't UTF8 it's people copying and pasting from Microsoft
stuff, which is WIN-1252. So try to use that with iconv instead of utf8

On 2010-08-16 12:40:03PM -0500, Karl Denninger wrote:
> So I have myself a nice pickle here.....
>
> I've got a database which was originally created with SQL_ASCII for the
> encoding (anything goes text fields)
>
> Unfortunately, I have a bunch of data that was encoded in UTF-8 that's
> in an RSS feed that I need to load into said database.  iconv barfs all
> over this file in an attempt to turn it into ISO-8859 (which is what the
> web application currently attached to that database is emitting and
> collecting.)  It appears the problem is (mostly) things like the
> stylized double-quotes.
>
> So I figured I'd go the other way, and convert what I have now in the
> tables into UTF8.
>
> Well, except that doesn't work either.
>
> ticker=# select convert_to(subject, 'utf8') from post where
> ordinal='2098167';
>                            convert_to
> ----------------------------------------------------------------
>  1%: the interest rate on IBM\222s most recent three-year bond.
> (1 row)
>
> \222 is the correct code point for the styled single apostrophe that is
> in that place in ISO-8859-1 in the source.  However, the UTF prefix is
> missing, as are the other two code-point characters (that is, I got the
> code point but not the other two bytes that should be in front of it.
> And if I set the code page on the web site to UTF-8, and also set the
> encoding on the SQL session to UTF-8 I don't get the three-byte code - I
> just get the one byte.  That's a bust.
>
> There are TWO fields in this database that need converted.  I presumed I
> could do something like this:
>
> # update post set field1 = convert_to(field1, 'utf8');
>
> It runs to completion without complaint but produces the above.  No good.
>
> So.... is there a way to do this?  I do NOT want to dump, iconv on the
> dumped file, then reload the database if I can possibly avoid it.  Yes,
> I know that will work (I've tested it on my development box), but being
> able to do this "hot" would be DRAMATICALLY preferred.  If I do the
> iconv route I am basically rebuilding the entire database with all
> that's involved in doing so in terms of downtime and such.  VERY
> undesirable.
>
> (Equally-valid would be a way to convert UTF-8 to ISO-8859-1 (Latin1),
> thereby allowing me to convert the incoming data stream to what's
> already in the system, but thus far I've found no joy on that at all.)
>
> Ideas?
>
> -- Karl
>

> begin:vcard
> fn:Karl Denninger
> n:Denninger;Karl
> email;internet:karl@denninger.net
> x-mozilla-html:TRUE
> version:2.1
> end:vcard
>

>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


--
===========================================================
Peter C. Lai                 | Bard College at Simon's Rock
Systems Administrator        | 84 Alford Rd.
Information Technology Svcs. | Gt. Barrington, MA 01230 USA
peter AT simons-rock.edu     | (413) 528-7428
===========================================================


Re: Encoding change question...

От
Karl Denninger
Дата:
Peter C. Lai wrote:
> The doublequotes isn't UTF8 it's people copying and pasting from Microsoft
> stuff, which is WIN-1252. So try to use that with iconv instead of utf8
>
> On 2010-08-16 12:40:03PM -0500, Karl Denninger wrote:
>
>> So I have myself a nice pickle here.....
>>
>> I've got a database which was originally created with SQL_ASCII for the
>> encoding (anything goes text fields)
>>
>> Unfortunately, I have a bunch of data that was encoded in UTF-8 that's
>> in an RSS feed that I need to load into said database.  iconv barfs all
>> over this file in an attempt to turn it into ISO-8859 (which is what the
>> web application currently attached to that database is emitting and
>> collecting.)  It appears the problem is (mostly) things like the
>> stylized double-quotes.
>>
>> So I figured I'd go the other way, and convert what I have now in the
>> tables into UTF8.
>>
>> Well, except that doesn't work either.
>>
>> ticker=# select convert_to(subject, 'utf8') from post where
>> ordinal='2098167';
>>                            convert_to
>> ----------------------------------------------------------------
>>  1%: the interest rate on IBM\222s most recent three-year bond.
>> (1 row)
>>
>> \222 is the correct code point for the styled single apostrophe that is
>> in that place in ISO-8859-1 in the source.  However, the UTF prefix is
>> missing, as are the other two code-point characters (that is, I got the
>> code point but not the other two bytes that should be in front of it.
>> And if I set the code page on the web site to UTF-8, and also set the
>> encoding on the SQL session to UTF-8 I don't get the three-byte code - I
>> just get the one byte.  That's a bust.
>>
>> There are TWO fields in this database that need converted.  I presumed I
>> could do something like this:
>>
>> # update post set field1 = convert_to(field1, 'utf8');
>>
>> It runs to completion without complaint but produces the above.  No good.
>>
>> So.... is there a way to do this?  I do NOT want to dump, iconv on the
>> dumped file, then reload the database if I can possibly avoid it.  Yes,
>> I know that will work (I've tested it on my development box), but being
>> able to do this "hot" would be DRAMATICALLY preferred.  If I do the
>> iconv route I am basically rebuilding the entire database with all
>> that's involved in doing so in terms of downtime and such.  VERY
>> undesirable.
>>
>> (Equally-valid would be a way to convert UTF-8 to ISO-8859-1 (Latin1),
>> thereby allowing me to convert the incoming data stream to what's
>> already in the system, but thus far I've found no joy on that at all.)
>>
>> Ideas?
>>
>> -- Karl
>>
No, the problem is that the existing data in the database and the web
app that are using it are both ISO-8859-1.

The issue is that the data I need to load INTO the system is in UTF-8
(and really is), and is full of three-byte escapes.  iconv barfs all
over it trying to go to ISO-8859-1, so I can't convert the INCOMING data
to what's in the system now.  It also won't convert it to Windows-1252
(which is kind of a superset of ISO-8859, and thus the server might not
toss on it too badly.)

FS/karl:~/tmp> iconv -f UTF-8 -t WINDOWS-1252 <rss-marketticker.php
>/dev/null
iconv: (stdin):2766:6125: cannot convert

FS/karl:~/tmp> iconv -f UTF-8 -t ISO-8859-1 <rss-marketticker.php
>/dev/null
iconv: (stdin):222:391: cannot convert

The data set that is incoming is some ~30,000 records.  The data in the
system already is ~2m records.  Obviously, if I can convert the incoming
data that's the better choice, but it appears I can't.

Thus the desire to go the other way - turn the existing data in the
database into UTF-8, which is probably desirable in the future anyway.

I can iconv a pg_dumpall of the database TO UTF-8 and it succeeds (at
least in theory), but the database into which I want to load this data
set is HUGE and dumping and reloading it isn't on my "A" list of things
to do.

If I have to I have to - but if I can avoid this I certainly WANT TO.

What I don't understand is why the Postgres function "convert_to()"
doesn't emit the THREE BYTE sequence, and only emits the codepoint.


-- Karl

Вложения