Обсуждение: converting databases form SQL_ASCII to UTF8

Поиск
Список
Период
Сортировка

converting databases form SQL_ASCII to UTF8

От
Geoffrey Myers
Дата:
We are moving our databases to new hardware soon, so we felt it would be
a good time to get the encoding correct.  Our databases are currently
SQL_ASCII and we plan to move them to UTF8.

So, as previously noted, there are certain characters that won't load
into a UTF8 database from a dump of the SQL_ASCII database.

Here's our problem.  We planned on moving databases a few at a time.
Problem is, there is a process that pushes data from one database to
another.  If this process attempts to push data from a SQL_ASCII
database to a new UTF8 database and it has one of these characters
mentioned above, the process fails.

So, now the question is, is this effort even worth our effort?

What is the harm in leaving our databases SQL_ASCII encoded?

Thanks for any insights.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Re: converting databases form SQL_ASCII to UTF8

От
Vick Khera
Дата:
On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers <lists@serioustechnology.com> wrote:
Here's our problem.  We planned on moving databases a few at a time. Problem is, there is a process that pushes data from one database to another.  If this process attempts to push data from a SQL_ASCII database to a new UTF8 database and it has one of these characters mentioned above, the process fails.

The database's enforcement of the encoding should be the last layer that does so.  Your applications should be enforcing strict utf-8 encoding from start to finish.  Once this is done, and the old data already in the DB is properly encoded as utf-8, then there should be no problems switching on the utf-8 encoding in postgres to get that final layer of verification.

Re: converting databases form SQL_ASCII to UTF8

От
Adrian Klaver
Дата:
On Friday, April 22, 2011 8:00:08 am Geoffrey Myers wrote:

>
> What is the harm in leaving our databases SQL_ASCII encoded?

SQL_ASCII is basically no encoding. The world is slowly but surely moving to
Unicode, sooner or later you are going to hit the unknown encoding/Unicode wall.
Probably better to get it over with now.

>
> Thanks for any insights.

--
Adrian Klaver
adrian.klaver@gmail.com

Re: converting databases form SQL_ASCII to UTF8

От
Vick Khera
Дата:
On Fri, Apr 22, 2011 at 11:16 AM, Geoffrey Myers <geof@serioustechnology.com> wrote:
Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?

you have no guarantees that the data stored within is utf-8. that is all.  if you can make such guarantees from within your application, then you have some confidence, but never 100%.  but it depends on what access you permit to the DB. it is kind of like removing FKs and enforcing them from the application layer.  you can do it, but you just don't have a 100% guarantee that the DB is consistent.

Re: converting databases form SQL_ASCII to UTF8

От
Geoffrey Myers
Дата:
Vick Khera wrote:
> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers
> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote:
>
>     Here's our problem.  We planned on moving databases a few at a time.
>     Problem is, there is a process that pushes data from one database to
>     another.  If this process attempts to push data from a SQL_ASCII
>     database to a new UTF8 database and it has one of these characters
>     mentioned above, the process fails.
>
>
> The database's enforcement of the encoding should be the last layer that
> does so.  Your applications should be enforcing strict utf-8 encoding
> from start to finish.  Once this is done, and the old data already in
> the DB is properly encoded as utf-8, then there should be no problems
> switching on the utf-8 encoding in postgres to get that final layer of
> verification.

Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?

--
Geoffrey Myers
Myers Consulting Inc.
770.592.1651

Re: converting databases form SQL_ASCII to UTF8

От
Rob Sargent
Дата:

On 04/22/2011 09:16 AM, Geoffrey Myers wrote:
> Vick Khera wrote:
>> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers
>> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote:
>>
>> Here's our problem. We planned on moving databases a few at a time.
>> Problem is, there is a process that pushes data from one database to
>> another. If this process attempts to push data from a SQL_ASCII
>> database to a new UTF8 database and it has one of these characters
>> mentioned above, the process fails.
>>
>>
>> The database's enforcement of the encoding should be the last layer
>> that does so. Your applications should be enforcing strict utf-8
>> encoding from start to finish. Once this is done, and the old data
>> already in the DB is properly encoded as utf-8, then there should be
>> no problems switching on the utf-8 encoding in postgres to get that
>> final layer of verification.
>
> Totally agree. Still, the question remains, why not leave it as SQL_ASCII?
>
Maybe because you'll have to consistently remember that you're doing
non-standard stuff?

Re: converting databases form SQL_ASCII to UTF8

От
Eric McKeeth
Дата:
On Fri, Apr 22, 2011 at 9:16 AM, Geoffrey Myers <geof@serioustechnology.com> wrote:
Vick Khera wrote:

The database's enforcement of the encoding should be the last layer that does so.  Your applications should be enforcing strict utf-8 encoding from start to finish.  Once this is done, and the old data already in the DB is properly encoded as utf-8, then there should be no problems switching on the utf-8 encoding in postgres to get that final layer of verification.

Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?


Well, if your data is supposed to be UTF-8 encoded, then any of those characters with invalid encoding in UTF-8 could reasonably be viewed as data errors. Leaving the database in SQL-ASCII allows those errors to continue accumulating, which will make a switch in the future even harder. If the lack of being able to check encoding errors at the database level doesn't bother you, and you're fine with risking bigger pain later in order to avoid pain now, then I see no compelling reason to move away from SQL_ASCII.

-Eric

Re: converting databases form SQL_ASCII to UTF8

От
Steve Crawford
Дата:
On 04/22/2011 08:00 AM, Geoffrey Myers wrote:
> We are moving our databases to new hardware soon, so we felt it would
> be a good time to get the encoding correct.  Our databases are
> currently SQL_ASCII and we plan to move them to UTF8.
We are in the same boat, fortunately only on one older server we are
upgrading and fortunately for internal apps....
>
>
> So, as previously noted, there are certain characters that won't load
> into a UTF8 database from a dump of the SQL_ASCII database.
>
> Here's our problem.  We planned on moving databases a few at a time.
> Problem is, there is a process that pushes data from one database to
> another.  If this process attempts to push data from a SQL_ASCII
> database to a new UTF8 database and it has one of these characters
> mentioned above, the process fails.
>
> So, now the question is, is this effort even worth our effort?
>
> What is the harm in leaving our databases SQL_ASCII encoded?

SQL_ASCII is a synonym for "no encoding." You put in a stream of bytes
and that's what you get out. That's OK if the byte-stream has exactly
the same meaning to every application and user. If that's not the case
then you have bytes in your database but you don't know what those bytes
are supposed to represent.

In a way, it's like having a generic integer column but depending on the
user or the application, that column might represent a unix epoch
timestamp, an age in years, a salary, a weight in grams, furlongs per
fortnight, etc. And there is no indicator to say which it is. Not good.

We are in the final stages of cleaning up our last bit of non-utf8 data
and the above some what silly example is actually not far from the
truth. Due to data that arrived from web-inputs, spreadsheet imports,
command-line, internal-apps, etc. we have been faced with cleaning
tables where one row has only basic ASCII data, another has UTF8, while
others have various different Microsoft encodings. With data like that
it's pretty much impossible to guarantee that even a simple web-report
will display fully correctly.

The longer you wait, the worse it gets. Even though it is only a tiny
percentage of our data, cleaning it is still a pain.

Is it worth it? Dunno - you have to calculate the cost/benefit. For us
it was a no-brainer to bite-the-bullet and do it.

Cheers,
Steve


Re: converting databases form SQL_ASCII to UTF8

От
Dimitri Fontaine
Дата:
Geoffrey Myers <lists@serioustechnology.com> writes:
> So, now the question is, is this effort even worth our effort?
> What is the harm in leaving our databases SQL_ASCII encoded?

You're declaring bankruptcy on being able to make any sense of the data
you stored.  Is that really what you think you need?

For converting, you might be interested into those two blog entries:

  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_1.html
  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: converting databases form SQL_ASCII to UTF8

От
Jasen Betts
Дата:
On 2011-04-22, Geoffrey Myers <geof@serioustechnology.com> wrote:
> Vick Khera wrote:
>> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers
>> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote:
>>
>>     Here's our problem.  We planned on moving databases a few at a time.
>>     Problem is, there is a process that pushes data from one database to
>>     another.  If this process attempts to push data from a SQL_ASCII
>>     database to a new UTF8 database and it has one of these characters
>>     mentioned above, the process fails.
>>
>>
>> The database's enforcement of the encoding should be the last layer that
>> does so.  Your applications should be enforcing strict utf-8 encoding
>> from start to finish.  Once this is done, and the old data already in
>> the DB is properly encoded as utf-8, then there should be no problems
>> switching on the utf-8 encoding in postgres to get that final layer of
>> verification.
>
> Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?

perhaps you want sorted output in some locale other than 'C'?
or maybe want to take a substring in the database...

utf8 in SQL-ASCII is just a string of octets

utf8 in a utf8 database is a string of unicode characters.

--
⚂⚃ 100% natural

Re: converting databases form SQL_ASCII to UTF8

От
Geoffrey Myers
Дата:
Jasen Betts wrote:
> On 2011-04-22, Geoffrey Myers <geof@serioustechnology.com> wrote:
>> Vick Khera wrote:
>>> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers
>>> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote:
>>>
>>>     Here's our problem.  We planned on moving databases a few at a time.
>>>     Problem is, there is a process that pushes data from one database to
>>>     another.  If this process attempts to push data from a SQL_ASCII
>>>     database to a new UTF8 database and it has one of these characters
>>>     mentioned above, the process fails.
>>>
>>>
>>> The database's enforcement of the encoding should be the last layer that
>>> does so.  Your applications should be enforcing strict utf-8 encoding
>>> from start to finish.  Once this is done, and the old data already in
>>> the DB is properly encoded as utf-8, then there should be no problems
>>> switching on the utf-8 encoding in postgres to get that final layer of
>>> verification.
>> Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?
>
> perhaps you want sorted output in some locale other than 'C'?
> or maybe want to take a substring in the database...
>
> utf8 in SQL-ASCII is just a string of octets
>
> utf8 in a utf8 database is a string of unicode characters.


We finally have a solution in place. A bug in my code was making the
problem bigger then it really is.  Gotta love those bugs.


--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson