Обсуждение: converting databases form SQL_ASCII to UTF8
We are moving our databases to new hardware soon, so we felt it would be a good time to get the encoding correct. Our databases are currently SQL_ASCII and we plan to move them to UTF8. So, as previously noted, there are certain characters that won't load into a UTF8 database from a dump of the SQL_ASCII database. Here's our problem. We planned on moving databases a few at a time. Problem is, there is a process that pushes data from one database to another. If this process attempts to push data from a SQL_ASCII database to a new UTF8 database and it has one of these characters mentioned above, the process fails. So, now the question is, is this effort even worth our effort? What is the harm in leaving our databases SQL_ASCII encoded? Thanks for any insights. -- Until later, Geoffrey "I predict future happiness for America if they can prevent the government from wasting the labors of the people under the pretense of taking care of them." - Thomas Jefferson
On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers <lists@serioustechnology.com> wrote:
Here's our problem. We planned on moving databases a few at a time. Problem is, there is a process that pushes data from one database to another. If this process attempts to push data from a SQL_ASCII database to a new UTF8 database and it has one of these characters mentioned above, the process fails.
The database's enforcement of the encoding should be the last layer that does so. Your applications should be enforcing strict utf-8 encoding from start to finish. Once this is done, and the old data already in the DB is properly encoded as utf-8, then there should be no problems switching on the utf-8 encoding in postgres to get that final layer of verification.
On Friday, April 22, 2011 8:00:08 am Geoffrey Myers wrote: > > What is the harm in leaving our databases SQL_ASCII encoded? SQL_ASCII is basically no encoding. The world is slowly but surely moving to Unicode, sooner or later you are going to hit the unknown encoding/Unicode wall. Probably better to get it over with now. > > Thanks for any insights. -- Adrian Klaver adrian.klaver@gmail.com
On Fri, Apr 22, 2011 at 11:16 AM, Geoffrey Myers <geof@serioustechnology.com> wrote:
Totally agree. Still, the question remains, why not leave it as SQL_ASCII?
you have no guarantees that the data stored within is utf-8. that is all. if you can make such guarantees from within your application, then you have some confidence, but never 100%. but it depends on what access you permit to the DB. it is kind of like removing FKs and enforcing them from the application layer. you can do it, but you just don't have a 100% guarantee that the DB is consistent.
Vick Khera wrote: > On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers > <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote: > > Here's our problem. We planned on moving databases a few at a time. > Problem is, there is a process that pushes data from one database to > another. If this process attempts to push data from a SQL_ASCII > database to a new UTF8 database and it has one of these characters > mentioned above, the process fails. > > > The database's enforcement of the encoding should be the last layer that > does so. Your applications should be enforcing strict utf-8 encoding > from start to finish. Once this is done, and the old data already in > the DB is properly encoded as utf-8, then there should be no problems > switching on the utf-8 encoding in postgres to get that final layer of > verification. Totally agree. Still, the question remains, why not leave it as SQL_ASCII? -- Geoffrey Myers Myers Consulting Inc. 770.592.1651
On 04/22/2011 09:16 AM, Geoffrey Myers wrote: > Vick Khera wrote: >> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers >> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote: >> >> Here's our problem. We planned on moving databases a few at a time. >> Problem is, there is a process that pushes data from one database to >> another. If this process attempts to push data from a SQL_ASCII >> database to a new UTF8 database and it has one of these characters >> mentioned above, the process fails. >> >> >> The database's enforcement of the encoding should be the last layer >> that does so. Your applications should be enforcing strict utf-8 >> encoding from start to finish. Once this is done, and the old data >> already in the DB is properly encoded as utf-8, then there should be >> no problems switching on the utf-8 encoding in postgres to get that >> final layer of verification. > > Totally agree. Still, the question remains, why not leave it as SQL_ASCII? > Maybe because you'll have to consistently remember that you're doing non-standard stuff?
On Fri, Apr 22, 2011 at 9:16 AM, Geoffrey Myers <geof@serioustechnology.com> wrote:
Vick Khera wrote:Totally agree. Still, the question remains, why not leave it as SQL_ASCII?The database's enforcement of the encoding should be the last layer that does so. Your applications should be enforcing strict utf-8 encoding from start to finish. Once this is done, and the old data already in the DB is properly encoded as utf-8, then there should be no problems switching on the utf-8 encoding in postgres to get that final layer of verification.
Well, if your data is supposed to be UTF-8 encoded, then any of those characters with invalid encoding in UTF-8 could reasonably be viewed as data errors. Leaving the database in SQL-ASCII allows those errors to continue accumulating, which will make a switch in the future even harder. If the lack of being able to check encoding errors at the database level doesn't bother you, and you're fine with risking bigger pain later in order to avoid pain now, then I see no compelling reason to move away from SQL_ASCII.
-Eric
On 04/22/2011 08:00 AM, Geoffrey Myers wrote: > We are moving our databases to new hardware soon, so we felt it would > be a good time to get the encoding correct. Our databases are > currently SQL_ASCII and we plan to move them to UTF8. We are in the same boat, fortunately only on one older server we are upgrading and fortunately for internal apps.... > > > So, as previously noted, there are certain characters that won't load > into a UTF8 database from a dump of the SQL_ASCII database. > > Here's our problem. We planned on moving databases a few at a time. > Problem is, there is a process that pushes data from one database to > another. If this process attempts to push data from a SQL_ASCII > database to a new UTF8 database and it has one of these characters > mentioned above, the process fails. > > So, now the question is, is this effort even worth our effort? > > What is the harm in leaving our databases SQL_ASCII encoded? SQL_ASCII is a synonym for "no encoding." You put in a stream of bytes and that's what you get out. That's OK if the byte-stream has exactly the same meaning to every application and user. If that's not the case then you have bytes in your database but you don't know what those bytes are supposed to represent. In a way, it's like having a generic integer column but depending on the user or the application, that column might represent a unix epoch timestamp, an age in years, a salary, a weight in grams, furlongs per fortnight, etc. And there is no indicator to say which it is. Not good. We are in the final stages of cleaning up our last bit of non-utf8 data and the above some what silly example is actually not far from the truth. Due to data that arrived from web-inputs, spreadsheet imports, command-line, internal-apps, etc. we have been faced with cleaning tables where one row has only basic ASCII data, another has UTF8, while others have various different Microsoft encodings. With data like that it's pretty much impossible to guarantee that even a simple web-report will display fully correctly. The longer you wait, the worse it gets. Even though it is only a tiny percentage of our data, cleaning it is still a pain. Is it worth it? Dunno - you have to calculate the cost/benefit. For us it was a no-brainer to bite-the-bullet and do it. Cheers, Steve
Geoffrey Myers <lists@serioustechnology.com> writes: > So, now the question is, is this effort even worth our effort? > What is the harm in leaving our databases SQL_ASCII encoded? You're declaring bankruptcy on being able to make any sense of the data you stored. Is that really what you think you need? For converting, you might be interested into those two blog entries: http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_1.html http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 2011-04-22, Geoffrey Myers <geof@serioustechnology.com> wrote: > Vick Khera wrote: >> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers >> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote: >> >> Here's our problem. We planned on moving databases a few at a time. >> Problem is, there is a process that pushes data from one database to >> another. If this process attempts to push data from a SQL_ASCII >> database to a new UTF8 database and it has one of these characters >> mentioned above, the process fails. >> >> >> The database's enforcement of the encoding should be the last layer that >> does so. Your applications should be enforcing strict utf-8 encoding >> from start to finish. Once this is done, and the old data already in >> the DB is properly encoded as utf-8, then there should be no problems >> switching on the utf-8 encoding in postgres to get that final layer of >> verification. > > Totally agree. Still, the question remains, why not leave it as SQL_ASCII? perhaps you want sorted output in some locale other than 'C'? or maybe want to take a substring in the database... utf8 in SQL-ASCII is just a string of octets utf8 in a utf8 database is a string of unicode characters. -- ⚂⚃ 100% natural
Jasen Betts wrote: > On 2011-04-22, Geoffrey Myers <geof@serioustechnology.com> wrote: >> Vick Khera wrote: >>> On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers >>> <lists@serioustechnology.com <mailto:lists@serioustechnology.com>> wrote: >>> >>> Here's our problem. We planned on moving databases a few at a time. >>> Problem is, there is a process that pushes data from one database to >>> another. If this process attempts to push data from a SQL_ASCII >>> database to a new UTF8 database and it has one of these characters >>> mentioned above, the process fails. >>> >>> >>> The database's enforcement of the encoding should be the last layer that >>> does so. Your applications should be enforcing strict utf-8 encoding >>> from start to finish. Once this is done, and the old data already in >>> the DB is properly encoded as utf-8, then there should be no problems >>> switching on the utf-8 encoding in postgres to get that final layer of >>> verification. >> Totally agree. Still, the question remains, why not leave it as SQL_ASCII? > > perhaps you want sorted output in some locale other than 'C'? > or maybe want to take a substring in the database... > > utf8 in SQL-ASCII is just a string of octets > > utf8 in a utf8 database is a string of unicode characters. We finally have a solution in place. A bug in my code was making the problem bigger then it really is. Gotta love those bugs. -- Until later, Geoffrey "I predict future happiness for America if they can prevent the government from wasting the labors of the people under the pretense of taking care of them." - Thomas Jefferson