Обсуждение: Dumping in LATIN1 and restoring in UTF-8
Hi all. Here is my use case: I've an application which uses PostgreSQL as backend. Up to now, the database was encoded in SQL_ASCII or LATIN1. Now, we need to migrate to UTF-8. What we tried, was to: 1) dump the database using pg_dump, in tar format (we had blob); 2) modifying the result, using some conversion tool (like recode) 3) destroying the old database 4) recreating the database with UNICODE setting 5) restoring the database using pg_restore The result was not what I expected. The pg_restore was using the LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded in UTF-8... The problem lied in the toc.dat file, which stated that the client encoding was LATIN1, instead of UTF-8. The solution in the end has been to manually modifying the toc.dat file, substituting the LATIN1 string with UTF-8 (plus a space, since the toc.dat is a binary file). Even though it worked for us, I wonder if there is any other way to accomplish the same result, at least to specify the encoding for the restore. Regards Marco -- Marco Bizzarri http://notenotturne.blogspot.com/
Marco Bizzarri schrieb: > Hi all. > > Here is my use case: I've an application which uses PostgreSQL as > backend. Up to now, the database was encoded in SQL_ASCII or LATIN1. > Now, we need to migrate to UTF-8. > > What we tried, was to: > > 1) dump the database using pg_dump, in tar format (we had blob); > 2) modifying the result, using some conversion tool (like recode) > > > 3) destroying the old database > 4) recreating the database with UNICODE setting > 5) restoring the database using pg_restore > > The result was not what I expected. The pg_restore was using the > LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded > in UTF-8... > > The problem lied in the toc.dat file, which stated that the client > encoding was LATIN1, instead of UTF-8. > > The solution in the end has been to manually modifying the toc.dat > file, substituting the LATIN1 string with UTF-8 (plus a space, since > the toc.dat is a binary file). > > Even though it worked for us, I wonder if there is any other way to > accomplish the same result, at least to specify the encoding for the > restore. Yes, its actually quite esay: you dump as you feel apropriate, then create the database with the encoding you want, restore w/o creating database and you are done. Restore sets the client encoding to what it actually was in the dump data (in your case latin-1) and the database would be utf-8 - postgres automatically recodes. No need for iconv and friends. Regards Tino
On 7/6/06, Tino Wildenhain <tino@wildenhain.de> wrote: > Marco Bizzarri schrieb: > > Hi all. > > > > Here is my use case: I've an application which uses PostgreSQL as > > backend. Up to now, the database was encoded in SQL_ASCII or LATIN1. > > Now, we need to migrate to UTF-8. > > > > What we tried, was to: > > > > 1) dump the database using pg_dump, in tar format (we had blob); > > 2) modifying the result, using some conversion tool (like recode) > > > > > > 3) destroying the old database > > 4) recreating the database with UNICODE setting > > 5) restoring the database using pg_restore > > > > The result was not what I expected. The pg_restore was using the > > LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded > > in UTF-8... > > > > The problem lied in the toc.dat file, which stated that the client > > encoding was LATIN1, instead of UTF-8. > > > > The solution in the end has been to manually modifying the toc.dat > > file, substituting the LATIN1 string with UTF-8 (plus a space, since > > the toc.dat is a binary file). > > > > Even though it worked for us, I wonder if there is any other way to > > accomplish the same result, at least to specify the encoding for the > > restore. > > Yes, its actually quite esay: you dump as you feel apropriate, > then create the database with the encoding you want, > restore w/o creating database and you are done. > Restore sets the client encoding to what it actually was > in the dump data (in your case latin-1) and the database > would be utf-8 - postgres automatically recodes. No need > for iconv and friends. > > Regards > Tino > First of all, thank you for your answer. However, I suspect I did not understand your answer, since the commands I used were: 1) pg_dump -Ft -b -f dump.sql.tar database 2) dropdb database 3) createdb -E UNICODE database 4) pg_restore -d database dump.sql.tar According to my experience, this produces a "double encoding". As you can see, I hand-created the database, with the proper encoding. However, when I reimported the database, the result was a latin1 encoded in utf-8, rather than a pure utf-8. How my procedure was different with respect to yours? I will make some test with a sample database, and enabling the logging, so that I can understand the commands which are issued. Regards Marco -- Marco Bizzarri http://notenotturne.blogspot.com/
... >> Yes, its actually quite esay: you dump as you feel apropriate, >> then create the database with the encoding you want, >> restore w/o creating database and you are done. >> Restore sets the client encoding to what it actually was >> in the dump data (in your case latin-1) and the database >> would be utf-8 - postgres automatically recodes. No need >> for iconv and friends. >> >> Regards >> Tino >> > > First of all, thank you for your answer. However, I suspect I did not > understand your answer, since the commands I used were: > > 1) pg_dump -Ft -b -f dump.sql.tar database > 2) dropdb database > 3) createdb -E UNICODE database > 4) pg_restore -d database dump.sql.tar > > According to my experience, this produces a "double encoding". As you > can see, I hand-created the database, with the proper encoding. > However, when I reimported the database, the result was a latin1 > encoded in utf-8, rather than a pure utf-8. > > How my procedure was different with respect to yours? That was the correct way. I wonder if you have recoding support enabled? Did you build postgres yourself? Latin-1 double encoded into utf-8 seems not like possible... utf-8 barfs on most latin-1 characters, current 8.1 is very picky about it. So maybe you can work with a small test table to find out what's going wrong here. (The changing of the client_enccoding setting in the backup is only needed in the case when you had data in the wrong encoding - like SQLAscii filled with latin-1 or something) Regards Tino
On 7/6/06, Tino Wildenhain <tino@wildenhain.de> wrote: > ... > >> Yes, its actually quite esay: you dump as you feel apropriate, > >> then create the database with the encoding you want, > >> restore w/o creating database and you are done. > >> Restore sets the client encoding to what it actually was > >> in the dump data (in your case latin-1) and the database > >> would be utf-8 - postgres automatically recodes. No need > >> for iconv and friends. > >> > >> Regards > >> Tino > >> > > > > First of all, thank you for your answer. However, I suspect I did not > > understand your answer, since the commands I used were: > > > > 1) pg_dump -Ft -b -f dump.sql.tar database > > 2) dropdb database > > 3) createdb -E UNICODE database > > 4) pg_restore -d database dump.sql.tar > > > > According to my experience, this produces a "double encoding". As you > > can see, I hand-created the database, with the proper encoding. > > However, when I reimported the database, the result was a latin1 > > encoded in utf-8, rather than a pure utf-8. > > > > How my procedure was different with respect to yours? > > That was the correct way. I wonder if you have recoding support > enabled? Did you build postgres yourself? Support for recoding? I don't know... I compiled myself postgres, which is, BTW, 7.4.8. How can I check if auto recoding is enabled? > Latin-1 double encoded into utf-8 seems not like possible... > utf-8 barfs on most latin-1 characters, current 8.1 is very > picky about it. So maybe you can work with a small > test table to find out what's going wrong here. Yes, I will do... I understand postgresql in later release became much more "picky" about encoding. > > (The changing of the client_enccoding setting in the backup is only > needed in the case when you had data in the wrong encoding > - like SQLAscii filled with latin-1 or something) Ok, thanks! Regards Marco -- Marco Bizzarri http://notenotturne.blogspot.com/