Обсуждение: Unicode problem ???
Hi everyone, I have a PostgreSQL 7.4 database running, which was initdb-ed using standard (SQL_ASCII) encoding, with -lc-collate=C option set. Everything is running fine, but I just discovered something funny. If text containing characters like ë is inserted into the database (text field), the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it looks fine, including these special characters. HOWEVER, when I get this data from within a Delphi application, using psqlODBC (07.03.0200), the special characters get corrupted (ë becomes something like ë). It seems to be a unicode problem. What do you think, should I re-create the database using UNICODE encoding? I'd like to know beforehand if this will solve the problem, since I do not want to recreate the database just to find out that it does not fix things. Or does the problem involve psqlODBC. I looked into some of its settings, but I can't find anything that involves character encoding... I hope someone can give me a pointer here. Thanks in advance, Alexander Priem The Netherlands.
> Hi everyone, > > I have a PostgreSQL 7.4 database running, which was initdb-ed > using standard > (SQL_ASCII) encoding, with -lc-collate=C option set. > > Everything is running fine, but I just discovered something > funny. If text > containing characters like ë is inserted into the database > (text field), the > inserting goes fine. When I view this data using phpPgAdmin > (v3.2.1) it > looks fine, including these special characters. > > HOWEVER, when I get this data from within a Delphi application, using > psqlODBC (07.03.0200), the special characters get corrupted (ë becomes > something like ë). It seems to be a unicode problem. > > What do you think, should I re-create the database using > UNICODE encoding? > I'd like to know beforehand if this will solve the problem, > since I do not > want to recreate the database just to find out that it does > not fix things. > > Or does the problem involve psqlODBC. I looked into some of > its settings, > but I can't find anything that involves character encoding... > > I hope someone can give me a pointer here. > > Thanks in advance, > Alexander Priem > The Netherlands. > I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of encoding problems to. On the postgres backend I use UNICODE. I'll tell you what I know, I hope it helps a bit. First off: there seems to be a difference between the datatypes string and widestring in Delphi. widestring is recommendedfor use with 'wide byte strings' (or something like that), for example UNICODE. string variables however can'tbe trusted for the full 100% in that case. I don't know exactly how it worked, but I'm sure some browsing in the DelphiHelp and other rescources will explain things. Second: I know that in my case I have to set the CLIENT_ENCODING TO 'LATIN1' in my odbc connect string to make sure thatall the characters are translated correct. I'm sorry that I can't be more specific about these things since I don't remember exactly WHY these things are done, butI do know they help :-) This was all set-up and figured out by one of my colleagues a while ago. I was not excactly intothe database itself, but restricted my work to delphi coding en general development. Therefore I hope that someone else can shed more light onto the matter so I can learn a bit too :-) Kind Regards, Stijn Vanroye.
Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander: > I have a PostgreSQL 7.4 database running, which was initdb-ed using > standard (SQL_ASCII) encoding, with -lc-collate=C option set. > > Everything is running fine, but I just discovered something funny. If text > containing characters like ë is inserted into the database (text field), > the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it > looks fine, including these special characters. You can try to update pg_database to change the database encoding to LATIN1. SQL_ASCII is only totally supported for, well, ASCII characters.
> > Hi everyone, > > > > I have a PostgreSQL 7.4 database running, which was initdb-ed > > using standard > > (SQL_ASCII) encoding, with -lc-collate=C option set. > > > > Everything is running fine, but I just discovered something > > funny. If text > > containing characters like ë is inserted into the database > > (text field), the > > inserting goes fine. When I view this data using phpPgAdmin > > (v3.2.1) it > > looks fine, including these special characters. > > > > HOWEVER, when I get this data from within a Delphi application, using > > psqlODBC (07.03.0200), the special characters get corrupted (ë becomes > > something like ë). It seems to be a unicode problem. > > > > What do you think, should I re-create the database using > > UNICODE encoding? > > I'd like to know beforehand if this will solve the problem, > > since I do not > > want to recreate the database just to find out that it does > > not fix things. > > > > Or does the problem involve psqlODBC. I looked into some of > > its settings, > > but I can't find anything that involves character encoding... > > > > I hope someone can give me a pointer here. > > > > Thanks in advance, > > Alexander Priem > > The Netherlands. > > > I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of > encoding problems to. On the postgres backend I use UNICODE. > > I'll tell you what I know, I hope it helps a bit. > First off: there seems to be a difference between the datatypes string and > widestring in Delphi. widestring is recommended for use with 'wide byte > strings' (or something like that), for example UNICODE. string variables > however can't be trusted for the full 100% in that case. I don't know > exactly how it worked, but I'm sure some browsing in the Delphi Help and > other rescources will explain things. > > Second: I know that in my case I have to set the CLIENT_ENCODING TO > 'LATIN1' in my odbc connect string to make sure that all the characters > are translated correct. > > I'm sorry that I can't be more specific about these things since I don't > remember exactly WHY these things are done, but I do know they help :-) > This was all set-up and figured out by one of my colleagues a while ago. I > was not excactly into the database itself, but restricted my work to > delphi coding en general development. > Therefore I hope that someone else can shed more light onto the matter so > I can learn a bit too :-) > > Kind Regards, > > Stijn Vanroye. I just tried something else. If I use SQL Explorer (A Borland tool for manipulating/viewing databases through DSN's) to look at this database, using the same (psqlODBC) DSN, I also get the wrong characters. So it seems Delphi doesn't have anything to do with it. On the other hand, SQL Explorer is also a Borland product... But I also just tried to view this field as a widestring. I queried the database and used ShowMessage(VarToWideStr(rs.Fields[1].Value)) to view the text. Didn't help... I also tried to change the client_encoding via SQL. I just gave a "set client_encoding to 'unicode'" command, followed by the command to retrieve the text. Also didn't help. "set client_encoding to 'latin1'" didn't help either. Is this the correct way to change the encoding or do I needd to rebuild the database? Maybe I am not really changing the encoding this way??? Alexander Priem.
Hi Stijn, I tried adding "set client_encoding to 'LATIN1'" to the Connect Settings of psqlODBC. It didn't help. I still get the same weird characters when I look at the data using SQL explorer. Could this be due to the fact that the database was CREATED using SQL_ASCII encoding? Maybe your solution only works when the database was created using LATIN1 or UNICODE encoding. The strange thing is that everything is FINE when I use phpPgAdmin to look at the data, so there must be a conversion problem somewhere in the ODBC layer... Maybe I'll just try recreating the database using UNICODE or LATIN1 encoding tomorrow. UNICODE seems the best, right? Or does LATIN1 have more possibilities than UNICODE? Groeten vanuit Veenendaal, Alexander.
> Hi Stijn, Hy Alexander, > I tried adding "set client_encoding to 'LATIN1'" to the > Connect Settings of > psqlODBC. It didn't help. I still get the same weird > characters when I look > at the data using SQL explorer. Damn (pardon me). I realy hoped that would do the trick. > Could this be due to the fact that the database was CREATED > using SQL_ASCII > encoding? Maybe your solution only works when the database > was created using > LATIN1 or UNICODE encoding. Could be, as I said I use Unicode. According to the email from Peter Eisentraut on the same topic: //Quote SQL_ASCII is only totally supported for, well, ASCII characters. //End Quote Sounds to me that SQL_ASCII is more limited then UNICODE. > The strange thing is that everything is FINE when I use > phpPgAdmin to look > at the data, so there must be a conversion problem somewhere > in the ODBC > layer... Guess so to. I think that's a question for the developpers. In any case there's a difference between what the ODBC does and wath phpPgAdmin does. I've always used pgAdmin (currently III). No problems there eigther. > Maybe I'll just try recreating the database using UNICODE or > LATIN1 encoding > tomorrow. UNICODE seems the best, right? Or does LATIN1 have more > possibilities than UNICODE? Of what I hear, UNICODE indeed seems the best option. But then again, that encoding stuff is still a bit of a mistery tome. What I personally don't understand is: if all my databases are UNICODE, why do I have to set the Client encoding to latin1to get a correct result? > Groeten vanuit Veenendaal, > > Alexander. Stijn.
Priem, Alexander said: > Could this be due to the fact that the database was CREATED using > SQL_ASCII > encoding? Maybe your solution only works when the database was created > using > LATIN1 or UNICODE encoding. Yes, I suspect > Maybe I'll just try recreating the database using UNICODE or LATIN1 > encoding > tomorrow. UNICODE seems the best, right? Or does LATIN1 have more > possibilities than UNICODE? Unicode has a larger character set than latin-1. But if you only need to support latin-1 then use that... Of course, if you have to upgrade to support unicode later, you'll wish you had started off using latin-1! I'm no expert on this, so I hope this info is correct. John Sidney-Woollett
I am also using postgres database with delphi. Therefor I don't use ODBC but I use the Zeos Database components for delphi. I had similar problems though. My database is in iso8859-7 and I needed Greek chars to appear corectly in my application. What I did is I am running an sql statement when my application is starting up (after my datasource connection to the db etc) and I am setting my client's encoding to iso8859-7. No problems since then ... Greeks show ok .. they are stored ok and the search on the data works file. Hope I helped a little Alexander Antonakakis John Sidney-Woollett wrote: > Priem, Alexander said: > >>Could this be due to the fact that the database was CREATED using >>SQL_ASCII >>encoding? Maybe your solution only works when the database was created >>using >>LATIN1 or UNICODE encoding. > > > Yes, I suspect > > >>Maybe I'll just try recreating the database using UNICODE or LATIN1 >>encoding >>tomorrow. UNICODE seems the best, right? Or does LATIN1 have more >>possibilities than UNICODE? > > > Unicode has a larger character set than latin-1. But if you only need to > support latin-1 then use that... Of course, if you have to upgrade to > support unicode later, you'll wish you had started off using latin-1! > > I'm no expert on this, so I hope this info is correct. > > John Sidney-Woollett > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > >
In article <200404211522.39036.peter_e@gmx.net>, Peter Eisentraut <peter_e@gmx.net> writes: > Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander: >> I have a PostgreSQL 7.4 database running, which was initdb-ed using >> standard (SQL_ASCII) encoding, with -lc-collate=C option set. >> >> Everything is running fine, but I just discovered something funny. If text >> containing characters like ë is inserted into the database (text field), >> the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it >> looks fine, including these special characters. > You can try to update pg_database to change the database encoding to LATIN1. > SQL_ASCII is only totally supported for, well, ASCII characters. Or even better to LATIN9. This also includes the Euro symbol and the French OE ligature forgotten in LATIN1.
> What I personally don't understand is: if all my databases > are UNICODE, why do I have to set the Client encoding to latin1 > to get a correct result? Because LATIN1 isn't just a subset of UNICODE. If the data coming out of the database *is* UNICODE *and* your client *does* handle UNICODE directly you might get away with not setting a client_encoding, same goes for DB=latin1 + client=latin1. If, however, the DB delivers UNICODE but your client really wants LATIN1 you need to tell the database to convert the stored UNICODE to LATIN1 before delivery. That's what the client_encoding is for. Or so is my understanding of it. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
So if I understand correctly, I could best try the following : * Create a new database, using UNICODE encoding, since this supports the largest character set (inc. Euro-sign etc.) * Then, set the client-encoding to LATIN9, so that clients get support for special dutch characters. The client-encoding can also be changed on-the-fly, so it won't matter if I don't get this right the first time. Most important is the database (cluster) encoding... Am I right in this? Thanks for all the help. Alexander Priem.
> > What I personally don't understand is: if all my databases > > are UNICODE, why do I have to set the Client encoding to latin1 > > to get a correct result? Karsten Hilbert wrote: > Because LATIN1 isn't just a subset of UNICODE. If the data > coming out of the database *is* UNICODE *and* your client > *does* handle UNICODE directly you might get away with not > setting a client_encoding, same goes for DB=latin1 + > client=latin1. If, however, the DB delivers UNICODE but your > client really wants LATIN1 you need to tell the database to > convert the stored UNICODE to LATIN1 before delivery. That's > what the client_encoding is for. > > Or so is my understanding of it. > > Karsten Thanks, that's a pretty clear explication. Only one question: Wouldn't it be better if I just set my client encoding to UNICODE in stead of LATIN1? I suppose the UNICODE encoding setis understood by windows (and delphi), since I write progs for a win enviroment. Stijn.
Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye: > Wouldn't it be better if I just set my client encoding to UNICODE in stead > of LATIN1? I suppose the UNICODE encoding set is understood by windows (and > delphi), since I write progs for a win enviroment. That really depends exclusively on what your client program/environment can tolerate. You may have to set your console, widget set or desktop to Unicode or something like that.
> Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye: > > Wouldn't it be better if I just set my client encoding to > UNICODE in stead > > of LATIN1? I suppose the UNICODE encoding set is understood > by windows (and > > delphi), since I write progs for a win enviroment. Peter Eisentraut wrote: > That really depends exclusively on what your client > program/environment can > tolerate. You may have to set your console, widget set or > desktop to Unicode > or something like that. Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE. Stijn.
Hi everyone, I solved the problem with the special characters (thanks to you all). I recreated my entire database (luckily I keep scripts for table/index/view creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my psqlODBC DSN settings I added "set client_encoding='LATIN9';" to the Connect Settings and that solved all my problems regarding the special characters. Thanks a lot, everyone! Alexander Priem.
In article <71E201BE5E881C46811BA160694C5FCB0FA934@fs1000.farcourier.com>, "Stijn Vanroye" <s.vanroye@farcourier.com> writes: >> Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye: >> > Wouldn't it be better if I just set my client encoding to >> UNICODE in stead >> > of LATIN1? I suppose the UNICODE encoding set is understood >> by windows (and >> > delphi), since I write progs for a win enviroment. > Peter Eisentraut wrote: >> That really depends exclusively on what your client >> program/environment can >> tolerate. You may have to set your console, widget set or >> desktop to Unicode >> or something like that. > Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE. I'd say you should first decide if you really need Unicode. If you're dealing exclusively with English/French/Spanish/German or so and if you're pretty sure you'll never touch Polish/Russian/Chinese, you can stick to Latin-1 or Latin-9 and happily ignore Unicode.
> I'd say you should first decide if you really need Unicode. If you're > dealing exclusively with English/French/Spanish/German or so and if > you're pretty sure you'll never touch Polish/Russian/Chinese, you can > stick to Latin-1 or Latin-9 and happily ignore Unicode. But wouldn't it be better to pick unicode just in case? Or would Latin-1 / Latin-9 perform better, compared to Unicode? Alexander Priem
Stijn Vanroye wrote: > Of what I hear, UNICODE indeed seems the best option. But then again, that > encoding stuff is still a bit of a mistery to me. What I personally don't The following is cut from the documentation of XML Ada (xmlada-1.0/docs/xml_2.html#SEC6) ( which is available at http://libre.act-europe.fr/xmlada/) <quote> We now know how each encoded character can be represented by an integer value (code point) depending on the character set. Character encoding schemes deal with the representation of a sequence of integers to a sequence of code units. A code unit is a sequence of bytes on a computer architecture. There exists a number of possible encoding schemes. Some of them encode all integers on the same number of bytes. They are called fixed-width encoding forms, and include the standard encoding for Internet emails (7bits, but it can't encode all characters), as well as the simple 8bits scheme, or the EBCDIC scheme. Among them is also the UTF-32 scheme which is defined in the Unicode standard. Another set of encoding schemes encode integers on a variable number of bytes. These include two schemes that are also defined in the Unicode standard, namely Utf-8 and Utf-16. Unicode doesn't impose any specific encoding. However, it is most often associated with one of the Utf encodings. They each have their own properties and advantages: Utf32 This is the simplest of all these encodings. It simply encodes all the characters on 32 bits (4 bytes). This encodes all the possible characters in Unicode, and is obviously straightforward to manipulate. However, given that the first 65535 characters in Unicode are enough to encode all known languages currently in use, Utf32 is also a waste of space in most cases. Utf16 For the above reason, Utf16 was defined. Most characters are only encoded on two bytes (which is enough for the first 65535 and most current characters). In addition, a number of special code points have been defined, known as surrogate pairs, that make the encoding of integers greater than 65535 possible. The integers are then encoded on four bytes. As a result, Utf16 is thus much more memory-efficient and requires less space than Utf32 to encode sequences of characters. However, it is also more complex to decode. Utf8 This is an even more space-efficient encoding, but is also more complex to decode. More important, it is compatible with the most currently used simple 8bit encoding. Utf8 has the following properties: Characters 0 to 127 (ASCII) are encoded simply as a single byte. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. Characters greater than 127 are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte can appear as part of any other character. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. UTF-8 encoded characters may theoretically be up to six bytes long, however the first 16-bit characters are only up to three bytes long. Note that the encodings above, except for Utf8, have two versions, depending on the chosen byte order on the machine. </quote> So yes, Unicode in Utf8 is tricky to handle /Björn