Обсуждение: ENCODING (Unicode)
To my surprise i recently found out that there is no difference in what encoding a database or even initdb has. I could insert and retrieve Unicode and extended ASCII with a Java program, and everything is good. However, when the valuesare displayed by pgadmin or a cygwin shell, both extended ASCII and Unicode characters are shown as a sequence of 2characters, and those don't correspond to the bits one would expect to have if 2 byte-characer were split into 2 1-byte (8-bit) characters. The best explanation i could come up with is that pgAdminII is not supporting Unicode encodings,and the shell obviously also does not. This is not surprising, because most of pre-Java, pre-.NET apps were andare like that, although Windows has the notions of wchar... If i'm right, i wonder if Unicode is expected to be supported sometimes in the future. And in general all feedback is appreciated. Sincerely, Reshat. ------------------------------------------------------------------------------------------- If you see my certificate with this message, you should be able to send me encrypted e-mail. Please consult your e-mail client for details if you would like to do that.
Вложения
> -----Original Message----- > From: Reshat Sabiq [mailto:sabiq@purdue.edu] > Sent: 20 May 2003 00:33 > To: pgadmin-support@postgresql.org > Subject: [pgadmin-support] ENCODING (Unicode) > > > To my surprise i recently found out that there is no > difference in what encoding a database or even initdb has. > I could insert and retrieve Unicode and extended ASCII with a > Java program, and everything is good. However, when the > values are displayed by pgadmin or a cygwin shell, both > extended ASCII and Unicode characters are shown as a sequence > of 2 characters, and those don't correspond to the bits one > would expect to have if 2 byte-characer were split into 2 > 1-byte (8-bit) characters. The best explanation i could come > up with is that pgAdminII is not supporting Unicode > encodings, and the shell obviously also does not. This is not > surprising, because most of pre-Java, pre-.NET apps were and > are like that, although Windows has the notions of wchar... > If i'm right, i wonder if Unicode is expected to be supported > sometimes in the future. And in general all feedback is appreciated. Hi, pgAdmin doesn't support Unicode as you say, primarily because Visual Basic doesn't either (at least not to any degree of usefulness). pgAdmin III will eventually support Unicode as it is being written in C++ with eventual support being catered for as much as possible. Regards, Dave.
Dave Page wrote:
-----Original Message----- From: Reshat Sabiq [mailto:sabiq@purdue.edu] Sent: 20 May 2003 00:33 To: pgadmin-support@postgresql.org Subject: [pgadmin-support] ENCODING (Unicode) To my surprise i recently found out that there is no difference in what encoding a database or even initdb has. I could insert and retrieve Unicode and extended ASCII with a Java program, and everything is good. However, when the values are displayed by pgadmin or a cygwin shell, both extended ASCII and Unicode characters are shown as a sequence of 2 characters, and those don't correspond to the bits one would expect to have if 2 byte-characer were split into 2 1-byte (8-bit) characters. The best explanation i could come up with is that pgAdminII is not supporting Unicode encodings, and the shell obviously also does not. This is not surprising, because most of pre-Java, pre-.NET apps were and are like that, although Windows has the notions of wchar... If i'm right, i wonder if Unicode is expected to be supported sometimes in the future. And in general all feedback is appreciated.Hi, pgAdmin doesn't support Unicode as you say, primarily because Visual Basic doesn't either (at least not to any degree of usefulness). pgAdmin III will eventually support Unicode as it is being written in C++ with eventual support being catered for as much as possible. Regards, Dave.
Thanks. Glad to hear about pgAdminIII... To follow up on this topic: Given that i can insert and retrieve Unicode values into either ASCII-based or Unicode-based DB, is Unicode-based DB less efficient? I remember reading something about it a while ago. I don't see immediately why that would be the case though, because special characters are 2 bytes either way, assuming we are not simplifying Unicode characters into ASCII. Sincerely, Reshat. ------------------------------------------------------------------------------------------- If you see my certificate with this message, you should be able to send me encrypted e-mail. Please consult your e-mail client for details if you would like to do that.
Вложения
-----Original Message-----
From: Reshat Sabiq [mailto:sabiq@purdue.edu]
Sent: 21 May 2003 08:11
To: Dave Page
Cc: pgadmin-support@postgresql.org; pgsql-novice@postgresql.org
Subject: Re: [pgadmin-support] ENCODING (Unicode)Thanks. Glad to hear about pgAdminIII... To follow up on this topic: Given that i can insert and retrieve Unicode values into either ASCII-based
or Unicode-based DB, is Unicode-based DB less efficient? I remember reading
something about it a while ago. I don't see immediately why that would be the
case though, because special characters are 2 bytes either way, assuming we
are not simplifying Unicode characters into ASCII.
Sorry, I can't answer that - perhaps someone on pgsql-general@postgresql.org would know?
Regards, Dave.
Le Mercredi 21 Mai 2003 09:10, Reshat Sabiq a écrit : > Given that i can insert and retrieve Unicode values into either ASCII-based > or Unicode-based DB, is Unicode-based DB less efficient? I remember reading > something about it a while ago. I don't see immediately why that would be > the case though, because special characters are 2 bytes either way, > assuming we are not simplifying Unicode characters into ASCII. Dear Reshat, In unicode (UTF-8), characters are coded on 1 byte (US-English letters), 2 bytes (Western and Eastern Europe languages) and 3 bytes (all other languages including Asian and Indian languages). Technically, you can store UTF-8 values in an ASCII-based database. But, storing UTF-8 in an ASCII database is not recommanded, for several reasons : - the query parser might not work well with text values (because it will not know wether 1 UTF-8 letter is made of 1, 2 or 3 bytes). - server-side languages are multi-byte safe. If you calculate the lenght of an UTF-8 string in PLpgSQL stored in an ASCII database, it will probably fail for special characters. So, the answer is : 1) If you need to search and display multi-langual text, you need an UTF-8 database. You will be able to combine all languages in a single database : arabic, polish, japanese, etc... But, be aware that you will also need a full UTF-8 chain behind the database. Not all web servers are UTF-8 compliant... Your web pages will also need to be saved into UTF-8. Take PHP for example, you will need to enable the mb_string option at compilation. The recommanded way is to design your pages under GNU/Linux as it supports UTF-8 encoding very well. 2) If you need to search and display English or Western languages only, an ASCII-based database is enough. Stay tuned. The team will soon test pgAdmin3 UTF-8 compliance. As far as I can tell, I could browse UTF-8 data in pgAdmin3. Cheers, Jean-Michel
Jean-Michel POURE wrote: > In unicode (UTF-8), characters are coded on 1 byte (US-English letters), 2 > bytes (Western and Eastern Europe languages) and 3 bytes (all other languages > including Asian and Indian languages). Technically, you can store UTF-8 > values in an ASCII-based database. > > But, storing UTF-8 in an ASCII database is not recommanded, for several > reasons : > > - the query parser might not work well with text values (because it will not > know wether 1 UTF-8 letter is made of 1, 2 or 3 bytes). > > - server-side languages are multi-byte safe. If you calculate the lenght of an > UTF-8 string in PLpgSQL stored in an ASCII database, it will probably fail > for special characters. Thanks for your feedback Jean-Michel, You made a good point, I forgot about the queries. I guess each character is converted into 4 bytes while parsing, so it makes a lot of difference between 1 2-byte character (4 bytes), and 2 1-byte characters (8 bytes). However, i haven't heard of UTF-8 supporting 3-byte values. From what i know, special characters are 2 bytes in UTF-8. 2-byte Unicode set is enough to cover all characters, including Asian (with Chinese taking a couple dozen thousands of characters). I read something recently about 3-byte character support in one of the standards (UTF-16?), but the RFC said there are no 3-byte assignments yet, because 2-byte range is currently enough... But you are right, i should use UNICODE encoding when i use characters beyond extended ASCII. As far as applications, i usually use Java, which supports Unicode. I'm glad that PHP does so as well. And i sure look forward to pgAdmin3. Good luck, Reshat.
Le Mercredi 21 Mai 2003 19:19, Reshat Sabiq a écrit : > However, i haven't heard of UTF-8 supporting 3-byte values. From what i > know, special characters are 2 bytes in UTF-8. 2-byte Unicode set is > enough to cover all characters, including Asian (with Chinese taking a > couple dozen thousands of characters). I can see three characters when browsing Japanese glyphs. Don't know more about the specs. Cheers, Jean-Michel