Обсуждение: ENCODING (Unicode)

Поиск
Список
Период
Сортировка

ENCODING (Unicode)

От
Reshat Sabiq
Дата:
To my surprise i recently found out that there is no difference in what encoding a database or even initdb has.
I could insert and retrieve Unicode and extended ASCII with a Java program, and everything is good. However, when the
valuesare displayed by pgadmin or a cygwin shell, both extended ASCII and Unicode characters are shown as a sequence of
2characters, and those don't correspond to the bits one would expect to have if 2 byte-characer were split into 2
1-byte (8-bit) characters. The best explanation i could come up with is that pgAdminII is not supporting Unicode
encodings,and the shell obviously also does not. This is not surprising, because most of pre-Java, pre-.NET apps were
andare like that, although Windows has the notions of wchar... 
If i'm right, i wonder if Unicode is expected to be supported sometimes in the future. And in general all feedback is
appreciated.

Sincerely,
Reshat.

-------------------------------------------------------------------------------------------
If you see my certificate with this message, you should be able to send me encrypted e-mail.
Please consult your e-mail client for details if you would like to do that.


Вложения

Re: ENCODING (Unicode)

От
"Dave Page"
Дата:
> -----Original Message-----
> From: Reshat Sabiq [mailto:sabiq@purdue.edu]
> Sent: 20 May 2003 00:33
> To: pgadmin-support@postgresql.org
> Subject: [pgadmin-support] ENCODING (Unicode)
>
>
> To my surprise i recently found out that there is no
> difference in what encoding a database or even initdb has.
> I could insert and retrieve Unicode and extended ASCII with a
> Java program, and everything is good. However, when the
> values are displayed by pgadmin or a cygwin shell, both
> extended ASCII and Unicode characters are shown as a sequence
> of 2 characters, and those don't correspond to the bits one
> would expect to have if 2 byte-characer were split into 2
> 1-byte  (8-bit) characters. The best explanation i could come
> up with is that pgAdminII is not supporting Unicode
> encodings, and the shell obviously also does not. This is not
> surprising, because most of pre-Java, pre-.NET apps were and
> are like that, although Windows has the notions of wchar...
> If i'm right, i wonder if Unicode is expected to be supported
> sometimes in the future. And in general all feedback is appreciated.

Hi,

pgAdmin doesn't support Unicode as you say, primarily because Visual
Basic doesn't either (at least not to any degree of usefulness). pgAdmin
III will eventually support Unicode as it is being written in C++ with
eventual support being catered for as much as possible.

Regards, Dave.

Re: ENCODING (Unicode)

От
Reshat Sabiq
Дата:


Dave Page wrote:
-----Original Message-----
From: Reshat Sabiq [mailto:sabiq@purdue.edu] 
Sent: 20 May 2003 00:33
To: pgadmin-support@postgresql.org
Subject: [pgadmin-support] ENCODING (Unicode)


To my surprise i recently found out that there is no 
difference in what encoding a database or even initdb has. 
I could insert and retrieve Unicode and extended ASCII with a 
Java program, and everything is good. However, when the 
values are displayed by pgadmin or a cygwin shell, both 
extended ASCII and Unicode characters are shown as a sequence 
of 2 characters, and those don't correspond to the bits one 
would expect to have if 2 byte-characer were split into 2 
1-byte  (8-bit) characters. The best explanation i could come 
up with is that pgAdminII is not supporting Unicode 
encodings, and the shell obviously also does not. This is not 
surprising, because most of pre-Java, pre-.NET apps were and 
are like that, although Windows has the notions of wchar... 
If i'm right, i wonder if Unicode is expected to be supported 
sometimes in the future. And in general all feedback is appreciated.   
Hi,

pgAdmin doesn't support Unicode as you say, primarily because Visual
Basic doesn't either (at least not to any degree of usefulness). pgAdmin
III will eventually support Unicode as it is being written in C++ with
eventual support being catered for as much as possible.

Regards, Dave. 
Thanks. Glad to hear about pgAdminIII...
To follow up on this topic:
Given that i can insert and retrieve Unicode values into either ASCII-based or Unicode-based DB, is Unicode-based DB less efficient? I remember reading something about it a while ago. I don't see immediately why that would be the case though, because special characters are 2 bytes either way, assuming we are not simplifying Unicode characters into ASCII.


Sincerely,
Reshat.

-------------------------------------------------------------------------------------------
If you see my certificate with this message, you should be able to send me encrypted e-mail. 
Please consult your e-mail client for details if you would like to do that.

Вложения

Re: ENCODING (Unicode)

От
"Dave Page"
Дата:
 
-----Original Message-----
From: Reshat Sabiq [mailto:sabiq@purdue.edu]
Sent: 21 May 2003 08:11
To: Dave Page
Cc: pgadmin-support@postgresql.org; pgsql-novice@postgresql.org
Subject: Re: [pgadmin-support] ENCODING (Unicode)
 
Thanks. Glad to hear about pgAdminIII...
To follow up on this topic:
Given that i can insert and retrieve Unicode values into either ASCII-based  
or Unicode-based DB, is Unicode-based DB less efficient? I remember reading  
something about it a while ago. I don't see immediately why that would be the  
case though, because special characters are 2 bytes either way, assuming we  
are not simplifying Unicode characters into ASCII.  
Sorry, I can't answer that - perhaps someone on pgsql-general@postgresql.org would know?
Regards, Dave. 

Re: ENCODING (Unicode)

От
Jean-Michel POURE
Дата:
Le Mercredi 21 Mai 2003 09:10, Reshat Sabiq a écrit :
> Given that i can insert and retrieve Unicode values into either ASCII-based
> or Unicode-based DB, is Unicode-based DB less efficient? I remember reading
> something about it a while ago. I don't see immediately why that would be
> the case though, because special characters are 2 bytes either way,
> assuming we are not simplifying Unicode characters into ASCII.

Dear Reshat,

In unicode (UTF-8), characters are coded on 1 byte (US-English letters), 2
bytes (Western and Eastern Europe languages) and 3 bytes (all other languages
including Asian and Indian languages). Technically, you can store UTF-8
values in an ASCII-based database.

But, storing UTF-8 in an ASCII database is not recommanded, for several
reasons :

- the query parser might not work well with text values (because it will not
know wether 1 UTF-8 letter is made of 1, 2 or 3 bytes).

- server-side languages are multi-byte safe. If you calculate the lenght of an
UTF-8 string in PLpgSQL stored in an ASCII database, it will probably fail
for special characters.

So, the answer is :

1) If you need to search and display multi-langual text, you need an UTF-8
database. You will be able to combine all languages in a single database :
arabic, polish, japanese, etc...

But, be aware that you will also need a full UTF-8 chain behind the database.
Not all web servers are UTF-8 compliant... Your web pages will also need to
be saved into UTF-8. Take PHP for example, you will need to enable the
mb_string option at compilation.

The recommanded way is to design your pages under GNU/Linux as it supports
UTF-8 encoding very well.

2) If you need to search and display English or Western languages only, an
ASCII-based database is enough.

Stay tuned. The team will soon test pgAdmin3 UTF-8 compliance. As far as I can
tell, I could browse UTF-8 data in pgAdmin3.

Cheers,
Jean-Michel

Re: ENCODING (Unicode)

От
Reshat Sabiq
Дата:
Jean-Michel POURE wrote:
> In unicode (UTF-8), characters are coded on 1 byte (US-English letters), 2
> bytes (Western and Eastern Europe languages) and 3 bytes (all other languages
> including Asian and Indian languages). Technically, you can store UTF-8
> values in an ASCII-based database.
>
> But, storing UTF-8 in an ASCII database is not recommanded, for several
> reasons :
>
> - the query parser might not work well with text values (because it will not
> know wether 1 UTF-8 letter is made of 1, 2 or 3 bytes).
>
> - server-side languages are multi-byte safe. If you calculate the lenght of an
> UTF-8 string in PLpgSQL stored in an ASCII database, it will probably fail
> for special characters.

Thanks for your feedback Jean-Michel,

You made a good point, I forgot about the queries. I guess each
character is converted into 4 bytes while parsing, so it makes a lot of
difference between 1 2-byte character (4 bytes), and 2 1-byte characters
(8 bytes).

However, i haven't heard of UTF-8 supporting 3-byte values. From what i
know, special characters are 2 bytes in UTF-8. 2-byte Unicode set is
enough to cover all characters, including Asian (with Chinese taking a
couple dozen thousands of characters). I read something recently about
3-byte character support in one of the standards (UTF-16?), but the RFC
said there are no 3-byte assignments yet, because 2-byte range is
currently enough...

But you are right, i should use UNICODE encoding when i use characters
beyond extended ASCII.
As far as applications, i usually use Java, which supports Unicode. I'm
glad that PHP does so as well. And i sure look forward to pgAdmin3.

Good luck,
Reshat.


Re: ENCODING (Unicode)

От
Jean-Michel POURE
Дата:
Le Mercredi 21 Mai 2003 19:19, Reshat Sabiq a écrit :
> However, i haven't heard of UTF-8 supporting 3-byte values. From what i
> know, special characters are 2 bytes in UTF-8. 2-byte Unicode set is
> enough to cover all characters, including Asian (with Chinese taking a
> couple dozen thousands of characters).

I can see three characters when browsing Japanese glyphs. Don't know more
about the specs.

Cheers,
Jean-Michel