Обсуждение: UTF-8 safe ascii() function

Поиск
Список
Период
Сортировка

UTF-8 safe ascii() function

От
Jean-Michel POURE
Дата:
Dear all,

I would like to transform UTF-8 strings into Java-Unicode. Example :
- Latin1 : 'é'
- UTF-8 : 'é'
- Java Unicode = '\u00233'

Basically, a Unicode compatible ascii() function would be fine.
ascii('é') should return 233.

1) Has anyone written an ascii UTF-8 safe wrapper to ascii() function? If yes,
would you be so kind to publish this function on the list.

2) Are there plans to add an ascii() UTF-8 safe function to PostrgeSQL?

Best regards,
Jean-Michel POURE

Re: [HACKERS] UTF-8 safe ascii() function

От
Patrice Hédé
Дата:
Hi Jean-Michel,

Jean-Michel POURE <jm.poure@freesurf.fr> a écrit :
> Dear all,
>
> I would like to transform UTF-8 strings into Java-Unicode. Example :
> - Latin1 : 'é'
> - UTF-8 : 'é'
> - Java Unicode = '\u00233'
>
> Basically, a Unicode compatible ascii() function would be fine.
> ascii('é') should return 233.
>
> 1) Has anyone written an ascii UTF-8 safe wrapper to ascii() function?
> If yes, would you be so kind to publish this function on the list.

OK, I just gave it a try, see the attachment.

The function is taking the first character of a TEXT element, and
returns its UCS2 value. I just did some basic test (i.e. I have not
tried with 3 or 4 bytes UTF-8 chars). The function is following the
Unicode 3.2 spec.

SELECT utf8toucs2('a'), utf8toucs2('é');
  utf8toucs2 | utf8toucs2
------------+------------
         97 |        233
(1 row)

The function returns -1 on error.

> 2) Are there plans to add an ascii() UTF-8 safe function to
> PostrgeSQL?

I don't think the function I did is useful as such. It would be better
to make a function that converts the whole string or something.

By the way, what is the encoding for Java Unicode ? is it always "\u"
followed by 5 hex digits (in which case your example is wrong) ? Then,
it shouldn't be too difficult to make the relevant function, though I'm
wondering if the Java programme would convert an incoming '\' 'u' '0'
'0' '2' '3' '3' to the corresponding UCS2/UTF16 character ?

Maybe we should have some similar input (and output ?) functionality in
psql, but then I would much prefer the Perl way, which is
\x{hex_digits}, which is unambiguous.

Regards,

Patrice

--
Patrice Hédé
email: patrice hede(à)islande org
www  : http://www.islande.org/


Вложения

Re: [HACKERS] UTF-8 safe ascii() function

От
Jean-Michel POURE
Дата:
Dear Patrice,

Thank you very much. This will save the lives of Java users.

> I don't think the function I did is useful as such. It would be better
> to make a function that converts the whole string or something.

Yes, this would save the lives of some Javascript users. Java Unicode notation
is the only Unicode understood by Javascript.

> By the way, what is the encoding for Java Unicode ? is it always "\u"
> followed by 5 hex digits (in which case your example is wrong) ? Then,
> it shouldn't be too difficult to make the relevant function, though I'm
> wondering if the Java programme would convert an incoming '\' 'u' '0'
> '0' '2' '3' '3' to the corresponding UCS2/UTF16 character ?

Java Unicode notation is not case sensitive ('\u' = '\U') and is followed by
an hexadecimal value.

> Maybe we should have some similar input (and output ?) functionality in
> psql, but then I would much prefer the Perl way, which is
> \x{hex_digits}, which is unambiguous.

This would be perfect. We should also handle the HTML unicode nation :
&#{dec_digits} and &#x{hex_digits} as it is unambiguous.

Cheers,
Jean-Michel



Re: [HACKERS] UTF-8 safe ascii() function

От
Jean-Michel POURE
Дата:
Le Dimanche 19 Mai 2002 11:44, Patrice Hédé a écrit :
> The function is taking the first character of a TEXT element, and
> returns its UCS2 value. I just did some basic test (i.e. I have not
> tried with 3 or 4 bytes UTF-8 chars). The function is following the
> Unicode 3.2 spec.

Hi Patrice,

I tried a Japanese character :
SELECT utf8toucs2 ('支'::text) which returns -1

Do you know why it does not return the UCS-2 value?

Cheers,
Jean-Michel POURE

Re: [HACKERS] UTF-8 safe ascii() function

От
Patrice Hédé
Дата:
Jean-Michel POURE <jm.poure@freesurf.fr> a écrit :

> I tried a Japanese character :
> SELECT utf8toucs2 ('æ_¯'::text) which returns -1
>
> Do you know why it does not return the UCS-2 value?

Oops, my mistake. I forgot to update a test after a copy-paste. Here is
a new version which should be correct this time ! :)

Patrice

--
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/

Вложения

Re: [HACKERS] UTF-8 safe ascii() function

От
Jean-Michel POURE
Дата:
Le Dimanche 19 Mai 2002 21:14, Patrice Hédé a écrit :
> Oops, my mistake. I forgot to update a test after a copy-paste. Here is
> a new version which should be correct this time ! :)

Thanks Patrice, merci Patrice !