Обсуждение: Knowing the length(convert(username using windows_1251_to_utf8))

Поиск
Список
Период
Сортировка

Knowing the length(convert(username using windows_1251_to_utf8))

От
"Alexander Farber"
Дата:
Hello PostgreSQL users!

I have this data stored in WIN1251 encoding, which
is being fetched by a libpq application I'm developing:

phpbb=> show client_encoding;
-----------------
 WIN1251
(1 row)

phpbb=> \d phpbb_users;
........
 username              | character varying(25)  | not null default ''::character
........

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
    username     | length | length
-----------------+--------+--------
 Лукашенко И. В. |     15 |     26
(1 row)

My problem is that I need the username in the utf8 encoding.
So I use the convert(username using windows_1251_to_utf8)
which works fine except one thing:

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

I was using char name[25 + 1] initially, but now I see
that it isn't sufficient. Should I use char name[25 * 2 + 1] ?

How do you usually handle such cases?

Thank you for any advices
Alex


--
http://preferans.de

Re: Knowing the length(convert(username using windows_1251_to_utf8))

От
"Alexander Farber"
Дата:
And additional question please:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

Thank you
Alex

On 1/11/07, Alexander Farber <alexander.farber@gmail.com> wrote:
> phpbb=> show client_encoding;
> -----------------
>  WIN1251
> (1 row)
>
> phpbb=> \d phpbb_users;
> ........
>  username              | character varying(25)  | not null default ''::character
> ........
>
> phpbb=> select username, length(username), length(convert(username
> using windows_1251_to_utf8)) from phpbb_users where user_id=224;
>     username     | length | length
> -----------------+--------+--------
>  Лукашенко И. В. |     15 |     26
> (1 row)
>

--
http://preferans.de

Re: Knowing the length(convert(username using windows_1251_to_utf8))

От
Martijn van Oosterhout
Дата:
On Thu, Jan 11, 2007 at 10:19:38AM +0100, Alexander Farber wrote:
> Hello PostgreSQL users!
>
> I have this data stored in WIN1251 encoding, which
> is being fetched by a libpq application I'm developing:

<snip>

> phpbb=> select username, length(username), length(convert(username
> using windows_1251_to_utf8)) from phpbb_users where user_id=224;
>    username     | length | length
> -----------------+--------+--------
> ????????? ?. ?. |     15 |     26
> (1 row)
>
> My problem is that I need the username in the utf8 encoding.
> So I use the convert(username using windows_1251_to_utf8)
> which works fine except one thing:


If you need the string in UTF-8, why not just set the "client_encoding"
to "utf8" and then the server will only send you strings in utf8, not
conversion necessary.

> Is there please a way to know the length of the utf8 data?
> (I'm using a fixed char array in my C program)

UTF-8 always variable length, I think up to 4 bytes per character.
Maybe you should n't be using a fixed-length array?

> How do you usually handle such cases?

Variable length arrays.

In your next email you ask:
> Can I still be sure that the data returned in the
> convert(username using windows_1251_to_utf8)
> column will be 0-terminated or should I fetch
> the data length using PQgetlength and maintain
> that value in my C-program?

In the client end (as long you're not doing binary transfers) the
strings are always null terminated.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: Knowing the length(convert(username using windows_1251_to_utf8))

От
"Alexander Farber"
Дата:
Hi Martijn,

On 1/11/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
> If you need the string in UTF-8, why not just set the "client_encoding"
> to "utf8" and then the server will only send you strings in utf8, not
> conversion necessary.

actually you are right, because I need all my data in UTF8 anyway
(for a web flash client). So I've followed your advice and added:

   PQsetClientEncoding(conn, "UTF8")

and now my program works same, but without that convert().

> > Is there please a way to know the length of the utf8 data?
> > (I'm using a fixed char array in my C program)
>
> UTF-8 always variable length, I think up to 4 bytes per character.
> Maybe you should n't be using a fixed-length array?

Ok I'll go for the 4 times bigger fixed array for now,
because I'd like to keep my webchat-like app quick.

> In your next email you ask:
> > Can I still be sure that the data returned in the
> > convert(username using windows_1251_to_utf8)
> > column will be 0-terminated or should I fetch
> > the data length using PQgetlength and maintain
> > that value in my C-program?
>
> In the client end (as long you're not doing binary transfers) the
> strings are always null terminated.

May I ask you an off-topic question? I've read several
docs on Unicode, but they are difficult to understand.

Do you think that an UTF8 string will ever have a 0 byte
inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
on the UTF8 values I'll be fetching from my database?

Regards
Alex

PS: Using postgresql-server-8.1.4 on OpenBSD 4.0-stable



--
http://preferans.de

Re: Knowing the length(convert(username using windows_1251_to_utf8))

От
Martijn van Oosterhout
Дата:
On Thu, Jan 11, 2007 at 12:37:32PM +0100, Alexander Farber wrote:
> May I ask you an off-topic question? I've read several
> docs on Unicode, but they are difficult to understand.

Have you read the Unicode FAQ?

http://www.cl.cam.ac.uk/~mgk25/unicode.html

> Do you think that an UTF8 string will ever have a 0 byte
> inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
> on the UTF8 values I'll be fetching from my database?

The answer to your questions are no and yes respectivly. See the FAQ.
That is also one of the reasons why Linux/Unix went for utf-8, because
it required minimal changes to programs (and in particular, the C
library).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения