Обсуждение: C11: should we use char32_t for unicode code points?

Поиск

Список

Период

Сортировка

C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

23 октября, 21:15:54

Now that we're using C11, should we use char32_t for unicode code
points?

Right now, we use pg_wchar for two purposes: 

  1. to abstract away some problems with wchar_t on platforms where
it's 16 bits; and
  2. hold unicode code point values

In UTF8, they are are equivalent and can be freely cast back and forth,
but not necessarily in other encodings. That can be confusing in some
contexts. Attached is a patch to use char32_t for the second purpose.

Both are equivalent to uint32, so there's no functional change and no
actual typechecking, it's just for readability.

Is this helpful, or needless code churn?

Regards,
    Jeff Davis

Вложения

v1-0001-Use-C11-char32_t-for-Unicode-code-points.patch

Re: C11: should we use char32_t for unicode code points?

От

Tatsuo Ishii

Дата:

24 октября, 12:43:15

> Now that we're using C11, should we use char32_t for unicode code
> points?
>
> Right now, we use pg_wchar for two purposes: 
>
>   1. to abstract away some problems with wchar_t on platforms where
> it's 16 bits; and
>   2. hold unicode code point values
>
> In UTF8, they are are equivalent and can be freely cast back and forth,
> but not necessarily in other encodings. That can be confusing in some
> contexts. Attached is a patch to use char32_t for the second purpose.
>
> Both are equivalent to uint32, so there's no functional change and no
> actual typechecking, it's just for readability.
>
> Is this helpful, or needless code churn?

Unless char32_t is solely used for the Unicode code point data, I
think it would be better to define something like "pg_unicode" and use
it instead of directly using char32_t because it would be cleaner for
code readers.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

24 октября, 18:25:27

On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote:
>
> Unless char32_t is solely used for the Unicode code point data, I
> think it would be better to define something like "pg_unicode" and
> use
> it instead of directly using char32_t because it would be cleaner for
> code readers.

That was my original idea, but then I saw that apparently char32_t is
intended for Unicode code points:

https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html

But I am also OK with a new type if others find it more readable.

Regards,
    Jeff Davis

Re: C11: should we use char32_t for unicode code points?

От

Thomas Munro

Дата:

25 октября, 06:21:28

On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote:
> > Unless char32_t is solely used for the Unicode code point data, I
> > think it would be better to define something like "pg_unicode" and
> > use
> > it instead of directly using char32_t because it would be cleaner for
> > code readers.
>
> That was my original idea, but then I saw that apparently char32_t is
> intended for Unicode code points:
>
> https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html

It's definitely a codepoint but C11 only promised UTF-32 encoding if
__STDC_UTF_32__ is defined to 1, and otherwise the encoding is
unknown.  The C23 standard resolved that insanity and required UTF-32,
and there are no known systems[1] that didn't already conform, but I
guess you could static_assert(__STDC_UTF_32__, "char32_t must use
UTF-32 encoding").  It's also defined as at least, not exactly, 32
bits but we already require the machine to have uint32_t so it must be
exactly 32 bits for us, and we could static_assert(sizeof(char32_t) ==
4) for good measure.  So all up, the standard type matches our
existing assumptions about pg_wchar *if* the database encoding is
UTF8.

IIUC you're proposing that all the stuff that only works when database
encoding is UTF8 should be flipped over to the new type, and that
seems like a really good idea to me: remaining uses of pg_wchar would
be warnings that the encoding is only conditionally known.  It'd be
documentation without new type safety though: for example I think you
missed a spot, the return type of the definition of utf8_to_unicode()
(I didn't search exhaustively).  Only in C++ is it a distinct type
that would catch that and a few other mistakes.

Do you consider explicit casts between eg pg_wchar and char32_t to be
useful documentation for humans, when coercion should just work?  I
kinda thought we were trying to cut down on useless casts, they might
signal something but can also hide bugs.  Should the few places that
deal in surrogates be using char16_t instead?

I wonder if the XXX_libc_mb() functions that contain our hard-coded
assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
use your to_char32_t() too (probably with a longer name
pg_wchar_to_char32_t() if it's in a header for wider use).  That'd
highlight the exact points at which we make that assumption and
centralise the assertion about database encoding, and then the code
that compares with various known cut-off values would be clearly in
the char32_t world.

> But I am also OK with a new type if others find it more readable.

Adding yet another name to this soup doesn't immediately sound like it
would make anything more readable to me.  ISO has standardised this
for the industry, so I'd vote for adopting it without indirection that
makes the reader work harder to understand what it is.  The churn
doesn't seem excessive either, it's fairly well contained stuff
already moving around a lot in recent releases with all your recent
and ongoing revamping work.

There is one small practical problem though: Apple hasn't got around
to supplying <uchar.h> in its C SDK yet.  It's there for C++ only, and
 isn't needed for the type in C++ anyway.  I don't think that alone
warrants a new name wart, as the standard tells us it must match
uint32_least32_t so we can just define it ourselves if
!defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
around to that.

Since it confused me briefly: Apple does provide <unicode/uchar.h> but
that's a coincidentally named ICU header, and on that subject I see
that ICU hasn't adopted these types yet but there are some hints that
they're thinking about it; meanwhile their C++ interfaces have begun
to document that they are acceptable in a few template functions.

All other target systems have it AFAICS.  Windows: tested by CI,
MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

As for the conversion functions in <uchar.h>, they're of course
missing on macOS but they also depend on the current locale, so it's
almost like C, POSIX and NetBSD have conspired to make them as useless
to us as possible.  They solve the "size and encoding of wchar_t is
undefined" problem, but there are no _l() variants and we can't depend
on uselocale() being available.  Probably wouldn't be much use to us
anyway considering our more complex and general transcoding
requirements, I just thought about this while contemplating
hypothetical pre-C23 systems that don't use UTF-32, specifically what
would break if such a system existed: probably nothing as long as you
don't use these.  I guess another way you could tell would be if you
used the fancy new U-prefixed character/string literal syntax, but I
can't see much need for that.

In passing, we seem to have a couple of mentions of "pg_wchar_t"
(bogus _t) in existing comments.

[1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting

Re: C11: should we use char32_t for unicode code points?

От

Jeff Davis

Дата:

26 октября, 22:43:01

On Sat, 2025-10-25 at 16:21 +1300, Thomas Munro wrote:
> I
> guess you could static_assert(__STDC_UTF_32__, "char32_t must use
> UTF-32 encoding").

Done.

>   It's also defined as at least, not exactly, 32
> bits but we already require the machine to have uint32_t so it must
> be
> exactly 32 bits for us, and we could static_assert(sizeof(char32_t)
> ==
> 4) for good measure.

What would be the problem if it were larger than 32 bits?

I don't mind adding the asserts, but it's slightly awkward because
StaticAssertDecl() isn't defined yet at the point we are including
uchar.h.

> IIUC you're proposing that all the stuff that only works when
> database
> encoding is UTF8 should be flipped over to the new type, and that
> seems like a really good idea to me: remaining uses of pg_wchar would
> be warnings that the encoding is only conditionally known.

Exactly. The idea is to make pg_wchar stand out more as a platform-
dependent (or encoding-dependent) representation, and remove the doubt
when someone sees char32_t.

>   It'd be
> documentation without new type safety though: for example I think you
> missed a spot, the return type of the definition of utf8_to_unicode()
> (I didn't search exhaustively).

Right, it's not offering type safety. Fixed the omission.

> Do you consider explicit casts between eg pg_wchar and char32_t to be
> useful documentation for humans, when coercion should just work?  I
> kinda thought we were trying to cut down on useless casts, they might
> signal something but can also hide bugs.

The patch doesn't add any explicit casts, except in to_char32() and
to_pg_wchar(), so I assume that the callsites of those functions are
what you meant by "explicit casts"?

We can get rid of those functions if you want. The main reason they
exist is for a place to comment on the safety of converting pg_wchar to
char32_t. I can put that somewhere else, though.

>   Should the few places that
> deal in surrogates be using char16_t instead?

Yes, done.

> I wonder if the XXX_libc_mb() functions that contain our hard-coded
> assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> use your to_char32_t() too (probably with a longer name
> pg_wchar_to_char32_t() if it's in a header for wider use).

I don't think those functions do depend on UTF-32. iswalpha(), etc.,
take a wint_t, which is just a wchar_t that can also be WEOF.

And if we don't use to_char32/to_pg_wchar in there, I don't see much
need for it outside of pg_locale_builtin.c, but if the need arises we
can move it to a header file and give it a longer name.

>   That'd
> highlight the exact points at which we make that assumption and
> centralise the assertion about database encoding, and then the code
> that compares with various known cut-off values would be clearly in
> the char32_t world.

The asserts about UTF-8 in pg_locale_libc.c are there because the
previous code only took those code paths for UTF-8, and I preserved
that. Also there is some code that depends on UTF-8 for decoding, but I
don't think anything in there depends on UTF-32 specifically.

> There is one small practical problem though: Apple hasn't got around
> to supplying <uchar.h> in its C SDK yet.  It's there for C++ only,
> and
>  isn't needed for the type in C++ anyway.  I don't think that alone
> warrants a new name wart, as the standard tells us it must match
> uint32_least32_t so we can just define it ourselves if
> !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
> around to that.

Thank you, I added a configure test for uchar.h and some more
preprocessor logic in c.h.

> Since it confused me briefly: Apple does provide <unicode/uchar.h>
> but
> that's a coincidentally named ICU header, and on that subject I see
> that ICU hasn't adopted these types yet but there are some hints that
> they're thinking about it; meanwhile their C++ interfaces have begun
> to document that they are acceptable in a few template functions.

Even when they fully move to char32_t, we will still have to support
the older ICU versions for a long time.

> All other target systems have it AFAICS.  Windows: tested by CI,
> MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.

Great, thank you!

> They solve the "size and encoding of wchar_t is
> undefined" problem

One thing I never understood about this is that it's our code that
converts from the server encoding to pg_wchar (e.g.
pg_latin12wchar_with_len()), so we must understand the representation
of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
understand the encoding of wchar_t, too, right?

>  In passing, we seem to have a couple of mentions of "pg_wchar_t"
> (bogus _t) in existing comments.

Thank you. I'll fix that separately.

Regards,
    Jeff Davis

Вложения

v2-0001-Use-C11-char16_t-and-char32_t-for-Unicode-code-po.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: C11: should we use char32_t for unicode code points?

C11: should we use char32_t for unicode code points?

Вложения

Re: C11: should we use char32_t for unicode code points?

Re: C11: should we use char32_t for unicode code points?

Re: C11: should we use char32_t for unicode code points?

Re: C11: should we use char32_t for unicode code points?

Вложения