Обсуждение: C11: should we use char32_t for unicode code points?
Now that we're using C11, should we use char32_t for unicode code
points?
Right now, we use pg_wchar for two purposes:
1. to abstract away some problems with wchar_t on platforms where
it's 16 bits; and
2. hold unicode code point values
In UTF8, they are are equivalent and can be freely cast back and forth,
but not necessarily in other encodings. That can be confusing in some
contexts. Attached is a patch to use char32_t for the second purpose.
Both are equivalent to uint32, so there's no functional change and no
actual typechecking, it's just for readability.
Is this helpful, or needless code churn?
Regards,
Jeff Davis
Вложения
> Now that we're using C11, should we use char32_t for unicode code > points? > > Right now, we use pg_wchar for two purposes: > > 1. to abstract away some problems with wchar_t on platforms where > it's 16 bits; and > 2. hold unicode code point values > > In UTF8, they are are equivalent and can be freely cast back and forth, > but not necessarily in other encodings. That can be confusing in some > contexts. Attached is a patch to use char32_t for the second purpose. > > Both are equivalent to uint32, so there's no functional change and no > actual typechecking, it's just for readability. > > Is this helpful, or needless code churn? Unless char32_t is solely used for the Unicode code point data, I think it would be better to define something like "pg_unicode" and use it instead of directly using char32_t because it would be cleaner for code readers. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote: > > Unless char32_t is solely used for the Unicode code point data, I > think it would be better to define something like "pg_unicode" and > use > it instead of directly using char32_t because it would be cleaner for > code readers. That was my original idea, but then I saw that apparently char32_t is intended for Unicode code points: https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html But I am also OK with a new type if others find it more readable. Regards, Jeff Davis
On Sat, Oct 25, 2025 at 4:25 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2025-10-24 at 18:43 +0900, Tatsuo Ishii wrote: > > Unless char32_t is solely used for the Unicode code point data, I > > think it would be better to define something like "pg_unicode" and > > use > > it instead of directly using char32_t because it would be cleaner for > > code readers. > > That was my original idea, but then I saw that apparently char32_t is > intended for Unicode code points: > > https://www.gnu.org/software/gnulib/manual/html_node/The-char32_005ft-type.html It's definitely a codepoint but C11 only promised UTF-32 encoding if __STDC_UTF_32__ is defined to 1, and otherwise the encoding is unknown. The C23 standard resolved that insanity and required UTF-32, and there are no known systems[1] that didn't already conform, but I guess you could static_assert(__STDC_UTF_32__, "char32_t must use UTF-32 encoding"). It's also defined as at least, not exactly, 32 bits but we already require the machine to have uint32_t so it must be exactly 32 bits for us, and we could static_assert(sizeof(char32_t) == 4) for good measure. So all up, the standard type matches our existing assumptions about pg_wchar *if* the database encoding is UTF8. IIUC you're proposing that all the stuff that only works when database encoding is UTF8 should be flipped over to the new type, and that seems like a really good idea to me: remaining uses of pg_wchar would be warnings that the encoding is only conditionally known. It'd be documentation without new type safety though: for example I think you missed a spot, the return type of the definition of utf8_to_unicode() (I didn't search exhaustively). Only in C++ is it a distinct type that would catch that and a few other mistakes. Do you consider explicit casts between eg pg_wchar and char32_t to be useful documentation for humans, when coercion should just work? I kinda thought we were trying to cut down on useless casts, they might signal something but can also hide bugs. Should the few places that deal in surrogates be using char16_t instead? I wonder if the XXX_libc_mb() functions that contain our hard-coded assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should use your to_char32_t() too (probably with a longer name pg_wchar_to_char32_t() if it's in a header for wider use). That'd highlight the exact points at which we make that assumption and centralise the assertion about database encoding, and then the code that compares with various known cut-off values would be clearly in the char32_t world. > But I am also OK with a new type if others find it more readable. Adding yet another name to this soup doesn't immediately sound like it would make anything more readable to me. ISO has standardised this for the industry, so I'd vote for adopting it without indirection that makes the reader work harder to understand what it is. The churn doesn't seem excessive either, it's fairly well contained stuff already moving around a lot in recent releases with all your recent and ongoing revamping work. There is one small practical problem though: Apple hasn't got around to supplying <uchar.h> in its C SDK yet. It's there for C++ only, and isn't needed for the type in C++ anyway. I don't think that alone warrants a new name wart, as the standard tells us it must match uint32_least32_t so we can just define it ourselves if !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets around to that. Since it confused me briefly: Apple does provide <unicode/uchar.h> but that's a coincidentally named ICU header, and on that subject I see that ICU hasn't adopted these types yet but there are some hints that they're thinking about it; meanwhile their C++ interfaces have begun to document that they are acceptable in a few template functions. All other target systems have it AFAICS. Windows: tested by CI, MinGW: found discussion, *BSD, Solaris, Illumos: found man pages. As for the conversion functions in <uchar.h>, they're of course missing on macOS but they also depend on the current locale, so it's almost like C, POSIX and NetBSD have conspired to make them as useless to us as possible. They solve the "size and encoding of wchar_t is undefined" problem, but there are no _l() variants and we can't depend on uselocale() being available. Probably wouldn't be much use to us anyway considering our more complex and general transcoding requirements, I just thought about this while contemplating hypothetical pre-C23 systems that don't use UTF-32, specifically what would break if such a system existed: probably nothing as long as you don't use these. I guess another way you could tell would be if you used the fancy new U-prefixed character/string literal syntax, but I can't see much need for that. In passing, we seem to have a couple of mentions of "pg_wchar_t" (bogus _t) in existing comments. [1] https://thephd.dev/c-the-improvements-june-september-virtual-c-meeting
On Sat, 2025-10-25 at 16:21 +1300, Thomas Munro wrote:
> I
> guess you could static_assert(__STDC_UTF_32__, "char32_t must use
> UTF-32 encoding").
Done.
> It's also defined as at least, not exactly, 32
> bits but we already require the machine to have uint32_t so it must
> be
> exactly 32 bits for us, and we could static_assert(sizeof(char32_t)
> ==
> 4) for good measure.
What would be the problem if it were larger than 32 bits?
I don't mind adding the asserts, but it's slightly awkward because
StaticAssertDecl() isn't defined yet at the point we are including
uchar.h.
> IIUC you're proposing that all the stuff that only works when
> database
> encoding is UTF8 should be flipped over to the new type, and that
> seems like a really good idea to me: remaining uses of pg_wchar would
> be warnings that the encoding is only conditionally known.
Exactly. The idea is to make pg_wchar stand out more as a platform-
dependent (or encoding-dependent) representation, and remove the doubt
when someone sees char32_t.
> It'd be
> documentation without new type safety though: for example I think you
> missed a spot, the return type of the definition of utf8_to_unicode()
> (I didn't search exhaustively).
Right, it's not offering type safety. Fixed the omission.
> Do you consider explicit casts between eg pg_wchar and char32_t to be
> useful documentation for humans, when coercion should just work? I
> kinda thought we were trying to cut down on useless casts, they might
> signal something but can also hide bugs.
The patch doesn't add any explicit casts, except in to_char32() and
to_pg_wchar(), so I assume that the callsites of those functions are
what you meant by "explicit casts"?
We can get rid of those functions if you want. The main reason they
exist is for a place to comment on the safety of converting pg_wchar to
char32_t. I can put that somewhere else, though.
> Should the few places that
> deal in surrogates be using char16_t instead?
Yes, done.
> I wonder if the XXX_libc_mb() functions that contain our hard-coded
> assumptions that libc's wchar_t values are in UTF-16 or UTF-32 should
> use your to_char32_t() too (probably with a longer name
> pg_wchar_to_char32_t() if it's in a header for wider use).
I don't think those functions do depend on UTF-32. iswalpha(), etc.,
take a wint_t, which is just a wchar_t that can also be WEOF.
And if we don't use to_char32/to_pg_wchar in there, I don't see much
need for it outside of pg_locale_builtin.c, but if the need arises we
can move it to a header file and give it a longer name.
> That'd
> highlight the exact points at which we make that assumption and
> centralise the assertion about database encoding, and then the code
> that compares with various known cut-off values would be clearly in
> the char32_t world.
The asserts about UTF-8 in pg_locale_libc.c are there because the
previous code only took those code paths for UTF-8, and I preserved
that. Also there is some code that depends on UTF-8 for decoding, but I
don't think anything in there depends on UTF-32 specifically.
> There is one small practical problem though: Apple hasn't got around
> to supplying <uchar.h> in its C SDK yet. It's there for C++ only,
> and
> isn't needed for the type in C++ anyway. I don't think that alone
> warrants a new name wart, as the standard tells us it must match
> uint32_least32_t so we can just define it ourselves if
> !defined(__cplusplus__) && !defined(HAVE_UCHAR_H), until Apple gets
> around to that.
Thank you, I added a configure test for uchar.h and some more
preprocessor logic in c.h.
> Since it confused me briefly: Apple does provide <unicode/uchar.h>
> but
> that's a coincidentally named ICU header, and on that subject I see
> that ICU hasn't adopted these types yet but there are some hints that
> they're thinking about it; meanwhile their C++ interfaces have begun
> to document that they are acceptable in a few template functions.
Even when they fully move to char32_t, we will still have to support
the older ICU versions for a long time.
> All other target systems have it AFAICS. Windows: tested by CI,
> MinGW: found discussion, *BSD, Solaris, Illumos: found man pages.
Great, thank you!
> They solve the "size and encoding of wchar_t is
> undefined" problem
One thing I never understood about this is that it's our code that
converts from the server encoding to pg_wchar (e.g.
pg_latin12wchar_with_len()), so we must understand the representation
of pg_wchar. And we cast directly from pg_wchar to wchar_t, so we
understand the encoding of wchar_t, too, right?
> In passing, we seem to have a couple of mentions of "pg_wchar_t"
> (bogus _t) in existing comments.
Thank you. I'll fix that separately.
Regards,
Jeff Davis