Обсуждение: Trying out native UTF-8 locales on Windows
Here's a very short patch to experiment with the idea of using Windows' native UTF-8 support when possible, ie when using "en-US.UTF-8" in a UTF-8 database. Otherwise it continues to use the special Windows-only wchar_t conversion that allows for locales with non-matching locales, ie the reason you're allowed to use "English_United States.1252" in a UTF-8 database on that OS, something we wouldn't allow on Unix. As I understand it, that mechanism dates from the pre-Windows 10 era when it had no .UTF-8 locales but users wanted or needed to use UTF-8 databases. I think some locales used encodings that we don't even support as server encodings, eg SJIS in Japan, so that was a workaround. I assume you could use "ja-JP.UTF-8" these days. CI tells me it compiles and passes, but I am not a Windows person, I'm primarily interested in code cleanup and removing weird platform differences. I wonder if someone directly interested in Windows would like to experiment with this and report whether (1) it works as expected and (2) "en-US.UTF-8" loses performance compared to "en-US" (which I guess uses WIN1252 encoding and triggers the conversion path?), and similarly for other locale pairs you might be interested in? It's possible that strcoll_l() internally converts the whole string to wchar_t internally anyway, in which case it might turn out to be marginally slower. We often have to copy the char strings up front ourselves in the regular path strcoll_l() path in order to null-terminate them, something that is skipped in the wchar_t conversion path that combines widening with null-termination in one step. Not sure if that'd kill the idea, but it'd at least be nice to know if we might eventually be able to drop the special code paths and strange configuration possibilities compared to Unix, and use it in less performance critical paths. At the very least, the comments are wrong...
Вложения
On 10/27/2025 10:22 PM, Thomas Munro wrote:
> Here's a very short patch to experiment with the idea of using
> Windows' native UTF-8 support when possible, ie when using
> "en-US.UTF-8" in a UTF-8 database. Otherwise it continues to use the
> special Windows-only wchar_t conversion that allows for locales with
> non-matching locales, ie the reason you're allowed to use
> "English_United States.1252" in a UTF-8 database on that OS, something
> we wouldn't allow on Unix.
>
> As I understand it, that mechanism dates from the pre-Windows 10 era
> when it had no .UTF-8 locales but users wanted or needed to use UTF-8
> databases. I think some locales used encodings that we don't even
> support as server encodings, eg SJIS in Japan, so that was a
> workaround. I assume you could use "ja-JP.UTF-8" these days.
>
> CI tells me it compiles and passes, but I am not a Windows person, I'm
> primarily interested in code cleanup and removing weird platform
> differences. I wonder if someone directly interested in Windows would
> like to experiment with this and report whether (1) it works as
> expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
> (which I guess uses WIN1252 encoding and triggers the conversion
> path?), and similarly for other locale pairs you might be interested
> in?
I wrote a standalone test to check this. Results on Windows 11 x64,
16 cores, ACP=1252.
(1) Correctness: PASS. strcoll_l() with UTF-8 locale matches wcscoll_l()
for all 26 test cases (ASCII, accents, umlauts, ß, Greek, etc).
Sorting 38 German/French words with both methods produces identical
order.
(2) Performance vs WIN1252: It depends on the data.
Basic comparison with short real-world strings (1M iterations each):
Test UTF8-new UTF8-cur WIN1252
---- -------- -------- -------
'hello' vs 'world' 82 ms 108 ms 76 ms
'apple' vs 'banana' 85 ms 110 ms 77 ms
'PostgreSQL' vs 'MySQL' 89 ms 113 ms 83 ms
UTF8-new = strcoll_l with UTF-8 locale (proposed patch)
UTF8-cur = wcscoll_l via conversion (current PostgreSQL)
WIN1252 = strcoll_l with legacy locale (baseline)
For ASCII strings, there's a crossover around 15-20 characters
(500K iterations each):
Length UTF8 WIN1252 Ratio
------ ---- ------- -----
5 43 ms 40 ms 0.93x (UTF8 7% slower)
10 50 ms 48 ms 0.96x (UTF8 4% slower)
20 57 ms 65 ms 1.13x (UTF8 13% faster)
50 104 ms 122 ms 1.17x (UTF8 17% faster)
100 150 ms 195 ms 1.30x (UTF8 30% faster)
500 550 ms 783 ms 1.43x (UTF8 43% faster)
For accented characters (á = 2 bytes UTF-8, 1 byte WIN1252), UTF-8
is consistently ~2x slower, as expected from the byte count
(500K iterations each):
Chars UTF8 WIN1252 Ratio
----- ---- ------- -----
5 55 ms 42 ms 0.76x
50 233 ms 117 ms 0.50x
200 694 ms 342 ms 0.49x
With 200-char ASCII strings, UTF-8 beats WIN1252 even when the
difference is at position 0 (500K iterations each):
Difference at UTF8 WIN1252
------------- ---- -------
Position 0 168 ms 260 ms
Position 199 252 ms 342 ms
This suggests WIN1252's strcoll_l has poor scaling characteristics
that UTF-8's implementation avoids. I don't have an explanation for
why.
The patch is correct, and the new strcoll_l() path is 10-25% faster than
the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
or slower than WIN1252 depends on string length and content - but users
choosing UTF-8 locales presumably want Unicode support, not WIN1252
compatibility.
I can test more if needed. I can also provide the test program for
anyone who wants it.
--
Bryan Green
EDB: https://www.enterprisedb.com
On Fri, Jan 2, 2026 at 5:25 PM Bryan Green <dbryan.green@gmail.com> wrote:
> The patch is correct, and the new strcoll_l() path is 10-25% faster than
> the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
> or slower than WIN1252 depends on string length and content - but users
> choosing UTF-8 locales presumably want Unicode support, not WIN1252
> compatibility.
Thanks Byran! This all sounds quite promising!
I can see three future pathways for this line of work:
1. We just do this opportunistically, ie when locale encoding happens
to be UTF-8, as in this patch, and call it a day, keeping the
mismatched encoding support indefinitely.
2. We additionally formalise it: there could be a build option
PG_ALLOW_MISMATCHED_ENCODINGS, true for Windows and false for Unix by
default, but a Unix developer could enable it to test that mode of
operation. We could use our own transcoding functions instead of the
Windows ones so that the code is identical on all platforms, ie
testable by all for general project sanity.
3. We decide we want to kill support for mismatched encodings. I
guess that means that the upgrade path for existing clusters would
involve switching your existing locale names eg in pg_database and
pg_collation from eg "English_United State.1252" or "en-US" to
"en-US.UTF-8", for databases that are using UTF-8 encoding. I don't
know how exactly that should be done, ie manually before pg_upgrade,
by pg_upgrade itself, or something else. On-the-fly translation would
also be possible but probably a bit too magical. In this pathway we
get to delete these code differences entirely, and I think our general
encoding and portability situation would be greatly improved.
I am in favour of #3 at least eventually, and I posted patches to try
that out a few years ago[1]. If we can't agree to do that, my next
preference would be #2. What do you think? #3 would be inflicting
one-time pain on users. Would it be worth it? (And if you're
interested in the general topic of encodings and portability, there
are many more problems to solve, working on those[2]...).
One small thing that would help reduce the pain of #3, if we think
there's any chance we're going to go that way, would be to ask the EDB
installer team to stop listing locales with mismatched encodings in
their GUI for initdb, for the benefit of new clusters being born in
the wild today.
An approach for existing clusters could be to produce a file
upgrade-windows-locale-encodings.sql that could be run against a
cluster, containing UPDATE pg_database SET datlocale = 'en_US.UTF-8'
WHERE datencoding = 6 AND datlocale in ('English_United States.1252',
'en-US') and so on, extracted from some official source, or something
like that? Or maybe it would just spit out the UPDATE statements for
human review and execution. I assume/hope those locales are really
the "same" in every important respect. One minor problem to think
about is that the historical "English_..." names don't report a
version, so that field is empty. I'm less sure what the situation is
with pg_collation. It is initially populated by
pg_import_system_collations() which iterates over the system locales
(it doesn't actually seem that wise to me that it translates "-" to
"_" in those names, what is the point of messing with the true
system-defined names?!) and I don't know of the top of my head if it
finishes up with both "en-US" (presumably implying .1252 encoding) and
"en-US.UTF-8" entries. If so it's less obvious how to upgrade
automatically, perhaps by ALTERing any columns to use the one that
points to "en-US.UTF-8" instead of the one that points to "en-US"
where appropriate, but perhaps a simple SQL query could find simple
cases and recommend the ALTER statements that would achieve that?
More complicated places where COLLATE might be hiding would of course
be a Turing tarpit beyond analysis. Presumably pg_upgrade would
simply fail if you don't upgrade the source database in this way
first, and tell you why.
[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com#6158915417859e029d60456312fd1cc7
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLrx02%3DwT647oYz7x0GW%3DPyQdma0s4U6w9h%3DsuFjcciTw%40mail.gmail.com