Обсуждение: Documenting more pitfalls of non-default collations?
Hi there,
I mentioned to Jeremy at pgConf.dev that using non-default collations
in some SQL idioms can produce undesired results, and he asked me to
send an email. An example idiom is the way Django implements
case-insensitive comparisons using "upper(x) = upper(y)" [1][2][3] ,
which returns false if x = y but they have different collations that
produce different uppercase.
For example, assuming the default collation performs standard Unicode
case mapping:
# select upper('i') = upper('i' collate "tr-x-icu");
f
# select upper('é') = upper('é' collate "C");
f
Or with collations in DDL instead:
# create table t (
tr text collate "tr-x-icu",
c text collate "C"
);
CREATE TABLE
# insert into t values ('i', 'é');
INSERT 0 1
# select count(*) from t where upper('i') = upper(tr);
0
# select count(*) from t where upper('é') = upper(c);
0
This is expected, given a careful reading of the collation docs, but
it's not really highlighted in any of the examples--in each example
that doesn't produce an error, all of the collation-sensitive
functions/operators end up applying the same collation. Maybe there
should be an example that applies different collations in different
subexpressions, and/or a warning against constructions like "upper(x)
= upper(y)"?
[1] https://github.com/django/django/blame/stable/5.1.x/django/db/backends/postgresql/operations.py#L175
[2] https://github.com/django/django/blame/stable/5.1.x/django/db/backends/postgresql/base.py#L155
[3] https://code.djangoproject.com/ticket/32485
On Mon, 2024-06-10 at 23:55 -0700, Will Mortensen wrote:
> I mentioned to Jeremy at pgConf.dev that using non-default collations
> in some SQL idioms can produce undesired results, and he asked me to
> send an email. An example idiom is the way Django implements
> case-insensitive comparisons using "upper(x) = upper(y)" [1][2][3] ,
> which returns false if x = y but they have different collations that
> produce different uppercase.
Hi,
Thank you for the examples.
There are quite a few subtleties to getting case-insensitive
comparisons right, and neither LOWER() nor UPPER() get everything quite
right even if the collation is the same.
For instance (for almost any locale other than "C"):
UPPER(LOWER(U&'\1E9E')) != UPPER(U&'\1E9E')
And:
LOWER(UPPER(U&'\03C2')) != LOWER(U&'\03C2')
The results of UPPER() and LOWER() can also change if some language
adds a new case variant in the future, which could be a problem if the
results are stored somewhere.
How should we document all of that? If we include too many caveats,
it's just frustrating.
Instead, I propose that we implement Unicode "case folding" in PG18,
which solves these issues by transforming the string to a canonical
form suitable for case-insensitive comparison. (In most cases, the
results are the same as LOWER(), but there are exceptions specifically
to avoid the problems above.)
Then, we can just have a section in the docs on "case folding" to
describe the right way to use it. That still leaves one caveat: the
handling of dotted- and dotless-i. But one caveat is a lot easier to
keep track of.
Regards,
Jeff Davis
Hi Jeff, Unicode case folding would be great! I wonder if there might still be situations where multiple collations are applied within the same expression/statement and this produces a surprising result. So far I'm mostly coming up with examples that seem more contrived or less surprising, but maybe one would still be informative?