Обсуждение: Use CASEFOLD() internally rather than LOWER()
There are a number of internal callers of LOWER(), and conceptually
those should all be using CASEFOLD(). Patches attached.
I'm not sure if we want the citext patch -- it would require REINDEX of
all existing citext indexes after upgrade, and there's already a
documented tip ("Consider using nondeterministic collations...), so
perhaps it's a legacy extension anyway.
It would be nice to make the tsearch change this release, as there are
already changes that could require a reindex.
I didn't change pg_trgm yet, because I think that we have to change the
regex machinery to be aware of more than two case variants first (and
potentially increasing string lengths, too).
Regards,
Jeff Davis
Вложения
> On Jan 13, 2026, at 02:22, Jeff Davis <pgsql@j-davis.com> wrote:
>
> There are a number of internal callers of LOWER(), and conceptually
> those should all be using CASEFOLD(). Patches attached.
>
> I'm not sure if we want the citext patch -- it would require REINDEX of
> all existing citext indexes after upgrade, and there's already a
> documented tip ("Consider using nondeterministic collations...), so
> perhaps it's a legacy extension anyway.
>
> It would be nice to make the tsearch change this release, as there are
> already changes that could require a reindex.
>
> I didn't change pg_trgm yet, because I think that we have to change the
> regex machinery to be aware of more than two case variants first (and
> potentially increasing string lengths, too).
>
> Regards,
> Jeff Davis
>
>
>
<v1-0001-ILIKE-use-CASEFOLD-rather-than-LOWER.patch><v1-0002-citext-use-CASEFOLD-rather-than-LOWER.patch><v1-0003-dict_xsyn-use-CASEFOLD-rather-than-LOWER.patch><v1-0004-tsearch-use-CASEFOLD-rather-than-LOWER.patch>
Hi Jeff,
Thanks for the patch. I have reviewed the patch set and got a few comments for tests:
1 - 0001
```
+SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_C_UTF8;
```
Do we want to added one more test:
```
SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_UNICODE_FAST;
?column?
----------
t
(1 row)
```
Which tests the different behaviors against collations.
2 - 0002
Do we need to add a test:
```
SELECT 'straße'::citext = 'STRASSE'::citext;
?column?
----------
f
(1 row)
```
I initially thought to add test cases with different collations, but after debugging, I found that citext intentionally
ignoresspecified collation.
3 - 0003 LGTM. Seems the existing test coverage is good enough.
4 - 0004
I thought to suggest add a test:
```
SELECT to_tsvector('straße') @@ to_tsquery('strasse');
?column?
----------
f
(1 row)
```
But I don’t see existing tests under backend/tsearch. So, I’m now not sure whether or not to insist the suggestion.
BWT, while reviewing this patch, I noticed a copy-paste error in str_casefold():
```
errmsg("could not determine which collation to use for %s function",
- "lower()"),
+ "casefold()”),
```
I have posted a patch to fix. See [1].
[1] https://postgr.es/m/CAEoWx2mMmm9fTZYgE-r_T-KPTFR1rKO029QV-S-6n=7US_9EMA@mail.gmail.com
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 2026-01-13 at 10:14 +0800, Chao Li wrote:
> 1 - 0001
> ```
> +SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_C_UTF8;
> ```
>
> Do we want to added one more test:
> ```
> SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_UNICODE_FAST;
> ?column?
> ----------
> t
> (1 row)
> ```
> Which tests the different behaviors against collations.
I am confused, doesn't the patch already have both of those tests for
both collations? What change are you suggesting?
> I initially thought to add test cases with different collations, but
> after debugging, I found that citext intentionally ignores specified
> collation.
Adding a test that's dependent on the database-wide collation is more
heavyweight because it needs to create a new database.
Regards,
Jeff Davis
Jeff Davis wrote: > There are a number of internal callers of LOWER(), and conceptually > those should all be using CASEFOLD(). Patches attached. I tried 0001 with a non-UTF8 database and got quickly stuck: create database latin9 encoding='LATIN9' template=template0 locale='fr_FR.ISO-8859-15@euro'; \c latin9 select 'abc' ilike 'bc'; ERROR: Unicode case folding can only be performed if server encoding is UTF8 Presumably the internal calls to casefold should be conditionalized on the encoding being UTF-8? Best regards, -- Daniel Vérité https://postgresql.verite.pro/
On Sat, 2026-02-28 at 14:27 +0100, Daniel Verite wrote:
> I tried 0001 with a non-UTF8 database and got quickly stuck:
Attached new versions. I moved the encoding check into the SQL-callable
casefold() function, and other callers use str_casefold(). That
slightly simplifies what happens in ILIKE, also.
I removed the citext changes. citext has somewhat of a legacy status, I
think, so I'm not sure it makes sense to try to modernize or change it.
Also, some SQL-language functions in citext use LOWER(), so the changes
aren't enough: we'd need to make the SQL CASEFOLD function callable in
other encodings, and also run a citext upgrade script to change the
definitions.
Note that these changes affect the result of some expressions (e.g.
ILIKE), so could theoretically make an expression index or predicate
index inconsistent.
Regards,
Jeff Davis
Вложения
On Tue, Mar 3, 2026 at 1:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2026-02-28 at 14:27 +0100, Daniel Verite wrote:
> I tried 0001 with a non-UTF8 database and got quickly stuck:
Attached new versions. I moved the encoding check into the SQL-callable
casefold() function, and other callers use str_casefold(). That
slightly simplifies what happens in ILIKE, also.
I removed the citext changes. citext has somewhat of a legacy status, I
think, so I'm not sure it makes sense to try to modernize or change it.
Also, some SQL-language functions in citext use LOWER(), so the changes
aren't enough: we'd need to make the SQL CASEFOLD function callable in
other encodings, and also run a citext upgrade script to change the
definitions.
Note that these changes affect the result of some expressions (e.g.
ILIKE), so could theoretically make an expression index or predicate
index inconsistent.
Thanks for the patches!
After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm still
uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
With builtin collations, these produce different results.
uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
With builtin collations, these produce different results.
Вложения
On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote: > After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm > still > uses str_tolower() for trigram extraction (trgm_op.c:352 and :948). > With builtin collations, these produce different results. Interesting, thank you. As stated in the original message, I was unsure about changing pg_trgm without adjusting the regex logic, also: https://www.postgresql.org/message-id/64d7949bad90545f981ac7513fb0b4954daca2c9.camel@j-davis.com do you have a suggestion about an easy way to do that, or should we revisit in the next cycle? Regards, Jeff Davis
On Tue, Mar 24, 2026 at 4:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote:
> After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm
> still
> uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
> With builtin collations, these produce different results.
Interesting, thank you. As stated in the original message, I was unsure
about changing pg_trgm without adjusting the regex logic, also:
https://www.postgresql.org/message-id/64d7949bad90545f981ac7513fb0b4954daca2c9.camel@j-davis.com
do you have a suggestion about an easy way to do that, or should we
revisit in the next cycle?
pg_trgm appears to be lossy, with recheck logic. I would think you just need to make it give answers which at least include everything that a regex would match, and then allow recheck to prune that down. My concern is having pg_trgm give less than all the answers, so that after recheck you get fewer results than a seqscan would have returned. Would switching to casefold be strictly broader than regex? If so, you would just need to convert pg_trgm to use casefold and then rely on the recheck machinery.
Sorry if this misses something discussed upthread. I'm clearly assuming here that you don't mind that such a change necessitates a REINDEX.
Mark Dilger
On Wed, 2026-03-25 at 07:40 -0700, Mark Dilger wrote:
> pg_trgm appears to be lossy, with recheck logic. I would think you
> just need to make it give answers which at least include everything
> that a regex would match, and then allow recheck to prune that down.
> My concern is having pg_trgm give less than all the answers, so that
> after recheck you get fewer results than a seqscan would have
> returned. Would switching to casefold be strictly broader than
> regex?
I think the precise question would be: "are there any two characters
that lowercase to the same character but do not casefold to the same
character?".
I don't have a counterexample, so perhaps using casefold would still be
fine.
Thoughts? Should we enhance regexes to consider more than two case
variants first, or should we proceed with some of these patches (and/or
a similar change to pg_trgm)?
> Sorry if this misses something discussed upthread. I'm clearly
> assuming here that you don't mind that such a change necessitates a
> REINDEX.
That's a concern. It may depend on how big the impact would be -- for
libc I don't think it would matter because lowercasing and casefolding
are the same thing.
Regards,
Jeff Davis
On Wed, Mar 25, 2026 at 2:02 PM Jeff Davis <pgsql@j-davis.com> wrote:
I think the precise question would be: "are there any two characters
that lowercase to the same character but do not casefold to the same
character?".
I don't know. I'll set up a test to iterate across all locales across all character pairs... no, I didn't find any on my system. Other searching suggests that the Turkish and Azerbaijani locale do have this characteristic, with I (U+0049) lowercasing to ı (U+0131) and case folding to i (U+0069) while ı (U+0131) lowercases to ı (U+0131) but also case folds to ı (U+0131). I have not confirmed that empirically, though.
I don't have a counterexample, so perhaps using casefold would still be
fine.
Thoughts? Should we enhance regexes to consider more than two case
variants first, or should we proceed with some of these patches (and/or
a similar change to pg_trgm)?
I don't want to take a strong position either way. I'm still wrapping my head around the various implications of the proposed changes, and don't feel I have a complete picture yet.
Mark Dilger