Обсуждение: Use CASEFOLD() internally rather than LOWER()

Поиск
Список
Период
Сортировка

Use CASEFOLD() internally rather than LOWER()

От
Jeff Davis
Дата:
There are a number of internal callers of LOWER(), and conceptually
those should all be using CASEFOLD(). Patches attached.

I'm not sure if we want the citext patch -- it would require REINDEX of
all existing citext indexes after upgrade, and there's already a
documented tip ("Consider using nondeterministic collations...), so
perhaps it's a legacy extension anyway.

It would be nice to make the tsearch change this release, as there are
already changes that could require a reindex.

I didn't change pg_trgm yet, because I think that we have to change the
regex machinery to be aware of more than two case variants first (and
potentially increasing string lengths, too).

Regards,
    Jeff Davis



Вложения

Re: Use CASEFOLD() internally rather than LOWER()

От
Chao Li
Дата:

> On Jan 13, 2026, at 02:22, Jeff Davis <pgsql@j-davis.com> wrote:
>
> There are a number of internal callers of LOWER(), and conceptually
> those should all be using CASEFOLD(). Patches attached.
>
> I'm not sure if we want the citext patch -- it would require REINDEX of
> all existing citext indexes after upgrade, and there's already a
> documented tip ("Consider using nondeterministic collations...), so
> perhaps it's a legacy extension anyway.
>
> It would be nice to make the tsearch change this release, as there are
> already changes that could require a reindex.
>
> I didn't change pg_trgm yet, because I think that we have to change the
> regex machinery to be aware of more than two case variants first (and
> potentially increasing string lengths, too).
>
> Regards,
> Jeff Davis
>
>
>
<v1-0001-ILIKE-use-CASEFOLD-rather-than-LOWER.patch><v1-0002-citext-use-CASEFOLD-rather-than-LOWER.patch><v1-0003-dict_xsyn-use-CASEFOLD-rather-than-LOWER.patch><v1-0004-tsearch-use-CASEFOLD-rather-than-LOWER.patch>

Hi Jeff,

Thanks for the patch. I have reviewed the patch set and got a few comments for tests:

1 - 0001
```
+SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_C_UTF8;
```

Do we want to added one more test:
```
SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_UNICODE_FAST;
 ?column?
----------
 t
(1 row)
```
Which tests the different behaviors against collations.

2 - 0002

Do we need to add a test:
```
SELECT 'straße'::citext = 'STRASSE'::citext;
 ?column?
----------
 f
(1 row)
```

I initially thought to add test cases with different collations, but after debugging, I found that citext intentionally
ignoresspecified collation. 

3 - 0003 LGTM. Seems the existing test coverage is good enough.

4 - 0004

I thought to suggest add a test:
```
SELECT to_tsvector('straße') @@ to_tsquery('strasse');
 ?column?
----------
 f
(1 row)
```

But I don’t see existing tests under backend/tsearch. So, I’m now not sure whether or not to insist the suggestion.

BWT, while reviewing this patch, I noticed a copy-paste error in str_casefold():
```
                                 errmsg("could not determine which collation to use for %s function",
-                                               "lower()"),
+                                               "casefold()”),
```

I have posted a patch to fix. See [1].

[1] https://postgr.es/m/CAEoWx2mMmm9fTZYgE-r_T-KPTFR1rKO029QV-S-6n=7US_9EMA@mail.gmail.com

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Use CASEFOLD() internally rather than LOWER()

От
Jeff Davis
Дата:
On Tue, 2026-01-13 at 10:14 +0800, Chao Li wrote:
> 1 - 0001
> ```
> +SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_C_UTF8;
> ```
>
> Do we want to added one more test:
> ```
> SELECT U&'straße' ILIKE U&'STRASSE' COLLATE PG_UNICODE_FAST;
>  ?column?
> ----------
>  t
> (1 row)
> ```
> Which tests the different behaviors against collations.

I am confused, doesn't the patch already have both of those tests for
both collations? What change are you suggesting?

> I initially thought to add test cases with different collations, but
> after debugging, I found that citext intentionally ignores specified
> collation.

Adding a test that's dependent on the database-wide collation is more
heavyweight because it needs to create a new database.


Regards,
    Jeff Davis




Re: Use CASEFOLD() internally rather than LOWER()

От
"Daniel Verite"
Дата:
    Jeff Davis wrote:

> There are a number of internal callers of LOWER(), and conceptually
> those should all be using CASEFOLD(). Patches attached.

I tried 0001 with a non-UTF8 database and got quickly stuck:

create database latin9 encoding='LATIN9'
 template=template0
 locale='fr_FR.ISO-8859-15@euro';

\c latin9

select 'abc' ilike 'bc';
ERROR:    Unicode case folding can only be performed if server encoding is UTF8

Presumably the internal calls to casefold should be conditionalized
on the encoding being UTF-8?


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/



Re: Use CASEFOLD() internally rather than LOWER()

От
Jeff Davis
Дата:
On Sat, 2026-02-28 at 14:27 +0100, Daniel Verite wrote:
> I tried 0001 with a non-UTF8 database and got quickly stuck:

Attached new versions. I moved the encoding check into the SQL-callable
casefold() function, and other callers use str_casefold(). That
slightly simplifies what happens in ILIKE, also.

I removed the citext changes. citext has somewhat of a legacy status, I
think, so I'm not sure it makes sense to try to modernize or change it.
Also, some SQL-language functions in citext use LOWER(), so the changes
aren't enough: we'd need to make the SQL CASEFOLD function callable in
other encodings, and also run a citext upgrade script to change the
definitions.

Note that these changes affect the result of some expressions (e.g.
ILIKE), so could theoretically make an expression index or predicate
index inconsistent.

Regards,
    Jeff Davis


Вложения

Re: Use CASEFOLD() internally rather than LOWER()

От
Mark Dilger
Дата:

On Tue, Mar 3, 2026 at 1:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2026-02-28 at 14:27 +0100, Daniel Verite wrote:
> I tried 0001 with a non-UTF8 database and got quickly stuck:

Attached new versions. I moved the encoding check into the SQL-callable
casefold() function, and other callers use str_casefold(). That
slightly simplifies what happens in ILIKE, also.

I removed the citext changes. citext has somewhat of a legacy status, I
think, so I'm not sure it makes sense to try to modernize or change it.
Also, some SQL-language functions in citext use LOWER(), so the changes
aren't enough: we'd need to make the SQL CASEFOLD function callable in
other encodings, and also run a citext upgrade script to change the
definitions.

Note that these changes affect the result of some expressions (e.g.
ILIKE), so could theoretically make an expression index or predicate
index inconsistent.

Thanks for the patches!

After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm still
uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
With builtin collations, these produce different results.

Вложения

Re: Use CASEFOLD() internally rather than LOWER()

От
Jeff Davis
Дата:
On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote:
> After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm
> still
> uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
> With builtin collations, these produce different results.

Interesting, thank you. As stated in the original message, I was unsure
about changing pg_trgm without adjusting the regex logic, also:

https://www.postgresql.org/message-id/64d7949bad90545f981ac7513fb0b4954daca2c9.camel@j-davis.com

do you have a suggestion about an easy way to do that, or should we
revisit in the next cycle?

Regards,
    Jeff Davis




Re: Use CASEFOLD() internally rather than LOWER()

От
Mark Dilger
Дата:


On Tue, Mar 24, 2026 at 4:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2026-03-21 at 20:14 -0700, Mark Dilger wrote:
> After v2-0001, ILIKE uses str_casefold() for matching, but pg_trgm
> still
> uses str_tolower() for trigram extraction (trgm_op.c:352 and :948).
> With builtin collations, these produce different results.

Interesting, thank you. As stated in the original message, I was unsure
about changing pg_trgm without adjusting the regex logic, also:

https://www.postgresql.org/message-id/64d7949bad90545f981ac7513fb0b4954daca2c9.camel@j-davis.com

do you have a suggestion about an easy way to do that, or should we
revisit in the next cycle?

pg_trgm appears to be lossy, with recheck logic.  I would think you just need to make it give answers which at least include everything that a regex would match, and then allow recheck to prune that down.  My concern is having pg_trgm give less than all the answers, so that after recheck you get fewer results than a seqscan would have returned.  Would switching to casefold be strictly broader than regex?  If so, you would just need to convert pg_trgm to use casefold and then rely on the recheck machinery.

Sorry if this misses something discussed upthread.  I'm clearly assuming here that you don't mind that such a change necessitates a REINDEX. 

--

Mark Dilger

Re: Use CASEFOLD() internally rather than LOWER()

От
Jeff Davis
Дата:
On Wed, 2026-03-25 at 07:40 -0700, Mark Dilger wrote:
> pg_trgm appears to be lossy, with recheck logic.  I would think you
> just need to make it give answers which at least include everything
> that a regex would match, and then allow recheck to prune that down. 
> My concern is having pg_trgm give less than all the answers, so that
> after recheck you get fewer results than a seqscan would have
> returned.  Would switching to casefold be strictly broader than
> regex?

I think the precise question would be: "are there any two characters
that lowercase to the same character but do not casefold to the same
character?".

I don't have a counterexample, so perhaps using casefold would still be
fine.

Thoughts? Should we enhance regexes to consider more than two case
variants first, or should we proceed with some of these patches (and/or
a similar change to pg_trgm)?

> Sorry if this misses something discussed upthread.  I'm clearly
> assuming here that you don't mind that such a change necessitates a
> REINDEX. 

That's a concern. It may depend on how big the impact would be -- for
libc I don't think it would matter because lowercasing and casefolding
are the same thing.

Regards,
    Jeff Davis




Re: Use CASEFOLD() internally rather than LOWER()

От
Mark Dilger
Дата:


On Wed, Mar 25, 2026 at 2:02 PM Jeff Davis <pgsql@j-davis.com> wrote:
I think the precise question would be: "are there any two characters
that lowercase to the same character but do not casefold to the same
character?".

I don't know.  I'll set up a test to iterate across all locales across all character pairs... no, I didn't find any on my system.  Other searching suggests that the Turkish and Azerbaijani locale do have this characteristic, with I (U+0049) lowercasing to ı (U+0131) and case folding to i (U+0069) while ı (U+0131) lowercases to ı (U+0131) but also case folds to ı (U+0131).  I have not confirmed that empirically, though.
 
I don't have a counterexample, so perhaps using casefold would still be
fine.

Thoughts? Should we enhance regexes to consider more than two case
variants first, or should we proceed with some of these patches (and/or
a similar change to pg_trgm)?

I don't want to take a strong position either way.  I'm still wrapping my head around the various implications of the proposed changes, and don't feel I have a complete picture yet.

--

Mark Dilger