Обсуждение: Use correct collation in pg_trgm

Поиск
Список
Период
Сортировка

Use correct collation in pg_trgm

От
David Geier
Дата:
Hi hackers,

In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
for converting trigrams to lower-case. Here are some examples where
today the collation is ignored:

CREATE EXSTENSION pg_trgm;
CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');

postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
                  show_trgm
---------------------------------------------
 {"  i"," is",anb,bul,ist,nbu,sta,tan,"ul "}

CREATE TABLE test(col TEXT COLLATE "turkish");
INSERT INTO test VALUES ('ISTANBUL');

postgres=# select show_trgm(col) FROM test;
                  show_trgm
---------------------------------------------
 {"  i"," is",anb,bul,ist,nbu,sta,tan,"ul "}

postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
 similarity
------------
        0.5

If the database is initialized via initdb --locale="tr_TR.utf8", the
output changes:

postgres=# SELECT show_trgm('ISTANBUL');
                       show_trgm
--------------------------------------------------------
 {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}

and

postgres=# select show_trgm(col) FROM test;
                       show_trgm
--------------------------------------------------------
 {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}

postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
 similarity
------------
          1

tr_TR.utf8 converts capital I to ı which is a multibyte character, while
my default collation converts I to i.

The attached patch attempts to fix that. I grepped for all occurrences
of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
collation OID instead DEFAULT_COLLATION_OID.

The corresponding regression tests pass.

[1]
https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com

--
David Geier
Вложения

Re: Use correct collation in pg_trgm

От
Zsolt Parragi
Дата:
Hello!

The patch is simple and it does what it says it does, I verified the
difference in behavior with/without it.

I think the test case included in the email should be part of the
patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql?
It also needs a proper commit message, and seems like the affected
indexes will need a REINDEX after this fix.



Re: Use correct collation in pg_trgm

От
Kirill Reshke
Дата:
On Wed, 21 Jan 2026 at 20:36, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
> CREATE EXSTENSION pg_trgm;
> CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
>
> postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
>                   show_trgm
> ---------------------------------------------
>  {"  i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> CREATE TABLE test(col TEXT COLLATE "turkish");
> INSERT INTO test VALUES ('ISTANBUL');
>
> postgres=# select show_trgm(col) FROM test;
>                   show_trgm
> ---------------------------------------------
>  {"  i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
>  similarity
> ------------
>         0.5
>
> If the database is initialized via initdb --locale="tr_TR.utf8", the
> output changes:
>
> postgres=# SELECT show_trgm('ISTANBUL');
>                        show_trgm
> --------------------------------------------------------
>  {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> and
>
> postgres=# select show_trgm(col) FROM test;
>                        show_trgm
> --------------------------------------------------------
>  {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
>  similarity
> ------------
>           1
>
> tr_TR.utf8 converts capital I to ı which is a multibyte character, while
> my default collation converts I to i.
>
> The attached patch attempts to fix that. I grepped for all occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
>
> The corresponding regression tests pass.
>
> [1]
> https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
>
> --
> David Geier

Hi!
LGTM

--
Best regards,
Kirill Reshke



Re: Use correct collation in pg_trgm

От
David Geier
Дата:
Hi!

Thanks for reviewing.

On 22.01.2026 07:32, Zsolt Parragi wrote:
> Hello!
> 
> The patch is simple and it does what it says it does, I verified the
> difference in behavior with/without it.

While reading through [1] I realized that the word boundary detection
also uses the wrong collation. Patch 0002 fixes that.

> I think the test case included in the email should be part of the
> patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql?
> It also needs a proper commit message, and seems like the affected
> indexes will need a REINDEX after this fix.

- I've added tests to 0001 and 0002 based on what each commit fixes.
- I've improved the commit messages.

Looking at [2], it seems like we don't include release note changes in
bug fix commits but rather collect them retroactively before cutting the
release.

[1]
https://www.postgresql.org/message-id/f30299bf-ad8e-4125-bf80-e0a8663991b6%40eisentraut.org

[2]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0

--
David Geier
Вложения

Re: Use correct collation in pg_trgm

От
Zsolt Parragi
Дата:
> While reading through [1] I realized that the word boundary detection
> also uses the wrong collation. Patch 0002 fixes that.

Good catch, I considered checking that when I reviewed the patch, but
I thought that it would be an issue with CJK languages, and I was
completely wrong about that.

Updated patch looks good, I only noticed two minor things:

* the new test should have a newline at the end of the file
* and probably a conditional skip based on locale availability, the
citext_utf8.sql test case does something similar



Re: Use correct collation in pg_trgm

От
Alexander Korotkov
Дата:
Hi!

Thank you for working on this.

On Mon, Jan 26, 2026 at 3:24 PM Zsolt Parragi <zsolt.parragi@percona.com> wrote:
>
> > While reading through [1] I realized that the word boundary detection
> > also uses the wrong collation. Patch 0002 fixes that.
>
> Good catch, I considered checking that when I reviewed the patch, but
> I thought that it would be an issue with CJK languages, and I was
> completely wrong about that.
>
> Updated patch looks good, I only noticed two minor things:
>
> * the new test should have a newline at the end of the file
> * and probably a conditional skip based on locale availability, the
> citext_utf8.sql test case does something similar

I wonder about existing indexes, which already use default collation
not column collation.  Should we add a release note saying they might
be obsolete?  Alternatively, given we now have opclass options, we may
add a new opclass option defining whether to use column collation
(must be false for existing indexes).

------
Regards,
Alexander Korotkov
Supabase



Re: Use correct collation in pg_trgm

От
David Geier
Дата:
Attached is v3 of the patch with the following changes:

On 26.01.2026 14:33, Alexander Korotkov wrote:
>> Updated patch looks good, I only noticed two minor things:
>>
>> * the new test should have a newline at the end of the file

Done.

>> * and probably a conditional skip based on locale availability, the
>> citext_utf8.sql test case does something similar

Done, the same way collate.icu.utf8.sql does it. This test also uses the
Turkish collation tr-x-icu.

> I wonder about existing indexes, which already use default collation
> not column collation.  Should we add a release note saying they might
> be obsolete?  Alternatively, given we now have opclass options, we may
> add a new opclass option defining whether to use column collation
> (must be false for existing indexes).

I think adding a release note is fine. We discussed this in [1] already
and there's precedence for doing it this way. See [2].

Looking at [3], it seems like we don't include release notes in
bug fix commits but rather collect them retroactively before cutting the
release. The file doc/src/sgml/release-19.sgml is also still completely
empty on master.

--
David Geier


[1]
https://www.postgresql.org/message-id/CAEze2WiUL9idZBbuUN%2BMuWqr6DcPr_-C91E9MTx%3DH62Xx5fHaQ%40mail.gmail.com

[2] https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION

[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0
Вложения

Re: Use correct collation in pg_trgm

От
Zsolt Parragi
Дата:
>> * the new test should have a newline at the end of the file
>
> Done.

It is still missing.

Otherwise it looks good to me.

(I don't see a commitfest entry for it?)



Re: Use correct collation in pg_trgm

От
David Geier
Дата:
On 02.02.2026 21:33, Zsolt Parragi wrote:
>>> * the new test should have a newline at the end of the file
>>
>> Done.
> 
> It is still missing.

Sorry, my bad. Now it's there.

> Otherwise it looks good to me.
> 
> (I don't see a commitfest entry for it?)

I thought Alexander might pick it up without CF entry. I created now
one, see [1]. Will you add yourself as reviewer?

--
David Geier

[1] https://commitfest.postgresql.org/patch/6454/
Вложения