Обсуждение: Use correct collation in pg_trgm
Hi hackers,
In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
for converting trigrams to lower-case. Here are some examples where
today the collation is ignored:
CREATE EXSTENSION pg_trgm;
CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
CREATE TABLE test(col TEXT COLLATE "turkish");
INSERT INTO test VALUES ('ISTANBUL');
postgres=# select show_trgm(col) FROM test;
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
0.5
If the database is initialized via initdb --locale="tr_TR.utf8", the
output changes:
postgres=# SELECT show_trgm('ISTANBUL');
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
and
postgres=# select show_trgm(col) FROM test;
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
1
tr_TR.utf8 converts capital I to ı which is a multibyte character, while
my default collation converts I to i.
The attached patch attempts to fix that. I grepped for all occurrences
of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
collation OID instead DEFAULT_COLLATION_OID.
The corresponding regression tests pass.
[1]
https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
--
David Geier
Вложения
Hello! The patch is simple and it does what it says it does, I verified the difference in behavior with/without it. I think the test case included in the email should be part of the patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql? It also needs a proper commit message, and seems like the affected indexes will need a REINDEX after this fix.
On Wed, 21 Jan 2026 at 20:36, David Geier <geidav.pg@gmail.com> wrote:
>
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
> CREATE EXSTENSION pg_trgm;
> CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
>
> postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> CREATE TABLE test(col TEXT COLLATE "turkish");
> INSERT INTO test VALUES ('ISTANBUL');
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 0.5
>
> If the database is initialized via initdb --locale="tr_TR.utf8", the
> output changes:
>
> postgres=# SELECT show_trgm('ISTANBUL');
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> and
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 1
>
> tr_TR.utf8 converts capital I to ı which is a multibyte character, while
> my default collation converts I to i.
>
> The attached patch attempts to fix that. I grepped for all occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
>
> The corresponding regression tests pass.
>
> [1]
> https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
>
> --
> David Geier
Hi!
LGTM
--
Best regards,
Kirill Reshke
Hi! Thanks for reviewing. On 22.01.2026 07:32, Zsolt Parragi wrote: > Hello! > > The patch is simple and it does what it says it does, I verified the > difference in behavior with/without it. While reading through [1] I realized that the word boundary detection also uses the wrong collation. Patch 0002 fixes that. > I think the test case included in the email should be part of the > patch, maybe as a new file contrib/pg_trgm/sql/pg_trgm_collation.sql? > It also needs a proper commit message, and seems like the affected > indexes will need a REINDEX after this fix. - I've added tests to 0001 and 0002 based on what each commit fixes. - I've improved the commit messages. Looking at [2], it seems like we don't include release note changes in bug fix commits but rather collect them retroactively before cutting the release. [1] https://www.postgresql.org/message-id/f30299bf-ad8e-4125-bf80-e0a8663991b6%40eisentraut.org [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0 -- David Geier
Вложения
> While reading through [1] I realized that the word boundary detection > also uses the wrong collation. Patch 0002 fixes that. Good catch, I considered checking that when I reviewed the patch, but I thought that it would be an issue with CJK languages, and I was completely wrong about that. Updated patch looks good, I only noticed two minor things: * the new test should have a newline at the end of the file * and probably a conditional skip based on locale availability, the citext_utf8.sql test case does something similar
Hi! Thank you for working on this. On Mon, Jan 26, 2026 at 3:24 PM Zsolt Parragi <zsolt.parragi@percona.com> wrote: > > > While reading through [1] I realized that the word boundary detection > > also uses the wrong collation. Patch 0002 fixes that. > > Good catch, I considered checking that when I reviewed the patch, but > I thought that it would be an issue with CJK languages, and I was > completely wrong about that. > > Updated patch looks good, I only noticed two minor things: > > * the new test should have a newline at the end of the file > * and probably a conditional skip based on locale availability, the > citext_utf8.sql test case does something similar I wonder about existing indexes, which already use default collation not column collation. Should we add a release note saying they might be obsolete? Alternatively, given we now have opclass options, we may add a new opclass option defining whether to use column collation (must be false for existing indexes). ------ Regards, Alexander Korotkov Supabase
Attached is v3 of the patch with the following changes: On 26.01.2026 14:33, Alexander Korotkov wrote: >> Updated patch looks good, I only noticed two minor things: >> >> * the new test should have a newline at the end of the file Done. >> * and probably a conditional skip based on locale availability, the >> citext_utf8.sql test case does something similar Done, the same way collate.icu.utf8.sql does it. This test also uses the Turkish collation tr-x-icu. > I wonder about existing indexes, which already use default collation > not column collation. Should we add a release note saying they might > be obsolete? Alternatively, given we now have opclass options, we may > add a new opclass option defining whether to use column collation > (must be false for existing indexes). I think adding a release note is fine. We discussed this in [1] already and there's precedence for doing it this way. See [2]. Looking at [3], it seems like we don't include release notes in bug fix commits but rather collect them retroactively before cutting the release. The file doc/src/sgml/release-19.sgml is also still completely empty on master. -- David Geier [1] https://www.postgresql.org/message-id/CAEze2WiUL9idZBbuUN%2BMuWqr6DcPr_-C91E9MTx%3DH62Xx5fHaQ%40mail.gmail.com [2] https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fb1a18810f0
Вложения
>> * the new test should have a newline at the end of the file > > Done. It is still missing. Otherwise it looks good to me. (I don't see a commitfest entry for it?)
On 02.02.2026 21:33, Zsolt Parragi wrote: >>> * the new test should have a newline at the end of the file >> >> Done. > > It is still missing. Sorry, my bad. Now it's there. > Otherwise it looks good to me. > > (I don't see a commitfest entry for it?) I thought Alexander might pick it up without CF entry. I created now one, see [1]. Will you add yourself as reviewer? -- David Geier [1] https://commitfest.postgresql.org/patch/6454/